Boston Dataset In R Download
Introduction
This project aims to find the factors affecting the domestic property value in the city of Boston. Factors like per capita income, environmental factors, educational facilities, property size, etc were taken into consideration to determine the most significant parameters. We create multiple linear regression model using forward stepwise selection and compare its performance with the linear regression model containing all the variables. We use the following metrics to compare the performance of the models: R-squared value, Adjusted R-squared value, AIC, BIC and model Mean Squared Error (MSE).
Packages Required
The following packages are required for the project:
library(corrr) library(gridExtra) library(ggplot2) library(tidyverse) library(dplyr) library(DT) library(MASS) library(leaps) library(glmnet) library(PerformanceAnalytics)
Data Exploration
Checking for Data structure
Our data contains 506 observations containing 14 variables. The datatypes are as follows:
glimpse(Boston)
## Observations: 506 ## Variables: 14 ## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ... ## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,... ## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ... ## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... ## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524... ## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172... ## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,... ## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605... ## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ... ## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,... ## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ... ## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60... ## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9... ## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
Data Summary
A quick summary of the distribution of every variable in the data
summary(Boston)
## crim zn indus chas ## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000 ## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000 ## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000 ## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917 ## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000 ## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000 ## nox rm age dis ## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130 ## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100 ## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207 ## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795 ## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188 ## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127 ## rad tax ptratio black ## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32 ## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38 ## Median : 5.000 Median :330.0 Median :19.05 Median :391.44 ## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67 ## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23 ## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90 ## lstat medv ## Min. : 1.73 Min. : 5.00 ## 1st Qu.: 6.95 1st Qu.:17.02 ## Median :11.36 Median :21.20 ## Mean :12.65 Mean :22.53 ## 3rd Qu.:16.95 3rd Qu.:25.00 ## Max. :37.97 Max. :50.00
Checking for Null Values
There are no null values to report in the data
colSums(is.na(Boston))
## crim zn indus chas nox rm age dis rad ## 0 0 0 0 0 0 0 0 0 ## tax ptratio black lstat medv ## 0 0 0 0 0
Data sneak peek
A quick glance into the data:
Boston %>% datatable(caption = "Boston Housing")
Correlation Matrices
Correlation of target variable with predictor variables:
- rm and lstat are highly correlated with the target variable medv
- black, dis, rm, chas, zn are positively correlated with medv
- crim, indus, nox, age, rad, tax, ptratio, lstat are negatively correlated with medv
Boston %>% correlate() %>% focus(medv)
## # A tibble: 13 x 2 ## rowname medv ## <chr> <dbl> ## 1 crim -0.388 ## 2 zn 0.360 ## 3 indus -0.484 ## 4 chas 0.175 ## 5 nox -0.427 ## 6 rm 0.695 ## 7 age -0.377 ## 8 dis 0.250 ## 9 rad -0.382 ## 10 tax -0.469 ## 11 ptratio -0.508 ## 12 black 0.333 ## 13 lstat -0.738
Correlation among predictor variables:
On plotting the pairwise correlations between each of the variables, we see the following:
The highest positive correlations are between "rad" and "tax", "indux" and "nox" and negative between "dis" and "age" and "dis" and "nox".
chart.Correlation(Boston[,-14], histogram=TRUE, pch=19)
Distributions
Predictor vars vs Target var
We plot the scatter plots of target variable medv versus the other variables, we see that rm and lstat show parabolic nature
Boston %>% gather(-medv, key = "var", value = "value") %>% filter(var != "chas") %>% ggplot(aes(x = value, y = medv)) + geom_point() + stat_smooth() + facet_wrap(~ var, scales = "free") + theme_bw()
Boxplots for Predictors
Boxplots show no significant outliers in the data
Boston %>% gather(-medv, key = "var", value = "value") %>% filter(var != "chas") %>% ggplot(aes(x = '',y = value)) + geom_boxplot(outlier.colour = "red", outlier.shape = 1) + facet_wrap(~ var, scales = "free") + theme_bw()
Histograms for Predictors
The histograms of predictors give the following insights:
- Rad and Tax seem to have two different peaks separated by no data in between
- rm follows perfect normal dostribution
- Most of the distributions here are skewed
Boston %>% gather(-medv, key = "var", value = "value") %>% filter(var != "chas") %>% ggplot(aes(x = value)) + geom_histogram() + facet_wrap(~ var, scales = "free") + theme_bw()
Splitting the Data
We split our data in 80:20 ratio as training data and test data. We will use our train data for modelling and test data for validation
set.seed(12420352) index <- sample(nrow(Boston),nrow(Boston)*0.80) Boston.train <- Boston[index,] Boston.test <- Boston[-index,]
Linear Regression using all predictors
We build up a Linear regression model using all variables present in the data
We notice that Indus and age have very high p-value and seem to be non-significant
The estimated coefficients are as follows:
model1 <- lm(medv~ ., data = Boston.train) sum.model1 <- summary(model1) sum.model1
## ## Call: ## lm(formula = medv ~ ., data = Boston.train) ## ## Residuals: ## Min 1Q Median 3Q Max ## -15.3673 -2.7380 -0.5821 1.6192 24.5081 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 41.864329 6.019967 6.954 1.50e-11 *** ## crim -0.133280 0.039488 -3.375 0.000812 *** ## zn 0.054777 0.015498 3.534 0.000458 *** ## indus 0.037333 0.074162 0.503 0.614975 ## chas 3.430143 1.022925 3.353 0.000877 *** ## nox -17.948596 4.412690 -4.067 5.75e-05 *** ## rm 3.154796 0.480915 6.560 1.71e-10 *** ## age 0.002563 0.014992 0.171 0.864349 ## dis -1.602194 0.229928 -6.968 1.37e-11 *** ## rad 0.356819 0.076813 4.645 4.65e-06 *** ## tax -0.013827 0.004391 -3.149 0.001764 ** ## ptratio -1.003724 0.153330 -6.546 1.86e-10 *** ## black 0.010420 0.003061 3.404 0.000733 *** ## lstat -0.564569 0.057875 -9.755 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4.855 on 390 degrees of freedom ## Multiple R-squared: 0.7359, Adjusted R-squared: 0.7271 ## F-statistic: 83.59 on 13 and 390 DF, p-value: < 2.2e-16
Model Stats:
Checking the model stats, using MSE, R-squared, adjusted R-squared, Test MSPE, AIC and BIC as metrics:
model1.mse <- (sum.model1$sigma)^2 model1.rsq <- sum.model1$r.squared model1.arsq <- sum.model1$adj.r.squared test.pred.model1 <- predict(model1, newdata=Boston.test) model1.mpse <- mean((Boston.test$medv-test.pred.model1)^2) model1.aic <- AIC(model1) model1.bic <- BIC(model1) stats.model1 <- c("full", model1.mse, model1.rsq, model1.arsq, model1.mpse, model1.aic, model1.bic) comparison_table <- c("model type", "MSE", "R-Squared", "Adjusted R-Squared", "Test MSPE", "AIC", "BIC") data.frame(cbind(comparison_table, stats.model1))
## comparison_table stats.model1 ## 1 model type full ## 2 MSE 23.57194554198 ## 3 R-Squared 0.735883231832148 ## 4 Adjusted R-Squared 0.727079339559886 ## 5 Test MSPE 19.7293018987321 ## 6 AIC 2438.9171384413 ## 7 BIC 2498.93836161072
Subset Selection
We will use subset selection techniques for variable selection. The three methods employed are:
* Forward Variable selection
* Backward Variable Selection
* Exhaustive Variable Selection
Forward Variable Selection
We start off with Forward selection method, where we keep on adding influential variables to the model.
lstat, rm and ptration are the most significant variables
Following table shows the variables added to the model at each step, along with the BIC, R-squared, adj r-squared, cp values associated woth the model
model2 <- regsubsets(medv~ ., data = Boston.train, nvmax = 13, method="forward") sum.model2 <- summary(model2) model2.subsets <- cbind(sum.model2$which, sum.model2$bic, sum.model2$rsq, sum.model2$adjr2,sum.model2$cp) model2.subsets <- as.data.frame(model2.subsets) colnames(model2.subsets)[15:18] <- c("BIC","rsq","adjr2","cp") model2.subsets
## (Intercept) crim zn indus chas nox rm age dis rad tax ptratio black ## 1 1 0 0 0 0 0 0 0 0 0 0 0 0 ## 2 1 0 0 0 0 0 1 0 0 0 0 0 0 ## 3 1 0 0 0 0 0 1 0 0 0 0 1 0 ## 4 1 0 0 0 1 0 1 0 0 0 0 1 0 ## 5 1 0 0 0 1 0 1 0 1 0 0 1 0 ## 6 1 0 0 0 1 1 1 0 1 0 0 1 0 ## 7 1 0 1 0 1 1 1 0 1 0 0 1 0 ## 8 1 0 1 0 1 1 1 0 1 0 0 1 1 ## 9 1 0 1 0 1 1 1 0 1 1 0 1 1 ## 10 1 1 1 0 1 1 1 0 1 1 0 1 1 ## 11 1 1 1 0 1 1 1 0 1 1 1 1 1 ## 12 1 1 1 1 1 1 1 0 1 1 1 1 1 ## 13 1 1 1 1 1 1 1 1 1 1 1 1 1 ## lstat BIC rsq adjr2 cp ## 1 1 -311.5594 0.5510738 0.5499570 262.89329 ## 2 1 -375.0524 0.6220194 0.6201342 160.13353 ## 3 1 -412.6604 0.6606952 0.6581504 105.02407 ## 4 1 -424.8811 0.6756594 0.6724078 84.92776 ## 5 1 -432.9888 0.6867910 0.6828562 70.49059 ## 6 1 -451.2668 0.7050596 0.7006020 45.51482 ## 7 1 -454.2284 0.7115310 0.7064318 37.95897 ## 8 1 -457.1622 0.7178410 0.7121264 30.64152 ## 9 1 -455.7460 0.7210252 0.7146527 27.93961 ## 10 1 -460.5552 0.7283914 0.7214802 19.06264 ## 11 1 -465.5634 0.7356932 0.7282764 10.28067 ## 12 1 -459.8224 0.7358634 0.7277569 12.02923 ## 13 1 -453.8512 0.7358832 0.7270793 14.00000
Plotting Model metrics
Checking the 13 models with varying variable size, we plot the model metrics to find out the best model. R-squared keeps on increasing with added variables and hence will always favor model with highest number of variables
Model with 11 variables gives the highest Adjusted R-squared value and the lowest cp and BIC values
#PLOTS OF R2, ADJ R2, CP, BIC# rsq <- data.frame(round(sum.model2$rsq,5)) model2.rsq.plot <- ggplot(data = rsq, aes(y = rsq, x = 1:13)) + geom_point() + geom_line() + geom_text(aes(label=rsq), size=3, vjust=-0.5) + scale_x_continuous(breaks=1:13) adjr2 <- data.frame(round(sum.model2$adjr2,4)) model2.adjrsq.plot <- ggplot(data = adjr2, aes(y = adjr2, x = 1:13)) + geom_point() + geom_line() + geom_text(aes(label=adjr2), size=3, vjust=-0.5) + scale_x_continuous(breaks=1:13) bic <- data.frame(round(sum.model2$bic,4)) model2.bic.plot <- ggplot(data = bic, aes(y = bic, x = 1:13)) + geom_point() + geom_line() + geom_text(aes(label=bic), size=3, vjust=-0.5) + scale_x_continuous(breaks=1:13) cp <- data.frame(round(sum.model2$cp,4)) model2.cp.plot <- ggplot(data = cp, aes(y = cp, x = 1:13)) + geom_point() + geom_line() + geom_text(aes(label=cp), size=3, vjust=-0.5) + scale_x_continuous(breaks=1:13) grid.arrange(model2.rsq.plot,model2.adjrsq.plot,model2.bic.plot,model2.cp.plot, ncol=2)
Selecting best subset
Reiterating our findings from the plots, We find the best model is the model with all variables except age and indus
which.max(sum.model2$rsq)
## [1] 13
which.max(sum.model2$adjr2)
## [1] 11
which.min(sum.model2$cp)
## [1] 11
which.min(sum.model2$bic)
## [1] 11
coef(model2,11)
## (Intercept) crim zn chas nox ## 41.41683298 -0.13427887 0.05374373 3.49302070 -17.05650513 ## rm dis rad tax ptratio ## 3.15619111 -1.63366083 0.34521921 -0.01280323 -0.98839845 ## black lstat ## 0.01043348 -0.55927297
Backward Variable Selection
Now we come to Backward selection method, where we keep on removing non-influential variables from the model.
lstat, rm and ptration are the most significant variables
Following table shows the variables included in different sized models, along with the BIC, R-squared, adj r-squared, cp values associated with the model
model3 <- regsubsets(medv~ ., data = Boston.train, nvmax = 13, method="backward") sum.model3 <- summary(model3) model3.subsets <- cbind(sum.model3$which, sum.model3$bic, sum.model3$rsq, sum.model3$adjr2,sum.model3$cp) model3.subsets <- as.data.frame(model3.subsets) colnames(model3.subsets)[15:18] <- c("BIC","rsq","adjr2","cp") model3.subsets
## (Intercept) crim zn indus chas nox rm age dis rad tax ptratio black ## 1 1 0 0 0 0 0 0 0 0 0 0 0 0 ## 2 1 0 0 0 0 0 1 0 0 0 0 0 0 ## 3 1 0 0 0 0 0 1 0 0 0 0 1 0 ## 4 1 0 0 0 0 0 1 0 1 0 0 1 0 ## 5 1 0 0 0 0 1 1 0 1 0 0 1 0 ## 6 1 0 0 0 1 1 1 0 1 0 0 1 0 ## 7 1 0 0 0 1 1 1 0 1 0 0 1 1 ## 8 1 0 0 0 1 1 1 0 1 1 0 1 1 ## 9 1 1 0 0 1 1 1 0 1 1 0 1 1 ## 10 1 1 1 0 1 1 1 0 1 1 0 1 1 ## 11 1 1 1 0 1 1 1 0 1 1 1 1 1 ## 12 1 1 1 1 1 1 1 0 1 1 1 1 1 ## 13 1 1 1 1 1 1 1 1 1 1 1 1 1 ## lstat BIC rsq adjr2 cp ## 1 1 -311.5594 0.5510738 0.5499570 262.89329 ## 2 1 -375.0524 0.6220194 0.6201342 160.13353 ## 3 1 -412.6604 0.6606952 0.6581504 105.02407 ## 4 1 -424.8374 0.6756242 0.6723724 84.97961 ## 5 1 -441.2403 0.6931232 0.6892680 61.14028 ## 6 1 -451.2668 0.7050596 0.7006020 45.51482 ## 7 1 -453.2593 0.7108382 0.7057267 38.98203 ## 8 1 -454.6626 0.7160898 0.7103397 33.22737 ## 9 1 -457.6475 0.7223352 0.7159926 26.00530 ## 10 1 -460.5552 0.7283914 0.7214802 19.06264 ## 11 1 -465.5634 0.7356932 0.7282764 10.28067 ## 12 1 -459.8224 0.7358634 0.7277569 12.02923 ## 13 1 -453.8512 0.7358832 0.7270793 14.00000
Plotting Model metrics
Checking the 13 models with varying variable size, we plot the model metrics to find out the best model. R-squared keeps on increasing with added variables and hence will always favor model with highest number of variables
Model with 11 variables gives the highest Adjusted R-squared value and the lowest cp and BIC values This is consistent with the forward selection model
#PLOTS OF R2, ADJ R2, CP, BIC# rsq <- data.frame(round(sum.model3$rsq,5)) model3.rsq.plot <- ggplot(data = rsq, aes(y = rsq, x = 1:13)) + geom_point() + geom_line() + geom_text(aes(label=rsq), size=3, vjust=-0.5) + scale_x_continuous(breaks=1:13) adjr2 <- data.frame(round(sum.model3$adjr2,4)) model3.adjrsq.plot <- ggplot(data = adjr2, aes(y = adjr2, x = 1:13)) + geom_point() + geom_line() + geom_text(aes(label=adjr2), size=3, vjust=-0.5) + scale_x_continuous(breaks=1:13) bic <- data.frame(round(sum.model3$bic,4)) model3.bic.plot <- ggplot(data = bic, aes(y = bic, x = 1:13)) + geom_point() + geom_line() + geom_text(aes(label=bic), size=3, vjust=-0.5) + scale_x_continuous(breaks=1:13) cp <- data.frame(round(sum.model3$cp,4)) model3.cp.plot <- ggplot(data = cp, aes(y = cp, x = 1:13)) + geom_point() + geom_line() + geom_text(aes(label=cp), size=3, vjust=-0.5) + scale_x_continuous(breaks=1:13) grid.arrange(model3.rsq.plot,model3.adjrsq.plot,model3.bic.plot,model3.cp.plot, ncol=2)
Selecting best subset
We find the best model is the model with all variables except age and indus
which.max(sum.model3$rsq)
## [1] 13
which.max(sum.model3$adjr2)
## [1] 11
which.min(sum.model3$cp)
## [1] 11
which.min(sum.model3$bic)
## [1] 11
coef(model3,11)
## (Intercept) crim zn chas nox ## 41.41683298 -0.13427887 0.05374373 3.49302070 -17.05650513 ## rm dis rad tax ptratio ## 3.15619111 -1.63366083 0.34521921 -0.01280323 -0.98839845 ## black lstat ## 0.01043348 -0.55927297
Exhaustive Subset Selection
Last subset selection method is exhaustive search. Here we find the best subset of variables of varying sizes
lstat, rm and ptration are the most significant variables
Following table shows the variables included in different sized models, along with the BIC, R-squared, adj r-squared, cp values associated with the model
model4 <- regsubsets(medv~ ., data = Boston.train, nvmax = 13) sum.model4 <- summary(model4) model4.subsets <- cbind(sum.model4$which, sum.model4$bic, sum.model4$rsq, sum.model4$adjr2,sum.model4$cp) model4.subsets <- as.data.frame(model4.subsets) colnames(model4.subsets)[15:18] <- c("BIC","rsq","adjr2","cp") model4.subsets
## (Intercept) crim zn indus chas nox rm age dis rad tax ptratio black ## 1 1 0 0 0 0 0 0 0 0 0 0 0 0 ## 2 1 0 0 0 0 0 1 0 0 0 0 0 0 ## 3 1 0 0 0 0 0 1 0 0 0 0 1 0 ## 4 1 0 0 0 1 0 1 0 0 0 0 1 0 ## 5 1 0 0 0 0 1 1 0 1 0 0 1 0 ## 6 1 0 0 0 1 1 1 0 1 0 0 1 0 ## 7 1 0 1 0 1 1 1 0 1 0 0 1 0 ## 8 1 0 1 0 1 1 1 0 1 0 0 1 1 ## 9 1 1 0 0 1 1 1 0 1 1 0 1 1 ## 10 1 1 1 0 1 1 1 0 1 1 0 1 1 ## 11 1 1 1 0 1 1 1 0 1 1 1 1 1 ## 12 1 1 1 1 1 1 1 0 1 1 1 1 1 ## 13 1 1 1 1 1 1 1 1 1 1 1 1 1 ## lstat BIC rsq adjr2 cp ## 1 1 -311.5594 0.5510738 0.5499570 262.89329 ## 2 1 -375.0524 0.6220194 0.6201342 160.13353 ## 3 1 -412.6604 0.6606952 0.6581504 105.02407 ## 4 1 -424.8811 0.6756594 0.6724078 84.92776 ## 5 1 -441.2403 0.6931232 0.6892680 61.14028 ## 6 1 -451.2668 0.7050596 0.7006020 45.51482 ## 7 1 -454.2284 0.7115310 0.7064318 37.95897 ## 8 1 -457.1622 0.7178410 0.7121264 30.64152 ## 9 1 -457.6475 0.7223352 0.7159926 26.00530 ## 10 1 -460.5552 0.7283914 0.7214802 19.06264 ## 11 1 -465.5634 0.7356932 0.7282764 10.28067 ## 12 1 -459.8224 0.7358634 0.7277569 12.02923 ## 13 1 -453.8512 0.7358832 0.7270793 14.00000
Plotting Model metrics
Checking the 13 models with varying variable size, we plot the model metrics to find out the best model. R-squared keeps on increasing with added variables and hence will always favor model with highest number of variables
Model with 11 variables gives the highest Adjusted R-squared value and the lowest cp and BIC values
#PLOTS OF R2, ADJ R2, CP, BIC# rsq <- data.frame(round(sum.model4$rsq,5)) model4.rsq.plot <- ggplot(data = rsq, aes(y = rsq, x = 1:13)) + geom_point() + geom_line() + geom_text(aes(label=rsq), size=3, vjust=-0.5) + scale_x_continuous(breaks=1:13) adjr2 <- data.frame(round(sum.model4$adjr2,4)) model4.adjrsq.plot <- ggplot(data = adjr2, aes(y = adjr2, x = 1:13)) + geom_point() + geom_line() + geom_text(aes(label=adjr2), size=3, vjust=-0.5) + scale_x_continuous(breaks=1:13) bic <- data.frame(round(sum.model4$bic,4)) model4.bic.plot <- ggplot(data = bic, aes(y = bic, x = 1:13)) + geom_point() + geom_line() + geom_text(aes(label=bic), size=3, vjust=-0.5) + scale_x_continuous(breaks=1:13) cp <- data.frame(round(sum.model4$cp,4)) model4.cp.plot <- ggplot(data = cp, aes(y = cp, x = 1:13)) + geom_point() + geom_line() + geom_text(aes(label=cp), size=3, vjust=-0.5) + scale_x_continuous(breaks=1:13) grid.arrange(model4.rsq.plot,model4.adjrsq.plot,model4.bic.plot,model4.cp.plot, ncol=2)
Selecting best subset
Again we find the best model is the model with all variables except age and indus, hence we use that model as our selected model
which.max(sum.model4$rsq)
## [1] 13
which.max(sum.model4$adjr2)
## [1] 11
which.min(sum.model4$cp)
## [1] 11
which.min(sum.model4$bic)
## [1] 11
coef(model4,11)
## (Intercept) crim zn chas nox ## 41.41683298 -0.13427887 0.05374373 3.49302070 -17.05650513 ## rm dis rad tax ptratio ## 3.15619111 -1.63366083 0.34521921 -0.01280323 -0.98839845 ## black lstat ## 0.01043348 -0.55927297
SELECTED MODEL = 11, -AGE -INDUS
From our subset selection techniques, we select the model without indus and age as our best model. Summary of the mdoel:
model.ss <- lm(medv ~ . -indus -age, data=Boston.train) sum.model.ss <- summary(model.ss) sum.model.ss
## ## Call: ## lm(formula = medv ~ . - indus - age, data = Boston.train) ## ## Residuals: ## Min 1Q Median 3Q Max ## -15.4298 -2.7600 -0.5466 1.6243 24.6067 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 41.416833 5.928789 6.986 1.22e-11 *** ## crim -0.134279 0.039352 -3.412 0.000711 *** ## zn 0.053744 0.015242 3.526 0.000472 *** ## chas 3.493021 1.013727 3.446 0.000631 *** ## nox -17.056505 3.994774 -4.270 2.46e-05 *** ## rm 3.156191 0.466475 6.766 4.83e-11 *** ## dis -1.633661 0.218068 -7.492 4.56e-13 *** ## rad 0.345219 0.073382 4.704 3.53e-06 *** ## tax -0.012803 0.003891 -3.291 0.001089 ** ## ptratio -0.988398 0.150104 -6.585 1.47e-10 *** ## black 0.010433 0.003043 3.429 0.000671 *** ## lstat -0.559273 0.054487 -10.264 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4.844 on 392 degrees of freedom ## Multiple R-squared: 0.7357, Adjusted R-squared: 0.7283 ## F-statistic: 99.19 on 11 and 392 DF, p-value: < 2.2e-16
Getting the model stats:
model.ss.mse <- (sum.model.ss$sigma)^2 model.ss.rsq <- sum.model.ss$r.squared model.ss.arsq <- sum.model.ss$adj.r.squared test.pred.model.ss <- predict(model.ss, newdata=Boston.test) model.ss.mpse <- mean((Boston.test$medv-test.pred.model.ss)^2) modelss.aic <- AIC(model.ss) modelss.bic <- BIC(model.ss) #ROW# stats.model.ss <- c("model.SS", model.ss.mse, model.ss.rsq, model.ss.arsq, model.ss.mpse, modelss.aic, modelss.bic) data.frame(cbind(comparison_table, stats.model.ss))
## comparison_table stats.model.ss ## 1 model type model.SS ## 2 MSE 23.4685578277093 ## 3 R-Squared 0.735693156684753 ## 4 Adjusted R-Squared 0.728276383020295 ## 5 Test MSPE 19.6614756361586 ## 6 AIC 2435.2077778495 ## 7 BIC 2487.22617126299
LASSO Variable Selection
Now we use LASSO variable selection technique. We try to shrink the coefficient estimates of non-significant variables to zero.
Here lambda is the penalty factor which helps in variable selection and so higher the lambda, lesser will be the significant variables included in the model.
STANDARDIZE COVARIATES
We need to standardize the variables before using them in model creation
Boston.X.std <- scale(dplyr::select(Boston, -medv)) X.train<- as.matrix(Boston.X.std)[index,] X.test<- as.matrix(Boston.X.std)[-index,] Y.train<- Boston[index, "medv"] Y.test<- Boston[-index, "medv"]
FIT MODEL
We fit the LASSO model to our data. From the plot below, we see that as the value of lambda keeps on increasing, the coefficients for the variables tend to 0.
lasso.fit<- glmnet(x=X.train, y=Y.train, alpha = 1) plot(lasso.fit, xvar = "lambda", label=TRUE)
CV TO GET OPTIMAL LAMBDA
Using cross-validation we now find the appropriate lambda value using error versus lambda plot.
We take the value with the least error as well as the error value which is one standard deviation away from the lowest error value. we then build models on the basis of both of these. For the higher error value , the number of variables selected decreases.
For model with lambda=min, coefficients of age and indus get reduced to zero. Formodel with lambda=1se, coefficients of indus, age, rad and tax get reduced to zero
cv.lasso<- cv.glmnet(x=X.train, y=Y.train, alpha = 1, nfolds = 10) plot(cv.lasso)
names(cv.lasso)
## [1] "lambda" "cvm" "cvsd" "cvup" "cvlo" ## [6] "nzero" "name" "glmnet.fit" "lambda.min" "lambda.1se"
#Lambda with minimum error cv.lasso$lambda.min
## [1] 0.02847133
#Lambda with Error 1 SD above cv.lasso$lambda.1se
## [1] 0.3198253
#Coefficients for Lambda min coef(lasso.fit, s=cv.lasso$lambda.min)
## 14 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## (Intercept) 22.4388805 ## crim -1.0571455 ## zn 1.1597953 ## indus . ## chas 0.8851066 ## nox -1.8571240 ## rm 2.2585490 ## age . ## dis -3.2309659 ## rad 2.5828855 ## tax -1.8169441 ## ptratio -2.0912414 ## black 0.9226658 ## lstat -4.0046371
#Coefficients for lambda 1se coef(lasso.fit, s=cv.lasso$lambda.1se)
## 14 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## (Intercept) 22.4923614 ## crim -0.2688068 ## zn 0.3094614 ## indus . ## chas 0.7857114 ## nox -0.5591879 ## rm 2.5469712 ## age . ## dis -1.1976543 ## rad . ## tax . ## ptratio -1.6591923 ## black 0.6574595 ## lstat -4.0502353
Model Stats
Computing various model performance metrics:
#TRAIN DATA PREDICTION pred.lasso.train.min <- predict(lasso.fit, newx = X.train, s=cv.lasso$lambda.min) pred.lasso.train.1se <- predict(lasso.fit, newx = X.train, s=cv.lasso$lambda.1se) #TEST DATA PREDICTION pred.lasso.test.min<- predict(lasso.fit, newx = X.test, s=cv.lasso$lambda.min) pred.lasso.test.1se<- predict(lasso.fit, newx = X.test, s=cv.lasso$lambda.1se) #MSE lasso.min.mse <- sum((Y.train-pred.lasso.train.min)^2)/(404-14) lasso.1se.mse <- sum((Y.train-pred.lasso.train.1se)^2)/(404-11) #MSPE lasso.min.mpse <- mean((Y.test-pred.lasso.test.min)^2) lasso.1se.mpse <- mean((Y.test-pred.lasso.test.1se)^2) #R_squared sst <- sum((Y.train - mean(Y.train))^2) sse_min <- sum((Y.train-pred.lasso.train.min)^2) sse_1se <- sum((Y.train-pred.lasso.train.1se)^2) rsq_min <- 1 - sse_min / sst rsq_1se <- 1 - sse_1se / sst #adj_R_squared #adj r squared = 1 - ((n-1)/(n-p-1))(1-r_squared) adj_rsq_min <- 1 - (dim(X.train)[1]-1)*(1-rsq_min)/(dim(X.train)[1]-12-1) adj_rsq_1se <- 1 - (dim(X.train)[1]-1)*(1-rsq_1se)/(dim(X.train)[1]-10-1) stats.model.lasso.min <- c("model.lasso.min", lasso.min.mse, rsq_min, adj_rsq_min, lasso.min.mpse) stats.model.lasso.1se <- c("model.lasso.1se", lasso.1se.mse, rsq_1se, adj_rsq_1se, lasso.1se.mpse) comparison_table <- c("model type", "MSE", "R-Squared", "Adjusted R-Squared", "Test MSPE") data.frame(cbind(comparison_table, stats.model.lasso.min, stats.model.lasso.1se))
## comparison_table stats.model.lasso.min stats.model.lasso.1se ## 1 model type model.lasso.min model.lasso.1se ## 2 MSE 23.6285729645258 26.3963732376923 ## 3 R-Squared 0.735248738094418 0.701961239023836 ## 4 Adjusted R-Squared 0.727123379672764 0.694377555538438 ## 5 Test MSPE 19.2715641026489 18.8801839579017
comparing models from Subset selection, LASSO with Full model
Comparing the performance of 4 models obrtained so far:
- MSE: MSE of all models are comparable around the 23 mark, except the LASSO.1se model which gives a MSE of 26.39
- R-Squared: Full model performs best in this category as expected, and the LASSO,1se model performs the worst, as expected again
- Adjusted R-squared: A better metric for comparing models of diff variable sizes, Subset selection model performs the best here
- Test MSPE: LASSO.1se model performs the best here with a low MSPE of 18.88. All other models also do a pretty good job with scores around the 19 mark
We select the subset selection model as our best model: Full model - age - indus
data.frame(cbind(comparison_table, c("full", model1.mse, model1.rsq, model1.arsq, model1.mpse), c("model.SS", model.ss.mse, model.ss.rsq, model.ss.arsq, model.ss.mpse), stats.model.lasso.min, stats.model.lasso.1se))
## comparison_table V2 V3 ## 1 model type full model.SS ## 2 MSE 23.57194554198 23.4685578277093 ## 3 R-Squared 0.735883231832148 0.735693156684753 ## 4 Adjusted R-Squared 0.727079339559886 0.728276383020295 ## 5 Test MSPE 19.7293018987321 19.6614756361586 ## stats.model.lasso.min stats.model.lasso.1se ## 1 model.lasso.min model.lasso.1se ## 2 23.6285729645258 26.3963732376923 ## 3 0.735248738094418 0.701961239023836 ## 4 0.727123379672764 0.694377555538438 ## 5 19.2715641026489 18.8801839579017
Residual Analysis plots
We do a quick residual analysis of the selected subset model and observe the following:
- The variance is not completely constant and hence the assumption of constant variance is not totally satisfied
- From the q-q plot we see that it is not completely normal and a little skewed to the right
- There is no autocorrelation observed in the model
- There are no observed outliers
par(mfrow=c(2,2)) plot(model.ss)
Source: https://rstudio-pubs-static.s3.amazonaws.com/366382_50808315651c444fbccb04c60df8f041.html
Posted by: fivesreds.blogspot.com