US States

(a) Examine the bivariate relationships present in the data. Briefly discuss notable results. You might find the scatterplotMatrix() function available in the car package helpful.

The notable results with a good amount of correlation are -

Illiteracy v/s Murder (Direct relationship)
Illiteracy v/s HS Grad (Inverse relationship)
Illiteracy v/s Frost (Inverse relationship)
Murder v/s Life Exp (Inverse relationship)
Frost v/s Murder (Inverse relationship)

Let us try visualizing these relationships -

1 Visualizing linear relationship between illiteracy and murder rates

2 Visualizing linear relationship between illiteracy and high school graduation rates

3 Visualizing linear relationship between illiteracy and frost

4 Visualizing linear relationship between Life expectancy and murder rate

5 Visualizing linear relationship between frost and murder rate

(b) Fit a multiple linear regression model. How much variance in the murder rate across states do the predictor variables explain?

Fitting multiple linear regression model between response variable ‘murder’ and all predictors

## [1] 80.82607

The variance in the murder rate across states that the predictor variables explain is 80.82607 %

(c) Evaluate the statistical assumptions in your regression analysis from part (b) by performing a basic analysis of model residuals and any unusual observations. Discuss any concerns you have about your model.

We see in the above graph that the linear model forced on to the above graph has a zero slope, which statistically means that there is no linear relationship between residuals and fitted values. Thus, the assumption of the linear model that there exists a linear relationship between residuals and fitted values is flawed in itself.

Also, since there is linear pattern with slope zero between residuals and fitted values, this means that might mean that the model (murder_lm) is not a good linear model, which is my main concern.

Let us evaluate the assumptions of linear model on murder_lm

Assessing Outliers

## 
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##        rstudent unadjusted p-value Bonferonni p
## Nevada 2.316971           0.025578           NA

qq plot for studentized residuals

Leverage plots

Influential Observations: Added variable plots

Cook’s D plot: identify D values > 4/(n-k-2)

Influence Plot

Normality of Residuals: qq plot for studentized residuals

Distribution of studentized residuals

Evaluate homoscedasticity: Non-constant error variance test

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.06620948    Df = 1     p = 0.7969379

Plot studentized residuals vs. fitted values

## 
## Suggested power transformation:  0.7326267

Evaluate Collinearity

## Population     Income Illiteracy   Life.Exp    HS.Grad      Frost 
##   1.342691   1.989395   4.135956   1.901430   3.437276   2.373463 
##       Area 
##   1.690625

## Population     Income Illiteracy   Life.Exp    HS.Grad      Frost 
##      FALSE      FALSE       TRUE      FALSE      FALSE      FALSE 
##       Area 
##      FALSE

Evaluate Nonlinearity - component + residual plot

- Global test of model assumptions

## 
## Call:
## lm(formula = Murder ~ ., data = state.x77)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4452 -1.1016 -0.0598  1.1758  3.2355 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.222e+02  1.789e+01   6.831 2.54e-08 ***
## Population   1.880e-04  6.474e-05   2.905  0.00584 ** 
## Income      -1.592e-04  5.725e-04  -0.278  0.78232    
## Illiteracy   1.373e+00  8.322e-01   1.650  0.10641    
## Life.Exp    -1.655e+00  2.562e-01  -6.459 8.68e-08 ***
## HS.Grad      3.234e-02  5.725e-02   0.565  0.57519    
## Frost       -1.288e-02  7.392e-03  -1.743  0.08867 .  
## Area         5.967e-06  3.801e-06   1.570  0.12391    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.746 on 42 degrees of freedom
## Multiple R-squared:  0.8083, Adjusted R-squared:  0.7763 
## F-statistic: 25.29 on 7 and 42 DF,  p-value: 3.872e-13
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = murder_lm) 
## 
##                      Value p-value                Decision
## Global Stat        1.60319  0.8082 Assumptions acceptable.
## Skewness           0.07024  0.7910 Assumptions acceptable.
## Kurtosis           0.93137  0.3345 Assumptions acceptable.
## Link Function      0.30714  0.5794 Assumptions acceptable.
## Heteroscedasticity 0.29444  0.5874 Assumptions acceptable.

Thus, we see from the above table that the following assumptions of linear model are acceptable -

Global Stat
Skewness
Kurtosis
Link Functiom
Heteroscedasticity

Thus, on a whole, it is not the best plot but is just acceptable.

(d) Use a stepwise model selection procedure of your choice to obtain a “best” fit model. Is the model different from the full model you fit in part (b)? If yes, how so?

## Start:  AIC=63.01
## Murder ~ Population + Income + Illiteracy + Life.Exp + HS.Grad + 
##     Frost + Area
## 
##              Df Sum of Sq    RSS    AIC
## - Income      1     0.236 128.27 61.105
## - HS.Grad     1     0.973 129.01 61.392
## <none>                    128.03 63.013
## - Area        1     7.514 135.55 63.865
## - Illiteracy  1     8.299 136.33 64.154
## - Frost       1     9.260 137.29 64.505
## - Population  1    25.719 153.75 70.166
## - Life.Exp    1   127.175 255.21 95.503
## 
## Step:  AIC=61.11
## Murder ~ Population + Illiteracy + Life.Exp + HS.Grad + Frost + 
##     Area
## 
##              Df Sum of Sq    RSS    AIC
## - HS.Grad     1     0.763 129.03 59.402
## <none>                    128.27 61.105
## - Area        1     7.310 135.58 61.877
## - Illiteracy  1     8.715 136.98 62.392
## - Frost       1     9.345 137.61 62.621
## + Income      1     0.236 128.03 63.013
## - Population  1    27.142 155.41 68.702
## - Life.Exp    1   127.500 255.77 93.613
## 
## Step:  AIC=59.4
## Murder ~ Population + Illiteracy + Life.Exp + Frost + Area
## 
##              Df Sum of Sq    RSS    AIC
## <none>                    129.03 59.402
## - Illiteracy  1     8.723 137.75 60.672
## + HS.Grad     1     0.763 128.27 61.105
## + Income      1     0.026 129.01 61.392
## - Frost       1    11.030 140.06 61.503
## - Area        1    15.937 144.97 63.225
## - Population  1    26.415 155.45 66.714
## - Life.Exp    1   140.391 269.42 94.213

## 
## Call:
## lm(formula = Murder ~ Population + Illiteracy + Life.Exp + Frost + 
##     Area, data = state.x77)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2976 -1.0711 -0.1123  1.1092  3.4671 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.202e+02  1.718e+01   6.994 1.17e-08 ***
## Population   1.780e-04  5.930e-05   3.001  0.00442 ** 
## Illiteracy   1.173e+00  6.801e-01   1.725  0.09161 .  
## Life.Exp    -1.608e+00  2.324e-01  -6.919 1.50e-08 ***
## Frost       -1.373e-02  7.080e-03  -1.939  0.05888 .  
## Area         6.804e-06  2.919e-06   2.331  0.02439 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.712 on 44 degrees of freedom
## Multiple R-squared:  0.8068, Adjusted R-squared:  0.7848 
## F-statistic: 36.74 on 5 and 44 DF,  p-value: 1.221e-14

## Analysis of Variance Table
## 
## Model 1: Murder ~ Population + Illiteracy + Life.Exp + Frost + Area
## Model 2: Murder ~ Population + Income + Illiteracy + Life.Exp + HS.Grad + 
##     Frost + Area
##   Res.Df    RSS Df Sum of Sq Pr(>Chi)
## 1     44 129.03                      
## 2     42 128.03  2   0.99851   0.8489

The two models are very similar in performance except for that StepWise model does not have Income as one of its predictors; that the full multiple regression model does.

The models are only slightly different when it comes to performance.
As we compare the summaries of both models, we see that there is a very minor difference int he residual standard error, multiple R_squared value and adjusted R-squared value of the models. So we can assume the models to be almost similar.

This is because the number of predictors is lesser in the StepWise model as compared to the full model which considers all available predictors. StepWise, on the other hand, eliminates the ones with minimal impact on the response variable and thus gives a slighlty better model.

(e) Assess the model (from part (d)) generalizability. Perform a 10-fold cross validation to estimate model performance. Report the results.

Displaying predicted residual sum of squares (PRESS)

## [1] 162.7271

Displaying sum of squared errors of prediction (SSE)

## [1] 129.0316

Thus, we see that except for one of the 10 lines, most of them have the same intercept and a similar slope, which suggests that the model is, indeed, generalizable. Moreover, the values of PRESS (predicted residual sum of squares) and SSE (sum of squared errors of prediction) suggest the model performance is decent enough.

Assessing R-square shrinkage using 1-Fold Cross-validation

##             [,1]
## Murder 0.8067654

##             [,1]
## Murder 0.7522147

Thus, we see the values of raw R2 shrinkage and cross-validated R2 shrinkage are 0.907 and 0.743 respectively.

(f) Fit a regression tree using the same covariates in your “best” fit model from part (d). Use cross validation to select the “best” tree.

## 
## Regression tree:
## tree(formula = Murder ~ state.x77$Life.Exp + Population + Illiteracy + 
##     Frost + Area, data = state.x77)
## Variables actually used in tree construction:
## [1] "state.x77$Life.Exp" "Population"         "Illiteracy"        
## [4] "Frost"             
## Number of terminal nodes:  7 
## Residual mean deviance:  2.813 = 121 / 43 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.50000 -1.18900  0.02222  0.00000  0.74290  4.02000

The above data summarizes the regression tree fit on state.x77 data and the residuals range from -3.50 to 4.02.

Let us use cross-validation to select the “best” tree -

The “best” tree is the one which has the minimum value of deviance and from the above graph, we can see that deviance is minimum for the highest value of size (i.e., 7) which means that the original regression tree that we fit has the best performance and we need not prune it in an attempt to improve the model performance.

(g) Compare the models from part (d) and (f) based on their performance. Which do you prefer? Be sure to justify your preference.

Predicting murder rates for given values of predictor variables using step selection model

Calculating Mean Squared Error of fitted points

## [1] 2.580632

Predicting murder rates for given values of predictor variables using regression tree model

Calculating Mean Squared Error of fitted points for calculating errors

## [1] 2.41919

The mean square error is lower for the regression tree as compared to stepAIC, so I prefer the regression tree.