The notable results with a good amount of correlation are -
Illiteracy v/s Murder (Direct relationship)
Illiteracy v/s HS Grad (Inverse relationship)
Illiteracy v/s Frost (Inverse relationship)
Murder v/s Life Exp (Inverse relationship)
Frost v/s Murder (Inverse relationship)
## [1] 80.82607
The variance in the murder rate across states that the predictor variables explain is 80.82607 %
We see in the above graph that the linear model forced on to the above graph has a zero slope, which statistically means that there is no linear relationship between residuals and fitted values. Thus, the assumption of the linear model that there exists a linear relationship between residuals and fitted values is flawed in itself.
Also, since there is linear pattern with slope zero between residuals and fitted values, this means that might mean that the model (murder_lm) is not a good linear model, which is my main concern.
Let us evaluate the assumptions of linear model on murder_lm
##
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## Nevada 2.316971 0.025578 NA
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.06620948 Df = 1 p = 0.7969379
##
## Suggested power transformation: 0.7326267
## Population Income Illiteracy Life.Exp HS.Grad Frost
## 1.342691 1.989395 4.135956 1.901430 3.437276 2.373463
## Area
## 1.690625
## Population Income Illiteracy Life.Exp HS.Grad Frost
## FALSE FALSE TRUE FALSE FALSE FALSE
## Area
## FALSE
- Global test of model assumptions
##
## Call:
## lm(formula = Murder ~ ., data = state.x77)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4452 -1.1016 -0.0598 1.1758 3.2355
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.222e+02 1.789e+01 6.831 2.54e-08 ***
## Population 1.880e-04 6.474e-05 2.905 0.00584 **
## Income -1.592e-04 5.725e-04 -0.278 0.78232
## Illiteracy 1.373e+00 8.322e-01 1.650 0.10641
## Life.Exp -1.655e+00 2.562e-01 -6.459 8.68e-08 ***
## HS.Grad 3.234e-02 5.725e-02 0.565 0.57519
## Frost -1.288e-02 7.392e-03 -1.743 0.08867 .
## Area 5.967e-06 3.801e-06 1.570 0.12391
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.746 on 42 degrees of freedom
## Multiple R-squared: 0.8083, Adjusted R-squared: 0.7763
## F-statistic: 25.29 on 7 and 42 DF, p-value: 3.872e-13
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = murder_lm)
##
## Value p-value Decision
## Global Stat 1.60319 0.8082 Assumptions acceptable.
## Skewness 0.07024 0.7910 Assumptions acceptable.
## Kurtosis 0.93137 0.3345 Assumptions acceptable.
## Link Function 0.30714 0.5794 Assumptions acceptable.
## Heteroscedasticity 0.29444 0.5874 Assumptions acceptable.
Thus, we see from the above table that the following assumptions of linear model are acceptable -
Global Stat
Skewness
Kurtosis
Link Functiom
Heteroscedasticity
Thus, on a whole, it is not the best plot but is just acceptable.
## Start: AIC=63.01
## Murder ~ Population + Income + Illiteracy + Life.Exp + HS.Grad +
## Frost + Area
##
## Df Sum of Sq RSS AIC
## - Income 1 0.236 128.27 61.105
## - HS.Grad 1 0.973 129.01 61.392
## <none> 128.03 63.013
## - Area 1 7.514 135.55 63.865
## - Illiteracy 1 8.299 136.33 64.154
## - Frost 1 9.260 137.29 64.505
## - Population 1 25.719 153.75 70.166
## - Life.Exp 1 127.175 255.21 95.503
##
## Step: AIC=61.11
## Murder ~ Population + Illiteracy + Life.Exp + HS.Grad + Frost +
## Area
##
## Df Sum of Sq RSS AIC
## - HS.Grad 1 0.763 129.03 59.402
## <none> 128.27 61.105
## - Area 1 7.310 135.58 61.877
## - Illiteracy 1 8.715 136.98 62.392
## - Frost 1 9.345 137.61 62.621
## + Income 1 0.236 128.03 63.013
## - Population 1 27.142 155.41 68.702
## - Life.Exp 1 127.500 255.77 93.613
##
## Step: AIC=59.4
## Murder ~ Population + Illiteracy + Life.Exp + Frost + Area
##
## Df Sum of Sq RSS AIC
## <none> 129.03 59.402
## - Illiteracy 1 8.723 137.75 60.672
## + HS.Grad 1 0.763 128.27 61.105
## + Income 1 0.026 129.01 61.392
## - Frost 1 11.030 140.06 61.503
## - Area 1 15.937 144.97 63.225
## - Population 1 26.415 155.45 66.714
## - Life.Exp 1 140.391 269.42 94.213
##
## Call:
## lm(formula = Murder ~ Population + Illiteracy + Life.Exp + Frost +
## Area, data = state.x77)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2976 -1.0711 -0.1123 1.1092 3.4671
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.202e+02 1.718e+01 6.994 1.17e-08 ***
## Population 1.780e-04 5.930e-05 3.001 0.00442 **
## Illiteracy 1.173e+00 6.801e-01 1.725 0.09161 .
## Life.Exp -1.608e+00 2.324e-01 -6.919 1.50e-08 ***
## Frost -1.373e-02 7.080e-03 -1.939 0.05888 .
## Area 6.804e-06 2.919e-06 2.331 0.02439 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.712 on 44 degrees of freedom
## Multiple R-squared: 0.8068, Adjusted R-squared: 0.7848
## F-statistic: 36.74 on 5 and 44 DF, p-value: 1.221e-14
## Analysis of Variance Table
##
## Model 1: Murder ~ Population + Illiteracy + Life.Exp + Frost + Area
## Model 2: Murder ~ Population + Income + Illiteracy + Life.Exp + HS.Grad +
## Frost + Area
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 44 129.03
## 2 42 128.03 2 0.99851 0.8489
The two models are very similar in performance except for that StepWise model does not have Income as one of its predictors; that the full multiple regression model does.
The models are only slightly different when it comes to performance.
As we compare the summaries of both models, we see that there is a very minor difference int he residual standard error, multiple R_squared value and adjusted R-squared value of the models. So we can assume the models to be almost similar.
This is because the number of predictors is lesser in the StepWise model as compared to the full model which considers all available predictors. StepWise, on the other hand, eliminates the ones with minimal impact on the response variable and thus gives a slighlty better model.
Displaying predicted residual sum of squares (PRESS)
## [1] 162.7271
Displaying sum of squared errors of prediction (SSE)
## [1] 129.0316
Thus, we see that except for one of the 10 lines, most of them have the same intercept and a similar slope, which suggests that the model is, indeed, generalizable. Moreover, the values of PRESS (predicted residual sum of squares) and SSE (sum of squared errors of prediction) suggest the model performance is decent enough.
Assessing R-square shrinkage using 1-Fold Cross-validation
## [,1]
## Murder 0.8067654
## [,1]
## Murder 0.7522147
Thus, we see the values of raw R2 shrinkage and cross-validated R2 shrinkage are 0.907 and 0.743 respectively.
##
## Regression tree:
## tree(formula = Murder ~ state.x77$Life.Exp + Population + Illiteracy +
## Frost + Area, data = state.x77)
## Variables actually used in tree construction:
## [1] "state.x77$Life.Exp" "Population" "Illiteracy"
## [4] "Frost"
## Number of terminal nodes: 7
## Residual mean deviance: 2.813 = 121 / 43
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.50000 -1.18900 0.02222 0.00000 0.74290 4.02000
The above data summarizes the regression tree fit on state.x77 data and the residuals range from -3.50 to 4.02.
Let us use cross-validation to select the “best” tree -
The “best” tree is the one which has the minimum value of deviance and from the above graph, we can see that deviance is minimum for the highest value of size (i.e., 7) which means that the original regression tree that we fit has the best performance and we need not prune it in an attempt to improve the model performance.
Predicting murder rates for given values of predictor variables using step selection model
Calculating Mean Squared Error of fitted points
## [1] 2.580632
Predicting murder rates for given values of predictor variables using regression tree model
Calculating Mean Squared Error of fitted points for calculating errors
## [1] 2.41919
The mean square error is lower for the regression tree as compared to stepAIC, so I prefer the regression tree.