Data Analysis and Interpretation Specialization: Test a Multiple Regression Model (python)

Introduction:

Data set: "gapminer.csv"
Response Variable: "internetuserate" - (Internet Use Rate)
Explanatory variable 1 : "urbanrate" -(Urbanization Rate)
Explanatory variable 2 : "polityscore" -(polity score)

Python Output 1:

---------------------
From the graph, both two plots are set with order 2 ( second order of the polynomial) to fit the spots.

Python Output 2:
urbanrate_c polityscore_c
count 155.00 155.00
mean 0.00 0.00
std 22.75 6.24
min -44.97 -13.85
25% -18.82 -5.35
50% 1.91 3.15
75% 16.86 5.15

max 44.63 6.15

--------------------------------------
See the residual plots, both two have obvious pattern and trend but not in random. This violates the the assumption of independently random distribution of the error terms.

Python Output 3:

------------------
The qq-plot regression analysis and the leverage plot is for the model:

internetuserate = beta_0+(beta_1) urbanrate_c + (beta_2)polityscore_c+(beta_3)urbanrate_c**2+(beta_4)polityscore_c**4

And the results of parameters:
beta_0=16.2094
beta_1=0.5783
beta_2=2.9886
beta_3=0.0052
beta_4=0.3525

All their p-values are lower than the 0.05 significant level and accept these parameters.

Conclusion:
My chosen model:
internetuserate =16.2094+(0.5783) urbanrate_c + (2.9886)polityscore_c+(0.0052)urbanrate_c**2+(0.3525)polityscore_c**4

Although the partial residual plots show it violates the assumption and outliers presents, the model meets normality assumption and the all the parameters are significant in regression analysis.

But ideally, more work can be done such as removal of the outliers and repeat the regression analysis,
remove highly correlated predictors, etc.

Python Code:

import numpy
import pandas
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn

pandas.set_option('display.float_format', lambda x:'%.2f'%x)

data = pandas.read_csv('gapminder.csv')

data['internetuserate'] = pandas.to_numeric(data['internetuserate'], errors='coerce')
data['urbanrate'] = pandas.to_numeric(data['urbanrate'], errors='coerce')
data['polityscore'] = pandas.to_numeric(data['polityscore'], errors='coerce')

sub1 = data[['urbanrate', 'internetuserate', 'polityscore']].dropna()

scat1 = seaborn.regplot(x="urbanrate", y="internetuserate", scatter=True, data=sub1)
plt.xlabel('Urbanization Rate')
plt.ylabel('Internet Use Rate')

scat1 = seaborn.regplot(x="urbanrate", y="internetuserate", scatter=True,order=2, data=sub1)
plt.xlabel('Urbanization Rate')
plt.ylabel('Internet Use Rate')

plt.show()

scat2 = seaborn.regplot(x="polityscore", y="internetuserate", scatter=True,order=2, data=sub1)
plt.xlabel('Polity score')
plt.ylabel('Internet Use Rate')
plt.show()

sub1['urbanrate_c'] = (sub1['urbanrate'] - sub1['urbanrate'].mean())
sub1['polityscore_c'] = (sub1['polityscore'] - sub1['polityscore'].mean())
sub1[["urbanrate_c", "polityscore_c"]].describe()

reg1 = smf.ols('internetuserate ~ urbanrate_c + polityscore_c+I(urbanrate_c**2)+I(polityscore_c**2)', data=sub1).fit()
print (reg1.summary())

fig1=sm.qqplot(reg1.resid, line='r')

stdres=pandas.DataFrame(reg1.resid_pearson)
plt.plot(stdres, 'o', ls='None')
l = plt.axhline(y=0, color='r')
plt.ylabel('Standardized Residual')
plt.xlabel('Observation Number')

fig2 = plt.figure(figsize=(12,8))
fig2 = sm.graphics.plot_regress_exog(reg1, "urbanrate_c", fig=fig2)
plt.show()

fig3 = plt.figure(figsize=(12,8))
fig3 = sm.graphics.plot_regress_exog(reg1, "polityscore_c", fig=fig3)
plt.show()

fig4=sm.graphics.influence_plot(reg1, size=8)
print(fig2)

fig5=sm.graphics.influence_plot(reg1, size=8)

print(fig3)

Data Analysis and Interpretation Specialization

Saturday, January 23, 2016

Test a Multiple Regression Model (python)

No comments:

Post a Comment