Sunday, January 17, 2016

Test a Basic Linear Regression Model (Python)

Introduction

Dataset: gapminder.csv

Response Variable: Internet Use Rate

Explanatory Variable: Polity Score

The two variables are both quantitative and the explanatory variable "polityscore" is going to be standardized to be mean 0.

Python Outout

Mean of Standardized polityscore (mean tends to be zero)

-3.91681166451608e-16



-----------

Conclusion:

Consider the linear regression model : Y=b0 + b1X + c, where c is the error term which is random , identically independent distributed and follows normal distribution.

From the output, b1 is 1.6043 and its P-value  is less than 0.05. It is significant and the null hypothesis of b1 is zero is rejected. The intercept is 32.2811 (%, internet use rate). Therefore the polity score (the democracy level) did affect the internet use rate according to the simple linear regression analysis.
And the model is : Y=32.2811 + 1.6043X + c.

Python Code:
import numpy
import pandas
import statsmodels.formula.api as smf
import seaborn
import matplotlib.pyplot as plt



data = pandas.read_csv('gapminder.csv')


data['internetuserate'] = pandas.to_numeric(data['internetuserate'], errors='coerce')
data['polityscore'] = pandas.to_numeric(data['polityscore'], errors='coerce')

data['polityscore']=data['polityscore'].dropna()
m = numpy.mean(data['polityscore'])
data['polityscore2']=data['polityscore']-m
data['polityscore2']=data['polityscore2']
a=numpy.mean(data['polityscore2'])
print('Mean of Standardized polityscore')
print(a)


scat1 = seaborn.regplot(x="polityscore2", y="internetuserate", scatter=True, data=data)
plt.xlabel('polityscore')
plt.ylabel('Internet Use Rate')
plt.title ('Scatterplot for the Association Between Polityscore and Internet Use Rate')
print(scat1)

print ("OLS regression model for the association between urban rate and internet use rate")
reg1 = smf.ols('internetuserate ~ polityscore2', data=data).fit()
print (reg1.summary())






No comments:

Post a Comment