Data Analysis and Interpretation Specialization: January 2016

Saturday, January 23, 2016

Test a Multiple Regression Model (python)

Introduction:

Data set: "gapminer.csv"
Response Variable: "internetuserate" - (Internet Use Rate)
Explanatory variable 1 : "urbanrate" -(Urbanization Rate)
Explanatory variable 2 : "polityscore" -(polity score)

Python Output 1:

---------------------
From the graph, both two plots are set with order 2 ( second order of the polynomial) to fit the spots.

Python Output 2:
urbanrate_c polityscore_c
count 155.00 155.00
mean 0.00 0.00
std 22.75 6.24
min -44.97 -13.85
25% -18.82 -5.35
50% 1.91 3.15
75% 16.86 5.15

max 44.63 6.15

--------------------------------------
See the residual plots, both two have obvious pattern and trend but not in random. This violates the the assumption of independently random distribution of the error terms.

Python Output 3:

------------------
The qq-plot regression analysis and the leverage plot is for the model:

internetuserate = beta_0+(beta_1) urbanrate_c + (beta_2)polityscore_c+(beta_3)urbanrate_c**2+(beta_4)polityscore_c**4

And the results of parameters:
beta_0=16.2094
beta_1=0.5783
beta_2=2.9886
beta_3=0.0052
beta_4=0.3525

All their p-values are lower than the 0.05 significant level and accept these parameters.

Conclusion:
My chosen model:
internetuserate =16.2094+(0.5783) urbanrate_c + (2.9886)polityscore_c+(0.0052)urbanrate_c**2+(0.3525)polityscore_c**4

Although the partial residual plots show it violates the assumption and outliers presents, the model meets normality assumption and the all the parameters are significant in regression analysis.

But ideally, more work can be done such as removal of the outliers and repeat the regression analysis,
remove highly correlated predictors, etc.

Python Code:

import numpy
import pandas
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn

pandas.set_option('display.float_format', lambda x:'%.2f'%x)

data = pandas.read_csv('gapminder.csv')

data['internetuserate'] = pandas.to_numeric(data['internetuserate'], errors='coerce')
data['urbanrate'] = pandas.to_numeric(data['urbanrate'], errors='coerce')
data['polityscore'] = pandas.to_numeric(data['polityscore'], errors='coerce')

sub1 = data[['urbanrate', 'internetuserate', 'polityscore']].dropna()

scat1 = seaborn.regplot(x="urbanrate", y="internetuserate", scatter=True, data=sub1)
plt.xlabel('Urbanization Rate')
plt.ylabel('Internet Use Rate')

scat1 = seaborn.regplot(x="urbanrate", y="internetuserate", scatter=True,order=2, data=sub1)
plt.xlabel('Urbanization Rate')
plt.ylabel('Internet Use Rate')

plt.show()

scat2 = seaborn.regplot(x="polityscore", y="internetuserate", scatter=True,order=2, data=sub1)
plt.xlabel('Polity score')
plt.ylabel('Internet Use Rate')
plt.show()

sub1['urbanrate_c'] = (sub1['urbanrate'] - sub1['urbanrate'].mean())
sub1['polityscore_c'] = (sub1['polityscore'] - sub1['polityscore'].mean())
sub1[["urbanrate_c", "polityscore_c"]].describe()

reg1 = smf.ols('internetuserate ~ urbanrate_c + polityscore_c+I(urbanrate_c**2)+I(polityscore_c**2)', data=sub1).fit()
print (reg1.summary())

fig1=sm.qqplot(reg1.resid, line='r')

stdres=pandas.DataFrame(reg1.resid_pearson)
plt.plot(stdres, 'o', ls='None')
l = plt.axhline(y=0, color='r')
plt.ylabel('Standardized Residual')
plt.xlabel('Observation Number')

fig2 = plt.figure(figsize=(12,8))
fig2 = sm.graphics.plot_regress_exog(reg1, "urbanrate_c", fig=fig2)
plt.show()

fig3 = plt.figure(figsize=(12,8))
fig3 = sm.graphics.plot_regress_exog(reg1, "polityscore_c", fig=fig3)
plt.show()

fig4=sm.graphics.influence_plot(reg1, size=8)
print(fig2)

fig5=sm.graphics.influence_plot(reg1, size=8)

print(fig3)

Sunday, January 17, 2016

Test a Basic Linear Regression Model (Python)

Introduction

Dataset: gapminder.csv

Response Variable: Internet Use Rate

Explanatory Variable: Polity Score

The two variables are both quantitative and the explanatory variable "polityscore" is going to be standardized to be mean 0.

Python Outout

Mean of Standardized polityscore (mean tends to be zero)

-3.91681166451608e-16

-----------

Conclusion:

Consider the linear regression model : Y=b0 + b1X + c, where c is the error term which is random , identically independent distributed and follows normal distribution.

From the output, b1 is 1.6043 and its P-value is less than 0.05. It is significant and the null hypothesis of b1 is zero is rejected. The intercept is 32.2811 (%, internet use rate). Therefore the polity score (the democracy level) did affect the internet use rate according to the simple linear regression analysis.
And the model is : Y=32.2811 + 1.6043X + c.

Python Code:
import numpy
import pandas
import statsmodels.formula.api as smf
import seaborn
import matplotlib.pyplot as plt

data = pandas.read_csv('gapminder.csv')

data['internetuserate'] = pandas.to_numeric(data['internetuserate'], errors='coerce')
data['polityscore'] = pandas.to_numeric(data['polityscore'], errors='coerce')

data['polityscore']=data['polityscore'].dropna()
m = numpy.mean(data['polityscore'])
data['polityscore2']=data['polityscore']-m
data['polityscore2']=data['polityscore2']
a=numpy.mean(data['polityscore2'])
print('Mean of Standardized polityscore')
print(a)

scat1 = seaborn.regplot(x="polityscore2", y="internetuserate", scatter=True, data=data)
plt.xlabel('polityscore')
plt.ylabel('Internet Use Rate')
plt.title ('Scatterplot for the Association Between Polityscore and Internet Use Rate')
print(scat1)

print ("OLS regression model for the association between urban rate and internet use rate")
reg1 = smf.ols('internetuserate ~ polityscore2', data=data).fit()
print (reg1.summary())

Saturday, January 9, 2016

Testing a Potential Moderator(Python)

Introduction:
From the "gapminder" dataset, the variables "urbanrate" and "internetuserate" are going to be tested for their correlation with the potential moderator "politiyscore".

"polityscore" has been divided into 4 levels : No_Democracy, Unsatisfied_Democracy, Satisfied_Democracy and Good_Democracy. The purpose of this research is to investigate the effect of this variable on the correlation between "urbanrate" and "internetuserate".

Python Output1:

association between urbanrate and internetuserate for NO_Democracy countries
(0.60873731612685533, 0.0020528103271054998)

association between urbanrate and internetuserate for Unsatisfied_Democracy countries
(0.70570435251349417, 8.1192786139000039e-05)

association between urbanrate and internetuserate for Satisfied_Democracy countries
(0.45972994415542473, 0.047663537308730138)
association between urbanrate and internetuserate for Good_Democracy countries

(0.68152965983424918, 2.7020062268884203e-13)

The values of the 4 correlation coefficient

---------

Conclusion:
The change of value of the Pearson Correlation Coefficient due to the testing moderator-"polityscore" does not have obvious pattern or trend. In other words, "polityscore" is not a significant moderator for the relationship between "urbanrate" and "internetuserate".

Python Code
import pandas
import numpy
import scipy.stats
import seaborn
import matplotlib.pyplot as plt

data = pandas.read_csv('gapminder.csv', low_memory=False)

data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True)
data['polityscore'] = data['polityscore'].convert_objects(convert_numeric=True)
data['internetuserate'] = data['internetuserate'].convert_objects(convert_numeric=True)
data['polityscore']=data['polityscore'].replace(' ', numpy.nan)

data_clean=data.dropna()

print (scipy.stats.pearsonr(data_clean['urbanrate'], data_clean['internetuserate']))

def politysco (row):
if row['polityscore'] <= -5:
return 1
elif row['polityscore'] <= 0 :
return 2
elif row['polityscore'] <= 5:
return 3
elif row['polityscore'] > 5:
return 4
data_clean['politysco'] = data_clean.apply (lambda row: politysco (row),axis=1)

chk1 = data_clean['politysco'].value_counts(sort=False, dropna=False)
print(chk1)

sub1=data_clean[(data_clean['politysco']== 1)]
sub2=data_clean[(data_clean['politysco']== 2)]
sub3=data_clean[(data_clean['politysco']== 3)]
sub4=data_clean [(data_clean['politysco']== 4)]

print ('association between urbanrate and internetuserate for NO_Democracy countries')
print (scipy.stats.pearsonr(sub1['urbanrate'], sub1['internetuserate']))
print (' ')
print ('association between urbanrate and internetuserate for Unsatisfied_Democracy countries')
print (scipy.stats.pearsonr(sub2['urbanrate'], sub2['internetuserate']))
print (' ')
print ('association between urbanrate and internetuserate for Satisfied_Democracy countries')
print (scipy.stats.pearsonr(sub3['urbanrate'], sub3['internetuserate']))
print ('association between urbanrate and internetuserate for Good_Democracy countries')
print (scipy.stats.pearsonr(sub4['urbanrate'], sub4['internetuserate']))

scat1 = seaborn.regplot(x="urbanrate", y="internetuserate", data=sub1)
plt.xlabel('Urban Rate')
plt.ylabel('Internet Use Rate')
plt.title('Scatterplot for the Association Between Urban Rate and Internet Use Rate forNO_Democracy countries countries')
print (scat1)
plt.show()

scat2 = seaborn.regplot(x="urbanrate", y="internetuserate", data=sub2)
plt.xlabel('Urban Rate')
plt.ylabel('Internet Use Rate')
plt.title('Scatterplot for the Association Between Urban Rate and Internet Use Rate for Unsatisfied_Democracy countries')
print (scat2)
plt.show()

scat3 = seaborn.regplot(x="urbanrate", y="internetuserate", data=sub3)
plt.xlabel('Urban Rate')
plt.ylabel('Internet Use Rate')
plt.title('Scatterplot for the Association Between Urban Rate and Internet Use Rate for Satisfied_Democracy countries countries')
print (scat3)
plt.show()

scat4 = seaborn.regplot(x="urbanrate", y="internetuserate", data=sub4)
plt.xlabel('Urban Rate')
plt.ylabel('Internet Use Rate')
plt.title('Scatterplot for the Association Between Urban Rate and Internet Use Rate for Good_Democracy countries')
print (scat4)
plt.show()

a=scipy.stats.pearsonr(sub1['urbanrate'], sub1['internetuserate'])[0]
b=scipy.stats.pearsonr(sub2['urbanrate'], sub2['internetuserate'])[0]
c=scipy.stats.pearsonr(sub3['urbanrate'], sub3['internetuserate'])[0]
d=scipy.stats.pearsonr(sub4['urbanrate'], sub4['internetuserate'])[0]

x=pandas.DataFrame([a,b,c,d])

print('The values of the 4 correlation coefficient')

x.plot(kind='bar', legend=True)

Friday, January 1, 2016

Generating a Correlation Coefficient with python

Generating a Correlation Coefficient with Python

Introduction:

This is a simple practice to find the correlation between "internetuserate" and "polityscore" which are the variables of "gapminder" data set.

Python output 1:

association between polityscore and internet use rate

(0.36438422712027008, 3.14535959202636e-06)

R-square

0.132775864974

---

From the scatter plot and the Pearson Correlation Coefficient = 0.36438422712027008, we can conclude that those two variables do not have significant linear relationship (neither positive or negative). The R-square value is 0.132775864974 and it means only about 13.3% of internet use rate and be predicted and explained bt polityscore.

Python Code:

import pandas

import numpy

import seaborn

import scipy

import matplotlib.pyplot as plt

gapminder=pandas.read_csv('gapminder.csv',low_memory=False)

print('number of rows and columns of gapminder')

print(len(gapminder))

print(len(gapminder.columns))

gapminder['internetuserate']=gapminder['internetuserate'].convert_objects(convert_numeric=True)

gapminder['polityscore']=gapminder['polityscore'].convert_objects(convert_numeric=True)

data=pandas.DataFrame()

data[['polityscore','internetuserate']]=gapminder[['polityscore','internetuserate']].dropna()

scat=seaborn.regplot(x="polityscore",y="internetuserate",fit_reg=False,data=data)

plt.xlabel("polityscore")

plt.ylabel("internetuserate")

plt.title('Scatterplot for the Association Between Internet Use Rate and PolityScore')