Wednesday, March 2, 2016

Running a k-means Cluster Analysis (Python)

Introduction:

Dataset: "gapminder.csv"

Variables: All variables within the dataset


Results:

Python Output 1:
-------------
According to the graph, I will choose 2-cluster or 4-cluster as the possible solutions for the k-means analysis  by the elbow method.

Python Output 2:
--------------

The two clusters are separated in a appropriate extent.




Python Output 4:
----------
The 4 clusters are also clearly separated.

he following are the mean values of the trained dataset :

Python Output 5:
Clustering variable means by 7-cluster
         level_0    index  polityscore  incomeperperson  alcconsumption  \
cluster                                                                  
0          20.00   96.900    -0.491388        -0.804625       -0.376205  
1          22.75  102.875     0.471166         0.594527        0.428931  

         armedforcesrate  breastcancerper100th  co2emissions  \
cluster                                                      
0               0.075222             -0.745935     -0.167833  
1              -0.295073              0.611362     -0.086379  

         femaleemployrate   hivrate  internetuserate  lifeexpectancy  \
cluster                                                              
0               -0.009326  0.245639        -0.843254       -0.763695  
1                0.082395 -0.159996         0.729558        0.740224  

         oilperperson  relectricperperson  suicideper100th  employrate  \
cluster                                                                
0           -0.483513           -0.657739        -0.222527    0.105981  
1            0.186257            0.427704        -0.043312   -0.018832  

         urbanrate
cluster            
0        -0.523812
1         0.420491


Python Code:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn import preprocessing
from sklearn.cluster import KMeans

data = pd.read_csv("gapminder.csv")

data['polityscore'] = pd.to_numeric(data['polityscore'], errors='coerce')
data['incomeperperson'] =  pd.to_numeric(data['incomeperperson'], errors='coerce')
data['alcconsumption'] =  pd.to_numeric(data['alcconsumption'], errors='coerce')
data['armedforcesrate'] =  pd.to_numeric(data['armedforcesrate'],errors='coerce')
data['breastcancerper100th'] =  pd.to_numeric(data['breastcancerper100th'], errors='coerce')
data["co2emissions"] =  pd.to_numeric(data["co2emissions"], errors='coerce')
data["femaleemployrate"] =  pd.to_numeric(data["femaleemployrate"], errors='coerce')
data['hivrate'] =  pd.to_numeric(data['hivrate'], errors='coerce')
data['internetuserate'] =  pd.to_numeric(data['internetuserate'], errors='coerce')
data['lifeexpectancy'] =  pd.to_numeric(data['lifeexpectancy'], errors='coerce')
data['oilperperson'] =  pd.to_numeric(data['oilperperson'], errors='coerce')
data['relectricperperson'] =  pd.to_numeric(data['relectricperperson'], errors='coerce')
data['suicideper100th'] =  pd.to_numeric(data['suicideper100th'], errors='coerce')
data['employrate'] =  pd.to_numeric(data['employrate'], errors='coerce')
data['urbanrate'] =  pd.to_numeric(data["urbanrate"], errors='coerce')
data_clean = data.dropna()

cluster= data_clean[['polityscore',"incomeperperson","alcconsumption","armedforcesrate",
"breastcancerper100th","co2emissions","femaleemployrate","hivrate",
"internetuserate","lifeexpectancy","oilperperson","relectricperperson",
"suicideper100th","employrate","urbanrate"]]

clustervar=cluster.copy()

from sklearn import preprocessing

clustervar['polityscore']=preprocessing.scale(clustervar['polityscore'].astype('float64'))
clustervar['incomeperperson']=preprocessing.scale(clustervar['incomeperperson'].astype('float64'))
clustervar['alcconsumption']=preprocessing.scale(clustervar['alcconsumption'].astype('float64'))
clustervar['armedforcesrate']=preprocessing.scale(clustervar['armedforcesrate'].astype('float64'))
clustervar['breastcancerper100th']=preprocessing.scale(clustervar['breastcancerper100th'].astype('float64'))
clustervar['co2emissions']=preprocessing.scale(clustervar['co2emissions'].astype('float64'))
clustervar['femaleemployrate']=preprocessing.scale(clustervar['femaleemployrate'].astype('float64'))
clustervar['hivrate']=preprocessing.scale(clustervar['hivrate'].astype('float64'))
clustervar['internetuserate']=preprocessing.scale(clustervar['internetuserate'].astype('float64'))
clustervar['lifeexpectancy']=preprocessing.scale(clustervar['lifeexpectancy'].astype('float64'))
clustervar['oilperperson']=preprocessing.scale(clustervar['oilperperson'].astype('float64'))
clustervar['relectricperperson']=preprocessing.scale(clustervar['relectricperperson'].astype('float64'))
clustervar['suicideper100th']=preprocessing.scale(clustervar['suicideper100th'].astype('float64'))
clustervar['employrate']=preprocessing.scale(clustervar['employrate'].astype('float64'))
clustervar['urbanrate']=preprocessing.scale(clustervar['urbanrate'].astype('float64'))

clus_train, clus_test = train_test_split(clustervar, test_size=0.2, random_state=23)

from scipy.spatial.distance import cdist
clusters=range(1,10)
meandist=[]

for k in clusters:
    model=KMeans(n_clusters=k)
    model.fit(clus_train)
    clusassign=model.predict(clus_train)
    meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1))
    / clus_train.shape[0])
   
plt.plot(clusters, meandist)
plt.xlabel('Number of clusters')
plt.ylabel('Average distance')
plt.title('Selecting k with the Elbow Method')

#Solution 1 : 2 Clusters
model1=KMeans(n_clusters=2)
model1.fit(clus_train)
clusassign=model.predict(clus_train)

from sklearn.decomposition import PCA
pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(clus_train)
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model1.labels_,)
plt.xlabel('Canonical variable 1')
plt.ylabel('Canonical variable 2')
plt.title('Scatterplot of Canonical Variables for 2 Clusters')
plt.show()



#Solution 2 : 4 Clusters
model2=KMeans(n_clusters=4)
model2.fit(clus_train)
clusassign=model2.predict(clus_train)

from sklearn.decomposition import PCA
pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(clus_train)
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model2.labels_,)
plt.xlabel('Canonical variable 1')
plt.ylabel('Canonical variable 2')
plt.title('Scatterplot of Canonical Variables for 4 Clusters')
plt.show()

#mean values of the variables of the training dataset
clus_train.reset_index(level=0, inplace=True)
cluslist2=list(clus_train['index'])
labels2=list(model2.labels_)
newlist2=dict(zip(cluslist, labels))
newlist2
newclus2=DataFrame.from_dict(newlist2, orient='index')
newclus2
newclus2.columns = ['cluster']
newclus2.reset_index(level=0, inplace=True)
merged_train2=pd.merge(clus_train, newclus2, on='index')
merged_train2.head(n=100)
merged_train2.cluster.value_counts()

clustergrp2 = merged_train2.groupby('cluster').mean()
print ("Clustering variable means by 7-cluster")
print(clustergrp2)

Saturday, February 27, 2016

Running a Lasso Regression Analysis (Python)

Introduction:

Dataset: "gapminder.csv"

Target Variable: "polityscore"

Predictors: "incomeperperson","alcconsumption","armedforcesrate","breastcancerper100th","co2emissions",
"femaleemployrate","hivrate","internetuserate","lifeexpectancy","oilperperson","relectricperperson",
"suicideper100th","employrate","urbanrate"

Algorithms:

-K-fold cross validation
-(LASSO) Least Angle Regression

For the target variable, "polityscore", its value ranges from -10 to 10 from the original dataset. Now it is divided into 2 groups:
[-10,0]=0
(0,10]=1

Results:

regression coefficients:

{'alcconsumption': 0.0,
 'armedforcesrate': -0.094224107805211468,
 'breastcancerper100th': 0.0025113191339298656,
 'co2emissions': 0.0,
 'employrate': 0.0,
 'femaleemployrate': 0.0,
 'hivrate': 0.0,
 'incomeperperson': 0.0,
 'internetuserate': 0.01682045898832887,
 'lifeexpectancy': 0.0,
 'oilperperson': 0.0,
 'relectricperperson': 0.0,
 'suicideper100th': 0.0,

 'urbanrate': 0.0}





MSE from training and test data:

training data MSE
0.111268760118
test data MSE
0.112600452979

R-square from training and test data
training data R-square
0.244465249376
test data R-square
-0.0847176970291

Summary:
The chosen predictors are 'armedforcesrate', 'breastcancerper100th' and 'internetuserate' according to the output.

The correlation of "internetuserate" and "polityscore" has been studied by my previous exercise. The other two variables seem difficult to be explained the relationship with "polityscore" in human sense.. In fact, this is also the limitation of the LASSO Regression or "Big Data, Simple Algorithm" which seeks for the "optimization"  of the prediction not the explanation of the variables.

Python Code:

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LassoLarsCV

#Load the dataset
data = pd.read_csv("gapminder.csv")

# Data Management


data['polityscore'] = data['polityscore'].convert_objects(convert_numeric=True)
data['incomeperperson'] = data['incomeperperson'].convert_objects(convert_numeric=True)
data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True)
data['armedforcesrate'] = data['armedforcesrate'].convert_objects(convert_numeric=True)
data['breastcancerper100th'] = data['breastcancerper100th'].convert_objects(convert_numeric=True)
data["co2emissions"] = data["co2emissions"].convert_objects(convert_numeric=True)
data["femaleemployrate"] = data["femaleemployrate"].convert_objects(convert_numeric=True)
data['hivrate'] = data['hivrate'].convert_objects(convert_numeric=True)
data['internetuserate'] = data['internetuserate'].convert_objects(convert_numeric=True)
data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True)
data['oilperperson'] = data['oilperperson'].convert_objects(convert_numeric=True)
data['relectricperperson'] = data['relectricperperson'].convert_objects(convert_numeric=True)
data['suicideper100th'] = data['suicideper100th'].convert_objects(convert_numeric=True)
data['employrate'] = data['employrate'].convert_objects(convert_numeric=True)
data['urbanrate'] = data["urbanrate"].convert_objects(convert_numeric=True)
data_clean = data.dropna(axis=0, how='any')

def politysco (row):
 

   if row['polityscore'] <= 0 :
      return 0
   elif row['polityscore'] <= 10:
      return 1

data_clean['polityscore'] = data_clean.apply (lambda row: politysco (row),axis=1)


#select predictor variables and target variable as separate data sets
predvar= data_clean[["incomeperperson","alcconsumption","armedforcesrate",
"breastcancerper100th","co2emissions","femaleemployrate","hivrate",
"internetuserate","lifeexpectancy","oilperperson","relectricperperson",
"suicideper100th","employrate","urbanrate"]]

target = data_clean.polityscore


# standardize predictors to have mean=0 and sd=1
predictors=predvar.copy()
predictors=predictors.dropna()
from sklearn import preprocessing

predictors['incomeperperson']=preprocessing.scale(predictors['incomeperperson'].astype('float64'))
predictors['alcconsumption']=preprocessing.scale(predictors['alcconsumption'].astype('float64'))
predictors['armedforcesrate']=preprocessing.scale(predictors['armedforcesrate'].astype('float64'))
predictors['breastcancerper100th']=preprocessing.scale(predictors['breastcancerper100th'].astype('float64'))
predictors['co2emissions']=preprocessing.scale(predictors['co2emissions'].astype('float64'))
predictors['femaleemployrate']=preprocessing.scale(predictors['femaleemployrate'].astype('float64'))
predictors['hivrate']=preprocessing.scale(predictors['hivrate'].astype('float64'))
predictors['internetuserate']=preprocessing.scale(predictors['internetuserate'].astype('float64'))
predictors['lifeexpectancy']=preprocessing.scale(predictors['lifeexpectancy'].astype('float64'))
predictors['oilperperson']=preprocessing.scale(predictors['oilperperson'].astype('float64'))
predictors['relectricperperson']=preprocessing.scale(predictors['relectricperperson'].astype('float64'))
predictors['suicideper100th']=preprocessing.scale(predictors['suicideper100th'].astype('float64'))
predictors['employrate']=preprocessing.scale(predictors['employrate'].astype('float64'))
predictors['urbanrate']=preprocessing.scale(predictors['urbanrate'].astype('float64'))


# split data into train and test sets
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target,
                                                              test_size=.3, random_state=123)

# specify the lasso regression model
model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train)

# print variable names and regression coefficients
dict(zip(predictors.columns, model.coef_))

# plot coefficient progression
m_log_alphas = -np.log10(model.alphas_)
ax = plt.gca()
plt.plot(m_log_alphas, model.coef_path_.T)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
            label='alpha CV')
plt.ylabel('Regression Coefficients')
plt.xlabel('-log(alpha)')
plt.title('Regression Coefficients Progression for Lasso Paths')

# plot mean square error for each fold
m_log_alphascv = -np.log10(model.cv_alphas_)
plt.figure()
plt.plot(m_log_alphascv, model.cv_mse_path_, ':')
plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k',
         label='Average across the folds', linewidth=2)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
            label='alpha CV')
plt.legend()
plt.xlabel('-log(alpha)')
plt.ylabel('Mean squared error')
plt.title('Mean squared error on each fold')
       

# MSE from training and test data
from sklearn.metrics import mean_squared_error
train_error = mean_squared_error(tar_train, model.predict(pred_train))
test_error = mean_squared_error(tar_test, model.predict(pred_test))
print ('training data MSE')
print(train_error)
print ('test data MSE')
print(test_error)

# R-square from training and test data
rsquared_train=model.score(pred_train,tar_train)
rsquared_test=model.score(pred_test,tar_test)
print ('training data R-square')
print(rsquared_train)
print ('test data R-square')
print(rsquared_test)

Sunday, February 7, 2016

Running a Random Forest (Python)

Introduction:

Dataset: gapminder.csv

Predictors: 'internetuserate','urbanrate','employrate','lifeexpectancy','alcconsumption',
'armedforcesrate','breastcancerper100th','co2emissions','femaleemployrate','hivrate'

Targets: 'polityscore'


"polityscore" reflects the democracy level of a country. The score ranges from -10 to 10. 10 marks means the country is the most democratic. I divided it into 2 levels :[-10,0),[0,10),which return as 0 and 1 respectively,

Results:

Data Partitioning:

-predictors in training dataset: 10 variables and 76 observations
-predictors in test dataset: 10 variables and 52 observations
-target in training dataser: 1 variable and 76 observations
-target in test dataset: 1 variable and 52 observations

Training-test ratio: 0.6

Confusion matrix for the target_test sample:
[[ 8,  3],
 [ 7, 34]]

Accuracy=0.80769230769230771

Feature-importance score:
[ 0.08430852  0.08336156  0.09066508  0.14997917  0.0512591   0.07579398
  0.11722497  0.07404398  0.15731402  0.11604963]
The 'femaleemployrate' 's score is highest (=0.15731402)



Accuracy Scores with different number of trees:





Python Code:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
from sklearn import datasets
from sklearn.ensemble import ExtraTreesClassifier



data = pd.read_csv("gapminder.csv")
data['polityscore'] = data['polityscore'].convert_objects(convert_numeric=True)
data['internetuserate'] = data['internetuserate'].convert_objects(convert_numeric=True)
data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True)
data['employrate'] = data['employrate'].convert_objects(convert_numeric=True)
data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True)
data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True)
data['armedforcesrate'] = data['armedforcesrate'].convert_objects(convert_numeric=True)
data['breastcancerper100th'] = data['breastcancerper100th'].convert_objects(convert_numeric=True)
data['co2emissions'] = data['co2emissions'].convert_objects(convert_numeric=True)
data['femaleemployrate'] = data['femaleemployrate'].convert_objects(convert_numeric=True)
data['hivrate'] = data['hivrate'].convert_objects(convert_numeric=True)



data_clean = data.dropna()
data_clean.dtypes
data_clean.describe()

def politysco (row):

   if row['polityscore'] <= 0 :
      return 0
   elif row['polityscore'] <= 10:
      return 1
 
data_clean['polityscore'] = data_clean.apply (lambda row: politysco (row),axis=1)


predictors = data_clean[['internetuserate','urbanrate','employrate','lifeexpectancy','alcconsumption',
'armedforcesrate','breastcancerper100th','co2emissions','femaleemployrate','hivrate']]

targets =data_clean['polityscore']

pred_train, pred_test, tar_train, tar_test  = train_test_split(predictors, targets, test_size=.4)

print(pred_train.shape)
print(pred_test.shape)
print(tar_train.shape)
print(tar_test.shape)

#Build model on training data
from sklearn.ensemble import RandomForestClassifier

classifier=RandomForestClassifier(n_estimators=9)
classifier=classifier.fit(pred_train,tar_train)

predictions=classifier.predict(pred_test)

sklearn.metrics.confusion_matrix(tar_test,predictions)
sklearn.metrics.accuracy_score(tar_test, predictions)

#Displaying the decision tree
model = ExtraTreesClassifier()
model.fit(pred_train,tar_train)
# display the relative importance of each attribute
print(model.feature_importances_)




trees=range(9)
accuracy=np.zeros(9)

for idx in range(len(trees)):
   classifier=RandomForestClassifier(n_estimators=idx + 1)
   classifier=classifier.fit(pred_train,tar_train)
   predictions=classifier.predict(pred_test)
   accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions)

print('Accuracy Scores with different number of trees' )  
plt.cla()
plt.plot(trees, accuracy)








Running a Classification Tree (Python)

Introduction:

Dataset: gapminder.csv

Predictors: 'internetuserate','urbanrate','employrate','lifeexpectancy'

Targets: polityscore

"polityscore" reflects the democracy level of a country. The score ranges from -10 to 10. 10 marks means the country is the most democratic. I divided it into 2 levels :[-10,0),[0,10),which return as 0 and 1 respectively,

Results:
Data Partitioning:

-predictors in training dataset: 4 variables and 91 observations
-predictors in test dataset: 4 variables and 61 observations
-target in training dataser: 1 variable and 91 observations
-target in test dataset: 1 variable and 61 observations

Training-test ratio: 0.6

Confusion matrix for the target_test sample:
[[ 6, 18],
 [ 9, 28]]

True Negative=6
True Positive =28
False Negative =9
False Positive=18

Accuracy=0.5901639344262295

Binary Decision  Tree:


Python Code:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np

import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics



data = pd.read_csv("gapminder.csv")
data['polityscore'] = data['polityscore'].convert_objects(convert_numeric=True)
data['internetuserate'] = data['internetuserate'].convert_objects(convert_numeric=True)
data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True)
data['employrate'] = data['employrate'].convert_objects(convert_numeric=True)
data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True)

data_clean = data.dropna()
data_clean.dtypes
data_clean.describe()

def politysco (row):

   if row['polityscore'] <= 0 :
      return 0
   elif row['polityscore'] <= 10:
      return 1
 
data_clean['polityscore'] = data_clean.apply (lambda row: politysco (row),axis=1)


predictors = data_clean[['internetuserate','urbanrate','employrate','lifeexpectancy',]]

targets =data_clean['polityscore']

pred_train, pred_test, tar_train, tar_test  = train_test_split(predictors, targets, test_size=.4)

print(pred_train.shape)
print(pred_test.shape)
print(tar_train.shape)
print(tar_test.shape)

#Build model on training data
classifier=DecisionTreeClassifier()
classifier=classifier.fit(pred_train,tar_train)

predictions=classifier.predict(pred_test)

sklearn.metrics.confusion_matrix(tar_test,predictions)
sklearn.metrics.accuracy_score(tar_test, predictions)

#Displaying the decision tree
from sklearn import tree
#from StringIO import StringIO
from io import StringIO
#from StringIO import StringIO
from IPython.display import Image
out = StringIO()
tree.export_graphviz(classifier, out_file=out)

import pydotplus
graph=pydotplus.graph_from_dot_data(out.getvalue())
Image(graph.create_png())




Saturday, January 23, 2016

Test a Multiple Regression Model (python)

Introduction:

Data set: "gapminer.csv"
Response Variable: "internetuserate"    - (Internet Use Rate)
Explanatory variable 1 : "urbanrate"     -(Urbanization Rate)
Explanatory variable 2 :  "polityscore"  -(polity score)

Python Output 1:

---------------------
From the graph, both two plots are set with order 2 ( second order of the polynomial) to fit the spots.


Python Output 2:
urbanrate_c            polityscore_c
count       155.00         155.00
mean          0.00           0.00
std          22.75             6.24
min         -44.97          -13.85
25%         -18.82          -5.35
50%           1.91           3.15
75%          16.86           5.15

max          44.63           6.15

--------------------------------------
See the residual plots, both two have obvious pattern and trend but not in random. This violates the the assumption of independently random distribution of the error terms.


Python Output 3:




------------------
The qq-plot regression analysis and the leverage plot is for the model:

internetuserate = beta_0+(beta_1) urbanrate_c + (beta_2)polityscore_c+(beta_3)urbanrate_c**2+(beta_4)polityscore_c**4

And the results of parameters:
beta_0=16.2094
beta_1=0.5783
beta_2=2.9886
beta_3=0.0052
beta_4=0.3525

All their p-values are lower than the 0.05 significant level and accept these parameters.


Conclusion:
My chosen model:
internetuserate =16.2094+(0.5783) urbanrate_c + (2.9886)polityscore_c+(0.0052)urbanrate_c**2+(0.3525)polityscore_c**4

Although the partial residual plots show it violates the assumption and outliers presents, the model meets normality assumption and the all the parameters are significant in regression analysis.

But ideally, more work can be done such as removal of the outliers and repeat the regression analysis,
remove highly correlated predictors, etc.

Python Code:

import numpy
import pandas
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn

pandas.set_option('display.float_format', lambda x:'%.2f'%x)

data = pandas.read_csv('gapminder.csv')


data['internetuserate'] = pandas.to_numeric(data['internetuserate'], errors='coerce')
data['urbanrate'] = pandas.to_numeric(data['urbanrate'], errors='coerce')
data['polityscore'] = pandas.to_numeric(data['polityscore'], errors='coerce')


sub1 = data[['urbanrate', 'internetuserate', 'polityscore']].dropna()


scat1 = seaborn.regplot(x="urbanrate", y="internetuserate", scatter=True, data=sub1)
plt.xlabel('Urbanization Rate')
plt.ylabel('Internet Use Rate')

scat1 = seaborn.regplot(x="urbanrate", y="internetuserate", scatter=True,order=2, data=sub1)
plt.xlabel('Urbanization Rate')
plt.ylabel('Internet Use Rate')

plt.show()

scat2 = seaborn.regplot(x="polityscore", y="internetuserate", scatter=True,order=2, data=sub1)
plt.xlabel('Polity score')
plt.ylabel('Internet Use Rate')
plt.show()



sub1['urbanrate_c'] = (sub1['urbanrate'] - sub1['urbanrate'].mean())
sub1['polityscore_c'] = (sub1['polityscore'] - sub1['polityscore'].mean())
sub1[["urbanrate_c", "polityscore_c"]].describe()


reg1 = smf.ols('internetuserate ~ urbanrate_c + polityscore_c+I(urbanrate_c**2)+I(polityscore_c**2)', data=sub1).fit()
print (reg1.summary())

fig1=sm.qqplot(reg1.resid, line='r')

stdres=pandas.DataFrame(reg1.resid_pearson)
plt.plot(stdres, 'o', ls='None')
l = plt.axhline(y=0, color='r')
plt.ylabel('Standardized Residual')
plt.xlabel('Observation Number')

fig2 = plt.figure(figsize=(12,8))
fig2 = sm.graphics.plot_regress_exog(reg1,  "urbanrate_c", fig=fig2)
plt.show()

fig3 = plt.figure(figsize=(12,8))
fig3 = sm.graphics.plot_regress_exog(reg1,  "polityscore_c", fig=fig3)
plt.show()

fig4=sm.graphics.influence_plot(reg1, size=8)
print(fig2)


fig5=sm.graphics.influence_plot(reg1, size=8)

print(fig3)


Sunday, January 17, 2016

Test a Basic Linear Regression Model (Python)

Introduction

Dataset: gapminder.csv

Response Variable: Internet Use Rate

Explanatory Variable: Polity Score

The two variables are both quantitative and the explanatory variable "polityscore" is going to be standardized to be mean 0.

Python Outout

Mean of Standardized polityscore (mean tends to be zero)

-3.91681166451608e-16



-----------

Conclusion:

Consider the linear regression model : Y=b0 + b1X + c, where c is the error term which is random , identically independent distributed and follows normal distribution.

From the output, b1 is 1.6043 and its P-value  is less than 0.05. It is significant and the null hypothesis of b1 is zero is rejected. The intercept is 32.2811 (%, internet use rate). Therefore the polity score (the democracy level) did affect the internet use rate according to the simple linear regression analysis.
And the model is : Y=32.2811 + 1.6043X + c.

Python Code:
import numpy
import pandas
import statsmodels.formula.api as smf
import seaborn
import matplotlib.pyplot as plt



data = pandas.read_csv('gapminder.csv')


data['internetuserate'] = pandas.to_numeric(data['internetuserate'], errors='coerce')
data['polityscore'] = pandas.to_numeric(data['polityscore'], errors='coerce')

data['polityscore']=data['polityscore'].dropna()
m = numpy.mean(data['polityscore'])
data['polityscore2']=data['polityscore']-m
data['polityscore2']=data['polityscore2']
a=numpy.mean(data['polityscore2'])
print('Mean of Standardized polityscore')
print(a)


scat1 = seaborn.regplot(x="polityscore2", y="internetuserate", scatter=True, data=data)
plt.xlabel('polityscore')
plt.ylabel('Internet Use Rate')
plt.title ('Scatterplot for the Association Between Polityscore and Internet Use Rate')
print(scat1)

print ("OLS regression model for the association between urban rate and internet use rate")
reg1 = smf.ols('internetuserate ~ polityscore2', data=data).fit()
print (reg1.summary())






Saturday, January 9, 2016

Testing a Potential Moderator(Python)

Introduction:
From the "gapminder" dataset, the variables "urbanrate" and "internetuserate" are going to be tested for their correlation with the potential moderator "politiyscore".

"polityscore" has been divided into 4 levels : No_Democracy, Unsatisfied_Democracy, Satisfied_Democracy and Good_Democracy. The purpose of this research is to investigate the effect of this variable on the correlation between "urbanrate" and "internetuserate".

Python Output1:

association between urbanrate and internetuserate for NO_Democracy countries
(0.60873731612685533, 0.0020528103271054998)
     
association between urbanrate and internetuserate for Unsatisfied_Democracy countries
(0.70570435251349417, 8.1192786139000039e-05)
     
association between urbanrate and internetuserate for Satisfied_Democracy countries
(0.45972994415542473, 0.047663537308730138)
association between urbanrate and internetuserate for Good_Democracy countries

(0.68152965983424918, 2.7020062268884203e-13)





The values of the 4 correlation coefficient

---------

Conclusion:
The change of value of the Pearson Correlation Coefficient due to the testing moderator-"polityscore" does not have obvious pattern or trend.  In other words, "polityscore" is not a significant moderator for the relationship between "urbanrate" and "internetuserate".

Python Code
import pandas
import numpy
import scipy.stats
import seaborn
import matplotlib.pyplot as plt

data = pandas.read_csv('gapminder.csv', low_memory=False)

data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True)
data['polityscore'] = data['polityscore'].convert_objects(convert_numeric=True)
data['internetuserate'] = data['internetuserate'].convert_objects(convert_numeric=True)
data['polityscore']=data['polityscore'].replace(' ', numpy.nan)

data_clean=data.dropna()

print (scipy.stats.pearsonr(data_clean['urbanrate'], data_clean['internetuserate']))

def politysco (row):
   if row['polityscore'] <= -5:
      return 1
   elif row['polityscore'] <= 0 :
      return 2
   elif row['polityscore'] <= 5:
      return 3
   elif row['polityscore'] > 5:
      return 4
data_clean['politysco'] = data_clean.apply (lambda row: politysco (row),axis=1)

chk1 = data_clean['politysco'].value_counts(sort=False, dropna=False)
print(chk1)

sub1=data_clean[(data_clean['politysco']== 1)]
sub2=data_clean[(data_clean['politysco']== 2)]
sub3=data_clean[(data_clean['politysco']== 3)]
sub4=data_clean [(data_clean['politysco']== 4)]

print ('association between urbanrate and internetuserate for NO_Democracy countries')
print (scipy.stats.pearsonr(sub1['urbanrate'], sub1['internetuserate']))
print ('       ')
print ('association between urbanrate and internetuserate for Unsatisfied_Democracy countries')
print (scipy.stats.pearsonr(sub2['urbanrate'], sub2['internetuserate']))
print ('       ')
print ('association between urbanrate and internetuserate for Satisfied_Democracy countries')
print (scipy.stats.pearsonr(sub3['urbanrate'], sub3['internetuserate']))
print ('association between urbanrate and internetuserate for Good_Democracy countries')
print (scipy.stats.pearsonr(sub4['urbanrate'], sub4['internetuserate']))




scat1 = seaborn.regplot(x="urbanrate", y="internetuserate", data=sub1)
plt.xlabel('Urban Rate')
plt.ylabel('Internet Use Rate')
plt.title('Scatterplot for the Association Between Urban Rate and Internet Use Rate forNO_Democracy countries countries')
print (scat1)
plt.show()

scat2 = seaborn.regplot(x="urbanrate", y="internetuserate", data=sub2)
plt.xlabel('Urban Rate')
plt.ylabel('Internet Use Rate')
plt.title('Scatterplot for the Association Between Urban Rate and Internet Use Rate for Unsatisfied_Democracy countries')
print (scat2)
plt.show()

scat3 = seaborn.regplot(x="urbanrate", y="internetuserate", data=sub3)
plt.xlabel('Urban Rate')
plt.ylabel('Internet Use Rate')
plt.title('Scatterplot for the Association Between Urban Rate and Internet Use Rate for Satisfied_Democracy countries countries')
print (scat3)
plt.show()

scat4 = seaborn.regplot(x="urbanrate", y="internetuserate", data=sub4)
plt.xlabel('Urban Rate')
plt.ylabel('Internet Use Rate')
plt.title('Scatterplot for the Association Between Urban Rate and Internet Use Rate for  Good_Democracy countries')
print (scat4)
plt.show()

a=scipy.stats.pearsonr(sub1['urbanrate'], sub1['internetuserate'])[0]
b=scipy.stats.pearsonr(sub2['urbanrate'], sub2['internetuserate'])[0]
c=scipy.stats.pearsonr(sub3['urbanrate'], sub3['internetuserate'])[0]
d=scipy.stats.pearsonr(sub4['urbanrate'], sub4['internetuserate'])[0]

x=pandas.DataFrame([a,b,c,d])

print('The values of the 4 correlation coefficient')

x.plot(kind='bar', legend=True)

Friday, January 1, 2016

Generating a Correlation Coefficient with python

Generating a Correlation Coefficient with Python

Introduction:
This is a simple practice to find the correlation between "internetuserate" and "polityscore" which are the variables of "gapminder" data set.

Python output 1:
association between polityscore and internet use rate
(0.36438422712027008, 3.14535959202636e-06)

R-square
0.132775864974

---
From the scatter plot and the Pearson Correlation Coefficient = 0.36438422712027008, we can conclude that those two variables do not have significant linear relationship (neither positive or negative). The R-square value is 0.132775864974 and it means only about 13.3% of internet use rate and be predicted and explained bt polityscore.

Python Code:
import pandas 
import numpy
import seaborn
import scipy
import matplotlib.pyplot as plt

gapminder=pandas.read_csv('gapminder.csv',low_memory=False)
print('number of rows and columns of gapminder')
print(len(gapminder))
print(len(gapminder.columns))


gapminder['internetuserate']=gapminder['internetuserate'].convert_objects(convert_numeric=True)
gapminder['polityscore']=gapminder['polityscore'].convert_objects(convert_numeric=True)

data=pandas.DataFrame()
data[['polityscore','internetuserate']]=gapminder[['polityscore','internetuserate']].dropna()



scat=seaborn.regplot(x="polityscore",y="internetuserate",fit_reg=False,data=data)
plt.xlabel("polityscore")
plt.ylabel("internetuserate")
plt.title('Scatterplot for the Association Between Internet Use Rate and PolityScore')

print('association between polityscore and internet use rate')
print(scipy.stats.pearsonr(data['internetuserate'],data['polityscore']))

print('R-square')
r_sq=scipy.stats.pearsonr(data['internetuserate'],data['polityscore'])[0]*\
scipy.stats.pearsonr(data['internetuserate'],data['polityscore'])[0]
print(r_sq)