Data Analysis and Interpretation Specialization: 2015

Objective of the study
Testing the independence/dependence of "polityscore" (democracy level of a country) and "Internet use rate"

Explanation and Methodologies

The objective of this study is much more likely the one I did previously that investigating about the relationship between a country's democracy level and its internet use rate by ANOVA. Same as before, "Gapminder" dataset is the data source and we use two of the variables,"polityscore" and "internetuserate".

First of all, "polityscore" is going to be divided into 4 groups according to its values.

polityscore	group(level)
-10 to -5	No_democracy
-5 to 0	Unstatisfied_democracy
0 to 5	Satisfied_democracy
5 to 10	Good_democracy

Secondly, convert the "internetuserate" into 2 categorical groups by 50%-quantile.

Python Output1:

number of rows and columns of gapminder

213

a dataframe combining 2 chosen variables

internetuserate_tile polityscore_grp

0 userate=50%tile Unsatisfied_democracy

1 userate=100%tile Good_democracy

2 userate=50%tile Satisfied_democracy

4 userate=50%tile Unsatisfied_democracy

6 userate=100%tile Good_democracy

7 userate=100%tile Satisfied_democracy

9 userate=100%tile Good_democracy

10 userate=100%tile Good_democracy

11 userate=100%tile No_democracy

13 userate=100%tile No_democracy

14 userate=50%tile Satisfied_democracy

16 userate=100%tile No_democracy

17 userate=100%tile Good_democracy

19 userate=50%tile Good_democracy

21 userate=50%tile Satisfied_democracy

22 userate=50%tile Good_democracy

24 userate=50%tile Good_democracy

25 userate=100%tile Good_democracy

27 userate=100%tile Good_democracy

28 userate=50%tile Unsatisfied_democracy

29 userate=50%tile Good_democracy

30 userate=50%tile Satisfied_democracy

31 userate=50%tile Unsatisfied_democracy

32 userate=100%tile Good_democracy

35 userate=50%tile Unsatisfied_democracy

36 userate=50%tile Unsatisfied_democracy

37 userate=100%tile Good_democracy

38 userate=100%tile No_democracy

39 userate=100%tile Good_democracy

40 userate=50%tile Good_democracy

.. ... ...

175 userate=100%tile Good_democracy

176 userate=50%tile Good_democracy

178 userate=50%tile Good_democracy

179 userate=100%tile Good_democracy

180 userate=50%tile Good_democracy

183 userate=50%tile No_democracy

184 userate=100%tile Good_democracy

185 userate=100%tile Good_democracy

186 userate=50%tile No_democracy

188 userate=50%tile Unsatisfied_democracy

189 userate=50%tile Unsatisfied_democracy

190 userate=50%tile Satisfied_democracy

191 userate=50%tile Good_democracy

192 userate=50%tile Unsatisfied_democracy

194 userate=100%tile Good_democracy

195 userate=100%tile Unsatisfied_democracy

196 userate=100%tile Good_democracy

197 userate=50%tile No_democracy

199 userate=50%tile Unsatisfied_democracy

200 userate=100%tile Good_democracy

201 userate=100%tile No_democracy

202 userate=100%tile Good_democracy

203 userate=100%tile Good_democracy

204 userate=100%tile Good_democracy

205 userate=50%tile No_democracy

207 userate=100%tile Unsatisfied_democracy

208 userate=50%tile No_democracy

210 userate=50%tile Unsatisfied_democracy

211 userate=50%tile Good_democracy

212 userate=50%tile Satisfied_democracy

-----

Now we have two categorical variables and a Chi Square Test is going to be applied.

Python Output2:

The contigency table of 'internetuserate_tile' and 'polityscore_grp'

polityscore_grp No_democracy Unsatisfied_democracy Satisfied_democracy Good_democracy

internetuserate_tile

userate=50%tile 12 21 17 35

userate=100%tile 11 4 2 53

The contigency table in percentage

polityscore_grp No_democracy Unsatisfied_democracy Satisfied_democracy Good_democracy

internetuserate_tile

userate=50%tile 0.521739 0.84 0.894737 0.397727

userate=100%tile 0.478261 0.16 0.105263 0.602273

The Chi-Square test

(25.918522100123596, 9.9194759541075989e-06, 3, array([[ 12.61290323, 13.70967742, 10.41935484, 48.25806452],

[ 10.38709677, 11.29032258, 8.58064516, 39.74193548]]))

-----

From the output, we can see that the p-value of the Chi-Square test is 9.9194759541075989e-06 which is lower than 0.05 significant level and reject the null hypothesis that democracy level and internet use rate has no dependence association (i.e.they have dependence association)

Python Output3:

-----

The charts shows that there is not a simple linear relationship between the two variables although they are dependent.

Now the followings are the post hoc tests for the Chi-Square test:

Python Output4:

Post hoc test for the Chi-Sqaure test

1,
NO_democracy vs Unsatisfied_democracy
NoVSUn No_democracy Unsatisfied_democracy
internetuserate_tile
userate=50%tile 12 21
userate=100%tile 11 4
(4.2634624505928862, 0.038940474422515331, 1, array([[ 15.8125, 17.1875],

[ 7.1875, 7.8125]]))

2,
No_democracy vs Satisfied_democracy
No_VS_Sat No_democracy Unsatisfied_democracy
internetuserate_tile
userate=50%tile 12 21
userate=100%tile 11 4
(4.2634624505928862, 0.038940474422515331, 1, array([[ 15.8125, 17.1875],
[ 7.1875, 7.8125]]))
C:\Users\user\Anaconda3\lib\site-packages\matplotlib\__init__.py:892: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.

warnings.warn(self.msg_depr % (key, alt_key))

3,
No_democracy vs Good_democracy
No_VS_Go Good_democracy_democracy No_democracy
internetuserate_tile
userate=50%tile 35 12
userate=100%tile 53 11
(0.69683212191731103, 0.4038501988882941, 1, array([[ 37.26126126, 9.73873874],

[ 50.73873874, 13.26126126]]))

4,
Satisfied_democracy vs Unsatisfied_democracy
Sat_VS_Un Satisfied_democracy Unsatisfied_democracy
internetuserate_tile
userate=50%tile 17 21
userate=100%tile 2 4
(0.0065004616805170515, 0.93573983334993682, 1, array([[ 16.40909091, 21.59090909],

[ 2.59090909, 3.40909091]]))

5,
Unsatisfied_democracy vs Good_democracy
Un_VS_Go Good_democracy Unsatisfied_democracy
internetuserate_tile
userate=50%tile 35 21
userate=100%tile 53 4
(13.516299876110732, 0.00023650025871216763, 1, array([[ 43.61061947, 12.38938053],
[ 44.38938053, 12.61061947]]))

6,
Satisfied_democracy vs Good_democracy
Sat_VS_Go Good_democracy Satisfied_democracy
internetuserate_tile
userate=50%tile 35 17
userate=100%tile 53 2
(13.526401267691638, 0.00023523068534581418, 1, array([[ 42.76635514, 9.23364486],

[ 45.23364486, 9.76635514]]))
-----
The post hoc test applies with Bonferroni Adjustnent. The significant level of rejecting null hypothesis is 0.05/(number of caparisons) =0.05/6=0.0083. From the output we can conclude that the 5th and 6th comparison reject the hypothesis and it means only can deduce the dependence association between democracy and internet use rate excluding the democracy level is "No_democracy".

Conclusion

The democracy level of a country and its internet use rate are dependent but not in a linear relationship. The post hoc test doesn't give the concrete and meaningful explanation or conclusion for the dependence association.A further research may be required

Python Code
import pandas
import numpy
import scipy.stats as ss
import seaborn
import matplotlib.pyplot as p

gapminder = pandas.read_csv('gapminder.csv',low_memory=False)

print('number of rows and columns of gapminder')
print(len(gapminder))
print(len(gapminder.columns))

gapminder['internetuserate']=gapminder['internetuserate'].convert_objects(convert_numeric=True)
gapminder['polityscore']=gapminder['polityscore'].convert_objects(convert_numeric=True)

gapminder['polityscore_grp']=pandas.cut(gapminder.polityscore,4,labels=["No_democracy","Unsatisfied_democracy",
"Satisfied_democracy","Good_democracy"])

gapminder['internetuserate_tile']=pandas.qcut(gapminder.internetuserate,2,labels=["userate=50%tile",
"userate=100%tile"])

print('a dataframe combining 2 chosen variables')
sub1=gapminder[['internetuserate_tile','polityscore_grp']].dropna()
print(sub1)

print("The contigency table of 'internetuserate_tile' and 'polityscore_grp' ")
ct1=pandas.crosstab(sub1['internetuserate_tile'],sub1['polityscore_grp'])
print(ct1)

print("The contigency table in percentage ")
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)

print("The Chi-Square test")
cs1=ss.chi2_contingency(ct1)
print(cs1)

sub1['internetuserate']=gapminder['internetuserate'].dropna()

sub1['internetuserate_tile']=sub1['internetuserate_tile'].convert_objects(convert_numeric=True)
sub1['polityscore_grp'].astype('category')

print("Bar chart of frequencies of 'internetuserate_tile'")
seaborn.factorplot(x='polityscore_grp',y='internetuserate',data=sub1,kind='bar',ci=None)
p.xlabel('polityscore_grp')
p.ylabel('internetuserate_tile')
print()

print("Post hoc test for the Chi-Sqaure test")
print("NO_democracy vs Unsatisfied_democracy")
recode1={"No_democracy":"No_democracy","Unsatisfied_democracy":"Unsatisfied_democracy"}
sub1["No_VS_Un"]=sub1['polityscore_grp'].map(recode1)
cc1=pandas.crosstab(sub1['internetuserate_tile'],sub1["No_VS_Un"])
print(cc1)
ccs1=ss.chi2_contingency(cc1)
print(ccs1)

print()
print("No_democracy vs Satisfied_democracy")
recode2={"No_democracy":"No_democracy","Satisfied_democracy":"Satisfied_democracy"}
sub1["No_VS_Sat"]=sub1['polityscore_grp'].map(recode1)
cc2=pandas.crosstab(sub1['internetuserate_tile'],sub1["No_VS_Sat"])
print(cc2)
ccs2=ss.chi2_contingency(cc2)
print(ccs2)

print()
print("No_democracy vs Good_democracy")
recode3={"No_democracy":"No_democracy","Good_democracy":"Good_democracy_democracy"}
sub1["No_VS_Go"]=sub1['polityscore_grp'].map(recode3)
cc3=pandas.crosstab(sub1['internetuserate_tile'],sub1["No_VS_Go"])
print(cc3)
ccs3=ss.chi2_contingency(cc3)
print(ccs3)

print()
print("Satisfied_democracy vs Unsatisfied_democracy")
recode4={"Satisfied_democracy":"Satisfied_democracy","Unsatisfied_democracy":"Unsatisfied_democracy"}
sub1["Sat_VS_Un"]=sub1['polityscore_grp'].map(recode4)
cc4=pandas.crosstab(sub1['internetuserate_tile'],sub1["Sat_VS_Un"])
print(cc4)
ccs4=ss.chi2_contingency(cc4)
print(ccs4)

print()
print("Unsatisfied_democracy vs Good_democracy")
recode5={"Unsatisfied_democracy":"Unsatisfied_democracy","Good_democracy":"Good_democracy"}
sub1["Un_VS_Go"]=sub1['polityscore_grp'].map(recode5)
cc5=pandas.crosstab(sub1['internetuserate_tile'],sub1["Un_VS_Go"])
print(cc5)
ccs5=ss.chi2_contingency(cc5)
print(ccs5)

print()
print("Satisfied_democracy vs Good_democracy")
recode6={"Satisfied_democracy":"Satisfied_democracy","Good_democracy":"Good_democracy"}
sub1["Sat_VS_Go"]=sub1['polityscore_grp'].map(recode6)
cc6=pandas.crosstab(sub1['internetuserate_tile'],sub1["Sat_VS_Go"])
print(cc6)
ccs6=ss.chi2_contingency(cc6)
print(ccs6)

Objective of the study:
Investigate the correlation between a country's democracy level and its internet use rate.

Explanation and Methodologies:

From the "Gapminder" dataset, two variables are extracted. They are "polityscore" and "internetuserate". The observations of the dataset are recorded among the countries around the world. The variable "politysocre" is measuring a country's democracy extent and the range of the score is -10 to 10. The key word "democracy level" will be used instead of "polity score" in the following explanation for better understanding of the meaning of the variable.

First, "polityscore" is divided by four groups (4 levels) according to its value as follows:

polityscore	group(level)
-10 to -5	No_democracy
-5 to 0	Unstatisfied_democracy
0 to 5	Satisfied_democracy
5 to 10	Good_democracy

The "polityscore" will be an explanatory varibale and "internetuserate" will be a response variable in my study.The ANOVA analysis of four levels will proceed.

The null hypothesis: The internet use rate has no mean difference among different levels of democracy.(i.e. internet use rate's mean1=mean2=mean3=mean4)

Python Output 1:

number of rows and columns of gapminder

213

a dataframe combining 4 chosen variables

internetuserate polityscore_grp

0 3.654122 Unsatisfied_democracy

1 44.989947 Good_democracy

2 12.500073 Satisfied_democracy

4 9.999954 Unsatisfied_democracy

6 36.000335 Good_democracy

7 44.001025 Satisfied_democracy

9 75.895654 Good_democracy

10 72.731576 Good_democracy

11 46.679702 No_democracy

13 54.992809 No_democracy

14 3.700003 Satisfied_democracy

16 32.052144 No_democracy

17 73.733934 Good_democracy

19 3.129962 Good_democracy

21 13.598876 Satisfied_democracy

22 20.001710 Good_democracy

24 5.999836 Good_democracy

25 40.650098 Good_democracy

27 45.986590 Good_democracy

28 1.400061 Unsatisfied_democracy

29 2.100213 Good_democracy

30 1.259934 Satisfied_democracy

31 3.999977 Unsatisfied_democracy

32 81.338393 Good_democracy

35 2.300027 Unsatisfied_democracy

36 1.700031 Unsatisfied_democracy

37 45.000000 Good_democracy

38 34.377790 No_democracy

39 36.499875 Good_democracy

40 5.098265 Good_democracy

.. ... ...

175 69.339971 Good_democracy

176 5.001375 Good_democracy

178 12.334893 Good_democracy

179 65.808554 Good_democracy

180 11.999971 Good_democracy

183 9.007736 No_democracy

184 90.016190 Good_democracy

185 82.166660 Good_democracy

186 20.663156 No_democracy

188 11.549391 Unsatisfied_democracy

189 11.000055 Unsatisfied_democracy

190 21.200072 Satisfied_democracy

191 0.210066 Good_democracy

192 5.379820 Unsatisfied_democracy

194 48.516818 Good_democracy

195 36.562553 Unsatisfied_democracy

196 39.820178 Good_democracy

197 2.199998 No_democracy

199 12.500255 Unsatisfied_democracy

200 44.585355 Good_democracy

201 77.996781 No_democracy

202 84.731705 Good_democracy

203 74.247572 Good_democracy

204 47.867469 Good_democracy

205 19.445021 No_democracy

207 35.850437 Unsatisfied_democracy

208 27.851822 No_democracy

210 12.349750 Unsatisfied_democracy

211 10.124986 Good_democracy

212 11.500415 Satisfied_democracy

This is the dataframe containing the two variables those are going to be analyzed.

Python Output 2:

-----

We can see the p-value of the F-statistics is 8.74e-08 lower than 0.05. The value 0.05 is the significant level to reject the null hypothesis statistically. Therefore, the null hypothesis of no mean difference of internet use rate among the democracy level groups is rejected.

Now we are going take a deeper look since the previous result does not tell us which groups are different from others. This means we don't know which pair of groups(democracy levels) have mean difference in internet use rate statistically. Therefore post hoc test for the ANOVA is going to be carried out. The next test is Tukey's Honestly Significant Different Test.

Python Out 3:

The values of the means and standard deviations of the internet use rate among the democracy level groups are given in the output for reference. It has a place drawing my attention that the mean value of "No_democracy" group has an unexpected high value since it is the lowest democracy level and expected that the mean difference between this group and "Good_democracy" should be largest among four groups according to my expectation.

According the the Tukey's test, There is statistically mean difference of internet use rate between:

"Good_democracy" and "Satisfied_democracy"

"Good_democracy" and "Unsatisfied_democracy"

Conclusion

According to the ANOVA, democracy levels have significant effect on internet use rate but there is no verification on the causal relationship between the two variables in my study.

In fact, I have an expectation that the internet use rate positively relates to the democracy levels (one has a higher value corresponding to the increasing value or level of another variable.). However,Tukey's Honestly Significant Different Test gives an unexpected result. Therefore, there is not a simple positively linear relationship between internet use rate and democracy level and a further analysis may be required.

Python Code
import pandas
import numpy
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi

gapminder = pandas.read_csv('gapminder.csv',low_memory=False)

print('number of rows and columns of gapminder')
print(len(gapminder))
print(len(gapminder.columns))

gapminder['internetuserate']=gapminder['internetuserate'].convert_objects(convert_numeric=True)
gapminder['polityscore']=gapminder['polityscore'].convert_objects(convert_numeric=True)

gapminder['polityscore_grp']=pandas.cut(gapminder.polityscore,[-11,-5,0,5,10],labels=["No_democracy","Unsatisfied_democracy","Satisfied_democracy","Good_democracy"])

print('a dataframe combining 4 chosen variables')
sub1=gapminder[['internetuserate','polityscore_grp']].dropna()
print(sub1)

model1=smf.ols(formula='internetuserate~C(polityscore_grp)',data=sub1).fit()
print(model1.summary())

print('means of internetuserate among the four groups')
m2=sub1.groupby('polityscore_grp').mean()
print(m2)
print('standard deviatons of internetuserate among the four groups')
sd2=sub1.groupby('polityscore_grp').std()
print(sd2)

print()
print("Tukey's Honestly Significant Different Test")
model2=multi.MultiComparison(sub1['internetuserate'],sub1['polityscore_grp']).tukeyhsd()

print(model2.summary())

Data Analysis and Interpretation Specialization

Thursday, December 24, 2015

Running a Chi-Square Test of Independence (Python)

Sunday, December 20, 2015

Running an analysis of variance (Python)