Data Analysis and Interpretation Specialization: Running a Chi-Square Test of Independence (Python)

Objective of the study
Testing the independence/dependence of "polityscore" (democracy level of a country) and "Internet use rate"

Explanation and Methodologies

The objective of this study is much more likely the one I did previously that investigating about the relationship between a country's democracy level and its internet use rate by ANOVA. Same as before, "Gapminder" dataset is the data source and we use two of the variables,"polityscore" and "internetuserate".

First of all, "polityscore" is going to be divided into 4 groups according to its values.

polityscore	group(level)
-10 to -5	No_democracy
-5 to 0	Unstatisfied_democracy
0 to 5	Satisfied_democracy
5 to 10	Good_democracy

Secondly, convert the "internetuserate" into 2 categorical groups by 50%-quantile.

Python Output1:

number of rows and columns of gapminder

213

a dataframe combining 2 chosen variables

internetuserate_tile polityscore_grp

0 userate=50%tile Unsatisfied_democracy

1 userate=100%tile Good_democracy

2 userate=50%tile Satisfied_democracy

4 userate=50%tile Unsatisfied_democracy

6 userate=100%tile Good_democracy

7 userate=100%tile Satisfied_democracy

9 userate=100%tile Good_democracy

10 userate=100%tile Good_democracy

11 userate=100%tile No_democracy

13 userate=100%tile No_democracy

14 userate=50%tile Satisfied_democracy

16 userate=100%tile No_democracy

17 userate=100%tile Good_democracy

19 userate=50%tile Good_democracy

21 userate=50%tile Satisfied_democracy

22 userate=50%tile Good_democracy

24 userate=50%tile Good_democracy

25 userate=100%tile Good_democracy

27 userate=100%tile Good_democracy

28 userate=50%tile Unsatisfied_democracy

29 userate=50%tile Good_democracy

30 userate=50%tile Satisfied_democracy

31 userate=50%tile Unsatisfied_democracy

32 userate=100%tile Good_democracy

35 userate=50%tile Unsatisfied_democracy

36 userate=50%tile Unsatisfied_democracy

37 userate=100%tile Good_democracy

38 userate=100%tile No_democracy

39 userate=100%tile Good_democracy

40 userate=50%tile Good_democracy

.. ... ...

175 userate=100%tile Good_democracy

176 userate=50%tile Good_democracy

178 userate=50%tile Good_democracy

179 userate=100%tile Good_democracy

180 userate=50%tile Good_democracy

183 userate=50%tile No_democracy

184 userate=100%tile Good_democracy

185 userate=100%tile Good_democracy

186 userate=50%tile No_democracy

188 userate=50%tile Unsatisfied_democracy

189 userate=50%tile Unsatisfied_democracy

190 userate=50%tile Satisfied_democracy

191 userate=50%tile Good_democracy

192 userate=50%tile Unsatisfied_democracy

194 userate=100%tile Good_democracy

195 userate=100%tile Unsatisfied_democracy

196 userate=100%tile Good_democracy

197 userate=50%tile No_democracy

199 userate=50%tile Unsatisfied_democracy

200 userate=100%tile Good_democracy

201 userate=100%tile No_democracy

202 userate=100%tile Good_democracy

203 userate=100%tile Good_democracy

204 userate=100%tile Good_democracy

205 userate=50%tile No_democracy

207 userate=100%tile Unsatisfied_democracy

208 userate=50%tile No_democracy

210 userate=50%tile Unsatisfied_democracy

211 userate=50%tile Good_democracy

212 userate=50%tile Satisfied_democracy

-----

Now we have two categorical variables and a Chi Square Test is going to be applied.

Python Output2:

The contigency table of 'internetuserate_tile' and 'polityscore_grp'

polityscore_grp No_democracy Unsatisfied_democracy Satisfied_democracy Good_democracy

internetuserate_tile

userate=50%tile 12 21 17 35

userate=100%tile 11 4 2 53

The contigency table in percentage

polityscore_grp No_democracy Unsatisfied_democracy Satisfied_democracy Good_democracy

internetuserate_tile

userate=50%tile 0.521739 0.84 0.894737 0.397727

userate=100%tile 0.478261 0.16 0.105263 0.602273

The Chi-Square test

(25.918522100123596, 9.9194759541075989e-06, 3, array([[ 12.61290323, 13.70967742, 10.41935484, 48.25806452],

[ 10.38709677, 11.29032258, 8.58064516, 39.74193548]]))

-----

From the output, we can see that the p-value of the Chi-Square test is 9.9194759541075989e-06 which is lower than 0.05 significant level and reject the null hypothesis that democracy level and internet use rate has no dependence association (i.e.they have dependence association)

Python Output3:

-----

The charts shows that there is not a simple linear relationship between the two variables although they are dependent.

Now the followings are the post hoc tests for the Chi-Square test:

Python Output4:

Post hoc test for the Chi-Sqaure test

1,
NO_democracy vs Unsatisfied_democracy
NoVSUn No_democracy Unsatisfied_democracy
internetuserate_tile
userate=50%tile 12 21
userate=100%tile 11 4
(4.2634624505928862, 0.038940474422515331, 1, array([[ 15.8125, 17.1875],

[ 7.1875, 7.8125]]))

2,
No_democracy vs Satisfied_democracy
No_VS_Sat No_democracy Unsatisfied_democracy
internetuserate_tile
userate=50%tile 12 21
userate=100%tile 11 4
(4.2634624505928862, 0.038940474422515331, 1, array([[ 15.8125, 17.1875],
[ 7.1875, 7.8125]]))
C:\Users\user\Anaconda3\lib\site-packages\matplotlib\__init__.py:892: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.

warnings.warn(self.msg_depr % (key, alt_key))

3,
No_democracy vs Good_democracy
No_VS_Go Good_democracy_democracy No_democracy
internetuserate_tile
userate=50%tile 35 12
userate=100%tile 53 11
(0.69683212191731103, 0.4038501988882941, 1, array([[ 37.26126126, 9.73873874],

[ 50.73873874, 13.26126126]]))

4,
Satisfied_democracy vs Unsatisfied_democracy
Sat_VS_Un Satisfied_democracy Unsatisfied_democracy
internetuserate_tile
userate=50%tile 17 21
userate=100%tile 2 4
(0.0065004616805170515, 0.93573983334993682, 1, array([[ 16.40909091, 21.59090909],

[ 2.59090909, 3.40909091]]))

5,
Unsatisfied_democracy vs Good_democracy
Un_VS_Go Good_democracy Unsatisfied_democracy
internetuserate_tile
userate=50%tile 35 21
userate=100%tile 53 4
(13.516299876110732, 0.00023650025871216763, 1, array([[ 43.61061947, 12.38938053],
[ 44.38938053, 12.61061947]]))

6,
Satisfied_democracy vs Good_democracy
Sat_VS_Go Good_democracy Satisfied_democracy
internetuserate_tile
userate=50%tile 35 17
userate=100%tile 53 2
(13.526401267691638, 0.00023523068534581418, 1, array([[ 42.76635514, 9.23364486],

[ 45.23364486, 9.76635514]]))
-----
The post hoc test applies with Bonferroni Adjustnent. The significant level of rejecting null hypothesis is 0.05/(number of caparisons) =0.05/6=0.0083. From the output we can conclude that the 5th and 6th comparison reject the hypothesis and it means only can deduce the dependence association between democracy and internet use rate excluding the democracy level is "No_democracy".

Conclusion

The democracy level of a country and its internet use rate are dependent but not in a linear relationship. The post hoc test doesn't give the concrete and meaningful explanation or conclusion for the dependence association.A further research may be required

Python Code
import pandas
import numpy
import scipy.stats as ss
import seaborn
import matplotlib.pyplot as p

gapminder = pandas.read_csv('gapminder.csv',low_memory=False)

print('number of rows and columns of gapminder')
print(len(gapminder))
print(len(gapminder.columns))

gapminder['internetuserate']=gapminder['internetuserate'].convert_objects(convert_numeric=True)
gapminder['polityscore']=gapminder['polityscore'].convert_objects(convert_numeric=True)

gapminder['polityscore_grp']=pandas.cut(gapminder.polityscore,4,labels=["No_democracy","Unsatisfied_democracy",
"Satisfied_democracy","Good_democracy"])

gapminder['internetuserate_tile']=pandas.qcut(gapminder.internetuserate,2,labels=["userate=50%tile",
"userate=100%tile"])

print('a dataframe combining 2 chosen variables')
sub1=gapminder[['internetuserate_tile','polityscore_grp']].dropna()
print(sub1)

print("The contigency table of 'internetuserate_tile' and 'polityscore_grp' ")
ct1=pandas.crosstab(sub1['internetuserate_tile'],sub1['polityscore_grp'])
print(ct1)

print("The contigency table in percentage ")
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)

print("The Chi-Square test")
cs1=ss.chi2_contingency(ct1)
print(cs1)

sub1['internetuserate']=gapminder['internetuserate'].dropna()

sub1['internetuserate_tile']=sub1['internetuserate_tile'].convert_objects(convert_numeric=True)
sub1['polityscore_grp'].astype('category')

print("Bar chart of frequencies of 'internetuserate_tile'")
seaborn.factorplot(x='polityscore_grp',y='internetuserate',data=sub1,kind='bar',ci=None)
p.xlabel('polityscore_grp')
p.ylabel('internetuserate_tile')
print()

print("Post hoc test for the Chi-Sqaure test")
print("NO_democracy vs Unsatisfied_democracy")
recode1={"No_democracy":"No_democracy","Unsatisfied_democracy":"Unsatisfied_democracy"}
sub1["No_VS_Un"]=sub1['polityscore_grp'].map(recode1)
cc1=pandas.crosstab(sub1['internetuserate_tile'],sub1["No_VS_Un"])
print(cc1)
ccs1=ss.chi2_contingency(cc1)
print(ccs1)

print()
print("No_democracy vs Satisfied_democracy")
recode2={"No_democracy":"No_democracy","Satisfied_democracy":"Satisfied_democracy"}
sub1["No_VS_Sat"]=sub1['polityscore_grp'].map(recode1)
cc2=pandas.crosstab(sub1['internetuserate_tile'],sub1["No_VS_Sat"])
print(cc2)
ccs2=ss.chi2_contingency(cc2)
print(ccs2)

print()
print("No_democracy vs Good_democracy")
recode3={"No_democracy":"No_democracy","Good_democracy":"Good_democracy_democracy"}
sub1["No_VS_Go"]=sub1['polityscore_grp'].map(recode3)
cc3=pandas.crosstab(sub1['internetuserate_tile'],sub1["No_VS_Go"])
print(cc3)
ccs3=ss.chi2_contingency(cc3)
print(ccs3)

print()
print("Satisfied_democracy vs Unsatisfied_democracy")
recode4={"Satisfied_democracy":"Satisfied_democracy","Unsatisfied_democracy":"Unsatisfied_democracy"}
sub1["Sat_VS_Un"]=sub1['polityscore_grp'].map(recode4)
cc4=pandas.crosstab(sub1['internetuserate_tile'],sub1["Sat_VS_Un"])
print(cc4)
ccs4=ss.chi2_contingency(cc4)
print(ccs4)

print()
print("Unsatisfied_democracy vs Good_democracy")
recode5={"Unsatisfied_democracy":"Unsatisfied_democracy","Good_democracy":"Good_democracy"}
sub1["Un_VS_Go"]=sub1['polityscore_grp'].map(recode5)
cc5=pandas.crosstab(sub1['internetuserate_tile'],sub1["Un_VS_Go"])
print(cc5)
ccs5=ss.chi2_contingency(cc5)
print(ccs5)

print()
print("Satisfied_democracy vs Good_democracy")
recode6={"Satisfied_democracy":"Satisfied_democracy","Good_democracy":"Good_democracy"}
sub1["Sat_VS_Go"]=sub1['polityscore_grp'].map(recode6)
cc6=pandas.crosstab(sub1['internetuserate_tile'],sub1["Sat_VS_Go"])
print(cc6)
ccs6=ss.chi2_contingency(cc6)
print(ccs6)

Data Analysis and Interpretation Specialization

Thursday, December 24, 2015

Running a Chi-Square Test of Independence (Python)

No comments:

Post a Comment