Thursday, December 24, 2015

Running a Chi-Square Test of Independence (Python)

Objective of the study
Testing the independence/dependence of "polityscore" (democracy level of a country) and "Internet use rate"

Explanation and Methodologies
The objective of this study is much more likely the one I did previously that investigating about the relationship between a country's democracy level and its internet use rate by ANOVA. Same as before, "Gapminder" dataset is the data source and we use two of the variables,"polityscore" and "internetuserate".

First of all, "polityscore" is going to be divided into 4 groups according to its values.

polityscore   group(level)
-10 to -5 No_democracy
-5 to 0 Unstatisfied_democracy
0 to 5 Satisfied_democracy
5 to 10 Good_democracy

Secondly, convert the "internetuserate" into 2 categorical groups by 50%-quantile.

Python Output1:


number of rows and columns of gapminder
213
16
a dataframe combining 2 chosen variables
    internetuserate_tile        polityscore_grp
0        userate=50%tile  Unsatisfied_democracy
1       userate=100%tile         Good_democracy
2        userate=50%tile    Satisfied_democracy
4        userate=50%tile  Unsatisfied_democracy
6       userate=100%tile         Good_democracy
7       userate=100%tile    Satisfied_democracy
9       userate=100%tile         Good_democracy
10      userate=100%tile         Good_democracy
11      userate=100%tile           No_democracy
13      userate=100%tile           No_democracy
14       userate=50%tile    Satisfied_democracy
16      userate=100%tile           No_democracy
17      userate=100%tile         Good_democracy
19       userate=50%tile         Good_democracy
21       userate=50%tile    Satisfied_democracy
22       userate=50%tile         Good_democracy
24       userate=50%tile         Good_democracy
25      userate=100%tile         Good_democracy
27      userate=100%tile         Good_democracy
28       userate=50%tile  Unsatisfied_democracy
29       userate=50%tile         Good_democracy
30       userate=50%tile    Satisfied_democracy
31       userate=50%tile  Unsatisfied_democracy
32      userate=100%tile         Good_democracy
35       userate=50%tile  Unsatisfied_democracy
36       userate=50%tile  Unsatisfied_democracy
37      userate=100%tile         Good_democracy
38      userate=100%tile           No_democracy
39      userate=100%tile         Good_democracy
40       userate=50%tile         Good_democracy
..                   ...                    ...
175     userate=100%tile         Good_democracy
176      userate=50%tile         Good_democracy
178      userate=50%tile         Good_democracy
179     userate=100%tile         Good_democracy
180      userate=50%tile         Good_democracy
183      userate=50%tile           No_democracy
184     userate=100%tile         Good_democracy
185     userate=100%tile         Good_democracy
186      userate=50%tile           No_democracy
188      userate=50%tile  Unsatisfied_democracy
189      userate=50%tile  Unsatisfied_democracy
190      userate=50%tile    Satisfied_democracy
191      userate=50%tile         Good_democracy
192      userate=50%tile  Unsatisfied_democracy
194     userate=100%tile         Good_democracy
195     userate=100%tile  Unsatisfied_democracy
196     userate=100%tile         Good_democracy
197      userate=50%tile           No_democracy
199      userate=50%tile  Unsatisfied_democracy
200     userate=100%tile         Good_democracy
201     userate=100%tile           No_democracy
202     userate=100%tile         Good_democracy
203     userate=100%tile         Good_democracy
204     userate=100%tile         Good_democracy
205      userate=50%tile           No_democracy
207     userate=100%tile  Unsatisfied_democracy
208      userate=50%tile           No_democracy
210      userate=50%tile  Unsatisfied_democracy
211      userate=50%tile         Good_democracy

212      userate=50%tile    Satisfied_democracy

-----
Now we have two categorical variables and a Chi Square Test is going to be applied.

Python Output2:

The contigency table of 'internetuserate_tile' and 'polityscore_grp' 

polityscore_grp       No_democracy  Unsatisfied_democracy  Satisfied_democracy  Good_democracy  
internetuserate_tile                                        
userate=50%tile                 12                     21                             17                         35  
userate=100%tile                11                      4                               2                         53  


The contigency table in percentage 
polityscore_grp       No_democracy  Unsatisfied_democracy  Satisfied_democracy  Good_democracy  
internetuserate_tile                                        
userate=50%tile           0.521739                   0.84                                0.894737        0.397727  
userate=100%tile          0.478261                   0.16                               0.105263        0.602273  


The Chi-Square test
(25.918522100123596, 9.9194759541075989e-06, 3, array([[ 12.61290323,  13.70967742,  10.41935484,  48.25806452],
       [ 10.38709677,  11.29032258,   8.58064516,  39.74193548]]))

-----

From the output, we can see that the p-value of the Chi-Square test is 9.9194759541075989e-06 which is lower than 0.05 significant level and reject the null hypothesis that democracy level and internet use rate has no dependence association (i.e.they have dependence association)


Python Output3:
.

-----
The charts shows that there is not a simple linear relationship between the two variables although they are dependent.

Now the followings are the post hoc tests for the Chi-Square test:

Python Output4:

Post hoc test for the Chi-Sqaure test

1,
NO_democracy vs Unsatisfied_democracy
NoVSUn                No_democracy  Unsatisfied_democracy
internetuserate_tile                                     
userate=50%tile                 12                     21
userate=100%tile                11                      4
(4.2634624505928862, 0.038940474422515331, 1, array([[ 15.8125,  17.1875],

       [  7.1875,   7.8125]]))


2,
No_democracy vs Satisfied_democracy
No_VS_Sat             No_democracy  Unsatisfied_democracy
internetuserate_tile                                     
userate=50%tile                 12                     21
userate=100%tile                11                      4
(4.2634624505928862, 0.038940474422515331, 1, array([[ 15.8125,  17.1875],
       [  7.1875,   7.8125]]))
C:\Users\user\Anaconda3\lib\site-packages\matplotlib\__init__.py:892: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.

  warnings.warn(self.msg_depr % (key, alt_key))

3,
No_democracy vs Good_democracy
No_VS_Go              Good_democracy_democracy  No_democracy
internetuserate_tile                                        
userate=50%tile                             35            12
userate=100%tile                            53            11
(0.69683212191731103, 0.4038501988882941, 1, array([[ 37.26126126,   9.73873874],

       [ 50.73873874,  13.26126126]]))

4,
Satisfied_democracy vs Unsatisfied_democracy
Sat_VS_Un             Satisfied_democracy  Unsatisfied_democracy
internetuserate_tile                                            
userate=50%tile                        17                     21
userate=100%tile                        2                      4
(0.0065004616805170515, 0.93573983334993682, 1, array([[ 16.40909091,  21.59090909],

       [  2.59090909,   3.40909091]]))

5,
Unsatisfied_democracy vs Good_democracy
Un_VS_Go              Good_democracy  Unsatisfied_democracy
internetuserate_tile                                       
userate=50%tile                   35                     21
userate=100%tile                  53                      4
(13.516299876110732, 0.00023650025871216763, 1, array([[ 43.61061947,  12.38938053],
       [ 44.38938053,  12.61061947]]))

6,
Satisfied_democracy vs Good_democracy
Sat_VS_Go             Good_democracy  Satisfied_democracy
internetuserate_tile                                     
userate=50%tile                   35                   17
userate=100%tile                  53                    2
(13.526401267691638, 0.00023523068534581418, 1, array([[ 42.76635514,   9.23364486],

       [ 45.23364486,   9.76635514]]))
-----
The post hoc test applies with Bonferroni Adjustnent. The significant level of rejecting null hypothesis is 0.05/(number of caparisons) =0.05/6=0.0083. From the output we can conclude that the 5th and 6th comparison reject the hypothesis and it means only can deduce the dependence association between democracy and internet use rate excluding the democracy level is "No_democracy".



Conclusion
The democracy level of a country and its internet use rate are dependent but not in a linear relationship. The post hoc test doesn't give the concrete and meaningful explanation or conclusion for the dependence association.A further research may be required 


Python Code
import pandas
import numpy
import scipy.stats as ss
import seaborn
import matplotlib.pyplot as p

gapminder = pandas.read_csv('gapminder.csv',low_memory=False)

print('number of rows and columns of gapminder')
print(len(gapminder))
print(len(gapminder.columns))


gapminder['internetuserate']=gapminder['internetuserate'].convert_objects(convert_numeric=True)
gapminder['polityscore']=gapminder['polityscore'].convert_objects(convert_numeric=True)


gapminder['polityscore_grp']=pandas.cut(gapminder.polityscore,4,labels=["No_democracy","Unsatisfied_democracy",
"Satisfied_democracy","Good_democracy"])

gapminder['internetuserate_tile']=pandas.qcut(gapminder.internetuserate,2,labels=["userate=50%tile",
"userate=100%tile"])


print('a dataframe combining 2 chosen variables')
sub1=gapminder[['internetuserate_tile','polityscore_grp']].dropna()
print(sub1)

print("The contigency table of 'internetuserate_tile' and 'polityscore_grp' ")
ct1=pandas.crosstab(sub1['internetuserate_tile'],sub1['polityscore_grp'])
print(ct1)

print("The contigency table in percentage ")
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)


print("The Chi-Square test")
cs1=ss.chi2_contingency(ct1)
print(cs1)

sub1['internetuserate']=gapminder['internetuserate'].dropna()

sub1['internetuserate_tile']=sub1['internetuserate_tile'].convert_objects(convert_numeric=True)
sub1['polityscore_grp'].astype('category')

print("Bar chart of frequencies of 'internetuserate_tile'")
seaborn.factorplot(x='polityscore_grp',y='internetuserate',data=sub1,kind='bar',ci=None)
p.xlabel('polityscore_grp')
p.ylabel('internetuserate_tile')
print()

print("Post hoc test for the Chi-Sqaure test")
print("NO_democracy vs Unsatisfied_democracy")
recode1={"No_democracy":"No_democracy","Unsatisfied_democracy":"Unsatisfied_democracy"}
sub1["No_VS_Un"]=sub1['polityscore_grp'].map(recode1)
cc1=pandas.crosstab(sub1['internetuserate_tile'],sub1["No_VS_Un"])
print(cc1)
ccs1=ss.chi2_contingency(cc1)
print(ccs1)

print()
print("No_democracy vs Satisfied_democracy")
recode2={"No_democracy":"No_democracy","Satisfied_democracy":"Satisfied_democracy"}
sub1["No_VS_Sat"]=sub1['polityscore_grp'].map(recode1)
cc2=pandas.crosstab(sub1['internetuserate_tile'],sub1["No_VS_Sat"])
print(cc2)
ccs2=ss.chi2_contingency(cc2)
print(ccs2)


print()
print("No_democracy vs Good_democracy")
recode3={"No_democracy":"No_democracy","Good_democracy":"Good_democracy_democracy"}
sub1["No_VS_Go"]=sub1['polityscore_grp'].map(recode3)
cc3=pandas.crosstab(sub1['internetuserate_tile'],sub1["No_VS_Go"])
print(cc3)
ccs3=ss.chi2_contingency(cc3)
print(ccs3)

print()
print("Satisfied_democracy vs Unsatisfied_democracy")
recode4={"Satisfied_democracy":"Satisfied_democracy","Unsatisfied_democracy":"Unsatisfied_democracy"}
sub1["Sat_VS_Un"]=sub1['polityscore_grp'].map(recode4)
cc4=pandas.crosstab(sub1['internetuserate_tile'],sub1["Sat_VS_Un"])
print(cc4)
ccs4=ss.chi2_contingency(cc4)
print(ccs4)

print()
print("Unsatisfied_democracy vs Good_democracy")
recode5={"Unsatisfied_democracy":"Unsatisfied_democracy","Good_democracy":"Good_democracy"}
sub1["Un_VS_Go"]=sub1['polityscore_grp'].map(recode5)
cc5=pandas.crosstab(sub1['internetuserate_tile'],sub1["Un_VS_Go"])
print(cc5)
ccs5=ss.chi2_contingency(cc5)
print(ccs5)


print()
print("Satisfied_democracy vs Good_democracy")
recode6={"Satisfied_democracy":"Satisfied_democracy","Good_democracy":"Good_democracy"}
sub1["Sat_VS_Go"]=sub1['polityscore_grp'].map(recode6)
cc6=pandas.crosstab(sub1['internetuserate_tile'],sub1["Sat_VS_Go"])
print(cc6)
ccs6=ss.chi2_contingency(cc6)
print(ccs6)

No comments:

Post a Comment