Testing the independence/dependence of "polityscore" (democracy level of a country) and "Internet use rate"
Explanation and Methodologies
The objective of this study is much more likely the one I did previously that investigating about the relationship between a country's democracy level and its internet use rate by ANOVA. Same as before, "Gapminder" dataset is the data source and we use two of the variables,"polityscore" and "internetuserate".
First of all, "polityscore" is going to be divided into 4 groups according to its values.
polityscore | group(level) |
-10 to -5 | No_democracy |
-5 to 0 | Unstatisfied_democracy |
0 to 5 | Satisfied_democracy |
5 to 10 | Good_democracy |
Secondly, convert the "internetuserate" into 2 categorical groups by 50%-quantile.
Python Output1:
number of rows and columns of gapminder
213
16
a dataframe combining 2 chosen variables
internetuserate_tile polityscore_grp
0 userate=50%tile Unsatisfied_democracy
1 userate=100%tile Good_democracy
2 userate=50%tile Satisfied_democracy
4 userate=50%tile Unsatisfied_democracy
6 userate=100%tile Good_democracy
7 userate=100%tile Satisfied_democracy
9 userate=100%tile Good_democracy
10 userate=100%tile Good_democracy
11 userate=100%tile No_democracy
13 userate=100%tile No_democracy
14 userate=50%tile Satisfied_democracy
16 userate=100%tile No_democracy
17 userate=100%tile Good_democracy
19 userate=50%tile Good_democracy
21 userate=50%tile Satisfied_democracy
22 userate=50%tile Good_democracy
24 userate=50%tile Good_democracy
25 userate=100%tile Good_democracy
27 userate=100%tile Good_democracy
28 userate=50%tile Unsatisfied_democracy
29 userate=50%tile Good_democracy
30 userate=50%tile Satisfied_democracy
31 userate=50%tile Unsatisfied_democracy
32 userate=100%tile Good_democracy
35 userate=50%tile Unsatisfied_democracy
36 userate=50%tile Unsatisfied_democracy
37 userate=100%tile Good_democracy
38 userate=100%tile No_democracy
39 userate=100%tile Good_democracy
40 userate=50%tile Good_democracy
.. ... ...
175 userate=100%tile Good_democracy
176 userate=50%tile Good_democracy
178 userate=50%tile Good_democracy
179 userate=100%tile Good_democracy
180 userate=50%tile Good_democracy
183 userate=50%tile No_democracy
184 userate=100%tile Good_democracy
185 userate=100%tile Good_democracy
186 userate=50%tile No_democracy
188 userate=50%tile Unsatisfied_democracy
189 userate=50%tile Unsatisfied_democracy
190 userate=50%tile Satisfied_democracy
191 userate=50%tile Good_democracy
192 userate=50%tile Unsatisfied_democracy
194 userate=100%tile Good_democracy
195 userate=100%tile Unsatisfied_democracy
196 userate=100%tile Good_democracy
197 userate=50%tile No_democracy
199 userate=50%tile Unsatisfied_democracy
200 userate=100%tile Good_democracy
201 userate=100%tile No_democracy
202 userate=100%tile Good_democracy
203 userate=100%tile Good_democracy
204 userate=100%tile Good_democracy
205 userate=50%tile No_democracy
207 userate=100%tile Unsatisfied_democracy
208 userate=50%tile No_democracy
210 userate=50%tile Unsatisfied_democracy
211 userate=50%tile Good_democracy
212 userate=50%tile Satisfied_democracy
-----
Now we have two categorical variables and a Chi Square Test is going to be applied.
Python Output2:
The contigency table of 'internetuserate_tile' and 'polityscore_grp'
polityscore_grp No_democracy Unsatisfied_democracy Satisfied_democracy Good_democracy
internetuserate_tile
userate=50%tile 12 21 17 35
userate=100%tile 11 4 2 53
The contigency table in percentage
polityscore_grp No_democracy Unsatisfied_democracy Satisfied_democracy Good_democracy
internetuserate_tile
userate=50%tile 0.521739 0.84 0.894737 0.397727
userate=100%tile 0.478261 0.16 0.105263 0.602273
The Chi-Square test
(25.918522100123596, 9.9194759541075989e-06, 3, array([[ 12.61290323, 13.70967742, 10.41935484, 48.25806452],
[ 10.38709677, 11.29032258, 8.58064516, 39.74193548]]))
-----
From the output, we can see that the p-value of the Chi-Square test is 9.9194759541075989e-06 which is lower than 0.05 significant level and reject the null hypothesis that democracy level and internet use rate has no dependence association (i.e.they have dependence association)
Python Output3:
.
-----
The charts shows that there is not a simple linear relationship between the two variables although they are dependent.
Now the followings are the post hoc tests for the Chi-Square test:
Python Output4:
Now the followings are the post hoc tests for the Chi-Square test:
Python Output4:
Post hoc test for the Chi-Sqaure test
1,
NO_democracy vs Unsatisfied_democracy
NoVSUn No_democracy Unsatisfied_democracy
internetuserate_tile
userate=50%tile 12 21
userate=100%tile 11 4
(4.2634624505928862, 0.038940474422515331, 1, array([[ 15.8125, 17.1875],
[ 7.1875, 7.8125]]))
2,
No_democracy vs Satisfied_democracy
No_VS_Sat No_democracy Unsatisfied_democracy
internetuserate_tile
userate=50%tile 12 21
userate=100%tile 11 4
(4.2634624505928862, 0.038940474422515331, 1, array([[ 15.8125, 17.1875],
[ 7.1875, 7.8125]]))
C:\Users\user\Anaconda3\lib\site-packages\matplotlib\__init__.py:892: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
warnings.warn(self.msg_depr % (key, alt_key))
3,
No_democracy vs Good_democracy
No_VS_Go Good_democracy_democracy No_democracy
internetuserate_tile
userate=50%tile 35 12
userate=100%tile 53 11
(0.69683212191731103, 0.4038501988882941, 1, array([[ 37.26126126, 9.73873874],
[ 50.73873874, 13.26126126]]))
4,
Satisfied_democracy vs Unsatisfied_democracy
Sat_VS_Un Satisfied_democracy Unsatisfied_democracy
internetuserate_tile
userate=50%tile 17 21
userate=100%tile 2 4
(0.0065004616805170515, 0.93573983334993682, 1, array([[ 16.40909091, 21.59090909],
[ 2.59090909, 3.40909091]]))
5,
Unsatisfied_democracy vs Good_democracy
Un_VS_Go Good_democracy Unsatisfied_democracy
internetuserate_tile
userate=50%tile 35 21
userate=100%tile 53 4
(13.516299876110732, 0.00023650025871216763, 1, array([[ 43.61061947, 12.38938053],
[ 44.38938053, 12.61061947]]))
6,
Satisfied_democracy vs Good_democracy
Sat_VS_Go Good_democracy Satisfied_democracy
internetuserate_tile
userate=50%tile 35 17
userate=100%tile 53 2
(13.526401267691638, 0.00023523068534581418, 1, array([[ 42.76635514, 9.23364486],
[ 45.23364486, 9.76635514]]))
-----
The post hoc test applies with Bonferroni Adjustnent. The significant level of rejecting null hypothesis is 0.05/(number of caparisons) =0.05/6=0.0083. From the output we can conclude that the 5th and 6th comparison reject the hypothesis and it means only can deduce the dependence association between democracy and internet use rate excluding the democracy level is "No_democracy".
1,
NO_democracy vs Unsatisfied_democracy
NoVSUn No_democracy Unsatisfied_democracy
internetuserate_tile
userate=50%tile 12 21
userate=100%tile 11 4
(4.2634624505928862, 0.038940474422515331, 1, array([[ 15.8125, 17.1875],
[ 7.1875, 7.8125]]))
2,
No_democracy vs Satisfied_democracy
No_VS_Sat No_democracy Unsatisfied_democracy
internetuserate_tile
userate=50%tile 12 21
userate=100%tile 11 4
(4.2634624505928862, 0.038940474422515331, 1, array([[ 15.8125, 17.1875],
[ 7.1875, 7.8125]]))
C:\Users\user\Anaconda3\lib\site-packages\matplotlib\__init__.py:892: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
warnings.warn(self.msg_depr % (key, alt_key))
3,
No_democracy vs Good_democracy
No_VS_Go Good_democracy_democracy No_democracy
internetuserate_tile
userate=50%tile 35 12
userate=100%tile 53 11
(0.69683212191731103, 0.4038501988882941, 1, array([[ 37.26126126, 9.73873874],
[ 50.73873874, 13.26126126]]))
4,
Satisfied_democracy vs Unsatisfied_democracy
Sat_VS_Un Satisfied_democracy Unsatisfied_democracy
internetuserate_tile
userate=50%tile 17 21
userate=100%tile 2 4
(0.0065004616805170515, 0.93573983334993682, 1, array([[ 16.40909091, 21.59090909],
[ 2.59090909, 3.40909091]]))
5,
Unsatisfied_democracy vs Good_democracy
Un_VS_Go Good_democracy Unsatisfied_democracy
internetuserate_tile
userate=50%tile 35 21
userate=100%tile 53 4
(13.516299876110732, 0.00023650025871216763, 1, array([[ 43.61061947, 12.38938053],
[ 44.38938053, 12.61061947]]))
6,
Satisfied_democracy vs Good_democracy
Sat_VS_Go Good_democracy Satisfied_democracy
internetuserate_tile
userate=50%tile 35 17
userate=100%tile 53 2
(13.526401267691638, 0.00023523068534581418, 1, array([[ 42.76635514, 9.23364486],
[ 45.23364486, 9.76635514]]))
-----
The post hoc test applies with Bonferroni Adjustnent. The significant level of rejecting null hypothesis is 0.05/(number of caparisons) =0.05/6=0.0083. From the output we can conclude that the 5th and 6th comparison reject the hypothesis and it means only can deduce the dependence association between democracy and internet use rate excluding the democracy level is "No_democracy".
Conclusion
The democracy level of a country and its internet use rate are dependent but not in a linear relationship. The post hoc test doesn't give the concrete and meaningful explanation or conclusion for the dependence association.A further research may be required
Python Code
import pandas
import numpy
import scipy.stats as ss
import seaborn
import matplotlib.pyplot as p
gapminder = pandas.read_csv('gapminder.csv',low_memory=False)
print('number of rows and columns of gapminder')
print(len(gapminder))
print(len(gapminder.columns))
gapminder['internetuserate']=gapminder['internetuserate'].convert_objects(convert_numeric=True)
gapminder['polityscore']=gapminder['polityscore'].convert_objects(convert_numeric=True)
gapminder['polityscore_grp']=pandas.cut(gapminder.polityscore,4,labels=["No_democracy","Unsatisfied_democracy",
"Satisfied_democracy","Good_democracy"])
gapminder['internetuserate_tile']=pandas.qcut(gapminder.internetuserate,2,labels=["userate=50%tile",
"userate=100%tile"])
print('a dataframe combining 2 chosen variables')
sub1=gapminder[['internetuserate_tile','polityscore_grp']].dropna()
print(sub1)
print("The contigency table of 'internetuserate_tile' and 'polityscore_grp' ")
ct1=pandas.crosstab(sub1['internetuserate_tile'],sub1['polityscore_grp'])
print(ct1)
print("The contigency table in percentage ")
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)
print("The Chi-Square test")
cs1=ss.chi2_contingency(ct1)
print(cs1)
sub1['internetuserate']=gapminder['internetuserate'].dropna()
sub1['internetuserate_tile']=sub1['internetuserate_tile'].convert_objects(convert_numeric=True)
sub1['polityscore_grp'].astype('category')
print("Bar chart of frequencies of 'internetuserate_tile'")
seaborn.factorplot(x='polityscore_grp',y='internetuserate',data=sub1,kind='bar',ci=None)
p.xlabel('polityscore_grp')
p.ylabel('internetuserate_tile')
print()
print("Post hoc test for the Chi-Sqaure test")
print("NO_democracy vs Unsatisfied_democracy")
recode1={"No_democracy":"No_democracy","Unsatisfied_democracy":"Unsatisfied_democracy"}
sub1["No_VS_Un"]=sub1['polityscore_grp'].map(recode1)
cc1=pandas.crosstab(sub1['internetuserate_tile'],sub1["No_VS_Un"])
print(cc1)
ccs1=ss.chi2_contingency(cc1)
print(ccs1)
print()
print("No_democracy vs Satisfied_democracy")
recode2={"No_democracy":"No_democracy","Satisfied_democracy":"Satisfied_democracy"}
sub1["No_VS_Sat"]=sub1['polityscore_grp'].map(recode1)
cc2=pandas.crosstab(sub1['internetuserate_tile'],sub1["No_VS_Sat"])
print(cc2)
ccs2=ss.chi2_contingency(cc2)
print(ccs2)
print()
print("No_democracy vs Good_democracy")
recode3={"No_democracy":"No_democracy","Good_democracy":"Good_democracy_democracy"}
sub1["No_VS_Go"]=sub1['polityscore_grp'].map(recode3)
cc3=pandas.crosstab(sub1['internetuserate_tile'],sub1["No_VS_Go"])
print(cc3)
ccs3=ss.chi2_contingency(cc3)
print(ccs3)
print()
print("Satisfied_democracy vs Unsatisfied_democracy")
recode4={"Satisfied_democracy":"Satisfied_democracy","Unsatisfied_democracy":"Unsatisfied_democracy"}
sub1["Sat_VS_Un"]=sub1['polityscore_grp'].map(recode4)
cc4=pandas.crosstab(sub1['internetuserate_tile'],sub1["Sat_VS_Un"])
print(cc4)
ccs4=ss.chi2_contingency(cc4)
print(ccs4)
print()
print("Unsatisfied_democracy vs Good_democracy")
recode5={"Unsatisfied_democracy":"Unsatisfied_democracy","Good_democracy":"Good_democracy"}
sub1["Un_VS_Go"]=sub1['polityscore_grp'].map(recode5)
cc5=pandas.crosstab(sub1['internetuserate_tile'],sub1["Un_VS_Go"])
print(cc5)
ccs5=ss.chi2_contingency(cc5)
print(ccs5)
print()
print("Satisfied_democracy vs Good_democracy")
recode6={"Satisfied_democracy":"Satisfied_democracy","Good_democracy":"Good_democracy"}
sub1["Sat_VS_Go"]=sub1['polityscore_grp'].map(recode6)
cc6=pandas.crosstab(sub1['internetuserate_tile'],sub1["Sat_VS_Go"])
print(cc6)
ccs6=ss.chi2_contingency(cc6)
print(ccs6)
import pandas
import numpy
import scipy.stats as ss
import seaborn
import matplotlib.pyplot as p
gapminder = pandas.read_csv('gapminder.csv',low_memory=False)
print('number of rows and columns of gapminder')
print(len(gapminder))
print(len(gapminder.columns))
gapminder['internetuserate']=gapminder['internetuserate'].convert_objects(convert_numeric=True)
gapminder['polityscore']=gapminder['polityscore'].convert_objects(convert_numeric=True)
gapminder['polityscore_grp']=pandas.cut(gapminder.polityscore,4,labels=["No_democracy","Unsatisfied_democracy",
"Satisfied_democracy","Good_democracy"])
gapminder['internetuserate_tile']=pandas.qcut(gapminder.internetuserate,2,labels=["userate=50%tile",
"userate=100%tile"])
print('a dataframe combining 2 chosen variables')
sub1=gapminder[['internetuserate_tile','polityscore_grp']].dropna()
print(sub1)
print("The contigency table of 'internetuserate_tile' and 'polityscore_grp' ")
ct1=pandas.crosstab(sub1['internetuserate_tile'],sub1['polityscore_grp'])
print(ct1)
print("The contigency table in percentage ")
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)
print("The Chi-Square test")
cs1=ss.chi2_contingency(ct1)
print(cs1)
sub1['internetuserate']=gapminder['internetuserate'].dropna()
sub1['internetuserate_tile']=sub1['internetuserate_tile'].convert_objects(convert_numeric=True)
sub1['polityscore_grp'].astype('category')
print("Bar chart of frequencies of 'internetuserate_tile'")
seaborn.factorplot(x='polityscore_grp',y='internetuserate',data=sub1,kind='bar',ci=None)
p.xlabel('polityscore_grp')
p.ylabel('internetuserate_tile')
print()
print("Post hoc test for the Chi-Sqaure test")
print("NO_democracy vs Unsatisfied_democracy")
recode1={"No_democracy":"No_democracy","Unsatisfied_democracy":"Unsatisfied_democracy"}
sub1["No_VS_Un"]=sub1['polityscore_grp'].map(recode1)
cc1=pandas.crosstab(sub1['internetuserate_tile'],sub1["No_VS_Un"])
print(cc1)
ccs1=ss.chi2_contingency(cc1)
print(ccs1)
print()
print("No_democracy vs Satisfied_democracy")
recode2={"No_democracy":"No_democracy","Satisfied_democracy":"Satisfied_democracy"}
sub1["No_VS_Sat"]=sub1['polityscore_grp'].map(recode1)
cc2=pandas.crosstab(sub1['internetuserate_tile'],sub1["No_VS_Sat"])
print(cc2)
ccs2=ss.chi2_contingency(cc2)
print(ccs2)
print()
print("No_democracy vs Good_democracy")
recode3={"No_democracy":"No_democracy","Good_democracy":"Good_democracy_democracy"}
sub1["No_VS_Go"]=sub1['polityscore_grp'].map(recode3)
cc3=pandas.crosstab(sub1['internetuserate_tile'],sub1["No_VS_Go"])
print(cc3)
ccs3=ss.chi2_contingency(cc3)
print(ccs3)
print()
print("Satisfied_democracy vs Unsatisfied_democracy")
recode4={"Satisfied_democracy":"Satisfied_democracy","Unsatisfied_democracy":"Unsatisfied_democracy"}
sub1["Sat_VS_Un"]=sub1['polityscore_grp'].map(recode4)
cc4=pandas.crosstab(sub1['internetuserate_tile'],sub1["Sat_VS_Un"])
print(cc4)
ccs4=ss.chi2_contingency(cc4)
print(ccs4)
print()
print("Unsatisfied_democracy vs Good_democracy")
recode5={"Unsatisfied_democracy":"Unsatisfied_democracy","Good_democracy":"Good_democracy"}
sub1["Un_VS_Go"]=sub1['polityscore_grp'].map(recode5)
cc5=pandas.crosstab(sub1['internetuserate_tile'],sub1["Un_VS_Go"])
print(cc5)
ccs5=ss.chi2_contingency(cc5)
print(ccs5)
print()
print("Satisfied_democracy vs Good_democracy")
recode6={"Satisfied_democracy":"Satisfied_democracy","Good_democracy":"Good_democracy"}
sub1["Sat_VS_Go"]=sub1['polityscore_grp'].map(recode6)
cc6=pandas.crosstab(sub1['internetuserate_tile'],sub1["Sat_VS_Go"])
print(cc6)
ccs6=ss.chi2_contingency(cc6)
print(ccs6)