Investigate the correlation between a country's democracy level and its internet use rate.
Explanation and Methodologies:
From the "Gapminder" dataset, two variables are extracted. They are "polityscore" and "internetuserate". The observations of the dataset are recorded among the countries around the world. The variable "politysocre" is measuring a country's democracy extent and the range of the score is -10 to 10. The key word "democracy level" will be used instead of "polity score" in the following explanation for better understanding of the meaning of the variable.
First, "polityscore" is divided by four groups (4 levels) according to its value as follows:
polityscore | group(level) |
-10 to -5 | No_democracy |
-5 to 0 | Unstatisfied_democracy |
0 to 5 | Satisfied_democracy |
5 to 10 | Good_democracy |
The "polityscore" will be an explanatory varibale and "internetuserate" will be a response variable in my study.The ANOVA analysis of four levels will proceed.
The null hypothesis: The internet use rate has no mean difference among different levels of democracy.(i.e. internet use rate's mean1=mean2=mean3=mean4)
Python Output 1:
number of rows and columns of gapminder
We can see the p-value of the F-statistics is 8.74e-08 lower than 0.05. The value 0.05 is the significant level to reject the null hypothesis statistically. Therefore, the null hypothesis of no mean difference of internet use rate among the democracy level groups is rejected.
Now we are going take a deeper look since the previous result does not tell us which groups are different from others. This means we don't know which pair of groups(democracy levels) have mean difference in internet use rate statistically. Therefore post hoc test for the ANOVA is going to be carried out. The next test is Tukey's Honestly Significant Different Test.
213
16
a dataframe combining 4 chosen variables
internetuserate polityscore_grp
0 3.654122 Unsatisfied_democracy
1 44.989947 Good_democracy
2 12.500073 Satisfied_democracy
4 9.999954 Unsatisfied_democracy
6 36.000335 Good_democracy
7 44.001025 Satisfied_democracy
9 75.895654 Good_democracy
10 72.731576 Good_democracy
11 46.679702 No_democracy
13 54.992809 No_democracy
14 3.700003 Satisfied_democracy
16 32.052144 No_democracy
17 73.733934 Good_democracy
19 3.129962 Good_democracy
21 13.598876 Satisfied_democracy
22 20.001710 Good_democracy
24 5.999836 Good_democracy
25 40.650098 Good_democracy
27 45.986590 Good_democracy
28 1.400061 Unsatisfied_democracy
29 2.100213 Good_democracy
30 1.259934 Satisfied_democracy
31 3.999977 Unsatisfied_democracy
32 81.338393 Good_democracy
35 2.300027 Unsatisfied_democracy
36 1.700031 Unsatisfied_democracy
37 45.000000 Good_democracy
38 34.377790 No_democracy
39 36.499875 Good_democracy
40 5.098265 Good_democracy
.. ... ...
175 69.339971 Good_democracy
176 5.001375 Good_democracy
178 12.334893 Good_democracy
179 65.808554 Good_democracy
180 11.999971 Good_democracy
183 9.007736 No_democracy
184 90.016190 Good_democracy
185 82.166660 Good_democracy
186 20.663156 No_democracy
188 11.549391 Unsatisfied_democracy
189 11.000055 Unsatisfied_democracy
190 21.200072 Satisfied_democracy
191 0.210066 Good_democracy
192 5.379820 Unsatisfied_democracy
194 48.516818 Good_democracy
195 36.562553 Unsatisfied_democracy
196 39.820178 Good_democracy
197 2.199998 No_democracy
199 12.500255 Unsatisfied_democracy
200 44.585355 Good_democracy
201 77.996781 No_democracy
202 84.731705 Good_democracy
203 74.247572 Good_democracy
204 47.867469 Good_democracy
205 19.445021 No_democracy
207 35.850437 Unsatisfied_democracy
208 27.851822 No_democracy
210 12.349750 Unsatisfied_democracy
211 10.124986 Good_democracy
212 11.500415 Satisfied_democracy
This is the dataframe containing the two variables those are going to be analyzed.
Python Output 2:
-----
We can see the p-value of the F-statistics is 8.74e-08 lower than 0.05. The value 0.05 is the significant level to reject the null hypothesis statistically. Therefore, the null hypothesis of no mean difference of internet use rate among the democracy level groups is rejected.
Python Out 3:
The values of the means and standard deviations of the internet use rate among the democracy level groups are given in the output for reference. It has a place drawing my attention that the mean value of "No_democracy" group has an unexpected high value since it is the lowest democracy level and expected that the mean difference between this group and "Good_democracy" should be largest among four groups according to my expectation.
According the the Tukey's test, There is statistically mean difference of internet use rate between:
- "Good_democracy" and "Satisfied_democracy"
- "Good_democracy" and "Unsatisfied_democracy"
According to the ANOVA, democracy levels have significant effect on internet use rate but there is no verification on the causal relationship between the two variables in my study.
In fact, I have an expectation that the internet use rate positively relates to the democracy levels (one has a higher value corresponding to the increasing value or level of another variable.). However,Tukey's Honestly Significant Different Test gives an unexpected result. Therefore, there is not a simple positively linear relationship between internet use rate and democracy level and a further analysis may be required.
Python Code
import pandas
import numpy
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
gapminder = pandas.read_csv('gapminder.csv',low_memory=False)
print('number of rows and columns of gapminder')
print(len(gapminder))
print(len(gapminder.columns))
gapminder['internetuserate']=gapminder['internetuserate'].convert_objects(convert_numeric=True)
gapminder['polityscore']=gapminder['polityscore'].convert_objects(convert_numeric=True)
gapminder['polityscore_grp']=pandas.cut(gapminder.polityscore,[-11,-5,0,5,10],labels=["No_democracy","Unsatisfied_democracy","Satisfied_democracy","Good_democracy"])
print('a dataframe combining 4 chosen variables')
sub1=gapminder[['internetuserate','polityscore_grp']].dropna()
print(sub1)
model1=smf.ols(formula='internetuserate~C(polityscore_grp)',data=sub1).fit()
print(model1.summary())
print('means of internetuserate among the four groups')
m2=sub1.groupby('polityscore_grp').mean()
print(m2)
print('standard deviatons of internetuserate among the four groups')
sd2=sub1.groupby('polityscore_grp').std()
print(sd2)
print()
print("Tukey's Honestly Significant Different Test")
model2=multi.MultiComparison(sub1['internetuserate'],sub1['polityscore_grp']).tukeyhsd()
print(model2.summary())
No comments:
Post a Comment