Sunday, December 20, 2015

Running an analysis of variance (Python)

Objective of the study:
Investigate the correlation between a country's democracy level and its internet use rate.

Explanation and Methodologies:
From the "Gapminder" dataset, two variables are extracted. They are "polityscore" and "internetuserate". The observations of the dataset are recorded among the countries around the world. The variable "politysocre" is measuring a country's democracy extent and the range of the score is -10 to 10. The key word "democracy level" will be used instead of "polity score" in the following explanation  for better understanding of the meaning of the variable.

First, "polityscore" is divided by four groups (4 levels) according to its value as follows:


polityscore    group(level)
-10 to -5 No_democracy
-5 to 0 Unstatisfied_democracy
0 to 5 Satisfied_democracy
5 to 10 Good_democracy


The "polityscore" will be an explanatory varibale and "internetuserate" will be a response variable in my study.The ANOVA analysis of four levels will  proceed.

The null hypothesis: The internet use rate has no mean difference among different levels of democracy.(i.e.  internet use rate's mean1=mean2=mean3=mean4)

Python Output 1:
 number of rows and columns of gapminder
213
16
a dataframe combining 4 chosen variables
     internetuserate        polityscore_grp
0           3.654122  Unsatisfied_democracy
1          44.989947         Good_democracy
2          12.500073    Satisfied_democracy
4           9.999954  Unsatisfied_democracy
6          36.000335         Good_democracy
7          44.001025    Satisfied_democracy
9          75.895654         Good_democracy
10         72.731576         Good_democracy
11         46.679702           No_democracy
13         54.992809           No_democracy
14          3.700003    Satisfied_democracy
16         32.052144           No_democracy
17         73.733934         Good_democracy
19          3.129962         Good_democracy
21         13.598876    Satisfied_democracy
22         20.001710         Good_democracy
24          5.999836         Good_democracy
25         40.650098         Good_democracy
27         45.986590         Good_democracy
28          1.400061  Unsatisfied_democracy
29          2.100213         Good_democracy
30          1.259934    Satisfied_democracy
31          3.999977  Unsatisfied_democracy
32         81.338393         Good_democracy
35          2.300027  Unsatisfied_democracy
36          1.700031  Unsatisfied_democracy
37         45.000000         Good_democracy
38         34.377790           No_democracy
39         36.499875         Good_democracy
40          5.098265         Good_democracy
..               ...                    ...
175        69.339971         Good_democracy
176         5.001375         Good_democracy
178        12.334893         Good_democracy
179        65.808554         Good_democracy
180        11.999971         Good_democracy
183         9.007736           No_democracy
184        90.016190         Good_democracy
185        82.166660         Good_democracy
186        20.663156           No_democracy
188        11.549391  Unsatisfied_democracy
189        11.000055  Unsatisfied_democracy
190        21.200072    Satisfied_democracy
191         0.210066         Good_democracy
192         5.379820  Unsatisfied_democracy
194        48.516818         Good_democracy
195        36.562553  Unsatisfied_democracy
196        39.820178         Good_democracy
197         2.199998           No_democracy
199        12.500255  Unsatisfied_democracy
200        44.585355         Good_democracy
201        77.996781           No_democracy
202        84.731705         Good_democracy
203        74.247572         Good_democracy
204        47.867469         Good_democracy
205        19.445021           No_democracy
207        35.850437  Unsatisfied_democracy
208        27.851822           No_democracy
210        12.349750  Unsatisfied_democracy
211        10.124986         Good_democracy
212        11.500415    Satisfied_democracy

This is the dataframe containing the two variables those are going to be analyzed.


Python Output 2:

-----


We can see the p-value of the F-statistics is  8.74e-08  lower than 0.05. The value 0.05 is the significant level to reject the null hypothesis statistically. Therefore, the null hypothesis of no mean difference of internet use rate among the democracy level groups is rejected.

Now we are going take a deeper look since the previous result does not tell us which groups are different from others. This means we don't know which pair of groups(democracy levels) have mean difference in internet use rate statistically. Therefore post hoc test for the ANOVA is going to be carried out. The next test is Tukey's Honestly Significant Different Test.

Python Out 3:


The values of the means and standard deviations of the internet use rate among the democracy level groups are given in the output for reference. It has a place drawing my attention that the mean value of "No_democracy" group has an unexpected high value since it is the lowest democracy level and expected that the mean difference between this group and "Good_democracy" should be largest among four groups according to my expectation. 

According the the Tukey's test, There is statistically mean difference of internet use rate between:
  • "Good_democracy" and "Satisfied_democracy"
  • "Good_democracy" and "Unsatisfied_democracy"
Conclusion
According to the ANOVA, democracy levels have significant effect on internet use rate but there is no verification on the causal relationship between the two variables in my study.

In fact, I have an expectation that the internet use rate positively relates to the democracy levels (one has a higher value corresponding to the increasing value or level of another variable.). However,Tukey's Honestly Significant Different Test gives an unexpected result. Therefore, there is not a simple positively linear relationship between internet use rate and democracy level and a further analysis may be required.

Python Code
import pandas
import numpy
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi

gapminder = pandas.read_csv('gapminder.csv',low_memory=False)

print('number of rows and columns of gapminder')
print(len(gapminder))
print(len(gapminder.columns))


gapminder['internetuserate']=gapminder['internetuserate'].convert_objects(convert_numeric=True)
gapminder['polityscore']=gapminder['polityscore'].convert_objects(convert_numeric=True)


gapminder['polityscore_grp']=pandas.cut(gapminder.polityscore,[-11,-5,0,5,10],labels=["No_democracy","Unsatisfied_democracy","Satisfied_democracy","Good_democracy"])



print('a dataframe combining 4 chosen variables')
sub1=gapminder[['internetuserate','polityscore_grp']].dropna()
print(sub1)


model1=smf.ols(formula='internetuserate~C(polityscore_grp)',data=sub1).fit()
print(model1.summary())

print('means of internetuserate among the four groups')
m2=sub1.groupby('polityscore_grp').mean()
print(m2)
print('standard deviatons of internetuserate among the four groups')
sd2=sub1.groupby('polityscore_grp').std()
print(sd2)

print()
print("Tukey's Honestly Significant Different Test")
model2=multi.MultiComparison(sub1['internetuserate'],sub1['polityscore_grp']).tukeyhsd()

print(model2.summary())

No comments:

Post a Comment