One way ANOVA (Analysis of Variance) is a technique for hypothesis testing. It is used to test whether the means of different group is really different.
In this notebook we will use the data-set, International football results from 1872 to 2019, which is available from the Kaggle website.
- Load libraries
- Load and explore data-set
- The Hypothesis
- ANOVA
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
import seaborn as sns
%matplotlib inline
matplotlib.style.use('fivethirtyeight')
rng = np.random.RandomState(201910)
%reload_ext watermark
%watermark -v -m --iversions
We are ready to load mentioned above data-set and explore it.
data = pd.read_csv('results.csv')
Getting sense of The data-set.
data.head()
data.info()
Rebuild index as datetime from datetime column
data.index = pd.to_datetime(data.date)
data.index
data.shape
There are some types of tournament presented in this data-set.
data.tournament.unique()[:40]
The Hypothesis¶
Getting the sense of the score distribution in world cup related tournaments
fifa_wc_data = data[data.tournament.isin(['FIFA World Cup', 'FIFA World Cup qualification'])]
fifa_wc_data.tournament.unique()
fifa_wc_data = fifa_wc_data.assign(score=fifa_wc_data.home_score - fifa_wc_data.away_score)
fifa_wc_data.dtypes
fifa_wc_data.index
plt.figure(figsize=(12, 7))
plt.plot(fifa_wc_data['score'].index, fifa_wc_data['score'].values, 'ro', label = 'Home/Away `score` in match')
plt.ylabel("Scores")
plt.title("Home-Away score in match")
plt.show()
plt.figure(figsize=(12, 7))
plt.title('Home-Away score distributions')
plt.ylabel('pdf')
sns.distplot(fifa_wc_data.score)
plt.xlim(-10, 10)
plt.show()
plt.figure(figsize=(12, 7))
sns.distplot(fifa_wc_data[fifa_wc_data.neutral == True].score, label='Neutral venue')
sns.distplot(fifa_wc_data[fifa_wc_data.neutral == False].score, label='Home venue')
plt.title('Score distribution for each type of venue')
plt.legend()
plt.xlim(-10, 10)
plt.show()
fifa_wc_data.groupby('neutral').agg(
[np.mean, np.median, np.count_nonzero, np.std]
).score
fifa_wc_data.groupby(['neutral', 'tournament']).agg(
[np.mean, np.median, np.count_nonzero, np.std]
).score
We can see difference in score on home venue against neutral.
ANOVA¶
The 1 way anova’s null hypothesis is $μ_{score_{neutral}} = μ_{score_{nome}}$
and this tests tries to see if it is true or not true
let’s assume that we have initially determine our confidence level of 99.99%, which means that we will accept 0.01% error rate.
mod = ols('score ~ neutral', fifa_wc_data[fifa_wc_data.tournament=='FIFA World Cup']).fit()
anova_table = sm.stats.anova_lm(mod, typ=2)
print('ANOVA table for FIFA World Cup')
print('----------------------')
print(anova_table)
print()
mod = ols('score ~ neutral', fifa_wc_data[fifa_wc_data.tournament=='FIFA World Cup qualification']).fit()
anova_table = sm.stats.anova_lm(mod, typ=2)
print('ANOVA table for FIFA World Cup qualification')
print('----------------------')
print(anova_table)
There are two p-values(PR(>F)) that we can see here, world cup itself and qualification games.
For World Cup, we cannot accept the null hypothesis under 99.99% confident level, because the p-value is greater than our alpha (0.0001 < 0.000493). So home venue is not enough factor at final round.
For qualification, since the p-value PR(>F) is less than our error rate (0.0001 > 1.195318e-12), we could reject the null hypothesis. This means we are quite confident that there is a different in score for qualification games by venue status.
Comments
comments powered by Disqus