```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r loadLib}
library("lubridate")
library("RColorBrewer")
library("car")
library("scales")
library("FactoMineR")
library("factoextra")
```
## Introduction
The current dataset comes from kaggle
https://www.kaggle.com/ralle360/historic-tour-de-france-dataset and provides
information about the Tour de France race, including 8 different variables
corresponding to 2,236 stages of the race. The data have been extracted from
Wikipedia and prepared by RasmusFiskerBang and is made available through a
CSV file that you can download on my website (the dataset is licensed under the
CC0 license - Public Domain).
## Data importation
Even though CSV files can be opened with Excel, we strongly discourage this
practice and we will use **R** directly for this task:
```{r importation}
tdf_data <- read.table("stages_TDF.csv", sep = ",", header = TRUE,
stringsAsFactors = FALSE, encoding = "UTF-8",
quote = "\"")
head(tdf_data)
```
where:
* `sep` is used to provide the character separating columns;
* `header = TRUE` indicates that column names are included in the file (in the
first row);
* `stringsAsFactor = FALSE` indicates that strings must not be converted to
type `factor` (this is the default behavior since **R** 4.0.0);
* `quote = "\""` indicates which character(s) has(have) to be considered as
quotation character and not part of the data.
Other information and options are available in the help page `?read.table` or
in `?read.csv`, `?read.csv2`, `?read.delim`, `?read.delim2`.
We can take a first look at the data with:
```{r head}
head(tdf_data)
summary(tdf_data)
```
When missing data are present, this last command shows that many rows contain
missing values (identified with `NA`) in each column. The dataset dimension is
obtained with:
```{r dim}
dim(tdf_data)
```
Information on column types can be obtained with:
```{r classes}
sapply(tdf_data, class)
```
which indicates that all columns are character except for the third one, which
is numeric. Sometimes, numeric variables (and more specifically integers) are
used to code for categorical variables but this is not the case in this dataset.
To be allowed to perform properly the subsequent analysis, it is best to
precisely define the different types of the columns:
* Stage: is a character variable but that correspond to the number (or the
symbol) of the stage each year. It is better recoded as a `factor`, which is the
proper type in R for categorical variables with a finite number of possible
values
```{r convertStageToFactor}
tdf_data$Stage <- factor(tdf_data$Stage)
```
* Date: is a date and there is a special type in R for dates, which is better
used here (for instance, to easily extract the year or the month of the date)
as a numeric variable
```{r convertDateYear}
tdf_data$Date <- as.Date(tdf_data$Date, format = c("%Y-%m-%d"))
tdf_data$Year <- year(tdf_data$Date)
```
* Distance: is indeed a numeric variable;
* Origin and Destination: are categorical variables but the number of possible
values (town of origin and destination of the stage) is so large that there is
only a minor benefit in converting it to a factor
* Type: is a categorical variable that is better converted to a factor
```{r convertTypeToFactor}
tdf_data$Type <- factor(tdf_data$Type)
```
* Winner: is the winner name so the number of possible values is so large that
there is only a minor benefit in converting it to a factor
* Winner_Country: is the country code of the winner so is better converted to
a factor:
```{r convertCountryToFactor}
tdf_data$Winner_Country <- factor(tdf_data$Winner_Country)
```
After these stages, the summary of the dataset is already more informative:
```{r summaryPostCleaning}
summary(tdf_data)
```
## Univariate statistics
### Numerical characteristics
The variable `Distance` is the distance of the stage:
```{r numChar}
mean(tdf_data$Distance) # mean
median(tdf_data$Distance) # median
min(tdf_data$Distance) # minimum
max(tdf_data$Distance) # maximum
# quartiles and min/max
quantile(tdf_data$Distance, probs = c(0, 0.25, 0.5, 0.75, 1))
```
The option `na.rm = TRUE` must be used when you have missing values that you
don't want to be taken into account for the computation (otherwise, most of
these functions would return the value `NA`).
*Exercise*: How to interpret these values? More precisely, what do they say
about the variable distribution?
Some of these values are also available with
```{r numSum}
summary(tdf_data$Distance)
```
Dispersion characteristics are obtained with:
```{r numVaribility}
var(tdf_data$Distance) # variance
sd(tdf_data$Distance) # standard deviation
range(tdf_data$Distance) # range
diff(range(tdf_data$Distance))
```
*Exercise*: How would you compute the inter-quartile range (in just one line of
code)?
```{r interQuartRange, echo=FALSE}
#diff(quantile(range(tdf_data$Distance), probs = c(0.25, 0.75)))
diff(quantile(tdf_data$Distance, probs = c(0.25, 0.75)))
IQR(tdf_data$Distance)
```
*Exercise*: What is the coefficient of variation (CV) for this variable?
```{r CV, echo=FALSE}
sd(tdf_data$Distance) / mean(tdf_data$Distance)
```
Standard modifications of data include:
* *binarization*:
```{r cuts}
tdf_data$cut1 <- cut(tdf_data$Distance, breaks = 5)
table(tdf_data$cut1)
tdf_data$cut2 <- cut(tdf_data$Distance, breaks = 5, labels = FALSE)
table(tdf_data$cut2)
tdf_data$cut3 <- cut(tdf_data$Distance, breaks = seq(0, 500, by = 100))
table(tdf_data$cut3)
```
*Exercise*: What is the mode of `cut3`?
```{r cut3Mod}
names(sort(-table(tdf_data$cut3)))[1]
```
* *centering and scaling*
```{r scale}
tdf_data$DScaled <- as.vector(scale(tdf_data$Distance))
summary(tdf_data$DScaled)
var(tdf_data$DScaled)
```
### Charts
Before we start, a short note on color palettes:
```{r palettes}
display.brewer.all()
```
* For a **factor** (`cut2`):
```{r factorCharts}
pie(table(tdf_data$Type), col = c(brewer.pal(9, "Set1"), brewer.pal(9, "Set2"))) # 18 niveaux > 18 couleurs
```
```{r factorCharts2}
barplot(table(tdf_data$Winner_Country), col = "darkgreen", las = 3,
cex.names = 0.6, cex.axis=0.6)
```
* For *numeric variables*
```{r numCharts}
hist(tdf_data$Distance)
hist(tdf_data$Distance, main = "Distribution of stage distances",
xlab = "distance (km)", breaks = 20)
plot(density(tdf_data$Distance))
hist(tdf_data$Distance, main = "Distribution of stage distances",
xlab = "distance (km)", breaks = 20, freq = FALSE)
lines(density(tdf_data$Distance), col = "red", lwd = 2)
boxplot(tdf_data$Distance)
boxplot(tdf_data$Distance, horizontal = TRUE, notch = TRUE)
```
## Bivariate statistics
### Factor versus factor
Contingency tables are obtained with the function `table` (here on a subset of
the winner countries and of the stage type).
```{r contingencyTable}
selected_countries <- names(which(table(tdf_data$Winner_Country) > 10))
selected_stages <- tdf_data$Winner_Country %in% selected_countries
cont_table <- table(tdf_data$Type[selected_stages],
tdf_data$Winner_Country[selected_stages, drop = TRUE])
cont_table
cont_table / rowSums(cont_table)
```
*Exercise*: How to interpret the numbers in the last table (we call them
"row profiles")? For instance, what does 0.4545454545, in the column of France,
mean? Compute the column profiles.
```{r columnProfiles, echo=FALSE}
cont_table / colSums(cont_table)
```
```{r cramerV}
chisq <- chisq.test(cont_table)$statistic
ddl <- min(nrow(cont_table) - 1, ncol(cont_table) - 1)
cramerV <- sqrt(chisq / (sum(cont_table) * ddl))
cramerV
```
Barplots are obtained using the contingency table as well:
```{r barplot2Vars}
selected_stages <- selected_stages &
(tdf_data$Type %in% c("Individual time trial", "Plain stage", "Stage with mountain(s)"))
cont_table <- table(tdf_data$Type[selected_stages, drop = TRUE],
tdf_data$Winner_Country[selected_stages, drop = TRUE])
barplot(cont_table, legend.text = TRUE, xlab = "country", ylab = "frequency",
cex.names = 0.6)
barplot(cont_table, legend.text = TRUE, xlab = "country", ylab = "frequency",
beside = TRUE, col = brewer.pal(3, "Set3"), cex.names = 0.6)
```
### Numeric versus numeric
Covariances can be obtained using the function `cov`:
```{r covariances}
cov(tdf_data$Distance, tdf_data$Year)
cov(tdf_data$Distance, tdf_data$Year, method = "spearman")
cov(tdf_data[ ,c("Distance", "Year", "DScaled")])
cor(tdf_data$Distance, tdf_data$Year)
cor(tdf_data$Distance, tdf_data$Year, method = "spearman")
cor(tdf_data[ ,c("Distance", "Year", "DScaled")])
```
and correlations are obtained with the function `cor` that takes the same
arguments than `cov`.
Partial correlation between `Distance` and `Year` given `DScaled` is obtained
with:
```{r partialCor}
cor(lm(tdf_data$Distance ~ tdf_data$DScaled)$residuals,
lm(tdf_data$Year ~ tdf_data$DScaled)$residuals)
```
Is it an expected result? Check at the residuals of the first model: what can
you say?
```{r}
summary(lm(tdf_data$Distance ~ tdf_data$DScaled)$residuals)
```
Dot plots (scatterplots) between two variables are obtained with:
```{r dotplots}
plot(tdf_data$Year, tdf_data$Distance, pch = 19)
```
or with scatterplot matrices:
```{r scatterplotMatrix}
scatterplotMatrix(tdf_data[, c("Distance", "Year", "DScaled")])
scatterplotMatrix(tdf_data[, c("Distance", "Year", "DScaled")],
col = "black", regLine = FALSE, smooth = FALSE, pch = "+")
```
Can you comment on the specific plot between `Distance` and `DScaled`?
### Numeric versus factor
Within and between variance of `Distance` with respect to `Winner_Country` are
obtained from the function `anova` (in the column `Mean Sq`, within group
variance corresponds to the row `residuals`):
```{r BetWithVars}
selected_countries <- names(which(table(tdf_data$Winner_Country) > 10))
selected_stages <- tdf_data$Winner_Country %in% selected_countries
anova(lm(tdf_data$Distance[selected_stages] ~
factor(tdf_data$Winner_Country[selected_stages, drop = TRUE])))
```
and the correlation ratio is the square root of the column `F value`:
```{r corrRatio}
cor_ratio <- anova(lm(tdf_data$Distance[selected_stages] ~
factor(tdf_data$Winner_Country[selected_stages, drop = TRUE])))
sqrt(cor_ratio[1, "Mean Sq"]/(cor_ratio[1, "Mean Sq"] + cor_ratio[2, "Mean Sq"]))
```
Parallel boxplots are obtained using the same syntax with the `~`:
```{r parallelPlots}
boxplot(tdf_data$Distance[selected_stages] ~
tdf_data$Winner_Country[selected_stages, drop = TRUE], las = 3,
xlab = "winner country", ylab = "distance")
```
To obtain multiple histograms on the same plot, you can use:
```{r multipleHistogramsB, echo=FALSE, fig.width=15, fig.height=15}
par(mfrow = c(5, 4))
for (wc in selected_countries) {
tmp <- tdf_data[tdf_data$Winner_Country == wc, c("Distance")]
hist(tmp, main = paste("Histogram of Distance for ", wc), xlab = "distance")
}
```
## Tests
### With one variable
To test if `Distance` average is equal to 200.
* with a **parametric test**, we need to first test if the variable is
distributed as a Gaussian variable. Several tests exist to do that but we will
use the Shapiro-Wilk's test:
```{r shapiroTest}
shapiro.test(tdf_data$Distance)
```
Despite significant deviation to normality, we will perform a Student test (for
the sake of the example):
```{r oneSTTest}
res <- t.test(tdf_data$Distance, mu = 200, conf.level = 0.99)
res
res$statistic
res$p.value
res$conf.int
```
* with a *non-parametric* Wilcoxon test (that tests the median instead of the
mean):
```{r oneWilcoxTest}
res <- wilcox.test(tdf_data$Distance, mu = 200, conf.int = TRUE)
res
res$statistic
res$p.value
res$conf.int
```
### With two factors
For small contingency tables, the independence between rows and columns can be
tested with a Fisher's exact test (to be preferred) or a $\chi^2$ test (only
when Fisher's exact test is computationally too heavy to run or when the sample
size is sufficiently large).
```{r contTable}
selected_countries <- names(which(table(tdf_data$Winner_Country) > 10))
selected_stages <- tdf_data$Winner_Country %in% selected_countries
selected_stages <- selected_stages &
(tdf_data$Type %in% c("Individual time trial", "Plain stage", "Stage with mountain(s)"))
tdf_small <- data.frame(tdf_data$Winner_Country[selected_stages, drop = TRUE],
tdf_data$Type[selected_stages, drop = TRUE],
tdf_data$Distance[selected_stages])
names(tdf_small) <- c("Winner_Country", "Type", "Distance")
cont_table <- table(tdf_small$Winner_Country, tdf_small$Type)
cont_table
```
```{fisherTest, eval=FALSE}
res <- fisher.test(cont_table)
# Error in fisher.test(cont_table) :
# FEXACT error 7(location). LDSTP=18330 is too small for this problem,
# (pastp=11.752, ipn_0:=ipoin[itp=563]=3932, stp[ipn_0]=10.2479).
# Increase workspace or consider using 'simulate.p.value=TRUE'
```
In addition to being more suited to large contingency tables, $\chi^2$ test also
provide interesting statistics for interpretation of the results:
```{r chiSQ}
res <- chisq.test(cont_table)
res
res$observed
res$expected
res$residuals^2
```
```{r barplot0}
barplot(t(cont_table), legend.text = TRUE, xlab = "country", ylab = "frequency",
cex.names = 0.6, col = brewer.pal(3, "Set3"))
```
*Exercise*: `Titanic[ , Sex = "Male", Age = "Adult", ]` is the contingency
table of Male Adults in Titanics for the variables `Class` and `Survival`. Are
these two variables independant? Which classes deviate the most from the
independence? Same questions for Women.
```{r titanic, echo=FALSE}
fisher.test(Titanic[ , Sex = "Male", Age = "Adult", ])
res <- chisq.test(Titanic[ , Sex = "Male", Age = "Adult", ])
res
res$observed
res$expected
res$residuals^2
res <- chisq.test(Titanic[ , Sex = "Female", Age = "Adult", ])
res
res$observed
res$expected
res$residuals^2
```
### With two numeric variables
Correlation tests are performed with:
```{r corrTests}
cor.test(tdf_data$Distance, tdf_data$Year)
cor.test(tdf_data$Distance, tdf_data$Year, method = "spearman")
plot(tdf_data$Distance, tdf_data$Year)
```
*Exercise*: The object `Soils` contain characteristics on soil samples.
Variables 6-13 are numerical characteristics. Make a scatterplot matrix of these
variables and pick two variables that you think are linearly correlated. Test
your hypothesis.
```{r soils, echo=FALSE}
scatterplotMatrix(Soils[ ,6:13], smooth = FALSE, regLine = FALSE, col = "black",
pch = "+")
with(Soils[ ,c("pH", "Ca")], cor.test(pH, Ca))
```
### With a numeric variable explained by a factor ($K=2$, independent)
Comparing the means of the given numeric variable `Distance` for the different
countries of origin of the winner `Winner_Country`, for the countries DEN and
FRA:
```{r mean2Sample}
selected_countries <- c("DEN", "FRA")
selected_stages <- tdf_data$Winner_Country %in% selected_countries
tdf_small <- data.frame(tdf_data$Winner_Country[selected_stages, drop = TRUE],
tdf_data$Distance[selected_stages])
names(tdf_small) <- c("Winner_Country", "Distance")
t.test(tdf_small$Distance ~ tdf_small$Winner_Country)
t.test(tdf_small$Distance ~ tdf_small$Winner_Country, var.equal = TRUE)
```
How to know if two variances are equal? Bartlett test or Fisher's test...
```{r bartlett}
bartlett.test(tdf_small$Distance ~ tdf_small$Winner_Country)
var.test(tdf_small$Distance ~ tdf_small$Winner_Country)
```
Comparing the medians of the given numeric variable `Distance` between the two
winner countries, FRA and DEN:
```{r median2Sample}
wilcox.test(tdf_small$Distance ~ tdf_small$Winner_Country)
```
*Exercise*: In the dataset `Soils`, does the pH significantly differ between
samples on depressions and on slopes (variable `Contour`)? Visually confirm with
the appropriate plot.
```{r SoilsContourPh, echo=FALSE}
shapiro.test(Soils$pH[Soils$Contour == "Depression"])
shapiro.test(Soils$pH[Soils$Contour == "Slope"])
bartlett.test(pH ~ Contour,
data = Soils[Soils$Contour %in% c("Depression", "Slope"), ])
t.test(pH ~ Contour,
data = Soils[Soils$Contour %in% c("Depression", "Slope"), ],
var.equal = TRUE)
df <- Soils[Soils$Contour %in% c("Depression", "Slope"), ]
df$Contour <- droplevels(df$Contour)
with(df, boxplot(pH ~ Contour))
```
### With a numeric variable explained by a factor ($K > 2$, independent)
ANOVA is based on the previous fitting of a linear model. For instance, if we
want to check if the means `Distance` are different between `Type` (with a
restricted dataset), we have to run (under normality and equal variance
assumptions):
```{r ANOVA}
selected_stages <- (tdf_data$Type %in%
c("Individual time trial", "Plain stage",
"Stage with mountain(s)"))
tdf_small <- data.frame(tdf_data$Type[selected_stages, drop = TRUE],
tdf_data$Distance[selected_stages])
names(tdf_small) <- c("Type", "Distance")
anova(lm(tdf_small$Distance ~ tdf_small$Type))
```
Kruskal-Wallis nonparametric test is performed with:
```{r KW}
kruskal.test(tdf_small$Distance ~ tdf_small$Type)
```
*Exercise*: In the dataset `Soils`, does the pH significantly differ between
the different contours? Visually confirm with the appropriate plot.
```{r SoilsContourPh2, echo=FALSE}
shapiro.test(Soils$pH[Soils$Contour == "Top"])
bartlett.test(pH ~ Contour, data = Soils)
anova(lm(pH ~ Contour, data = Soils))
boxplot(pH ~ Contour, data = Soils)
# Alternatives de programmation
#boxplot(Soils$pH ~ Soils$Contour)
#plot(Soils$Contour, Soils$pH)
```
### With a numeric variables in two paired samples (two groups with the same individuals)
The `sleep` data show the effect (`extra`) of two soporific drugs (`group`) on
10 patients (`ID`):
```{r sleep}
head(sleep)
tail(sleep)
dim(sleep)
sleep2 <- reshape(sleep, direction = "wide", idvar = "ID", timevar = "group")
dim(sleep2)
head(sleep2)
```
Paired comparison tests are performed with (under normality assumption):
```{r pairedTTest}
with(sleep2, t.test(extra.1, extra.2, paired = TRUE))
```
and with a nonparametric test:
```{r pairedWTest}
with(sleep2, wilcox.test(extra.1, extra.2, paired = TRUE))
```
Si beaucoup d'ex-aequo ne pas utiliser cette p-value mais si le nombre d'ex-aequo est faible la p-value peut etre utilisee.
```{r}
boxplot(extra ~ group, data = sleep)
```
```{r}
sleep2$diff <- sleep2$extra.2 - sleep2$extra.1
head(sleep2)
t.test(x=sleep2$diff)
wilcox.test(x=sleep2$diff)
```
## Linear regression and models
### A numeric variable explained by one or several numeric variables
```{r linearReg}
res <- lm(tdf_data$Distance ~ tdf_data$Year)
res
summary(res)
coefficients(res)
confint(res)
plot(res)
# http://www.sthda.com/english/articles/39-regression-model-diagnostics/161-linear-regression-assumptions-and-diagnostics-in-r-essentials/
# For details on validations plots
```
```{r linearModels}
tdf_small <- tdf_data[, c("Distance", "Year")]
head(tdf_small)
res <- lm(Distance ~ ., data = tdf_small)
summary(res)
plot(res)
```
### A numeric variable explained by one (or several) factor(s)
```{r ANOVALM}
selected_stages <- (tdf_data$Type %in%
c("Individual time trial", "Plain stage",
"Stage with mountain(s)"))
tdf_small <- data.frame(tdf_data$Type[selected_stages, drop = TRUE],
tdf_data$Distance[selected_stages])
names(tdf_small) <- c("Type", "Distance")
head(tdf_small)
res <- lm(Distance ~ Type, data = tdf_small)
summary(res)
anova(res)
```
*Exercise*: Explain the pH of `Soils` by an additive effect between contour and
`Ca` and with a additive effect with interaction between contour and Depth.
```{r completeLin, echo=FALSE}
res <- lm(pH ~ Contour + Ca, data = Soils)
summary(res)
res = aov(pH ~ Contour + Depth, data = Soils)
TukeyHSD(res)
plot(TukeyHSD(res))
res <- lm(pH ~ Contour + Depth + Contour:Depth, data = Soils)
summary(res)
anova(res)
```
### A binary variable is explained by the linear relation between predictor(s)
```{r logit}
res <- glm(Type ~ Distance, data = tdf_small, family = binomial(link = "logit"))
summary(res)
```
## Multiple testing correction
If we test the differences between the levels of Contour of the mean for all
numeric variables in `Soils`, we obtain 9 p-values on which multiple testing
can be performed:
```{r multTest}
all_pvals <- apply(Soils[ ,6:14], 2, function(avar)
anova(lm(avar ~ Soils$Group))[1, "Pr(>F)"])
all_pvals
p.adjust(all_pvals, method = "BH")
p.adjust(all_pvals, method = "bonferroni")
```
## PCA
```{r loadUSA}
data("USArrests")
summary(USArrests)
```
```{r PCA}
pca_usa <- PCA(USArrests, graph = FALSE)
fviz_eig(pca_usa)
```
```{r PCAplotVar}
fviz_pca_var(pca_usa, choice = "var")
```
```{r PCAPlotInd}
fviz_pca_ind(pca_usa, choice = "ind")
```
```{r PCAPlotInd2}
fviz_pca_ind(pca_usa, choice = "ind", col.ind = USArrests$UrbanPop)
```
*Exercise*: Perform PCA of the dataset `athle_records.csv` after
log-transformation of the data.
## Clustering
```{r hc}
scaled_usa <- scale(USArrests)
hc_usa <- hclust(dist(scaled_usa)^2, method = "ward.D")
plot(hc_usa)
```
```{r withinss}
plot(49:1, cumsum(hc_usa$height), type = "b")
1 - cumsum(hc_usa$height)[46] / cumsum(hc_usa$height)[49] # reproduced inertia
```
```{r cutdendro}
plot(hc_usa)
rect.hclust(hc_usa, k = 4)
```
```{r clusteringhc}
clust_hc <- cutree(hc_usa, k = 4)
clust_hc
fviz_pca_ind(pca_usa, col.ind = factor(clust_hc))
```
```{r kmeans}
init_center <- by(scaled_usa, clust_hc, colMeans)
init_center <- Reduce("rbind", init_center)
clust_kmeans <- kmeans(scaled_usa, centers = init_center, nstart = 20)
clust_kmeans
```
```{r clusteringkmeans}
fviz_pca_ind(pca_usa, col.ind = factor(clust_kmeans$cluster))
```
*Exercise*: Perform clustering of the dataset `athle_records.csv` after
log-transformation and scaling of the data.
## Session information
This file has been compiled with the current system:
```{r sessionInfo}
sessionInfo()
```