library("lubridate")
library("RColorBrewer")
library("car")
library("scales")
library("FactoMineR")
library("factoextra")
The current dataset comes from kaggle https://www.kaggle.com/ralle360/historic-tour-de-france-dataset and provides information about the Tour de France race, including 8 different variables corresponding to 2,236 stages of the race. The data have been extracted from Wikipedia and prepared by RasmusFiskerBang and is made available through a CSV file that you can download on my website (the dataset is licensed under the CC0 license - Public Domain).
Even though CSV files can be opened with Excel, we strongly discourage this practice and we will use R directly for this task:
tdf_data <- read.table("stages_TDF.csv", sep = ",", header = TRUE,
stringsAsFactors = FALSE, encoding = "UTF-8",
quote = "\"")
head(tdf_data)
## Stage Date Distance Origin Destination
## 1 1 2017-07-01 14.0 Düsseldorf Düsseldorf
## 2 2 2017-07-02 203.5 Düsseldorf Liège
## 3 3 2017-07-03 212.5 Verviers Longwy
## 4 4 2017-07-04 207.5 Mondorf-les-Bains Vittel
## 5 5 2017-07-05 160.5 Vittel La Planche des Belles Filles
## 6 6 2017-07-06 216.0 Vesoul Troyes
## Type Winner Winner_Country
## 1 Individual time trial Geraint Thomas GBR
## 2 Flat stage Marcel Kittel GER
## 3 Medium mountain stage Peter Sagan SVK
## 4 Flat stage Arnaud Démare FRA
## 5 Medium mountain stage Fabio Aru ITA
## 6 Flat stage Marcel Kittel GER
where:
sep
is used to provide the character separating
columns;
header = TRUE
indicates that column names are
included in the file (in the first row);
stringsAsFactor = FALSE
indicates that strings must
not be converted to type factor
(this is the default
behavior since R 4.0.0);
quote = "\""
indicates which character(s) has(have)
to be considered as quotation character and not part of the
data.
Other information and options are available in the help page
?read.table
or in ?read.csv
,
?read.csv2
, ?read.delim
,
?read.delim2
.
We can take a first look at the data with:
head(tdf_data)
## Stage Date Distance Origin Destination
## 1 1 2017-07-01 14.0 Düsseldorf Düsseldorf
## 2 2 2017-07-02 203.5 Düsseldorf Liège
## 3 3 2017-07-03 212.5 Verviers Longwy
## 4 4 2017-07-04 207.5 Mondorf-les-Bains Vittel
## 5 5 2017-07-05 160.5 Vittel La Planche des Belles Filles
## 6 6 2017-07-06 216.0 Vesoul Troyes
## Type Winner Winner_Country
## 1 Individual time trial Geraint Thomas GBR
## 2 Flat stage Marcel Kittel GER
## 3 Medium mountain stage Peter Sagan SVK
## 4 Flat stage Arnaud Démare FRA
## 5 Medium mountain stage Fabio Aru ITA
## 6 Flat stage Marcel Kittel GER
summary(tdf_data)
## Stage Date Distance Origin
## Length:2236 Length:2236 Min. : 1.0 Length:2236
## Class :character Class :character 1st Qu.:156.0 Class :character
## Mode :character Mode :character Median :199.0 Mode :character
## Mean :196.8
## 3rd Qu.:236.0
## Max. :482.0
## Destination Type Winner Winner_Country
## Length:2236 Length:2236 Length:2236 Length:2236
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
When missing data are present, this last command shows that many rows
contain missing values (identified with NA
) in each column.
The dataset dimension is obtained with:
dim(tdf_data)
## [1] 2236 8
Information on column types can be obtained with:
sapply(tdf_data, class)
## Stage Date Distance Origin Destination
## "character" "character" "numeric" "character" "character"
## Type Winner Winner_Country
## "character" "character" "character"
which indicates that all columns are character except for the third one, which is numeric. Sometimes, numeric variables (and more specifically integers) are used to code for categorical variables but this is not the case in this dataset.
To be allowed to perform properly the subsequent analysis, it is best to precisely define the different types of the columns:
factor
, which is the proper type in R for categorical
variables with a finite number of possible valuestdf_data$Stage <- factor(tdf_data$Stage)
tdf_data$Date <- as.Date(tdf_data$Date, format = c("%Y-%m-%d"))
tdf_data$Year <- year(tdf_data$Date)
Distance: is indeed a numeric variable;
Origin and Destination: are categorical variables but the number of possible values (town of origin and destination of the stage) is so large that there is only a minor benefit in converting it to a factor
Type: is a categorical variable that is better converted to a factor
tdf_data$Type <- factor(tdf_data$Type)
Winner: is the winner name so the number of possible values is so large that there is only a minor benefit in converting it to a factor
Winner_Country: is the country code of the winner so is better converted to a factor:
tdf_data$Winner_Country <- factor(tdf_data$Winner_Country)
After these stages, the summary of the dataset is already more informative:
summary(tdf_data)
## Stage Date Distance Origin
## 4 : 101 Min. :1903-07-01 Min. : 1.0 Length:2236
## 11 : 100 1st Qu.:1938-07-17 1st Qu.:156.0 Class :character
## 2 : 100 Median :1969-07-02 Median :199.0 Mode :character
## 9 : 100 Mean :1966-08-02 Mean :196.8
## 10 : 99 3rd Qu.:1991-07-20 3rd Qu.:236.0
## 6 : 99 Max. :2017-07-23 Max. :482.0
## (Other):1637
## Destination Type Winner
## Length:2236 Plain stage :1053 Length:2236
## Class :character Stage with mountain(s): 530 Class :character
## Mode :character Individual time trial : 205 Mode :character
## Flat stage : 110
## Team time trial : 87
## Hilly stage : 76
## (Other) : 175
## Winner_Country Year
## FRA :691 Min. :1903
## BEL :460 1st Qu.:1938
## ITA :262 Median :1969
## NED :157 Mean :1966
## ESP :125 3rd Qu.:1991
## GER : 71 Max. :2017
## (Other):470
The variable Distance
is the distance of the stage:
mean(tdf_data$Distance) # mean
## [1] 196.783
median(tdf_data$Distance) # median
## [1] 199
min(tdf_data$Distance) # minimum
## [1] 1
max(tdf_data$Distance) # maximum
## [1] 482
# quartiles and min/max
quantile(tdf_data$Distance, probs = c(0, 0.25, 0.5, 0.75, 1))
## 0% 25% 50% 75% 100%
## 1 156 199 236 482
The option na.rm = TRUE
must be used when you have
missing values that you don’t want to be taken into account for the
computation (otherwise, most of these functions would return the value
NA
).
Exercise: How to interpret these values? More precisely, what do they say about the variable distribution?
Some of these values are also available with
summary(tdf_data$Distance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 156.0 199.0 196.8 236.0 482.0
Dispersion characteristics are obtained with:
var(tdf_data$Distance) # variance
## [1] 8131.78
sd(tdf_data$Distance) # standard deviation
## [1] 90.17639
range(tdf_data$Distance) # range
## [1] 1 482
diff(range(tdf_data$Distance))
## [1] 481
Exercise: How would you compute the inter-quartile range (in just one line of code)?
## 75%
## 80
## [1] 80
Exercise: What is the coefficient of variation (CV) for this variable?
## [1] 0.4582529
Standard modifications of data include:
tdf_data$cut1 <- cut(tdf_data$Distance, breaks = 5)
table(tdf_data$cut1)
##
## (0.519,97.2] (97.2,193] (193,290] (290,386] (386,482]
## 320 676 950 211 79
tdf_data$cut2 <- cut(tdf_data$Distance, breaks = 5, labels = FALSE)
table(tdf_data$cut2)
##
## 1 2 3 4 5
## 320 676 950 211 79
tdf_data$cut3 <- cut(tdf_data$Distance, breaks = seq(0, 500, by = 100))
table(tdf_data$cut3)
##
## (0,100] (100,200] (200,300] (300,400] (400,500]
## 324 820 825 210 57
Exercise: What is the mode of cut3
?
## [1] "(200,300]"
tdf_data$DScaled <- as.vector(scale(tdf_data$Distance))
summary(tdf_data$DScaled)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.17111 -0.45226 0.02458 0.00000 0.43489 3.16288
var(tdf_data$DScaled)
## [1] 1
Before we start, a short note on color palettes:
display.brewer.all()