



The current dataset comes from kaggle and provides information about the Tour de France race, including 8 different variables corresponding to 2,236 stages of the race. The data have been extracted from Wikipedia and prepared by RasmusFiskerBang and is made available through a CSV file that you can download on my website (the dataset is licensed under the CC0 license - Public Domain).

Data importation

Even though CSV files can be opened with Excel, we strongly discourage this practice and we will use R directly for this task:

tdf_data <- read.table("stages_TDF.csv", sep = ",", header = TRUE,
                       stringsAsFactors = FALSE, encoding = "UTF-8",
                       quote = "\"")
##   Stage       Date Distance            Origin                  Destination
## 1     1 2017-07-01     14.0        Düsseldorf                   Düsseldorf
## 2     2 2017-07-02    203.5        Düsseldorf                        Liège
## 3     3 2017-07-03    212.5          Verviers                       Longwy
## 4     4 2017-07-04    207.5 Mondorf-les-Bains                       Vittel
## 5     5 2017-07-05    160.5            Vittel La Planche des Belles Filles
## 6     6 2017-07-06    216.0            Vesoul                       Troyes
##                    Type         Winner Winner_Country
## 1 Individual time trial Geraint Thomas            GBR
## 2            Flat stage  Marcel Kittel            GER
## 3 Medium mountain stage    Peter Sagan            SVK
## 4            Flat stage  Arnaud Démare            FRA
## 5 Medium mountain stage      Fabio Aru            ITA
## 6            Flat stage  Marcel Kittel            GER


  • sep is used to provide the character separating columns;

  • header = TRUE indicates that column names are included in the file (in the first row);

  • stringsAsFactor = FALSE indicates that strings must not be converted to type factor (this is the default behavior since R 4.0.0);

  • quote = "\"" indicates which character(s) has(have) to be considered as quotation character and not part of the data.

Other information and options are available in the help page ?read.table or in ?read.csv, ?read.csv2, ?read.delim, ?read.delim2.

We can take a first look at the data with:

##   Stage       Date Distance            Origin                  Destination
## 1     1 2017-07-01     14.0        Düsseldorf                   Düsseldorf
## 2     2 2017-07-02    203.5        Düsseldorf                        Liège
## 3     3 2017-07-03    212.5          Verviers                       Longwy
## 4     4 2017-07-04    207.5 Mondorf-les-Bains                       Vittel
## 5     5 2017-07-05    160.5            Vittel La Planche des Belles Filles
## 6     6 2017-07-06    216.0            Vesoul                       Troyes
##                    Type         Winner Winner_Country
## 1 Individual time trial Geraint Thomas            GBR
## 2            Flat stage  Marcel Kittel            GER
## 3 Medium mountain stage    Peter Sagan            SVK
## 4            Flat stage  Arnaud Démare            FRA
## 5 Medium mountain stage      Fabio Aru            ITA
## 6            Flat stage  Marcel Kittel            GER
##     Stage               Date              Distance        Origin         
##  Length:2236        Length:2236        Min.   :  1.0   Length:2236       
##  Class :character   Class :character   1st Qu.:156.0   Class :character  
##  Mode  :character   Mode  :character   Median :199.0   Mode  :character  
##                                        Mean   :196.8                     
##                                        3rd Qu.:236.0                     
##                                        Max.   :482.0                     
##  Destination            Type              Winner          Winner_Country    
##  Length:2236        Length:2236        Length:2236        Length:2236       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  

When missing data are present, this last command shows that many rows contain missing values (identified with NA) in each column. The dataset dimension is obtained with:

## [1] 2236    8

Information on column types can be obtained with:

sapply(tdf_data, class)
##          Stage           Date       Distance         Origin    Destination 
##    "character"    "character"      "numeric"    "character"    "character" 
##           Type         Winner Winner_Country 
##    "character"    "character"    "character"

which indicates that all columns are character except for the third one, which is numeric. Sometimes, numeric variables (and more specifically integers) are used to code for categorical variables but this is not the case in this dataset.

To be allowed to perform properly the subsequent analysis, it is best to precisely define the different types of the columns:

  • Stage: is a character variable but that correspond to the number (or the symbol) of the stage each year. It is better recoded as a factor, which is the proper type in R for categorical variables with a finite number of possible values
tdf_data$Stage <- factor(tdf_data$Stage)
  • Date: is a date and there is a special type in R for dates, which is better used here (for instance, to easily extract the year or the month of the date) as a numeric variable
tdf_data$Date <- as.Date(tdf_data$Date, format = c("%Y-%m-%d"))
tdf_data$Year <- year(tdf_data$Date)
  • Distance: is indeed a numeric variable;

  • Origin and Destination: are categorical variables but the number of possible values (town of origin and destination of the stage) is so large that there is only a minor benefit in converting it to a factor

  • Type: is a categorical variable that is better converted to a factor

tdf_data$Type <- factor(tdf_data$Type)
  • Winner: is the winner name so the number of possible values is so large that there is only a minor benefit in converting it to a factor

  • Winner_Country: is the country code of the winner so is better converted to a factor:

tdf_data$Winner_Country <- factor(tdf_data$Winner_Country)

After these stages, the summary of the dataset is already more informative:

##      Stage           Date               Distance        Origin         
##  4      : 101   Min.   :1903-07-01   Min.   :  1.0   Length:2236       
##  11     : 100   1st Qu.:1938-07-17   1st Qu.:156.0   Class :character  
##  2      : 100   Median :1969-07-02   Median :199.0   Mode  :character  
##  9      : 100   Mean   :1966-08-02   Mean   :196.8                     
##  10     :  99   3rd Qu.:1991-07-20   3rd Qu.:236.0                     
##  6      :  99   Max.   :2017-07-23   Max.   :482.0                     
##  (Other):1637                                                          
##  Destination                            Type         Winner         
##  Length:2236        Plain stage           :1053   Length:2236       
##  Class :character   Stage with mountain(s): 530   Class :character  
##  Mode  :character   Individual time trial : 205   Mode  :character  
##                     Flat stage            : 110                     
##                     Team time trial       :  87                     
##                     Hilly stage           :  76                     
##                     (Other)               : 175                     
##  Winner_Country      Year     
##  FRA    :691    Min.   :1903  
##  BEL    :460    1st Qu.:1938  
##  ITA    :262    Median :1969  
##  NED    :157    Mean   :1966  
##  ESP    :125    3rd Qu.:1991  
##  GER    : 71    Max.   :2017  
##  (Other):470

Univariate statistics

Numerical characteristics

The variable Distance is the distance of the stage:

mean(tdf_data$Distance) # mean
## [1] 196.783
median(tdf_data$Distance) # median
## [1] 199
min(tdf_data$Distance) # minimum
## [1] 1
max(tdf_data$Distance) # maximum
## [1] 482
# quartiles and min/max
quantile(tdf_data$Distance, probs = c(0, 0.25, 0.5, 0.75, 1))
##   0%  25%  50%  75% 100% 
##    1  156  199  236  482

The option na.rm = TRUE must be used when you have missing values that you don’t want to be taken into account for the computation (otherwise, most of these functions would return the value NA).

Exercise: How to interpret these values? More precisely, what do they say about the variable distribution?

Some of these values are also available with

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   156.0   199.0   196.8   236.0   482.0

Dispersion characteristics are obtained with:

var(tdf_data$Distance) # variance
## [1] 8131.78
sd(tdf_data$Distance) # standard deviation
## [1] 90.17639
range(tdf_data$Distance) # range
## [1]   1 482
## [1] 481

Exercise: How would you compute the inter-quartile range (in just one line of code)?

## 75% 
##  80
## [1] 80

Exercise: What is the coefficient of variation (CV) for this variable?

## [1] 0.4582529

Standard modifications of data include:

  • binarization:
tdf_data$cut1 <- cut(tdf_data$Distance, breaks = 5)
## (0.519,97.2]   (97.2,193]    (193,290]    (290,386]    (386,482] 
##          320          676          950          211           79
tdf_data$cut2 <- cut(tdf_data$Distance, breaks = 5, labels = FALSE)
##   1   2   3   4   5 
## 320 676 950 211  79
tdf_data$cut3 <- cut(tdf_data$Distance, breaks = seq(0, 500, by = 100))
##   (0,100] (100,200] (200,300] (300,400] (400,500] 
##       324       820       825       210        57

Exercise: What is the mode of cut3?

## [1] "(200,300]"
  • centering and scaling
tdf_data$DScaled <- as.vector(scale(tdf_data$Distance))
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.17111 -0.45226  0.02458  0.00000  0.43489  3.16288
## [1] 1


Before we start, a short note on color palettes:
