Make your data tidy

Nathalie Vialaneix and Sébastien Déjean
12 octobre 2018

library(tidyverse)

Basic concepts

What is tidyverse?
A new operator

Basic description of tidyverse

tidyverse is a collection of packages aiming at manipulating data easier:

ggplot2 to make graphics
dplyr to transform and summarize the content of datasets
tidyr to transform the structure of data tables (see also reshape2)
purrr to allow better functional programming in R
tibble to provide a better framework for data tables
readr to read datasets faster

New operators: pipe %>%

x %>% f() means f(x)
x %>% f(y) means f(x, y)
x %>% f(y, .) means f(y, x)

x <- 3
log(x)

[1] 1.098612

Exercise: use pipe to perform the same operation

New operators: pipe %>%

x %>% f() means f(x)
x %>% f(y) means f(x, y)
x %>% f(y, .) means f(y, x)

x <- 3
x %>% log()

[1] 1.098612

Exercise: use pipe to perform the same operation

Manipulating data with dplyr

Main functions

dplyr, as ggplot2, is based on a grammar that aims at manipulating data. In this grammar, the main actions are:

select() to select variables based on their names
arrange() to modify the ordering of a dataset
filter() to select observations based on their values
mutate() to add new variables to a dataset (coming from existing variables)
summarise() to summarize multiple values (usually combined with group_by())

data(diamonds)
set.seed(25091309)
sample1000 <- sample(1:nrow(diamonds), 1000, replace = FALSE)
diamonds <- diamonds[sample1000, ]

select()

small_diam <- diamonds %>% select(cut, color, price)
small_diam

# A tibble: 1,000 x 3
   cut       color price
   <ord>     <ord> <int>
 1 Very Good D      1658
 2 Premium   G     10766
 3 Premium   I      6173
 4 Ideal     E      5962
 5 Premium   F      2839
 6 Premium   H      5266
 7 Ideal     E       723
 8 Premium   G     11032
 9 Fair      D      4441
10 Good      E      1246
# ... with 990 more rows

arrange()

ordered_diams <- diamonds %>% arrange(desc(color))
ordered_diams

# A tibble: 1,000 x 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1 0.310 Ideal     J     VS2      62.1  53.6   400  4.36  4.39  2.72
 2 1.23  Ideal     J     VS2      61.8  56.0  4986  6.87  6.81  4.23
 3 1.55  Ideal     J     SI1      62.4  57.0  8301  7.50  7.45  4.67
 4 0.400 Very Good J     VS1      63.4  58.0   810  4.64  4.61  2.93
 5 0.710 Premium   J     VS2      62.8  61.0  1917  5.71  5.63  3.56
 6 1.00  Good      J     SI1      58.7  62.0  3614  6.47  6.51  3.81
 7 0.550 Very Good J     VS2      62.6  57.0  1062  5.17  5.21  3.25
 8 1.24  Very Good J     SI2      63.3  60.0  3908  6.83  6.75  4.29
 9 1.56  Good      J     VS2      62.3  64.0  8107  7.41  7.36  4.60
10 1.75  Ideal     J     VS2      62.1  56.0  9890  7.74  7.69  4.79
# ... with 990 more rows

filter()

vg_diam <- diamonds %>% filter(cut == "Ideal")
summary(vg_diam)

     carat               cut      color     clarity       depth      
 Min.   :0.2300   Fair     :  0   D:43   VS2    :84   Min.   :58.50  
 1st Qu.:0.3400   Good     :  0   E:65   SI1    :61   1st Qu.:61.30  
 Median :0.5300   Very Good:  0   F:68   VS1    :58   Median :61.80  
 Mean   :0.6758   Premium  :  0   G:75   SI2    :47   Mean   :61.66  
 3rd Qu.:0.9100   Ideal    :357   H:50   VVS1   :46   3rd Qu.:62.20  
 Max.   :2.1600                   I:37   VVS2   :41   Max.   :63.70  
                                  J:19   (Other):20                  
     table           price             x               y        
 Min.   :53.00   Min.   :  360   Min.   :3.970   Min.   :3.990  
 1st Qu.:55.00   1st Qu.:  855   1st Qu.:4.490   1st Qu.:4.470  
 Median :56.00   Median : 1761   Median :5.220   Median :5.240  
 Mean   :56.09   Mean   : 3123   Mean   :5.439   Mean   :5.453  
 3rd Qu.:57.00   3rd Qu.: 4008   3rd Qu.:6.220   3rd Qu.:6.230  
 Max.   :60.00   Max.   :18682   Max.   :8.400   Max.   :8.340  

       z        
 Min.   :2.430  
 1st Qu.:2.760  
 Median :3.210  
 Mean   :3.358  
 3rd Qu.:3.860  
 Max.   :5.100

vg_diam

# A tibble: 357 x 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1 1.11  Ideal E     SI2      60.6  56.0  5962  6.76  6.78  4.10
 2 0.330 Ideal E     VS2      61.5  57.0   723  4.41  4.47  2.73
 3 0.310 Ideal D     VS2      62.5  56.0   734  4.29  4.32  2.69
 4 0.310 Ideal J     VS2      62.1  53.6   400  4.36  4.39  2.72
 5 0.530 Ideal D     VS2      60.9  57.0  1783  5.17  5.24  3.17
 6 0.330 Ideal F     VVS1     61.9  56.0   955  4.42  4.47  2.75
 7 0.380 Ideal D     VS2      62.0  56.0   998  4.68  4.64  2.89
 8 0.790 Ideal E     SI1      62.0  57.0  3384  5.92  5.96  3.68
 9 1.23  Ideal J     VS2      61.8  56.0  4986  6.87  6.81  4.23
10 0.380 Ideal F     VVS1     62.3  54.0  1096  4.64  4.70  2.91
# ... with 347 more rows

Exercice: Make a dataset with the top 10% most expensive diamonds.

filter()

vg_diam <- diamonds %>% filter(cut == "Ideal")
summary(vg_diam)

     carat               cut      color     clarity       depth      
 Min.   :0.2300   Fair     :  0   D:43   VS2    :84   Min.   :58.50  
 1st Qu.:0.3400   Good     :  0   E:65   SI1    :61   1st Qu.:61.30  
 Median :0.5300   Very Good:  0   F:68   VS1    :58   Median :61.80  
 Mean   :0.6758   Premium  :  0   G:75   SI2    :47   Mean   :61.66  
 3rd Qu.:0.9100   Ideal    :357   H:50   VVS1   :46   3rd Qu.:62.20  
 Max.   :2.1600                   I:37   VVS2   :41   Max.   :63.70  
                                  J:19   (Other):20                  
     table           price             x               y        
 Min.   :53.00   Min.   :  360   Min.   :3.970   Min.   :3.990  
 1st Qu.:55.00   1st Qu.:  855   1st Qu.:4.490   1st Qu.:4.470  
 Median :56.00   Median : 1761   Median :5.220   Median :5.240  
 Mean   :56.09   Mean   : 3123   Mean   :5.439   Mean   :5.453  
 3rd Qu.:57.00   3rd Qu.: 4008   3rd Qu.:6.220   3rd Qu.:6.230  
 Max.   :60.00   Max.   :18682   Max.   :8.400   Max.   :8.340  

       z        
 Min.   :2.430  
 1st Qu.:2.760  
 Median :3.210  
 Mean   :3.358  
 3rd Qu.:3.860  
 Max.   :5.100

vg_diam

# A tibble: 357 x 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1 1.11  Ideal E     SI2      60.6  56.0  5962  6.76  6.78  4.10
 2 0.330 Ideal E     VS2      61.5  57.0   723  4.41  4.47  2.73
 3 0.310 Ideal D     VS2      62.5  56.0   734  4.29  4.32  2.69
 4 0.310 Ideal J     VS2      62.1  53.6   400  4.36  4.39  2.72
 5 0.530 Ideal D     VS2      60.9  57.0  1783  5.17  5.24  3.17
 6 0.330 Ideal F     VVS1     61.9  56.0   955  4.42  4.47  2.75
 7 0.380 Ideal D     VS2      62.0  56.0   998  4.68  4.64  2.89
 8 0.790 Ideal E     SI1      62.0  57.0  3384  5.92  5.96  3.68
 9 1.23  Ideal J     VS2      61.8  56.0  4986  6.87  6.81  4.23
10 0.380 Ideal F     VVS1     62.3  54.0  1096  4.64  4.70  2.91
# ... with 347 more rows

Exercice: Make a dataset with the top 10% most expensive diamonds.

top_exp <- diamonds %>% filter(price >= quantile(price, probs = 0.9))
summary(top_exp$price)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   9641   11168   13113   13399   15230   18788

mutate()

large_diam <- diamonds %>% mutate(ratio = price / carat,
                                  nothing = NA,
                                  weird = paste(cut, color, sep = "-")) %>%
  select(carat, price, color, cut, ratio, nothing, weird)
large_diam

# A tibble: 1,000 x 7
   carat price color cut       ratio nothing weird      
   <dbl> <int> <ord> <ord>     <dbl> <lgl>   <chr>      
 1 0.510  1658 D     Very Good 3251. NA      Very Good-D
 2 1.56  10766 G     Premium   6901. NA      Premium-G  
 3 1.51   6173 I     Premium   4088. NA      Premium-I  
 4 1.11   5962 E     Ideal     5371. NA      Ideal-E    
 5 0.710  2839 F     Premium   3999. NA      Premium-F  
 6 1.12   5266 H     Premium   4702. NA      Premium-H  
 7 0.330   723 E     Ideal     2191. NA      Ideal-E    
 8 1.71  11032 G     Premium   6451. NA      Premium-G  
 9 1.03   4441 D     Fair      4312. NA      Fair-D     
10 0.530  1246 E     Good      2351. NA      Good-E     
# ... with 990 more rows

Exercice: Make a dataset with only the diamonds with color 'D' and two additional variables: the \( \log_{10} \) of their price and the combination of clarity and cut.

mutate()

large_diam <- diamonds %>% mutate(ratio = price / carat,
                                  nothing = NA,
                                  weird = paste(cut, color, sep = "-")) %>%
  select(carat, price, color, cut, ratio, nothing, weird)
large_diam

# A tibble: 1,000 x 7
   carat price color cut       ratio nothing weird      
   <dbl> <int> <ord> <ord>     <dbl> <lgl>   <chr>      
 1 0.510  1658 D     Very Good 3251. NA      Very Good-D
 2 1.56  10766 G     Premium   6901. NA      Premium-G  
 3 1.51   6173 I     Premium   4088. NA      Premium-I  
 4 1.11   5962 E     Ideal     5371. NA      Ideal-E    
 5 0.710  2839 F     Premium   3999. NA      Premium-F  
 6 1.12   5266 H     Premium   4702. NA      Premium-H  
 7 0.330   723 E     Ideal     2191. NA      Ideal-E    
 8 1.71  11032 G     Premium   6451. NA      Premium-G  
 9 1.03   4441 D     Fair      4312. NA      Fair-D     
10 0.530  1246 E     Good      2351. NA      Good-E     
# ... with 990 more rows

Exercice: Make a dataset with only the diamonds with color 'D' and two additional variables: the \( \log_{10} \) of their price and the combination of clarity and cut.

ld <- diamonds %>% filter(color == "D") %>%
  mutate(log10 = log10(price),
         combo = paste(cut, clarity, sep = "-")) %>%
  select(price, cut, clarity, color, log10, combo)
ld

# A tibble: 113 x 6
   price cut       clarity color log10 combo        
   <int> <ord>     <ord>   <ord> <dbl> <chr>        
 1  1658 Very Good VS2     D      3.22 Very Good-VS2
 2  4441 Fair      SI1     D      3.65 Fair-SI1     
 3   734 Ideal     VS2     D      2.87 Ideal-VS2    
 4   717 Good      SI1     D      2.86 Good-SI1     
 5  1752 Premium   VS2     D      3.24 Premium-VS2  
 6  1783 Ideal     VS2     D      3.25 Ideal-VS2    
 7   998 Ideal     VS2     D      3.00 Ideal-VS2    
 8  2079 Ideal     VS2     D      3.32 Ideal-VS2    
 9  1838 Good      SI1     D      3.26 Good-SI1     
10  1753 Ideal     VS2     D      3.24 Ideal-VS2    
# ... with 103 more rows

summarise() and group_by()

new_diams <- diamonds %>% summarise(av_depth = mean(depth),
                                    sd_depth = sd(depth))
new_diams

# A tibble: 1 x 2
  av_depth sd_depth
     <dbl>    <dbl>
1     61.7     1.55

new_diams <- diamonds %>% 
  group_by(color) %>%
  summarise(av_price = mean(price),
            sd_price = sd(price))
new_diams

# A tibble: 7 x 3
  color av_price sd_price
  <ord>    <dbl>    <dbl>
1 D        2852.    3101.
2 E        3424.    3531.
3 F        3061.    3374.
4 G        4178.    4193.
5 H        4020.    4138.
6 I        5701.    4611.
7 J        4816.    3746.

Exercice: Summarize (with mean, sd and frequentcy) the datasets for all combinations of color and cut for colors 'D' and 'E'.

summarise() and group_by()

Exercice: Summarize (with mean, sd and frequentcy) the datasets for all combinations of color and cut for colors 'D' and 'E'.

new_diamb <- diamonds %>% filter(color %in% c("D", "E")) %>%
  group_by(color, cut) %>%
  summarise(av_price = mean(price),
            sd_price = sd(price),
            count = length(price))
new_diamb

# A tibble: 10 x 5
# Groups:   color [?]
   color cut       av_price sd_price count
   <ord> <ord>        <dbl>    <dbl> <int>
 1 D     Fair         6240.    6888.     4
 2 D     Good         3817.    3817.    10
 3 D     Very Good    3097.    2999.    32
 4 D     Premium      2723.    3052.    24
 5 D     Ideal        2202.    2363.    43
 6 E     Fair         2614.    1802.     2
 7 E     Good         5091.    5162.    19
 8 E     Very Good    3770.    3757.    58
 9 E     Premium      3707.    3563.    47
10 E     Ideal        2449.    2407.    65

summarise() and group_by()

Useful for:

p <- ggplot(new_diamb, aes(x = cut, y = av_price, colour = color, group = color)) + geom_point() + 
  geom_line() + geom_errorbar(aes(ymin = av_price - sd_price / sqrt(count),
                                  ymax = av_price + sd_price / sqrt(count)))
p

plot of chunk ggplotSumm

Clean data with tidyr

Main functions

tidyr is used to clean datasets so as: 1/ each variable is in a column; 2/ each observation is in a row; 3/ each value is in a cell. The main functions are:

gather() that gathers multiple columns into a key-value pairs (where key is the former column name)
spread() that takes two columns (key, value) and spreads them into multiple columns (one column for each key)
separate() and extract() to pull appart a column with multiple values based on a separator or a regular expression

separate()

grades <- tibble(
  Name = c("Tommy", "Mary", "Gary", "Cathy"),
  Sexage = c("m.15", "f.15", "m.16", "f.14"),
  Math = c(10, 15, 16, 14),
  Philo = c(11, 13, 10, 12),
  English = c(12, 13, 17, 10)
)
grades

# A tibble: 4 x 5
  Name  Sexage  Math Philo English
  <chr> <chr>  <dbl> <dbl>   <dbl>
1 Tommy m.15     10.   11.     12.
2 Mary  f.15     15.   13.     13.
3 Gary  m.16     16.   10.     17.
4 Cathy f.14     14.   12.     10.

grades <- grades %>% 
  separate(Sexage, into = c("Sex", "Age")) # default separator is any nonalphanumeric character

gather()

modif_grades <- grades %>%
  gather(Math, Philo, English, key = Topic, value = Grade)
modif_grades

# A tibble: 12 x 5
   Name  Sex   Age   Topic   Grade
   <chr> <chr> <chr> <chr>   <dbl>
 1 Tommy m     15    Math      10.
 2 Mary  f     15    Math      15.
 3 Gary  m     16    Math      16.
 4 Cathy f     14    Math      14.
 5 Tommy m     15    Philo     11.
 6 Mary  f     15    Philo     13.
 7 Gary  m     16    Philo     10.
 8 Cathy f     14    Philo     12.
 9 Tommy m     15    English   12.
10 Mary  f     15    English   13.
11 Gary  m     16    English   17.
12 Cathy f     14    English   10.

gather()

Usefull for:

p <- ggplot(modif_grades, aes(x = Topic, y = Grade)) + geom_boxplot() + 
  theme_bw()
p

plot of chunk ggplotGather

gather()

Exercise:

# A tibble: 6 x 7
# Groups:   Topic [3]
  Name  Sex   Age   Topic   Grade minval maxval
  <chr> <chr> <chr> <chr>   <dbl>  <dbl>  <dbl>
1 Mary  f     15    Math      15.    14.    15.
2 Cathy f     14    Math      14.    14.    15.
3 Mary  f     15    Philo     13.    12.    13.
4 Cathy f     14    Philo     12.    12.    13.
5 Mary  f     15    English   13.    10.    13.
6 Cathy f     14    English   10.    10.    13.

plot of chunk gratherEx2

Credits

Slides built with material coming from:

Christophe Ambroise (LaMME, Université d'Evry) inspire from Hadley Wickham's book “R for data science: import, tidy, transform, visualize and model data”