Nathalie Vialaneix and Sébastien Déjean
12 octobre 2018
library(tidyverse)
What is tidyverse?
A new operator
tidyverse is a collection of packages aiming at manipulating data easier:
ggplot2 to make graphicsdplyr to transform and summarize the content of datasetstidyr to transform the structure of data tables (see also reshape2)purrr to allow better functional programming in Rtibble to provide a better framework for data tablesreadr to read datasets fasterx %>% f() means f(x)
x %>% f(y) means f(x, y)
x %>% f(y, .) means f(y, x)
x <- 3
log(x)
[1] 1.098612
Exercise: use pipe to perform the same operation
x %>% f() means f(x)
x %>% f(y) means f(x, y)
x %>% f(y, .) means f(y, x)
x <- 3
x %>% log()
[1] 1.098612
Exercise: use pipe to perform the same operation
dplyr, as ggplot2, is based on a grammar that aims at manipulating data. In this grammar, the main actions are:
select() to select variables based on their namesarrange() to modify the ordering of a datasetfilter() to select observations based on their valuesmutate() to add new variables to a dataset (coming from existing variables)summarise() to summarize multiple values (usually combined with group_by())data(diamonds)
set.seed(25091309)
sample1000 <- sample(1:nrow(diamonds), 1000, replace = FALSE)
diamonds <- diamonds[sample1000, ]
small_diam <- diamonds %>% select(cut, color, price)
small_diam
# A tibble: 1,000 x 3
   cut       color price
   <ord>     <ord> <int>
 1 Very Good D      1658
 2 Premium   G     10766
 3 Premium   I      6173
 4 Ideal     E      5962
 5 Premium   F      2839
 6 Premium   H      5266
 7 Ideal     E       723
 8 Premium   G     11032
 9 Fair      D      4441
10 Good      E      1246
# ... with 990 more rows
ordered_diams <- diamonds %>% arrange(desc(color))
ordered_diams
# A tibble: 1,000 x 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1 0.310 Ideal     J     VS2      62.1  53.6   400  4.36  4.39  2.72
 2 1.23  Ideal     J     VS2      61.8  56.0  4986  6.87  6.81  4.23
 3 1.55  Ideal     J     SI1      62.4  57.0  8301  7.50  7.45  4.67
 4 0.400 Very Good J     VS1      63.4  58.0   810  4.64  4.61  2.93
 5 0.710 Premium   J     VS2      62.8  61.0  1917  5.71  5.63  3.56
 6 1.00  Good      J     SI1      58.7  62.0  3614  6.47  6.51  3.81
 7 0.550 Very Good J     VS2      62.6  57.0  1062  5.17  5.21  3.25
 8 1.24  Very Good J     SI2      63.3  60.0  3908  6.83  6.75  4.29
 9 1.56  Good      J     VS2      62.3  64.0  8107  7.41  7.36  4.60
10 1.75  Ideal     J     VS2      62.1  56.0  9890  7.74  7.69  4.79
# ... with 990 more rows
vg_diam <- diamonds %>% filter(cut == "Ideal")
summary(vg_diam)
     carat               cut      color     clarity       depth      
 Min.   :0.2300   Fair     :  0   D:43   VS2    :84   Min.   :58.50  
 1st Qu.:0.3400   Good     :  0   E:65   SI1    :61   1st Qu.:61.30  
 Median :0.5300   Very Good:  0   F:68   VS1    :58   Median :61.80  
 Mean   :0.6758   Premium  :  0   G:75   SI2    :47   Mean   :61.66  
 3rd Qu.:0.9100   Ideal    :357   H:50   VVS1   :46   3rd Qu.:62.20  
 Max.   :2.1600                   I:37   VVS2   :41   Max.   :63.70  
                                  J:19   (Other):20                  
     table           price             x               y        
 Min.   :53.00   Min.   :  360   Min.   :3.970   Min.   :3.990  
 1st Qu.:55.00   1st Qu.:  855   1st Qu.:4.490   1st Qu.:4.470  
 Median :56.00   Median : 1761   Median :5.220   Median :5.240  
 Mean   :56.09   Mean   : 3123   Mean   :5.439   Mean   :5.453  
 3rd Qu.:57.00   3rd Qu.: 4008   3rd Qu.:6.220   3rd Qu.:6.230  
 Max.   :60.00   Max.   :18682   Max.   :8.400   Max.   :8.340  
       z        
 Min.   :2.430  
 1st Qu.:2.760  
 Median :3.210  
 Mean   :3.358  
 3rd Qu.:3.860  
 Max.   :5.100  
vg_diam
# A tibble: 357 x 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1 1.11  Ideal E     SI2      60.6  56.0  5962  6.76  6.78  4.10
 2 0.330 Ideal E     VS2      61.5  57.0   723  4.41  4.47  2.73
 3 0.310 Ideal D     VS2      62.5  56.0   734  4.29  4.32  2.69
 4 0.310 Ideal J     VS2      62.1  53.6   400  4.36  4.39  2.72
 5 0.530 Ideal D     VS2      60.9  57.0  1783  5.17  5.24  3.17
 6 0.330 Ideal F     VVS1     61.9  56.0   955  4.42  4.47  2.75
 7 0.380 Ideal D     VS2      62.0  56.0   998  4.68  4.64  2.89
 8 0.790 Ideal E     SI1      62.0  57.0  3384  5.92  5.96  3.68
 9 1.23  Ideal J     VS2      61.8  56.0  4986  6.87  6.81  4.23
10 0.380 Ideal F     VVS1     62.3  54.0  1096  4.64  4.70  2.91
# ... with 347 more rows
Exercice: Make a dataset with the top 10% most expensive diamonds.
vg_diam <- diamonds %>% filter(cut == "Ideal")
summary(vg_diam)
     carat               cut      color     clarity       depth      
 Min.   :0.2300   Fair     :  0   D:43   VS2    :84   Min.   :58.50  
 1st Qu.:0.3400   Good     :  0   E:65   SI1    :61   1st Qu.:61.30  
 Median :0.5300   Very Good:  0   F:68   VS1    :58   Median :61.80  
 Mean   :0.6758   Premium  :  0   G:75   SI2    :47   Mean   :61.66  
 3rd Qu.:0.9100   Ideal    :357   H:50   VVS1   :46   3rd Qu.:62.20  
 Max.   :2.1600                   I:37   VVS2   :41   Max.   :63.70  
                                  J:19   (Other):20                  
     table           price             x               y        
 Min.   :53.00   Min.   :  360   Min.   :3.970   Min.   :3.990  
 1st Qu.:55.00   1st Qu.:  855   1st Qu.:4.490   1st Qu.:4.470  
 Median :56.00   Median : 1761   Median :5.220   Median :5.240  
 Mean   :56.09   Mean   : 3123   Mean   :5.439   Mean   :5.453  
 3rd Qu.:57.00   3rd Qu.: 4008   3rd Qu.:6.220   3rd Qu.:6.230  
 Max.   :60.00   Max.   :18682   Max.   :8.400   Max.   :8.340  
       z        
 Min.   :2.430  
 1st Qu.:2.760  
 Median :3.210  
 Mean   :3.358  
 3rd Qu.:3.860  
 Max.   :5.100  
vg_diam
# A tibble: 357 x 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1 1.11  Ideal E     SI2      60.6  56.0  5962  6.76  6.78  4.10
 2 0.330 Ideal E     VS2      61.5  57.0   723  4.41  4.47  2.73
 3 0.310 Ideal D     VS2      62.5  56.0   734  4.29  4.32  2.69
 4 0.310 Ideal J     VS2      62.1  53.6   400  4.36  4.39  2.72
 5 0.530 Ideal D     VS2      60.9  57.0  1783  5.17  5.24  3.17
 6 0.330 Ideal F     VVS1     61.9  56.0   955  4.42  4.47  2.75
 7 0.380 Ideal D     VS2      62.0  56.0   998  4.68  4.64  2.89
 8 0.790 Ideal E     SI1      62.0  57.0  3384  5.92  5.96  3.68
 9 1.23  Ideal J     VS2      61.8  56.0  4986  6.87  6.81  4.23
10 0.380 Ideal F     VVS1     62.3  54.0  1096  4.64  4.70  2.91
# ... with 347 more rows
Exercice: Make a dataset with the top 10% most expensive diamonds.
top_exp <- diamonds %>% filter(price >= quantile(price, probs = 0.9))
summary(top_exp$price)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   9641   11168   13113   13399   15230   18788 
large_diam <- diamonds %>% mutate(ratio = price / carat,
                                  nothing = NA,
                                  weird = paste(cut, color, sep = "-")) %>%
  select(carat, price, color, cut, ratio, nothing, weird)
large_diam
# A tibble: 1,000 x 7
   carat price color cut       ratio nothing weird      
   <dbl> <int> <ord> <ord>     <dbl> <lgl>   <chr>      
 1 0.510  1658 D     Very Good 3251. NA      Very Good-D
 2 1.56  10766 G     Premium   6901. NA      Premium-G  
 3 1.51   6173 I     Premium   4088. NA      Premium-I  
 4 1.11   5962 E     Ideal     5371. NA      Ideal-E    
 5 0.710  2839 F     Premium   3999. NA      Premium-F  
 6 1.12   5266 H     Premium   4702. NA      Premium-H  
 7 0.330   723 E     Ideal     2191. NA      Ideal-E    
 8 1.71  11032 G     Premium   6451. NA      Premium-G  
 9 1.03   4441 D     Fair      4312. NA      Fair-D     
10 0.530  1246 E     Good      2351. NA      Good-E     
# ... with 990 more rows
Exercice: Make a dataset with only the diamonds with color 'D' and two additional variables: the \( \log_{10} \) of their price and the combination of clarity and cut.
large_diam <- diamonds %>% mutate(ratio = price / carat,
                                  nothing = NA,
                                  weird = paste(cut, color, sep = "-")) %>%
  select(carat, price, color, cut, ratio, nothing, weird)
large_diam
# A tibble: 1,000 x 7
   carat price color cut       ratio nothing weird      
   <dbl> <int> <ord> <ord>     <dbl> <lgl>   <chr>      
 1 0.510  1658 D     Very Good 3251. NA      Very Good-D
 2 1.56  10766 G     Premium   6901. NA      Premium-G  
 3 1.51   6173 I     Premium   4088. NA      Premium-I  
 4 1.11   5962 E     Ideal     5371. NA      Ideal-E    
 5 0.710  2839 F     Premium   3999. NA      Premium-F  
 6 1.12   5266 H     Premium   4702. NA      Premium-H  
 7 0.330   723 E     Ideal     2191. NA      Ideal-E    
 8 1.71  11032 G     Premium   6451. NA      Premium-G  
 9 1.03   4441 D     Fair      4312. NA      Fair-D     
10 0.530  1246 E     Good      2351. NA      Good-E     
# ... with 990 more rows
Exercice: Make a dataset with only the diamonds with color 'D' and two additional variables: the \( \log_{10} \) of their price and the combination of clarity and cut.
ld <- diamonds %>% filter(color == "D") %>%
  mutate(log10 = log10(price),
         combo = paste(cut, clarity, sep = "-")) %>%
  select(price, cut, clarity, color, log10, combo)
ld
# A tibble: 113 x 6
   price cut       clarity color log10 combo        
   <int> <ord>     <ord>   <ord> <dbl> <chr>        
 1  1658 Very Good VS2     D      3.22 Very Good-VS2
 2  4441 Fair      SI1     D      3.65 Fair-SI1     
 3   734 Ideal     VS2     D      2.87 Ideal-VS2    
 4   717 Good      SI1     D      2.86 Good-SI1     
 5  1752 Premium   VS2     D      3.24 Premium-VS2  
 6  1783 Ideal     VS2     D      3.25 Ideal-VS2    
 7   998 Ideal     VS2     D      3.00 Ideal-VS2    
 8  2079 Ideal     VS2     D      3.32 Ideal-VS2    
 9  1838 Good      SI1     D      3.26 Good-SI1     
10  1753 Ideal     VS2     D      3.24 Ideal-VS2    
# ... with 103 more rows
new_diams <- diamonds %>% summarise(av_depth = mean(depth),
                                    sd_depth = sd(depth))
new_diams
# A tibble: 1 x 2
  av_depth sd_depth
     <dbl>    <dbl>
1     61.7     1.55
new_diams <- diamonds %>% 
  group_by(color) %>%
  summarise(av_price = mean(price),
            sd_price = sd(price))
new_diams
# A tibble: 7 x 3
  color av_price sd_price
  <ord>    <dbl>    <dbl>
1 D        2852.    3101.
2 E        3424.    3531.
3 F        3061.    3374.
4 G        4178.    4193.
5 H        4020.    4138.
6 I        5701.    4611.
7 J        4816.    3746.
Exercice: Summarize (with mean, sd and frequentcy) the datasets for all combinations of color and cut for colors 'D' and 'E'.
Exercice: Summarize (with mean, sd and frequentcy) the datasets for all combinations of color and cut for colors 'D' and 'E'.
new_diamb <- diamonds %>% filter(color %in% c("D", "E")) %>%
  group_by(color, cut) %>%
  summarise(av_price = mean(price),
            sd_price = sd(price),
            count = length(price))
new_diamb
# A tibble: 10 x 5
# Groups:   color [?]
   color cut       av_price sd_price count
   <ord> <ord>        <dbl>    <dbl> <int>
 1 D     Fair         6240.    6888.     4
 2 D     Good         3817.    3817.    10
 3 D     Very Good    3097.    2999.    32
 4 D     Premium      2723.    3052.    24
 5 D     Ideal        2202.    2363.    43
 6 E     Fair         2614.    1802.     2
 7 E     Good         5091.    5162.    19
 8 E     Very Good    3770.    3757.    58
 9 E     Premium      3707.    3563.    47
10 E     Ideal        2449.    2407.    65
Useful for:
p <- ggplot(new_diamb, aes(x = cut, y = av_price, colour = color, group = color)) + geom_point() + 
  geom_line() + geom_errorbar(aes(ymin = av_price - sd_price / sqrt(count),
                                  ymax = av_price + sd_price / sqrt(count)))
p
tidyr is used to clean datasets so as: 1/ each variable is in a column; 2/ each observation is in a row; 3/ each value is in a cell. The main functions are:
gather() that gathers multiple columns into a key-value pairs (where key is
the former column name)spread() that takes two columns (key, value) and spreads them into 
multiple columns (one column for each key)separate() and extract() to pull appart a column with multiple values 
based on a separator or a regular expressiongrades <- tibble(
  Name = c("Tommy", "Mary", "Gary", "Cathy"),
  Sexage = c("m.15", "f.15", "m.16", "f.14"),
  Math = c(10, 15, 16, 14),
  Philo = c(11, 13, 10, 12),
  English = c(12, 13, 17, 10)
)
grades
# A tibble: 4 x 5
  Name  Sexage  Math Philo English
  <chr> <chr>  <dbl> <dbl>   <dbl>
1 Tommy m.15     10.   11.     12.
2 Mary  f.15     15.   13.     13.
3 Gary  m.16     16.   10.     17.
4 Cathy f.14     14.   12.     10.
grades <- grades %>% 
  separate(Sexage, into = c("Sex", "Age")) # default separator is any nonalphanumeric character
modif_grades <- grades %>%
  gather(Math, Philo, English, key = Topic, value = Grade)
modif_grades
# A tibble: 12 x 5
   Name  Sex   Age   Topic   Grade
   <chr> <chr> <chr> <chr>   <dbl>
 1 Tommy m     15    Math      10.
 2 Mary  f     15    Math      15.
 3 Gary  m     16    Math      16.
 4 Cathy f     14    Math      14.
 5 Tommy m     15    Philo     11.
 6 Mary  f     15    Philo     13.
 7 Gary  m     16    Philo     10.
 8 Cathy f     14    Philo     12.
 9 Tommy m     15    English   12.
10 Mary  f     15    English   13.
11 Gary  m     16    English   17.
12 Cathy f     14    English   10.
Usefull for:
p <- ggplot(modif_grades, aes(x = Topic, y = Grade)) + geom_boxplot() + 
  theme_bw()
p
Exercise:
# A tibble: 6 x 7
# Groups:   Topic [3]
  Name  Sex   Age   Topic   Grade minval maxval
  <chr> <chr> <chr> <chr>   <dbl>  <dbl>  <dbl>
1 Mary  f     15    Math      15.    14.    15.
2 Cathy f     14    Math      14.    14.    15.
3 Mary  f     15    Philo     13.    12.    13.
4 Cathy f     14    Philo     12.    12.    13.
5 Mary  f     15    English   13.    10.    13.
6 Cathy f     14    English   10.    10.    13.
Slides built with material coming from: