This file illustrates the normalization and differential analysis of RNAseq data on a real life dataset. The focus is put on testing interaction between a genotype (wt or mutant) and an environment (with or without a treatment). The data are provided by courtesy of the transcriptomic platform of IPS2. Gene names have been shuffled so that the results are not biologically interpretable. The data are composed of three files, all included in the directory data:

In the rest of this file, we:

1/ first import and describe the data;

2/ then perform a standard normalization (TMM) and explore its effect on data;

3/ finally perform a differential analysis using different methods and models, including a model in which the interaction is properly defined.

We advice that you use the file by reading the R code while trying to make sense of it. Then, you run it and analyze the results. The packages ggplot2, reshape2, mixOmics, RColorBrewer, edgeR, VennDiagram and devtools are required to compile this file. Information about my system (including package and R versions) are provided at the end of this file, in Section “Session information”.

The packages are loaded with:

library(ggplot2)
library(reshape2)
library(mixOmics)
library(RColorBrewer)
library(gridExtra)
library(edgeR)
library(VennDiagram)
library(devtools)

Data description and importation

The data used in this practical session correspond to 24 samples of RNAseq obtained from Arabidopsis thaliana plants. There are four genotypes: the wild type and three types of mutations (“dbMT”, “oxMT” and “siMT”). For each genotype, gene expression is measured in two conditions (treated and controled) for three groups of plants grown at different time (biological replicates). In addition to the wild type plants, three types of mutations are studied (“dbMT”, “oxMT” and “siMT”) and for each of the four types of plant, a treatment and a control condition are measured.

The datasets are included in the repository data located at the root of the material for this practical session. They can be loaded using:

raw_counts <- read.table("../data/D2-counts.txt", header = TRUE, row.names = 1)
raw_counts <- as.matrix(raw_counts)
design <- read.table("../data/D2-targets.txt", header = TRUE, 
                     stringsAsFactors = FALSE)

Raw counts

The dimensions of the raw count matrix are equal to:

dim(raw_counts)
## [1] 33602    24

which shows that the data contains 24 columns (samples) and 33602 rows (genes). And a quick overview is obtained by:

head(raw_counts)
##             WT_control_rep1 WT_control_rep2 WT_control_rep3
## AT1G01010.1            1202            1172            1245
## AT1G01020.1               0               0               0
## AT1G01030.1            1089            1191            1161
## AT1G01040.2               9               7               9
## AT1G01046.1               0               0               0
## AT1G01050.1               1               1               0
##             dbMT_control_rep1 dbMT_control_rep2 dbMT_control_rep3
## AT1G01010.1              1069              1291              1183
## AT1G01020.1                 1                 0                 0
## AT1G01030.1              1160              1468              1347
## AT1G01040.2                 6                 9                 2
## AT1G01046.1                 0                 0                 0
## AT1G01050.1                 0                 0                 1
##             dbMT_treated_rep1 dbMT_treated_rep2 dbMT_treated_rep3
## AT1G01010.1              1268              1120              1036
## AT1G01020.1                 0                 0                 0
## AT1G01030.1              1113               930               894
## AT1G01040.2                 2                 2                 2
## AT1G01046.1                 0                 0                 0
## AT1G01050.1                 0                 0                 0
##             oxMT_control_rep1 oxMT_control_rep2 oxMT_control_rep3
## AT1G01010.1              1166              1041              1005
## AT1G01020.1                 0                 0                 0
## AT1G01030.1              1241              1242              1024
## AT1G01040.2                 5                 2                 5
## AT1G01046.1                 0                 0                 0
## AT1G01050.1                 0                 0                 0
##             oxMT_treated_rep1 oxMT_treated_rep2 oxMT_treated_rep3
## AT1G01010.1              1442              1490              1403
## AT1G01020.1                 0                 0                 0
## AT1G01030.1              1044              1168              1190
## AT1G01040.2                 3                 6                 4
## AT1G01046.1                 0                 0                 0
## AT1G01050.1                 0                 0                 0
##             siMT_control_rep1 siMT_control_rep2 siMT_control_rep3
## AT1G01010.1               943              1386              1335
## AT1G01020.1                 0                 0                 0
## AT1G01030.1               959              1389              1522
## AT1G01040.2                 5                13                 6
## AT1G01046.1                 0                 0                 0
## AT1G01050.1                 0                 1                 0
##             siMT_treated_rep1 siMT_treated_rep2 siMT_treated_rep3
## AT1G01010.1              1151              1379              1224
## AT1G01020.1                 0                 0                 0
## AT1G01030.1               793              1041               923
## AT1G01040.2                 3                 4                 9
## AT1G01046.1                 0                 0                 0
## AT1G01050.1                 0                 0                 0
##             WT_treated_rep1 WT_treated_rep2 WT_treated_rep3
## AT1G01010.1            1309            1562            1367
## AT1G01020.1               0               0               0
## AT1G01030.1             897            1180             905
## AT1G01040.2               6               2               3
## AT1G01046.1               0               0               0
## AT1G01050.1               0               0               0

which shows that the row names are gene names and that the data are made of counts. Many of these counts are equal to 0. A basic exploratory analysis of the count data is provided in the next section.

Experimental design

The experimental design is described in the object design:

design
##               labels        group replicat factor1 factor2
## 1    WT_control_rep1   WT_control  repbio1      WT control
## 2    WT_control_rep2   WT_control  repbio2      WT control
## 3    WT_control_rep3   WT_control  repbio3      WT control
## 4  dbMT_control_rep1 dbMT_control  repbio1    dbMT control
## 5  dbMT_control_rep2 dbMT_control  repbio2    dbMT control
## 6  dbMT_control_rep3 dbMT_control  repbio3    dbMT control
## 7  dbMT_treated_rep1 dbMT_treated  repbio1    dbMT treated
## 8  dbMT_treated_rep2 dbMT_treated  repbio2    dbMT treated
## 9  dbMT_treated_rep3 dbMT_treated  repbio3    dbMT treated
## 10 oxMT_control_rep1 oxMT_control  repbio1    oxMT control
## 11 oxMT_control_rep2 oxMT_control  repbio2    oxMT control
## 12 oxMT_control_rep3 oxMT_control  repbio3    oxMT control
## 13 oxMT_treated_rep1 oxMT_treated  repbio1    oxMT treated
## 14 oxMT_treated_rep2 oxMT_treated  repbio2    oxMT treated
## 15 oxMT_treated_rep3 oxMT_treated  repbio3    oxMT treated
## 16 siMT_control_rep1 siMT_control  repbio1    siMT control
## 17 siMT_control_rep2 siMT_control  repbio2    siMT control
## 18 siMT_control_rep3 siMT_control  repbio3    siMT control
## 19 siMT_treated_rep1 siMT_treated  repbio1    siMT treated
## 20 siMT_treated_rep2 siMT_treated  repbio2    siMT treated
## 21 siMT_treated_rep3 siMT_treated  repbio3    siMT treated
## 22   WT_treated_rep1   WT_treated  repbio1      WT treated
## 23   WT_treated_rep2   WT_treated  repbio2      WT treated
## 24   WT_treated_rep3   WT_treated  repbio3      WT treated

that contains 5 columns: the first one labels is the sample name, the second one is group which combines the genotype and the treatment/control factor, the third column replicat indicates the biological replicate number, the fourth column called factor1 indicates the genotype and the last column factor2 indicates condition (control or treated).

Basic exploratory analysis of raw counts

Pseudo raw count distribution

We start the analysis by filtering out the genes for which no count has been found:

raw_counts_wn <- raw_counts[rowSums(raw_counts) > 0, ]
dim(raw_counts_wn)
## [1] 27732    24

The number of genes which have been filtered out is thus equal to 5870.

Pseudo-count distribution is then obtained and visualized using the variable “group” to colour the different boxplots:

pseudo_counts <- log2(raw_counts_wn + 1)

df_raw <- melt(pseudo_counts, id = rownames(raw_counts_wn))
names(df_raw)[1:2] <- c("id", "sample")
df_raw$group <- substr(as.character(df_raw$sample), 1,
                       nchar(as.character(df_raw$sample)) - 5)
df_raw$method <- rep("Raw counts", nrow(df_raw))  

p <- ggplot(data=df_raw, aes(x=sample, y=value, fill=group))
p <- p + geom_boxplot()
p <- p + scale_fill_brewer(type = "qual", palette = 7)
p <- p + theme_bw()
p <- p + ggtitle("Boxplots of raw pseudo counts")
p <- p + ylab(expression(log[2] ~ (raw ~ count + 1))) + xlab("")
p <- p + theme(title = element_text(size=10), axis.text.x = element_blank(), 
               axis.ticks.x = element_blank())
print(p)

PCA

A first exploratory analysis is performed with PCA:

resPCA <- pca(t(pseudo_counts), ncomp = 12)
plot(resPCA)

3 principal components seem to be enough to retrieve most of the variability in the dataset (the colors are identical to the ones used in the previous boxplots):

plotIndiv(resPCA, group = design$group, col.per.group = brewer.pal(8, "Dark2"),
          title = "1st and 2nd PCs")

The first principal component shows a strong separation between the two groups of treatment. On the second principal component, the different genotypes are approximately organized in separated groups, even though the separation is not perfect.

plotIndiv(resPCA, group = design$group, col.per.group = brewer.pal(8, "Dark2"),
          comp = c(2,3), title = "2nd and 3rd PCs")