This file illustrates the normalization and differential analysis of RNAseq data on a real life dataset. The focus is put on testing interaction between a genotype (wt or mutant) and an environment (with or without a treatment). The data are provided by courtesy of the transcriptomic platform of IPS2. Gene names have been shuffled so that the results are not biologically interpretable. The data are composed of three files, all included in the directory data
:
D2-counts.txt
contains the raw counts of the experiment (25 columns: the first one contains the gene names, the others correspond to 24 different samples);
D2-genesLength.txt
contains information about gene lengths (not used in this practical session);
D2-targets.txt
contains information about the sample and the experimental design.
In the rest of this file, we:
1/ first import and describe the data;
2/ then perform a standard normalization (TMM) and explore its effect on data;
3/ finally perform a differential analysis using different methods and models, including a model in which the interaction is properly defined.
We advice that you use the file by reading the R code while trying to make sense of it. Then, you run it and analyze the results. The packages ggplot2
, reshape2
, mixOmics
, RColorBrewer
, edgeR
, VennDiagram
and devtools
are required to compile this file. Information about my system (including package and R versions) are provided at the end of this file, in Section “Session information”.
The packages are loaded with:
library(ggplot2)
library(reshape2)
library(mixOmics)
library(RColorBrewer)
library(gridExtra)
library(edgeR)
library(VennDiagram)
library(devtools)
The data used in this practical session correspond to 24 samples of RNAseq obtained from Arabidopsis thaliana plants. There are four genotypes: the wild type and three types of mutations (“dbMT”, “oxMT” and “siMT”). For each genotype, gene expression is measured in two conditions (treated and controled) for three groups of plants grown at different time (biological replicates). In addition to the wild type plants, three types of mutations are studied (“dbMT”, “oxMT” and “siMT”) and for each of the four types of plant, a treatment and a control condition are measured.
The datasets are included in the repository data
located at the root of the material for this practical session. They can be loaded using:
raw_counts <- read.table("../data/D2-counts.txt", header = TRUE, row.names = 1)
raw_counts <- as.matrix(raw_counts)
design <- read.table("../data/D2-targets.txt", header = TRUE,
stringsAsFactors = FALSE)
The dimensions of the raw count matrix are equal to:
dim(raw_counts)
## [1] 33602 24
which shows that the data contains 24 columns (samples) and 33602 rows (genes). And a quick overview is obtained by:
head(raw_counts)
## WT_control_rep1 WT_control_rep2 WT_control_rep3
## AT1G01010.1 1202 1172 1245
## AT1G01020.1 0 0 0
## AT1G01030.1 1089 1191 1161
## AT1G01040.2 9 7 9
## AT1G01046.1 0 0 0
## AT1G01050.1 1 1 0
## dbMT_control_rep1 dbMT_control_rep2 dbMT_control_rep3
## AT1G01010.1 1069 1291 1183
## AT1G01020.1 1 0 0
## AT1G01030.1 1160 1468 1347
## AT1G01040.2 6 9 2
## AT1G01046.1 0 0 0
## AT1G01050.1 0 0 1
## dbMT_treated_rep1 dbMT_treated_rep2 dbMT_treated_rep3
## AT1G01010.1 1268 1120 1036
## AT1G01020.1 0 0 0
## AT1G01030.1 1113 930 894
## AT1G01040.2 2 2 2
## AT1G01046.1 0 0 0
## AT1G01050.1 0 0 0
## oxMT_control_rep1 oxMT_control_rep2 oxMT_control_rep3
## AT1G01010.1 1166 1041 1005
## AT1G01020.1 0 0 0
## AT1G01030.1 1241 1242 1024
## AT1G01040.2 5 2 5
## AT1G01046.1 0 0 0
## AT1G01050.1 0 0 0
## oxMT_treated_rep1 oxMT_treated_rep2 oxMT_treated_rep3
## AT1G01010.1 1442 1490 1403
## AT1G01020.1 0 0 0
## AT1G01030.1 1044 1168 1190
## AT1G01040.2 3 6 4
## AT1G01046.1 0 0 0
## AT1G01050.1 0 0 0
## siMT_control_rep1 siMT_control_rep2 siMT_control_rep3
## AT1G01010.1 943 1386 1335
## AT1G01020.1 0 0 0
## AT1G01030.1 959 1389 1522
## AT1G01040.2 5 13 6
## AT1G01046.1 0 0 0
## AT1G01050.1 0 1 0
## siMT_treated_rep1 siMT_treated_rep2 siMT_treated_rep3
## AT1G01010.1 1151 1379 1224
## AT1G01020.1 0 0 0
## AT1G01030.1 793 1041 923
## AT1G01040.2 3 4 9
## AT1G01046.1 0 0 0
## AT1G01050.1 0 0 0
## WT_treated_rep1 WT_treated_rep2 WT_treated_rep3
## AT1G01010.1 1309 1562 1367
## AT1G01020.1 0 0 0
## AT1G01030.1 897 1180 905
## AT1G01040.2 6 2 3
## AT1G01046.1 0 0 0
## AT1G01050.1 0 0 0
which shows that the row names are gene names and that the data are made of counts. Many of these counts are equal to 0. A basic exploratory analysis of the count data is provided in the next section.
The experimental design is described in the object design
:
design
## labels group replicat factor1 factor2
## 1 WT_control_rep1 WT_control repbio1 WT control
## 2 WT_control_rep2 WT_control repbio2 WT control
## 3 WT_control_rep3 WT_control repbio3 WT control
## 4 dbMT_control_rep1 dbMT_control repbio1 dbMT control
## 5 dbMT_control_rep2 dbMT_control repbio2 dbMT control
## 6 dbMT_control_rep3 dbMT_control repbio3 dbMT control
## 7 dbMT_treated_rep1 dbMT_treated repbio1 dbMT treated
## 8 dbMT_treated_rep2 dbMT_treated repbio2 dbMT treated
## 9 dbMT_treated_rep3 dbMT_treated repbio3 dbMT treated
## 10 oxMT_control_rep1 oxMT_control repbio1 oxMT control
## 11 oxMT_control_rep2 oxMT_control repbio2 oxMT control
## 12 oxMT_control_rep3 oxMT_control repbio3 oxMT control
## 13 oxMT_treated_rep1 oxMT_treated repbio1 oxMT treated
## 14 oxMT_treated_rep2 oxMT_treated repbio2 oxMT treated
## 15 oxMT_treated_rep3 oxMT_treated repbio3 oxMT treated
## 16 siMT_control_rep1 siMT_control repbio1 siMT control
## 17 siMT_control_rep2 siMT_control repbio2 siMT control
## 18 siMT_control_rep3 siMT_control repbio3 siMT control
## 19 siMT_treated_rep1 siMT_treated repbio1 siMT treated
## 20 siMT_treated_rep2 siMT_treated repbio2 siMT treated
## 21 siMT_treated_rep3 siMT_treated repbio3 siMT treated
## 22 WT_treated_rep1 WT_treated repbio1 WT treated
## 23 WT_treated_rep2 WT_treated repbio2 WT treated
## 24 WT_treated_rep3 WT_treated repbio3 WT treated
that contains 5 columns: the first one labels
is the sample name, the second one is group
which combines the genotype and the treatment/control factor, the third column replicat
indicates the biological replicate number, the fourth column called factor1
indicates the genotype and the last column factor2
indicates condition (control or treated).
We start the analysis by filtering out the genes for which no count has been found:
raw_counts_wn <- raw_counts[rowSums(raw_counts) > 0, ]
dim(raw_counts_wn)
## [1] 27732 24
The number of genes which have been filtered out is thus equal to 5870.
Pseudo-count distribution is then obtained and visualized using the variable “group” to colour the different boxplots:
pseudo_counts <- log2(raw_counts_wn + 1)
df_raw <- melt(pseudo_counts, id = rownames(raw_counts_wn))
names(df_raw)[1:2] <- c("id", "sample")
df_raw$group <- substr(as.character(df_raw$sample), 1,
nchar(as.character(df_raw$sample)) - 5)
df_raw$method <- rep("Raw counts", nrow(df_raw))
p <- ggplot(data=df_raw, aes(x=sample, y=value, fill=group))
p <- p + geom_boxplot()
p <- p + scale_fill_brewer(type = "qual", palette = 7)
p <- p + theme_bw()
p <- p + ggtitle("Boxplots of raw pseudo counts")
p <- p + ylab(expression(log[2] ~ (raw ~ count + 1))) + xlab("")
p <- p + theme(title = element_text(size=10), axis.text.x = element_blank(),
axis.ticks.x = element_blank())
print(p)
A first exploratory analysis is performed with PCA:
resPCA <- pca(t(pseudo_counts), ncomp = 12)
plot(resPCA)
3 principal components seem to be enough to retrieve most of the variability in the dataset (the colors are identical to the ones used in the previous boxplots):
plotIndiv(resPCA, group = design$group, col.per.group = brewer.pal(8, "Dark2"),
title = "1st and 2nd PCs")
The first principal component shows a strong separation between the two groups of treatment. On the second principal component, the different genotypes are approximately organized in separated groups, even though the separation is not perfect.
plotIndiv(resPCA, group = design$group, col.per.group = brewer.pal(8, "Dark2"),
comp = c(2,3), title = "2nd and 3rd PCs")