Practical session about Normalization and differential analysis of RNAseq data - Effect of normalization and differential test

Data description and importation
- Raw counts
- Experimental design
Basic exploratory analysis of raw counts
Normalization
- DESeq
- edgeR
- Comparison
Differential analysis
Session information

This file illustrates the normalization and differential analysis of RNAseq data on a real life dataset. The data are provided by courtesy of the transcriptomic platform of IPS2 and the name of the genes were shuffled. Hence the results are not biologically interpretable. They are composed of three files, all included in the directory data:

D1-counts.txt contains the raw counts of the experiment (13 columns: the first one contains the gene names, the others correspond to 12 different samples);
D1-genesLength.txt contains information about gene lengths;
D1-targets.txt contains information about the sample and the experimental design.

In the rest of this file, we:

1/ first import and describe the data;

2/ then perform and compare different normalizations;

3/ finally perform a differential analysis using different methods and models.

We advice that you use the file by reading the R code while trying to make sense of it. Then, you run it and analyze the results. The packages devtools, ggplot2, gridExtra, reshape2, mixOmics, RColorBrewer, DESeq, edgeR and VennDiagram are required to compile this file. Information about my system (including package and R versions) are provided at the end of this file, in Section “Session information”.

The packages are loaded with:

library(ggplot2)
library(gridExtra)
library(reshape2)
library(mixOmics)
library(RColorBrewer)
library(DESeq)
library(edgeR)
library(VennDiagram)
library(devtools)

Data description and importation

The data used in this practical session correspond to 12 samples of RNAseq obtained from microdissections on plants. Plants have four different genotypes: the wild type (“wt”“) plant and three types of mutants and are observed three times in the same conditions. for information Mutants 1 and 2 have a similar phenotype (more complicated leaves than the wild type) whereas mutant 3 has the opposite phenotype (simpler leaves than the wild type).

The datasets are included in the repository data located at the root of the material for this practical session. They can be loaded using:

raw_counts <- read.table("../data/D1-counts.txt", header = TRUE, row.names = 1)
raw_counts <- as.matrix(raw_counts)
design <- read.table("../data/D1-targets.txt", header = TRUE, 
                     stringsAsFactors = FALSE)
gene_lengths <- scan("../data/D1-genesLength.txt")

Raw counts

The dimensions of the raw count matrix are equal to:

dim(raw_counts)

## [1] 50897    12

which shows that the data contains 12 columns (samples) and 50897 rows (genes). And a quick look is obtained by:

head(raw_counts)

##                  wt_1 wt_2 wt_3 mut1_1 mut1_2 mut1_3 mut2_1 mut2_2 mut2_3
## Medtr0001s0010.1    0    0    0      1      0      1      0      0      0
## Medtr0001s0070.1    0    0    0      0      0      0      0      0      0
## Medtr0001s0100.1    0    0    0      0      0      0      0      0      0
## Medtr0001s0120.1    0    0    0      0      0      0      0      0      0
## Medtr0001s0160.1    0    0    0      0      0      0      0      0      0
## Medtr0001s0190.1    0    0    0      0      0      0      0      0      0
##                  mut3_1 mut3_2 mut3_3
## Medtr0001s0010.1      0      1      0
## Medtr0001s0070.1      0      0      0
## Medtr0001s0100.1      0      0      0
## Medtr0001s0120.1      0      0      0
## Medtr0001s0160.1      0      0      0
## Medtr0001s0190.1      0      0      0

which shows that the row names are gene names and that the data are made of counts. Many of these counts are equal to 0. A basic exploratory analysis of the count data is provided in the next section.

Experimental design

The experimental design is described in the object design:

design

##    labels group replicat
## 1    wt_1    wt  repbio1
## 2    wt_2    wt  repbio2
## 3    wt_3    wt  repbio3
## 4  mut1_1  mut1  repbio1
## 5  mut1_2  mut1  repbio2
## 6  mut1_3  mut1  repbio3
## 7  mut2_1  mut2  repbio1
## 8  mut2_2  mut2  repbio2
## 9  mut2_3  mut2  repbio3
## 10 mut3_1  mut3  repbio1
## 11 mut3_2  mut3  repbio2
## 12 mut3_3  mut3  repbio3

that contains 3 columns: the first one labels is the sample name, the second one group is the group of the sample (wild type or one of three types of mutant) and the last one replicat is the biological replicate number.

Basic exploratory analysis of raw counts

We start the analysis by filtering out the genes for which no count has been found:

raw_counts_wn <- raw_counts[rowSums(raw_counts) > 0, ]
dim(raw_counts_wn)

## [1] 27916    12

The number of genes which have been filtered out is thus equal to 22981.

It is often useful, to visualize the count distribution, to compute “pseudo counts”, which are log-transformed counts:

pseudo_counts <- log2(raw_counts_wn + 1)
head(pseudo_counts)

##                      wt_1     wt_2     wt_3   mut1_1   mut1_2   mut1_3
## Medtr0001s0010.1 0.000000 0.000000 0.000000 1.000000 0.000000 1.000000
## Medtr0001s0200.1 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000
## Medtr0001s0260.1 6.357552 6.768184 6.643856 6.169925 6.507795 6.189825
## Medtr0001s0360.1 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000
## Medtr0001s0490.1 1.000000 2.000000 0.000000 0.000000 0.000000 0.000000
## Medtr0002s0040.1 7.707359 6.918863 7.826548 2.584963 2.000000 4.643856
##                    mut2_1   mut2_2   mut2_3   mut3_1   mut3_2   mut3_3
## Medtr0001s0010.1 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000
## Medtr0001s0200.1 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## Medtr0001s0260.1 6.209453 6.392317 6.672425 6.820179 6.554589 6.539159
## Medtr0001s0360.1 0.000000 1.584963 0.000000 1.000000 0.000000 0.000000
## Medtr0001s0490.1 0.000000 0.000000 1.584963 1.000000 0.000000 1.000000
## Medtr0002s0040.1 6.643856 7.651052 8.044394 3.169925 3.459432 1.584963

df_raw <- melt(pseudo_counts, id = rownames(raw_counts_wn))
names(df_raw)[1:2]<- c("id", "sample")
df_raw$method <- rep("Raw counts", nrow(df_raw))  
head(df_raw)

##                 id sample    value     method
## 1 Medtr0001s0010.1   wt_1 0.000000 Raw counts
## 2 Medtr0001s0200.1   wt_1 0.000000 Raw counts
## 3 Medtr0001s0260.1   wt_1 6.357552 Raw counts
## 4 Medtr0001s0360.1   wt_1 0.000000 Raw counts
## 5 Medtr0001s0490.1   wt_1 1.000000 Raw counts
## 6 Medtr0002s0040.1   wt_1 7.707359 Raw counts

The object df_raw will be used later to compare the effect of different normalization on the count distribution in the different samples.

Count distribution

First, we provide an overview of the distribution of the first sample (without the genes that have been filtered out) by ploting the histograms of raw counts and pseudo counts:

df <- data.frame(rcounts = raw_counts_wn[ ,1], prcounts = pseudo_counts[ ,1])

p <- ggplot(data=df, aes(x = rcounts, y = ..density..))
p <- p + geom_histogram(fill = "lightblue")
p <- p + theme_bw()
p <- p + ggtitle(paste0("count distribution '", design$labels[1], "'"))
p <- p + xlab("counts")

p2 <- ggplot(data=df, aes(x = prcounts, y = ..density..))
p2 <- p2 + geom_histogram(fill = "lightblue")
p2 <- p2 + theme_bw()
p2 <- p2 + ggtitle(paste0("count distribution - '", design$labels[1], "'"))
p2 <- p2 + xlab(expression(log[2](counts + 1)))

grid.arrange(p, p2, ncol = 2)

This figure shows that the count distribution is highly skewed with a large proportion of genes with a count equal to 0 and a few number of genes with an extreme count (more that \(2^{15}\)).

Relation between mean and variance

Another important feature of RNAseq data is the fact that they are overdispersed. This means that the variance of the count of a given gene over different biological samples within a given condition is larger than the average count for the same gene. This feature is illustrated by plotting the graphics of the mean vs variance for condition “wt” for all genes with an average count smaller than 5000 (otherwise the chart can not be read easily):

df <- data.frame(mean = apply(raw_counts_wn[ ,design$group == "wt"], 1, mean),
                 var = apply(raw_counts_wn[ ,design$group == "wt"], 1, var))
df <- df[df$mean <= 5000, ]
p <- ggplot(data=df, aes(x = mean, y = var))
p <- p + geom_point(colour = "orange")
p <- p + theme_bw()
p <- p + geom_abline(aes(intercept=0, slope=1))
p <- p + ggtitle("Variance versus mean in counts") + ylab("variance")
print(p)

In this figure, the black line is the “y=x” diagonal. It is easy to see that, for most genes, the variance is much larger than the mean.

PCA

Finally, a PCA is performed to identify the main sources of variability in the dataset. Pseudo-counts are used to perform this PCA.

resPCA <- pca(t(pseudo_counts), ncomp = 12)
plot(resPCA)

The first component explains more variance than the other components which almost all equally reproduce variance. The projection of the samples on the first two PCs is obtained with:

plotIndiv(resPCA, group = design$group, col.per.group = brewer.pal(4, "Dark2"))

This figure shows that mutants 1 and 3 are well separated from the other wild type but that wild type is not well separated from mutant 2.

Normalization

This section performs different normalization approaches for RNAseq data. These normalization are performed using either DESeq (RLE) or edgeR (TC, RPKM, UQ, TMM). The final count distribution is compared to the raw count distribution in the last part of this section.

DESeq

To perform the normalization with DESeq, an object of type CountDataSet has to be created:

dge <- newCountDataSet(raw_counts_wn, conditions = design$group)
dge

## CountDataSet (storageMode: environment)
## assayData: 27916 features, 12 samples 
##   element names: counts 
## protocolData: none
## phenoData
##   sampleNames: wt_1 wt_2 ... mut3_3 (12 total)
##   varLabels: sizeFactor condition
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:

In this object, the attribute sizeFactor is at first missing:

sizeFactors(dge)

##   wt_1   wt_2   wt_3 mut1_1 mut1_2 mut1_3 mut2_1 mut2_2 mut2_3 mut3_1 
##     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA 
## mut3_2 mut3_3 
##     NA     NA

The function estimateSizeFactors() allows to estimate the size factors. The dge object is unchanged but the attribute sizeFactor is now given:

dge <- estimateSizeFactors(dge)
sizeFactors(dge)

##      wt_1      wt_2      wt_3    mut1_1    mut1_2    mut1_3    mut2_1 
## 0.9753728 1.1491346 1.0187414 0.9167701 1.0326086 1.1436435 0.9664323 
##    mut2_2    mut2_3    mut3_1    mut3_2    mut3_3 
## 0.9570126 0.9826726 1.1355102 0.9271699 0.9202688

Normalized counts can be obtained using the function counts with the option normalized = TRUE. The calculation of the normalized counts is performed with the formula \(\tilde{K}_{gj} = \frac{K_{gj}}{s_j}\). The following command lines compare the result of the counts function with this calculation:

deseq_normcount <- counts(dge, normalized = TRUE)
test_normcount <- sweep(raw_counts_wn, 2, sizeFactors(dge), "/")
sum(test_normcount != deseq_normcount)

## [1] 0

Finally, pseudo counts are computed and stored in a data frame for further comparison.

pseudo_deseq <- log2(deseq_normcount + 1)
df_deseq <- melt(pseudo_deseq, id = rownames(raw_counts_wn))
names(df_deseq)[1:2]<- c("id", "sample")
df_deseq$method <- rep("DESeq (RLE)", nrow(df_raw))

edgeR

In this section, the package edgeR is perform to compare the other normalization approaches. To do so, a DGEList object has to be created from the count data:

dge2 <- DGEList(raw_counts_wn)
dge2

## An object of class "DGEList"
## $counts
##                  wt_1 wt_2 wt_3 mut1_1 mut1_2 mut1_3 mut2_1 mut2_2 mut2_3
## Medtr0001s0010.1    0    0    0      1      0      1      0      0      0
## Medtr0001s0200.1    0    1    0      0      0      0      0      0      0
## Medtr0001s0260.1   81  108   99     71     90     72     73     83    101
## Medtr0001s0360.1    0    0    1      0      0      0      0      2      0
## Medtr0001s0490.1    1    3    0      0      0      0      0      0      2
##                  mut3_1 mut3_2 mut3_3
## Medtr0001s0010.1      0      1      0
## Medtr0001s0200.1      0      0      0
## Medtr0001s0260.1    112     93     92
## Medtr0001s0360.1      1      0      0
## Medtr0001s0490.1      1      0      1
## 27911 more rows ...
## 
## $samples
##        group lib.size norm.factors
## wt_1       1  5932414            1
## wt_2       1  7129204            1
## wt_3       1  6223759            1
## mut1_1     1  5577595            1
## mut1_2     1  6272344            1
## 7 more rows ...

The library sizes and the normalization factors for all samples are obtained by:

dge2$samples

##        group lib.size norm.factors
## wt_1       1  5932414            1
## wt_2       1  7129204            1
## wt_3       1  6223759            1
## mut1_1     1  5577595            1
## mut1_2     1  6272344            1
## mut1_3     1  6634542            1
## mut2_1     1  5897356            1
## mut2_2     1  5785184            1
## mut2_3     1  5767730            1
## mut3_1     1  7106881            1
## mut3_2     1  5872052            1
## mut3_3     1  6035692            1

Total count

Normalization by Total Count (TC) is obtained directly by the function cpm. Pseudo-counts (\(\log_2\) transformed counts) are computed and stored in a data frame for later use.

pseudo_TC <- log2(cpm(dge2) + 1)

df_TC <- melt(pseudo_TC, id = rownames(raw_counts_wn))
names(df_TC)[1:2] <- c ("id", "sample")
df_TC$method <- rep("TC", nrow(df_TC))

RPKM

RPKM needs information about gene lengths which have to be passed to the function rpkm. Again, pseudo-counts are computed and stored in a data frame for further use.

gene_lengths_wn <- gene_lengths[rowSums(raw_counts) > 0]
pseudo_RPKM <- log2(rpkm(dge2, gene.length = gene_lengths_wn) + 1)

df_RPKM <- melt(pseudo_RPKM, id = rownames(raw_counts_wn))
names(df_RPKM)[1:2] <- c ("id", "sample")
df_RPKM$method <- rep("RPKM", nrow(df_RPKM))

Upper quartile

Upper quartile normalization is obtained with the function calcNormFactors. This function changes the variable norm.factors in the DGEList object and thus also the way the functions cpm and rpkm are computing counts: now, a normalized library size, equal to the original library size multiplied by the normalization factor, is used to compute normalized counts: \(\tilde{K}_{gj} = \frac{K_{gj}}{\mbox{modified LS}}*10^6\). The normalization factors are computed using:

dge2 <- calcNormFactors(dge2, method = "upperquartile")
dge2$samples

##        group lib.size norm.factors
## wt_1       1  5932414    1.0047308
## wt_2       1  7129204    0.9810476
## wt_3       1  6223759    1.0130558
## mut1_1     1  5577595    1.0068753
## mut1_2     1  6272344    0.9942229
## mut1_3     1  6634542    1.0593861
## mut2_1     1  5897356    0.9873347
## mut2_2     1  5785184    1.0124342
## mut2_3     1  5767730    1.0453656
## mut3_1     1  7106881    0.9792811
## mut3_2     1  5872052    0.9857220
## mut3_3     1  6035692    0.9361638

The previous formula is compared to the result of the function cpm in the following command lines:

test_normcount <- sweep(dge2$counts, 2,
                        dge2$samples$lib.size*dge2$samples$norm.factors / 10^6,
                        "/")
range(as.vector(test_normcount - cpm(dge2)))

## [1] -1.818989e-12  0.000000e+00

which shows no difference. Finally, pseudo counts are obtained and stored in a data frame:

pseudo_UQ <- log2(cpm(dge2) + 1)

df_UQ <- melt(pseudo_UQ, id = rownames(raw_counts_wn))
names(df_UQ)[1:2] <- c ("id", "sample")
df_UQ$method <- rep("UQ", nrow(df_UQ))

TMM

TMM normalization works similarly as UQ and updates the value of the variable norm.factors:

dge2 <- calcNormFactors(dge2, method = "TMM")
dge2$samples

##        group lib.size norm.factors
## wt_1       1  5932414    1.0082044
## wt_2       1  7129204    0.9850930
## wt_3       1  6223759    1.0047944
## mut1_1     1  5577595    1.0064979
## mut1_2     1  6272344    1.0094894
## mut1_3     1  6634542    1.0593265
## mut2_1     1  5897356    0.9977013
## mut2_2     1  5785184    1.0096719
## mut2_3     1  5767730    1.0387295
## mut3_1     1  7106881    0.9791254
## mut3_2     1  5872052    0.9637238
## mut3_3     1  6035692    0.9429276

Normalized pseudo counts are obtained with the function cpm and stored in a data frame:

pseudo_TMM <- log2(cpm(dge2) + 1)

df_TMM <- melt(pseudo_TMM, id = rownames(raw_counts_wn))
names(df_TMM)[1:2] <- c ("id", "sample")
df_TMM$method <- rep("TMM", nrow(df_TMM))

Comparison

This last section shows the comparison between all normalization methods (and original raw data) for this data set. First, the distributions of all samples and all normalization methods are compared with boxplots and density plots:

df_allnorm <- rbind(df_raw, df_deseq, df_TC, df_RPKM, df_UQ, df_TMM)
df_allnorm$method <- factor(df_allnorm$method,
                            levels = c("Raw counts", "DESeq (RLE)", "TC", 
                                       "RPKM", "TMM", "UQ"))

p <- ggplot(data=df_allnorm, aes(x=sample, y=value, fill=method))
p <- p + geom_boxplot()  
p <- p + theme_bw()
p <- p + ggtitle("Boxplots of normalized pseudo counts\n
for all samples by normalization methods")
p <- p + facet_grid(. ~ method) 
p <- p + ylab(expression(log[2] ~ (normalized ~ count + 1))) + xlab("")
p <- p + theme(title = element_text(size=10), axis.text.x = element_blank(), 
               axis.ticks.x = element_blank())
print(p)

p <- ggplot(data=df_allnorm, aes(x=value, colour=sample))
p <- p + geom_density()  
p <- p + theme_bw()
p <- p + ggtitle("Density of normalized pseudo counts\n
for all samples by normalization methods")
p <- p + facet_grid(. ~ method) 
p <- p + ylab(expression(log[2] ~ (normalized ~ count + 1))) + xlab("")
p <- p + theme(title = element_text(size=10), axis.text.x = element_blank(), 
               axis.ticks.x = element_blank())
print(p)

Visually, DESeq, UQ and TMM seem to behave similarly and provide comparable distributions between samples. Finally, a PCA is performed on the pseudo counts obtained after TMM normalization:

resPCA <- pca(t(pseudo_TMM), ncomp = 12)
plot(resPCA)

The first component explains more variance than the other components which almost all equally reproduce variance. The projection of the samples on the first two PCs is obtained with:

plotIndiv(resPCA, group = design$group, col.per.group = brewer.pal(4, "Dark2"))

This figure shows that a better separation between wild type samples and mutant 2 samples. On the contrary mutants 2 and 1 are less well separated than in raw count data. This is a biologically plausible result since the organization of the different mutants are concordant with their phenotypes. The normalization has thus improved the data quality.

Differential analysis

This section will compare the results of different types of approach to obtain genes which are differentially expressed between the wild type plants and each of the mutants:

a standard NB exact test between two conditions;
a GLM with only the plant type effect;
a GLM with the plant and genotype effects;
LM after a transformation of the data.

All analyses are performed with the R package edgeR, except for the last one which is made with limma. A last section compares the results obtained using the different methods.

First approach: pairwise exact test

The following command lines loop over the different mutant types (stored in the vector all_conditions) and perform the following operations:

first, a DGEList object is created with the wild type samples and the current mutant type samples. The “wt” condition is chosen as the reference level. Normalization is performed using TMM;
then, the functions estimateCommonDisp and estimateTagwiseDisp respectively estimate a common dispersion parameter and uses it as a prior to estimate gene specific dispersions \(\phi_g\) for all \(g\). The “Biological Coefficient of Variation” plot is also displayed which displays \(\phi_g\) versus the average (normalized) count for all genes;
finally, an exact test is performed with the function exactTest to find genes which are differentially expressed between the two conditions (wild type and mutant). The output of the function is printed and p-values and adjusted p-values (Benjamini and Hochberg approach) are stored in pvals_pairwiseExact. The function topTags is used to print the genes with the smallest p-values and the function decideTestsDGE is used to extract genes whose adjusted p-value is smaller than 5%. The names of those genes as well as the fact that they are up/down regulated in the mutant type are saved in DGE_pairwiseExact.

all_conditions <- setdiff(unique(design$group), "wt")
DEG_pairwiseExact <- NULL
pvals_pairwiseExact <- NULL
par(mfrow = c(1,3))
for (ind in seq_along(all_conditions)) {
  # create dataset with 'wt' and current mutant and normalize (TMM)
  sel_cols <- union(grep("wt", colnames(raw_counts_wn)),
                    grep(all_conditions[ind], colnames(raw_counts_wn)))
  cur_counts <- raw_counts_wn[ ,sel_cols]
  group <- relevel(as.factor(design$group[sel_cols]), ref = "wt")
  cur_dge <- DGEList(cur_counts, group = group)
  cur_dge <- calcNormFactors(cur_dge, method = "TMM")
  
  # estimate dispersions
  cur_dge <- estimateCommonDisp(cur_dge)
  cur_dge <- estimateTagwiseDisp(cur_dge)
  plotBCV(cur_dge, 
          main = paste0("BCV plot 'wt' vs '", all_conditions[ind], "'"))
  
  # exact test
  res_et <- exactTest(cur_dge)
  cat("Exact test results:\n")
  print(res_et)
  pvals_pairwiseExact <- c(pvals_pairwiseExact, res_et$table$PValue,
                           p.adjust(res_et$table$PValue, method = "BH"))

  cat("Top 10 DEG for 'wt' vs '", all_conditions[ind], "':\n")
  print(topTags(res_et))
  
  cur_res <- decideTestsDGE(res_et, adjust.method = "BH", p.value = 0.05)
  print(cur_res)
  sel_deg <- which(cur_res[ ,1] != 0)
  cur_res <- cbind(rownames(cur_counts)[sel_deg], cur_res[sel_deg,1],
                   rep(all_conditions[ind], length(sel_deg)))
  DEG_pairwiseExact <- rbind(DEG_pairwiseExact, cur_res)
}

## Exact test results:
## An object of class "DGEExact"
## $table
##                       logFC    logCPM    PValue
## Medtr0001s0010.1  2.6776698 -1.431356 0.5100990
## Medtr0001s0200.1 -1.8534460 -1.539448 1.0000000
## Medtr0001s0260.1 -0.2643299  3.819200 0.2980937
## Medtr0001s0360.1 -1.8661639 -1.538425 1.0000000
## Medtr0001s0490.1 -3.5116569 -1.241010 0.1584303
## 27911 more rows ...
## 
## $comparison
## [1] "wt"   "mut1"
## 
## $genes
## NULL
## 
## Top 10 DEG for 'wt' vs ' mut1 ':
## Comparison of groups:  mut1-wt 
##                      logFC   logCPM        PValue           FDR
## Medtr7g079200.1  10.428936 5.646475 1.084483e-128 3.027442e-124
## Medtr6g034805.1  -5.621053 3.157062  3.813887e-36  5.323424e-32
## Medtr2g028530.1  -9.381843 2.800091  2.136579e-31  1.988158e-27
## Medtr3g092475.1  -2.761047 4.927626  1.219502e-29  8.510903e-26
## Medtr7g116150.1   6.314021 2.897514  3.541276e-29  1.977165e-25
## Medtr6g051975.1   8.734298 2.175163  6.808323e-27  3.167686e-23
## Medtr6g445300.1  -8.201467 1.703607  4.842629e-20  1.931241e-16
## Medtr2g054640.1   1.999710 6.178289  2.147915e-19  7.495149e-16
## Medtr0024s0070.1  6.840129 2.166796  4.779187e-19  1.482397e-15
## Medtr0002s0040.1 -4.131323 3.991809  7.549003e-18  2.107380e-14
## TestResults matrix
## [1] 0 0 0 0 0
## 27911 more rows ...

## Exact test results:
## An object of class "DGEExact"
## $table
##                        logFC    logCPM    PValue
## Medtr0001s0010.1  0.00000000 -1.613063 1.0000000
## Medtr0001s0200.1 -1.82136561 -1.498958 1.0000000
## Medtr0001s0260.1 -0.04218856  3.920703 0.8831503
## Medtr0001s0360.1  0.88126012 -1.290109 1.0000000
## Medtr0001s0490.1 -0.75807651 -1.031106 0.7324608
## 27911 more rows ...
## 
## $comparison
## [1] "wt"   "mut2"
## 
## $genes
## NULL
## 
## Top 10 DEG for 'wt' vs ' mut2 ':
## Comparison of groups:  mut2-wt 
##                     logFC   logCPM       PValue          FDR
## Medtr7g079200.1  6.264926 1.624597 7.995954e-09 0.0001026475
## Medtr2g054640.1  1.394714 5.716404 8.559340e-09 0.0001026475
## Medtr4g035895.1 -1.788930 6.346641 1.103104e-08 0.0001026475
## Medtr3g073360.1 -6.724211 2.152298 4.357221e-08 0.0002551703
## Medtr2g006070.1  2.088873 4.624899 4.570323e-08 0.0002551703
## Medtr8g467910.1  3.591974 2.640817 1.190725e-07 0.0005540047
## Medtr3g036280.1 -2.653538 2.594173 1.455923e-07 0.0005556901
## Medtr8g073335.1 -4.249142 4.712342 1.592464e-07 0.0005556901
## Medtr4g117170.1 -2.288529 2.964487 1.974584e-07 0.0006124721
## Medtr2g025455.1 -2.499058 3.965837 4.045263e-07 0.0011292757
## TestResults matrix
## [1] 0 0 0 0 0
## 27911 more rows ...

## Exact test results:
## An object of class "DGEExact"
## $table
##                        logFC    logCPM    PValue
## Medtr0001s0010.1  1.91369814 -1.558248 1.0000000
## Medtr0001s0200.1 -1.83589426 -1.560024 1.0000000
## Medtr0001s0260.1  0.12727186  3.961898 0.5998775
## Medtr0001s0360.1  0.04230309 -1.452970 1.0000000
## Medtr0001s0490.1 -0.80141626 -1.091878 0.7178534
## 27911 more rows ...
## 
## $comparison
## [1] "wt"   "mut3"
## 
## $genes
## NULL
## 
## Top 10 DEG for 'wt' vs ' mut3 ':
## Comparison of groups:  mut3-wt 
##                      logFC   logCPM        PValue           FDR
## Medtr7g079200.1  10.398918 5.576979 1.152665e-137 3.217779e-133
## Medtr3g092475.1  -8.644769 4.695309  4.918009e-92  6.864556e-88
## Medtr6g034805.1  -5.376733 3.125204  2.710513e-33  2.522223e-29
## Medtr2g028530.1  -9.357128 2.765291  4.307709e-31  3.006350e-27
## Medtr5g024095.1  -5.182918 3.295389  1.039588e-28  5.804225e-25
## Medtr0002s0040.1 -4.712504 3.927106  4.775120e-28  2.221704e-24
## Medtr6g445300.1  -8.178158 1.674507  1.476583e-19  5.888613e-16
## Medtr6g051975.1   8.104036 1.561939  5.101524e-19  1.780177e-15
## Medtr0044s0130.1 -4.041116 3.016525  2.071577e-18  6.425572e-15
## Medtr7g066070.1  -4.195235 2.974816  8.616258e-17  2.405315e-13
## TestResults matrix
## [1] 0 0 0 0 0
## 27911 more rows ...

Finally, pvals_pairwiseExact and DEG_pairwiseExact are transformed into data frames for further use:

DEG_pairwiseExact <- as.data.frame(DEG_pairwiseExact)
names(DEG_pairwiseExact) <- c("name", "UD", "condition")
listDEG_pairwiseExact <- unique(DEG_pairwiseExact$name)

pvals_pairwiseExact <- data.frame("pvalue" = pvals_pairwiseExact,
                                  "type" = rep(rep(c("raw", "adjusted"),
                                                   each = nrow(raw_counts_wn)),
                                               length(all_conditions)),
                                  "condition" = rep(all_conditions,
                                                    each=nrow(raw_counts_wn)*2))

In particular, the number of DEGs corresponding to each mutant type, the distribution of up/down regulated DEGs versus the mutant type and the total number of (unique) DEGs are obtained with:

table(DEG_pairwiseExact$condition)

## 
## mut1 mut2 mut3 
##  362   51  127

table(DEG_pairwiseExact$condition, DEG_pairwiseExact$UD) # 1 means 'up-regulated'

##       
##          1  -1
##   mut1 283  79
##   mut2  13  38
##   mut3  53  74

length(listDEG_pairwiseExact)

## [1] 480

Finally, the distributions of p-values and adjusted p-values for all three tests are displayed by:

p <- ggplot(data = pvals_pairwiseExact, aes(x = pvalue, fill = type))
p <- p + geom_histogram()
p <- p + theme_bw()
p <- p + ggtitle("Histograms of raw/adjusted p-values for exact test")
p <- p + facet_grid(type ~ condition) 
p <- p + theme(title = element_text(size=10), axis.text.x = element_blank(), 
               axis.ticks.x = element_blank())
print(p)

Second approach: overall GLM tests with group effect

In this section, we fit a generalized linear model to account for the group effect. The model writes \(K_{gj} \sim \mbox{NB}(\mu_{gj}, \phi_g)\) with \(\mathbb{E}(\log K_{gj}) = \log(s_j) + \log(\lambda_{gj})\). Here, \(j\) denotes one of the four genotypes (wild type or mutant 1, 2 and 3) and \(\lambda_{gj}\) is explained by: \(\log(\lambda_{g,\textrm{mutant}}) = \lambda_{g0} + \beta_{g1} \mathbb{I}_{\textrm{mutant}_1} + \beta_{g2} \mathbb{I}_{\textrm{mutant}_2} + \beta_{g3} \mathbb{I}_{\textrm{mutant}_3}\). In this model, the wild type plant is the reference genotype and thus, testing if the second (resp., the third, the fourth) coefficient in the model is equal to zero is equivalent to find DEGs between the wild type and mutant 1 (resp. 2, 3).

First, a DGEList object is created with all samples and is normalized by TMM:

# create dataset with 'wt' and current mutant and normalize (TMM)
cur_dge <- DGEList(raw_counts_wn, group = design$group)
cur_dge <- calcNormFactors(cur_dge, method = "TMM")

Then, a design matrix is created: the group condition is stored in a variable group in which the reference level is set to “wt”. Then, this variable is used in the function model.matrix to obtain the design matrix:

group <- relevel(as.factor(design$group), ref = "wt")
design_matrix <- model.matrix(~ group)
design_matrix

##    (Intercept) groupmut1 groupmut2 groupmut3
## 1            1         0         0         0
## 2            1         0         0         0
## 3            1         0         0         0
## 4            1         1         0         0
## 5            1         1         0         0
## 6            1         1         0         0
## 7            1         0         1         0
## 8            1         0         1         0
## 9            1         0         1         0
## 10           1         0         0         1
## 11           1         0         0         1
## 12           1         0         0         1
## attr(,"assign")
## [1] 0 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

Then, the dispersion is estimated in three steps: a common dispersion is first estimated, then a trended dispersion (which relates the dispersion to overall level of expression of the gene) and finally a gene specific dispersion:

cur_dge <- estimateGLMCommonDisp(cur_dge, design_matrix)
cur_dge <- estimateGLMTrendedDisp(cur_dge, design_matrix)
cur_dge <- estimateGLMTagwiseDisp(cur_dge, design_matrix)
plotBCV(cur_dge, main = paste0("BCV plot"))

Finally, the model is fitted with the function glmFit. Then, coefficients are tested using the function glmLRT in which the argument coef is used to select a given mutant type. topTags is used to print the DEGs with the smallest p-values for each mutant type. Also, decideTestsDGE allows to extract DEGs and the results (p-values, DEGs and information about up and down regulation) are all saved in pvals_GLM1 and listDEG_GLM1 similarly as what was done for the results of the Exact Test.

# GLM fit
fit <- glmFit(cur_dge, design_matrix)
DEG_GLM1 <- NULL
pvals_GLM1 <- NULL
for (ind in 1:3) {
  res_GLM1 <- glmLRT(fit, coef = ind+1)
  pvals_GLM1 <- c(pvals_GLM1, res_GLM1$table$PValue,
                  p.adjust(res_GLM1$table$PValue, method = "BH"))

  cat("Top 10 DEG for 'wt' vs '", all_conditions[ind], "':\n")
  print(topTags(res_GLM1))
  
  cur_res <- decideTestsDGE(res_GLM1, adjust.method = "BH", p.value = 0.05)
  print(cur_res)
  sel_deg <- which(cur_res[ ,1] != 0)
  cur_res <- cbind(rownames(raw_counts_wn)[sel_deg], cur_res[sel_deg,1],
                   rep(all_conditions[ind], length(sel_deg)))
  DEG_GLM1 <- rbind(DEG_GLM1, cur_res)
}

## Top 10 DEG for 'wt' vs ' mut1 ':
## Coefficient:  groupmut1 
##                      logFC   logCPM        LR       PValue          FDR
## Medtr7g079200.1  10.406886 5.655526 237.35327 1.485330e-53 4.146448e-49
## Medtr6g034805.1  -5.629671 3.149726 177.65306 1.577141e-40 2.201374e-36
## Medtr2g028530.1  -9.344921 2.667981 149.38073 2.367662e-34 2.203189e-30
## Medtr6g051975.1   8.687375 1.917974 128.19714 1.016301e-29 7.092767e-26
## Medtr6g445300.1  -8.164661 1.570747  89.69065 2.784673e-21 1.554739e-17
## Medtr2g436330.1  -2.315931 3.962352  84.01861 4.901388e-20 2.280452e-16
## Medtr2g054640.1   1.988198 6.048253  74.99260 4.724826e-18 1.884261e-14
## Medtr0002s0040.1 -4.149904 4.018977  63.85203 1.341238e-15 4.680250e-12
## Medtr3g043760.1   1.793755 4.098330  58.45934 2.075362e-14 6.437312e-11
## Medtr0196s0020.1  2.158905 5.643368  52.81686 3.661446e-13 1.022129e-09
## TestResults matrix
## [1] 0 0 0 0 0
## 27911 more rows ...
## Top 10 DEG for 'wt' vs ' mut2 ':
## Coefficient:  groupmut2 
##                     logFC   logCPM       LR       PValue          FDR
## Medtr7g079200.1  6.268296 5.655526 69.02311 9.731628e-17 2.716681e-12
## Medtr2g006070.1  2.090780 5.149846 39.47340 3.325582e-10 4.641847e-06
## Medtr2g054640.1  1.394952 6.048253 37.52270 9.035520e-10 8.407852e-06
## Medtr4g035895.1 -1.788964 5.999316 34.82308 3.610668e-09 2.519885e-05
## Medtr4g117170.1 -2.288755 2.788306 29.74795 4.920270e-08 2.747085e-04
## Medtr1g018840.1 -2.335843 4.731286 28.06178 1.175040e-07 5.467067e-04
## Medtr7g046260.1 -2.263240 2.882804 25.90610 3.584354e-07 1.429441e-03
## Medtr7g116150.1  4.615098 2.544361 25.58139 4.241107e-07 1.479934e-03
## Medtr5g006710.1 -1.853209 5.174918 24.18845 8.735477e-07 2.375306e-03
## Medtr8g035810.1 -1.771165 4.080911 24.15524 8.887385e-07 2.375306e-03
## TestResults matrix
## [1] 0 0 0 0 0
## 27911 more rows ...
## Top 10 DEG for 'wt' vs ' mut3 ':
## Coefficient:  groupmut3 
##                      logFC   logCPM        LR       PValue          FDR
## Medtr7g079200.1  10.387927 5.655526 236.46749 2.317253e-53 6.468845e-49
## Medtr6g034805.1  -5.380946 3.149726 170.06094 7.175589e-39 1.001569e-34
## Medtr2g028530.1  -9.344921 2.667981 146.78328 8.752026e-34 8.144052e-30
## Medtr0044s0130.1 -4.049583 3.911734  98.46413 3.309650e-23 2.309805e-19
## Medtr6g051975.1   8.083961 1.917974  96.47444 9.040336e-23 5.047401e-19
## Medtr6g445300.1  -8.164661 1.570747  87.87981 6.955508e-21 3.236166e-17
## Medtr7g066070.1  -4.202221 3.470112  85.52089 2.292665e-20 9.143149e-17
## Medtr5g024095.1  -5.180186 3.732410  75.05921 4.568074e-18 1.544177e-14
## Medtr0002s0040.1 -4.722476 4.018977  74.88940 4.978360e-18 1.544177e-14
## Medtr3g055120.1  -3.806717 2.945337  59.59270 1.166674e-14 3.256886e-11
## TestResults matrix
## [1] 0 0 0 0 0
## 27911 more rows ...

DEG_GLM1 <- as.data.frame(DEG_GLM1)
names(DEG_GLM1) <- c("name", "UD", "condition")
listDEG_GLM1 <- unique(DEG_GLM1$name)

pvals_GLM1 <- data.frame("pvalue" = pvals_GLM1,
                         "type" = rep(rep(c("raw", "adjusted"),
                                          each = nrow(raw_counts_wn)),
                                      length(all_conditions)),
                         "condition" = rep(all_conditions,
                                           each=nrow(raw_counts_wn)*2))

The distributions of p-values and adjusted p-values for all three mutant types are obtained with:

p <- ggplot(data = pvals_GLM1, aes(x = pvalue, fill = type))
p <- p + geom_histogram()
p <- p + theme_bw()
p <- p + ggtitle("Histogram of raw/adjusted p-values for exact test")
p <- p + facet_grid(type ~ condition) 
p <- p + theme(title = element_text(size=10), axis.text.x = element_blank(), 
               axis.ticks.x = element_blank())
print(p)

The total number of DEGs for each mutant type, the distribution of up/down regulated DEGs for each mutant type and the number of (unique) DEGs among the three mutant types are:

table(DEG_GLM1$condition)

## 
## mut1 mut2 mut3 
##  329   75  165

table(DEG_GLM1$condition, DEG_GLM1$UD) # 1 means 'up-regulated'

##       
##          1  -1
##   mut1 243  86
##   mut2  21  54
##   mut3  54 111

length(listDEG_GLM1)

## [1] 501

Compared to what was found with the Exact test, the number of obtained DEGs is of the same order (only a bit larger). A more precise comparison is made in the last part of this section.

Third approach: overall GLM tests with group and replicate effects

The previous model does not take into account all information about the experimental design. Only the group effect has been tested whereas an additional effect might influence the expression level: the replicat effect. In this section we fit a model with an additive contribution of both effects. This problem is equivalent to introduce a (fixed) individual effect in the model. In this situation, the average level of \(\log\) normalized counts is expressed as: \(\log(\lambda_{g,\textrm{rep},\textrm{mutant}}) = \lambda_{g0} + \beta_{g1} \mathbb{I}_{\textrm{rep}_1} + \beta_{g2} \mathbb{I}_{\textrm{rep}_2} + \beta_{g3} \mathbb{I}_{\textrm{mutant}_1} + \beta_{g4} \mathbb{I}_{\textrm{mutant}_2} + \beta_{g5} \mathbb{I}_{\textrm{mutant}_3}\) and DEGs for each of the three mutant types are obtained by testing coefficients \(\beta_{g3}\), \(\beta_{g4}\) and \(\beta_{g5}\).

The method is performed similarly as before: a DGEList is created and normalization factors are obtained with the TMM approach. Then a design matrix that incorporates the two factors in an additive model is created:

# create dataset with 'wt' and current mutant and normalize (TMM)
cur_dge <- DGEList(raw_counts_wn, group = design$group)
cur_dge <- calcNormFactors(cur_dge, method = "TMM")

# create design matrix
design_matrix <- model.matrix(~ design$replicat + group)
design_matrix

##    (Intercept) design$replicatrepbio2 design$replicatrepbio3 groupmut1
## 1            1                      0                      0         0
## 2            1                      1                      0         0
## 3            1                      0                      1         0
## 4            1                      0                      0         1
## 5            1                      1                      0         1
## 6            1                      0                      1         1
## 7            1                      0                      0         0
## 8            1                      1                      0         0
## 9            1                      0                      1         0
## 10           1                      0                      0         0
## 11           1                      1                      0         0
## 12           1                      0                      1         0
##    groupmut2 groupmut3
## 1          0         0
## 2          0         0
## 3          0         0
## 4          0         0
## 5          0         0
## 6          0         0
## 7          1         0
## 8          1         0
## 9          1         0
## 10         0         1
## 11         0         1
## 12         0         1
## attr(,"assign")
## [1] 0 1 1 2 2 2
## attr(,"contrasts")
## attr(,"contrasts")$`design$replicat`
## [1] "contr.treatment"
## 
## attr(,"contrasts")$group
## [1] "contr.treatment"

The dispersion is then estimated using the same approach as in the previous section:

cur_dge <- estimateGLMCommonDisp(cur_dge, design_matrix)
cur_dge <- estimateGLMTrendedDisp(cur_dge, design_matrix)
cur_dge <- estimateGLMTagwiseDisp(cur_dge, design_matrix)
plotBCV(cur_dge, main = paste0("BCV plot"))

And finally, the model is fit with glmFit and tests are performed on the coefficients of interest with glmLRT. Results extracted with topTags and decideTestsDGE are saved in pvals_GLM2 and listDEG_GLM2:

# GLM fit
fit <- glmFit(cur_dge, design_matrix)
DEG_GLM2 <- NULL
pvals_GLM2 <- NULL
for (ind in 1:3) {
  res_GLM2 <- glmLRT(fit, coef = ind+3)
  pvals_GLM2 <- c(pvals_GLM2, res_GLM2$table$PValue,
                  p.adjust(res_GLM2$table$PValue, method = "BH"))

  cat("Top 10 DEG for 'wt' vs '", all_conditions[ind], "':\n")
  print(topTags(res_GLM2))
  
  cur_res <- decideTestsDGE(res_GLM2, adjust.method = "BH", p.value = 0.05)
  print(cur_res)
  sel_deg <- which(cur_res[ ,1] != 0)
  cur_res <- cbind(rownames(raw_counts_wn)[sel_deg], cur_res[sel_deg,1],
                   rep(all_conditions[ind], length(sel_deg)))
  DEG_GLM2 <- rbind(DEG_GLM2, cur_res)
}

## Top 10 DEG for 'wt' vs ' mut1 ':
## Coefficient:  groupmut1 
##                      logFC   logCPM        LR       PValue          FDR
## Medtr7g079200.1  10.474995 5.655524 258.39513 3.840109e-58 1.072005e-53
## Medtr6g034805.1  -5.629742 3.149727 153.69225 2.703860e-35 3.774048e-31
## Medtr2g028530.1  -9.353618 2.667978 148.21874 4.249283e-34 3.954099e-30
## Medtr6g051975.1   8.687313 1.917958 116.21369 4.267458e-27 2.978259e-23
## Medtr6g445300.1  -8.180778 1.570739  85.50786 2.307823e-20 1.288504e-16
## Medtr2g436330.1  -2.323846 3.962353  83.96557 5.034671e-20 2.342465e-16
## Medtr0002s0040.1 -4.276996 4.018988  75.21492 4.221655e-18 1.683596e-14
## Medtr2g054640.1   1.997224 6.048253  70.74996 4.054917e-17 1.414963e-13
## Medtr0024s0070.1  7.031074 1.392281  69.41668 7.971152e-17 2.472474e-13
## Medtr0196s0020.1  2.164260 5.643357  54.42308 1.616534e-13 4.512717e-10
## TestResults matrix
## [1] 0 0 0 0 0
## 27911 more rows ...
## Top 10 DEG for 'wt' vs ' mut2 ':
## Coefficient:  groupmut2 
##                      logFC     logCPM       LR       PValue          FDR
## Medtr7g079200.1   6.165822  5.6555242 67.77259 1.834830e-16 5.122111e-12
## Medtr2g006070.1   2.172786  5.1498495 38.93033 4.392029e-10 6.130394e-06
## Medtr5g069205.1  -7.592679  0.4681538 35.35716 2.744580e-09 2.232755e-05
## Medtr2g054640.1   1.397947  6.0482526 35.05864 3.199248e-09 2.232755e-05
## Medtr7g006245.1  -6.932746  1.5022720 33.74887 6.270537e-09 3.500966e-05
## Medtr4g082315.1  -6.207794 -0.5376342 31.93052 1.597868e-08 7.434345e-05
## Medtr4g035895.1  -1.742579  5.9993076 30.56317 3.231701e-08 1.288802e-04
## Medtr4g117170.1  -2.253610  2.7882773 25.98388 3.442802e-07 1.084179e-03
## Medtr1g019080.1  -3.558257  2.4657083 25.95464 3.495348e-07 1.084179e-03
## Medtr0508s0010.1  1.517946  3.1080756 25.06802 5.534317e-07 1.432564e-03
## TestResults matrix
## [1] 0 0 0 0 0
## 27911 more rows ...
## Top 10 DEG for 'wt' vs ' mut3 ':
## Coefficient:  groupmut3 
##                      logFC   logCPM        LR       PValue          FDR
## Medtr7g079200.1  10.480542 5.655524 258.40033 3.830090e-58 1.069208e-53
## Medtr6g034805.1  -5.409679 3.149727 147.54793 5.955959e-34 8.313327e-30
## Medtr2g028530.1  -9.324268 2.667978 145.11142 2.030493e-33 1.889441e-29
## Medtr0044s0130.1 -4.008075 3.911732 105.42133 9.874120e-25 6.891149e-21
## Medtr5g024095.1  -5.143842 3.732409  90.00130 2.380041e-21 1.328824e-17
## Medtr6g051975.1   8.083946 1.917958  88.99814 3.951833e-21 1.683232e-17
## Medtr7g066070.1  -4.144825 3.470096  88.86791 4.220741e-21 1.683232e-17
## Medtr6g445300.1  -8.156283 1.570739  83.45149 6.529892e-20 2.278606e-16
## Medtr0002s0040.1 -4.643405 4.018988  81.37178 1.870091e-19 5.800606e-16
## Medtr3g055120.1  -3.742504 2.945316  62.19284 3.114179e-15 8.693541e-12
## TestResults matrix
## [1] 0 0 0 0 0
## 27911 more rows ...

DEG_GLM2 <- as.data.frame(DEG_GLM2)
names(DEG_GLM2) <- c("name", "UD", "condition")
listDEG_GLM2 <- unique(DEG_GLM2$name)

pvals_GLM2 <- data.frame("pvalue" = pvals_GLM2,
                         "type" = rep(rep(c("raw", "adjusted"),
                                          each = nrow(raw_counts_wn)),
                                      length(all_conditions)),
                         "condition" = rep(all_conditions,
                                           each=nrow(raw_counts_wn)*2))

The distribution of p-values is given by:

p <- ggplot(data = pvals_GLM2, aes(x = pvalue, fill = type))
p <- p + geom_histogram()
p <- p + theme_bw()
p <- p + ggtitle("Histogram of raw/adjusted p-values for exact test")
p <- p + facet_grid(type ~ condition) 
p <- p + theme(title = element_text(size=10), axis.text.x = element_blank(), 
               axis.ticks.x = element_blank())
print(p)

Finally, the total number of DEGs for each mutant type, the distribution of up/down regulated DEGs for each mutant type and the number of (unique) DEGs among the three mutant types are:

table(DEG_GLM2$condition)

## 
## mut1 mut2 mut3 
##  327   68  190

table(DEG_GLM2$condition, DEG_GLM2$UD) # 1 means 'up-regulated'

##       
##          1  -1
##   mut1 247  80
##   mut2  21  47
##   mut3  74 116

length(listDEG_GLM2)

## [1] 504

Again, the number of obtained DEGs is of the same order (only very slightly larger) than what was found with the other two methods.

Fourth approach: voom and LM tests

In this last approach, the limma package is used to fit a linear model on transformed data. Similarly as in the previous section, the linear model is designed to have two additive effects, a group effect and a replicat effect.

This approach starts with the definition of a DGEList object as previsously and normalization factors are computed with the TMM approach. The function voom is then used to transform the data with the same design matrix as in the previous model:

cur_dge <- DGEList(raw_counts_wn)
cur_dge <- calcNormFactors(cur_dge, method = "TMM")
vdge <- voom(cur_dge, design_matrix, plot = TRUE)

Then, the function lmFit is invoked to fit the model with the same design matrix and finally, the coefficients are shrinked with a Bayesian approach implemented in eBayes. The function decideTests allows to extract the DEGs after multiple testing correction (DEGs are indicated in the column corresponding to the tested condition with a 1, for up regulated genes, or a -1, for down regulated genes):

# LM fit
fit <- lmFit(vdge, design_matrix)
fit <- eBayes(fit)
cur_gres <- decideTests(fit, adjust.method = "BH", p.value = 0.05)
head(cur_gres)

##                  (Intercept) design$replicatrepbio2 design$replicatrepbio3
## Medtr0001s0010.1          -1                      0                      0
## Medtr0001s0200.1          -1                      0                      0
## Medtr0001s0260.1           1                      0                      0
## Medtr0001s0360.1          -1                      0                      0
## Medtr0001s0490.1          -1                      0                      0
## Medtr0002s0040.1           1                      0                      0
##                  groupmut1 groupmut2 groupmut3
## Medtr0001s0010.1         0         0         0
## Medtr0001s0200.1         0         0         0
## Medtr0001s0260.1         0         0         0
## Medtr0001s0360.1         0         0         0
## Medtr0001s0490.1         0         0         0
## Medtr0002s0040.1         0         0         0

DEGs corresponding to each of the three mutant types and corresponding status (up/down regulated) as well as the p-values are saved in DEG_voom and pvals_voom:

DEG_voom <- NULL
pvals_voom <- NULL
for (ind in 1:3) {
  res_voom <- fit$p.value
  pvals_voom <- c(pvals_voom, res_voom[ ,ind+3],
                  p.adjust(res_voom[ ,ind+3], method = "BH"))

  cat("Top 10 DEG for 'wt' vs '", all_conditions[ind], "':\n")
  print(topTable(fit, coef = ind+3))
  
  sel_deg <- which(cur_gres[ ,ind+3] != 0)
  cur_res <- cbind(rownames(raw_counts_wn)[sel_deg], 
                   cur_gres[sel_deg,ind+3],
                   rep(all_conditions[ind], length(sel_deg)))
  DEG_voom <- rbind(DEG_voom, cur_res)
}

## Top 10 DEG for 'wt' vs ' mut1 ':
##                      logFC     AveExpr          t      P.Value
## Medtr6g051975.1   6.739374 -0.43009811  20.245425 4.867541e-10
## Medtr2g028530.1  -7.342487 -0.03160517 -19.031312 9.429348e-10
## Medtr2g436330.1  -2.307696  3.63095676 -11.827159 1.383082e-07
## Medtr6g445300.1  -6.193881 -0.61791117 -16.455440 4.423730e-09
## Medtr7g079200.1   9.762553  3.00431424  13.080276 4.901524e-08
## Medtr1g026140.1   5.141030 -1.94784797  13.347305 3.974531e-08
## Medtr2g054640.1   1.998888  5.83811133   9.064372 1.990549e-06
## Medtr1g094840.1   1.543271  3.28196917   9.007936 2.115814e-06
## Medtr0024s0070.1  6.024795 -1.25354315  10.237915 5.964608e-07
## Medtr6g034805.1  -5.447605  1.26135305  -9.904975 8.296906e-07
##                     adj.P.Val        B
## Medtr6g051975.1  1.316148e-05 8.472273
## Medtr2g028530.1  1.316148e-05 8.388615
## Medtr2g436330.1  6.435019e-04 7.741279
## Medtr6g445300.1  4.116429e-05 7.578494
## Medtr7g079200.1  2.736619e-04 6.817848
## Medtr1g026140.1  2.736619e-04 6.290930
## Medtr2g054640.1  4.543466e-03 5.443729
## Medtr1g094840.1  4.543466e-03 5.391351
## Medtr0024s0070.1 2.081350e-03 5.153579
## Medtr6g034805.1  2.316164e-03 5.012548
## Top 10 DEG for 'wt' vs ' mut2 ':
##                       logFC  AveExpr         t      P.Value adj.P.Val
## Medtr2g054640.1   1.4013639 5.838111  6.224875 6.555715e-05 0.4575234
## Medtr0508s0010.1  1.5052159 2.813566  5.898000 1.044513e-04 0.4859770
## Medtr2g006070.1   2.2098591 4.829084  5.644011 1.515410e-04 0.5860616
## Medtr5g063740.1  -0.9408826 4.385796 -5.419138 2.122790e-04 0.5860616
## Medtr4g104170.1  -0.8233931 5.508924 -5.390001 2.218708e-04 0.5860616
## Medtr6g060340.1   1.0231411 3.078198  5.546134 1.753314e-04 0.5860616
## Medtr8g063830.1  -1.1594136 5.241249 -5.306687 2.519250e-04 0.5860616
## Medtr4g104510.1   0.8412229 4.870297  5.082582 3.562885e-04 0.6024473
## Medtr4g035895.1  -1.7142203 5.760276 -5.044998 3.778768e-04 0.6024473
## Medtr8g064370.1   1.2056288 5.229920  4.944765 4.424966e-04 0.6024473
##                          B
## Medtr2g054640.1  2.0827830
## Medtr0508s0010.1 1.2864437
## Medtr2g006070.1  1.1988973
## Medtr5g063740.1  0.9635849
## Medtr4g104170.1  0.9451127
## Medtr6g060340.1  0.9207591
## Medtr8g063830.1  0.8258245
## Medtr4g104510.1  0.5015316
## Medtr4g035895.1  0.4345313
## Medtr8g064370.1  0.2972860
## Top 10 DEG for 'wt' vs ' mut3 ':
##                      logFC     AveExpr          t      P.Value
## Medtr2g028530.1  -7.292767 -0.03160517 -18.949440 9.873397e-10
## Medtr6g051975.1   6.144723 -0.43009811  17.919284 1.791176e-09
## Medtr6g445300.1  -6.142565 -0.61791117 -16.340956 4.762712e-09
## Medtr7g079200.1   9.745243  3.00431424  13.057438 4.991081e-08
## Medtr0079s0070.1 -5.041555  0.06658128 -11.292945 2.216452e-07
## Medtr6g034805.1  -5.711564  1.26135305 -10.392510 5.132504e-07
## Medtr2g436330.1  -1.381085  3.63095676  -8.310958 4.611795e-06
## Medtr4g075345.1  -4.888321 -1.36136651 -10.018936 7.403141e-07
## Medtr4g046737.1   1.441677  4.13880651   7.752055 8.932441e-06
## Medtr0044s0130.1 -3.916287  3.12999397  -8.280226 4.778367e-06
##                     adj.P.Val        B
## Medtr2g028530.1  2.500124e-05 7.417968
## Medtr6g051975.1  2.500124e-05 6.902496
## Medtr6g445300.1  4.431862e-05 6.667024
## Medtr7g079200.1  3.483275e-04 6.120476
## Medtr0079s0070.1 1.237490e-03 4.777822
## Medtr6g034805.1  2.387983e-03 4.777653
## Medtr2g436330.1  1.212663e-02 4.632578
## Medtr4g075345.1  2.952372e-03 4.181997
## Medtr4g046737.1  1.858069e-02 4.017856
## Medtr0044s0130.1 1.212663e-02 3.696410

DEG_voom <- as.data.frame(DEG_voom)
names(DEG_voom) <- c("name", "UD", "condition")
listDEG_voom <- unique(DEG_voom$name)

pvals_voom <- data.frame("pvalue" = pvals_voom,
                         "type" = rep(rep(c("raw", "adjusted"),
                                          each = nrow(raw_counts_wn)),
                                      length(all_conditions)),
                         "condition" = rep(all_conditions,
                                           each=nrow(raw_counts_wn)*2))

The distribution of p-values is given by:

p <- ggplot(data = pvals_voom, aes(x = pvalue, fill = type))
p <- p + geom_histogram()
p <- p + theme_bw()
p <- p + ggtitle("Histogram of raw/adjusted p-values for exact test")
p <- p + facet_grid(type ~ condition) 
p <- p + theme(title = element_text(size=10), axis.text.x = element_blank(), 
               axis.ticks.x = element_blank())
print(p)

The total number of DEGs for each mutant type, the distribution of up/down regulated DEGs for each mutant type and the number of (unique) DEGs among the three mutant types are:

table(DEG_voom$condition)

## 
## mut1 mut3 
##   30   18

table(DEG_voom$condition, DEG_voom$UD) # 1 means 'up-regulated'

##       
##         1 -1
##   mut1 19 11
##   mut3  5 13

length(listDEG_voom)

## [1] 35

The number of DEGs found with this approach is much smaller than what was obtained with the Negative Binomial models. In particular, the linear models does not find any DEG for mutant 2.

Comparison

In this last part, a Venn diagram is displayed for the results (unique genes for any type of mutant) of the four approaches:

vd <- venn.diagram(x=list("Exact test" = listDEG_pairwiseExact,
                          "GLM\n group" = listDEG_GLM1, 
                          "GLM\n group + replicate" = listDEG_GLM2,
                          "voom" = listDEG_voom), 
                   fill = brewer.pal(4, "Set3"), 
                   cat.col = c("darkgreen", "black", "darkblue", "darkred"),
                   cat.cex = 1.5, fontface="bold", filename=NULL)
grid.draw(vd)

This chart yields to several conclusions:

first, the results of voom are very consistant with the results of the GLM approach using the same design matrix (with two additive factors), only they are much more stringent (the number of genes found by that method is more than 10 times smaller than what was found by the GLM approach);
second, a large number of genes (\(321 + 33 = 354\) out of approximately 500) are in common between all three negative binomial approaches. As expected, the two more consistent methods are the two GLM approaches (with an additional number of \(73+2=75\) DEGs in common and \(31+41+59+16 = 147\) DEGs that are specific of a given model). The Exact Test has 31 more genes in common with the GLM on group only and 16 more DEGs in common with the GLM on group and replicate;
the 59 DEGs that were found exclusively by GLM with two additive factors might be interesting because they are DEGs which can not be found if the genotype effect is not included in the model.

Session information

To obtain results similar to mines, make sure that your session has identical features than the one used to compile this document:

session_info()

## Session info --------------------------------------------------------------

##  setting  value                       
##  version  R version 3.3.1 (2016-06-21)
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_US:en                    
##  collate  en_US.UTF-8                 
##  tz       <NA>                        
##  date     2016-07-03

## Packages ------------------------------------------------------------------

##  package        * version   date       source        
##  annotate         1.48.0    2016-07-03 Bioconductor  
##  AnnotationDbi    1.32.3    2016-07-03 Bioconductor  
##  assertthat       0.1       2013-12-06 CRAN (R 3.3.1)
##  Biobase        * 2.30.0    2016-07-03 Bioconductor  
##  BiocGenerics   * 0.16.1    2016-07-03 Bioconductor  
##  colorspace       1.2-6     2015-03-11 CRAN (R 3.3.1)
##  corpcor          1.6.8     2015-07-08 CRAN (R 3.3.0)
##  DBI              0.4-1     2016-05-08 CRAN (R 3.3.1)
##  DESeq          * 1.22.1    2016-07-03 Bioconductor  
##  devtools       * 1.12.0    2016-06-24 CRAN (R 3.3.1)
##  digest           0.6.9     2016-01-08 CRAN (R 3.3.1)
##  dplyr            0.5.0     2016-06-24 CRAN (R 3.3.1)
##  edgeR          * 3.12.1    2016-07-03 Bioconductor  
##  ellipse          0.3-8     2013-04-13 CRAN (R 3.3.0)
##  evaluate         0.9       2016-04-29 CRAN (R 3.3.0)
##  formatR          1.4       2016-05-09 CRAN (R 3.3.0)
##  futile.logger  * 1.4.1     2015-04-20 CRAN (R 3.2.2)
##  futile.options   1.0.0     2010-04-06 CRAN (R 3.2.2)
##  genefilter       1.52.1    2016-07-03 Bioconductor  
##  geneplotter      1.48.0    2016-07-03 Bioconductor  
##  ggplot2        * 2.1.0     2016-03-01 CRAN (R 3.3.1)
##  gridExtra      * 2.2.1     2016-02-29 CRAN (R 3.3.0)
##  gtable           0.2.0     2016-02-26 CRAN (R 3.3.1)
##  htmltools        0.3.5     2016-03-21 CRAN (R 3.3.0)
##  igraph           1.0.1     2015-06-26 CRAN (R 3.3.0)
##  IRanges          2.4.8     2016-07-03 Bioconductor  
##  knitr            1.13      2016-05-09 CRAN (R 3.3.0)
##  labeling         0.3       2014-08-23 CRAN (R 3.2.2)
##  lambda.r         1.1.7     2015-03-20 CRAN (R 3.2.2)
##  lattice        * 0.20-33   2015-07-14 CRAN (R 3.2.1)
##  limma          * 3.26.9    2016-07-03 Bioconductor  
##  locfit         * 1.5-9.1   2013-04-20 CRAN (R 3.2.2)
##  magrittr         1.5       2014-11-22 CRAN (R 3.2.2)
##  MASS           * 7.3-45    2016-04-21 CRAN (R 3.3.1)
##  Matrix           1.2-6     2016-05-02 CRAN (R 3.3.1)
##  memoise          1.0.0     2016-01-29 CRAN (R 3.3.0)
##  mixOmics       * 6.0.0     2016-06-14 CRAN (R 3.3.1)
##  munsell          0.4.3     2016-02-13 CRAN (R 3.3.1)
##  plyr             1.8.4     2016-06-08 CRAN (R 3.3.1)
##  R6               2.1.2     2016-01-26 CRAN (R 3.3.0)
##  RColorBrewer   * 1.1-2     2014-12-07 CRAN (R 3.3.1)
##  Rcpp             0.12.5    2016-05-14 CRAN (R 3.3.1)
##  reshape2       * 1.4.1     2014-12-06 CRAN (R 3.2.2)
##  rgl              0.94.1131 2014-09-08 CRAN (R 3.1.1)
##  rmarkdown        0.9.6     2016-05-01 CRAN (R 3.3.0)
##  RSQLite          1.0.0     2014-10-25 CRAN (R 3.2.2)
##  S4Vectors        0.8.11    2016-07-03 Bioconductor  
##  scales           0.4.0     2016-02-26 CRAN (R 3.3.1)
##  stringi          1.1.1     2016-05-27 CRAN (R 3.3.1)
##  stringr          1.0.0     2015-04-30 CRAN (R 3.2.2)
##  survival         2.39-5    2016-06-26 CRAN (R 3.3.1)
##  tibble           1.0       2016-03-23 CRAN (R 3.3.1)
##  tidyr            0.5.1     2016-06-14 CRAN (R 3.3.1)
##  VennDiagram    * 1.6.17    2016-04-18 CRAN (R 3.3.0)
##  withr            1.0.2     2016-06-20 CRAN (R 3.3.1)
##  XML              3.98-1.4  2016-03-01 CRAN (R 3.3.1)
##  xtable           1.8-2     2016-02-05 CRAN (R 3.3.1)
##  yaml             2.1.13    2014-06-12 CRAN (R 3.3.0)

Practical session about Normalization and differential analysis of RNAseq data - Effect of normalization and differential test

Nathalie Villa-Vialaneix

SPS Summer School - July 18-19th 2016

Data description and importation

Raw counts

Experimental design

Basic exploratory analysis of raw counts

Count distribution

Relation between mean and variance

PCA

Normalization

DESeq

edgeR

Total count

RPKM

Upper quartile

TMM

Comparison

Differential analysis

First approach: pairwise exact test

Second approach: overall GLM tests with group effect

Third approach: overall GLM tests with group and replicate effects

Fourth approach: voom and LM tests

Comparison

Session information