```{r globalOptions, include=FALSE} # comment / uncomment to show / hide R code knitr::opts_chunk$set(echo = TRUE) ``` This file analyses the data of the count table included in ```countData.txt```. These data are RNA sequencing of tomatoes, with two conditions (wild type and mutant) and 3 replicates by condition. One observation in a given condition has a clone in the second condition: this will be refered as the "genotype effect" in the rest of the document. The data were provided as a courtesy of Mohamed Zouine (ENSAT, Toulouse). # Where to start? (installation and alike) Download the following files: * http://www.nathalievilla.org/doc/R/solution_edgeR-tomato.R (**R** script) * http://www.nathalievilla.org/doc/txt/countData.txt (dataset, text format) * ... (design, text format) and save them all in the same directory. Then start RStudio and using the menu "File/Open File...", open the file solution_edgeR-tomato.R (that contains the **R** script of this document). Also set your working directory to the directory where you have downloaded all files with the menu "Session / Set working directory / To source file location". The command lines in the **R** script can be executed using the button "Run" or the shortcut ```Ctrl+Enter```. The analysis has been performed using **R** and the **R** packages edgeR, limma, RColorBrewer, mixOmics and HTSFilter. Precise versions of the **R** software used in this document is given in the last section of this document. The required packages must be loaded prior the analysis. If one of them is not already installed, you can refer to the webpage http://www.nathalievilla.org/teaching/rnaseq.html to install it by yourself (```HTSFilter``` is probably not installed; you can not use the standard command ```install.packages``` to install it because it is a Bioconductor package: check the webpage!). ```{r loadLib, results='hide', message=FALSE} ## Do Not Run # how to install required packages? # install.packages(c("RColorBrewer","mixOmics")) # source("http://bioconductor.org/biocLite.R") # biocLite("edgeR") ## End DNR library(edgeR) library(limma) library(RColorBrewer) library(mixOmics) library(VennDiagram) library(HTSFilter) ``` # Preparing datasets ## Description of the files (count table and design) If you open the file ```countData.txt``` with a simple text reader (wordpad for instance), you see that: - the first column contains the gene names; - the next three columns contain the WT samples; - the last three columns contain the Mutant samples; - columns are separated by tabs. If you open the file ```design.csv``` with a simple text reader (wordpad for instance), you see that: - the first column contains the sample names; - the second column contains the condition (M for Mutant or WT for Wild Type); - columns are separated by commas; - the first row contains column names. ## Importing the files The two files are imported in, respectively, ```rawCountTable``` and ```sampleInfo``` with the function ```read.table```: ```{r importData} rawCountTable <- read.table("countData.txt", header=TRUE, sep="\t", row.names=1) sampleInfo <- read.table("design.csv", header=TRUE, sep=",", row.names=1) ``` Their content can be checked with ```head``` that print the first 6 rows: ```{r checkData} head(rawCountTable) nrow(rawCountTable) ``` `r nrow(rawCountTable)` genes are included in this file. Similarly, informations about samples are contained into ```sampleInfo```. The entry ```genotype```, that corresponds to the clone number, is converted into a factor for subsequent analyses. ```{r checkDesign} head(sampleInfo) sampleInfo$genotype <- as.factor(sampleInfo$genotype) ``` ## Creating a ```DGEList``` object A DGEList object is needed to process RNAseq datasets. This object is created using the count table and the design file: ```{r DGEListCreation} dgeFull <- DGEList(rawCountTable, remove.zeros = TRUE) dgeFull ``` The information about sample genotypes and conditions can be added to the ```$samples``` entry with: ```{r DGEListSamples} dgeFull$samples$condition <- relevel(sampleInfo$condition, ref = "WT") dgeFull$samples$genotype <- sampleInfo$genotype dgeFull ``` # Data exploration and quality assessment Exploratory analysis is generally perormed on $\log_2$-transformed counts to avoid problems due to skewness of the count distribution: ```{r pseudoCounts} pseudoCounts <- log2(dgeFull$counts + 1) head(pseudoCounts) ``` Standard exploratory analyses include: ## Histogram for pseudo-counts (sample ```Cond.WT.Rep.1```) ```{r histoPseudoCounts} hist(pseudoCounts[ ,"Cond.WT.Rep.1"], main = "", xlab = "counts") ``` ## Boxplot for pseudo-counts ```{r boxplotPseudoCounts} par(mar = c(8,4,1,2)) boxplot(pseudoCounts, col = "gray", las = 3, cex.names = 1) ``` ## MA-plots between first two WT samples (using ```limma``` package) ```{r maPlotPseudoCounts, fig.width=10} limma::plotMA(pseudoCounts[ ,1:2], xlab = "M", ylab = "A", main = "") abline(h = 0, col = "red") ``` ## MDS for pseudo-counts (using ```limma``` package) MDS is similar to PCA when Euclidean distances are used to assess the distances between samples. In ```limma```, only the 500 top genes (the most variable genes accross samples). ```{r MDSPseudoCounts} colConditions <- brewer.pal(3, "Set2") colConditions <- colConditions[match(sampleInfo$condition, levels(sampleInfo$condition))] pchGenotypes <- c(8, 15, 16)[match(sampleInfo$genotype, levels(sampleInfo$genotype))] plotMDS(pseudoCounts, pch = pchGenotypes, col = colConditions) legend("topright", lwd = 2, col = brewer.pal(3, "Set2")[1:2], legend = levels(sampleInfo$condition)) legend("bottomright", pch = c(8, 15, 16), legend = levels(sampleInfo$genotype)) ``` ## Heatmap for pseudo-counts (using ```mixOmics``` package) ```{r cimPseudoCounts, fig.width=10, fig.height=10} sampleDists <- as.matrix(dist(t(pseudoCounts))) sampleDists cimColor <- colorRampPalette(rev(brewer.pal(9, "Reds")))(16) cim(sampleDists, color = cimColor, symkey = FALSE, row.cex = 0.7, col.cex = 0.7) ``` # Normalization ## Compute normalization factors ```{r estimateNormFactors} dgeFull <- calcNormFactors(dgeFull, method="TMM") dgeFull ``` **Important**: using ```calcNormFactors``` does not change the counts: it just updates the column ```norm.factors``` in ```$samples```. It is therefore recommanded that you use the same name (```dgeFull```) to save the result of this function: ```{r countsAfterNorm} head(dgeFull$counts) ``` ```{r sampleInfoAfterNorm} dgeFull$samples ``` ## Normalized counts exploratory analysis Normalized counts and pseudo-counts can be extracted from ```dgeFull``` using the function ```cpm```: ```{r estimateNormCounts, fig.width=10, fig.height=10} normCounts <- cpm(dgeFull) pseudoNormCounts <- cpm(dgeFull, log = TRUE, prior.count = 1) par(mar = c(8,4,1,2)) boxplot(pseudoNormCounts, col = "gray", las = 3, cex.names = 1) ``` ```{r normMDS} plotMDS(pseudoNormCounts, pch = pchGenotypes, col = colConditions) legend("topright", lwd = 2, col = brewer.pal(3, "Set2")[1:2], legend = levels(sampleInfo$condition)) legend("bottomright", pch = c(8, 15, 16), legend = levels(sampleInfo$genotype)) ``` A further analysis and comparison of the different normalizations provided in the **R** packages ```edgeR``` and ```DESeq2``` is provided in [this document](http://www.nathalievilla.org/doc/html/TP1_normalization.html) (the design of the dataset used in this practical application is similar to the design of the tomato dataset). # Differential analysis This section will compare the results of different types of approach to obtain genes which are differentially expressed between the wild type tomatoes and the mutants: * a standard NB exact test between two conditions; * a GLM with the plant and genotype effects. ## First approach: exact test between the two groups In this first approach, the differences between the two groups (WT and M) is tested using an exact NB test between the two groups. This method is performed by: * creating a ```DGEList``` object using the argument ```group``` and using the same normalization factors than in ```dgeFull``` ; * estimating the dispersion for this object with the functions ```estimateCommonDisp``` and ```estimateTagwiseDisp``` ; * performing the test with the function ```exactTest```. ### Using the argument ```group``` in ```DGEList``` A new ```DGEList``` object is created with the argument ```group```. Normalization factors are updated from that in ```dgeFull```. ```{r DGEListCreationG} dgeFull.group <- DGEList(rawCountTable, remove.zeros = TRUE, group = dgeFull$samples$condition) dgeFull.group$samples$norm.factors <- dgeFull$samples$norm.factors dgeFull.group ``` ### Estimate dispersion Common and then tagwise dispersions can be estimated with: ```{r estimateDispersion, cache=TRUE} dgeFull.group <- estimateCommonDisp(dgeFull.group) dgeFull.group <- estimateTagwiseDisp(dgeFull.group) dgeFull.group ``` ```edgeR``` also contains a function to estimate dispersions in a more robust way, that can be used if the previous approach seems to fail. This function is ```estimateGLMRobustDisp```. The quality of the variability estimatino can be assessed with the BCV versus average log CPM plot (that plots $\phi_g$ versus the average normalized count for all genes): ```{r plotBCV} plotBCV(dgeFull.group) ``` ### Perform the test An exact perform an exact test for the difference in expression between the two conditions "WT" and "M": ```{r fisherExact} dgeExactTest <- exactTest(dgeFull.group) dgeExactTest ``` $p$-values are corrected with the function ```topTags``` : ```{r topTag} resExactTest <- topTags(dgeExactTest, n = nrow(dgeExactTest$table)) head(resExactTest$table) ``` $p$-value and (BH) adjusted $p$-value distribution can be assessed with: ```{r histPValExact, fig.width=10, fig.height=6} par(mfrow = c(1,2)) hist(resExactTest$table$PValue, xlab = "p-value", main = "raw p-values") hist(resExactTest$table$FDR, xlab = "p-value", main = "adjusted p-values") ``` And finally, genes with a FDR smaller than 5% and a log Fold Change larger than 1 or smaller than -1 are extracted: ```{r DEGExact} selectedET <- resExactTest$table$FDR < 0.05 & abs(resExactTest$table$logFC) > 1 selectedET <- resExactTest$table[selectedET, ] nrow(selectedET) head(selectedET) ``` which shows that `r nrow(selectedET)` genes are found differential with this method. The column ```logFC``` can be used to found up/down regulated genes in the M: ```{r UDregExact} selectedET$updown <- factor(ifelse(selectedET$logFC > 0, "up", "down")) head(selectedET) ``` This list can be exported with: ```{r exportExact} write.table(selectedET, file = "tomatoDEG.csv", sep = ",") ``` ## Second approach: GLM with condition and genotype effects Here, we fit a GLM to account for the genotype effect. The model writes $K_{gj} \sim \mbox{NB}(\mu_{gj}, \phi_g)$ with $\mathbb{E}(\log K_{gj}) = \log(s_j) + \log(\lambda_{gj})$ in which $j$ is the sample number, $s_j$ is the normalization factor and $\log(\lambda_{gj})$ is explained by $\log(\lambda_{gj}) = \lambda_{g0} + \beta_{g,j \textrm{ is M}} + \gamma_{g,j\textrm{ is clone 2}} + \gamma_{g,j\textrm{ is clone +}}$ (WT condition and clone number 1 are reference levels). ### Estimate dispersion The model is first encoded in a design matrix: ```{r designMatrix} design.matrix <- model.matrix(~ dgeFull$samples$condition + dgeFull$samples$genotype) design.matrix ``` Common, trended and then tagwise dispersions can be estimated with: ```{r estimateDispersionGLM, cache=TRUE} dgeFull <- estimateDisp(dgeFull, design.matrix) dgeFull ``` The quality of the variability estimation can be assessed with the BCV versus average log CPM plot (that plots $\phi_g$ versus the average normalized count for all genes): ```{r plotBCVGLM} plotBCV(dgeFull) ``` ### Fit GLM and perform the test The GLM is fitted with the function ```glmFit```: ```{r GLMfit} fit <- glmFit(dgeFull, design.matrix) fit ``` Then tests can be performed with a log-ratio test (function ```glmRT```). For instance, to testing differential genes between WT and M, is equivalent to testing the nullity of the second coefficient (see the design matrix) : ```{r LRTtest} dgeLRTtest <- glmLRT(fit, coef = 2) dgeLRTtest ``` Testing differential genes between clone number 2 and 3 is equivalent to testing the equality of coefficients 3 and 4: ```{r LRTtest2} contrasts <- rep(0, ncol(design.matrix)) contrasts[3] <- 1 contrasts[4] <- -1 dgeLRTtest2 <- glmLRT(fit, contrast = contrasts) dgeLRTtest2 ``` Finally, DEGs can be extracted as previously, using the function ```topTags```: ```{r topTagsGLM} resLRT <- topTags(dgeLRTtest, n = nrow(dgeFull$counts)) head(resLRT$table) ``` ```{r extractDEGGLM} selectedLRT <- resLRT$table$FDR < 0.05 & abs(resLRT$table$logFC) > 1 selectedLRT <- resLRT$table[selectedLRT, ] nrow(selectedLRT) head(selectedLRT) ``` ## Comparison A Venn diagram comparing the two approaches is provided below: ```{r VennDiagram} vd <- venn.diagram(x = list("Exact test" = rownames(selectedET), "GLM" = rownames(selectedLRT)), fill = brewer.pal(3, "Set2")[1:2], filename = NULL) grid.draw(vd) ``` More complex models and a more detailed comparison between the different approches for differential analysis is provided in [this document](http://www.nathalievilla.org/doc/html/TP1_normalization.html) (the same that was previously pointed for normalization comparison) and [this document](http://www.nathalievilla.org/doc/html/TP2_interaction.html) (with the analysis of an interaction effect). # Filtering ## Differential analysis after independent filtering Independant filtering can be performed with the package ```HTSFilter``` after the dispersion has been estimated: ```{r filter, cache=TRUE} dgeFilt <- HTSFilter(dgeFull)$filteredData dgeFilt ``` Then, the differential analysis (GLM approach) is performed: ```{r GLMfilt} fit <- glmFit(dgeFilt, design.matrix) dgeLRTfilt <- glmLRT(fit, coef = 2) resLRTfilt <- topTags(dgeLRTfilt, n = nrow(dgeFilt$counts)) selectedFilt <- resLRTfilt$table$FDR < 0.05 & abs(resLRTfilt$table$logFC) > 1 selectedFilt <- resLRTfilt$table[selectedFilt, ] nrow(selectedFilt) head(selectedFilt) ``` ## Comparison A Venn diagram comparing the two approaches is provided below: ```{r VennDiagramFilt} vd <- venn.diagram(x = list("No filtering" = rownames(selectedLRT), "Filtering" = rownames(selectedFilt)), fill = brewer.pal(3, "Set2")[1:2], filename = NULL) grid.draw(vd) ``` # Exploratory analysis of DEGs ## Create a MA plot with differentially expressed genes To create a MA plot between M and WT, the entry ```$samples$group``` of the ```DGEList``` object must be filled with the indication of what the two groups are. ```{r MADEG} dgeFilt$samples$group <- dgeFilt$samples$condition plotSmear(dgeFilt, de.tags = rownames(selectedFilt)) ``` ## Volcano plot ```{r volcanoPlot} volcanoData <- cbind(resLRTfilt$table$logFC, -log10(resLRTfilt$table$FDR)) colnames(volcanoData) <- c("logFC", "negLogPval") DEGs <- resLRTfilt$table$FDR < 0.05 & abs(resLRTfilt$table$logFC) > 1 point.col <- ifelse(DEGs, "red", "black") plot(volcanoData, pch = 16, col = point.col, cex = 0.5) ``` ## Heatmap ```{r selectDEHeatmap, fig.width=10, fig.height=10} selY <- cpm(dgeFilt, log = TRUE, prior.count = 1) selY <- selY[match(rownames(selectedFilt), rownames(dgeFilt$counts)), ] finalHM <- cim(t(selY), color = cimColor, symkey = FALSE, row.cex = 0.7, col.cex = 0.7) ``` If you are interested in the result of the gene clustering, the result of HAC is saved into ```$ddc```: ```{r plotDendoGenes, fig.width=10, fig.height=10} plot(finalHM$ddc, leaflab="none") abline(h=10, lwd=2, col="pink") ``` Using this dendrogram, we might want to cut the tree at level $h=10$ (for instance). This can be performed using the function ```cutree```, which will provide a cluster membership for each gene. ```{r cutDendoGenes} geneClust <- cutree(as.hclust(finalHM$ddc), h=10) head(geneClust) ``` For instance, the number of clusters is equal to ```{r nbClustGene} length(unique(geneClust)) ``` and the genes in cluster 1 are: ```{r geneClust1} names(which(geneClust == 1)) ``` # Session information ```{r sessionInformation, echo=TRUE} sessionInfo() ``` ```{r deleteVD, echo=FALSE} system("rm VennDiagram*.log") ```