DESeq2 - Differential Expression

DESeq2 Differential Expression analysis - Identifying differentially expressed genes

So far we have counted the number of reads for each gene in each sample and get it into a form that DESeq2 can analyse, displayed a heatmap (with clustering) of the top 50 most variable genes and created a PCA plot of our samples. We are now ready to find up and downregulated genes.

Note: that we use the raw counts of the gene expression rather than the log transform which we used for visualisation.

To run the DESeq2 analysis:

dc14DESeqDataSet = DESeq(dc14DESeqDataSet)

You’ll see some text representing the progress of the analysis and once it has finished you can see the format of the results:

results(dc14DESeqDataSet)

DESeq2 DE

You can also get a description of the columns:

mcols(results(dc14DESeqDataSet),use.names=T)

DESeq2 columns

The columns we are most interested in (unless you are a statistician) are:

  • log2foldchange - a value of 1 means that the expression level for a gene in the mutant is twice that in the wild type.
  • padj values - this is the p-value that we use to judge significance. It has been adjusted because we have done thousands of tests, which can introduce error. It is sometimes called a false discovery rate and more information can be found here.

Let’s check the 10 most significantly upregulated genes and see whether the calculations look plausible.

First lets extract the top 10 differentially expressed genes (ranked by padj values)

topUpregGenes = head(order(results(dc14DESeqDataSet)$padj),n=10)
results(dc14DESeqDataSet)[topUpregGenes,]

If we look at AT5G47390 we can see that the log2FoldChange tells us that the gene seems to be upregulated in the mutant strain and the padj value shows us that this change is highly significant. As a sanity check we can look at the raw count numbers these genes:

assay(dc14DESeqDataSet)[topUpregGenes,]

The counts confirm the identified differential expression with the mutant strains showing many more counts in AT5G47390 than in the wild-type strains.

We can also use an MA Plot visualize all the results. This plots the log ratio (M) on the x axis and the average expression (A) on the y axis.

plotMA(dc14DESeqDataSet,ylim=c(-2.5,2.5))

DESeq volcano

This plot shows one point for each gene - if DESeq2 considers the gene to be significantly up or down regulated it is shown in red, otherwise in grey. The curved boundary between red and grey shows that as the expression level of the gene increases DESeq2 can call genes with lower fold change ratios.

A large number of genes have been identified as being differentially regulated; this is consistent with the large difference in phenotypic response of the two strains.

To create files containing all statistically significantly up and down-regulated (by 2 fold) genes we use the which() command to select them

downregGenes = which(results(dc14DESeqDataSet)$log2FoldChange < -1)
upregGenes = which(results(dc14DESeqDataSet)$log2FoldChange > 1)

Remember here that a 2 fold change in expression is represented by a log2 fold change of 1.

Next, we extract all the genes that show significant differential expression using a padj cutoff of 0.05.

sigGenes = which(results(dc14DESeqDataSet)$padj < 0.05)

Finally, we write the intersection (appears in both) of significantly differentially expressed genes with those that are up and down regulated. These files will be useful for downstream analysis programs or for making tables for publications.

write.table (file='downreg.tab', results(dc14DESeqDataSet) [intersect(downregGenes, sigGenes ),])
write.table (file='upreg.tab', results(dc14DESeqDataSet) [intersect(upregGenes, sigGenes ),])

Summary so far

You should have some appreciation of the process and challenges of analysing RNA-seq data for differentially expressed genes. To recap the process we followed was:

  • prepare the reference genome
  • align reads to the genome using a splicing aware program
  • count the number of reads which align to each gene
  • generate heat maps and PCA plots to visualise our experiment
  • apply the DESeq2 analysis toolkit to identify DE genes
  • filter most highly DE genes and visualise results.