Prepare Reference

Preparing a reference genome

We are now almost ready to align our reads (fastq) to our reference genome. However, before we can do any alignment we need to prepare our reference genome. The reference genome consists of a set of chromosomal sequences (stored in a fasta file) and a set of gene annotations (stored as a gtf file). We will build an index of the reference genome so that our chosen aligner will be able to access the genome efficiently and align our reads.

In this workshop we will be using STAR an ultrafast universal RNA-Seq aligner. If you’re interested in using STAR in your own work, it can be downloaded from github.

To build our reference we will:

Move into the reference directory and view the files

cd ~/rna_tutorial/reference
ls -l Ataliana*

Build the reference with STAR using the command

STAR --runThreadN 4 --runMode genomeGenerate --genomeDir . --genomeFastaFiles Ataliana.fa --sjdbGTFfile Ataliana.gtf --sjdbOverhang 100

The program will then take a few minutes to run and build your genome index. This command contains several parameters:

  • --runThreadN The number of threads to use to run the command
  • --runMode Which of STAR’s programs to use, here we want to generate a genome
  • --genomeDir Where to store the output files. . is the current directory
  • --genomeFastaFiles The fasta files that contain the genome
  • --sjdbGTFfile The GTF file that contains known gene positions
  • --sjdbOverhang This should be set to the readLength-1, though 100 will work well in most cases.

Once STAR has finished successfully we can view the newly created files.

ls -lh

We now have many new files that contain genome information, chromosome information and our index. These files will be used by STAR to perform the alignment of our reads.