Align Reads

Aligning reads to a reference genome

Now we’re ready to align our reads (fastq) to the reference genome index we created in the last section. For this workshop we’re only going to align a subset of reads for a single sample, as aligning all reads for all samples would take far too long. For further analysis you will be provided with pre-prepared files.

Lets change directory to see our reads

cd ~/rna_tutorial/sample_data_for_alignment

Next we can run STAR to align our reads

STAR --runThreadN 4 --genomeDir ../reference/ --readFilesIn R1_sub_sample.fastq R2_sub_sample.fastq --outSAMtype BAM SortedByCoordinate

This will take a minute or two to finish. There are a few parameters used here:

  • --runThreadN The number of threads we wish to use to run the command
  • --genomeDir The directory where we stored our genome index
  • --readFilesIn Our fastq file(s). Note for paired end data we have an R1 and R2 file. These are listed separated by a space.
  • --outSAMtype Here, we specifiy that the output alignment file should be binary (BAM) as it saves storage space and be sorted by mapping position. Many downstream programs we will use will want BAM files sorted by position. You can find details of alignment files here.

There is one other parameter that we haven’t used that is worth some discussion:

  • --quantMode This parameter that can be run with a value ‘GeneCounts’ will count the number of reads that map to each gene. This is a very important step in expression quantification and gene expression analysis. However, in this workshop we will use htseq in the next section.