Now we’re ready to align our reads (fastq) to the reference genome index we created in the last section. For this workshop we’re only going to align a subset of reads for a single sample, as aligning all reads for all samples would take far too long. For further analysis you will be provided with pre-prepared files.
Lets change directory to see our reads
cd ~/rna_tutorial/sample_data_for_alignment
Next we can run STAR to align our reads
STAR --runThreadN 4 --genomeDir ../reference/ --readFilesIn R1_sub_sample.fastq R2_sub_sample.fastq --outSAMtype BAM SortedByCoordinate
This will take a minute or two to finish. There are a few parameters used here:
--runThreadN
The number of threads we wish to use to run the command--genomeDir
The directory where we stored our genome index--readFilesIn
Our fastq file(s). Note for paired end data we have an R1 and R2 file. These are listed separated by a space.--outSAMtype
Here, we specifiy that the output alignment file should be binary (BAM) as it saves storage space and be sorted by mapping position. Many downstream programs we will use will want BAM files sorted by position. You can find details of alignment files here.There is one other parameter that we haven’t used that is worth some discussion:
--quantMode
This parameter that can be run with a value ‘GeneCounts’ will count the number of reads that map to each gene. This is a very important step in expression quantification and gene expression analysis. However, in this workshop we will use htseq in the next section.