We will be filtering the data to ensure any low quality reads are removed and that any sequences containing adaptor sequences are either trimmed or removed altogether. To do this we will use the fastq-mcf program from the ea-utils package. This package is remarkably fast and ensures that after filtering both read 1 and read 2 files are in the correct order.
Note: Typically when submitting raw Illumina data to NCBI or EBI you would submit unfiltered data, so don’t delete your original fastq files!
We will execute the fastq-mcf
program which performs both adaptor sequence trimming and low quality bases. To remove adaptor sequences, we need to supply the adaptor sequences to the program. A list of the most common adaptors used is given in the file:
~/rna_tutorial/reference/adaptors.fasta
You can view the file with the command:
more ~/rna_tutorial/reference/adaptors.fasta
Note: Quit out of the file viewer by pressing q
Change the directory
cd ~/rna_tutorial/sample_data_for_alignment
Then run the command
fastq-mcf ~/rna_tutorial/reference/adaptors.fasta R1_sub_sample.fastq R2_sub_sample.fastq -o R1_sub_sample.filtered.fastq -o R2_sub_sample.filtered.fastq -C 1000000 -l 50 -q 20 -p 10 -u -x 0.01
To see the function of all the options listed above use
fastq-mcf -h
Finally, run the command
ls -lh
This lets us view the files in the directory and tells us their size. You can see the files are only slightly smaller; this supports our assessment that the files are good.
Try running fastqc on the filtered files to see if there is any difference.