Filtering

Filtering fastq files

We will be filtering the data to ensure any low quality reads are removed and that any sequences containing adaptor sequences are either trimmed or removed altogether. To do this we will use the fastq-mcf program from the ea-utils package. This package is remarkably fast and ensures that after filtering both read 1 and read 2 files are in the correct order.

Note: Typically when submitting raw Illumina data to NCBI or EBI you would submit unfiltered data, so don’t delete your original fastq files!

We will execute the fastq-mcf program which performs both adaptor sequence trimming and low quality bases. To remove adaptor sequences, we need to supply the adaptor sequences to the program. A list of the most common adaptors used is given in the file:

~/rna_tutorial/reference/adaptors.fasta

You can view the file with the command:

more ~/rna_tutorial/reference/adaptors.fasta

Note: Quit out of the file viewer by pressing q

Change the directory

cd ~/rna_tutorial/sample_data_for_alignment

Then run the command

fastq-mcf ~/rna_tutorial/reference/adaptors.fasta R1_sub_sample.fastq R2_sub_sample.fastq -o R1_sub_sample.filtered.fastq -o R2_sub_sample.filtered.fastq -C 1000000 -l 50 -q 20 -p 10 -u -x 0.01

To see the function of all the options listed above use

fastq-mcf -h

Finally, run the command

ls -lh

This lets us view the files in the directory and tells us their size. You can see the files are only slightly smaller; this supports our assessment that the files are good.

Try running fastqc on the filtered files to see if there is any difference.