Linux Advanced (Units 36-40)

Unit 36 - Getting fancy with regular expressions

A concept that is supported by many Unix programs and also by most programming languages is that of using regular expressions. These allow you to specify search patterns which are quite complex and really help restrict the huge amount of data that you might be searching for to some very specific lines of output. E.g. you might want to find lines that start with an ‘ATG’ and finish with ‘TGA’ but which have at least three AC dinucleotides in the middle:

$ grep "^ATG.*ACACAC.*TGA$" chr1.fasta

ATGAACCTTGTACTTCACCGGGTGCCCTCAAAGACGTTCTGCTCGGAAGGTTTGTCTTACACACTTTGATGTCAAATGA
ATGATAGCTCAACCACGAAATGTCATTACCTGAAACCCTTAAACACACTCTACCTCAAACTTACTGGTAAAAACATTGA
ATGCATACCTCAGTTGCATCCCGGCGCAGGGCAAGCATACCCGCTTCAACACACACTGCTTTGAGTTGAGCTCCATTGA

You’ll learn more about regular expressions when you start programming. The ^ character is a special character that tells grep to only match a pattern if it occurs at the start of a line. Similarly, the $ tells grep to match patterns that occur at the end of the line.

Task

The . and * characters are also special characters that form part of the regular expression. Try to understand how the following patterns all differ. Try using each of these these patterns with grep against any one of the sequence files. Can you predict which of the five patterns will generate the most matches?

ACGT
AC.GT
AC*GT
AC.*GT

The asterisk in a regular expression is similar to, but NOT the same, as the other asterisks that we have seen so far. An asterisk in a regular expression means: ‘match zero or more of the preceding character or pattern’.

Try searching for the following patterns to ensure you understand what . and * are doing:

A...T
AG*T
A*C*G*T*

Unit 37 - Counting with grep

Rather than showing you the lines that match a certain pattern, grep can also just give you a count of how many lines match. This is one of the frequently used grep options. Running grep -c simply counts how many lines match the specified pattern. It doesn’t show you the lines themselves, just a number:

$ grep -c i2 intron_IME_data.fasta
9785

Task

Count how many times each pattern from Task in Unit 36 occurs in all of the sequence files (specifying *.fasta will allow you to specify all sequence files).


Unit 38 - Regular expressions in less

You have seen already how you can use less to view files, and also to search for patterns. If you are viewing a file with less, you can type a forward-slash / character, and this allows you to then specify a pattern and it will then search for (and highlight) all matches to that pattern. Technically it is searching forward from whatever point you are at in the file. You can also type a question-mark ? and less will allow you to search backwards. The real bonus is that the patterns you specify can be regular expressions.

Task

Try viewing a sequence file with less and then searching for a pattern such as ATCG.*TAG$. This should make it easier to see exactly where your regular expression pattern matches. After typing a forward-slash (or a question-mark), you can press the up and down arrows to select previous searches.


Unit 39 - Let me transl(iter)ate that for you

We have seen that these sequence files contain upper-case characters. What if we wanted to turn them into lower-case characters (because maybe another bioinformatics program will only work if they are lower-case)? The Unix command tr (short for transliterate) does just this, it takes one range of characters that you specify and changes them into another range of characters:

$ head -n 2 chr1.fasta

>Chr1 dumped from ADB: Mar/14/08 12:28; last updated: 2007-12-20
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAATCTTTAAATCCTACATCCAT

$ head -n 2 chr1.fasta | tr 'A-Z' 'a-z'

>chr1 dumped from adb: mar/14/08 12:28; last updated: 2007-12-20
ccctaaaccctaaaccctaaaccctaaacctctgaatccttaatccctaaatccctaaatctttaaatcctacatccat

Unit 40 - That’s what she sed

The tr command let’s you change a range of characters into another range. But what if you wanted to change a particular pattern into something completely different? Unix has a very powerful command called sed that is capable of performing a variety of text manipulations. Let’s assume that you want to change the way the FASTA header looks:

$ head -n 1 chr1.fasta >Chr1 dumped from ADB: Mar/14/08 12:28; last updated: 2007-12-20

$ head -n 1 chr1.fasta | sed 's/Chr1/Chromosome 1/' >Chromosome 1 dumped from ADB: Mar/14/08 12:28; last updated: 2007-12-20

The ‘s’ part of the sed command puts sed in ‘substitute’ mode, where you specify one pattern (between the first two forward slashes) to be replaced by another pattern (specified between the second set of forward slashes). Note that this doesn’t actually change the contents of the file, it just changes the screen output from the previous command in the pipe. We will learn later on how to send the output from a command into a new file.