Linux Advanced (Units 41-45)

Unit 41 - Word up

For this section we want to work with a different type of file. It is sometimes good to get a feeling for how large a file is before you start running lots of commands against it. The ls -l command will tell you how big a file is, but for many purposes it is often more desirable to know how many ‘lines’ it has. That is because many Unix commands like grep and sed work on a line by line basis. Fortunately, there is a simple Unix command called wc (word count) that does this:

$ cd ~/tutorial_materials/Data/Arabidopsis/ $ wc At_genes.gff 531497 4783473 39322356 At_genes.gff

The three numbers in the output above count the number of lines, words and bytes in the specified file(s). If we had run wc -l, the -l option would have shown us just the line count.


Unit 42 - GFF and the art of redirection

The Arabidopsis directory also contains a GFF file. This is a common file format in bioinformatics and GFF files are used to describe the location of various features on a DNA sequence. Features can be exons, genes, binding sites etc, and the sequence can be a single gene or (more commonly) an entire chromosome.

This GFF file describes of all of the gene-related features from chromosome I of A. thaliana. We want to play around with some of this data, but don’t need all of the file…just 10,000 lines will do (rather than the ~500,000 lines in the original). We will create a new (smaller) file that contains a subset of the original:

$ head -n 10000 At_genes.gff > At_genes_subset.gff
$ ls -l
total 195360
-rwxrwxrwx  1 keith  staff  39322356 Jul  9 15:02 At_genes.gff
-rwxrwxrwx  1 keith  staff    705370 Jul 10 13:33 At_genes_subset.gff
-rwxrwxrwx  1 keith  staf f 17836225 Oct  9  2008 At_proteins.fasta
-rwxrwxrwx  1 keith  staff  30817851 May  7  2008 chr1.fasta
-rwxrwxrwx  1 keith  staff  11330285 Jul 10 11:11 intron_IME_data.fasta

This step introduces a new concept. Up till now we have sent the output of any command to the screen (this is the default behavior of Unix commands), or through a pipe to another program. Sometimes you just want to redirect the output into an actual file, and that is what the > symbol is doing, it acts as one of three redirection operators in Unix.

As already mentioned, the GFF file that we are working with is a standard file format in bioinformatics. For now, all you really need to know is that every GFF file has 9 fields, each separated with a tab character. There should always be some text at every position (even if it is just a ‘.’ character). The last field often is used to store a lot of text.


Unit 43 - Not just a pipe dream

The 2nd and/or 3rd fields of a GFF file are usually used to describe some sort of biological feature. We might be interested in seeing how many different features are in our file:

$ cut -f 3 At_genes_subset.gff | sort | uniq

CDS
chromosome
exon
five_prime_UTR
gene
mRNA
miRNA
ncRNA
protein
pseudogene
pseudogenic_exon
pseudogenic_transcript
snoRNA
tRNA
three_prime_UTR
transposable_element_gene

In this example, we combine three separate Unix commands together in one go. Let’s break it down (it can be useful to just run each command one at at time to see how each additional command is modifying the preceding output):

  1. The cut command first takes the At_genes_subset.gff file and ‘cuts’ out just the 3rd column (as specified by the -f option). Luckily, the default behavior for the cut command is to split text files into columns based on tab characters (if the columns were separated by another character such as a comma then we would need to use another command line option to specify the comma).
  2. The sort command takes the output of the cut command and sorts it alphanumerically.
  3. The uniq command (in its default format) only keeps lines which are unique to the output (otherwise you would see thousands of fields which said ‘curated’, ‘Coding_transcript’ etc.)

Now let’s imagine that you might want to find which features start earliest in the chromosome sequence. The start coordinate of features is always specified by column 4 of the GFF file, so:

$ cut -f 3,4 At_genes_subset.gff | sort -n -k 2 | head

chromosome	1
exon	3631
five_prime_UTR	3631
gene	3631
mRNA	3631
CDS	3760
protein	3760
CDS	3996
exon	3996
CDS	4486

Here we first cut out just two columns of interest (3 & 4) from the GFF file. The -f option of the cut command lets us specify which columns we want to remove. The output is then sorted with the sort command. By default, sort will sort alphanumerically, rather than numerically, so we use the -n option to specify that we want to sort numerically. We have two columns of output at this point and we could sort based on either column. The -k 2 specifies that we use the second column. Finally, we use the head command to get just the 10 rows of output. These should be lines from the GFF file that have the lowest starting coordinate.


Unit 44 - The end of the line

When you press the return/enter key on your keyboard you may think that this causes the same effect no matter what computer you are using. The visible effects of hitting this key are indeed the same…if you are in a word processor or text editor, then your cursor will move down one line. However, behind the scenes pressing enter will generate one of two different events (depending on what computer you are using). Technically speaking, pressing enter generates a newline character which is represented internally by either a line feed or carriage return character (actually, Windows uses a combination of both to represent a newline). If this is all sounding confusing, well it is, and it is even more complex than we are revealing here.

The relevance of this to Unix is that you will sometimes receive a text file from someone else which looks fine on their computer, but looks unreadable in the Unix text viewer that you are using. In Unix (and in Perl and other programming languages) the patterns \n and \r can both be used to denote newlines. A common fix for this requires substituting \r for \n.

Use less to look at the Data/Misc/excel_data.csv file. This is a simple 4-line file that was exported from a Mac version of Microsoft Excel. You should see that if you use less, then this appears as one line with the newlines replaced with ^M characters. You can convert these carriage returns into Unix-friendly line-feed characters by using the tr command like so:

$ cd ~/tutorial_materials/Data/Misc
$ tr '\r' '\n' < excel_data.csv
sequence 1,acacagagag
sequence 2,acacaggggaaa
sequence 3,ttcacagaga
sequence 4,cacaccaaacac

This will convert the characters but not save the resulting output, if you wanted to send this output to a new file you will have to use a second redirect operator:

$ tr '\r' '\n' < excel_data.csv > excel_data_formatted.csv

Unit 45 - This one goes to 11

Finally, let’s parse the Arabidopsis intron_IME_data.fasta file to see if we can extract a subset of sequences that match criteria based on something in the FASTA header line. Every intron sequence in this file has a header line that contains the following pieces of information:

  • gene name
  • intron position in gene
  • distance of intron from transcription start site (TSS)
  • type of sequence that intron is located in (either CDS or UTR)

Let’s say that we want to extract five sequences from this file that are: a) from first introns, b) in the 5’ UTR, and c) closest to the TSS. Therefore we will need to look for FASTA headers that contain the text ‘i1’ (first intron) and also the text ‘5UTR’.

We can use grep to find header lines that match these terms, but this will not let us extract the associated sequences. The distance to the TSS is the number in the FASTA header which comes after the intron position. So we want to find the five introns which have the lowest values.

Before I show you one way of doing this in Unix, think for a moment how you would go about this if you didn’t know any Unix or Perl…would it even be something you could do without manually going through a text file and selecting each sequence by eye? Note that this Unix command is so long that — depending on how you are viewing this document — it may appear to wrap across two lines. When you type this, it should all be on a single line:

$ tr '\n' '@' < intron_IME_data.fasta | sed 's/>/#>/g' | tr '#' '\n' | grep "i1_.*5UTR" | sort -nk 3 -t "_" | head -n 5 | tr '@' '\n'

>AT4G39070.1_i1_7_5UTR
GTGTGAAACCAAAACCAAAACAAGTCAATTTGGGGGCATTGAAAGCAAAGGAGAGAGTAG
CTATCAAATCAAGAAAATGAGAGGAAGGAGTTAAAAAAGACAAAGGAAACCTAAGCTGCT
TATCTATAAAGCCAACACATTATTCTTACCCTTTTGCCCACACTTATACCCCATCAACCT
CTACATACACTCACCCACATGAGTGTCTCTACATAAACACTACTATATAGTACTGGTCCA
AAGGTACAAGTTGAGGGAG

>AT5G38430.1_i1_7_5UTR
GCTTTTTGCCTCTTACGGTTCTCACTATATAAAGATGACAAAACCAATAGAAAAACAATT
AAG

>AT1G31820.1_i1_14_5UTR
GTTTGTACTTCTTTACCTCTCGTAAATGTTTAGACTTTCGTATAAGGATCCAAGAATTTA
TCTGATTGTTTTTTTTTCTTTGTTTCTTTGTGTTGATTCAG

>AT3G12670.1_i1_18_5UTR
GTAGAATTCGTAAATTTCTTCTGCTCACTTTATTGTTTCGACTCATACCCGATAATCTCT
TCTATGTTTGGTAGAGATATCTTCTCAAAGTCTTATCTTTCCTTACCGTGTTCTGTGTTT
TTTGATGATTTAG

>AT1G26930.1_i1_19_5UTR
GTATAATATGAGAGATAGACAAATGTAAAGAAAAACACAGAGAGAAAATTAGTTTAATTA
ATCTCTCAAATATATACAAATATTAAAACTTCTTCTTCTTCAATTACAATTCTCATTCTT
TTTTTCTTGTTCTTATATTGTAGTTGCAAGAAAGTTAAAAGATTTTGACTTTTCTTGTTT
CAG

That’s a long command, but it does a lot. Try to break down each step and work out what it is doing (you will need to consult the man page for some commands maybe). Notice that I use one of the other redirect operators < to read from a file. It took seven Unix commands to do this, but these are all relatively simple Unix commands; it is the combination of them together which makes them so powerful. One might argue that when things get this complex with Unix that it might be easier to do it in a programming language.