Welcome to Introduction to Genomics Course

Introduction

Welcome to the Genomics workshop. Generating reams of data in Biology is easy these days. In little more than a fortnight we can generate more data than the entire human genome project generated in over a decade of work. Making biological sense out of that data, understanding its limitations and how the analysis algorithms work is now the major challenge for researchers. The aim of this workshop is to take you through a few example projects and tasks. On the way you will learn how to evaluate the quality of data as provided by a sequencing facility, how to align the data against a known and annotated reference genome and how to perform a de-novo assembly. In addition you will also learn how to compare results between different samples.

This workshop is broken into 6 parts. You should feel free to take as long as you like on each part. It is much more important that you have a thorough understanding of each part, rather than try to race through the entire workshop.

The five parts are:

Introduction
Remapping a strain of E.coli to a reference sequence
Assembly of unmapped reads
Complete de-novo assembly of all reads
Repeating on strains of V.parahaemolyticus and comparing them
Lon Read de-novo assembly.

For this workshop we will assume little background knowledge, except a basic familiarity with the Linux operating system and the Openstack cloud. We will cover the basics of how genomic DNA libraries are generated and sequenced, and the principles behind short read paired-end sequencing. We will look at why data can vary in quality, why adaptor sequences need to be filtered out and how to quality control data.

In the second part we will take the plunge and align the filtered reads to a reference genome, call variants and compare them against the published genome to identify missing, truncated or altered genes. This will involve the use of a publicly available set of bacterial E.coli Illumina reads and reference genome. In parts 3 and 4 we will look at how one can identify novel sequences which are not present in the reference genome. In part 5, you will be asked to repeat the steps 2, 3 and 4 on other data sets and to compare the results. In part 6 we will look at an assembly process using only long reads.

A word on notation. If you see something like this:
cd ~/genomics_tutorial/reference/sequence
It means, type the highlighted text into your terminal.