Bioinformatics

Underpinning our bioinformatics team is a cutting-edge bioinformatics infrastructure with secure, high-throughput disk storage and a high-performance compute cluster capable of handling even the most challenging of bioinformatics tasks.

The role of bioinformatics in genomics is to distil the millions of data points generated by high-throughput technologies into information and knowledge that can be interpreted biologically. Within Edinburgh Genomics, we have a large bioinformatics team with a wide range of skills, including software development and data analysis. We have a huge wealth of experience working with large-scale genomics data, and we were one of the first labs in the UK to adopt next-generation sequencing technology. We therefore have experience of working with data from all next-generation sequencing platforms, including Illumina, ABI SOLiD and 454. Our team of post-doctoral and post-graduate bioinformaticians are ready, willing and able to collaborate with you on your research projects.

We have developed and implemented pipelines for quality assessment and control, sequence alignment, genome assembly, metagenomics, transcriptomics, environmental genomics and functional genomics, and we can help with virtually any project including:

  • De novo genome sequencing, assembly and annotation

  • Genome resequencing, alignment and variant calling

  • Transcriptomics (RNA-Seq, microRNA-Seq, mRNA-Seq)

  • Epigenetics (MeDIP-Seq, Bisulfite sequencing)

  • Genotyping by sequencing (e.g. RAD sequencing)

  • Sequence capture (e.g. exome sequencing)

  • Metagenomics

  • Meta-barcoding (e.g. 16S)

  • Pathogen discovery and sequencing

We can help with experimental design, and will take your data from QC to biological interpretation.
 

Applications
 

Our bioinformatics team has a huge wealth of expertise in the analysis of genomics data, and we have mature pipelines for the analysis of a range of datasets. We use mostly open-source, published bioinformatics software, although we augment those tools with commercial software where that software has proven to provide better speed or quality. The problems can generally be placed into two categories: assembly and mapping, depending on whether the data are from a known or novel genome. Some applications are listed below:
 

Assembly
 

  • De novo genome assembly: This is possibly the most difficult task in bioinformatics, constructing an entire genome from short fragments is very challenging. Many paradigms exist for genome assembly, including "overlap-layout-consensus" (e.g. Celera assembler), De Bruijn graph based approaches (e.g. Velvet, SOAPdenovo) and String Graph approaches (e.g. SGA). We use a range of tools depending on the complexity of the genome, striving to produce the best assembly possible.
     
  • De novo transcriptome assembly: The definition of the transcribed genes of an organism is an important step in analysis, and we have extensive experience of assembly of short-read and long-read assembly and validation of transcriptomes.
     
  • Metagenomics (WGS): Sequencing every genome within an entire ecosystem is challenging, and the task of assembling those genomes is far harder. Several assemblers designed for this task have been published, and we have experience of using MetaVelvet and Meta-IDBA/IDBA-UD. Annotation of genes can be carried out using Glimmer-MG, and taxon assignment using a variety of tools such as Metawatt and Blobology (produced by Edinburgh Genomics director Mark Blaxter's group). For data sharing, we also use Meta4, a web-application developed in Mick Watson's group.
  • Pathogen discovery and sequencing is a growing area and one which greatly benefits from the high throughput nature of our Illumina HiSeq instruments. Reads can be directly aligned to reference databases, or de novo assembly performed and an attempt to characterise the contigs made.
     

Alignment
 

  • Genome resequencing: Possibly the most common task in modern bioinformatics is to align the millions of reads from next-generation sequencing platforms to a reference genome, and detect the differences (SNPs, CNV etc). We use industry standard tools such as BWA, Samtools and GATK for this type of analysis. For highly accurate results, we also occasionally use specialist tools such as Novoalign and Stampy.
     
  • Genotyping by sequencing: This approach generates genotyping data for individual organisms or pools through sequencing rather than directed polymorphism specific assays. Often a reduced representation of the genome is assayed, as, for example in RAD sequencing. This technology has been established in Edinburgh for many years and we use mature pipelines for allele calling and analysis of these data
     
  • Sequence capture and exome sequencing: Rather than sequence the whole of a genome, it is possible to subsample specific parts using capture reagents. Analysis of these data is largely similar to whole genome resequencing, with special attention to the particular properties of the capture, and we use many of the same tools e.g. BWA, GATK and Samtools.
     
  • Transcriptomics (RNA-Seq): Sequencing of RNA allows scientists to both define the location and structure of genes and simultaneously quantify their expression. For spliced alignment, we use tools such as TopHat and STAR; for transcript definition, tools such as Cufflinks; for quantification, we use HTSeq; and for de novo assembly, we use Trinity and Velvet/Oases. Differential expression is carried out using edgeR and DESeq.
     
  • Epigenetic sequencing: Identifying the regulatory "marks" on DNA, and the interactions between proteins and DNA is an important facet of systems approaches to biology. Data are generated through approaches such as ChIP-Seq and Bisulfite-Seq, and then mapped to a reference genome. We use specialist tools to identify regions of binding or modification finding and to perform comparison across samples. For Bisulfite sequencing (identifying methylated DNA), we make use of specialist Bisulfite aligners such as Bismarck.
     
  • Metabarcoding: The identification of biological species that are present in a complex sample can be tackled by sequencing a marker gene (such as the ribosomal RNA genes (16S, 18S) and the cytochrome oxidase 1 gene) and using this sequence to indicate which taxa are present. This DNA barcoding approach has been adapted for new sequencing technologies, and we have a lot of experience of analyses of these data. We use widely respected tools for sample processing, clustering, taxon identification and statistical analysis (such as Mothur and QIIME).
     
  • Small RNA sequencing: we have extensive experience in small RNA sequencing and bioinformatics, including both host and pathogen encoded microRNA studies, and siRNA/piRNA studies in insects and plants

We are very happy to discuss your bioinformatics requirements with you, and we have many skills beyond those listed above. Please get in touch: donald.dunbar@ed.ac.uk or mick.watson@roslin.ed.ac.uk
 

Our technologies
 

Edinburgh Genomics has a secure and sophisticated suite of technologies for the delivery of high quality bioinformatics.  We run a secure, powerful computer cluster running SGE, with several hundred cores and 2 terabytes (Tb) of RAM.  The University of Edinburgh's HECTOR supercomputer and the Edinburgh Compute and Data Facilities (ECDF), a parallel compute cluster with over 3000 cores and several Tb of RAM, provide additional infrastructure.

We have an advanced data store (with many hundreds of Tb of capacity) based on highly resilient technology to assure availability and security of data.

Cloud computing (e.g. Amazon EC2) is becoming an ever more popular tool for big data applications such as genomics, and our staff are experienced in the use of this approach to high performance analysis.

We strive to use open source software tools where available, but also use commercial tools where these have a particular advantage.