|Home | About | Journals | Submit | Contact Us | Français|
Innovations in biomedical research technologies continue to provide experimental biologists with novel and increasingly large genomic and high-throughput data resources to be analyzed. As creating and obtaining data has become easier, the key decision faced by many researchers is a practical one: where and how should an analysis be performed? Datasets are large and analysis tool set-up and use is riddled with complexities outside of the scope of core research activities. The authors believe that Galaxy (galaxyproject.org) provides a powerful solution that simplifies data acquisition and analysis in an intuitive web-application, granting all researchers access to key informatics tools previously only available to computational specialists working in Unix-based environments. We will demonstrate through a series of biomedically relevant protocols how Galaxy specifically brings together 1) data retrieval from public and private sources, for example, UCSC’s Eukaryote and Microbial Genome Browsers (genome.ucsc.edu), 2) custom tools (wrapped Unix functions, format standardization/conversions, interval operations) and 3rd party analysis tools, for example, Bowtie/Tuxedo Suite (bowtie-bio.sourceforge.net), Lastz (www.bx.psu.edu/~rsharris/lastz/), SAMTools (samtools.sourceforge.net), FASTX-toolkit (hannonlab.cshl.edu/fastx_toolkit), and MACS (liulab.dfci.harvard.edu/MACS), and creates results formatted for visualization in tools such as the Galaxy Track Browser (GTB, galaxyproject.org/wiki/Learn/Visualization), UCSC Genome Browser (genome.ucsc.edu), Ensembl (www.ensembl.org), and GeneTrack (genetrack.bx.psu.edu).
Galaxy rapidly has become the most popular choice for integrated next generation sequencing (NGS) analytics and collaboration, where users can perform, document, and share complex analysis within a single interface in an unprecedented number of ways.
Most experimental biologists cannot fully take advantage of genomic data due to a formidable wall of countless and unnecessary computational issues. The goal of Galaxy [Goecks, et al., 2010] is to solve these issues. Consider the following example: A researcher wants to identify protein-coding exons containing the highest density of SNPs. Most biologists know three primary sources of genome-wide data for vertebrates: Entrez at NCBI [unit 1.3; Gibney et al., 2011], the Genome Browser at the UCSC [unit 1.4; Karolchik et al., 2009], and Ensembl [unit 1.15; Fernández Suárez et al., 2010] at the EBI/Wellcome Trust Sanger Institute (UK). Although these three sources offer extensive information about genes, including genomic structure, gene expression profiles, and SNPs, the end user must still perform this task elsewhere—the listed resources do not provide functionality necessary to perform this analysis. Typically, this project ends up in the hands of a graduate student who might initially try to achieve this using popular desktop applications. Unfortunately, Excel (like many other desktop applications) cannot handle that much data. As a result, this relatively simple task becomes a complex endeavor that may easily take weeks or months. In the authors’ view, this does not have to be complicated. Galaxy bridges the gap between data and analyses by allowing experimental biologists without programming experience to easily perform large scale studies from within their Web browsers.
In this unit, the authors describe the functionality of Galaxy using a series of examples that correspond to the following protocols: Basic Protocol 1 covers the most fundamental features of Galaxy. Basic Protocol 2 elaborates on different types of data accepted by Galaxy. It also shows the user how to upload data and set data attributes. Basic Protocol 3 demonstrates analysis with ChIP-seq high throughput sequencing data. Basic Protocol 4 shows that manipulation of genomic intervals is one of Galaxy’s greatest strengths. Basic Protocol 5 explains how Galaxy enables users to manipulate multiple alignments.
In addition, each protocol has a corresponding Galaxy tutorial including a Screencast (web video) hosted on the Galaxy wiki at http://galaxycast.org/CurrentProtocolsBioinfo2012.
Suppose one wants to find the top hundred protein-coding exons in the human genome with the highest density of single nucleotide polymorphisms (SNPs). Answering this question is not trivial. To do so, one needs to compare all human exons to all human SNPs. To put this into perspective, the current version of the human genome at UCSC for hg19 includes over 350,000 known coding exons and dbSNP build 134 [Sherry et al., 2001] contains nearly 49 million SNPs. Galaxy is specifically designed to make such large-scale analyses fast and user-friendly. Galaxy’s interface is accessible from http://usegalaxy.org. In the following protocol, the authors will use RefSeq [Pruitt et al., 2005] exons and dbSNP annotations on chromosome 22 extracted from the UCSC Table Browser [Fujita et al., 2011].
An Internet-connected computer.
clade: Mammal genome: Human assembly: Feb 2009 (GRCh37/hg19) group: Genes and Gene Predictions Tracks track: RefSeq Genes region: position position: chr22 output format: BED – browser extensible data Send output to: Galaxy
clade: Mammal genome: Human assembly: Feb 2009 (GRCh37/hg19) group: Variation and Repeats track: Common SNPs(132) region: position position chr22 output format: BED – browser extensible data Send output to: Galaxy
Join: Exons hg19 chr22 with: SNPs hg19 chr22 with min overlap: 1 Return: Only records that are joined (INNER JOIN)
from dataset: Exon-SNP Pairings Count occurrences of values in columns(s): c4 delimited by: Tab
Sort Query: Exon SNP Counts, unsorted on column: c1 with flavor: Numerical sort everything in: Descending order
Select first: 100 from: Exon SNP Counts, sorted
Compare: Exons hg19 chr22 using column: c4 Against: Exon SNP Counts, top 100 using column: c2 To find: Matching rows of 1st dataset
In Galaxy, information is stored in “datasets” which are analogous to files. Datasets can be added to your history by uploading files from your computer, or extracting from external data sources integrated with Galaxy such as UCSC’s ENCODE datasets [Raney et al., 2011]. Transferring external data via http/ftp, copying from shared or public Galaxy histories and libraries, and running data manipulation and analysis tools within Galaxy are explained. In addition to their data contents, each Galaxy “dataset” is associated with “metadata”. Metadata is information that describes the characteristics of a dataset. These can include the assigned and given names/annotation, the associated reference genome and build, the format datatype, and frequently additional datatype-specific labels and definitions.
In this protocol we demonstrate how metadata is assigned and modified for common genome analysis datasets uploaded into Galaxy using the methods listed above. We also use Galaxy to transform a dataset from a custom format into a standard BED format.
An Internet-connected computer.
Host: main.g2.bx.psu.edu Username: your username on Galaxy Main Password: your password on Galaxy Main
Cut columns c2 Delimited by Tab From MPromDB Promoters chr19and click “Execute”.
Convert all: Colons In Query: The dataset produced by the Cut operation.and click “Execute”.
Convert all: Dots In Query: The dataset produced by the previous Convert operation.and click “Execute”.
Paste: the 3 column dataset and: MPromDB Promoters chr19 Delimit by: Taband click “Execute”.
Strand column: 10 Name/Identifier column: 8and click “Save”.
Cut columns: c1,c2,c3,c8,c13,c10 Delimited by: Tab From: MPromDB Promoters chr19 intervaland click “Execute”.
clade: Mammal genome: Mouse assembly: July 2007 (NCBI37/mm9) group: Genes and Gene Predictions Tracks track: RefSeq Genes region: position position: chr19 output format: BED – browser extensible data Send output to: Galaxy
The decreasing cost and increasing throughput of sequencing technologies has made chromatin immunoprecipitation followed by sequencing (ChIP-seq) an essential tool for genome-wide profiling of protein-binding, histone modification, and nucleosome positioning [Park 2009 and Pepke et al. 2009]. There are numerous tools for various stages of ChIP-seq analysis and this Protocol will focus on the use of MACS (Model-based Analysis of ChIP-Seq) [Zhang et al. 2008] to perform peak calling that identifies regions of the mouse genome that are positive for zinc-finger CTCF tags versus a control. CTCF is a transcription factor that can function as either a repressor or activator. Though known to bind to several thousand different genomic locations, it has also been experimentally associated with cancer tumors including but not limited to: testis, prostate, lung, and breast [Phillips and Corces 2009]. This protocol begins with FASTQ Tag and Control datasets that are groomed (using FASTQ Groomer, a Galaxy tool that normalizes quality scores and FASTQ formatting) [Blankenberg et al., 2010] and mapped (Bowtie, a DNA short read aligner) [Langmead et al. 2009] and ends with peak calling by MACS.
An Internet-connected computer.
The protocol describing finding human exons with highest SNP density (Basic Protocol 1) used the Join operation to find all protein-coding exons that contain SNPs. This is just one of many interval operations offered in Galaxy, which are based on the bx-python package (https://bitbucket.org/james_taylor/bx-python/wiki/Home) developed at Penn Sate University and Emory University. These include intersect, subtract, complement, merge, concatenate, cluster, coverage, base coverage, and join. Some operations are analogous to relational database queries, such as join and coverage [unit 9.2; Jamison, 2003]. Other operations are analogous to set operations. Figures 10.5.19 and 10.5.20 show examples of input and output produced by individual interval operations. In the following protocol, the authors use two human chromosome 22 annotation datasets as examples. The first dataset "Exons", representing protein-coding exons, is imported from the "Basic Protocol 1" history. The second dataset "Repeats", representing interspersed repeats (also known as transposable elements or simply repeats in the text), is retrieved from the UCSC Table Browser.
[*Figs 19 and 20 near here]
An Internet-connected computer.
clade: Mammal genome: Human assembly: Feb 2009 (GRCh37/hg19) group: Variation and Repeats track: RepeatMasker region: position position chr22 output format: BED – browser extensible data Send output to: Galaxy
Return: Overlapping Intervals of: Exons that intersect: Repeats For at least: 1
Return: Overlapping pieces of Intervals of: Exons that intersect: Repeats For at least: 1
Subtract: Repeats from: Exons Return: Intervals with no overlap where minimal overlap is: 1
Subtract: Repeats from: Exons Return: Non-overlapping pieces of intervals where minimal overlap is: 1
Concatenate: Exons with: Repeats Both datasets are the same filetype: checked
Cluster intervals of: Repeats max distance between intervals: 100 min number of intervals per cluster: 2 Return type: Merge clusters into single intervals
Join: Exons with Repeats with min overlap: 1 Return: Only records that are joined (INNER JOIN)
Galaxy includes several tools to specifically work with paired and multiple sequence alignment format (MAF) datasets. The tool functions can upload, extract, and summarize the content of MAF datasets sourced from the UCSC Browser with the goal of maximizing analytical access to the underlying data. Both custom and standard MAF datasets can be uploaded and used with the majority of tools. The MAF manipulation tools used in this protocol were developed by the Galaxy team [Blankenberg et al. 2011].
Part A of this protocol will demonstrate how to extract regions from a standard Conservation MAF reference track (hg19), based on the query interval ranges from Basic Protocol 1, Step 20: top 100 SNP containing human coding exons on chromosome 22.
Part B of this protocol will demonstrate how to generate coverage statistics from a standard Conservation MAF reference track (hg19), based on the query interval ranges from Basic Protocol 1, Step 20: top 100 SNP containing human coding exons on chromosome 22.
Part C of this protocol will demonstrate how to extract and manipulate syntenic “transcript” FASTA sequence from a standard Conservation MAF reference track (hg19), based on the query interval ranges from a human RefSeq Genes track, as extracted in BED format from the UCSC Table Browser, limited to chromosome 22.
An Internet-connected computer.
- Human Homo sapiens Feb. 2009 hg19/GRCh37 - Chimp Pan troglodytes Mar. 2006 panTro2 - Gorilla Gorilla gorilla gorilla Oct. 2008 gorGor1 - Orangutan Pongo pygmaeus abelii July 2007 ponAbe2 - Rhesus Macaca mulatta Jan. 2006 rheMac2 - Baboon Papio hamadryas Nov. 2008 papHam1 - Marmoset Callithrix jacchus June 2007 calJac1 - Tarsier Tarsius syrichta Aug. 2008 tarSyr1 - Mouse lemur Microcebus murinus Jun. 2003 micMur1 - Bushbaby Otolemur garnettii Dec. 2006 otoGar1
clade: Mammal genome: Human assembly: Feb 2009 (GRCh37/hg19) group: Genes and Gene Predictions Tracks track: RefSeq Genes region: position position: chr22:1-51304566 output format: BED – browser extensible data Send output to: Galaxy
- Rhesus Macaca mulatta Jan. 2006 rheMac2
[*Fig 37 near here]
Galaxy was designed to be an interactive system and in most cases results will be self-descriptive depending on which tools were applied to the original data. As always caution should be used when interpreting genomic data—the information produced by Galaxy is only as good as the underlying data imported.
Modern Web-based genomic resources offer many facilities for retrieving and visualization of data. However, few of these resources offer sophisticated tools for further analysis of these data. As a result, almost every experimental biologist has to analyze data on his/her own, struggling with numerous difficulties arising from format incompatibility or incomprehensible user interfaces. Although our computational colleagues are happy to help, few are willing to devote time and resources to develop a good user interface (a significant challenge). Galaxy is a system designed to help both sides. For experimental biologists, Galaxy provides an intuitive user interface offering a direct connection to many widely used data sources and browsers, a simplified FTP data loading procedure, and a custom genome option for most tools including the native Galaxy Track Browser (GTB, or Trackster). The Galaxy workspace includes a unique history system to organize, label and displays data, to track datasets and analysis for sharing and/or publishing, and to extract analysis functions into workflows for re-use. For computational biologists, Galaxy provides a framework that can integrate command-line tools with almost no effort. For each tool, Galaxy generates an interface and provides all housekeeping (e.g., input and output management, job control, error catching, and testing facilities). As this text was compiled with experimental biologists in mind, it does not contain any information on technical aspects of the Galaxy system (found at http://galaxyproject.org/wiki/).
Galaxy allows performing an infinite number of analyses on genomic data. In designing the system, the authors tried to put as few constraints on the user as possible. In that sense Galaxy is similar to a car with the manual gearbox—it gives you more control if you know what you are doing (e.g., you do not shift from fifth to reverse). Fortunately, user feedback provides convincing evidence that a short test drive is sufficient to understand how Galaxy works. This text is equivalent to such a test drive. Below, the authors list the most common problems encountered by Galaxy users. They can be condensed into two categories: (1) data format issues and (2) genome build incompatibilities.
Galaxy “understands” several datatypes including genomic coordinates (e.g., BED, GFF/GTF, Wig), sequences (e.g., FASTQ, FASTA), and alignments (e.g., SAM/BAM and MAF). Most of the tools require data to be in one of these formats. For example, the genomic intervals operations described in Basic Protocol 4 can be only performed on data in Interval format. In most cases changing your data to interval format is as simple as correctly setting metadata as shown in Step 6 of Basic Protocol 2.
Galaxy supports interactive genome analyses that use a mix of different genomes within a single analysis space (History). In the authors’ opinion such “mixing” is essential for a true comparative genomics resource. The ease of mixing also means that in some cases users will accidentally attempt comparing data from different genomes. Thus, when using tools that operate on more than one history item (i.e., most genomic interval operations) make sure that all data come from the same genome build.
A vision for Galaxy was originally articulated by Ross Hardison, who is also the major source of support and critical feedback. The authors would like to thank Jim Kent and David Haussler for their continuing support and making UCSC Genome Browser uplink and connection possible. Istvan Albert pioneered initial aspects of Galaxy design. Efforts of the Galaxy Team (Enis Afgan, Guru Ananda, Dannon Baker, Nate Coraor, Jeremy Goecks, Greg Von Kuster, Ross Lazarus) were instrumental for making this work happen. The following individuals also contributed to the Galaxy project at different stages: Richard Burhans, Ramkrishna Chakrabarty, Laura Elnitski, Belinda Giardiane, Bob Harris, Jianbin He, Kanwei Li, Webb Miller, Cathy Riemer, Kelly Vincent, and Yi Zhang. Robert Castelo, France Denoeud, Roderic Guigo, Erika Kvikstad, Julien Lagarde, and Kateryna Makova provided critical comments during software testing. Ramana Davuluri gave permission to use the MPromDB data in these protocols. This work was funded by an NIH grant GM07226405S2 to KDM, a Beckman Foundation Young Investigator Award to AN, NSF grant DBI 0543285 and NIH grant HG004909 to AN and JT, NIH grants HG005133 and HG005542 to JT and AN, as well as funds from Penn State University and the Huck Institutes for the Life Sciences to AN and from Emory University to JT. Additional funding is provided, in part, under a grant with the Pennsylvania Department of Health using Tobacco Settlement Funds. The Department specifically disclaims responsibility for any analyses, interpretations or conclusions.