MERCATOR (
Dewey 2007) is a tool for whole genome alignment using protein coding exons as anchors in the alignment procedure. MERCATOR produces a map of the synteny blocks among the genomes compared and can be used for pairwise or multi-way alignments.
This protocol assumes that the user has already run the MERCATOR pipeline to produce DNA sequence alignments, that the GBrowse package (see Unit 9.9) has been installed, and that. The examples shown in this protocol are run on Linux (CentOS release 5.3) using MySQL server version 5.0.77.
Steps for running MERCATOR are outlined in (
Dewey 2007) and in the appendix of (
Dewey 2006). The results of MERCATOR include several files and directories; however, the necessary folder for this procedure is the ‘alignments’ directory. Running MERCATOR requires generating a gene annotation and genome files (not shown).
The typical directory structure, if following the MERCATOR instructions, includes an ‘input’ and ‘output’ directory. Within the ‘output’ directory there is a directory called ‘alignments’, this contains all the data necessary for transformation to GBrowse_syn.
Example data are taken from pairwise alignments for the species
Drosophila yakuba and
D. erecta from the web site
http://www.biostat.wisc.edu/~cdewey/fly_CAF1. These data are the result of a MERCATOR and MAVID alignment (
Bray and Pachter, 2004) between these two fly species. Although MAVID is used for the example data, other DNA sequence alignment software could be used on the synteny blocks identified by MERCATOR.
Necessary Resources
Hardware - Unix (Linux, Solaris, or other variety) workstation or Macintosh with OS X 10.2.3 or higher
- Internet connection
Software - All necessary software should be installed if Support Protocol 1 has been completed.
Files 1)
Download the example MERCATOR/MAVID data for D. erecta and D. yakuba pair-wise alignments from http://www.biostat.wisc.edu/~cdewey/fly_CAF1/ (note that the long line of this command is wrapped; a ‘\’ indicates a line break inside a single command).2)
Unpack the compressed archive-
$ tar zxf DroYak_CAF1-DroEre_CAF1.tar.gz
- where
-
|
z |
decompress the gzipped archive |
|
x |
extract the files |
|
f |
use the archive file DroYak_CAF1-DroEre_CAF1.tar |
3)
The directory
DroYak_CAF1-DroEre_CAF1 is equivalent to the ‘alignments’ directory described above. Examine the directory with
ls.-
$ \ls -1 DroYak_CAF1-DroEre_CAF1
-
1
-
10
-
100
-
--- truncated ---
-
98
-
99
-
DroEre_CAF1.agp
-
genomes
-
map
-
treefile
-
NOTE: the
−1 for
ls option lists one file/line. There are a total of 116 numbered directories. The list has been truncated for display purposes. Each numbered directory contains a single file,
mavid.mfa. The key files for conversion to GBrowse_syn are
-
|
x/mavid.mfa |
multiple sequence alignment produced by MAVID |
|
genomes |
lists the prefixed named used when Mercator alignments were run. |
|
map |
encodes the chromosome, start, stop, and strand locations of each synteny block in each of the genomes aligned in the order listed. |
4)
Convert the data to GBrowse_syn loading format using the
mercatoraln_to_synhits.pl script. If Support Protocol 1 has been completed and the current stable GBrowse is installed, this script will be pre-installed in the executable path, typically /usr/bin (may vary by operating system) and can be run without specifying the path to the script. The program prints to STDOUT, so redirect the output to a file. The command is all on one line.-
$ mercatoraln_to_synhits.pl -d DroYak_CAF1-DroEre_CAF1 \
-
-a mavid.mfa > mercator.tab
- where
-
|
-d |
the path to the folder with the necessary input files |
|
-a |
the name of the alignment file in each of the numbered subdirectories |
The file
mercator.tab is in a tab delimited format designed for direct loading into the GBrowse_syn alignment (or joining) database. The format has one tab-delimited record/line. Each line represents a synteny block, or alignment, with 13 fields:
-
Reference Species
-
Reference Seqid
-
Reference Start
-
Reference End
-
Reference Strand
-
Reference Cigar-string (not used; reserved for future use)
-
Target Species
-
Target Seqid
-
Target Start
-
Target End
-
Target Strand
-
Target Cigar-string (not used; reserved for future use)
-
Coordinate map (optional)
The coordinate map is used to save pair-wise nucleotide residue coordinates for columns in the aligned sequences. It is not necessary to store coordinates for every column. GBrowse_syn usually uses multiples of 10, typically 100. The purpose of storing the coordinate information is to position grid lines in the graphical display that will make large insertions and deletions in the sequences visible and intuitive. The grid lines are equidistant on the reference sequence but can show insertions or deletions by increasing or decreasing the distance between the lines, respectively, on the target sequence. The format of field 13 (with spaces, not tabs)
-
rcoord1 tcoord1 rcoord2 tcoord2 | tcoorda rcoorda tcoordb rcoordb
- where
-
|
rcoordn |
reference nucleotide residue number n |
|
tcoordn |
target nucleotide residue number n |
|
n |
column in the alignment |
|
| |
Symbol delimiting reciprocal coordinate maps |
-
NOTE: calculating the coordinate map is computationally intensive and the script may take a long time to run.
5)
Load the GBrowse_syn alignment database with the script
load_alignment_database.pl. If Support Protocol 1 has been completed, the script is pre-installed and can be run without specifying the path. Substitute your MySQL user name and password in the command below.-
$ load_alignment_database.pl -u user -p pass -d database -v -c \
-
mercator.tab
- where
-
|
-u |
username with root-level MySQL privileges |
|
-p |
password (if required) |
|
-d |
database name |
|
-v |
verbose progress reporting (optional) |
|
-c |
start new database. This option overwrites any existing database of that name. (recommended) |