Many eukaryotic lineages have experienced whole genome duplication (WGD) events, including fungi [1
], animals [2
] and especially flowering plants, where WGDs are prevalent [4
]. Over evolutionary time, evidence of WGDs is obscured by loss of duplicate genes, gene movement and genome rearrangements. Unequal evolutionary rates for different lineages and gene families further complicate this problem, making phylogenetic inferences from distributions of pairwise distances between paralogous genes difficult and occasionally leading to erroneous findings. Conservation of collinear gene order (or "synteny") is less subject than sequence conservation to difficulties with phylogenetic inference, and is the basis for the discovery and dating of ancient genomic events where whole genome sequence is available [6
A typical pipeline for genome structure comparison starts with the enumeration of "synteny blocks" - regions of chromosomes between two or more input genomes that shared a common order of homologous genes and are therefore inferred to be derived from a common ancestor. Synteny blocks are often viewed as "diagonals" on a syntenic dot plot, where dots represent putative homologous gene pairs or marker pairs as inferred by sequence similarity (Figure ).
Figure 1 Typical syntenic dot plot between genomes that have undergone shared ancient WGD events. In this example, the synteny plot is between A. lyrata scaffold 1 (x-axis) and A. thaliana chromosome 1 (y-axis). Gray dots represent putative homologous gene pairs, (more ...)
WGDs can present a major challenge for accurately attributing synteny blocks to different evolutionary origins, especially when there are multiple sequential WGD events. In particular, some WGDs may be species-specific and others shared by multiple species (i.e. the polyploidy occurred before their lineages diverged). In order to automate the classification of different evolutionary origins for sets of syntenic blocks in lineages with a history of WGDs, it is essential to identify the relative ages of the WGDs and their placement relative to the divergence of the genomes being compared before performing in-depth analyses (see an example in Figure ).
Current softwares to identify synteny blocks often uses chaining or clustering of putative homologous gene pairs [4
]. It is common to then use custom schemes to score each block and apply a cutoff [8
]. Existing methods do not differentiate the evolutionary origin and age of these synteny blocks. However, specification of identity and age is crucial for any downstream evolutionary analysis. These software packages for identifying syntenic blocks will identify some blocks that are derived from shared, ancient whole genome duplications as well as false syntenic regions created by repeats and local gene duplications, creating ambiguity in the identification of true syntenic orthologs.
Methods that find syntenic regions between vertebrate genomes often rely on "best-in-genome" (or "one-to-one" reciprocal best) criteria [12
] in order to remove noise from the exhaustive enumeration of synteny blocks. This is not appropriate when studying plant or other genomes affected by multiple rounds of polyploidy. The orthologous blocks between two genomes under WGD scenarios can be one-to-many, or many-to-many depending on when the WGD(s) occurred in the evolutionary history of one or both lineages.
There have been ad-hoc
rules in pairwise comparisons to extract synteny blocks under the influence of WGDs. For example, approaches that find "Doubly Conserved Syntenies" (DCS) were extensively used in yeast [1
] and fish [3
] genomes. The DCS method attempts to find 2 chromosomal regions that both match the same single region in the outgroup. However, it is only designed to work with the 2-to-1 case and does not deal with hexaploidy, double tetraploidy, or more general n
-fold polyploidy. Additionally, the DCS method still relies on ad-hoc
rules to classify the 2:1 pattern. Without an explicit optimization objective, the DCS method is not fully reproducible. DCS also requires the sequence of an unduplicated outgroup species, a resource not available in many cases.
To generalize the concept behind DCS, we observe that when aligning two genomes with known polyploidy events in one or both lineages, we often have expectations for the number of subgenomes, or "multiplication level", or "depths" [6
]. Such a priori
information can be used as a guide to screen synteny blocks. If we set up an upper bound ("quota") for the expected number of subgenomes, the synteny blocks will then "compete" for the specified depths and the identification of synteny blocks will then be selective, excluding weaker matches that are more ancient or simply artifacts. As more flowering plant lineages with complicated polyploidy histories are sequenced, tools to automate the identification of synteny blocks with known evolutionary origins will become valuable to a wider range of researchers.
The algorithm we present here, called QUOTA-ALIGN, is a method that screens synteny blocks based on the expected number of subgenomes, effectively eliminating more ancient (weaker) or spurious alignments. In its simplest usage, a quota of 1:1 between genomes of different species corresponds to orthologous blocks in the traditional sense, i.e. neither of the genomes have duplicated since their divergence (e.g. between two mammalian genomes) and will contain an approximately 1:1 syntenic mapping of genomic regions. In QUOTA-ALIGN, the quota constraint is generalized to QX:QY in order to handle lineages with different duplication histories, a case found frequently in flowering plant lineages.
The quota-based screening of synteny blocks is a difficult problem because the goal is to simultaneously maximize synteny blocks' scores on both the x
-axis and y
-axis in a dot plot [15
], where simple sorting along any one axis will not necessarily be optimal on the other axis. Even in the case of 1:1 quota, the problem is known to be NP-hard [15
]. Herein, we show that it is possible to translate the problem into "Binary Integer Programming" (BIP), which is well-studied and has efficient software implementations. After converting to a BIP problem and solving it, QUOTA-ALIGN produces cleaner sets of synteny blocks and eliminates most ambiguous matches.