Influenza A viruses are widespread and diverse within mammals and birds. As with other RNA viruses, they undergo frequent mutations allowing rapid evolution in response to natural selection 
. The enormous health impact of influenza viruses in humans and animals, and the potentially catastrophic effects of influenza pandemics, have led to large-scale surveillance of mammalian and avian influenza viruses with thousands of gene sequences being generated each year for molecular characterization 
. Protection against influenza viruses depends mainly on vaccination and naturally acquired immunity, but the rapid antigenic evolution of these viruses allows them to escape population immunity 
. Phylogenetic analyses, particularly those based on hemagglutinin (HA) genes, can help characterize groups of related viruses into clades and lineages expected to share common immunologic and/or phenotypic features 
H5N1 and H9N2 are two avian influenza A virus subtypes with significant pandemic potential. Both subtypes have widespread geographic distribution in domestic poultry and have caused occasional disease in humans. Since its identification in China in 1996, descendants of the A/goose/Guangdong/1/1996-like (Gs/GD-like) hemagglutinin gene of highly pathogenic H5N1 have spread across Asia, Africa, and Europe into over 63 countries 
. While these viruses are not readily transmissible between humans, the case fatality ratio is approximately 58%, with over 600 laboratory-confirmed human infections 
. By contrast, low pathogenicity H9N2 viruses have been detected infrequently in humans with mild influenza-like illness. Nonetheless, H9N2 continues to cause disease outbreaks throughout much of the world's poultry populations. In recent years, human zoonotic infections that have been reported coincide with increased detection of these viruses in domestic poultry throughout Asia and the Middle East 
Both H5N1 and H9N2 viruses have persisted in domestic birds for many years with viral diversification being driven by pronounced spread during outbreaks, continuous interspecies transmission in avian hosts, geographic isolation, and genetic selection—all resulting in the emergence of multiple genetic lineages 
. H5N1 viruses in particular have been grouped into over 30 genetic clades by the WHO/OIE/FAO H5N1 Evolution Working Group since its classification system was put into place less than ten years ago 
The nomenclature recommendations and ongoing clade determinations for H5N1 are based on phylogenetic analyses and quantification of clade sequence divergence (WHO/OIE/FAO 2008). As such, accurate assignment of new sequences requires the use of the appropriate annotated guide tree (www.who.int/influenza/gisrs_laboratory/h5n1_nomenclature
) along with careful application of the WHO/OIE/FAO H5N1 Evolution Working Group guidelines. For large datasets the clade determination process can be time-consuming, requiring the alignment of query and reference sequences, manual correction of alignments, phylogenetic tree construction with sufficient bootstraps, and, finally, pairwise genetic distance calculations.
A number of automated, sequence comparison methods have been developed for lineage assignment, subtyping, and genotyping. BLAST-based methods 
are fast but are vulnerable to new sequences that have diverged from the reference library. Phylogenetic tree based methods—such as the “two-time test” for genotyping described in 
where viruses from a recent time window must be clustered with those from an earlier time window—are highly accurate but require computationally-intensive multiple sequence alignment, as well as tree construction, to identify each gene lineage. Composition based approaches, such as Chaos Game Representation, have been shown to effectively identify HIV-1 subtypes with increasing efficacy for whole genomes in comparison to sub-genomic regions 
. However, their discriminatory power may be limited for the analysis of viruses with segmented genomes, such as influenza, where lineage assignment is done on relatively small gene segments and where clades can have very similar nucleotide composition.
Herein we provide a new method and pipeline for the automated clade annotation of influenza hemagglutinin sequences. The new tool, termed “Lineage assignment by extended learning” (LABEL), can be trained to characterize lineages with broad diversity (e.g., HA subtypes), minor differences (e.g., emerging HA sub-clades) or both, provided the initial lineages are pre-defined. LABEL uses profile hidden Markov models (pHMM) to analyze sequence similarity to various clades and extends the results to support vector machines (SVM) for making lineage assignment decisions. Profile HMMs have found use in remote homolog identification and the determination of protein family membership 
. SVMs have been used previously in metagenomics, splice-site recognition, gene finding, and sequence classification 
LABEL was developed, validated and optimized using two influenza A virus HA gene subtypes: highly pathogenic H5N1 and low pathogenicity H9N2 avian influenza viruses. We show excellent accuracy for full-length hemagglutinin gene analysis and fast runtime compared to the usual phylogenetic tree methods. Furthermore, we demonstrate how HMM profile scores can be used to visualize clustering patterns for the annotation of sequences that fail to cluster consistently using traditional phylogenetic analyses. The use of LABEL to rapidly and accurately assign new influenza virus sequences into lineages will aid viral surveillance and disease control activities as well as advance research into finding new clade-specific phenotypes.