Patents in the biotechnology domain cover a wide range of areas, including health (e.g. vaccines, antibodies and diagnostics), industrial microbiology (e.g. genetically modified microbes) and agriculture (e.g. GMO and cultivars). Thus, the patent data are a valuable resource, not only for the intellectual-property world but also for the scientific community. Information in patent data can be more detailed (1
), appears earlier or is not available in the scientific literature (2
). The European Bioinformatics Institute (EMBL-EBI) provides public access to patent data resources, including abstracts, chemical compounds and sequences (http://www.ebi.ac.uk/patentdata/
). Patent abstracts contains abstracts of biology-related patent applications derived from data products of the European Patent Office (EPO). Chemical compounds appearing in patents are available in ChEBI (3
), a dictionary of molecular entities focused on small chemical compounds.
The sequences appearing in patent applications are an important resource for patent-related searches. During the past decade, the number of biological sequences appearing in patent applications has been increasing dramatically (). Today, millions of nucleotide and protein sequences extracted from the patent documents are available from both the commercial sector and the public domain. Proprietary efforts include GENESEQTM
(Thomson Reuters) http://thomsonreuters.com/products_services/science/science_products/life_sciences/biology/geneseq
), GQ-PAT (GenomeQuest, Inc.; http://www.genomequest.com
), CAS REGISTRY (Chemical Abstracts Service; http://www.cas.org
), PCTGEN (FIZ Karlsruhe; http://www.fiz-karlsruhe.de/sci_tech_patent_information.html
) and USGENE (SequenceBase Corporation; http://www.sequencebase.com/
) and the major public databases represented by the International Nucleotide Sequence Database Collaboration (INSDC; http://www.insdc.org/
) member databases: EMBL-Bank (5
), GenBank (6
) and DDBJ (7
). These include data provided by the EPO, Japan Patent Office (JPO), Korean Intellectual Property Office (KIPO) and United States Patent and Trademark Office (USPTO). EMBL-Bank has a specific data class (PAT) for nucleotide sequences obtained from patents. The EMBL-EBI also collates protein sequences provided by the EPO, JPO, KIPO and USPTO, into the Patent Proteins data set available from the EMBL-EBI ftp server (ftp://ftp.ebi.ac.uk/pub/databases/embl/patent/
) and via the SRS server (http://srs.ebi.ac.uk/
). FASTA format files for sequence searching are also available format: ftp://ftp.ebi.ac.uk/pub/databases/fastafiles/patent/
Data growth of EMBL-Bank patent class. The curve indicates the number of entries in the EMBL-Bank patent class has increased dramatically during the past decade.
Searching patent sequence databases can be used as inspiration for scientific innovation and discovery of existing inventions (e.g. industrial processes) with relevance to the work of the researcher. However, as mentioned earlier, sequences may appear multiple times due to the same invention being filed with multiple patent offices. Furthermore, the same sequence may be used by different inventors in different inventions. Information relating to the source patent may be incomplete, and biological information available in the patent document may not be reflected in annotation. Thus, search and analysis of these data have become increasingly challenging (8
). Recent efforts have been made to create non-redundant patent sequence resources to improve access and direct analysis of the sequences. PatGen (10
), a database containing non-redundant data from the public resources, allowed queries against patent bibliographic data and sequences. Unfortunately, the method of redundancy removal in PatGen has not been detailed to the public and the database is no longer available online. Patome (11
) is a non-redundant patent sequence set also derived from the public resources, providing additional annotations with RefSeq (12
), OMIM (13
) and Gene ontology (GO) (14
). Patome is useful for the identification of disease-related patent sequences. Duplicated sequences were removed in Patome according to the patent number (PN) and the sequence identifier in the sequence listing; however, identical sequences granted with different PNs by different patent offices are not classified in Patome. None of these studies attempts to establish publicly available non-redundant patent sequence databases based on the sequence level and the patent family level.
In this article, we describe a publicly available collection of non-redundant patent sequence databases, which have been created at two levels and cover the EMBL-Bank patent class nucleotides and the patent proteins from the EPO, JPO, KIPO and USPTO. The proprietary patent resources have been excluded due to the restrictions on their use. The non-redundant sequences are identified using MD5 (Message-Digest algorithm 5) (http://www.faqs.org/rfcs/rfc1321.html
) checksums of the sequences. Members of a level-1 cluster are 100% identical over their whole length. Level-2 clusters are defined by sub-grouping level-1 clusters based on the patent equivalents which have been published by different patent offices for a single invention. The clusters contain value-added annotations, such as publication patent corrections and earliest publication dates. Level-2 clusters also offer merged biological features. The data collection significantly enhances the quality of patent sequence data and allows for better tracking and cross-referencing in patent search.