With the completion of genome sequencing projects for more than 30 plant species, large volumes of genome sequences have been produced and stored in online databases. Advancements in sequencing technologies have reduced the cost and time of whole genome sequencing enabling more and more plants to be subjected to genome sequencing. Despite this, genome sequence qualities of multiple plants have not been evaluated.
Integrity and accuracy were calculated to evaluate the genome sequence quality of 32 plants. The integrity of a genome sequence is presented by the ratio of chromosome size and genome size (or between scaffold size and genome size), which ranged from 55.31% to nearly 100%. The accuracy of genome sequence was presented by the ratio between matched EST and selected ESTs where 52.93% ∼ 98.28% and 89.02% ∼ 98.85% of the randomly selected clean ESTs could be mapped to chromosome and scaffold sequences, respectively. According to the integrity, accuracy and other analysis of each plant species, thirteen plant species were divided into four levels. Arabidopsis thaliana, Oryza sativa and Zea mays had the highest quality, followed by Brachypodium distachyon, Populus trichocarpa, Vitis vinifera and Glycine max, Sorghum bicolor, Solanum lycopersicum and Fragaria vesca, and Lotus japonicus, Medicago truncatula and Malus × domestica in that order. Assembling the scaffold sequences into chromosome sequences should be the primary task for the remaining nineteen species. Low GC content and repeat DNA influences genome sequence assembly.
The quality of plant genome sequences was found to be lower than envisaged and thus the rapid development of genome sequencing projects as well as research on bioinformatics tools and the algorithms of genome sequence assembly should provide increased processing and correction of genome sequences that have already been published.