In 2001, the Human Genome Epidemiology Network (HuGENet) established the HuGE Published Literature database (HuGE Pub Lit), a continually updated, searchable, online database of population-based, genetic epidemiology articles [
7]. Relevant studies are identified weekly from NCBI PubMed [
8] by a genetic epidemiologist who records the study design, genes and diseases of interest, and interacting environmental factors [
7]. As of May 21, 2007, this database included a total of 27,386 articles that examined genotype-phenotype associations (both qualitative and quantitative traits) published in 2,773 journals. Further details regarding the contents of this database have been previously described [
7]. This information along with the title, contributing authors, abstract, journal, date of publication, and the unique PubMed Identifier (PMID) are deposited in the HuGE Pub Lit database [
7]. To select articles for this analysis, we queried the HuGE Pub Lit database for population-based studies that used observational study designs (i.e., case-control, cohort, and cross-sectional studies) to investigate gene-disease associations, interactions between genetic variants (interlocus or gene-gene interactions), or gene-environment interactions. Family-based linkage studies were not collected systematically in HuGE Pub Lit and, therefore, were not included in this study. In addition, we restricted our analysis to full text articles because studies presented only as concise summaries (e.g., as letters or abstracts) could have increased the heterogeneity of our sample.
Our evaluation was designed in 2004 and data collection and analyses were conducted in 2004–2007. For the main analysis, we drew a five percent simple random sample (SRS) of articles that were returned by the query described above, published from 2001 to 2003, and curated in HuGE Pub Lit before May 30th, 2004 (n = 8,115) to yield a dataset of 406 articles. To provide an updated description of reporting practices and to assess improvements in reporting, we randomly selected (SRS) 40 articles that were published during 2006 from articles that were returned by our database query, added to PubMed in 2006, and curated in HuGE Pub Lit before May 18, 2007 (n = 5,353). After each article was read, 91 from 2001–2003 and 12 from 2006 were excluded from the analysis for the following reasons: not written in English (2001–2003: n = 28, 2006: n = 6), population screening studies (2001–2003: n = 23, 2006: n = 0), clinical trials or pharmacogenomic studies (2001–2003: n = 16, 2006: n = 3), not full-length articles (i.e., letter or abstract) (2001–2003: n = 11, 2006: n = 0), failed to fulfill the inclusion criteria for HuGE Pub Lit [
7] on closer scrutiny (2001–2003: n = 6, 2006: n = 1), family studies (2001–2003: n = 3, 2006: n = 2), studies of genetic tests (2001–2003: n = 2, 2006: n = 0), or meta-analyses (2001–2003: n = 2, 2006: n = 0).
Data were abstracted from each original publication in duplicate by two independent data extractors. All discrepancies between the independent extractors were discussed and a consensus was reached.
For the 2001–2003 articles, a standardized abstraction form was developed and piloted for 10 articles; the form was revised according to the results of this pilot study to ensure that the definitions for the collected items were clear and unambiguous. Items on this final form were designed to collect information on the reporting of study design, genotyping method, population stratification, analytical methods (including the analysis of multiple genetic variants and gene-environment interactions), and study inferences. In addition, the final form accommodated different observational study designs, multiple groups of study participants, and the consideration of more than one postulated genetic risk factor as well as additional environmental factors. Articles were coded as potentially misclassifying the disease or environmental exposure status of study participants when the article did not explicitly state that these factors were directly measured for all participants in the study population. When multiple groups of study participants were reported in a study, we recorded the sample size of the largest group for cohort and cross-sectional studies and the largest case and control groups for case-control studies. Items were collected separately for case and control groups for case-control studies and for all study participants regardless of disease status for cohort and cross-sectional studies. For the purpose of this analysis, data collected for case and control groups were combined so that statistics could be calculated for all study participants. Information (e.g., mean or median age and sex distribution) was considered as given for all study participants only if it was provided for all case and control groups. Additionally, for case-control studies, we recorded whether cases and controls were described as drawn from the same population according to one or more of the following definitions: 1) geographic region, 2) clinical population, 3) general population (i.e., ethnic group), or whether information on the choice of suitable controls was missing or incomplete.
Fourteen items were assessed in the HuGE articles published in 2006. These included the number of study participants, genes, polymorphisms, and environmental factors assessed in gene-environment interactions. In addition, we selected ten items that were applicable to all study designs and that had been reported in fewer than 50% of the articles published in 2001–2003.
The data analysis was conducted using SAS 9.13 (SAS Institute, Cary, NC). Counts and percents were calculated for the items abstracted from the articles. Comparisons of articles published in 2006 vs. those published in 2001–2003 used the Mann-Whitney U test for continuous variables and Fisher's exact test for binary variables.