Advances in technology have driven down the price and difficulty of genotyping, but until recently, the same has not been true of phenotyping 
. We propose that web-based collection of self-reported data on medical phenotypes is an efficient and effective method for phenotyping a large cohort of individuals, as evidenced by our ability to replicate a high percentage of associations across a wide range of conditions. Relative to medical record review, internet-based phenotyping is fast (we assessed more than 20,000 people for 50 phenotypes in approximately 12 months using only a small team of people). To our knowledge, this is the largest number of replications across a wide variety of diseases ever reported, demonstrating the value of gathering self-reported data on a large genotyped population.
While many of the associations tested here have been replicated before, there are a few that are, based on our literature review, the first independent replications of these associations in a population of European descent: basal cell carcinoma (PADI4
), plasma levels of liver enzymes (PNPLA3
), and bone mineral density (MEF2C
—these have already been replicated in a population of Asian descent). Though our study has been performed in a population of European ancestry, a similar study would be feasible in other populations. Such a study could potentially improve risk prediction in non-European populations as well as further our understanding of disease architecture (e.g., understanding how effect size varies across populations could provide insight into how tightly linked associations are to the causal variants). Furthermore, while it is true that we are able to replicate previously identified associations using our research platform, the reverse is also true—novel discoveries using our method have been independently replicated using other modes of data collection for both traits and medical conditions 
Although most studies use medical records as the gold standard against which self-reported data are compared, there are some inherent challenges to the use of medical records 
. As very few people have received all their health care from the same provider, the medical records from different stages of their lives are stored at different sites of care. Thus, a childhood diagnosis of asthma might be stored in a record at the pediatrician's office but not be reported in the record at the adult medical practice. In addition, extracting data from medical records often requires either manual curation, which is time-consuming and expensive, or reliance on ICD-9-CM or CPT codes which may have been miscoded. For example, a replication study was carried out using the BioVU DNA databank at Vanderbilt University by applying natural language processing techniques and billing-code queries to electronic medical records 
. Their algorithms achieved high positive predictive value (as measured by independent record review by two physicians) but required manual review and significant iterative work. Out of 21 SNPs in five phenotypes, they were able to replicate eight associations. In contrast, we were able to examine 50 phenotypes and replicate over 180 associations. For cases in which the information required may be difficult for individuals to report but can be extracted from electronic medical records (such as lab values), these two methods can provide complementary sources of data.
We replicated approximately 75% of the associations we expected to (excluding those for which our power may be substantially overestimated), based on power calculations. There are several possible reasons why we did not replicate all the associations we expected to (see and Figure S1
for instances in which our success ratio did not overlap the 95% prediction interval). One factor is systematic inflation of odds ratios in the initial reports due to the winner's curse—a bias in the effect size estimates from the first publication to report an association, generally occurring when the discovery sample is poorly powered to detect the association 
. For example, if we were to assume a systematic inflation of 15% in the log-odds ratio, the replication rate would change from 70% to 77% (or 75% to 82% if we again exclude the nine associations that are not clearly true positives). This amount of inflation is entirely within the confidence intervals for most studies: it corresponds to an estimated odds ratio of 1.3 where the true odds ratio was 1.25 or an estimate of 1.5 where the true odds ratio was 1.41. There are more sophisticated methods to perform bias correction for odds ratios but these require an analysis of the original experimental design that is beyond the scope of this paper 
While winner's curse probably explains part of the deviation from expected, some classes of diseases were likely not well phenotyped in this study, through some combination of misdiagnosis and misreport. For example, autoimmune diseases are more challenging because they may be of low prevalence, have non-specific symptoms, and a high rate of misdiagnosis. In a study of rheumatoid arthritis diagnoses by non-rheumatologists, 23–82% were judged to be misdiagnoses 
, while another study showed that relative to assessment in a specialist setting, patients in a community setting who received a diagnosis of celiac disease were actually misdiagnosed more than 50% of the time 
. Some of the underperformance of this approach for autoimmune diseases is therefore likely due to patients reporting a mistaken diagnosis by a non-specialist.
Because we chose to keep the burden of answering surveys low for our participants, many of the conditions in this study were assessed with single questions such as “Have you ever been diagnosed by a doctor with schizophrenia?” This assessment likely led to reporting errors for some diseases. For example, psychiatric diseases or mental disorders such as Alzheimer's disease, for which diagnosis requires a somewhat subjective clinical evaluation of a patient's symptoms or an autopsy, were each assessed via a single question in this study. More questions are needed here to gather information about the clinical features that led to the diagnosis. In addition, in some cases it may make more sense to have a family member, friend, or caregiver provide information for an individual.
On occasion, the nature of people's answers to such single questions necessitated making judgment calls on how to define a phenotype. Because some people may have type 2 diabetes but are only aware of having high blood sugar, we included people who self-reported having hyperglycemia as type 2 diabetes cases. For chronic obstructive pulmonary disease (COPD), we included individuals who reported having emphysema or chronic bronchitis. However, there are likely to be individuals who repeatedly get bronchitis associated with a cold or flu and reported having “chronic bronchitis”, not knowing that the clinical definition of this condition is developing bronchitis lasting at least three months in two consecutive years. This confusion may have reduced our power to replicate associations with COPD. In other cases, we were unable to come up with an acceptable match for a condition. For example, most GWAS of age-related macular degeneration (AMD) have focused on advanced AMD and generally only included cases with large drusen, geographic atrophy, and/or neovascularization. Our question asked only about AMD without assessing severity and thus our study may have included individuals with small or intermediate drusen and/or pigmentary abnormalities as cases. Such phenotypes from the GWAS catalog without direct analogs in our database were skipped for the main calculations in this paper. For all such conditions, more in-depth questions will be necessary to collect data more accurately.
These in-depth questions, which will be important when attempting to unravel the complex biological underpinnings of most phenotypes, can be asked up front for phenotypes that we suspect a priori may be challenging to assess. However, having a recontactable cohort makes the process of refinement possible when more information must be gathered. For celiac disease, starting with the question “Have you ever been diagnosed by a doctor with celiac disease?”, we replicated only one association out of almost six expected. As the prevalence of celiac disease in our cohort appeared to be somewhat higher than the reported prevalence in the United States 
, we chose to return to our customer database with a refined question of “Have you ever been diagnosed with celiac disease, as confirmed by a biopsy of the small intestine? If your diagnosis was not confirmed by a biopsy, please select no.” As a result, with a much smaller number of cases (which also reduced the number of associations we expected to replicate), we successfully replicated 4 out of 4.5 associations expected for celiac disease. This approach could also be used to examine endophenotypes or to divide broad phenotypes into subclasses with more defined characteristics.
The trend in GWAS research has been towards ever increasing sample sizes and reuse of previously genotyped cohorts whenever possible. Because it is relatively straightforward for our participants to provide information that is relevant for a variety of studies, any given individual can be a case or a control in multiple analyses at once. This could potentially reduce the total amount of work for the patient (sample collection needs to occur only once to participate in many studies) as well as potentially reducing the total number of people an investigator needs to genotype. In addition, for most conditions, this framework leads to a much larger number of controls than cases, which increases the study's power up to a certain point. Though self-report may lead to a slight increase in phenotyping error, in many cases, the lower phenotyping cost may lead to a more powerful study. For example, a study with 3,000 cases and 3,000 controls and a phenotyping error rate of 5% would have 77% power to detect a SNP at a minor allele frequency of 30% and an odds ratio of 1.3 with a p-value threshold of 10−7. But a study with 5,000 cases and 5,000 controls with a phenotyping error rate of 10% would have 95% power to detect such an association. Even if the error rate were 15%, the 10,000 person study would have 77% power and would have many more people to follow up with. Although more data are needed to evaluate the true costs of this model relative to other models, we believe that this method has the potential to collect high-quality phenotype data in an efficient manner.
The framework described here, in which additional questions can be directed at participants at any time with relatively low marginal effort, facilitates follow-up on specific topics as shown in the celiac disease example. Thus, one possible model for large-scale phenotyping could start with broad but shallow phenotyping by self-report on a very large cohort of individuals, followed by targeted recontact of specific subsets of individuals for deeper phenotyping based on the initial information gathered. The additional phenotyping could involve more in-depth questions to the participants or a completely different type of data collection that may require an in-person visit. A platform like this one that maintains an ongoing relationship with the participants, including sharing data with them, may motivate individuals to participate and stay active in research (for example, more than 80% of our research participants have taken more than one research survey).
There are many benefits to having a large, recontactable cohort. Testing new hypotheses, following up on initial data, and assessing the accuracy of different risk prediction models are easier when the need to assemble a new cohort every time is obviated. This raises the question, how large of a cohort is needed? With 20,000 generally unselected people, we expected to replicate approximately 40% of the associations that we tested. Only a 10× increase to 200,000 individuals would raise the expected proportion of replications to 80%, and with a million the expected replication rate would be more than 97%. A simple sum of the initial sample sizes in the papers reported in the GWAS catalog totals nearly 1,400,000. This is clearly an overestimate of the number of genotyped individuals as certain cohorts are reported in more than one study, but even if only 70% of these individuals are unique, this would constitute a resource of a million individuals with genome-wide genotype data who may be interested in participating in further research if given the opportunity. Unfortunately, because of the way research is currently done, these individuals come from dozens of different cohorts and it would be impractical if not impossible to recontact them all. As we move into studies that require ever larger sample sizes, such as those investigating gene—gene or gene—environment interactions, developing more efficient methods of conducting this type of research will become a necessity. We believe that this model in which investigators maintain long-term relationships with research participants and facilitate their participation through online tools is a significant step in that direction.