This study described in detail a two stage, iterative process for drawing up code lists and looked at the utility of indicator markers prior to the definitive diagnostic code in a) the description of early care, and b) the case definition, of rheumatoid arthritis patients, with a view to defining uncoded cases (false negatives). Following a panel of clinicians drawing up a first list of codes, data from RA patients was scrutinised to see if additions to the lists should be made. The second stage of this process resulted in considerable modification of the lists, suggesting this is a valuable addition to traditional methods of approaching code list design. The indicator markers chosen were found to exist in the file well before the RA diagnostic code and to give important information on the diagnostic process of RA. Indicator markers were wide-spread in the records of RA patients with 83.5% of patients having at least two markers in their file before the RA diagnostic code was recorded.
This strategy of drawing up code lists, with the combination of a priori and a posteriori data-driven stages, will be applicable to other conditions and coding systems. Using the second stage, a posteriori approach allowed us to make significant modifications to code lists. For example, synovitis and non-specific auto-immune investigations had been raised by the expert group at the initial stages but not included in the a priori searches as they were considered too non-specific. However, they were discovered, during the data-driven stage, to be found with some regularity in the records of RA patients. The data-driven stage then revealed codes that had been missed and some errors in classification and so served as a form of triangulation. The finding that this data-driven modification stage is of such importance poses a challenge to researchers and data providers, as this additional stage of code searching may not be possible in the design stage of all research studies, due to difficulties with advance access to data.
Secondly, our study looked at the value of using indicator markers to understand the diagnostic process of early RA as presenting in primary care. Using these markers, we aimed to indicate locations in the EHR where the diagnosis of RA was being considered and to describe the early course of disease presentation in these patients, as shown by coded symptoms of the disease, referrals to appropriate services or because of tests being ordered or results recorded. Indicator markers were widespread within the RA patient records, in many cases a long time before the RA diagnostic code was found on the file. These results suggest, as was hypothesised, that the diagnosis of RA may have been known about for some time before a diagnostic code was added, with specific markers frequently found more than six months before recorded diagnosis. This discrepancy between dates may indicate diagnostic uncertainty or delay in coding a known or presumptive diagnosis. Further field work would be needed to determine which of these two processes has most influence in these data. UK guidance from NICE 
suggests that RA should only be diagnosed after it has been in evidence for 6 months. However, these data precede this guidance and this is unlikely to be a full explanation for the delay in diagnostic coding.
We were also able to explore gender differences in indicator markers, to check for systematic differences in recording by gender, but found no evidence of gender differences in the prevalence of most, except for named inflammatory arthritis and DMARD prescriptions both of which were more common in men. The prevalence of referral and investigation markers was similar in men and women although there was some evidence that the gap between referral, non-specific symptoms, non-specific investigations and coding of diagnosis was longer in women than in men.
Thirdly, this study explored indicator markers as a way of broadening case definitions in this disease, thereby potentially identifying individuals with the disease who do not yet have a diagnostic code. The majority of patients (83.5%) had two or more indicator markers, and therefore combinations of codes may help to describe the early presentation and diagnostic process of these patients. One finding, however, was that the most common symptom codes found in our study were the non-specific ones, and these could relate to conditions other than RA. Despite this, 86% of patients had at least one specific marker. When a combination of the four most indicative markers were examined, it was found that 36% had two of these markers, 24.4% had three markers and 9.1% had all four of these common markers. This suggests that using these markers in combination may facilitate the identification of RA cases, both in the early stages before an RA code is recorded and in cases where no code has been recorded and the cases would ordinarily slip “under the radar”, leading to false negatives and biasing a subsequent study.
However, these findings suggest that while coded indicator markers do give a good indication of the presentation of RA, in 16.5% of patients there was no marker or only one marker before the RA diagnostic code. With coded information alone it may not be possible to reliably identify the early presentation of every case. Extra information may be available in unstructured free text, and may include both notes made by the GP during a consultation and the content of referral letters and correspondence from specialists. Information governance requirements mean that free text must be anonymised before release to researchers. It is not currently possible to do this reliably automatically and therefore manual, expensive labour is required to ensure that datasets are anonymised. Research projects using EHRs have therefore relied almost exclusively on coded structured data rather than accessing free text 
. Text data is also difficult to analyse in large numbers of patients and requires some form of processing or structuring to allow quantitative analysis 
. The supplementation of coded information with free text would significantly strengthen findings in studies using EHRs, as has been found in previous studies of case definition of RA using EHRs 
In order to identify a “case” for a study using EHRs we therefore propose that there are three levels of classification for sources of variation in recording, each level posing an increasing challenge to the researcher. Level 1 consists of diagnostic codes from within the same domain, such as different diagnostic codes for rheumatoid arthritis. Because these are diagnostic codes, it is probable that patients with these codes have the condition of interest. However, the sole use of these codes may miss cases where diagnostic coding is delayed or incomplete. At level 2, codes from different domains are used, such as symptoms or test results rather than a diagnosis e.g. “joint stiffness”, “referral to rheumatology clinic” or “rheumatoid factor positive”. A combination of these codes would result in a probabilistic definition of the condition, but these coding patterns result in less certainty in the diagnosis for the researcher and may also result in the inclusion of cases which do not have the disease of interest. The final level, level 3, is where additional disease specific information is found in free text – either along with a more general diagnostic code or a code for a symptom or a more general code such as “had a chat to patient”. Free text may supply additional information to allow a diagnosis to be made with more certainty than if coded data alone were used.
How findings fit with previous literature
Literature on the methodology of drawing up code-lists for the identification of cases, treatment or outcomes for health research is sparse, despite its pivotal importance in such research. Some work been published on the operationalisation of code-lists into statistical programmes 
. There are techniques available to address the proliferation of diagnostic codes such as ancestor/descendant tracing within SNOMED (“Systematised Nomenclature of Medicine Clinical Terms” – an international comprehensive terminology system being adopted by the NHS), hierarchies within Read codes and query building within clinical trial systems. However, high-level search strategies, and how to address variations in lists of codes below Level 1 variation (outlined above) have not been discussed in the literature.
A lack of standardisation in methodology means that code-lists, and hence case definitions, may vary across studies in a variety of research contexts. Very little is known about how this variation in code-lists affects study results, although there are implications that it may be important, such as high rates of false negatives or positives introducing biases in the estimation of uptake of certain tests or treatments within disease populations 
. An additional concern is that the use and type of coding structures may be individual to software systems so that the codes used to describe, for example a polyarthritis, may differ across different primary care software systems. Such system-specific coding reinforces the need for a data-driven stage in any case-finding process and the need to explore the data before finalizing code selections.
Other authors have also highlighted the deficits from coded data 
. Jordan and colleagues showed that the number of consultations related to knee pain were underestimated without the use of text from primary care databases 
, and as mentioned Tate and colleagues have shown that the diagnosis date of ovarian cancer recorded in primary care records is later than other evidence in the record 
. This is consistent with our results indicating that the diagnosis may have been known for some time before it was formally coded. In addition two studies have found that using free text within algorithms for case definition significantly improves the positive predictive value of subject classification in RA when using EHRs 
Previous studies looking at the presentation and management of early RA have been situated in secondary care and are therefore prone to substantial referral bias 
. The existing work suggests that delay in seeking medical attention is a more important factor than delay in recognition and referral to specialist services 
. Data from primary care describing the care process from the first presentation are needed, but our results suggest that this cannot be achieved within routine primary care electronic databases using codes alone as much information, particularly that which occurs outside primary care, may be recorded in free text.
The indicator markers have not been validated by a review of the associated text in the record and therefore it is possible that non-specific codes, for example for a painful ankle, were unrelated to the subsequent RA diagnosis. Similarly prescriptions for DMARD such as methotrexate, and especially NSAIDs, may be unrelated to RA. This will lead to an over-estimation of the presenting symptoms and treatment in RA patients. The risk of under-estimation is perhaps greater. This will occur if events or consultations relating to RA diagnosis and management are missed because general codes have been used and the important information recorded in text, which we have described as level 3 variation. The next stage in the project is to extract text around the time of indicator markers, and perform keyword searches on extracted text.
Although we used a panel of four experts for generating the a priori
lists of indicator markers, a larger panel or a more in depth process, such as a Delphi process, might have generated a different or fuller list. However, other studies in this field have used similar size panels of experts 
to scrutinise possible codes. Indeed, some studies have not drawn up lists but used a single diagnostic Read code to identify RA patients 
. We are not aware of any studies which have reported the details of a data-driven stage for modifying their lists. Greater clarity of reporting for drawing up lists would result increased standardisation of case definition and increase the generalisability of EHR study results. Previous studies reporting validation of cases using EHRs have inadequately described the methods for this validation 
. We believe that this documentation of the process is an important resource for researchers new in the field, and encourage other researchers to publish their code list strategies in a replicable way.
Furthermore we did not look at control data to compare the incidence of the indicator markers in patients with no RA diagnostic code. This comparison would be valuable to ascertain whether these markers do indeed occur in greater numbers just before an RA diagnostic code compared to at other times or in other patients. Further work will examine these markers in control data and also assess their utility in estimating the rates of false negatives which may result by only using diagnostic codes for case definition, thereby introducing bias into the study 
. The use of control data will allow us to estimate the positive predictive value of clusters of indicator markers to ascertain which combinations are most predictive of a RA diagnostic code being added to the file. However, given the small proportions of patients with three or more of the chosen codes in their record (around 10%) prior to the RA code, there is likely to be more information concealed in the free text which would contribute to case definition. Therefore our plan is to extract this free text information to estimate its contribution to case definition before estimating the predictive value of clusters of indicators.
Implications for future research and practice
Preparing indicator markers and code-lists is time-consuming and requires input from clinical experts. Context and temporal relationships are often important in looking at markers and building up patterns. It is likely to be a rate-limiting step in future research projects – particularly if multiple clinical entities need to be included in any one study. The design of future EHR systems should consider the need for reporting and aggregation of cases and facilitate the recording of the code-selection process, including a data-driven stage and the ability to search within code-sets. Future research might also consider whether machine learning techniques could contribute to this process.
If EHRs are to be used appropriately in research we need to develop a wider understanding of the drivers of coding in the clinical environment. As work on different diseases develops, it may also be possible to develop a typology of what kind of clinical entities are liable to long gaps between the onset of clinical suspicion, referral and treatment, and where the two are reliably contemporaneous, in order to minimise wasted effort doing over-complex analyses. Our findings here may generalise to some other chronic diseases, but not all.
Sharing of resources between researchers is desirable, to avoid unnecessary duplication. The process of breaking up search strategies into different markers, with publication of precise search terms and results, may allow transparency and replicability in the preparation of code-lists and facilitate their sharing. For example codes relating to a certain presentation might be stored and available for use by another study. At present there is no standardisation of methods for preparing case definitions and code-lists which limits the possibilities of sharing results. No recognised repository to aid sharing exists. Any standardisation would require a common set of meta-data relating to case-finding – detailing what has been done, decisions that have been made and why. The format these meta-data should take is not obvious.
These findings emphasise the need for research using EHR to go beyond simple use of diagnostic codes and adopt more sophisticated strategies for case-finding, including the use of free text. Relying on coded diagnosis may not lead to accurate case definition thereby leading to inaccurate estimates for disease registries and assessments of service needs. The development of automated methods to allow access to information in text without anonymisation should be an urgent priority.