Reproducibility of Electronic Health Record Research Data Requests

Objectives/Goals: Translational research inclusive of observational, comparative effectiveness, clinical trials and population health studies is increasingly dependent on the secondary use of existing data including electronic health record (EHR), as a source for knowledge discovery. Researchers often provide natural language descriptions of their data needs. Research data teams use these descriptions to develop queries to run against EHR systems and provide result back to the researcher. Within this process, the data team and the researcher usually engage in complex written and verbal communication in order to mediate the details within the natural language description. The data team then abstracts the understood request to the complexities present with the EHR system. This is followed by the development of appropriate query scripts, extraction, transform and provisioning of query results and analysis. The data team and researcher usually iterate over these steps multiple times in order to develop the final data deliverable. In this study, we analyze the reproducibility of the current process of using natural language descriptions for acquiring research data from the EHR.

Methods/Study Population: We provided a natural language description of an Upper Respiratory Tract Infection (URTI) study two data teams of three CTSA sites. The teams were blinded to the true nature of the study, which was to understand the reproducibility of the natural language description. Results and processes followed at each were analyzed. The following is a summary of the URTI data request:

     Patients eligible for enrollment between July 1, 2012 and September 30, 2015. Data will be reviewed up to six months pre-index to identify baseline characteristics and exclusion criteria. Subjects identified with an ICD-9-CM diagnosis code for an URTI during an outpatient patient encounter will be included. The index date is defined as the first ICD-9-CM documentation for an URTI. A 6-month pre-index period will be used to identify exclusion criteria, and to observe baseline characteristics. Patients will be examined for outcomes of interest within 24-hours of the index clinic date and time. In addition, we exclude patients with a positive rapid antigen detection test (RADT) for group A streptococcal pathogens at the initial visit (results available within 24-hours) as this is an instance when antibiotic prescribing is appropriate.

Inclusion criteria:

  1. Age >18 years old
  2. Diagnosis (ICD – 9 code list provided) of a URTI in the outpatient setting

Exclusion criteria:

  1. AIDS/HIV (ICD – 9 code list provided)
  2. COPD/Asthma (ICD – 9 code list provided)
  3. Cancer (ICD – 9 code list provided)
  4. Conditions for which antibiotic prescribing may be appropriate
    • An URTI diagnosis within the 180 day pre-index period
    • Additional infectious diseases (ICD – 9 code list provided)
    • A positive rapid antigen detection test for group A streptococcal pathogen (LOINC code list provided).


Results/Anticipated Results: Our results yielded 684,478, 460,159, and 412,942 individuals having outpatient visits between July 1, 2012 and September 30, 2015 at the three sites respectively. Of these, 18.7%, 1.7%, and 3.1% had URTI at each site. Of these, 17.5%, 34.2%, 39.9% were over 18 years of age at the three sites, respectively. After applying all exclusion criteria 6,797, 623 and 3,092 patients respectively, were obtained at the three sites. These patients would NOT be expected to receive antibiotics. Of those, 9.4%, 0.3% and 36.5%, and 11.2%, 0.3%, and 40.3% were prescribed antibiotics within the first 24 hours and first 8 days at each of the three sites, respectively.

Discussion/Significance of Impact: Analysis of the results at each stage of the query building showed 10-fold to double differences across the sites. The site and magnitude of these differences varied at each step of inclusion/exclusion criteria. In order to ascertain possible reasons for these discrepancies, we asked the data teams at each site to describe their data query process and analyzed the query results with the undertaken processes. In our analysis we found that contextual, organizational and data analyst specific issues played a significant role in how the data team constructed their queries. In addition, differences in how data was transformed and stored from the EHR source systems, as well as presented to the data team played an important role.

Data obtained from the EHR has great potential for use in translational research. However, the inability to successfully reproduce data requests across different sites should be an important consideration. Reproducing research data requires effective communication between the research and data teams. In addition, there is a need for (1) structured or semi-structured data requisition methods using templates, and (2) context-sensitive and metadata-driven workflows that supports the entire life cycle of research data requisition including the development of the natural language description of the research data, query mediation, data abstraction, data extraction and provisioning of results and analysis.