Conceptual background

Achieving a high data quality is essential for the valid study of diseases, risk factors and consequences. This entails the need for informative data quality indicators and tools to assess and report data quality. Yet, despite many available works (e.g. Kahn et al. 2016, Weiskopf et al. 2013, Weiskopf et al. 2017, Nonnemacher et al. 2014), no standards have been achieved in our field of research. Existing data quality frameworks target registries, and electronic health records (EHR) rather than data that has directly been collected for research purposes.

A lack of common standards is partially due to the large heterogeneity of data structures and data collection processes (Keller et al. 2017). When understanding data quality as “the degree to which a set of inherent characteristics of data fulfills requirements” (ISO 8000), the heterogeneity is quite understandable. Requirements and their operationalizations differ considerably within and across areas of research, studies, or data bodies.

Aims and scope

Against this background we developed a data quality framework with related implementations to facilitate standardized assessments of data quality. The core area of application are observational research data collections in medical research, yet applications are not limited to this area.

We focus intrinsic data quality, i.e. “data have quality in their own right” as opposed to contextual data quality “which highlights the requirement that data quality must be considered within the context of the task” (Wang and Strong 1996).

The former targets basic aspects such as (I) processable data, (2) complete data, and (3) error free data. These requirements are common to virtually all substantive scientific research. In contrast, “contextual data quality” is largely situation specific and it is more complicated to generate a uniform approach. Contextual examples are the availability of a relevant variable selection for some research question or enough power to conduct analyses.

The revised TMF guideline for data quality (Nonnemacher et al. 2014, Stausberg et al. 2019) was used as an initial point of reference for this work because it targets aspects of primary data collections. An empirical evaluation of indicators described by the TMF-guideline was conducted by representatives of the participating cohorts (Schmidt et al. 2019). This evaluation was used to identify indicators of particular relevance but also potential areas of improvement. The concept is described in the respective section.

One feature of importance is to provide not only a data quality framework but to accompany it by statistical implementations to facilitate and harmonize the assessments. The focus is R but also a Stata environment has been created, both are described in Software.

Disclaimer: Work in progress

The development of the concept and implementations is still ongoing. Therefore, the scope of the content is expected to grow.

Kahn, M.G., Callahan, T.J., Barnard, J., Bauck, A.E., Brown, J., Davidson, B.N., Estiri, H., Goerg, C., Holve, E., and Johnson, S.G. (2016). A harmonized data quality assessment terminology and framework for the secondary use of electronic health record data. eGEMs 4.
Keller, S., Korkmaz, G., Orr, M., Schroeder, A., and Shipp, S. (2017). The evolution of data quality: Understanding the transdisciplinary origins of data quality concepts and approaches.
Nonnemacher, M., Nasseh, D., and Stausberg, J. (2014). Datenqualität in der medizinischen forschung (Medizinisch Wissenschaftliche Verlagsgesellschaft).
Schmidt, C.O., Richter, A., Enzenbach, C., Pohlabeln, H., Meisinger, C., Wellman, J., Selder, S., Houben, R., Nonnemacher, M., and Stausberg, J. (2019). Assessment of a data quality guideline by representatives of german epidemiologic cohort studies. MIBE 15.
Stausberg, J., Bauer, U., Nasseh, D., Pritzkuleit, R., Schmidt, C., Schrader, T., and Nonnemacher, M. (2019). Indicators of data quality: Review and requirements from the perspective of networked medical research indikatoren zur datenqualität: Stand und anforderungen aus sicht der vernetzten medizinischen forschung. GMS Med Inform Biom Epidemiol 15.
Wang, R.Y., and Strong, D.M. (1996). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems 12, 5–33.
Weiskopf, N.G., Hripcsak, G., Swaminathan, S., and Weng, C. (2013). Defining and measuring completeness of electronic health records for secondary use. Journal of Biomedical Informatics 46, 830–836.
Weiskopf, N.G., Bakken, S., Hripcsak, G., and Weng, C. (2017). A data quality assessment guideline for electronic health record data reuse. eGEMs (Generating Evidence & Methods to Improve Patient Outcomes) 5.