This tutorial introduces the creation of data quality reports in R with dataquieR.

Loading data and metadata

Creating reports requires the appropriate setup of study data and metadata, as shown in the figure below:

We can load the synthetic example data from dataquieR via the following:

load(system.file("extdata", "study_data.RData", package = "dataquieR"))
sd1 <- study_data

This example study data has 3000 observations and 53 variables:

sd1
v00000 v00001 v00002 v00003 v00004 v00005 v01003 v01002 v00103 v00006
3 LEIIX715 0 49 127 77 49 0 40-49 3.8
1 QHNKM456 0 47 114 76 47 0 40-49 1.9
1 HTAOB589 0 50 114 71 50 0 50-59 0.8
5 HNHFV585 0 48 120 65 48 0 40-49 3.8
1 UTDLS949 0 56 119 78 56 0 50-59 4.1
5 YQFGE692 1 47 133 81 47 1 40-49 9.5
1 AVAEH932 0 53 114 78 53 0 50-59 5.0
3 QDOPT378 1 48 116 86 48 1 40-49 9.6
3 BMOAK786 0 44 115 71 44 0 40-49 2.0
5 ZDKNF462 0 50 116 74 50 0 50-59 2.4


We can see that the study data variables have abstract names (e.g. v00001, v00002). Hence, the appropriate labels must be mapped from the metadata. Besides all variables' data types and labels, the metadata stores further expected characteristics and static information about the study data.

We can read in the example metadata via the following:

load(system.file("extdata", "meta_data.RData", package = "dataquieR"))
md1 <- meta_data
md1
VAR_NAMES LABEL DATA_TYPE VALUE_LABELS MISSING_LIST JUMP_LIST HARD_LIMITS DETECTION_LIMITS
v00000 CENTER_0 integer 1 = Berlin | 2 = Hamburg | 3 = Leipzig | 4 = Cologne | 5 = Munich NA NA NA NA
v00001 PSEUDO_ID string NA NA NA NA NA
v00002 SEX_0 integer 0 = females | 1 = males NA NA NA NA
v00003 AGE_0 integer NA NA NA [18;Inf) NA
v00103 AGE_GROUP_0 string NA NA NA NA NA
v01003 AGE_1 integer NA NA NA [18;Inf) NA
v01002 SEX_1 integer 0 = females | 1 = males NA NA NA NA
v10000 PART_STUDY integer 0 = no | 1 = yes NA NA NA NA
v00004 SBP_0 float NA 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99987 | 99988 | 99989 | 99990 | 99991 | 99992 | 99993 | 99994 | 99995 NA [80;180] [0;265]
v00005 DBP_0 float NA 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99987 | 99988 | 99989 | 99990 | 99991 | 99992 | 99993 | 99994 | 99995 NA [50;Inf) [0;265]


For more information on the synthetic example data and metadata, see here.

Generating a report

We can create a default report using the dq_report() function, which requires only the data and metadata as input:

dq_report(study_data = sd1, 
          meta_data = md1)

Minimal workflow example

The animation below shows a quick workflow for reporting data quality with dataquieR:

This example uses data from the Study of Health in Pomerania (SHIP) project, which is also included in dataquieR. You can see the example report generated by dq_report() here.

Example code

The full code shown in the animation to produce a report is given here:

# --------------------------------------------------------------------------------------------------
# D A T A    Q U A L I T Y   I N    E P I D E M I O L O G I C A L    R E S E A R C H
#
# == dataquieR
#
# dq_report() eases the generation of data quality reports as it automatically calls dataquieR functions
# 
#
# Installation/Further Information -----------------------------------------------------------------
#
# Please see our website:
# https://dataquality.ship-med.uni-greifswald.de/
#
# install dataquieR from CRAN using

install.packages("dataquieR")

# Alternatively, you may install the development version as described
# on https://dataquality.ship-med.uni-greifswald.de/DownloadR.html

# load the package

library(dataquieR)

# data ---------------------------------------------------------------------------------------------

# Study of Health in Pomerania example data

sd1 <- readRDS(system.file("extdata", "ship.RDS", package = "dataquieR"))

summary(sd1)

# metadata

md1 <- readRDS(system.file("extdata", "ship_meta.RDS", package = "dataquieR"))


# dq_report() - a crude approach -------------------------------------------------------------------

my_dq_report <- dq_report(study_data = sd1,
                          meta_data  = md1,
                          label_col  = LABEL)

# view the results

my_dq_report

The function dq_report() can manage further arguments and settings. However, this sparse version is a good start to gaining insight into the data and may serve as the base to tailor more specific reports.