This document describes the generation of a SHIP data quality report using a 50% random sample of data from the Study of Health in Pomerania (SHIP-0, 1997-2001) examination. For further information on this cohort study please see Völzke et al.. To secure anonymity and for illustrative purposes, some noise has been introduced to the data.

INTEGRITY

The first step in the data quality assessment workflow addresses the compliance of the submitted study data with the respective metadata regarding formal and structural requirements. Both data need to be provided as data frames.

Note:

The metadata file is the primary point of reference for generating data quality reports:

  • First, it defines the number of variables for which to generate reports.
  • Second, it is the expected truth against which the study data are assessed

Study data

In this example, the SHIP data are loaded from the dataquieR package.

sd1 <- readRDS(system.file("extdata", "ship.RDS", package = "dataquieR"))

The imported study data consists of:

  • N = 2154 observations and
  • P = 29 study variables

Metadata

Similarly, the respective metadata must be loaded from dataquieR.

md1 <- readRDS(system.file("extdata", "ship_meta.RDS", package = "dataquieR"))

The imported metadata provide information for:

  • P = 29 study variables and
  • Q = 20 variable level attributes

An identical number of variables in both files is desirable but not necessary. Attributes, i.e. columns in the metadata, comprise information on each variable of the study data file such as labels or admissibility limits.

Integrity check

Now the actual integrity checks start with a call of the function pro_applicability_matrix(). The data quality indicators covered by this function are:

pro_applicability_matrix() generates a heatmap-like plot for the applicability of all dataquieR functions to the study data, using the provided metadata as a point of reference:

appmatrix <- pro_applicability_matrix(study_data = sd1, meta_data = md1, 
                                      label_col = LABEL, split_segments = TRUE)

The heatmap can be retrieved by using the command:

appmatrix$ApplicabilityPlot

As the formal split_segments is used in the call above all output is organized by study segments defined in the metadata. In this case, there are data from four examination segments: the computer assisted interview, intro (basic information on the participants such as sociodemographic information and examination date), laboratory variables, somatometric examination. The assignment of variables to segments is provided with the metadata file.

The results applicability checks are of a technical nature, i.e. the function compares for example the data type as defined in the metadata with those observed in the study data. The light blue areas indicate that additional checks would be possible for many variables if additional metadata was provided.

Note:

It is not advisable to apply all technically feasible data quality implementations to all study data variables. For example, detection limits are not at all meaningful for IDs of participants. However, the variable ID is represented as an integer format which technically allows the check on detection limits.

Solving integrity issues

All datatype issues found by the application of pro_applicability_matrix() should be checked data element by data element. Regarding the variable WAIST_CIRC_0 a major issue was found. The variable is represented in the study data with datatype character which differs from the expected datatype float as defined in the metadata. Some basic checks show the misuse of comma as the decimal delimiter.

To correct this issue in R the simple conversion of WAIST_CIRC_0 to datatype numeric will coerce respective values to NA which should be avoided. We replaced the comma with the correct delimiter and corrected the datatype without loss of data values. The resulting applicability plot shows no more issues.

pro_applicability_matrix(study_data = sd1, meta_data = md1, label_col = LABEL)$ApplicabilityPlot

COMPLETENESS

The next major step in the data quality assessment workflow is to assess the occurrence and patterns of missing data.

The sequence of checks in this example is ordered according to common stages of a data collection:

Level Description
Unit missingness Subjects without information on any of the provided study variables
Segment missingness Subjects without information for all variables on a defined study segment (e.g. some examination)
Item missingness Subjects without information on data fields within segments

Following this sequence enables to calculate the correct denominators for the calculation of item missingness. This is particularly important for complex cohort studies in which different levels of examination programs are conducted. For example, only half of a study population might be foreseen for an MRI examination. In the remaining 50% the respective MRI variables are per study design not populated. This should be considered if item missingness is examined.

Unit missingness

This check identifies subjects without any measurements on the provided target variables for a data quality check.

Note:

The interpretation of findings depends on the scope of the provided variables and data records. In this example, the study data set comprises examined SHIP participants, not the target sample. Accordingly, the check is not about study participation. Rather, it identifies cases for which unexpectedly no information has been provided at all. Any identified case would indicate a data management problem.

The covered indicator by my_unit_missings2() is:

  • DQI-2001 Missing values with an implementation at the level “Units”

Unit missingness can be assessed by using the command:

my_unit_missings2 <- com_unit_missingness(study_data  = sd1, 
                                          meta_data   = md1,
                                          label_col   = LABEL,
                                          id_vars     = "ID")

In total 0 units in this data have missings in all variables of the study data.

Thus for each participant there is at least one variable with information.

Segment missingness

Subsequently, a check is performed that identifies subjects without any measurements within each of the four defined study segments.

The covered indicator by my_unit_missings2() is:

  • DQI-2001 Missing values with an implementation at the level “Segments”

using in this example the call with a table output:

MissSegs <- com_segment_missingness(study_data = sd1, 
                                    meta_data = md1, 
                                    threshold_value = 1, 
                                    direction = "high",
                                    exclude_roles = c("secondary", "process"))

MissSegs$SummaryData

Exploring segment missingness over time requires another variable in the study data.

Information regarding this variable can be added to the metadata using the dataquieR function prep_add_to_meta():

# create a discretized version of the examination year
sd1$exyear <- as.integer(lubridate::year(sd1$exdate))

# add metadata for this variable
md1 <- dataquieR::prep_add_to_meta(VAR_NAMES = "exyear", 
                                   DATA_TYPE = "integer",
                                   LABEL = "EX_YEAR_0",
                                   VALUE_LABELS = "1997 = 1st | 1998 = 2nd | 1999 = 3rd | 2000 = 4th | 2001 = 5th",
                                   VARIABLE_ROLE = "process",
                                   meta_data = md1)

With a discretized variable for examination year (EX_YEAR_0) the occurrence pattern by year can subsequently be assessed using the command com_segment_missingness():

MissSegs <- com_segment_missingness(study_data = sd1, 
                                    meta_data = md1, 
                                    threshold_value = 1, 
                                    label_col = LABEL,
                                    group_vars = "EX_YEAR_0",
                                    direction = "high",
                                    exclude_roles = "process")

MissSegs$SummaryPlot
## $SummaryPlot

The plot is a descriptor, assigned to the indicator:

  • DQI-2001 Missing values with an implementation at the level “Segments”

It illustrates that missing information from the laboratory examination is distributed unequally across examination years, with the highest proportion of missing data occurring in the 1st, 2nd, and 5th year.

Item missingness

Finally, in the completeness dimension, a check is performed to identify subjects with missing information in variables of all study segments. The covered indicators by the function com_item_missingness() are:

  • DQI-1008 Uncertain missingness status
  • DQI-2001 Missing values with an implementation at the level “Item”
  • DQI-2005 Missing due to specified reason

Item missingness can be assessed by using the following call:

item_miss <- com_item_missingness(study_data      = sd1, 
                                  meta_data       = md1, 
                                  show_causes     = TRUE, 
                                  cause_label_df  = code_labels,
                                  label_col       = "LABEL",
                                  include_sysmiss = TRUE, 
                                  threshold_value = 95
                                ) 

Summary table

A result overview can be obtained by requesting a summary table of this function:

item_miss$SummaryTable

The table provides one line for each of the 29 variables. Of particular interest are:

  • System missings N : The number of data fields for each variable without any valid data entry, indicating a technically inferior coding (DQI-1008)
  • Missing Codes: The number of data fields with valid missing codes.
  • Jump codes: Data fields, for which no data collection was attempted
  • Measurements: provides an inverse of DQI-2001 Missing values with an implementation at the level “Items”

The table shows that one variable is affected by many missing values: HOUSE_INCOME_MONTH_0 on the net household income. In addition, age onset of diabetes (DIAB_AGE_ONSET_0) was only coded for 173 subjects, but the most values are missing because of an intended jump.

Note:

In case that Jump-codes have been used, e.g. for variable CONTRACEPTIVE_EVER_0, the denominator for the calculation of item missingness is corrected for the amount of used Jump-codes.

Summary plot

The summary plot provides a different view on missing data by providing frequency of the specified reason for missing data:

  • DQI-2005 Missing due to specified reason.

The balloon size is determined by the number of missing data fields.

It can now be inferred that, for example, the elevated number of missing values for the item HOUSE_INCOME_MONTH_0 is mainly caused by refusals of participants to answer the respective question.

CONSISTENCY

Consistency is targeted after completeness has been examined as the first part of the data quality dimension Correctness. The removal of Missing- and Jump-codes is one prerequisite for the application of correctness checks. Consistency, as the first main aspect of correctness, describes the degree to which data values are free of breaks in conventions or contradictions. Different data types may be addressed in respective checks.

Inadmissible numerical values

The covered indicator by con_limit_deviations() when specifying limits = “HARD_LIMITS” is:

  • DQI-3001 Inadmissible numerical values

Note:

When specifying the formal limits = “SOFT_LIMITS” the check does not identify inadmissible but uncertain values, according to the specified ranges. The related indicator then is:

The call in this example with regards to inadmissible numerical values is:

MyValueLimits <- con_limit_deviations(study_data = sd1,
                                      meta_data  = md1,
                                      label_col  = "LABEL",
                                      limits     = "HARD_LIMITS")

Summary table

Subsequently a table output may be requested. It provides the number and percentage of all range violations for variables in which :

MyValueLimits$SummaryTable

The last column of the table also provides a grading. If the percentage of violations is above some threshold, a problem grading (=1) is assigned. In this case any occurrence is classified as problematic. Otherwise the grading is 0.

The following statement assigns all variables identified as problematic to the R object whichdeviate to subsequently enable a more targeted output, for example for a plot of distributions for any variable with violations along the specified limits.:

# select variables with deviations
whichdeviate <- as.character(MyValueLimits$SummaryTable$Variables)[MyValueLimits$SummaryTable$GRADING == 1]

Summary plot

In this case, the plot has been restricted to the one variable with limit deviations, i.e. those with a grading of 1 in the table above.

(Only first two displayed to reduce file size).

head(MyValueLimits$SummaryPlotList[whichdeviate], 2)

Inadmissible categorical values

A comparable check may be performed for categorical variables using the command con_inadmissible_categorical():

The covered indicator is:

  • DQI-3003 Inadmissible categorical values

The call is:

IAVCatAll <- con_inadmissible_categorical(study_data = sd1, 
                                          meta_data  = md1, 
                                          label_col  = "LABEL")

As with inadmissible numerical values, a table output may be requested. It displays the observed categories, the defined categories, any non matching level, its count, and a grading:

IAVCatAll$SummaryTable

The results show that there are two variables, SCHOOL_GRAD_0 and OBS_SOMA_0 with one inadmissible level occurring for each variable. Regarding the variable OBS_SOMA_0 either the metadata did not include the respective Missing- or Jump-code or that a false code has been used in the study data.

Contradictions

The second main type of checks within the consistency dimension concerns contradictions.

The covered indicators by the command con_contradictions() are:

Rules to identify contradictions must first be uploaded from a spreadsheet format. The creation of this spreadsheet is supported by a Shiny App. Overall, 11 different logical comparisons can be applied. An overview is given in the respective tutorial. Each line within the spreadsheet defines one check rule.

checks <- read.csv(system.file("extdata", 
                               "ship_contradiction_checks.csv",
                               package = "dataquieR"), 
                            header = TRUE, sep = "#")

Subsequently, the command con_contradictions() may be triggered, using the table checks as the point of reference for a check on contradictions:

AnyContradictions <- con_contradictions(study_data      = sd1,
                                        meta_data       = md1,
                                        label_col       = "LABEL",
                                        check_table     = checks,
                                        threshold_value = 1)

Summary table

A summary table may be requested to show the number and percentage of contradictions for each defined rule:

In this example, one rule leads to the identification of 35 contradictions: Age onset for diabetes is provided but the variable on the presence of diabetes does not indicate a known disease.

Summary plot

The distributions may also be displayed as a plot:

ACCURACY

The second dimension related to correctness is accuracy. It targets the degree of agreement between observed and expected distributions and associations.

In contrast to most consistency relate indicators, findings indicate an elevated probability that some data quality issue exists, rather than a certain issue.

Univariate outlier

Based on statistical criteria, univariate outliers are addressed. The covered indicator is:

The function acc_robust_univariate_outlier() identifies outliers according to the approaches of Tukey, SixSigma, Hubert, and the heuristic approach of SigmaGap. It may be called as follows:

UnivariateOutlier <- dataquieR:::acc_robust_univariate_outlier(study_data      = sd1,
                                                               meta_data       = md1,
                                                               label_col       = "LABEL")

Summary table

As with other dataquieR implementations one output option is a table. It provides descriptive statistics and detected outliers according to the different criteria:

UnivariateOutlier$SummaryTable

There are outliers according to at least three criteria that affect all targeted variables but only for the variable HDL cholesterol (CHOLES_HDL_0) two outliers have been detected using the Sigma-gap criterion.

Summary plot

To obtain a better insight on univariate distributions, a plot can be requested. It highlights observations for each variable according to the number of violated rules (first 4 shown only to reduce file size).

Multivariate outlier

The function acc_multivariate_outlier() identifies outliers related to the indicator:

acc_multivariate_outlier() uses the same rules as acc_robust_univariate_outlier() for the identification of outliers.

The following function call relates systolic and diastolic blood pressure measurement to age and weight and a table output is created for the number of detected multivariate outliers:

MVO_SBP0.1 <- acc_multivariate_outlier(resp_vars = c("SBP_0.1", "DBP_0.1", "AGE_0", "BODY_WEIGHT_0"),
                                       study_data      = sd1,
                                       meta_data       = md1,
                                       id_vars         = "ID",
                                       label_col       = "LABEL")

MVO_SBP0.1$SummaryTable

The number of outliers varies considerably, depending on the criterion. Subsequently a parallel-coordinate-plot may be requested to further inspect results:

MVO_SBP0.1$SummaryPlot

Another example is the inspection of the first and second systolic blood pressure measurements:

MVO_DBP <- acc_multivariate_outlier(resp_vars = c("SBP_0.1", "SBP_0.2"),
                                       study_data      = sd1,
                                       meta_data       = md1,
                                       label_col       = "LABEL")

MVO_DBP$SummaryTable
MVO_DBP$SummaryPlot


Distribution

The function acc_distributions() describes distributions using histograms and displays empirical cumulative distribution function (ecdf) in case a grouping variable is provided. The function is only descriptive and as such not related to a specific indicator. Rather, results may be of relevance to most indicators within the unexpected distribution domain.

The following example examines measurements in which a possible influence of the observers is considered.

ECDFSoma <- acc_distributions(resp_vars = c("WAIST_CIRC_0", "BODY_HEIGHT_0", "BODY_WEIGHT_0"),
                              group_vars = "OBS_SOMA_0",
                              study_data      = sd1,
                              meta_data       = md1,
                              label_col       = "LABEL")

The respective list of plots may be displayed using the following command (to decrease the file size, only the first 2 plots):

invisible(lapply(head(ECDFSoma$SummaryPlotList,2), print))

The reason for the nested call of the print function, which is applied to each list element, some annotation and messages are suppressed.


Margins

The function acc_margins() is mainly related to the indicators:

However, it also provides descriptive output such as violin plots, and box plots for continuous variables, count plots for categorical data, and density plots for both. The main application of acc_margins() is to make inference on effects related to process variables such as examiners, devices, or study centers. The R function determines whether measurements are provided as continuous or discrete. Alternatively, metadata specifications may provide this information.

In the first example acc_margins() is applied to the variable waist circumference (WAIST_CIRC_0). In this case dependencies related to the examiners (OBS_SOMA_0) are examined while the raw measurements are controlled for variable age and sex (AGE_0, SEX_0):

Waist circumference

marginal_dists <- acc_margins(resp_vars  = "WAIST_CIRC_0",
                              co_vars    = c("AGE_0", "SEX_0"),
                              group_vars = "OBS_SOMA_0",
                              study_data = sd1,
                              meta_data  = md1,
                              label_col  = "LABEL")

A plot may be requested to revise the results.

marginal_dists$SummaryPlot

Based on a statistical test, no mean waist circumference of any examiner differed substantially (p<0.05) from the overall mean.

Myocardial infarction

The situation is quite different when assessing the coded myocardial infarction across examiners while controlling for age and sex:

marginal_dists <- acc_margins(resp_vars  = "MYOCARD_YN_0",
                              co_vars    =c("AGE_0", "SEX_0"),
                              group_vars = "OBS_INT_0",
                              study_data      = sd1,
                              meta_data       = md1,
                              label_col       = "LABEL")

marginal_dists$SummaryPlot

The result shows elevated proportions for the examiners 05 and 07.

An important and related issue is the quantification of the observed examiner effects:

Variance components

This is accomplished by the function acc_varcomp() related to the indicators:

It computes the percentage of variance of some target variable, here attributable to the grouping variable while controlling for some control variables (age and sex) and the output may be reviewed in a table format:

vcs <- acc_varcomp(resp_vars  = "WAIST_CIRC_0",
                   co_vars    = c("AGE_0", "SEX_0"),
                   group_vars = "OBS_SOMA_0",
                   study_data = sd1,
                   meta_data  = md1,
                   label_col  = "LABEL")

vcs$SummaryTable

For the variable WAIST_CIRC_0, an ICC of 0.019 has been found which is below the threshold. The same is the case for the variable MYOCARD_YN_0, probably because the case count in the two deviant observers 05 and 07 is low:

vcs <- acc_varcomp(resp_vars  = "MYOCARD_YN_0",
                   co_vars    =c("AGE_0", "SEX_0"),
                   group_vars = "OBS_INT_0",
                   study_data      = sd1,
                   meta_data       = md1,
                   label_col       = "LABEL")

vcs$SummaryTable


LOESS

A particular complexity is the study of effects across groups and times. As a descriptor this is realized using the function acc_loess(). While providing primarily information related to the indicator:

it may be used to obtain information with regard to other indicators in the domain unexpected distributions as well. A sample call with graphical output using waist circumference as the target variable is:

timetrends <- acc_loess(resp_vars  = "WAIST_CIRC_0",
                        co_vars    =c("AGE_0", "SEX_0"),
                        group_vars = "OBS_SOMA_0",
                        time_vars  = "EXAM_DT_0",
                        study_data      = sd1,
                        meta_data       = md1,
                        label_col       = "LABEL")

invisible(lapply(timetrends$SummaryPlotList, print))

The graph for this variable indicates no major discrepancies between the observers over the examination period.

Shape

Assessing the shape of a distribution is, next to location parameters, an important aspect of accuracy.

The related indicator is:

Observed distributions can be tested against expected distributions using the function acc_shape_or_scale().

In this example the uniform distribution for the use of measurement devices can be examined.

MyUnexpDist1 <- acc_shape_or_scale(resp_vars  = "DEV_BP_0", 
                                   guess      = TRUE, 
                                   label_col  =  "LABEL",
                                   dist_col   = "DISTRIBUTION",
                                   study_data = sd1, 
                                   meta_data  = md1)

MyUnexpDist1$SummaryPlot

The plot illustrates that devices have not been used with comparable frequencies.

In another example the normal distribution of blood pressure is examined.

MyUnexpDist2 <- acc_shape_or_scale(resp_vars  = "SBP_0.2", 
                                   guess      = TRUE, 
                                   label_col  =  "LABEL",
                                   dist_col   = "DISTRIBUTION",
                                   study_data = sd1, 
                                   meta_data  = md1)

MyUnexpDist2$SummaryPlot

The results reveal a slight discrepancy from the normality assumption. It is up to the person responsible for data quality assessments to decide whether such a discrepancy is of relevance.


End digit preferences

The analysis of end digit preferences is a specific implementation related to the indicator:

In this example the uniform distribution of the end digits of body height are examined. Body height in SHIP-0 was a measurement which required the manual reading and transfer of data into an eCRF.

MyEndDigits <- acc_end_digits(resp_vars  = "BODY_HEIGHT_0", 
                              label_col  = LABEL,
                              study_data = sd1, 
                              meta_data  = md1)

MyEndDigits$SummaryPlot

The graph highlights no effects of relevance across the ten categories.

Output within the accuracy dimension illustrates frequently combine descriptive and inferential content. This seemed necessary to improve valid conclusions on data quality issues. Further details on all functions can be obtained following the links and via the software section of the data quality web-page.