This document describes the generation of a SHIP data quality report using a 50% random sample of data from the Study of Health in Pomerania (SHIP-0, 1997-2001) examination. For further information on this cohort study please see Völzke et al.. To secure anonymity and for illustrative purposes, some noise has been introduced to the data.
The first step in the data quality assessment workflow addresses the compliance of the submitted study data with the respective metadata regarding formal and structural requirements. Both data need to be provided as data frames.
The metadata file is the primary point of reference for generating data quality reports:
In this example, the SHIP data are loaded from the
sd1 <- readRDS(system.file("extdata", "ship.RDS", package = "dataquieR"))
The imported study data consists of:
Similarly, the respective metadata must be loaded from
md1 <- readRDS(system.file("extdata", "ship_meta.RDS", package = "dataquieR"))
The imported metadata provide information for:
An identical number of variables in both files is desirable but not necessary. Attributes, i.e. columns in the metadata, comprise information on each variable of the study data file such as labels or admissibility limits.
Now the actual integrity checks start with a call of the function
pro_applicability_matrix(). The data quality indicators covered by this function are:
pro_applicability_matrix() generates a heatmap-like plot for the applicability of all
dataquieR functions to the study data, using the provided metadata as a point of reference:
appmatrix <- pro_applicability_matrix(study_data = sd1, meta_data = md1, label_col = LABEL, split_segments = TRUE)
The heatmap can be retrieved by using the command:
As the formal
split_segments is used in the call above all output is organized by study segments defined in the metadata. In this case, there are data from four examination segments: the computer assisted interview, intro (basic information on the participants such as sociodemographic information and examination date), laboratory variables, somatometric examination. The assignment of variables to segments is provided with the metadata file.
The results applicability checks are of a technical nature, i.e. the function compares for example the data type as defined in the metadata with those observed in the study data. The light blue areas indicate that additional checks would be possible for many variables if additional metadata was provided.
It is not advisable to apply all technically feasible data quality implementations to all study data variables. For example, detection limits are not at all meaningful for IDs of participants. However, the variable ID is represented as an integer format which technically allows the check on detection limits.
All datatype issues found by the application of
pro_applicability_matrix() should be checked data element by data element. Regarding the variable WAIST_CIRC_0 a major issue was found. The variable is represented in the study data with datatype character which differs from the expected datatype float as defined in the metadata. Some basic checks show the misuse of comma as the decimal delimiter.
To correct this issue in R the simple conversion of WAIST_CIRC_0 to datatype numeric will coerce respective values to NA which should be avoided. We replaced the comma with the correct delimiter and corrected the datatype without loss of data values. The resulting applicability plot shows no more issues.
pro_applicability_matrix(study_data = sd1, meta_data = md1, label_col = LABEL)$ApplicabilityPlot
The next major step in the data quality assessment workflow is to assess the occurrence and patterns of missing data.
The sequence of checks in this example is ordered according to common stages of a data collection:
|Unit missingness||Subjects without information on any of the provided study variables|
|Segment missingness||Subjects without information for all variables on a defined study segment (e.g. some examination)|
|Item missingness||Subjects without information on data fields within segments|
Following this sequence enables to calculate the correct denominators for the calculation of item missingness. This is particularly important for complex cohort studies in which different levels of examination programs are conducted. For example, only half of a study population might be foreseen for an MRI examination. In the remaining 50% the respective MRI variables are per study design not populated. This should be considered if item missingness is examined.
This check identifies subjects without any measurements on the provided target variables for a data quality check.
The interpretation of findings depends on the scope of the provided variables and data records. In this example, the study data set comprises examined SHIP participants, not the target sample. Accordingly, the check is not about study participation. Rather, it identifies cases for which unexpectedly no information has been provided at all. Any identified case would indicate a data management problem.
The covered indicator by
Unit missingness can be assessed by using the command:
my_unit_missings2 <- com_unit_missingness(study_data = sd1, meta_data = md1, label_col = LABEL, id_vars = "ID")
In total 0 units in this data have missings in all variables of the study data.
Thus for each participant there is at least one variable with information.
Subsequently, a check is performed that identifies subjects without any measurements within each of the four defined study segments.
The covered indicator by
using in this example the call with a table output:
MissSegs <- com_segment_missingness(study_data = sd1, meta_data = md1, threshold_value = 1, direction = "high", exclude_roles = c("secondary", "process")) MissSegs$SummaryData
Exploring segment missingness over time requires another variable in the study data.
Information regarding this variable can be added to the metadata using the
# create a discretized version of the examination year sd1$exyear <- as.integer(lubridate::year(sd1$exdate)) # add metadata for this variable md1 <- dataquieR::prep_add_to_meta(VAR_NAMES = "exyear", DATA_TYPE = "integer", LABEL = "EX_YEAR_0", VALUE_LABELS = "1997 = 1st | 1998 = 2nd | 1999 = 3rd | 2000 = 4th | 2001 = 5th", VARIABLE_ROLE = "process", meta_data = md1)
With a discretized variable for examination year (EX_YEAR_0) the occurrence pattern by year can subsequently be assessed using the command
MissSegs <- com_segment_missingness(study_data = sd1, meta_data = md1, threshold_value = 1, label_col = LABEL, group_vars = "EX_YEAR_0", direction = "high", exclude_roles = "process") MissSegs$SummaryPlot
The plot is a descriptor, assigned to the indicator:
It illustrates that missing information from the laboratory examination is distributed unequally across examination years, with the highest proportion of missing data occurring in the 1st, 2nd, and 5th year.
Finally, in the completeness dimension, a check is performed to identify subjects with missing information in variables of all study segments. The covered indicators by the function
Item missingness can be assessed by using the following call:
item_miss <- com_item_missingness(study_data = sd1, meta_data = md1, show_causes = TRUE, cause_label_df = code_labels, label_col = "LABEL", include_sysmiss = TRUE, threshold_value = 95 )
A result overview can be obtained by requesting a summary table of this function:
The table provides one line for each of the 29 variables. Of particular interest are:
The table shows that one variable is affected by many missing values: HOUSE_INCOME_MONTH_0 on the net household income. In addition, age onset of diabetes (DIAB_AGE_ONSET_0) was only coded for 173 subjects, but the most values are missing because of an intended jump.
In case that Jump-codes have been used, e.g. for variable CONTRACEPTIVE_EVER_0, the denominator for the calculation of item missingness is corrected for the amount of used Jump-codes.
The summary plot provides a different view on missing data by providing frequency of the specified reason for missing data:
The balloon size is determined by the number of missing data fields.
It can now be inferred that, for example, the elevated number of missing values for the item HOUSE_INCOME_MONTH_0 is mainly caused by refusals of participants to answer the respective question.
Consistency is targeted after completeness has been examined as the first part of the data quality dimension Correctness. The removal of Missing- and Jump-codes is one prerequisite for the application of correctness checks. Consistency, as the first main aspect of correctness, describes the degree to which data values are free of breaks in conventions or contradictions. Different data types may be addressed in respective checks.
The covered indicator by
con_limit_deviations() when specifying limits = “HARD_LIMITS” is:
When specifying the formal
limits = “SOFT_LIMITS” the check does not identify inadmissible but uncertain values, according to the specified ranges. The related indicator then is:
The call in this example with regards to inadmissible numerical values is:
MyValueLimits <- con_limit_deviations(study_data = sd1, meta_data = md1, label_col = "LABEL", limits = "HARD_LIMITS")
Subsequently a table output may be requested. It provides the number and percentage of all range violations for variables in which :
The last column of the table also provides a grading. If the percentage of violations is above some threshold, a problem grading (=1) is assigned. In this case any occurrence is classified as problematic. Otherwise the grading is 0.
The following statement assigns all variables identified as problematic to the R object
whichdeviate to subsequently enable a more targeted output, for example for a plot of distributions for any variable with violations along the specified limits.:
# select variables with deviations whichdeviate <- as.character(MyValueLimits$SummaryTable$Variables)[MyValueLimits$SummaryTable$GRADING == 1]
In this case, the plot has been restricted to the one variable with limit deviations, i.e. those with a grading of 1 in the table above.
(Only first two displayed to reduce file size).
A comparable check may be performed for categorical variables using the command
The covered indicator is:
The call is:
IAVCatAll <- con_inadmissible_categorical(study_data = sd1, meta_data = md1, label_col = "LABEL")
As with inadmissible numerical values, a table output may be requested. It displays the observed categories, the defined categories, any non matching level, its count, and a grading:
The results show that there are two variables, SCHOOL_GRAD_0 and OBS_SOMA_0 with one inadmissible level occurring for each variable. Regarding the variable OBS_SOMA_0 either the metadata did not include the respective Missing- or Jump-code or that a false code has been used in the study data.
The second main type of checks within the consistency dimension concerns contradictions.
The covered indicators by the command
Rules to identify contradictions must first be uploaded from a spreadsheet format. The creation of this spreadsheet is supported by a Shiny App. Overall, 11 different logical comparisons can be applied. An overview is given in the respective tutorial. Each line within the spreadsheet defines one check rule.
checks <- read.csv(system.file("extdata", "ship_contradiction_checks.csv", package = "dataquieR"), header = TRUE, sep = "#")
Subsequently, the command
con_contradictions() may be triggered, using the table checks as the point of reference for a check on contradictions:
AnyContradictions <- con_contradictions(study_data = sd1, meta_data = md1, label_col = "LABEL", check_table = checks, threshold_value = 1)
A summary table may be requested to show the number and percentage of contradictions for each defined rule:
In this example, one rule leads to the identification of 35 contradictions: Age onset for diabetes is provided but the variable on the presence of diabetes does not indicate a known disease.
The distributions may also be displayed as a plot:
The second dimension related to correctness is accuracy. It targets the degree of agreement between observed and expected distributions and associations.
In contrast to most consistency relate indicators, findings indicate an elevated probability that some data quality issue exists, rather than a certain issue.
Based on statistical criteria, univariate outliers are addressed. The covered indicator is:
UnivariateOutlier <- dataquieR:::acc_robust_univariate_outlier(study_data = sd1, meta_data = md1, label_col = "LABEL")
As with other
dataquieR implementations one output option is a table. It provides descriptive statistics and detected outliers according to the different criteria:
There are outliers according to at least three criteria that affect all targeted variables but only for the variable HDL cholesterol (CHOLES_HDL_0) two outliers have been detected using the Sigma-gap criterion.
To obtain a better insight on univariate distributions, a plot can be requested. It highlights observations for each variable according to the number of violated rules (first 4 shown only to reduce file size).
acc_multivariate_outlier() identifies outliers related to the indicator:
acc_multivariate_outlier() uses the same rules as
acc_robust_univariate_outlier() for the identification of outliers.
The following function call relates systolic and diastolic blood pressure measurement to age and weight and a table output is created for the number of detected multivariate outliers:
MVO_SBP0.1 <- acc_multivariate_outlier(resp_vars = c("SBP_0.1", "DBP_0.1", "AGE_0", "BODY_WEIGHT_0"), study_data = sd1, meta_data = md1, id_vars = "ID", label_col = "LABEL") MVO_SBP0.1$SummaryTable
The number of outliers varies considerably, depending on the criterion. Subsequently a parallel-coordinate-plot may be requested to further inspect results:
Another example is the inspection of the first and second systolic blood pressure measurements:
MVO_DBP <- acc_multivariate_outlier(resp_vars = c("SBP_0.1", "SBP_0.2"), study_data = sd1, meta_data = md1, label_col = "LABEL") MVO_DBP$SummaryTable
acc_distributions() describes distributions using histograms and displays empirical cumulative distribution function (ecdf) in case a grouping variable is provided. The function is only descriptive and as such not related to a specific indicator. Rather, results may be of relevance to most indicators within the unexpected distribution domain.
The following example examines measurements in which a possible influence of the observers is considered.
ECDFSoma <- acc_distributions(resp_vars = c("WAIST_CIRC_0", "BODY_HEIGHT_0", "BODY_WEIGHT_0"), group_vars = "OBS_SOMA_0", study_data = sd1, meta_data = md1, label_col = "LABEL")
The respective list of plots may be displayed using the following command (to decrease the file size, only the first 2 plots):
The reason for the nested call of the print function, which is applied to each list element, some annotation and messages are suppressed.
acc_margins() is mainly related to the indicators:
However, it also provides descriptive output such as violin plots, and box plots for continuous variables, count plots for categorical data, and density plots for both. The main application of
acc_margins() is to make inference on effects related to process variables such as examiners, devices, or study centers. The R function determines whether measurements are provided as continuous or discrete. Alternatively, metadata specifications may provide this information.
In the first example
acc_margins() is applied to the variable waist circumference (WAIST_CIRC_0). In this case dependencies related to the examiners (OBS_SOMA_0) are examined while the raw measurements are controlled for variable age and sex (AGE_0, SEX_0):
marginal_dists <- acc_margins(resp_vars = "WAIST_CIRC_0", co_vars = c("AGE_0", "SEX_0"), group_vars = "OBS_SOMA_0", study_data = sd1, meta_data = md1, label_col = "LABEL")
A plot may be requested to revise the results.
Based on a statistical test, no mean waist circumference of any examiner differed substantially (p<0.05) from the overall mean.
The situation is quite different when assessing the coded myocardial infarction across examiners while controlling for age and sex:
marginal_dists <- acc_margins(resp_vars = "MYOCARD_YN_0", co_vars =c("AGE_0", "SEX_0"), group_vars = "OBS_INT_0", study_data = sd1, meta_data = md1, label_col = "LABEL") marginal_dists$SummaryPlot
The result shows elevated proportions for the examiners 05 and 07.
An important and related issue is the quantification of the observed examiner effects:
This is accomplished by the function
acc_varcomp() related to the indicators:
It computes the percentage of variance of some target variable, here attributable to the grouping variable while controlling for some control variables (age and sex) and the output may be reviewed in a table format:
vcs <- acc_varcomp(resp_vars = "WAIST_CIRC_0", co_vars = c("AGE_0", "SEX_0"), group_vars = "OBS_SOMA_0", study_data = sd1, meta_data = md1, label_col = "LABEL") vcs$SummaryTable
For the variable WAIST_CIRC_0, an ICC of 0.019 has been found which is below the threshold. The same is the case for the variable MYOCARD_YN_0, probably because the case count in the two deviant observers 05 and 07 is low:
vcs <- acc_varcomp(resp_vars = "MYOCARD_YN_0", co_vars =c("AGE_0", "SEX_0"), group_vars = "OBS_INT_0", study_data = sd1, meta_data = md1, label_col = "LABEL") vcs$SummaryTable
A particular complexity is the study of effects across groups and times. As a descriptor this is realized using the function
acc_loess(). While providing primarily information related to the indicator:
it may be used to obtain information with regard to other indicators in the domain unexpected distributions as well. A sample call with graphical output using waist circumference as the target variable is:
timetrends <- acc_loess(resp_vars = "WAIST_CIRC_0", co_vars =c("AGE_0", "SEX_0"), group_vars = "OBS_SOMA_0", time_vars = "EXAM_DT_0", study_data = sd1, meta_data = md1, label_col = "LABEL") invisible(lapply(timetrends$SummaryPlotList, print))
The graph for this variable indicates no major discrepancies between the observers over the examination period.
Assessing the shape of a distribution is, next to location parameters, an important aspect of accuracy.
The related indicator is:
Observed distributions can be tested against expected distributions using the function
In this example the uniform distribution for the use of measurement devices can be examined.
MyUnexpDist1 <- acc_shape_or_scale(resp_vars = "DEV_BP_0", guess = TRUE, label_col = "LABEL", dist_col = "DISTRIBUTION", study_data = sd1, meta_data = md1) MyUnexpDist1$SummaryPlot
The plot illustrates that devices have not been used with comparable frequencies.
In another example the normal distribution of blood pressure is examined.
MyUnexpDist2 <- acc_shape_or_scale(resp_vars = "SBP_0.2", guess = TRUE, label_col = "LABEL", dist_col = "DISTRIBUTION", study_data = sd1, meta_data = md1) MyUnexpDist2$SummaryPlot
The results reveal a slight discrepancy from the normality assumption. It is up to the person responsible for data quality assessments to decide whether such a discrepancy is of relevance.
The analysis of end digit preferences is a specific implementation related to the indicator:
In this example the uniform distribution of the end digits of body height are examined. Body height in SHIP-0 was a measurement which required the manual reading and transfer of data into an eCRF.
MyEndDigits <- acc_end_digits(resp_vars = "BODY_HEIGHT_0", label_col = LABEL, study_data = sd1, meta_data = md1) MyEndDigits$SummaryPlot
The graph highlights no effects of relevance across the ten categories.
Output within the accuracy dimension illustrates frequently combine descriptive and inferential content. This seemed necessary to improve valid conclusions on data quality issues. Further details on all functions can be obtained following the links and via the software section of the data quality web-page.