This document describes the generation of a quality report using a 50% random sample of data from the Study of Health in Pomerania (SHIP-0, 1997-2001) examination. For further information on this cohort study please see Völzke et al. 2010. Some noise has been introduced to the data to secure anonymity and for illustrative purposes.
The first step in the data quality assessment workflow addresses the compliance of the submitted study data with the respective metadata, regarding formal and structural requirements. Both data and metadata need to be provided as data frames.
Note:
The metadata file is the primary point of reference for generating data quality reports:
In this example, the SHIP data are loaded from the
dataquieR
package:
sd1 <- readRDS(system.file("extdata", "ship.RDS", package = "dataquieR"))
The imported study data consists of:
Similarly, the respective metadata must be loaded from
dataquieR
:
md1 <- readRDS(system.file("extdata", "ship_meta.RDS", package = "dataquieR"))
The imported metadata provide information for:
An identical number of variables in both files is desirable but not necessary. Attributes (i.e. columns in the metadata) comprise information on each variable of the study data file, such as labels or admissibility limits.
The integrity check starts by calling the function
pro_applicability_matrix()
. The data quality indicators
covered by this function are:
pro_applicability_matrix()
generates a heatmap-like plot
for the applicability of all dataquieR
functions to the
study data, using the provided metadata as a point of reference:
appmatrix <- pro_applicability_matrix(study_data = sd1,
meta_data = md1,
label_col = LABEL,
split_segments = TRUE)
The heatmap can be retrieved by the command:
appmatrix$ApplicabilityPlot
As split_segments = TRUE
was used as an argument, all
output is organized by the study segments defined in the metadata. In
this case, there are data from four examination segments: the
computer-assisted interview, intro (basic information on the
participants, such as sociodemographic information and examination
date), laboratory variables, and somatometric examination. The
assignment of variables to segments is done in the metadata file.
The applicability checks results are technical, i.e. the function compares, for example, the data type as defined in the metadata with those observed in the study data. The light blue areas indicate that additional checks would be possible for many variables if additional metadata would be provided.
Note:
Applying all technically feasible data quality implementations to all study data variables is not advisable. For example, detection limits are not meaningful for participants` IDs. However, the variable ID is represented as an integer format, which technically allows checking detection limits.
All datatype issues found by pro_applicability_matrix()
should be checked data element by data element. For instance, a major
issue was found in the variable WAIST_CIRC_0. This variable is
represented in the study data with datatype character, which
differs from the expected datatype float defined in the
metadata. Some basic checks show the misuse of comma as the decimal
delimiter.
To correct this issue, the conversion of WAIST_CIRC_0 to datatype numeric will coerce respective values to NA, which should be avoided. Hence, we replaced the comma with the correct delimiter and corrected the datatype without losing data values. The resulting applicability plot shows no more issues.
pro_applicability_matrix(study_data = sd1, meta_data = md1, label_col = LABEL)$ApplicabilityPlot
The next major step in the data quality assessment workflow is to assess the occurrence and patterns of missing data.
The sequence of checks in this example is ordered according to common stages of a data collection:
Level | Description |
---|---|
Unit missingness | Subjects without information on any of the provided study variables |
Segment missingness | Subjects without information for all variables on a defined study segment (e.g. some examination) |
Item missingness | Subjects without information on data fields within segments |
Following this sequence enables to calculate the correct denominators for the calculation of item missingness. This is particularly important for complex cohort studies in which different levels of examination programs are conducted. For example, only half of a study population might be foreseen for an MRI examination. In the remaining 50% the respective MRI variables are per study design not populated. This should be considered if item missingness is examined.
This check identifies subjects without any measurements on the provided target variables for a data quality check.
Note:
The interpretation of findings depends on the scope of the provided variables and data records. In this example, the study data set comprises examined SHIP participants, not the target sample. Accordingly, the check is not about study participation. Rather, it identifies cases for which unexpectedly no information has been provided at all. Any identified case would indicate a data management problem.
The covered indicator by my_unit_missings2()
is:
Unit missingness can be assessed by using the command:
my_unit_missings2 <- com_unit_missingness(study_data = sd1,
meta_data = md1,
label_col = LABEL,
id_vars = "ID")
In total 0 units in this data have missings in all variables of the study data.
Thus for each participant there is at least one variable with information.
Subsequently, a check is performed that identifies subjects without any measurements within each of the four defined study segments.
The covered indicator by my_unit_missings2()
is:
using in this example the call with a table output:
MissSegs <- com_segment_missingness(study_data = sd1,
meta_data = md1,
threshold_value = 1,
direction = "high",
exclude_roles = c("secondary", "process"))
MissSegs$SummaryData
Exploring segment missingness over time requires another variable in the study data.
Information regarding this variable can be added to the metadata
using the dataquieR
function
prep_add_to_meta()
:
# create a discretized version of the examination year
sd1$exyear <- as.integer(lubridate::year(sd1$exdate))
# add metadata for this variable
md1 <- dataquieR::prep_add_to_meta(VAR_NAMES = "exyear",
DATA_TYPE = "integer",
LABEL = "EX_YEAR_0",
VALUE_LABELS = "1997 = 1st | 1998 = 2nd | 1999 = 3rd | 2000 = 4th | 2001 = 5th",
VARIABLE_ROLE = "process",
meta_data = md1)
With a discretized variable for examination year (EX_YEAR_0) the
occurrence pattern by year can subsequently be assessed using the
command com_segment_missingness()
:
MissSegs <- com_segment_missingness(study_data = sd1,
meta_data = md1,
threshold_value = 1,
label_col = LABEL,
group_vars = "EX_YEAR_0",
direction = "high",
exclude_roles = "process")
MissSegs$SummaryPlot
The plot is a descriptor, assigned to the indicator:
It illustrates that missing information from the laboratory examination is distributed unequally across examination years, with the highest proportion of missing data occurring in the 1st, 2nd, and 5th year.
Finally, in the completeness dimension, a check is performed
to identify subjects with missing information in variables of all study
segments. The covered indicators by the function
com_item_missingness()
are:
Item missingness can be assessed by using the following call:
item_miss <- com_item_missingness(study_data = sd1,
meta_data = md1,
show_causes = TRUE,
cause_label_df = code_labels,
label_col = "LABEL",
include_sysmiss = TRUE,
threshold_value = 95
)
A result overview can be obtained by requesting a summary table of this function:
item_miss$SummaryTable
The table provides one line for each of the 29 variables. Of particular interest are:
The table shows that one variable is affected by many missing values: HOUSE_INCOME_MONTH_0 on the net household income. In addition, age onset of diabetes (DIAB_AGE_ONSET_0) was only coded for 173 subjects, but the most values are missing because of an intended jump.
Note:
In case that Jump-codes have been used, e.g. for variable CONTRACEPTIVE_EVER_0, the denominator for the calculation of item missingness is corrected for the amount of used Jump-codes.
The summary plot provides a different view on missing data by providing frequency of the specified reason for missing data:
The balloon size is determined by the number of missing data fields.
It can now be inferred that, for example, the elevated number of missing values for the item HOUSE_INCOME_MONTH_0 is mainly caused by refusals of participants to answer the respective question.
Consistency is targeted after completeness has been examined as the first part of the data quality dimension Correctness. The removal of Missing- and Jump-codes is one prerequisite for the application of correctness checks. Consistency, as the first main aspect of correctness, describes the degree to which data values are free of breaks in conventions or contradictions. Different data types may be addressed in respective checks.
The covered indicator by con_limit_deviations()
when
specifying limits = “HARD_LIMITS” is:
Note:
When specifying the formal limits
= “SOFT_LIMITS” the
check does not identify inadmissible but uncertain values, according to
the specified ranges. The related indicator then is:
The call in this example with regards to inadmissible numerical values is:
MyValueLimits <- con_limit_deviations(study_data = sd1,
meta_data = md1,
label_col = "LABEL",
limits = "HARD_LIMITS")
Subsequently a table output may be requested. It provides the number and percentage of all range violations for variables in which :
MyValueLimits$SummaryTable
The last column of the table also provides a GRADING. If the percentage of violations is above some threshold, a problem GRADING (=1) is assigned. In this case any occurrence is classified as problematic. Otherwise the GRADING is 0.
The following statement assigns all variables identified as
problematic to the R object whichdeviate
to subsequently
enable a more targeted output, for example for a plot of distributions
for any variable with violations along the specified limits.:
# select variables with deviations
whichdeviate <- as.character(MyValueLimits$SummaryTable$Variables)[MyValueLimits$SummaryTable$GRADING == 1]
In this case, the plot has been restricted to the one variable with limit deviations, i.e. those with a GRADING of 1 in the table above.
(Only first two displayed to reduce file size).
head(MyValueLimits$SummaryPlotList[whichdeviate], 2)
A comparable check may be performed for categorical variables using
the command con_inadmissible_categorical()
:
The covered indicator is:
The call is:
IAVCatAll <- con_inadmissible_categorical(study_data = sd1,
meta_data = md1,
label_col = "LABEL")
As with inadmissible numerical values, a table output may be requested. It displays the observed categories, the defined categories, any non matching level, its count, and a GRADING:
IAVCatAll$SummaryTable
The results show that there are two variables, SCHOOL_GRAD_0 and OBS_SOMA_0 with one inadmissible level occurring for each variable. Regarding the variable OBS_SOMA_0 either the metadata did not include the respective Missing- or Jump-code or that a false code has been used in the study data.
The second main type of checks within the consistency dimension concerns contradictions.
The covered indicators by the command
con_contradictions()
are:
Rules to identify contradictions must first be uploaded from a spreadsheet format. The creation of this spreadsheet is supported by a Shiny App. Overall, 11 different logical comparisons can be applied. An overview is given in the respective tutorial. Each line within the spreadsheet defines one check rule.
checks <- read.csv(system.file("extdata",
"ship_contradiction_checks.csv",
package = "dataquieR"),
header = TRUE, sep = "#")
Subsequently, the command con_contradictions()
may be
triggered, using the table checks as the point of reference for a check
on contradictions:
AnyContradictions <- con_contradictions(study_data = sd1,
meta_data = md1,
label_col = "LABEL",
check_table = checks,
threshold_value = 1)
A summary table may be requested to show the number and percentage of contradictions for each defined rule:
In this example, one rule leads to the identification of 35 contradictions: Age onset for diabetes is provided but the variable on the presence of diabetes does not indicate a known disease.
The distributions may also be displayed as a plot:
The second dimension related to correctness is accuracy. It targets the degree of agreement between observed and expected distributions and associations.
In contrast to most consistency relate indicators, findings indicate an elevated probability that some data quality issue exists, rather than a certain issue.
Based on statistical criteria, univariate outliers are addressed. The covered indicator is:
The function acc_robust_univariate_outlier()
identifies
outliers according to the approaches of Tukey,
SixSigma,
Hubert,
and the heuristic approach of SigmaGap. It may be called as
follows:
UnivariateOutlier <- dataquieR:::acc_robust_univariate_outlier(study_data = sd1,
meta_data = md1,
label_col = "LABEL")
As with other dataquieR
implementations one output
option is a table. It provides descriptive statistics and detected
outliers according to the different criteria:
UnivariateOutlier$SummaryTable
There are outliers according to at least three criteria that affect all targeted variables but only for the variable HDL cholesterol (CHOLES_HDL_0) two outliers have been detected using the Sigma-gap criterion.
To obtain a better insight on univariate distributions, a plot can be requested. It highlights observations for each variable according to the number of violated rules (first 4 shown only to reduce file size).
The function acc_multivariate_outlier()
identifies
outliers related to the indicator:
acc_multivariate_outlier()
uses the same rules as
acc_robust_univariate_outlier()
for the identification of
outliers.
The following function call relates systolic and diastolic blood pressure measurement to age and weight and a table output is created for the number of detected multivariate outliers:
MVO_SBP0.1 <- acc_multivariate_outlier(resp_vars = c("SBP_0.1", "DBP_0.1", "AGE_0", "BODY_WEIGHT_0"),
study_data = sd1,
meta_data = md1,
id_vars = "ID",
label_col = "LABEL")
MVO_SBP0.1$SummaryTable
The number of outliers varies considerably, depending on the criterion. Subsequently a parallel-coordinate-plot may be requested to further inspect results:
MVO_SBP0.1$SummaryPlot
Another example is the inspection of the first and second systolic blood pressure measurements:
MVO_DBP <- acc_multivariate_outlier(resp_vars = c("SBP_0.1", "SBP_0.2"),
study_data = sd1,
meta_data = md1,
label_col = "LABEL")
MVO_DBP$SummaryTable
MVO_DBP$SummaryPlot
The function acc_distributions()
describes distributions
using histograms and displays empirical cumulative distribution function
(ecdf) in case a grouping variable is provided. The function is only
descriptive and as such not related to a specific indicator. Rather,
results may be of relevance to most indicators within the unexpected
distribution domain.
The following example examines measurements in which a possible influence of the observers is considered.
ECDFSoma <- acc_distributions(resp_vars = c("WAIST_CIRC_0", "BODY_HEIGHT_0", "BODY_WEIGHT_0"),
group_vars = "OBS_SOMA_0",
study_data = sd1,
meta_data = md1,
label_col = "LABEL")
The respective list of plots may be displayed using the following command (to decrease the file size, only the first 2 plots):
invisible(lapply(head(ECDFSoma$SummaryPlotList,2), print))
The reason for the nested call of the print function, which is applied to each list element, some annotation and messages are suppressed.
The function acc_margins()
is mainly related to the
indicators:
However, it also provides descriptive output such as violin plots,
and box plots for continuous variables, count plots for categorical
data, and density plots for both. The main application of
acc_margins()
is to make inference on effects related to
process variables such as examiners, devices, or study centers. The R
function determines whether measurements are provided as continuous or
discrete. Alternatively, metadata specifications may provide this
information.
In the first example acc_margins()
is applied to the
variable waist circumference (WAIST_CIRC_0). In this case dependencies
related to the examiners (OBS_SOMA_0) are examined while the raw
measurements are controlled for variable age and sex (AGE_0, SEX_0):
marginal_dists <- acc_margins(resp_vars = "WAIST_CIRC_0",
co_vars = c("AGE_0", "SEX_0"),
group_vars = "OBS_SOMA_0",
study_data = sd1,
meta_data = md1,
label_col = "LABEL")
A plot may be requested to revise the results.
marginal_dists$SummaryPlot
Based on a statistical test, no mean waist circumference of any examiner differed substantially (p<0.05) from the overall mean.
The situation is quite different when assessing the coded myocardial infarction across examiners while controlling for age and sex:
marginal_dists <- acc_margins(resp_vars = "MYOCARD_YN_0",
co_vars =c("AGE_0", "SEX_0"),
group_vars = "OBS_INT_0",
study_data = sd1,
meta_data = md1,
label_col = "LABEL")
marginal_dists$SummaryPlot
The result shows elevated proportions for the examiners 05 and 07.
An important and related issue is the quantification of the observed examiner effects:
This is accomplished by the function acc_varcomp()
related to the indicators:
It computes the percentage of variance of some target variable, here attributable to the grouping variable while controlling for some control variables (age and sex) and the output may be reviewed in a table format:
vcs <- acc_varcomp(resp_vars = "WAIST_CIRC_0",
co_vars = c("AGE_0", "SEX_0"),
group_vars = "OBS_SOMA_0",
study_data = sd1,
meta_data = md1,
label_col = "LABEL")
vcs$SummaryTable
For the variable WAIST_CIRC_0, an ICC of 0.019 has been found which is below the threshold. The same is the case for the variable MYOCARD_YN_0, probably because the case count in the two deviant observers 05 and 07 is low:
vcs <- acc_varcomp(resp_vars = "MYOCARD_YN_0",
co_vars =c("AGE_0", "SEX_0"),
group_vars = "OBS_INT_0",
study_data = sd1,
meta_data = md1,
label_col = "LABEL")
vcs$SummaryTable
A particular complexity is the study of effects across groups and
times. As a descriptor this is realized using the function
acc_loess()
. While providing primarily information related
to the indicator:
it may be used to obtain information with regard to other indicators in the domain unexpected distributions as well. A sample call with graphical output using waist circumference as the target variable is:
timetrends <- acc_loess(resp_vars = "WAIST_CIRC_0",
co_vars =c("AGE_0", "SEX_0"),
group_vars = "OBS_SOMA_0",
time_vars = "EXAM_DT_0",
study_data = sd1,
meta_data = md1,
label_col = "LABEL")
invisible(lapply(timetrends$SummaryPlotList, print))
The graph for this variable indicates no major discrepancies between the observers over the examination period.
Assessing the shape of a distribution is, next to location parameters, an important aspect of accuracy.
The related indicator is:
Observed distributions can be tested against expected distributions
using the function acc_shape_or_scale()
.
In this example the uniform distribution for the use of measurement devices can be examined.
MyUnexpDist1 <- acc_shape_or_scale(resp_vars = "DEV_BP_0",
guess = TRUE,
label_col = "LABEL",
dist_col = "DISTRIBUTION",
study_data = sd1,
meta_data = md1)
MyUnexpDist1$SummaryPlot
The plot illustrates that devices have not been used with comparable frequencies.
In another example the normal distribution of blood pressure is examined.
MyUnexpDist2 <- acc_shape_or_scale(resp_vars = "SBP_0.2",
guess = TRUE,
label_col = "LABEL",
dist_col = "DISTRIBUTION",
study_data = sd1,
meta_data = md1)
MyUnexpDist2$SummaryPlot
The results reveal a slight discrepancy from the normality assumption. It is up to the person responsible for data quality assessments to decide whether such a discrepancy is of relevance.
The analysis of end digit preferences is a specific implementation related to the indicator:
In this example the uniform distribution of the end digits of body height are examined. Body height in SHIP-0 was a measurement which required the manual reading and transfer of data into an eCRF.
MyEndDigits <- acc_end_digits(resp_vars = "BODY_HEIGHT_0",
label_col = LABEL,
study_data = sd1,
meta_data = md1)
MyEndDigits$SummaryPlot
The graph highlights no effects of relevance across the ten categories.
Output within the accuracy dimension illustrates frequently combine descriptive and inferential content. This seemed necessary to improve valid conclusions on data quality issues. Further details on all functions can be obtained following the links and via the software section of the data quality web-page.