Introduction

This document illustrates the use of metadata for DQ assessments. Metadata is considered as “data that describe other data” (Nadkarni 2011). Metadata provides information to support the correct interpretation of study data and to guide statistical analyses. The focus in this document is on metadata related to single variables since most data quality (DQ) assessments focus on this structural level of data. Metadata are, for example, lists of value codes to examine reasons for incomplete data or value labels to support interpretable reports. Some metadata will be specific for certain DQ assessments while others will be used across DQ implementations. This will be detailed below.

For further information on metadata please see Richter et al. 2019 here.


Storage of metadata

Metadata is commonly stored in so-called data dictionaries (DDs). DDs frequently contain, for example, the name of a variable, its data type, and, if applicable, labels for the levels of a categorical variable (Meyer et al. 2012). DDs should be available for study data in each research study. However, DDs often only host subset of all information necessary for data quality assessments. A natural consequence is to extend DDs on aspects related to data quality.

If this is not possible, metadata may also be stored in a spreadsheet type format, for example as data frames. dataquieR uses predefined metadata provided as data frames as decribed below.


How dataquieR uses metadata

The R-package dataquieR uses predefined metadata in two ways:

  1. for each variable of the study data that is named in a function call of a DQ implementation the respective metadata are interpreted from a data frame of metadata

  2. some implementation also search for relations between variables such as a date-time-stamp that belongs to a measurement. The definition of such relations is explained in the paragraph KEY-COLUMNS.

Therefore, metadata and study data must be defined in a 1:1 correspondance, i.e. each variable of the study data is identifiable in the metadata. The key for this mapping is the variable name which is listed in the column :

\(\Rightarrow\) VAR_NAMES

in the metadata. A necessary convention regarding variable names is their uniqueness, i.e. none of the variables names should have a duplicate. Further, for better distinction between column names in metadata and study data, all columns of the metadata are defined in upper case letters.

The 1:1 correspondence implies that each variable name is unique.


Metadata for data quality reporting


VARIABLE AND VALUE LABELS

Appropriate labels are a necessary precondition for readable data quality reports. Their absence does not affect the functionality of statistical implementations.

CAVEAT: A necessary convention for all labels in the current project phase is the definition of unique + short labels. This is necessary since reports may be corrupted by too long labels.

LABEL

Assigning labels to variables is important because variable names in the study data are rather technical and lmiting to useful interpretations. As is the case for variable names each variable label should be unique. In addition, labels should be as short as possible to ensure a readable output.

To enhance the presentation and plotting quality character length specified in LABELS should not exceed 20 characters.

VAR_NAMES LABEL
v00000 CENTER_0
v00001 PSEUDO_ID
v00002 SEX_0
v00003 AGE_0
v00103 AGE_GROUP_0

All implementations of dataquieR support the use of LABELS.

LONG_LABEL

Under some circumstances the notation of a short label or variable name is insufficient to provide all necessary information. The colum “LONG_LABEL” can therefore be filled with self-explaining anotations for variables. Long labels are of a higher relevance for a table output compared to a graphical output.

Via the specification of the label_col formal in all implementations of dataquieR short or long labels can be defined.

VALUE_LABELS

Categorical variables in the study data are often coded as integers (e.g. 0, 1). Because the number is non-informative labels are essential to secure undestandable reports, e.g:

  • The sex of participants can be coded as \(1 = females\) and \(2 = males\).
  • The presence of a disease can be coded as \(0 = no\) and \(1 = yes\).

To make use of VALUE_LABELS in dataquieR the following convention has been made: all values of a study variable and respective labels can be summarized in a list using the pipe operator \(|\) for separation. The latter is crucial for the use of DQ-implementations.

To enhance presentation and plotting quality the character length of a value label specified in VALUE_LABELS should not exceed 20 characters.

The function dataqiueR::con_inadmissible_categorical() searches for all observed levels in the study data and compares them with pre-defined categories in the metadata. The column NON_MATCHING denotes observed levels which have not been defined in the metadata.

Variables NUM_con_rvv_icat PCT_con_rvv_icat GRADING FLG_con_rvv_icat
8 ARM_CUFF_0 0 0.0 0 FALSE
9 USR_VO2_0 0 0.0 0 FALSE
10 USR_BP_0 0 0.0 0 FALSE
11 PART_PHYS_EXAM 0 0.0 0 FALSE
12 PART_LAB 0 0.0 0 FALSE
13 EDUCATION_0 0 0.0 0 FALSE
14 EDUCATION_1 3 0.1 1 TRUE
15 FAM_STAT_0 2389 79.6 1 TRUE
16 MARRIED_0 0 0.0 0 FALSE
17 EATING_PREFS_0 0 0.0 0 FALSE

Another application of value labels relates to the number of admissible levels in a categorical variable. If three distinct levels are observed in the data but the metadata (DD) references in value codes and value labels only two levels this implies the existence of inadmissible values.


DATA_TYPE

In contrast to LABEL the definition of the DATA_TYPE is crucial because the applicability of DQ - implementations may depend on the data type.

The following DATA_TYPES are differentiated in dataquieR:

  • float
  • integer
  • datetime
  • string

The list appears small compared to some electronic data capturing systems (e.g. RedCAP, Harris et al. 2009) or Shiny Apps (Chang et al. 2018). However, the data type should not be mixed up with data entry types which could be very different using sliders or radio buttons. Similarly, the data type is not a statistical property such as an ordinal characteristic.

The function dataquieR::pro_applicability_matrix() provides an overview of applicable DQ-implementations according to the defined data type.

FIG 1: Sketch of a matrix summarizing the applicability of DQ implementations


VALUE CODES

Data often contain a qualification of values which are not measurements. These are for example codes for missing values. Figure 2 shows the use of such codes in the variable V_0101. Both, measurement values and missing codes are considered as data values.


FIG 2: Categorization of measurements and missing values in dataquieR


Using such codes may complicate the application of standardized routines for DQ assessment since coded missing measurements must be correctly interpreted. For example, it must be secured that a data value representing a missing code is not treated as a measurement value to avoid spurious results when adressing data accuracy. Therefore codes representing non-measurement values must be correctly identified and treated correctly.

The R-package dataquieR distinguishes two different code lists for data values: MISSING_LIST and JUMP_LIST. The conceptual difference is described in the concomitant data quality concept.


MISSING_LIST

Codes spcified in the MISSING_LIST indicate unexpected missingness of measurements, for example missing values due to refusals or technical problems.

The MISSING_LIST is a list of pipe \(|\) separated numeric codes: \(99980\: |\: 99983\: |\: 99988\).


The DQ-implementation dataquieR::com_item_missingness() examines the presence of missing codes for all variables:

FIG 2: Analysis of reasons for missingness


JUMP_LIST

Codes in the JUMP_LIST indicate measurements which are missing by design. For example, if a sub-sample of a study population does not participate in a specific examination (by design) then jump-codes should be used to indicate this reason for missingness.

The JUMP_LIST is a list of pipe \(|\) separated numeric codes: \(88880\: |\: 88883\: |\: 88884\).

Specifying a JUMP_LISTS is used by dataquieR::com_item_missingness() to compute the appropriate denominator for item missingness, i.e. observation in which the observation of a variable is not expected by design are not considered. In the example below, the correct denominator for item missingness of NBIRTH_0 and PREGNANT_0 is the number of females, i.e. all males have qualified jump codes in the respective variables.

Study variable Observations N Sysmiss N (%) Datavalues N (%) Missing codes N (%) Jumps N (%) Measurements N (%) GRADING
34 SMOKE_SHOP_0 3000 1681 (56.03) 1319 (43.97) 513 (17.1) 0 (0) 806 (26.87) 1
35 N_INJURIES_0 3000 320 (10.67) 2680 (89.33) 481 (16.03) 0 (0) 2199 (73.3) 1
36 N_BIRTH_0 3000 289 (9.63) 2711 (90.37) 499 (16.63) 1113 (37.1) 1099 (58.24) 1
37 INCOME_GROUP_0 3000 311 (10.37) 2689 (89.63) 515 (17.17) 0 (0) 2174 (72.47) 1
38 PREGNANT_0 3000 350 (11.67) 2650 (88.33) 519 (17.3) 1066 (35.53) 1065 (55.07) 1


LIMITS

Limits describe ranges to check the plausibility of measurement values (hard, soft limits) or to identify measurements outside a measurable range (detection limits). Limits may apply to study data of type: float, integer, and date-time. Specifying limits can be content-driven (e.g. based on clinical information) or may depend on properties of the used examination device or the outcome under study. For example, body weight cannot be negative.

Unfortunately, the definition of limits can be ambiguous:

  • a plausibility limit of “\(\gt10\)” may imply that all values above are plausible.
  • however, this notation is also frequently used to guide decisions in eCRFs, i.e. if a value is “\(\gt10\)” than alert the user regarding an implausible value.


To avoid this ambiguity, HARD_LIMITS, SOFT_LIMITS, and DETECTION_LIMITS in the metadata are defined using interval notation. Values inside the interval are eligible/plausible/possible. The definition of intervals adheres also to a distinguished use of braces:

  • \((0;\:10)\): open interval, i.e. values \(>0\) and \(<10\) are inside the interval.
  • \((0;\:10]\): left-open interval, i.e. values \(>0\) and \(\le10\) are inside the interval.
  • \([0;\:10)\): right-open interval, i.e. values \(\ge0\) and \(<10\) are inside the interval.
  • \([0;\:10]\): is a closed interval, i.e. values \(\ge0\) and \(\le10\) are inside the interval.

Each side of the interval must be defined by a value of the same type as the measurement (including dates and date-times). If the range is undefined \(-Inf\) and/or \(Inf\) have to be defined. Please see the examples provided in Metadata in dataquieR.

Two types of limits may be distinguished depending on whether the range indicates inadmissible or just impropable values.

HARD_LIMITS

HARD_LIMITS should be specified to identify inadmissible values. Inadmissibility does not necessarily mean impossible. For example, while it is known that the heaviest man on Earth did weigh more than 600kg, it may be reasonable to declare values above 250kg as inadmissible because under the circumstances of a general-population study in Germany it is deemed unlikely that a heavier person may arrive at the examination center.

The application of the function dataquieR::con_limit_deviations() in combination with HARD_LIMITS leads to the removal of respective values. The removal is indicated in the respective plot and provided in a message:

N = 24 values in SMOKE_SHOP_0 have been above HARD_LIMITS and were removed.N = 903 values in MEDICATION_0 have been above HARD_LIMITS and were removed.N = 21 values in EDUCATION_1 have been above HARD_LIMITS and were removed.

FIG 3: Example of summary plot for limit deviations

If the removal of values outside the limit intervals is not intended the function can be used in combination with SOFT_LIMITS.

SOFT_LIMITS

The functionality of SOFT_LIMITS is similar to HARD_LIMITS. However, values outside the limits are not removed, because SOFT_LIMITS indicate improbable but not impossible measurements.

The formal setup of SOFT_LIMITS is identical to HARD_LIMITS.

DETECTION_LIMITS

The definition of DETECTION_LIMITS can be necessary if measurement devices have predefined limits of sensitivity. It is possible that measurements are indicated as being below or above the DETECTION_LIMITS. Such information should result in a different management of respective data values as they are still informative and can be used in later analysis.

Values outside detection limits are not removed.

The formal setup of DETECTION_LIMITS is identical to HARD_LIMITS.


CONTRADICTIONS

Checks for contradictions compare the values of two study data variables to detect inadmissible combinations. Compared to ithe assessment of limits, only the combination of values in two variables is inadmissible while the values of each variable are admissible. Checks are performed rowwise within an individual.

For example, the variable sex for a given participant may contain \(male\) and the variable no. of births = \(2\). Each value is admissible but the combination is not since male participants may not give birth.

The column CONTRADICTIONS in the metadata (DD) references pipe | separated IDs of contradiction checks such as: \(1004\: |\: 1005\: |\: 1006\). Please see for example Metadata in dataquieR. Each of these IDs is linked to a specific contradiction which are defined either in a spreadsheet table or by means of a ShinyApp provided by dataquieR. An example is provided in the table below.

Important note: in the table below the columns A and B contain the labels of variables since the columns in the study data have rather technical names. In this case the respective function con_contradictions must be used with a defined label_col formal.

ID Function_name A A_levels A_value B B_levels B_value Label
1001 A_less_than_B_vv AGE_1 NA NA AGE_0 NA NA Age follow-up
1002 A_not_equal_B_vv SEX_1 NA NA SEX_0 NA NA Sex follow-up
1003 A_less_than_B_vv EDUCATION_1 NA NA EDUCATION_0 NA NA Education follow-up
1004 A_levels_and_B_levels_ll EATING_PREFS_0 vegetarian NA MEAT_CONS_0 1-2d a week | 3-4d a week | 5-6d a week | daily NA Nutrition inconsistency vegetarian
1005 A_levels_and_B_levels_ll EATING_PREFS_0 vegan NA MEAT_CONS_0 1-2d a week | 3-4d a week | 5-6d a week | daily NA Nutrition inconsistency vegan
1006 A_levels_and_B_levels_ll EATING_PREFS_0 none NA MEAT_CONS_0 never NA Nutrition inconsistency
1007 A_levels_and_B_levels_ll SMOKING_0 no NA SMOKE_SHOP_0 1-2d a week | 3-4d a week | 5-6d a week | daily NA Non-smokers inconsistency
1008 A_levels_and_B_levels_ll SMOKING_0 yes NA SMOKE_SHOP_0 never NA Smokers inconsistency
1009 A_not_equal_B_vv ARM_CIRC_DISC_0 NA NA ARM_CUFF_0 NA NA Blood pressure false cuff
1010 A_levels_and_B_gt_value_lc PREGNANT_0 yes NA AGE_0 NA 55 Pregnancy high age
1011 A_less_than_B_vv LAB_DT_0 NA NA EXAM_DT_0 NA NA LAB before MEX

In total, 10 different types of comparisons between two variables can be defined to detect contradictions. In the table above, in line one the function name A_less_than_B_vv has been selected, to ensure that the age of a study participant at the follow-up exmaination is never lower than the age at baseline. The related check rule is: A is lower then the value in variable B. The suffixes *_vv, _ll, _lc* are irrelevant for the user but necessary for the ShinyApp to prompt further inputs. The complete list of functions to check contradictions is provided in the Appendix under Contradiction checks.

The application of the function dataquieR::con_contradictions() provides three different outputs. One of them is a summary table of applied checks.

Check ID Check type Variables A and B A Levels B Levels Contradictions (N) Contradictions (%) Grading Label
1002 A_not_equal_B_vv A is: SEX_1; B is: SEX_0 NA NA 150 5.00 1 Sex follow-up
1001 A_less_than_B_vv A is: AGE_1; B is: AGE_0 NA NA 150 5.00 1 Age follow-up
1009 A_not_equal_B_vv A is: ARM_CIRC_DISC_0; B is: ARM_CUFF_0 NA NA 173 5.77 1 Blood pressure false cuff
1003 A_less_than_B_vv A is: EDUCATION_1; B is: EDUCATION_0 NA NA 7 0.23 0 Education follow-up
1004 A_levels_and_B_levels_ll A is: EATING_PREFS_0; B is: MEAT_CONS_0 vegetarian 1-2d a week,3-4d a week,5-6d a week,daily 54 1.80 1 Nutrition inconsistency vegetarian
1005 A_levels_and_B_levels_ll A is: EATING_PREFS_0; B is: MEAT_CONS_0 vegan 1-2d a week,3-4d a week,5-6d a week,daily 19 0.63 0 Nutrition inconsistency vegan
1006 A_levels_and_B_levels_ll A is: EATING_PREFS_0; B is: MEAT_CONS_0 none never 64 2.13 1 Nutrition inconsistency
1007 A_levels_and_B_levels_ll A is: SMOKING_0; B is: SMOKE_SHOP_0 no 1-2d a week,3-4d a week,5-6d a week,daily 91 3.03 1 Non-smokers inconsistency
1008 A_levels_and_B_levels_ll A is: SMOKING_0; B is: SMOKE_SHOP_0 yes never 118 3.93 1 Smokers inconsistency
1010 A_levels_and_B_gt_value_lc A is: PREGNANT_0; B is: AGE_0 yes NA 5 0.17 0 Pregnancy high age


REPORT DESIGN

VARIABLE_ROLE

Usually not all variables of the study data will be subject to DQ reporting. To allow for simple filtering, different roles of variables can be defined. The number of roles is not limited. In dataquieR the following roles are defeined:

  • intro: administrative variables, for example indicating the participation in an examination
  • primary: measurement variables of major importance
  • secondary: measurement variables of minor importance
  • process: measurements of the data generating process under which study data were obtained. For example, room temperature or the respective examiner.

VARIABLE_ORDER

In this column the order of the variables in a data quality report can be defined.


Appendix

Metadata in dataquieR

## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''

## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''

## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''

## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''

## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''

## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''

## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''

## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''

## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''

## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''

## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''

## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''

## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''

## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''

## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''

## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''

## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''

## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''
## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''
## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''

## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''

## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''

## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 VAR_NAMES
[character]
1. v00000
2. v00001
3. v00002
4. v00003
5. v00004
6. v00005
7. v00006
8. v00007
9. v00008
10. v00009
[ 43 others ]
1 ( 1.9%)
1 ( 1.9%)
1 ( 1.9%)
1 ( 1.9%)
1 ( 1.9%)
1 ( 1.9%)
1 ( 1.9%)
1 ( 1.9%)
1 ( 1.9%)
1 ( 1.9%)
43 (81.1%)










IIIIIIIIIIIIIIII
53
(100.0%)
0
(0.0%)
2 LABEL
[character]
1. AGE_0
2. AGE_1
3. AGE_GROUP_0
4. ARM_CIRC_0
5. ARM_CIRC_DISC_0
6. ARM_CUFF_0
7. ASTHMA_0
8. BSG_0
9. CENTER_0
10. CRP_0
[ 43 others ]
1 ( 1.9%)
1 ( 1.9%)
1 ( 1.9%)
1 ( 1.9%)
1 ( 1.9%)
1 ( 1.9%)
1 ( 1.9%)
1 ( 1.9%)
1 ( 1.9%)
1 ( 1.9%)
43 (81.1%)










IIIIIIIIIIIIIIII
53
(100.0%)
0
(0.0%)
3 DATA_TYPE
[character]
1. datetime
2. float
3. integer
4. string
4 ( 7.5%)
6 (11.3%)
37 (69.8%)
6 (11.3%)
I
II
IIIIIIIIIIIII
II
53
(100.0%)
0
(0.0%)
4 VALUE_LABELS
[character]
1. 0 = no | 1 = yes
2. 0 = females | 1 = males
3. 0 = never | 1 = 1-2d a we
4. 0 = pre-primary | 1 = pri
5. 1 = (-Inf,20] | 2 = (20,3
6. 0 = <10k | 1 = [10-30k) |
7. 0 = none | 1 = vegetarian
8. 1 = Berlin | 2 = Hamburg
9. A = excellent | B = good
10. single | married | divorc
[ 3 others ]
10 (38.5%)
2 ( 7.7%)
2 ( 7.7%)
2 ( 7.7%)
2 ( 7.7%)
1 ( 3.8%)
1 ( 3.8%)
1 ( 3.8%)
1 ( 3.8%)
1 ( 3.8%)
3 (11.5%)
IIIIIII
I
I
I
I





II
26
(49.1%)
27
(50.9%)
5 MISSING_LIST
[character]
1. 99980 | 99983 | 99988 |
2. 99980 | 99983 | 99988 |
3. 99980 | 99988 | 99989 |
4. 99980 | 99981 | 99982 | 9
5. 99980 | 99981 | 99982 | 9
6. 99980 | 99983 | 99987 |
7. 99980 | 99987
8. 99981 | 99982
15 (41.7%)
8 (22.2%)
1 ( 2.8%)
4 (11.1%)
2 ( 5.6%)
2 ( 5.6%)
1 ( 2.8%)
3 ( 8.3%)
IIIIIIII
IIII

II
I
I

I
36
(67.9%)
17
(32.1%)
6 JUMP_LIST
[integer]
Min : 88880
Mean : 88888
Max : 88890
88880 : 2 (20.0%)
88890 : 8 (80.0%)
IIII
IIIIIIIIIIIIIIII
10
(18.9%)
43
(81.1%)
7 HARD_LIMITS
[character]
1. [0;10]
2. [0;1]
3. [2018-01-01 00:00:00 CET;
4. [0;4]
5. [0;6]
6. [0;Inf)
7. [1;3]
8. [18;Inf)
9. [0;100]
10. [0;2]
[ 3 others ]
9 (27.3%)
5 (15.2%)
4 (12.1%)
2 ( 6.1%)
2 ( 6.1%)
2 ( 6.1%)
2 ( 6.1%)
2 ( 6.1%)
1 ( 3.0%)
1 ( 3.0%)
3 ( 9.1%)
IIIII
III
II
I
I
I
I
I


I
33
(62.3%)
20
(37.7%)
8 DETECTION_LIMITS
[character]
1. [0;265]
2. [0.16;Inf)
2 (66.7%)
1 (33.3%)
IIIIIIIIIIIII
IIIIII
3
(5.7%)
50
(94.3%)
9 CONTRADICTIONS
[character]
1. 1001
2. 1002
3. 1003
4. 1004 | 1005 | 1006
5. 1007 | 1008
6. 1009
7. 1010
8. 1011
2 (13.3%)
2 (13.3%)
2 (13.3%)
2 (13.3%)
2 (13.3%)
2 (13.3%)
1 ( 6.7%)
2 (13.3%)
II
II
II
II
II
II
I
II
15
(28.3%)
38
(71.7%)
10 SOFT_LIMITS
[character]
1. (0;60]
2. (55;100)
3. (90;170)
4. [0;10]
5. [0;5]
6. [0.2;10)
7. [0.2;30)
8. [1;9]
1 (11.1%)
1 (11.1%)
1 (11.1%)
2 (22.2%)
1 (11.1%)
1 (11.1%)
1 (11.1%)
1 (11.1%)
II
II
II
IIII
II
II
II
II
9
(17.0%)
44
(83.0%)
11 DISTRIBUTION
[character]
1. gamma
2. normal
3. uniform
1 (14.3%)
4 (57.1%)
2 (28.6%)
II
IIIIIIIIIII
IIIII
7
(13.2%)
46
(86.8%)
12 DECIMALS
[integer]
Mean (sd) : 0.7 (1.2)
min < med < max:
0 < 0 < 3
IQR (CV) : 0.8 (1.8)
0 : 4 (66.7%)
1 : 1 (16.7%)
3 : 1 (16.7%)
IIIIIIIIIIIII
III
III
6
(11.3%)
47
(88.7%)
13 DATA_ENTRY_TYPE
[integer]
Min : 0
Mean : 0.3
Max : 1
0 : 4 (66.7%)
1 : 2 (33.3%)
IIIIIIIIIIIII
IIIIII
6
(11.3%)
47
(88.7%)
14 KEY_OBSERVER
[character]
1. v00011
2. v00012
3. v00032
1 ( 5.6%)
2 (11.1%)
15 (83.3%)
I
II
IIIIIIIIIIIIIIII
18
(34.0%)
35
(66.0%)
15 KEY_DEVICE
[character]
1. v00010
2. v00016
2 (66.7%)
1 (33.3%)
IIIIIIIIIIIII
IIIIII
3
(5.7%)
50
(94.3%)
16 KEY_DATETIME
[character]
1. v00013
2. v00017
4 (66.7%)
2 (33.3%)
IIIIIIIIIIIII
IIIIII
6
(11.3%)
47
(88.7%)
17 KEY_STUDY_SEGMENT
[character]
1. v10000
2. v20000
3. v30000
4. v40000
5. v50000
11 (20.8%)
11 (20.8%)
4 ( 7.5%)
18 (34.0%)
9 (17.0%)
IIII
IIII
I
IIIIII
III
53
(100.0%)
0
(0.0%)
18 VARIABLE_ROLE
[character]
1. intro
2. primary
3. process
4. secondary
11 (20.8%)
30 (56.6%)
9 (17.0%)
3 ( 5.7%)
IIII
IIIIIIIIIII
III
I
53
(100.0%)
0
(0.0%)
19 VARIABLE_ORDER
[integer]
Mean (sd) : 27 (15.4)
min < med < max:
1 < 27 < 53
IQR (CV) : 26 (0.6)
53 distinct values
(Integer sequence)
: : : : :\
: : : :
: : : :
: : : : .
: : : : :
53
(100.0%)
0
(0.0%)
20 LONG_LABEL
[character]
1. AGE_0
2. AGE_1
3. AGE_GROUP_0
4. ARM_CIRCUMFERENCE_0
5. ARM_CIRCUMFERENCE_DISCRET
6. ARM_USED_CUFF_0
7. ASTHMA_YESNO_0
8. BSG_0
9. CENTER_0
10. CRP_0
[ 43 others ]
1 ( 1.9%)
1 ( 1.9%)
1 ( 1.9%)
1 ( 1.9%)
1 ( 1.9%)
1 ( 1.9%)
1 ( 1.9%)
1 ( 1.9%)
1 ( 1.9%)
1 ( 1.9%)
43 (81.1%)










IIIIIIIIIIIIIIII
53
(100.0%)
0
(0.0%)
21 LOCATION_RANGE
[character]
1. (100;140)
2. (20;30)
3. (60;100)
4. [2;4)
5. [45;55]
1 (16.7%)
1 (16.7%)
1 (16.7%)
1 (16.7%)
2 (33.3%)
III
III
III
III
IIIIII
6
(11.3%)
47
(88.7%)
22 LOCATION_METRIC
[character]
1. Mean
2. Median
5 (83.3%)
1 (16.7%)
IIIIIIIIIIIIIIII
III
6
(11.3%)
47
(88.7%)
23 PROPORTION_RANGE
[character]
1. (10;90)
2. [15;30]
3. [48;52]
4. 0 in [48;52]
5. 4 in (2;10] | 5 in (5;15]
1 (20.0%)
1 (20.0%)
1 (20.0%)
1 (20.0%)
1 (20.0%)
IIII
IIII
IIII
IIII
IIII
5
(9.4%)
48
(90.6%)


Contradiction checks

Function notation Explanation
A \(\ne\) B Value in A is not equal value in B
A > B Value in A is greater than value in B
A \(\ge\) B Value in A is greater equal value in B
A observed \(\cap\) B missing A is observed and B is missing
A obserevd \(\cap\) B is not missing A is observed and B is observed
A \(\in\) {set of levels} \(\cap\) B > value A has a level and B is greater than a value
A \(\in\) {set of levels} \(\cap\) B = value A has a level and B is equal to a value
A \(\in\) {set of levels} \(\cap\) B < value A has a level and B is lower than a value
A \(\in\) {set of levels} \(\cap\) B \(\in\) {set of levels} A has a level and B has a level
A \(\in\) {set of levels} \(\cap\) B \(\neg\) \(\in\) {set of levels} A has a level and B has not a level

Back to Overview

Chang, W., Cheng, J., Allaire, J., Xie, Y., McPherson, J., et al. (2018). Shiny: Web application framework for r, 2015. R Package Version 1, 14.
Harris, P.A., Taylor, R., Thielke, R., Payne, J., Gonzalez, N., and Conde, J.G. (2009). Research electronic data capture (REDCap)—a metadata-driven methodology and workflow process for providing translational research informatics support. Journal of Biomedical Informatics 42, 377–381.
Kleiber, C., and Zeileis, A. (2016). Visualizing count data regressions using rootograms. The American Statistician 70, 296–303.
Meyer, J., Ostrzinski, S., Fredrich, D., Havemann, C., Krafczyk, J., and Hoffmann, W. (2012). Efficient data management in a large-scale epidemiology research project. Computer Methods and Programs in Biomedicine 107, 425–435.
Nadkarni, P.M. (2011). Metadata-driven software systems in biomedicine: Designing systems that can adapt to changing knowledge (Springer Science & Business Media).
Richter, A., Schössow, J., Werner, A., Schauer, B., Radke, D., Henke, J., Struckmann, S., and Schmidt, C. (2019). Data quality monitoring in clinical and observational epidemiologic studies: The role of metadata and process information. GMS Med Inform Biom Epidemiol 15.