This document illustrates the use of metadata for DQ assessments. Metadata is considered as “data that describe other data” (Nadkarni 2011). Metadata provides information to support the correct interpretation of study data and to guide statistical analyses. The focus in this document is on metadata related to single variables since most data quality (DQ) assessments focus on this structural level of data. Metadata are, for example, lists of value codes to examine reasons for incomplete data or value labels to support interpretable reports. Some metadata will be specific for certain DQ assessments while others will be used across DQ implementations. This will be detailed below.
For further information on metadata please see Richter et al. 2019 here.
Metadata is commonly stored in so-called data dictionaries (DDs). DDs frequently contain, for example, the name of a variable, its data type, and, if applicable, labels for the levels of a categorical variable (Meyer et al. 2012). DDs should be available for study data in each research study. However, DDs often only host subset of all information necessary for data quality assessments. A natural consequence is to extend DDs on aspects related to data quality.
If this is not possible, metadata may also be stored in a spreadsheet
type format, for example as data frames. dataquieR
uses
predefined metadata provided as data frames as decribed below.
dataquieR
uses metadataThe R-package dataquieR
uses predefined metadata in two
ways:
for each variable of the study data that is named in a function call of a DQ implementation the respective metadata are interpreted from a data frame of metadata
some implementation also search for relations between variables such as a date-time-stamp that belongs to a measurement. The definition of such relations is explained in the paragraph KEY-COLUMNS.
Therefore, metadata and study data must be defined in a 1:1 correspondance, i.e. each variable of the study data is identifiable in the metadata. The key for this mapping is the variable name which is listed in the column :
\(\Rightarrow\)
VAR_NAMES
in the metadata. A necessary convention regarding variable names is their uniqueness, i.e. none of the variables names should have a duplicate. Further, for better distinction between column names in metadata and study data, all columns of the metadata are defined in upper case letters.
The 1:1 correspondence implies that each variable name is unique.
Appropriate labels are a necessary precondition for readable data quality reports. Their absence does not affect the functionality of statistical implementations.
CAVEAT: A necessary convention for all labels in the current project phase is the definition of unique + short labels. This is necessary since reports may be corrupted by too long labels.
Assigning labels to variables is important because variable names in the study data are rather technical and lmiting to useful interpretations. As is the case for variable names each variable label should be unique. In addition, labels should be as short as possible to ensure a readable output.
To enhance the presentation and plotting quality character length
specified in LABELS
should not exceed 20 characters.
VAR_NAMES | LABEL |
---|---|
v00000 | CENTER_0 |
v00001 | PSEUDO_ID |
v00002 | SEX_0 |
v00003 | AGE_0 |
v00103 | AGE_GROUP_0 |
All implementations of dataquieR
support the use of
LABELS
.
Under some circumstances the notation of a short label or variable name is insufficient to provide all necessary information. The colum “LONG_LABEL” can therefore be filled with self-explaining anotations for variables. Long labels are of a higher relevance for a table output compared to a graphical output.
Via the specification of the label_col
formal in all
implementations of dataquieR
short or long labels can be
defined.
Categorical variables in the study data are often coded as integers (e.g. 0, 1). Because the number is non-informative labels are essential to secure undestandable reports, e.g:
To make use of VALUE_LABELS
in dataquieR
the following convention has been made: all values of a study variable
and respective labels can be summarized in a list using the pipe
operator \(|\) for separation. The
latter is crucial for the use of DQ-implementations.
To enhance presentation and plotting quality the character length of
a value label specified in VALUE_LABELS
should not exceed
20 characters.
The function dataqiueR::con_inadmissible_categorical()
searches for all observed levels in the study data and compares them
with pre-defined categories in the metadata. The column
NON_MATCHING denotes observed levels which have not been
defined in the metadata.
Variables | NUM_con_rvv_icat | PCT_con_rvv_icat | GRADING | FLG_con_rvv_icat | |
---|---|---|---|---|---|
8 | ARM_CUFF_0 | 0 | 0.0 | 0 | FALSE |
9 | USR_VO2_0 | 0 | 0.0 | 0 | FALSE |
10 | USR_BP_0 | 0 | 0.0 | 0 | FALSE |
11 | PART_PHYS_EXAM | 0 | 0.0 | 0 | FALSE |
12 | PART_LAB | 0 | 0.0 | 0 | FALSE |
13 | EDUCATION_0 | 0 | 0.0 | 0 | FALSE |
14 | EDUCATION_1 | 3 | 0.1 | 1 | TRUE |
15 | FAM_STAT_0 | 2389 | 79.6 | 1 | TRUE |
16 | MARRIED_0 | 0 | 0.0 | 0 | FALSE |
17 | EATING_PREFS_0 | 0 | 0.0 | 0 | FALSE |
Another application of value labels relates to the number of admissible levels in a categorical variable. If three distinct levels are observed in the data but the metadata (DD) references in value codes and value labels only two levels this implies the existence of inadmissible values.
In contrast to LABEL the definition of the
DATA_TYPE
is crucial because the applicability of DQ -
implementations may depend on the data type.
The following DATA_TYPES
are differentiated in
dataquieR
:
The list appears small compared to some electronic data capturing
systems (e.g. RedCAP, Harris et al. 2009) or Shiny Apps
(Chang et al. 2018). However, the data type
should not be mixed up with data entry types which could be
very different using sliders
or radio buttons
.
Similarly, the data type is not a statistical property such as an
ordinal characteristic.
The function dataquieR::pro_applicability_matrix()
provides an overview of applicable DQ-implementations according to the
defined data type.
FIG 1: Sketch of a matrix summarizing the applicability of DQ implementations
Data often contain a qualification of values which are not
measurements. These are for example codes for missing values. Figure 2
shows the use of such codes in the variable V_0101
. Both,
measurement values and missing codes are considered as data values.
FIG 2: Categorization of measurements and
missing values in dataquieR
Using such codes may complicate the application of standardized routines for DQ assessment since coded missing measurements must be correctly interpreted. For example, it must be secured that a data value representing a missing code is not treated as a measurement value to avoid spurious results when adressing data accuracy. Therefore codes representing non-measurement values must be correctly identified and treated correctly.
The R-package dataquieR
distinguishes two different code
lists for data values: MISSING_LIST
and
JUMP_LIST
. The conceptual difference is described in the
concomitant data quality concept.
Codes spcified in the MISSING_LIST
indicate unexpected
missingness of measurements, for example missing values due to refusals
or technical problems.
The MISSING_LIST
is a list of pipe \(|\) separated numeric codes: \(99980\: |\: 99983\: |\: 99988\).
The DQ-implementation dataquieR::com_item_missingness()
examines the presence of missing codes for all variables:
FIG 2: Analysis of reasons for missingness
Codes in the JUMP_LIST
indicate measurements which are
missing by design. For example, if a sub-sample of a study population
does not participate in a specific examination (by design) then
jump-codes should be used to indicate this reason for missingness.
The JUMP_LIST
is a list of pipe \(|\) separated numeric codes: \(88880\: |\: 88883\: |\: 88884\).
Specifying a JUMP_LISTS
is used by
dataquieR::com_item_missingness()
to compute the
appropriate denominator for item missingness, i.e. observation in which
the observation of a variable is not expected by design are not
considered. In the example below, the correct denominator for item
missingness of NBIRTH_0 and PREGNANT_0 is the number
of females, i.e. all males have qualified jump codes in the respective
variables.
Study variable | Observations N | Sysmiss N (%) | Datavalues N (%) | Missing codes N (%) | Jumps N (%) | Measurements N (%) | GRADING | |
---|---|---|---|---|---|---|---|---|
34 | SMOKE_SHOP_0 | 3000 | 1681 (56.03) | 1319 (43.97) | 513 (17.1) | 0 (0) | 806 (26.87) | 1 |
35 | N_INJURIES_0 | 3000 | 320 (10.67) | 2680 (89.33) | 481 (16.03) | 0 (0) | 2199 (73.3) | 1 |
36 | N_BIRTH_0 | 3000 | 289 (9.63) | 2711 (90.37) | 499 (16.63) | 1113 (37.1) | 1099 (58.24) | 1 |
37 | INCOME_GROUP_0 | 3000 | 311 (10.37) | 2689 (89.63) | 515 (17.17) | 0 (0) | 2174 (72.47) | 1 |
38 | PREGNANT_0 | 3000 | 350 (11.67) | 2650 (88.33) | 519 (17.3) | 1066 (35.53) | 1065 (55.07) | 1 |
Limits describe ranges to check the plausibility of measurement values (hard, soft limits) or to identify measurements outside a measurable range (detection limits). Limits may apply to study data of type: float, integer, and date-time. Specifying limits can be content-driven (e.g. based on clinical information) or may depend on properties of the used examination device or the outcome under study. For example, body weight cannot be negative.
Unfortunately, the definition of limits can be ambiguous:
To avoid this ambiguity, HARD_LIMITS
,
SOFT_LIMITS
, and DETECTION_LIMITS
in the
metadata are defined using interval notation. Values inside the interval
are eligible/plausible/possible. The definition of intervals adheres
also to a distinguished use of braces:
Each side of the interval must be defined by a value of the same type
as the measurement (including dates and date-times). If the range is
undefined \(-Inf\) and/or \(Inf\) have to be defined. Please see the
examples provided in Metadata in
dataquieR
.
Two types of limits may be distinguished depending on whether the range indicates inadmissible or just impropable values.
HARD_LIMITS
should be specified to identify inadmissible
values. Inadmissibility does not necessarily mean impossible. For
example, while it is known that the heaviest man on Earth did weigh more
than 600kg, it may be reasonable to declare values above 250kg as
inadmissible because under the circumstances of a general-population
study in Germany it is deemed unlikely that a heavier person may arrive
at the examination center.
The application of the function
dataquieR::con_limit_deviations()
in combination with
HARD_LIMITS
leads to the removal of respective values. The
removal is indicated in the respective plot and provided in a
message:
N = 24 values in SMOKE_SHOP_0 have been above HARD_LIMITS and were removed.N = 903 values in MEDICATION_0 have been above HARD_LIMITS and were removed.N = 21 values in EDUCATION_1 have been above HARD_LIMITS and were removed.
FIG 3: Example of summary plot for limit deviations
If the removal of values outside the limit intervals is not intended
the function can be used in combination with
SOFT_LIMITS
.
The functionality of SOFT_LIMITS
is similar to
HARD_LIMITS
. However, values outside the limits are not
removed, because SOFT_LIMITS
indicate improbable but not
impossible measurements.
The formal setup of SOFT_LIMITS
is identical to
HARD_LIMITS
.
The definition of DETECTION_LIMITS
can be necessary if
measurement devices have predefined limits of sensitivity. It is
possible that measurements are indicated as being below or above the
DETECTION_LIMITS
. Such information should result in a
different management of respective data values as they are still
informative and can be used in later analysis.
Values outside detection limits are not removed.
The formal setup of DETECTION_LIMITS
is identical to
HARD_LIMITS
.
Checks for contradictions compare the values of two study data variables to detect inadmissible combinations. Compared to ithe assessment of limits, only the combination of values in two variables is inadmissible while the values of each variable are admissible. Checks are performed rowwise within an individual.
For example, the variable sex for a given participant may contain \(male\) and the variable no. of births = \(2\). Each value is admissible but the combination is not since male participants may not give birth.
The column CONTRADICTIONS
in the metadata (DD)
references pipe | separated IDs of contradiction checks such as: \(1004\: |\: 1005\: |\: 1006\). Please see
for example Metadata in
dataquieR
. Each of these IDs is linked to a specific
contradiction which are defined either in a spreadsheet table or by
means of a ShinyApp provided by dataquieR
. An example is
provided in the table below.
Important note: in the table below the columns A and
B contain the labels of variables since the columns in the
study data have rather technical names. In this case the respective
function con_contradictions
must be used with a defined
label_col
formal.
ID | Function_name | A | A_levels | A_value | B | B_levels | B_value | Label |
---|---|---|---|---|---|---|---|---|
1001 | A_less_than_B_vv | AGE_1 | NA | NA | AGE_0 | NA | NA | Age follow-up |
1002 | A_not_equal_B_vv | SEX_1 | NA | NA | SEX_0 | NA | NA | Sex follow-up |
1003 | A_less_than_B_vv | EDUCATION_1 | NA | NA | EDUCATION_0 | NA | NA | Education follow-up |
1004 | A_levels_and_B_levels_ll | EATING_PREFS_0 | vegetarian | NA | MEAT_CONS_0 | 1-2d a week | 3-4d a week | 5-6d a week | daily | NA | Nutrition inconsistency vegetarian |
1005 | A_levels_and_B_levels_ll | EATING_PREFS_0 | vegan | NA | MEAT_CONS_0 | 1-2d a week | 3-4d a week | 5-6d a week | daily | NA | Nutrition inconsistency vegan |
1006 | A_levels_and_B_levels_ll | EATING_PREFS_0 | none | NA | MEAT_CONS_0 | never | NA | Nutrition inconsistency |
1007 | A_levels_and_B_levels_ll | SMOKING_0 | no | NA | SMOKE_SHOP_0 | 1-2d a week | 3-4d a week | 5-6d a week | daily | NA | Non-smokers inconsistency |
1008 | A_levels_and_B_levels_ll | SMOKING_0 | yes | NA | SMOKE_SHOP_0 | never | NA | Smokers inconsistency |
1009 | A_not_equal_B_vv | ARM_CIRC_DISC_0 | NA | NA | ARM_CUFF_0 | NA | NA | Blood pressure false cuff |
1010 | A_levels_and_B_gt_value_lc | PREGNANT_0 | yes | NA | AGE_0 | NA | 55 | Pregnancy high age |
1011 | A_less_than_B_vv | LAB_DT_0 | NA | NA | EXAM_DT_0 | NA | NA | LAB before MEX |
In total, 10 different types of comparisons between two variables can be defined to detect contradictions. In the table above, in line one the function name A_less_than_B_vv has been selected, to ensure that the age of a study participant at the follow-up exmaination is never lower than the age at baseline. The related check rule is: A is lower then the value in variable B. The suffixes *_vv, _ll, _lc* are irrelevant for the user but necessary for the ShinyApp to prompt further inputs. The complete list of functions to check contradictions is provided in the Appendix under Contradiction checks.
The application of the function
dataquieR::con_contradictions()
provides three different
outputs. One of them is a summary table of applied checks.
Check ID | Check type | Variables A and B | A Levels | B Levels | Contradictions (N) | Contradictions (%) | Grading | Label |
---|---|---|---|---|---|---|---|---|
1002 | A_not_equal_B_vv | A is: SEX_1; B is: SEX_0 | NA | NA | 150 | 5.00 | 1 | Sex follow-up |
1001 | A_less_than_B_vv | A is: AGE_1; B is: AGE_0 | NA | NA | 150 | 5.00 | 1 | Age follow-up |
1009 | A_not_equal_B_vv | A is: ARM_CIRC_DISC_0; B is: ARM_CUFF_0 | NA | NA | 173 | 5.77 | 1 | Blood pressure false cuff |
1003 | A_less_than_B_vv | A is: EDUCATION_1; B is: EDUCATION_0 | NA | NA | 7 | 0.23 | 0 | Education follow-up |
1004 | A_levels_and_B_levels_ll | A is: EATING_PREFS_0; B is: MEAT_CONS_0 | vegetarian | 1-2d a week,3-4d a week,5-6d a week,daily | 54 | 1.80 | 1 | Nutrition inconsistency vegetarian |
1005 | A_levels_and_B_levels_ll | A is: EATING_PREFS_0; B is: MEAT_CONS_0 | vegan | 1-2d a week,3-4d a week,5-6d a week,daily | 19 | 0.63 | 0 | Nutrition inconsistency vegan |
1006 | A_levels_and_B_levels_ll | A is: EATING_PREFS_0; B is: MEAT_CONS_0 | none | never | 64 | 2.13 | 1 | Nutrition inconsistency |
1007 | A_levels_and_B_levels_ll | A is: SMOKING_0; B is: SMOKE_SHOP_0 | no | 1-2d a week,3-4d a week,5-6d a week,daily | 91 | 3.03 | 1 | Non-smokers inconsistency |
1008 | A_levels_and_B_levels_ll | A is: SMOKING_0; B is: SMOKE_SHOP_0 | yes | never | 118 | 3.93 | 1 | Smokers inconsistency |
1010 | A_levels_and_B_gt_value_lc | A is: PREGNANT_0; B is: AGE_0 | yes | NA | 5 | 0.17 | 0 | Pregnancy high age |
Usually not all variables of the study data will be subject to DQ
reporting. To allow for simple filtering, different roles of variables
can be defined. The number of roles is not limited. In
dataquieR
the following roles are defeined:
In this column the order of the variables in a data quality report can be defined.
dataquieR
## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''
## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''
## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''
## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''
## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''
## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''
## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''
## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''
## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''
## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''
## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''
## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''
## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''
## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''
## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''
## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''
## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''
## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''
## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''
## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''
## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''
## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''
## Warning in png(png_loc <- tempfile(fileext = ".png"), width = 150 *
## graph.magnif, : unable to open connection to X11 display ''
No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing |
---|---|---|---|---|---|---|
1 | VAR_NAMES [character] |
1. v00000 2. v00001 3. v00002 4. v00003 5. v00004 6. v00005 7. v00006 8. v00007 9. v00008 10. v00009 [ 43 others ] |
1 ( 1.9%) 1 ( 1.9%) 1 ( 1.9%) 1 ( 1.9%) 1 ( 1.9%) 1 ( 1.9%) 1 ( 1.9%) 1 ( 1.9%) 1 ( 1.9%) 1 ( 1.9%) 43 (81.1%) |
IIIIIIIIIIIIIIII |
53 (100.0%) |
0 (0.0%) |
2 | LABEL [character] |
1. AGE_0 2. AGE_1 3. AGE_GROUP_0 4. ARM_CIRC_0 5. ARM_CIRC_DISC_0 6. ARM_CUFF_0 7. ASTHMA_0 8. BSG_0 9. CENTER_0 10. CRP_0 [ 43 others ] |
1 ( 1.9%) 1 ( 1.9%) 1 ( 1.9%) 1 ( 1.9%) 1 ( 1.9%) 1 ( 1.9%) 1 ( 1.9%) 1 ( 1.9%) 1 ( 1.9%) 1 ( 1.9%) 43 (81.1%) |
IIIIIIIIIIIIIIII |
53 (100.0%) |
0 (0.0%) |
3 | DATA_TYPE [character] |
1. datetime 2. float 3. integer 4. string |
4 ( 7.5%) 6 (11.3%) 37 (69.8%) 6 (11.3%) |
I II IIIIIIIIIIIII II |
53 (100.0%) |
0 (0.0%) |
4 | VALUE_LABELS [character] |
1. 0 = no | 1 = yes 2. 0 = females | 1 = males 3. 0 = never | 1 = 1-2d a we 4. 0 = pre-primary | 1 = pri 5. 1 = (-Inf,20] | 2 = (20,3 6. 0 = <10k | 1 = [10-30k) | 7. 0 = none | 1 = vegetarian 8. 1 = Berlin | 2 = Hamburg 9. A = excellent | B = good 10. single | married | divorc [ 3 others ] |
10 (38.5%) 2 ( 7.7%) 2 ( 7.7%) 2 ( 7.7%) 2 ( 7.7%) 1 ( 3.8%) 1 ( 3.8%) 1 ( 3.8%) 1 ( 3.8%) 1 ( 3.8%) 3 (11.5%) |
IIIIIII I I I I II |
26 (49.1%) |
27 (50.9%) |
5 | MISSING_LIST [character] |
1. 99980 | 99983 | 99988 | 2. 99980 | 99983 | 99988 | 3. 99980 | 99988 | 99989 | 4. 99980 | 99981 | 99982 | 9 5. 99980 | 99981 | 99982 | 9 6. 99980 | 99983 | 99987 | 7. 99980 | 99987 8. 99981 | 99982 |
15 (41.7%) 8 (22.2%) 1 ( 2.8%) 4 (11.1%) 2 ( 5.6%) 2 ( 5.6%) 1 ( 2.8%) 3 ( 8.3%) |
IIIIIIII IIII II I I I |
36 (67.9%) |
17 (32.1%) |
6 | JUMP_LIST [integer] |
Min : 88880 Mean : 88888 Max : 88890 |
88880 : 2 (20.0%) 88890 : 8 (80.0%) |
IIII IIIIIIIIIIIIIIII |
10 (18.9%) |
43 (81.1%) |
7 | HARD_LIMITS [character] |
1. [0;10] 2. [0;1] 3. [2018-01-01 00:00:00 CET; 4. [0;4] 5. [0;6] 6. [0;Inf) 7. [1;3] 8. [18;Inf) 9. [0;100] 10. [0;2] [ 3 others ] |
9 (27.3%) 5 (15.2%) 4 (12.1%) 2 ( 6.1%) 2 ( 6.1%) 2 ( 6.1%) 2 ( 6.1%) 2 ( 6.1%) 1 ( 3.0%) 1 ( 3.0%) 3 ( 9.1%) |
IIIII III II I I I I I I |
33 (62.3%) |
20 (37.7%) |
8 | DETECTION_LIMITS [character] |
1. [0;265] 2. [0.16;Inf) |
2 (66.7%) 1 (33.3%) |
IIIIIIIIIIIII IIIIII |
3 (5.7%) |
50 (94.3%) |
9 | CONTRADICTIONS [character] |
1. 1001 2. 1002 3. 1003 4. 1004 | 1005 | 1006 5. 1007 | 1008 6. 1009 7. 1010 8. 1011 |
2 (13.3%) 2 (13.3%) 2 (13.3%) 2 (13.3%) 2 (13.3%) 2 (13.3%) 1 ( 6.7%) 2 (13.3%) |
II II II II II II I II |
15 (28.3%) |
38 (71.7%) |
10 | SOFT_LIMITS [character] |
1. (0;60] 2. (55;100) 3. (90;170) 4. [0;10] 5. [0;5] 6. [0.2;10) 7. [0.2;30) 8. [1;9] |
1 (11.1%) 1 (11.1%) 1 (11.1%) 2 (22.2%) 1 (11.1%) 1 (11.1%) 1 (11.1%) 1 (11.1%) |
II II II IIII II II II II |
9 (17.0%) |
44 (83.0%) |
11 | DISTRIBUTION [character] |
1. gamma 2. normal 3. uniform |
1 (14.3%) 4 (57.1%) 2 (28.6%) |
II IIIIIIIIIII IIIII |
7 (13.2%) |
46 (86.8%) |
12 | DECIMALS [integer] |
Mean (sd) : 0.7 (1.2) min < med < max: 0 < 0 < 3 IQR (CV) : 0.8 (1.8) |
0 : 4 (66.7%) 1 : 1 (16.7%) 3 : 1 (16.7%) |
IIIIIIIIIIIII III III |
6 (11.3%) |
47 (88.7%) |
13 | DATA_ENTRY_TYPE [integer] |
Min : 0 Mean : 0.3 Max : 1 |
0 : 4 (66.7%) 1 : 2 (33.3%) |
IIIIIIIIIIIII IIIIII |
6 (11.3%) |
47 (88.7%) |
14 | KEY_OBSERVER [character] |
1. v00011 2. v00012 3. v00032 |
1 ( 5.6%) 2 (11.1%) 15 (83.3%) |
I II IIIIIIIIIIIIIIII |
18 (34.0%) |
35 (66.0%) |
15 | KEY_DEVICE [character] |
1. v00010 2. v00016 |
2 (66.7%) 1 (33.3%) |
IIIIIIIIIIIII IIIIII |
3 (5.7%) |
50 (94.3%) |
16 | KEY_DATETIME [character] |
1. v00013 2. v00017 |
4 (66.7%) 2 (33.3%) |
IIIIIIIIIIIII IIIIII |
6 (11.3%) |
47 (88.7%) |
17 | KEY_STUDY_SEGMENT [character] |
1. v10000 2. v20000 3. v30000 4. v40000 5. v50000 |
11 (20.8%) 11 (20.8%) 4 ( 7.5%) 18 (34.0%) 9 (17.0%) |
IIII IIII I IIIIII III |
53 (100.0%) |
0 (0.0%) |
18 | VARIABLE_ROLE [character] |
1. intro 2. primary 3. process 4. secondary |
11 (20.8%) 30 (56.6%) 9 (17.0%) 3 ( 5.7%) |
IIII IIIIIIIIIII III I |
53 (100.0%) |
0 (0.0%) |
19 | VARIABLE_ORDER [integer] |
Mean (sd) : 27 (15.4) min < med < max: 1 < 27 < 53 IQR (CV) : 26 (0.6) |
53 distinct values (Integer sequence) |
|
53 (100.0%) |
0 (0.0%) |
20 | LONG_LABEL [character] |
1. AGE_0 2. AGE_1 3. AGE_GROUP_0 4. ARM_CIRCUMFERENCE_0 5. ARM_CIRCUMFERENCE_DISCRET 6. ARM_USED_CUFF_0 7. ASTHMA_YESNO_0 8. BSG_0 9. CENTER_0 10. CRP_0 [ 43 others ] |
1 ( 1.9%) 1 ( 1.9%) 1 ( 1.9%) 1 ( 1.9%) 1 ( 1.9%) 1 ( 1.9%) 1 ( 1.9%) 1 ( 1.9%) 1 ( 1.9%) 1 ( 1.9%) 43 (81.1%) |
IIIIIIIIIIIIIIII |
53 (100.0%) |
0 (0.0%) |
21 | LOCATION_RANGE [character] |
1. (100;140) 2. (20;30) 3. (60;100) 4. [2;4) 5. [45;55] |
1 (16.7%) 1 (16.7%) 1 (16.7%) 1 (16.7%) 2 (33.3%) |
III III III III IIIIII |
6 (11.3%) |
47 (88.7%) |
22 | LOCATION_METRIC [character] |
1. Mean 2. Median |
5 (83.3%) 1 (16.7%) |
IIIIIIIIIIIIIIII III |
6 (11.3%) |
47 (88.7%) |
23 | PROPORTION_RANGE [character] |
1. (10;90) 2. [15;30] 3. [48;52] 4. 0 in [48;52] 5. 4 in (2;10] | 5 in (5;15] |
1 (20.0%) 1 (20.0%) 1 (20.0%) 1 (20.0%) 1 (20.0%) |
IIII IIII IIII IIII IIII |
5 (9.4%) |
48 (90.6%) |
Function notation | Explanation |
---|---|
A \(\ne\) B | Value in A is not equal value in B |
A > B | Value in A is greater than value in B |
A \(\ge\) B | Value in A is greater equal value in B |
A observed \(\cap\) B missing | A is observed and B is missing |
A obserevd \(\cap\) B is not missing | A is observed and B is observed |
A \(\in\) {set of levels} \(\cap\) B > value | A has a level and B is greater than a value |
A \(\in\) {set of levels} \(\cap\) B = value | A has a level and B is equal to a value |
A \(\in\) {set of levels} \(\cap\) B < value | A has a level and B is lower than a value |
A \(\in\) {set of levels} \(\cap\) B \(\in\) {set of levels} | A has a level and B has a level |
A \(\in\) {set of levels} \(\cap\) B \(\neg\) \(\in\) {set of levels} | A has a level and B has not a level |