APPROACH

This implementation can be applied in two use cases:

  1. participation in study segments is not recorded by respective variables, e.g. a participants’ refusal to attend a specific examination is not recorded.
  2. participation in study segments is recorded by respective variables.

Use case (1) will be common in smaller studies. For the calculation of segment missingness it is assumed that study variables are nested in respective segments. This structure must be specified in the static metadata. The R-function identifies all variables within each segment and returns TRUE if all variables within a segment are missing, otherwise FALSE.

Use case (2) assumes a more complex structure of study data and meta data. The study data comprise so-called intro-variables (either TRUE/FALSE or codes for non-participation). The column KEY_STUDY_SEGMENT in the metadata is filled by variable-IDs indicating for each variable the respective intro-variable. This structure has the benefit that subsequent calculation of item missingness obtains correct denominators for the calculation of missingness rates.

Example of study data

Data from the package dataquieR are loaded as shown below:

load(system.file("extdata", "study_data.RData", package = "dataquieR"))
sd1 <- study_data

This example of study data has N=3000 observations. Study data variables have abstract and non-interpretable names; appropriate labels must be mapped from the metadata.

v00000 v00001 v00002 v00003 v00004 v00005 v01003 v01002 v00103 v00006
3 LEIIX715 0 49 127 77 49 0 40-49 3.8
1 QHNKM456 0 47 114 76 47 0 40-49 1.9
1 HTAOB589 0 50 114 71 50 0 50-59 0.8
5 HNHFV585 0 48 120 65 48 0 40-49 3.8
1 UTDLS949 0 56 119 78 56 0 50-59 4.1
5 YQFGE692 1 47 133 81 47 1 40-49 9.5
1 AVAEH932 0 53 114 78 53 0 50-59 5.0
3 QDOPT378 1 48 116 86 48 1 40-49 9.6
3 BMOAK786 0 44 115 71 44 0 40-49 2.0
5 ZDKNF462 0 50 116 74 50 0 50-59 2.4

Example of metadata

Data from the package dataquieR are loaded as shown below:

load(system.file("extdata", "meta_data.RData", package = "dataquieR"))
md1 <- meta_data

Information corresponding to the study data is kept in the table of static metadata. An interpretable label for each variable is also attached. Besides the data type and labels of all variables further expected characteristics are stored in the metadata.

Regarding the following implementation the column KEY_STUDY_SEGMENT in the metadata is crucial. According to use-case (2) (see Approach), variable-IDs are provided in this column. However, content of this column can be string-values as well.

VAR_NAMES LABEL DATA_TYPE KEY_STUDY_SEGMENT
4 v00003 AGE_0 integer v10000
39 v00030 MEDICATION_0 integer v40000
1 v00000 CENTER_0 integer v10000
34 v00025 SMOKE_SHOP_0 integer v40000
23 v00016 DEV_NO_0 integer v30000
43 v40000 PART_INTERVIEW integer v10000
14 v00009 ARM_CIRC_0 float v20000
18 v00012 USR_BP_0 string v20000
33 v00024 SMOKING_0 integer v40000
21 v00014 CRP_0 float v30000

Required R-packages

This implementation requires only the R-package ggplot2.

R-FUNCTION

The R-function requires definition of eight arguments:

  • study_data: mandatory, the name of the data frame that contains the measurements
  • meta_data: mandatory, the name of the data frame that contains static metadata of the study data
  • group_vars: optional, the name of the observer, device or any variable alike that should be compared
  • strata_vars: optional, the name of a variable that should be used for stratification
  • label_col: optional, specifies the column name of the metadata table which contains labels for all variables in the study data
  • threshold_value: optional with default (10%), an numeric value see Implementation and use of thresholds
  • direction: optional with default (high), can be either low or high with regard to the threshold. If values above the threshold are considered critical high should be selected, low respectively. See also Implementation and use of thresholds
  • exclude_roles: optional, a concatenated character string specifying VARIABLE_ROLE(s) that can be deselected.

com_segment_missingness

com_segment_missingness <-  function(study_data, meta_data, group_vars = NULL,
                                    strata_vars = NULL, label_col,
                                    threshold_value,
                                    direction, exclude_roles = "process") {

  #########
  # STOPS #
  #########
  if (missing(threshold_value) ||
      length(threshold_value) != 1 ||
      !is.numeric(threshold_value)) {
    util_warning(
      c("threshold_value should be a single number between 0 and 100.",
      "Invalid value specified, setting to 10%%."),
      applicability_problem = TRUE)
    threshold_value <- 10
  }

  if (missing(direction)) {
    util_warning(c(
      "No specification of threshold direction found.",
      "The function interprets values higher the threshold as violations."),
      applicability_problem = TRUE)
    direction <- "high"
  }

  if (length(direction) != 1) {
    util_error(
      "Parameter %s, if not missing, should be of length 1, but not %d.",
      dQuote("direction"), length(direction),
      applicability_problem = TRUE)
  }

  if (!(direction %in% c("low", "high"))) {
    util_error(
      "Parameter %s should be either %s or %s, but not %s.",
      dQuote("direction"),
      dQuote("low"), dQuote("high"), dQuote(direction),
      applicability_problem = TRUE
    )
  }

  ####################
  # PREPS AND CHECKS #
  ####################

  # map meta to study
  util_prepare_dataframes()

  # correct variable usage
  util_correct_variable_use("strata_vars",
    allow_null = TRUE,
    need_type = "!float"
  )

  util_correct_variable_use("group_vars",
    allow_null = TRUE,
    need_type = "!float"
  )

  # browser()
  # exclude_roles = c("process", "not")

  # should some variables not be considered?
  if (VARIABLE_ROLE %in% names(meta_data)) {

    # a: not all roles specified found in metadata
    if (!(all(exclude_roles %in% meta_data[[VARIABLE_ROLE]]))) {
      if (any(exclude_roles %in% meta_data[[VARIABLE_ROLE]])) {
        util_warning(paste0(
          "Specified VARIABLE_ROLE(s): '",
          exclude_roles[!(exclude_roles %in% meta_data[[VARIABLE_ROLE]])],
          "' was not found in metadata, only: '",
          exclude_roles[exclude_roles %in% meta_data[[VARIABLE_ROLE]]],
          "' is used."
        ), applicability_problem = TRUE)

        exclude_roles <- exclude_roles[exclude_roles %in%
                                         meta_data[[VARIABLE_ROLE]]]

        which_vars_not <-
          meta_data[[label_col]][meta_data[[VARIABLE_ROLE]] %in% exclude_roles]
        if (missing(label_col)) {
          which_vars_not <-
            meta_data[[VAR_NAMES]][meta_data[[VARIABLE_ROLE]] %in%
                                     exclude_roles]
        }
        which_vars_not <- setdiff(which_vars_not, strata_vars)
        which_vars_not <- setdiff(which_vars_not, group_vars)
        if (length(intersect(names(ds1), which_vars_not)) > 0) {
          util_warning(paste0(
            "Study variables: ",
            paste(dQuote(intersect(names(ds1), which_vars_not)),
                  collapse = ", "),
            " are not considered due to their VARIABLE_ROLE."
          ), applicability_problem = TRUE)
        }
        ds1 <- ds1[, !(names(ds1) %in% which_vars_not)]
      } else {
        exclude_roles <- FALSE
        util_warning(
          c("Specified VARIABLE_ROLE(s) were not found in metadata.",
            "All variables are included here."),
          applicability_problem = TRUE)
      }

      # b: all roles are found in metadata
    } else {
      if (missing(exclude_roles)) {
        util_warning(
          c("Formal exclude_roles is used with default: all process variables",
            "are not included here."), applicability_problem = TRUE)
      }

      which_vars_not <- meta_data[[label_col]][meta_data[[VARIABLE_ROLE]] %in%
                                                 exclude_roles]
      if (missing(label_col)) {
        which_vars_not <- meta_data[[VAR_NAMES]][meta_data[[VARIABLE_ROLE]] %in%
                                                   exclude_roles]
      }
      which_vars_not <- setdiff(which_vars_not, strata_vars)
      which_vars_not <- setdiff(which_vars_not, group_vars)
      if (length(intersect(names(ds1), which_vars_not)) > 0) {
        util_warning(paste0(
          "Study variables: ", paste(dQuote(intersect(names(ds1),
                                                      which_vars_not)),
                                     collapse = ", "),
          " are not considered due to their VARIABLE_ROLE."
        ), applicability_problem = FALSE)
      }
      ds1 <- ds1[, !(names(ds1) %in% which_vars_not)]
    }
  } else {
    # since there are no roles defined exclusion is set to false
    exclude_roles <- FALSE
    util_warning(
      c("VARIABLE_ROLE has not been defined in the metadata,",
        "therefore all variables within segments are used."),
      applicability_problem = TRUE)
  }

  # Which segments?
  if (!("KEY_STUDY_SEGMENT" %in% names(meta_data))) {
    util_error("Metadata do not contain the column KEY_STUDY_SEGMENT.",
               applicability_problem = TRUE)
  }

  segs <- unique(meta_data$KEY_STUDY_SEGMENT[!(is.na(
    meta_data$KEY_STUDY_SEGMENT))])
  seg_names <- meta_data[[LONG_LABEL]][meta_data$VAR_NAMES %in% segs]

  if (length(seg_names) > 0) {
    names(segs) <- seg_names
  } else {
    names(segs) <- segs
  }

  # browser()
  # determine which vars per segment
  var_sets <- list()
  for (i in seq_along(segs)) {
    if (isFALSE(exclude_roles)) {
      meta_data_excl <- meta_data[!(meta_data$VAR_NAMES %in% segs), ]
      var_sets[[i]] <-
        meta_data_excl[[label_col]][meta_data_excl$KEY_STUDY_SEGMENT == segs[i]]
    } else {
      meta_data_excl <-
        meta_data[!(meta_data[[VARIABLE_ROLE]] %in% exclude_roles) &
                    !(meta_data$VAR_NAMES %in% segs), ]
      var_sets[[i]] <-
        meta_data_excl[[label_col]][meta_data_excl$KEY_STUDY_SEGMENT ==
                                      segs[i]][!is.na(meta_data_excl[[
                                        label_col]][
                                          meta_data_excl$KEY_STUDY_SEGMENT ==
                                            segs[i]])]
    }
  }

  # Which groups?
  if (length(group_vars) > 0) {
    # No. of group levels and labels
    ds1[[group_vars]] <- util_assign_levlabs(
      variable = ds1[[group_vars]],
      string_of_levlabs = meta_data$VALUE_LABELS[meta_data$LABEL == group_vars],
      splitchar = SPLIT_CHAR,
      assignchar = "="
    )
    gr <- unique(ds1[[group_vars]][!is.na(ds1[[group_vars]])])
    gr <- gr[order(gr)]
    # covariables for plot
    cvs <- c(group_vars, "Examinations")

    # missings in grouping variable?
    ds1 <- ds1[!is.na(ds1[[group_vars]]), ]
  } else {
    gr <- 1
    group_vars <- "Group"
    ds1$Group <- 1
    cvs <- "Examinations"
  }

  if (length(strata_vars) > 0) {
    # No. of strata levels and labels
    if (dim(ds1)[1] != dim(ds1[!is.na(ds1[[strata_vars]]), ])[1]) {
      ds1 <- ds1[!is.na(ds1[[strata_vars]]), ]
      util_warning(paste0("Some observations in ", strata_vars,
                          " are NA and were removed."),
                   applicability_problem = FALSE)
    }

    ds1[[strata_vars]] <- util_assign_levlabs(
      variable = ds1[[strata_vars]],
      string_of_levlabs = meta_data$VALUE_LABELS[meta_data$LABEL ==
                                                   strata_vars],
      splitchar = SPLIT_CHAR,
      assignchar = "="
    )
    strata <- unique(ds1[[strata_vars]])[!is.na(unique(ds1[[strata_vars]]))]
    # covariables for plot
    cvs <- c(strata_vars, group_vars, "Examinations")
  }

  # create result dataframe by factor combinations
  if (length(strata_vars) == 0) {
    res_df <- expand.grid(Group = gr, Examinations = names(segs))
    colnames(res_df) <- c(group_vars, "Examinations")
  } else {
    res_df <- expand.grid(
      Strata = strata, Group = gr, Examinations = names(segs),
      stringsAsFactors = TRUE
    )
    colnames(res_df) <- c(strata_vars, group_vars, "Examinations")
    res_df <- res_df[order(res_df[[strata_vars]], res_df[[group_vars]]), ]
  }

  ################
  # CALCULATIONS #
  ################

  myfun <- function(x) {
    all(is.na(x))
  }
  Ns <- c()
  Ms <- c()

  if (length(strata_vars) == 0) {
    for (j in seq_along(segs)) {
      for (i in seq_along(gr)) {
        Ns <- c(Ns, dim(ds1[ds1[[group_vars]] == gr[i], ])[1])
        Ms <- c(Ms, sum(apply(ds1[ds1[[group_vars]] == gr[i],
                                  c(as.character(unlist(var_sets[j]))),
                                  drop = FALSE], 1, myfun)))
      }
    }
  } else {
    for (i in seq_along(strata)) {
      for (j in seq_along(gr)) {
        for (k in seq_along(segs)) {
          Ns <- c(Ns, dim(ds1[ds1[[strata_vars]] == strata[i] &
                                ds1[[group_vars]] == gr[j], ])[1])
          Ms <- c(Ms, sum(apply(ds1[
            ds1[[strata_vars]] == strata[i] & ds1[[group_vars]] == gr[j],
            c(as.character(unlist(var_sets[k])))
          ], 1, myfun)))
        }
      }
    }
  }

  res_df$"No. of Participants" <- Ns
  res_df$"No. of missing segments" <- Ms
  res_df$"(%) of missing segments" <- round(res_df$`No. of missing segments` /
                                              res_df$`No. of Participants` *
                                              100, digits = 2)
  res_df$"(%) of missing segments" <-
    as.numeric(res_df$"(%) of missing segments")
  res_df$threshold <- threshold_value
  res_df$direction <- direction

  if (direction == "high") {
    res_df$grading <- ifelse(res_df$"(%) of missing segments" >
                               threshold_value, 1, 0)
  } else {
    res_df$grading <- ifelse(res_df$"(%) of missing segments" <
                               threshold_value, 1, 0)
  }

  # PLOT
  inversion <- ifelse(direction == "low", 1, 0)

  # order result data frame by grouping variable
  if (length(strata_vars) == 0) {
    # order result data frame by grouping variable
    res_df <- res_df[order(res_df[[group_vars]]), ]
    p <- util_heatmap_1th(
      df = res_df, cat_vars = cvs, values = "(%) of missing segments",
      right_intv = TRUE, threshold = threshold_value,
      invert = inversion
    )$SummaryPlot
  } else {
    # order result data frame by grouping variable
    res_df <- res_df[order(res_df[[strata_vars]], res_df[[group_vars]]), ]
    p <- util_heatmap_1th(
      df = res_df, cat_vars = cvs[-1], values = "(%) of missing segments",
      right_intv = TRUE, threshold = threshold_value,
      invert = inversion, strata = strata_vars
    )$SummaryPlot
  }

  suppressWarnings({
    # suppress wrong warnings: https://github.com/tidyverse/ggplot2/pull/4439/commits
    # find out size of the plot https://stackoverflow.com/a/51795017
    bp <- ggplot_build(p)
    w <- 2 * length(bp$layout$panel_params[[1]]$x$get_labels())
    if (w == 0) {
      w <- 10
    }
    w <- w + 2 +
      max(nchar(bp$layout$panel_params[[1]]$y$get_labels()),
          na.rm = TRUE)
    h <- 2 * length(bp$layout$panel_params[[1]]$y$get_labels())
    if (h == 0) {
      h <- 10
    }
    h <- h + 15

    p <- util_set_size(p, width_em = w, height_em = h)
  })

  return(list(SummaryData = res_df, SummaryPlot = p))
}

Implementation and use of thresholds

This implementation uses one threshold to discriminate critical from non-critical values. If direction is high than all values below the threshold_value are normal (displayed in dark blue in the plot and flagged with grading = 0 in the dataframe). All values above the threshold_value are considered critical. The more they deviate from the threshold the displayed color shifts to dark red. All critical values are highlighted with grading = 1 in the summary data frame. By default, highest values are always shown in dark red irrespective of the absolute deviation.

If direction is low than all values above the threshold_value are normal (displayed in dark blue, grading = 0).

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

Call of the R-function

Please provide a call of the R-Function using the example data and metadata.

MissSegs <- com_segment_missingness(study_data = sd1, 
                                    meta_data = md1, 
                                    label_col = "LABEL", 
                                    threshold_value = 5, 
                                    direction = "high",
                                    exclude_roles = c("secondary", "process"))

OUTPUT

The function has two outputs, a summary dataframe and a heatmap-like graphic that highlights critical values depending on the respective threshold_value and direction.

Threshold direction: high

Output 1: Summary data

The summary data frame expands over all possible combinations of aux_variable levels and examinations identified in the metadata. The threshold_value specified by the user and the direction are added to the dataframe. Since direction is high all values above the threshold are considered critical and flagged by grading = 1.

Group Examinations No. of Participants No. of missing segments (%) of missing segments threshold direction grading
1 PARTICIPATION_STUDY 3000 0 0.00 5 high 0
1 PARTICIPATION_PHYSICAL_EXAM 3000 220 7.33 5 high 1
1 PARTICIPATION_LABORATORY 3000 173 5.77 5 high 1
1 PARTICIPATION_INTERVIEW 3000 408 13.60 5 high 1
1 PARTICIPATION_QUESTIONNAIRE 3000 136 4.53 5 high 0

Output 2: Summary plot

The summary plot frame is called using mp1$SummaryPlot:

Using further stratification

For some analyses adding new and transformed variable to the study data is necessary.

# use the month function of the lubridate package to extract month of exam date
require(lubridate)
# apply changes to copy of data
sd2 <- sd1
# indicate first/second half year
sd2$month <- month(sd2$v00013)

Static metadata of the variable must be added to the respective metadata.

MD_TMP <- prep_add_to_meta(VAR_NAMES    = "month",
                           DATA_TYPE    = "integer",
                           LABEL        = "EXAM_MONTH",
                           VALUE_LABELS = "1 = January | 2 = February | 3 = March | 
                                          4 = April | 5 = May | 6 = June | 7 = July |
                                          8 = August | 9 = September | 10 = October |
                                          11 = November | 12 = December",
                           MISSING_LIST = "",
                           meta_data    = md1)

Subsequent call of the R-function may include the new variable.

MissSegs <- com_segment_missingness(study_data = sd2, 
                                    meta_data = MD_TMP, 
                                    group_vars = "EXAM_MONTH", 
                                    label_col = "LABEL", 
                                    threshold_value = 1, 
                                    direction = "high",
                                    exclude_roles = c("secondary", "process"))

Output 2: Summary plot

The output of mp1$SummaryPlot uses now facets from ggplot(), while stratum of the new variable represents one facet:

INTERPRETATION

This indicator makes use of a simple threshold implementation which is specified by the user. By default, highest deviation from the threshold is always displayed in dark red, irrespective of the absolute deviation. Classifying an deviation to be critical is up to the user and involves qualitative interpretation.

Concept relations