APPROACH

Inadmissible numerical values can be of type integer or float. This implementation requires the definition of intervals in the metadata to examine the admissibility of numerical study data.

ALGORITHM OF THIS IMPLEMENTATION:

  1. Remove missing codes from the study data (if defined in the metadata)
  2. Interpretation of variable specific intervals as supplied in the metadata.
  3. Identification of measurements outside defined limits. Therefore two output data frames are generated:
    • on the level of observation to flag each deviation, and
    • a summary table for each variable.
  4. A list of plots is generated for each variable examined for limit deviations. The histogram-like plots indicate respective limits as well as deviations.
  5. Values exceeding limits are removed in a data frame of modified study data

Example of study data

Data from the package dataquieR are loaded as shown below:

load(system.file("extdata", "study_data.RData", package = "dataquieR"))
sd1 <- study_data

This example of study data has N=3000 observations. Study data variables have abstract and non-interpretable names; appropriate labels must be mapped from the metadata.

v00000 v00001 v00002 v00003 v00004 v00005 v01003 v01002 v00103 v00006
3 LEIIX715 0 49 127 77 49 0 40-49 3.8
1 QHNKM456 0 47 114 76 47 0 40-49 1.9
1 HTAOB589 0 50 114 71 50 0 50-59 0.8
5 HNHFV585 0 48 120 65 48 0 40-49 3.8
1 UTDLS949 0 56 119 78 56 0 50-59 4.1
5 YQFGE692 1 47 133 81 47 1 40-49 9.5
1 AVAEH932 0 53 114 78 53 0 50-59 5.0
3 QDOPT378 1 48 116 86 48 1 40-49 9.6
3 BMOAK786 0 44 115 71 44 0 40-49 2.0
5 ZDKNF462 0 50 116 74 50 0 50-59 2.4

Example of metadata

Data from the package dataquieR are loaded as shown below:

load(system.file("extdata", "meta_data.RData", package = "dataquieR"))
md1 <- meta_data

Information corresponding to the study data is kept in the table of static metadata. An interpretable label for each variable is also attached. Besides data type and labels of all variables further expected characteristics are stored in the metadata.

Regarding the following implementation the columns HARD_LIMITS as well as MISSING_LIST + JUMP_LIST in the metadata are particularly relevant.

HARD_LIMITS have to be defined as intervals:

  • \([0; 100]\): any value between 0 and 100, including 0 or 100

  • \((0; 100)\): any value between 0 and 100, not including 0 or 100

  • \([0; Inf)\): any positive numerical value, including 0

This table shows the metadata defined for the example data that required for this implementation:

VAR_NAMES LABEL MISSING_LIST JUMP_LIST HARD_LIMITS
4 v00003 AGE_0 NA NA [18;Inf)
39 v00030 MEDICATION_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 NA [0;1]
1 v00000 CENTER_0 NA NA NA
34 v00025 SMOKE_SHOP_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 NA [0;4]
23 v00016 DEV_NO_0 NA NA NA
43 v40000 PART_INTERVIEW NA NA NA
14 v00009 ARM_CIRC_0 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99987 | 99988 | 99989 | 99990 | 99991 | 99992 | 99993 | 99994 | 99995 NA [0;Inf)
18 v00012 USR_BP_0 99981 | 99982 NA NA
33 v00024 SMOKING_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 NA [0;1]
21 v00014 CRP_0 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99988 | 99989 | 99990 | 99991 | 99992 | 99994 | 99995 NA [0;Inf)

However, this function can also be used with other columns of the metadata that contain limit definitions according to the conventions mentioned above. Currently SOFT_LIMITS and DETECTION_LIMITS are also handled by this implementation.

Required R-packages

library(ggplot2)
library(gridExtra)
library(plyr)
library(dplyr)
library(stringr)

R-FUNCTION

Please define all arguments used by the R-function:

  • resp_vars: the name of the continuous measurement variable
  • label_col: if labels should be used specify column of metadata containing the labels
  • limits: which limits should be investigated (HARD_LIMITS, SOFT_LIMITS, DETECTION_LIMITS)
  • study_data: the name of the data frame that contains the measurements
  • meta_data: the name of the data frame that contains metadata attributes of study data

CAVE:

Regarding the naming of the following implementation we deviate from other implementations. This is motivated by the generic use of the function which can process different types of limits, i.e. if SOFT_LIMITS or DETECTION_LIMITS. A necessary convention is the identical definition of limits as shown here: Example of metadata.

con_limit_deviations <-  function(resp_vars = NULL, label_col, study_data,
                                 meta_data, limits = c(
                                   "HARD_LIMITS", "SOFT_LIMITS",
                                   "DETECTION_LIMITS"
                                 )) {
  rvs <- resp_vars

  infix <- unlist(strsplit(limits, "_"))[1]

  # Preps ----------------------------------------------------------------------

  # map meta to study
  util_prepare_dataframes()

  util_correct_variable_use("resp_vars",
    allow_more_than_one = TRUE,
    allow_null = TRUE,
    allow_any_obs_na = TRUE,
    need_type = "integer|float|datetime"
  )

  # which limits?
  LIMITS <- toupper(match.arg(limits))

  # variables correct?
  # util_correct_variable_use("resp_variables", role = "response_vars")

  # no variables defined?
  if (length(rvs) == 0) {
    if (all(is.na(meta_data[[LIMITS]]))) {
      util_error(paste0("No Variables with defined ", LIMITS, "."))
    } else {
      util_warning(paste0("All variables with ", LIMITS,
                          " in the metadata are used."))
      rvs <- meta_data[[label_col]][!(is.na(meta_data[[LIMITS]]))]
    }
  } else {
    # limits defined at all?
    if (all(is.na(meta_data[[LIMITS]][meta_data[[label_col]] %in% rvs]))) {
      util_error(paste0("No Variables with defined ", LIMITS, "."))
    }
    # no limits for some variables?
    rvs2 <- meta_data[[label_col]][!(is.na(meta_data[[LIMITS]])) &
                                     meta_data[[label_col]] %in% rvs]
    if (length(rvs2) < length(unique(rvs))) {
      util_warning(paste0("The variables ", rvs[!(rvs %in% rvs2)],
                          " have no defined limits."))
    }
    rvs <- rvs2
  }

  datetime_vars <- vapply(
    rvs,
    function(rv) {
      meta_data[["DATA_TYPE"]][meta_data[[label_col]] == rv] ==
        DATA_TYPES$DATETIME
    },
    logical(1)
  )

  # conversion of numeric handling needs a bit more coding, since it needs
  # an origin then. in other cases (Date, character, ...),
  # origin will be ignored by as.POSIXct

  ds1[, rvs[datetime_vars]] <-
    lapply(ds1[, rvs[datetime_vars], drop = FALSE], as.POSIXct, origin =
             min(as.POSIXct(Sys.Date()), 0))


  # remove rvs with non-matching data type
  var_matches_datatype <-
    vapply(FUN.VALUE = logical(1), ds1[, rvs, drop = FALSE],
                        function(x) is.numeric(x) || inherits(x, "POSIXct"))

  if (!all(var_matches_datatype)) {
    util_warning(paste0(
      "Variables ", paste0(rvs[!var_matches_datatype], collapse = ", "),
      " are neither numeric nor datetime and will be removed from analyses."
    ))
    rvs <- rvs[var_matches_datatype]
  }

  if (length(rvs) == 0) {
    util_error("No variables left, no limit checks possible.")
  }

  # interpret limit intervals
  imdf <- util_interpret_limits(mdata = meta_data)

  fsd_list <- lapply(setNames(nm = rvs), function(rv) {
    fsd <- ds1[, rv, drop = FALSE]

    ds1[!is.finite(ds1[[rv]]), rv] <- NA

    # Extract and interpret available metadata -------------------------------
    LOWER <- paste0(infix, "_LIMIT_LOW")
    ll <- imdf[[LOWER]][imdf[[label_col]] == rv]
    ll <- ifelse(is.infinite(ll), NA, ll)
    UPPER <- paste0(infix, "_LIMIT_UP")
    lu <- imdf[[UPPER]][imdf[[label_col]] == rv]
    lu <- ifelse(is.infinite(lu), NA, lu)

    if ((datetime_vars[[rv]])) {
      ll <- as.POSIXct(ll, origin = min(as.POSIXct(Sys.Date()), 0))
      lu <- as.POSIXct(lu, origin = min(as.POSIXct(Sys.Date()), 0))
    }

    # Fill summary DFs -------------------------------------------------------
    BELOW <- paste0(rv, "_below_", infix)
    ABOVE <- paste0(rv, "_above_", infix)
    OUT <- paste0(rv, "_OUT_", infix)

    if (!(is.na(ll))) {
      fsd[[BELOW]][!(is.na(fsd[[rv]]))] <-
        ifelse(as.numeric(fsd[[rv]][!(is.na(fsd[[rv]]))]) < ll, 1, 0)
    } else {
      fsd[[BELOW]][!(is.na(fsd[[rv]]))] <- 0
    }
    if (!(is.na(lu))) {
      fsd[[ABOVE]][!(is.na(fsd[[rv]]))] <-
        ifelse(as.numeric(fsd[[rv]][!(is.na(fsd[[rv]]))]) > lu, 1, 0)
    } else {
      fsd[[ABOVE]][!(is.na(fsd[[rv]]))] <- 0
    }

    return(fsd)
  }) # end lapply fsd

  fsd <-
    do.call(cbind.data.frame, c(unname(fsd_list), list(
      stringsAsFactors = FALSE)))

  plot_list <- lapply(setNames(nm = rvs), function(rv) {

    # Fill summary DFs -------------------------------------------------------
    BELOW <- paste0(rv, "_below_", infix)
    ABOVE <- paste0(rv, "_above_", infix)
    OUT <- paste0(rv, "_OUT_", infix)

    # Combine flag for plot
    ds1[[OUT]] <- pmax(fsd[[BELOW]], fsd[[ABOVE]], na.rm = TRUE)

    ds1[!is.finite(ds1[[rv]]), rv] <- NA

    # Extract and interpret available metadata -------------------------------
    LOWER <- paste0(infix, "_LIMIT_LOW")
    ll <- imdf[[LOWER]][imdf[[label_col]] == rv]
    ll <- ifelse(is.infinite(ll), NA, ll)
    UPPER <- paste0(infix, "_LIMIT_UP")
    lu <- imdf[[UPPER]][imdf[[label_col]] == rv]
    lu <- ifelse(is.infinite(lu), NA, lu)

    if (datetime_vars[[rv]]) {
      ll <- as.POSIXct(ll, origin = min(as.POSIXct(Sys.Date()), 0))
      lu <- as.POSIXct(lu, origin = min(as.POSIXct(Sys.Date()), 0))
    }

    # Calculation of values relevant for plot area ---------------------------
    # data extrema
    max_data <- max(ds1[[rv]], na.rm = TRUE)
    min_data <- min(ds1[[rv]], na.rm = TRUE)

    ### Define bounds for graph
    minx <- min(c(min_data, ll), na.rm = TRUE)
    maxx <- max(c(max_data, lu), na.rm = TRUE)

    if (!datetime_vars[[rv]]) {
      # expand plot area
      inc <- floor(0.1 * abs(maxx - minx))
      if (inc < 1)
        inc <- 1
      minx <- minx - inc
      maxx <- maxx + inc
    }

    if (maxx == 0) {
      # expand plot area
      maxx <- (maxx + max(1, floor(0.1 * (maxx - minx))))
    }

    if (minx == 0) {
      # expand plot area
      minx <- (minx - max(1, floor(0.1 * (maxx - minx))))
    }

    # differentiate continuous from discrete variables -------------------------
    if (datetime_vars[[rv]] || !(all(ds1[[rv]] %% 1 == 0, na.rm = TRUE)) ||
      length(unique(ds1[[rv]])) > 20) {
      # continuous or integer with more than 20 values

      # Freedman-Diaconis (2 * IQR(data) / length(data)^(1/3)):
      # optimal width restricted to the data within limits!
      thedata <- ds1[[rv]]

      if (!is.na(ll)) {
        thedata <- thedata[ds1[[rv]] >= ll]
      }

      if (!is.na(lu)) {
        thedata <- thedata[ds1[[rv]] <= lu]
      }

      bw <- (2 * IQR(thedata, na.rm = TRUE) / length(thedata) ^ (1 / 3))
      if (bw == 0) bw <- 1

      # steps within hard limits
      # (rounded according modulo division to meet limits)
      dif <- as.numeric(maxx) - as.numeric(minx)
      byX <- dif / (dif %/% bw)

      # breaks must be within hard limits
      # (old: breakswithin <- seq(xlims[1], xlims[2], by = byX))
      breakswithin <- c(min_data - byX, seq(min_data, max_data, by = byX),
                        max_data + byX)

      # breaks outside plausis (always the case since minx/maxx outside limits)
      breakslower <- seq(minx, min_data, by = byX)
      breaksupper <- seq(max_data, maxx, by = byX)

      # rounding
      if (datetime_vars[[rv]]) {
        breaksX <- unique(c(breakslower, breakswithin, breaksupper))
      } else {
        breaksX <- round(unique(c(breakslower, breakswithin, breaksupper)), 3)
      }
      # if no values below/above
      breaksX <- unique(breaksX[!is.na(breaksX)])
    } else {
      breaksX <- unique(ds1[[rv]][!(is.na(ds1[[rv]]))])
    }

    breaksX <- sort(breaksX)

    if (length(unique(breaksX)) > 10000) {
      likely1 <- ds1[util_looks_like_missing(ds1[[rv]]), rv, TRUE]
      likely2 <- c(max(ds1[[rv]],
                       na.rm = TRUE),
                   min(ds1[[rv]],
                       na.rm = TRUE))
      likely <- intersect(likely1, likely2)
      if (length(likely) == 0)
        likely <- likely2
      util_warning(
        c("For %s, I have %d breaks. Did you forget to specify some missing",
          "codes (%s)? Will arbitrarily reduce the number of breaks below",
          "10000 to avoid rendering problems."),
        dQuote(rv), length(unique(breaksX)), paste0(dQuote(likely
                                                           ), collapse = " or ")
      )
      while (length(unique(breaksX)) > 10000) {
        breaksX <- breaksX[!is.na(breaksX)]
        breaksX <- c(min(breaksX), breaksX[c(TRUE, FALSE)], max(breaksX))
      }
      util_warning(
        c("For %s. Will arbitrarily reduced the number of breaks to",
          "%d <= 10000 to avoid rendering problems."),
        dQuote(rv), length(unique(breaksX)))
    }

    # Generate ggplot-objects for placing and annotation of lines --------------
    # lower limit

    if (is.na(ll)) {
      # Create Line and Text
      lll <- geom_vline(
        xintercept = minx, color = "#999999", alpha = 1,
        linetype = "dotted"
      )
      tll <- annotate("text",
        x = minx, y = 0,
        label = paste0(
          "?lower ",
          tolower(infix),
          " limit?"
        ),
        color = "#999999", angle = 0, vjust = 1, hjust = 0
      )
    } else {
      # Detect number of cases below lower hard limit
      below_hl <- sum(ds1[[rv]] < ll, na.rm = TRUE)
      # Create Line and Text
      if (all(util_is_integer(ds1[[rv]]), na.rm = TRUE)) {
        xll <- ll - 0.5
      } else {
        xll <- ll
      }
      lll <- geom_vline(xintercept = xll, color = "#B2182B", alpha = 1)
      tll <- annotate("text",
        x = xll, y = 0,
        label = paste0(
          "limit ",
          tolower(infix),
          " low=", ll, "; Obs < LHL: ", below_hl
        ),
        color = "#B2182B", angle = 0, vjust = 1.5, hjust = 0
      )
    }

    # Upper limit
    if (is.na(lu)) {
      # Create Line and Text
      llu <- geom_vline(
        xintercept = maxx, color = "#999999", alpha = 1,
        linetype = "dotted"
      )
      tlu <- annotate("text",
        x = maxx, y = 0,
        label = paste0(
          "?upper ",
          tolower(infix),
          " limit?"
        ),
        color = "#999999", angle = 0, vjust = 1, hjust = 0
      )
    } else {
      # Detect number of cases above upper hard limit
      above_hl <- sum(ds1[[rv]] > lu, na.rm = TRUE)
      if (all(util_is_integer(ds1[[rv]]), na.rm = TRUE)) {
        xlu <- lu + 0.5
      } else {
        xlu <- lu
      }
      # Create Line and Text
      llu <- geom_vline(xintercept = xlu, color = "#B2182B", alpha = 1)
      tlu <- annotate("text",
        x = xlu, y = 0,
        label = paste0(
          "limit ",
          tolower(infix),
          " up=", lu, "; Obs > UHL: ", above_hl
        ),
        color = "#B2182B", angle = 0, vjust = -0.5, hjust = 0
      )
    }

    # building the plot --------------------------------------------------------
    txtspec <- element_text(
      colour = "black", # size = 16,
      hjust = .5, vjust = .5, face = "plain"
    ) # angle = 0,

    out_cols <- c("#2166AC", "#B2182B")
    names(out_cols) <- c("0", "1")

    if (datetime_vars[[rv]] || !(all(ds1[[rv]] %% 1 == 0, na.rm = TRUE)) ||
      length(unique(ds1[[rv]])) > 20) {
      # continuous or integer with more than 20 values
      #          if (!all(util_is_integer(ds1[[rv]]), na.rm = TRUE) || ) {
      breaks <- unique(breaksX)
      if (!datetime_vars[[rv]]) {
        myxlim <- c(floor(minx), ceiling(maxx))
      } else {
        myxlim <- c(minx, maxx)
        breaks <- as.POSIXct(breaks, origin = min(as.POSIXct(Sys.Date()), 0))
        myxlim <- as.POSIXct(myxlim, origin = min(as.POSIXct(Sys.Date()), 0))
      }
      p <- ggplot(data = ds1, aes_string(x = ds1[[rv]], fill =
                                           factor(ds1[[OUT]]))) +
        geom_histogram(breaks = breaks) +
        scale_fill_manual(values = out_cols, guide = FALSE) +
        coord_flip(xlim = myxlim) +
        labs(x = "", y = paste0(rv)) +
        theme_minimal() +
        theme(
          title = txtspec,
          axis.text.x = txtspec,
          axis.text.y = txtspec,
          axis.title.x = txtspec,
          axis.title.y = txtspec
        ) +
        # add line/text for lower limit
        lll +
        tll +
        # add line/text for upper limit
        llu +
        tlu
    } else {
      p <- ggplot(ds1, aes_string(x = ds1[[rv]], fill = factor(ds1[[OUT]]))) +
        geom_bar() +
        scale_fill_manual(values = out_cols, guide = FALSE) +
        coord_flip(xlim = c(floor(minx), ceiling(maxx))) +
        labs(x = "", y = paste0(rv)) +
        theme_minimal() +
        theme(
          title = txtspec,
          axis.text.x = txtspec,
          axis.text.y = txtspec,
          axis.title.x = txtspec,
          axis.title.y = txtspec
        ) +
        # add line/text for lower limit
        lll +
        tll +
        # add line/text for upper limit
        llu +
        tlu
    }

    return(p)
  }) # end lapply plot_list

  # remove violations of value limits
  msdf <- ds1

  for (current_rv in rvs) {
    if (HARD_LIMITS %in% names(imdf)) {
      # values below hard limit?
      minx1 <- imdf[[HARD_LIMIT_LOW]][imdf[[label_col]] == current_rv]
      minx2 <- suppressWarnings(min(msdf[[current_rv]], na.rm = TRUE))

      if (!is.na(minx1) & minx1 > minx2) {
        n_below <- sum(msdf[[current_rv]] < minx1, na.rm = TRUE)
        msdf[[current_rv]][msdf[[current_rv]] < minx1] <- NA
        util_warning(paste0("N = ", n_below, " values in ", current_rv,
                            " have been below %s and were removed."),
                     HARD_LIMIT_LOW)
      }

      # values above hard limit?
      maxx1 <- imdf[[HARD_LIMIT_UP]][imdf[[label_col]] == current_rv]
      maxx2 <- suppressWarnings(max(msdf[[current_rv]], na.rm = TRUE))

      if (!is.na(maxx1) & maxx1 < maxx2) {
        n_above <- sum(msdf[[current_rv]] > maxx1, na.rm = TRUE)
        msdf[[current_rv]][msdf[[current_rv]] > maxx1] <- NA
        util_warning(paste0("N = ", n_above, " values in ", current_rv,
                            " have been above %s and were removed."),
                     HARD_LIMIT_UP)
      }
    }
  }

  # add Summary Table with GRADING column
  name_bel <- paste0("Below ", infix, " (N)")
  name_abo <- paste0("Above ", infix, " (N)")

  sumtab <- lapply(setNames(nm = rvs), function(rv) {
    BELOW <- paste0(rv, "_below_", infix)
    ABOVE <- paste0(rv, "_above_", infix)

    r <- list(
      rv,
      sum(fsd[[BELOW]], na.rm = TRUE),
      round(sum(fsd[[BELOW]], na.rm = TRUE) / sum(!(is.na(fsd[[rv]]))) * 100,
            digits = 2),
      sum(fsd[[ABOVE]], na.rm = TRUE),
      round(sum(fsd[[ABOVE]], na.rm = TRUE) / sum(!(is.na(fsd[[rv]]))) * 100,
            digits = 2)
    )
    r <- as.data.frame(r, stringsAsFactors = FALSE)
    colnames(r) <- c(
      "Variables",
      paste0("Below ", infix, " (N)"),
      paste0("Below ", infix, " (%)"),
      paste0("Above ", infix, " (N)"),
      paste0("Above ", infix, " (%)")
    )
    r
  })

  sumtab <- do.call(rbind.data.frame, c(sumtab, stringsAsFactors = FALSE,
                                        deparse.level = 0, make.row.names =
                                          FALSE))

  sumtab$GRADING <- ifelse((sumtab[[name_bel]] > 0) | (sumtab[[name_abo]] > 0),
                           1, 0)

  return(list(FlaggedStudyData = fsd, SummaryTable = sumtab, SummaryPlotList =
                plot_list, ModifiedStudyData = msdf))
}

Implementation and use of thresholds

This implementation makes no use of thresholds.

Call of the R-function

This R-functions uses a vector of response variables.

MyValueLimits <- con_limit_deviations(resp_vars  = c("AGE_0", "SBP_0", "DBP_0", "SEX_0"),
                                      label_col  = "LABEL",
                                      study_data = sd1,
                                      meta_data  = md1,
                                      limits     = "HARD_LIMITS")
names(MyValueLimits)

OUTPUT

For selected response variables

The function can be applied on selected variables. The output comprises two tables and plots for each selected variable. The function checks whether the respective limits are specified for each selected variable. If not, a warning is supplied.

MyValueLimits <- con_limit_deviations(resp_vars  = c("AGE_0", "SBP_0", "SEX_0"),
                                      label_col  = "LABEL",
                                      study_data = sd1,
                                      meta_data  = md1,
                                      limits     = "HARD_LIMITS")
## Warning: In con_limit_deviations: The variables SEX_0 have no defined limits.
## > con_limit_deviations(resp_vars = c("AGE_0", "SBP_0", "SEX_0"), 
##     label_col = "LABEL", study_data = sd1, meta_data = md1, limits = "HARD_LIMITS")
## Warning: In con_limit_deviations: Found invalid limits for 'HARD_LIMITS': "[2018-01-01 00:00:00 CET;)", "[2018-01-01 00:00:00 CET;)", "[2018-01-01 00:00:00 CET;)", "[2018-01-01 00:00:00 CET;)" -- will ignore these
## > con_limit_deviations(resp_vars = c("AGE_0", "SBP_0", "SEX_0"), 
##     label_col = "LABEL", study_data = sd1, meta_data = md1, limits = "HARD_LIMITS")

Output 1: FlaggedStudyData

The first table is related to the study data by a 1:1 relationship, i.e. for each observation is checked whether the value is below or above the limits.

AGE_0 AGE_0_below_HARD AGE_0_above_HARD SBP_0 SBP_0_below_HARD SBP_0_above_HARD
49 0 0 127 0 0
47 0 0 114 0 0
50 0 0 114 0 0
48 0 0 120 0 0
56 0 0 119 0 0
47 0 0 133 0 0

Output 2: SummaryTable

The second table summarizes this information for each variable.

Variables Below HARD (N) Below HARD (%) Above HARD (N) Above HARD (%) GRADING
AGE_0 0 0 0 0 0
SBP_0 0 0 0 0 0

Output 3: SummaryPlotList

The plots for each variable are either a histogram (continuous) or a barplot (discrete) and all are added to a list which is accessed via MyValueLimits$SummaryPlotList.

Output 4: ModifiedStudyData

The fourth output object is a dataframe similar to the study data, however, limit deviations have been removed.

Without specification of response variables

It is not necessary to specify variables. In this case the functions seeks for all numeric variables with defined limits. If the function identifies limit deviations, the respective values are removed in the dataframe of ModifiedStudyData.

## Warning: In con_limit_deviations: All variables with HARD_LIMITS in the metadata are used.
## > con_limit_deviations(label_col = "LABEL", study_data = sd1, meta_data = md1, 
##     limits = "HARD_LIMITS")
## Warning: In con_limit_deviations: Found invalid limits for 'HARD_LIMITS': "[2018-01-01 00:00:00 CET;)", "[2018-01-01 00:00:00 CET;)", "[2018-01-01 00:00:00 CET;)", "[2018-01-01 00:00:00 CET;)" -- will ignore these
## > con_limit_deviations(label_col = "LABEL", study_data = sd1, meta_data = md1, 
##     limits = "HARD_LIMITS")
## Warning: In con_limit_deviations: N = 3 values in EDUCATION_1 have been above HARD_LIMIT_UP and were removed.
## > con_limit_deviations(label_col = "LABEL", study_data = sd1, meta_data = md1, 
##     limits = "HARD_LIMITS")
## Warning: In con_limit_deviations: N = 24 values in SMOKE_SHOP_0 have been above HARD_LIMIT_UP and were removed.
## > con_limit_deviations(label_col = "LABEL", study_data = sd1, meta_data = md1, 
##     limits = "HARD_LIMITS")
## Warning: In con_limit_deviations: N = 349 values in MEDICATION_0 have been above HARD_LIMIT_UP and were removed.
## > con_limit_deviations(label_col = "LABEL", study_data = sd1, meta_data = md1, 
##     limits = "HARD_LIMITS")

Output 2: Summary data table 2

Variables Below HARD (N) Below HARD (%) Above HARD (N) Above HARD (%) GRADING
DBP_0 0 0 0 0.00 0
GLOBAL_HEALTH_VAS_0 0 0 0 0.00 0
ASTHMA_0 0 0 0 0.00 0
ARM_CIRC_0 0 0 0 0.00 0
ARM_CIRC_DISC_0 0 0 0 0.00 0
ARM_CUFF_0 0 0 0 0.00 0
EXAM_DT_0 0 0 0 0.00 0
CRP_0 0 0 0 0.00 0
BSG_0 0 0 0 0.00 0
LAB_DT_0 0 0 0 0.00 0
EDUCATION_0 0 0 0 0.00 0
EDUCATION_1 0 0 3 0.12 1
MARRIED_0 0 0 0 0.00 0
EATING_PREFS_0 0 0 0 0.00 0
MEAT_CONS_0 0 0 0 0.00 0
SMOKING_0 0 0 0 0.00 0
SMOKE_SHOP_0 0 0 24 2.98 1
PREGNANT_0 0 0 0 0.00 0
MEDICATION_0 0 0 349 54.45 1
AGE_0 0 0 0 0.00 0
N_ATC_CODES_0 0 0 0 0.00 0
INT_DT_0 0 0 0 0.00 0
ITEM_1_0 0 0 0 0.00 0
ITEM_2_0 0 0 0 0.00 0
ITEM_3_0 0 0 0 0.00 0
ITEM_4_0 0 0 0 0.00 0
ITEM_5_0 0 0 0 0.00 0
ITEM_6_0 0 0 0 0.00 0
ITEM_7_0 0 0 0 0.00 0
ITEM_8_0 0 0 0 0.00 0
QUEST_DT_0 0 0 0 0.00 0
AGE_1 0 0 0 0.00 0
SBP_0 0 0 0 0.00 0

Output 3: Plot

Only the 3 selected plots are displayed to reduce the size of this file. However, for each variable with limits, a plot has been generated.

For variables of type datetime

## Warning: In con_limit_deviations: Found invalid limits for 'HARD_LIMITS': "[2018-01-01 00:00:00 CET;)", "[2018-01-01 00:00:00 CET;)", "[2018-01-01 00:00:00 CET;)", "[2018-01-01 00:00:00 CET;)" -- will ignore these
## > con_limit_deviations(resp_vars = c("QUEST_DT_0"), label_col = "LABEL", 
##     study_data = sd1, meta_data = md1, limits = "HARD_LIMITS")

Output 2: Summary data table 2

Variables Below HARD (N) Below HARD (%) Above HARD (N) Above HARD (%) GRADING
QUEST_DT_0 0 0 0 0 0

Output 3: Plot

INTERPRETATION

The definition of HARD_LIMITS is a common issue in the data curation process. For example, values of a numeric rating scale (0 - 10) should not exceed these limits and values outside these limits must be removed or at least verified as they represent certain incorrect measurements. Nevertheless, there are measurements in which the definition of such limits is difficult. In this case the alternative definition of SOFT_LIMITS is recommended.

Concept relations