APPROACH

This approach considers a contradiction if impossible combinations of data are observed in one participant. For example, if age of a participant is recorded repeatedly the value of age is (unfortunately) not able to decline. Most cases of contradictions rest on comparison of two variables.

Important to note, each value that is used for comparison may represent a possible characteristic but the combination of these two values is considered to be impossible. The approach does not consider implausible or inadmissible values.

ALGORITHM OF THIS IMPLEMENTATION:

  1. Select all variables in the data with defined contradiction rules (static metadata column CONTRADICTIONS)
  2. Remove missing codes from the study data (if defined in the metadata)
  3. Remove measurements deviating from limits defined in the metadata
  4. Assign label to levels of categorical variables (if applicable)
  5. Apply contradiction checks on predefined sets of variables
  6. Identification of measurements fulfilling contradiction rules. Therefore two output data frames are generated:
    • on the level of observation to flag each contradictory value combination, and
    • a summary table for each contradiction check.
  7. A summary plot illustrating the number of contradictions is generated.

Example of study data

Data from the package dataquieR are loaded as shown below:

load(system.file("extdata", "study_data.RData", package = "dataquieR"))
sd1 <- study_data

This example of study data has N=3000 observations. Study data variables have abstract and non-interpretable names; appropriate labels must be mapped from the metadata. Nonetheless, the study comprise the following characteristics:

  • Age at baseline + age during follow-up
  • Sex + sex during follow-up
  • Education + education during follow-up
  • eating preferences
  • weekly meat consumption
  • smoking
  • shopping behavior regarding tobacco products
  • circumference of upper arm
  • used arm cuff for blood pressure measurement
  • pregnancy status of women
  • some medication
v00000 v00001 v00002 v00003 v00004 v00005 v01003 v01002 v00103 v00006
3 LEIIX715 0 49 127 77 49 0 40-49 3.8
1 QHNKM456 0 47 114 76 47 0 40-49 1.9
1 HTAOB589 0 50 114 71 50 0 50-59 0.8
5 HNHFV585 0 48 120 65 48 0 40-49 3.8
1 UTDLS949 0 56 119 78 56 0 50-59 4.1
5 YQFGE692 1 47 133 81 47 1 40-49 9.5
1 AVAEH932 0 53 114 78 53 0 50-59 5.0
3 QDOPT378 1 48 116 86 48 1 40-49 9.6
3 BMOAK786 0 44 115 71 44 0 40-49 2.0
5 ZDKNF462 0 50 116 74 50 0 50-59 2.4

Example of metadata

Data from the package dataquieR are loaded as shown below:

load(system.file("extdata", "meta_data.RData", package = "dataquieR"))
md1 <- meta_data

Information corresponding to the study data is kept in the table of static metadata. An interpretable label for each variable is also attached. Besides data type and labels of all variables further expected characteristics are stored in the metadata.

Regarding the following implementation the columns CONTRADICTIONS as well as MISSING_LIST, VALUE_LABELS, and HARD_LIMITS in the metadata are particularly relevant.

The column of CONTRADICTION contains only IDs for explicit contradictions. Respective definition can be done in the metadata but we recommend the use of an associated ShinyApp (Chang et al. 2018, Potter et al. 2016). See also Definition of contradictions.

VAR_NAMES LABEL MISSING_LIST VALUE_LABELS HARD_LIMITS CONTRADICTIONS
3 v00002 SEX_0 NA 0 = females | 1 = males NA 1002
4 v00003 AGE_0 NA NA [18;Inf) 1001
6 v01003 AGE_1 NA NA [18;Inf) 1001
7 v01002 SEX_1 NA 0 = females | 1 = males NA 1002
15 v00109 ARM_CIRC_DISC_0 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99987 | 99988 | 99989 | 99990 | 99991 | 99992 | 99993 | 99994 | 99995 1 = (-Inf,20] | 2 = (20,30] | 3 = (30, Inf] [1;3] 1009
16 v00010 ARM_CUFF_0 99980 | 99987 1 = (-Inf,20] | 2 = (20,30] | 3 = (30, Inf] [1;3] 1009
19 v00013 EXAM_DT_0 NA NA [2018-01-01 00:00:00 CET;) 1011
24 v00017 LAB_DT_0 NA NA [2018-01-01 00:00:00 CET;) 1011
26 v00018 EDUCATION_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 0 = pre-primary | 1 = primary | 2 = secondary | 3 = uppersecond | 4 = postsecond | 5 = tertiary | 6 = secondtertiary [0;6] 1003
27 v01018 EDUCATION_1 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 0 = pre-primary | 1 = primary | 2 = secondary | 3 = uppersecond | 4 = postsecond | 5 = tertiary | 6 = secondtertiary [0;6] 1003
31 v00022 EATING_PREFS_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 0 = none | 1 = vegetarian | 2 = vegan [0;2] 1004 | 1005 | 1006
32 v00023 MEAT_CONS_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 0 = never | 1 = 1-2d a week | 2 = 3-4d a week | 3 = 5-6d a week | 4 = daily [0;4] 1004 | 1005 | 1006
33 v00024 SMOKING_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 0 = no | 1 = yes [0;1] 1007 | 1008
34 v00025 SMOKE_SHOP_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 0 = never | 1 = 1-2d a week | 2 = 3-4d a week | 3 = 5-6d a week | 4 = daily [0;4] 1007 | 1008
38 v00029 PREGNANT_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 0 = no | 1 = yes [0;1] 1010

Definition of contradictions

To a large extent contradictions can be defined via logical comparison of variables. Assume \(A\) and \(B\) to represent two variables in the study data. Then:

  • if \(A \gt B\) a contradiction may follow

  • if \(A\) is not missing, then \(B\) should not be observed

  • if \(A \lt 18\) then \(B \ne \:"adult"\)

Defining such comparisons is supported by a Shiny App allowing the specification of checks in a standardized manner. Therefore a comprehensive table of metadata is required.

CAVE: For the time being, contradiction checks can be defined for the contradictions between two variables only.


Contradictions specified for the example of study data inherent to dataquieR are loaded as follows:

checks <- read.csv(system.file("extdata", 
                               "contradiction_checks.csv",
                               package = "dataquieR"), 
                   header = TRUE, sep = "#")

The following table shows the contradictions that were defined for this example of study data:

ID Function_name A A_levels A_value B B_levels B_value Label
1001 A_less_than_B_vv AGE_1 NA NA AGE_0 NA NA Age follow-up
1002 A_not_equal_B_vv SEX_1 NA NA SEX_0 NA NA Sex follow-up
1003 A_less_than_B_vv EDUCATION_1 NA NA EDUCATION_0 NA NA Education follow-up
1004 A_levels_and_B_levels_ll EATING_PREFS_0 vegetarian NA MEAT_CONS_0 1-2d a week | 3-4d a week | 5-6d a week | daily NA Nutrition inconsistency vegetarian
1005 A_levels_and_B_levels_ll EATING_PREFS_0 vegan NA MEAT_CONS_0 1-2d a week | 3-4d a week | 5-6d a week | daily NA Nutrition inconsistency vegan
1006 A_levels_and_B_levels_ll EATING_PREFS_0 none NA MEAT_CONS_0 never NA Nutrition inconsistency
1007 A_levels_and_B_levels_ll SMOKING_0 no NA SMOKE_SHOP_0 1-2d a week | 3-4d a week | 5-6d a week | daily NA Non-smokers inconsistency
1008 A_levels_and_B_levels_ll SMOKING_0 yes NA SMOKE_SHOP_0 never NA Smokers inconsistency
1009 A_not_equal_B_vv ARM_CIRC_DISC_0 NA NA ARM_CUFF_0 NA NA Blood pressure false cuff
1010 A_levels_and_B_gt_value_lc PREGNANT_0 yes NA AGE_0 NA 55 Pregnancy high age
1011 A_less_than_B_vv LAB_DT_0 NA NA EXAM_DT_0 NA NA LAB before MEX

Calculation of contradictions

The indicator uses a list of prespecified functions of logical comparisons. Each of the functions is designed to indicate a contradiction if the specified criteria are met.

The suffixes _vv, _ll, _lc are required for the ShinyApp mentioned above and have no interpretation in the context of contradictions.

A_not_equal_B_vv <- function(study_data, A, B, A_levels, B_levels, A_value, B_value) {
  X <- study_data
  grading <- ifelse(X[[A]] != X[[B]], 1, 0)
  return(grading)
}

All R-functions of logical comparisons have seven arguments:

  • study_data: the name of the data frame containing the study data
  • A: one variable in which a contradiction may occur
  • B: the second required variable to evaluate a contradiction
  • A_levels: in case A is nominal, the respective levels are used here
  • B_levels: in case B is nominal, the respective levels are used here
  • A_value: in case a value of A is used for comparison with B
  • B_value: in case a value of B is used for comparison with A

Required R-packages

library(ggplot2)
library(plyr)

R-FUNCTION

con_contradictions <-  function(resp_vars = NULL, study_data, meta_data,
                               label_col, threshold_value, check_table,
                               summarize_categories = FALSE) {
  rvs <- resp_vars

  # Preps ----------------------------------------------------------------------
  # labels used instead of variable names?
  if (!(missing(label_col)) && label_col != VAR_NAMES) {
    message(
      sprintf(paste("Labels of variables from %s will be used.",
                    "In this case columns A and B in check_tables must",
                    "refer to labels.", collapse = " "),
      dQuote(label_col))
    )
  } else {
    message(paste("Variable names will be used. In this case columns A",
                  "and B in check_tables must refer to variable names."))
  }

  # map meta to study
  util_prepare_dataframes()

  util_correct_variable_use("resp_vars",
    allow_more_than_one = TRUE,
    allow_null = TRUE,
    allow_any_obs_na = TRUE
  )

  # table of specified contradictions
  if (missing(check_table) || !is.data.frame(check_table)) {
    util_error(
      c("Missing check_table --",
        "cannot apply contradictions checks w/o contradiction rules"),
      applicability_problem = TRUE)
  }

  if (missing(threshold_value)) {
    threshold_value <- 0
    util_warning("No %s has been set, will use default %d",
                 dQuote("threshold_value"), threshold_value,
                 applicability_problem = FALSE)
  }

  ct <- check_table
  ct$Label <- as.character(ct$Label)

  # colors:
  cols <- c("0" = "#2166AC", "1" = "#B2182B")

  # columns:
  expected_cols <- c(
    "ID",
    "Label",
    "Function_name",
    "A",
    "B",
    "A_value",
    "B_value",
    "A_levels",
    "B_levels"
  )

  missing_cols <- !(expected_cols %in% colnames(ct))

  if (any(missing_cols)) {
    util_error(
      "Missing the following columns in the check_table: %s",
      dQuote(expected_cols[missing_cols]),
      applicability_problem = TRUE
    )
  }

  if (summarize_categories) {
    # if we want to summarize contradictions per category
    if (!("tag" %in% colnames(ct))) {
      util_error(c(
        "Cannot summerize categories of contractions,",
        "because these are not defined in the check_table as column 'tag'."),
        applicability_problem = TRUE)
    }
    splitted_tags <- lapply(strsplit(ct$tag, SPLIT_CHAR, fixed = TRUE), trimws)
    tags <- sort(unique(unlist(splitted_tags)))
    tags <- setNames(nm = tags)
    tags_ext <- tags
    tags_ext[["all_checks"]] <- NA
    result <- lapply(tags_ext, function(atag) {
      # generate one output per category (stratified)
      if (is.na(atag)) {
        new_ct <- ct[, -which(colnames(ct) == "tag"), drop = FALSE]
      } else {
        contains_tag <- function(x, tg) {
          any(x == tg, na.rm = TRUE)
        }
        rows_matching_tag <- vapply(splitted_tags, contains_tag, tg = atag,
                                    logical(1))
        new_ct <- ct[rows_matching_tag, -which(colnames(ct) == "tag"),
                     drop = FALSE]
      }
      con_contradictions(
        resp_vars = resp_vars, study_data = study_data,
        meta_data = meta_data, label_col = label_col,
        threshold_value = threshold_value, check_table = new_ct,
        summarize_categories = FALSE
      )
    })
    rx <- lapply(tags_ext, function(atag) {
      # and summarize the contradictions per category/tag
      if (is.na(atag)) {
        sum(rowSums(result[["all_checks"]]$FlaggedStudyData[, -1, drop = FALSE],
                    na.rm = TRUE) > 0) /
          nrow(result[["all_checks"]]$FlaggedStudyData) * 100
      } else {
        sum(rowSums(result[[atag]]$FlaggedStudyData[, -1, drop = FALSE],
                    na.rm = TRUE) > 0) /
          nrow(result[[atag]]$FlaggedStudyData) * 100
      }
    })
    rx <- data.frame(
      category = names(rx),
      percent = unlist(rx),
      GRADING = ordered(ifelse(unlist(rx) > threshold_value, 1, 0))
    )
    result$SummaryData <- rx
    result$SummaryPlot <-
      ggplot(rx, aes_(x = ~category, y = ~percent, fill = ~GRADING)) +
      geom_bar(stat = "identity") +
      scale_fill_manual(values = cols, name = " ", guide = FALSE) +
      theme_minimal() +
      scale_y_continuous(name = "(%)", limits = (c(0, max(rx$percent) + 1))) +
      geom_hline(yintercept = threshold_value, color = "red", linetype = 2) +
      coord_flip() +
      theme(text = element_text(size = 20))

    return(result)
  } else {
    ct$A_levels <- as.character(ct$A_levels)
    ct$B_levels <- as.character(ct$B_levels)

    # check and prep meta data
    if (!(CONTRADICTIONS %in% colnames(meta_data))) {
      util_error(
        c("Missing column %s in meta data cannot apply",
          "contradictions checks w/o contradiction rules"),
        dQuote(CONTRADICTIONS),
        applicability_problem = TRUE
      )
    }

    meta_data[["CONTRADICTIONS"]] <-
      as.character(meta_data[["CONTRADICTIONS"]])

    # no variables defined?
    if (length(rvs) == 0) {
      if (all(is.na(meta_data[[CONTRADICTIONS]]))) {
        util_error(paste0("No Variables with defined CONTRADICTIONS."),
                   applicability_problem = TRUE)
      } else {
        util_warning(paste0(
          "All variables with CONTRADICTIONS in the metadata are used."),
          applicability_problem = TRUE)
        rvs <- meta_data[[label_col]][!(is.na(meta_data[[CONTRADICTIONS]]))]
        rvs <- intersect(rvs, colnames(ds1))
      }
    } else {
      # contradictions defined at all?
      if (all(is.na(meta_data[[CONTRADICTIONS]][meta_data[[label_col]] %in%
                                                rvs]))) {
        util_error(paste0("No Variables with defined CONTRADICTIONS."),
                   applicability_problem = TRUE)
      }
      # no contradictions for some variables?
      rvs2 <- meta_data[[label_col]][!(is.na(meta_data[[CONTRADICTIONS]])) &
                                       meta_data[[label_col]] %in% rvs]
      if (length(rvs2) < length(rvs)) {
        util_warning(paste0("The variables ", rvs[!(rvs %in% rvs2)],
                            " have no defined CONTRADICTIONS.",
                            collapse = ", "),
                     applicability_problem = TRUE)
      }
      rvs <- rvs2
    }

    # Inadmissible values must be removed --------------------------------------
    # temporary studydata for the check of contradictions
    ds1_ll <- ds1

    # interpret limit intervals
    imdf <- util_interpret_limits(meta_data)

    for (i in seq_along(rvs)) {
      if (HARD_LIMITS %in% names(imdf)) {
        # values below hard limit?
        minx1 <- imdf[[HARD_LIMIT_LOW]][imdf[[label_col]] == rvs[[i]]]
        minx2 <- min(ds1_ll[[rvs[[i]]]], na.rm = TRUE)

        if (!is.na(minx1) & minx1 > minx2) {
          n_below <- sum(ds1_ll[[rvs[[i]]]] < minx1,
                         na.rm = TRUE)
          ds1_ll[[rvs[[i]]]][ds1_ll[[rvs[[i]]]] < minx1] <- NA
          util_warning(paste0("N = ", n_below, " values in ", rvs[[i]],
                              " have been below HARD_LIMITS and were removed."),
                       applicability_problem = FALSE)
        }

        # values above hard limit?
        maxx1 <- imdf[[HARD_LIMIT_UP]][imdf[[label_col]] == rvs[[i]]]
        maxx2 <- max(ds1_ll[[rvs[[i]]]], na.rm = TRUE)

        if (!is.na(maxx1) & maxx1 < maxx2) {
          n_above <- sum(ds1_ll[[rvs[[i]]]] > maxx1,
                         na.rm = TRUE)
          ds1_ll[[rvs[[i]]]][ds1_ll[[rvs[[i]]]] > maxx1] <- NA
          util_warning(paste0("N = ", n_above, " values in ", rvs[[i]],
                              " have been above HARD_LIMITS and were removed."),
                       applicability_problem = FALSE)
        }
      }
    }

    # Label assignment ---------------------------------------------------------

    # all labelled variables
    levlabs <- meta_data$VALUE_LABELS[meta_data[[label_col]] %in% rvs]

    # any variables without labels?
    if (any(is.na(levlabs))) {
      util_warning(paste0("Variables: ", paste0(rvs[is.na(levlabs)],
                                                collapse = ", "),
                          " have no assigned labels and levels."),
                   applicability_problem = FALSE)
    }

    # only variables with labels
    if (!all(is.na(levlabs))) {
      rvs_ll <- rvs[!is.na(levlabs)]
      levlabs <- levlabs[!is.na(levlabs)]

      for (i in seq_along(rvs_ll)) {
        ds1_ll[[rvs_ll[i]]] <- util_assign_levlabs(
          variable = ds1_ll[[rvs_ll[i]]],
          string_of_levlabs = levlabs[i],
          splitchar = SPLIT_CHAR,
          assignchar = " = ",
          ordered = TRUE
        )
      }
    }

    # select contradiction checks
    # get checks from metadata
    cl <- meta_data[[CONTRADICTIONS]][!(is.na(meta_data$CONTRADICTIONS))]

    # is list ?
    cl <- unlist(cl)

    # select unique checks
    cl <- unique(as.numeric(unlist(strsplit(as.character(cl), SPLIT_CHAR,
                                            fixed = TRUE))))

    cl <- intersect(cl, ct$ID)

    cl <- cl[order(cl)]

    summary_df1 <- data.frame(Obs = 1:dim(ds1_ll)[1])

    summary_df2 <- data.frame(
      Check_type = rep(NA, length(cl)),
      Check_ID = rep(NA, length(cl)),
      Study_variables = rep(NA, length(cl)),
      A_levels = rep(NA, length(cl)),
      B_levels = rep(NA, length(cl)),
      N = rep(NA, length(cl)),
      Percent = rep(NA, length(cl)),
      Grading = rep(NA, length(cl)),
      Label = rep(NA, length(cl))
    )

    for (i in seq_along(cl)) {
      prior_names <- names(summary_df1)

      # prepare columns and name of respective check
      summary_df1[i + 1] <- NA
      names(summary_df1) <- c(prior_names, paste0("grading_", cl[i]))

      # which check function is to be applied
      check <- paste(ct$Function_name[ct$ID == cl[i]])

      # parse levels
      a_lev <- gsub("'", "", ct$A_levels[ct$ID == cl[i]])
      a_lev <- unlist(strsplit(a_lev, SPLIT_CHAR, fixed = TRUE))
      a_lev <- trimws(a_lev)

      b_lev <- gsub("'", "", ct$B_levels[ct$ID == cl[i]])
      b_lev <- unlist(strsplit(b_lev, SPLIT_CHAR, fixed = TRUE))
      b_lev <- trimws(b_lev)
      # apply check
      summary_df1[i + 1] <-
        contradiction_functions[[check]](study_data = ds1_ll,
        A = paste(ct$A[ct$ID == cl[i]]),
        A_levels = a_lev,
        A_value = ct$A_value[ct$ID == cl[i]],
        B = paste(ct$B[ct$ID == cl[i]]),
        B_levels = b_lev,
        B_value = ct$B_value[ct$ID == cl[i]]
      )

      # summarize checks
      summary_df2[i, 1] <- cl[i]
      summary_df2[i, 2] <- check
      summary_df2[i, 3] <- paste0(
        "A is: ", ct$A[ct$ID == cl[i]], "; ",
        "B is: ", ct$B[ct$ID == cl[i]]
      )
      summary_df2[i, 4] <- paste(a_lev, collapse = SPLIT_CHAR)
      summary_df2[i, 5] <- paste(b_lev, collapse = SPLIT_CHAR)
      summary_df2[i, 6] <- sum(summary_df1[, i + 1], na.rm = TRUE)
      summary_df2[i, 7] <- sum(summary_df1[, i + 1], na.rm = TRUE) /
        dim(ds1)[1] * 100
      summary_df2[i, 8] <- ifelse(summary_df2[i, 7] > threshold_value, 1, 0)
      summary_df2[i, 9] <- ct$Label[ct$ID == cl[i]]
    }

    summary_df2$Percent <- round(summary_df2$Percent, digits = 2)

    names(summary_df2) <- c(
      "Check ID", "Check type", "Variables A and B", "A Levels",
      "B Levels", "Contradictions (N)", "Contradictions (%)",
      "Grading", "Label"
    )

    summary_df2$Grading <- ordered(summary_df2$Grading)

    x <- util_as_numeric(reorder(summary_df2[, 1], -summary_df2[, 1]))
    lbs <- as.character(reorder(summary_df2[, 9], -summary_df2[, 1]))
    # plot summary_df2
    p <- ggplot(summary_df2, aes_(x = ~x, y = ~ summary_df2[, 7], fill =
                                    ~ as.ordered(Grading))) +
      geom_bar(stat = "identity") +
      geom_text(
        y = round(summary_df2[, 7], 1) + 0.5,
        label = paste0(round(summary_df2[, 7], digits = 2), "%")
      ) +
      scale_fill_manual(values = cols, name = " ", guide = FALSE) +
      theme_minimal() +
      xlab("IDs of applied checks") +
      scale_y_continuous(name = "(%)",
                         limits = (c(0, max(summary_df2[, 7]) + 1))) +
      scale_x_continuous(breaks = x, sec.axis =
                           sec_axis(~., breaks = x, labels = lbs)) +
      geom_hline(yintercept = threshold_value, color = "red", linetype = 2) +
      coord_flip() +
      theme(text = element_text(size = 20))

    # create SummaryTable object
    st1 <- summary_df2
    st1$`Variables A and B` <- gsub("A is: ", "", st1$`Variables A and B`)
    st1$`Variables A and B` <- gsub("B is: ", "", st1$`Variables A and B`)
    st1$Variables <- unlist(lapply(st1$`Variables A and B`,
                                   function(x) unlist(strsplit(x, ";",
                                                               fixed =
                                                                 TRUE))[1]))
    st1$`Reference variable` <- unlist(lapply(st1$`Variables A and B`,
                                              function(x) unlist(
                                                strsplit(x, ";", fixed =
                                                           TRUE))[2]))
    st1$`Variables A and B` <- NULL
    st1 <- st1[, c(9, 10, 1:8)]
    st1 <- dplyr::rename(st1, c("GRADING" = "Grading"))

    # Output
    return(list(
      FlaggedStudyData = summary_df1,
      SummaryTable = st1,
      SummaryData = summary_df2,
      SummaryPlot = p
    ))
  }

  # Never called, just for documentation.
  return(list(
    FlaggedStudyData = summary_df1,
    SummaryTable = st1,
    SummaryData = summary_df2,
    SummaryPlot = p
  ))
}

Implementation and use of thresholds

The implementation above uses a threshold based on percentages (0-100). Specification of the threshold_value is mandatory.

Call of the R-function

AnyContradictions <- con_contradictions(study_data      = sd1,
                                        meta_data       = md1,
                                        label_col       = "LABEL",
                                        check_table     = checks,
                                        threshold_value = 1)
## Labels of variables from "LABEL" will be used. In this case columns A and B in check_tables must refer to labels.
## Warning: In con_contradictions: All variables with CONTRADICTIONS in the metadata are used.
## > con_contradictions(study_data = sd1, meta_data = md1, label_col = "LABEL", 
##     check_table = checks, threshold_value = 1)
## Warning: In con_contradictions: N = 3 values in EDUCATION_1 have been above HARD_LIMITS and were removed.
## > con_contradictions(study_data = sd1, meta_data = md1, label_col = "LABEL", 
##     check_table = checks, threshold_value = 1)
## Warning: In con_contradictions: N = 24 values in SMOKE_SHOP_0 have been above HARD_LIMITS and were removed.
## > con_contradictions(study_data = sd1, meta_data = md1, label_col = "LABEL", 
##     check_table = checks, threshold_value = 1)
## Warning: In con_contradictions: Variables: AGE_0, AGE_1, EXAM_DT_0, LAB_DT_0 have no assigned labels and levels.
## > con_contradictions(study_data = sd1, meta_data = md1, label_col = "LABEL", 
##     check_table = checks, threshold_value = 1)

OUTPUT

Output 1: FlaggedStudyData

This implementation returns four objects. The dataframe FlaggedStudyData flags each observation in the study data that has one or more contradictions between different variables. For each applied check on the variables an additional column (names with the ID of the check) is added. The object can be accessed via AnyContradictions$FlaggedStudyData.

Output 2: Summary table 1

The second output of the contradiction function is a data frame which summarizes the no. of contradictions for each variable that has been examined. This object is primarily used by the dataquieR-function dq_report to summarize information of all examined variables.

Variables Reference variable Check ID Check type A Levels B Levels Contradictions (N) Contradictions (%) GRADING Label
AGE_1 AGE_0 1001 A_less_than_B_vv NA NA 150 5.00 1 Age follow-up
SEX_1 SEX_0 1002 A_not_equal_B_vv NA NA 150 5.00 1 Sex follow-up
EDUCATION_1 EDUCATION_0 1003 A_less_than_B_vv NA NA 7 0.23 0 Education follow-up
EATING_PREFS_0 MEAT_CONS_0 1004 A_levels_and_B_levels_ll vegetarian 1-2d a week|3-4d a week|5-6d a week|daily 54 1.80 1 Nutrition inconsistency vegetarian
EATING_PREFS_0 MEAT_CONS_0 1005 A_levels_and_B_levels_ll vegan 1-2d a week|3-4d a week|5-6d a week|daily 19 0.63 0 Nutrition inconsistency vegan
EATING_PREFS_0 MEAT_CONS_0 1006 A_levels_and_B_levels_ll none never 64 2.13 1 Nutrition inconsistency


Output 3: Summary table 2

The third output summarizes this information quite similarly but also names the applied checks. This output can be used to provide an executive overview on the amount of contradictions.

Check ID Check type Variables A and B A Levels B Levels Contradictions (N) Contradictions (%) Grading Label
1001 A_less_than_B_vv A is: AGE_1; B is: AGE_0 NA NA 150 5.00 1 Age follow-up
1002 A_not_equal_B_vv A is: SEX_1; B is: SEX_0 NA NA 150 5.00 1 Sex follow-up
1003 A_less_than_B_vv A is: EDUCATION_1; B is: EDUCATION_0 NA NA 7 0.23 0 Education follow-up
1004 A_levels_and_B_levels_ll A is: EATING_PREFS_0; B is: MEAT_CONS_0 vegetarian 1-2d a week|3-4d a week|5-6d a week|daily 54 1.80 1 Nutrition inconsistency vegetarian
1005 A_levels_and_B_levels_ll A is: EATING_PREFS_0; B is: MEAT_CONS_0 vegan 1-2d a week|3-4d a week|5-6d a week|daily 19 0.63 0 Nutrition inconsistency vegan
1006 A_levels_and_B_levels_ll A is: EATING_PREFS_0; B is: MEAT_CONS_0 none never 64 2.13 1 Nutrition inconsistency
1007 A_levels_and_B_levels_ll A is: SMOKING_0; B is: SMOKE_SHOP_0 no 1-2d a week|3-4d a week|5-6d a week|daily 91 3.03 1 Non-smokers inconsistency
1008 A_levels_and_B_levels_ll A is: SMOKING_0; B is: SMOKE_SHOP_0 yes never 118 3.93 1 Smokers inconsistency
1009 A_not_equal_B_vv A is: ARM_CIRC_DISC_0; B is: ARM_CUFF_0 NA NA 173 5.77 1 Blood pressure false cuff
1010 A_levels_and_B_gt_value_lc A is: PREGNANT_0; B is: AGE_0 yes NA 5 0.17 0 Pregnancy high age
1011 A_less_than_B_vv A is: LAB_DT_0; B is: EXAM_DT_0 NA NA 116 3.87 1 LAB before MEX

Output 4: Summary plot

The fourth output visualizes summarized information of output 2 and 3.

AnyContradictions$SummaryPlot

INTERPRETATION

Any contradiction in the study data should be resolved by appropriate data curation steps.

Concept relations

Chang, W., Cheng, J., Allaire, J., Xie, Y., McPherson, J., and others (2018). Shiny: Web application framework for r, 2015. R Package Version 1, 14.
Potter, G., Wong, J., Alcaraz, I., Chi, P., and others (2016). Web application teaching tools for statistics using r and shiny. Technology Innovations in Statistics Education 9.