Inadmissible numerical values can be of type integer
or float
. This implementation requires the definition of intervals in the metadata to examine the admissibility of numerical study data.
ALGORITHM OF THIS IMPLEMENTATION:
Data from the package dataquieR
are loaded as shown below:
load(system.file("extdata", "study_data.RData", package = "dataquieR"))
sd1 <- study_data
This example of study data has N=3000 observations. Study data variables have abstract and non-interpretable names; appropriate labels must be mapped from the metadata.
v00000 | v00001 | v00002 | v00003 | v00004 | v00005 | v01003 | v01002 | v00103 | v00006 |
---|---|---|---|---|---|---|---|---|---|
3 | LEIIX715 | 0 | 49 | 127 | 77 | 49 | 0 | 40-49 | 3.8 |
1 | QHNKM456 | 0 | 47 | 114 | 76 | 47 | 0 | 40-49 | 1.9 |
1 | HTAOB589 | 0 | 50 | 114 | 71 | 50 | 0 | 50-59 | 0.8 |
5 | HNHFV585 | 0 | 48 | 120 | 65 | 48 | 0 | 40-49 | 3.8 |
1 | UTDLS949 | 0 | 56 | 119 | 78 | 56 | 0 | 50-59 | 4.1 |
5 | YQFGE692 | 1 | 47 | 133 | 81 | 47 | 1 | 40-49 | 9.5 |
1 | AVAEH932 | 0 | 53 | 114 | 78 | 53 | 0 | 50-59 | 5.0 |
3 | QDOPT378 | 1 | 48 | 116 | 86 | 48 | 1 | 40-49 | 9.6 |
3 | BMOAK786 | 0 | 44 | 115 | 71 | 44 | 0 | 40-49 | 2.0 |
5 | ZDKNF462 | 0 | 50 | 116 | 74 | 50 | 0 | 50-59 | 2.4 |
Data from the package dataquieR
are loaded as shown below:
load(system.file("extdata", "meta_data.RData", package = "dataquieR"))
md1 <- meta_data
Information corresponding to the study data is kept in the table of static metadata. An interpretable label for each variable is also attached. Besides data type and labels of all variables further expected characteristics are stored in the metadata.
Regarding the following implementation the columns HARD_LIMITS
as well as MISSING_LIST
+ JUMP_LIST
in the metadata are particularly relevant.
HARD_LIMITS have to be defined as intervals:
\([0; 100]\): any value between 0 and 100, including 0 or 100
\((0; 100)\): any value between 0 and 100, not including 0 or 100
\([0; Inf)\): any positive numerical value, including 0
This table shows the metadata defined for the example data that required for this implementation:
VAR_NAMES | LABEL | MISSING_LIST | JUMP_LIST | HARD_LIMITS | |
---|---|---|---|---|---|
4 | v00003 | AGE_0 | NA | NA | [18;Inf) |
39 | v00030 | MEDICATION_0 | 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 | NA | [0;1] |
1 | v00000 | CENTER_0 | NA | NA | NA |
34 | v00025 | SMOKE_SHOP_0 | 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 | NA | [0;4] |
23 | v00016 | DEV_NO_0 | NA | NA | NA |
43 | v40000 | PART_INTERVIEW | NA | NA | NA |
14 | v00009 | ARM_CIRC_0 | 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99987 | 99988 | 99989 | 99990 | 99991 | 99992 | 99993 | 99994 | 99995 | NA | [0;Inf) |
18 | v00012 | USR_BP_0 | 99981 | 99982 | NA | NA |
33 | v00024 | SMOKING_0 | 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 | NA | [0;1] |
21 | v00014 | CRP_0 | 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99988 | 99989 | 99990 | 99991 | 99992 | 99994 | 99995 | NA | [0;Inf) |
However, this function can also be used with other columns of the metadata that contain limit definitions according to the conventions mentioned above. Currently SOFT_LIMITS
and DETECTION_LIMITS
are also handled by this implementation.
library(ggplot2)
library(gridExtra)
library(plyr)
library(dplyr)
library(stringr)
Please define all arguments used by the R-function:
CAVE:
Regarding the naming of the following implementation we deviate from other implementations. This is motivated by the generic use of the function which can process different types of limits, i.e. if SOFT_LIMITS
or DETECTION_LIMITS
. A necessary convention is the identical definition of limits as shown here: Example of metadata.
con_limit_deviations <- function(resp_vars = NULL, label_col, study_data,
meta_data, limits = c(
"HARD_LIMITS", "SOFT_LIMITS",
"DETECTION_LIMITS"
)) {
rvs <- resp_vars
infix <- unlist(strsplit(limits, "_"))[1]
# Preps ----------------------------------------------------------------------
# map meta to study
util_prepare_dataframes()
util_correct_variable_use("resp_vars",
allow_more_than_one = TRUE,
allow_null = TRUE,
allow_any_obs_na = TRUE,
need_type = "integer|float|datetime"
)
# which limits?
LIMITS <- toupper(match.arg(limits))
# variables correct?
# util_correct_variable_use("resp_variables", role = "response_vars")
# no variables defined?
if (length(rvs) == 0) {
if (all(is.na(meta_data[[LIMITS]]))) {
util_error(paste0("No Variables with defined ", LIMITS, "."))
} else {
util_warning(paste0("All variables with ", LIMITS,
" in the metadata are used."))
rvs <- meta_data[[label_col]][!(is.na(meta_data[[LIMITS]]))]
}
} else {
# limits defined at all?
if (all(is.na(meta_data[[LIMITS]][meta_data[[label_col]] %in% rvs]))) {
util_error(paste0("No Variables with defined ", LIMITS, "."))
}
# no limits for some variables?
rvs2 <- meta_data[[label_col]][!(is.na(meta_data[[LIMITS]])) &
meta_data[[label_col]] %in% rvs]
if (length(rvs2) < length(unique(rvs))) {
util_warning(paste0("The variables ", rvs[!(rvs %in% rvs2)],
" have no defined limits."))
}
rvs <- rvs2
}
datetime_vars <- vapply(
rvs,
function(rv) {
meta_data[["DATA_TYPE"]][meta_data[[label_col]] == rv] ==
DATA_TYPES$DATETIME
},
logical(1)
)
# conversion of numeric handling needs a bit more coding, since it needs
# an origin then. in other cases (Date, character, ...),
# origin will be ignored by as.POSIXct
ds1[, rvs[datetime_vars]] <-
lapply(ds1[, rvs[datetime_vars], drop = FALSE], as.POSIXct, origin =
min(as.POSIXct(Sys.Date()), 0))
# remove rvs with non-matching data type
var_matches_datatype <-
vapply(FUN.VALUE = logical(1), ds1[, rvs, drop = FALSE],
function(x) is.numeric(x) || inherits(x, "POSIXct"))
if (!all(var_matches_datatype)) {
util_warning(paste0(
"Variables ", paste0(rvs[!var_matches_datatype], collapse = ", "),
" are neither numeric nor datetime and will be removed from analyses."
))
rvs <- rvs[var_matches_datatype]
}
if (length(rvs) == 0) {
util_error("No variables left, no limit checks possible.")
}
# interpret limit intervals
imdf <- util_interpret_limits(mdata = meta_data)
fsd_list <- lapply(setNames(nm = rvs), function(rv) {
fsd <- ds1[, rv, drop = FALSE]
ds1[!is.finite(ds1[[rv]]), rv] <- NA
# Extract and interpret available metadata -------------------------------
LOWER <- paste0(infix, "_LIMIT_LOW")
ll <- imdf[[LOWER]][imdf[[label_col]] == rv]
ll <- ifelse(is.infinite(ll), NA, ll)
UPPER <- paste0(infix, "_LIMIT_UP")
lu <- imdf[[UPPER]][imdf[[label_col]] == rv]
lu <- ifelse(is.infinite(lu), NA, lu)
if ((datetime_vars[[rv]])) {
ll <- as.POSIXct(ll, origin = min(as.POSIXct(Sys.Date()), 0))
lu <- as.POSIXct(lu, origin = min(as.POSIXct(Sys.Date()), 0))
}
# Fill summary DFs -------------------------------------------------------
BELOW <- paste0(rv, "_below_", infix)
ABOVE <- paste0(rv, "_above_", infix)
OUT <- paste0(rv, "_OUT_", infix)
if (!(is.na(ll))) {
fsd[[BELOW]][!(is.na(fsd[[rv]]))] <-
ifelse(as.numeric(fsd[[rv]][!(is.na(fsd[[rv]]))]) < ll, 1, 0)
} else {
fsd[[BELOW]][!(is.na(fsd[[rv]]))] <- 0
}
if (!(is.na(lu))) {
fsd[[ABOVE]][!(is.na(fsd[[rv]]))] <-
ifelse(as.numeric(fsd[[rv]][!(is.na(fsd[[rv]]))]) > lu, 1, 0)
} else {
fsd[[ABOVE]][!(is.na(fsd[[rv]]))] <- 0
}
return(fsd)
}) # end lapply fsd
fsd <-
do.call(cbind.data.frame, c(unname(fsd_list), list(
stringsAsFactors = FALSE)))
plot_list <- lapply(setNames(nm = rvs), function(rv) {
# Fill summary DFs -------------------------------------------------------
BELOW <- paste0(rv, "_below_", infix)
ABOVE <- paste0(rv, "_above_", infix)
OUT <- paste0(rv, "_OUT_", infix)
# Combine flag for plot
ds1[[OUT]] <- pmax(fsd[[BELOW]], fsd[[ABOVE]], na.rm = TRUE)
ds1[!is.finite(ds1[[rv]]), rv] <- NA
# Extract and interpret available metadata -------------------------------
LOWER <- paste0(infix, "_LIMIT_LOW")
ll <- imdf[[LOWER]][imdf[[label_col]] == rv]
ll <- ifelse(is.infinite(ll), NA, ll)
UPPER <- paste0(infix, "_LIMIT_UP")
lu <- imdf[[UPPER]][imdf[[label_col]] == rv]
lu <- ifelse(is.infinite(lu), NA, lu)
if (datetime_vars[[rv]]) {
ll <- as.POSIXct(ll, origin = min(as.POSIXct(Sys.Date()), 0))
lu <- as.POSIXct(lu, origin = min(as.POSIXct(Sys.Date()), 0))
}
# Calculation of values relevant for plot area ---------------------------
# data extrema
max_data <- max(ds1[[rv]], na.rm = TRUE)
min_data <- min(ds1[[rv]], na.rm = TRUE)
### Define bounds for graph
minx <- min(c(min_data, ll), na.rm = TRUE)
maxx <- max(c(max_data, lu), na.rm = TRUE)
if (!datetime_vars[[rv]]) {
# expand plot area
inc <- floor(0.1 * abs(maxx - minx))
if (inc < 1)
inc <- 1
minx <- minx - inc
maxx <- maxx + inc
}
if (maxx == 0) {
# expand plot area
maxx <- (maxx + max(1, floor(0.1 * (maxx - minx))))
}
if (minx == 0) {
# expand plot area
minx <- (minx - max(1, floor(0.1 * (maxx - minx))))
}
# differentiate continuous from discrete variables -------------------------
if (datetime_vars[[rv]] || !(all(ds1[[rv]] %% 1 == 0, na.rm = TRUE)) ||
length(unique(ds1[[rv]])) > 20) {
# continuous or integer with more than 20 values
# Freedman-Diaconis (2 * IQR(data) / length(data)^(1/3)):
# optimal width restricted to the data within limits!
thedata <- ds1[[rv]]
if (!is.na(ll)) {
thedata <- thedata[ds1[[rv]] >= ll]
}
if (!is.na(lu)) {
thedata <- thedata[ds1[[rv]] <= lu]
}
bw <- (2 * IQR(thedata, na.rm = TRUE) / length(thedata) ^ (1 / 3))
if (bw == 0) bw <- 1
# steps within hard limits
# (rounded according modulo division to meet limits)
dif <- as.numeric(maxx) - as.numeric(minx)
byX <- dif / (dif %/% bw)
# breaks must be within hard limits
# (old: breakswithin <- seq(xlims[1], xlims[2], by = byX))
breakswithin <- c(min_data - byX, seq(min_data, max_data, by = byX),
max_data + byX)
# breaks outside plausis (always the case since minx/maxx outside limits)
breakslower <- seq(minx, min_data, by = byX)
breaksupper <- seq(max_data, maxx, by = byX)
# rounding
if (datetime_vars[[rv]]) {
breaksX <- unique(c(breakslower, breakswithin, breaksupper))
} else {
breaksX <- round(unique(c(breakslower, breakswithin, breaksupper)), 3)
}
# if no values below/above
breaksX <- unique(breaksX[!is.na(breaksX)])
} else {
breaksX <- unique(ds1[[rv]][!(is.na(ds1[[rv]]))])
}
breaksX <- sort(breaksX)
if (length(unique(breaksX)) > 10000) {
likely1 <- ds1[util_looks_like_missing(ds1[[rv]]), rv, TRUE]
likely2 <- c(max(ds1[[rv]],
na.rm = TRUE),
min(ds1[[rv]],
na.rm = TRUE))
likely <- intersect(likely1, likely2)
if (length(likely) == 0)
likely <- likely2
util_warning(
c("For %s, I have %d breaks. Did you forget to specify some missing",
"codes (%s)? Will arbitrarily reduce the number of breaks below",
"10000 to avoid rendering problems."),
dQuote(rv), length(unique(breaksX)), paste0(dQuote(likely
), collapse = " or ")
)
while (length(unique(breaksX)) > 10000) {
breaksX <- breaksX[!is.na(breaksX)]
breaksX <- c(min(breaksX), breaksX[c(TRUE, FALSE)], max(breaksX))
}
util_warning(
c("For %s. Will arbitrarily reduced the number of breaks to",
"%d <= 10000 to avoid rendering problems."),
dQuote(rv), length(unique(breaksX)))
}
# Generate ggplot-objects for placing and annotation of lines --------------
# lower limit
if (is.na(ll)) {
# Create Line and Text
lll <- geom_vline(
xintercept = minx, color = "#999999", alpha = 1,
linetype = "dotted"
)
tll <- annotate("text",
x = minx, y = 0,
label = paste0(
"?lower ",
tolower(infix),
" limit?"
),
color = "#999999", angle = 0, vjust = 1, hjust = 0
)
} else {
# Detect number of cases below lower hard limit
below_hl <- sum(ds1[[rv]] < ll, na.rm = TRUE)
# Create Line and Text
if (all(util_is_integer(ds1[[rv]]), na.rm = TRUE)) {
xll <- ll - 0.5
} else {
xll <- ll
}
lll <- geom_vline(xintercept = xll, color = "#B2182B", alpha = 1)
tll <- annotate("text",
x = xll, y = 0,
label = paste0(
"limit ",
tolower(infix),
" low=", ll, "; Obs < LHL: ", below_hl
),
color = "#B2182B", angle = 0, vjust = 1.5, hjust = 0
)
}
# Upper limit
if (is.na(lu)) {
# Create Line and Text
llu <- geom_vline(
xintercept = maxx, color = "#999999", alpha = 1,
linetype = "dotted"
)
tlu <- annotate("text",
x = maxx, y = 0,
label = paste0(
"?upper ",
tolower(infix),
" limit?"
),
color = "#999999", angle = 0, vjust = 1, hjust = 0
)
} else {
# Detect number of cases above upper hard limit
above_hl <- sum(ds1[[rv]] > lu, na.rm = TRUE)
if (all(util_is_integer(ds1[[rv]]), na.rm = TRUE)) {
xlu <- lu + 0.5
} else {
xlu <- lu
}
# Create Line and Text
llu <- geom_vline(xintercept = xlu, color = "#B2182B", alpha = 1)
tlu <- annotate("text",
x = xlu, y = 0,
label = paste0(
"limit ",
tolower(infix),
" up=", lu, "; Obs > UHL: ", above_hl
),
color = "#B2182B", angle = 0, vjust = -0.5, hjust = 0
)
}
# building the plot --------------------------------------------------------
txtspec <- element_text(
colour = "black", # size = 16,
hjust = .5, vjust = .5, face = "plain"
) # angle = 0,
out_cols <- c("#2166AC", "#B2182B")
names(out_cols) <- c("0", "1")
if (datetime_vars[[rv]] || !(all(ds1[[rv]] %% 1 == 0, na.rm = TRUE)) ||
length(unique(ds1[[rv]])) > 20) {
# continuous or integer with more than 20 values
# if (!all(util_is_integer(ds1[[rv]]), na.rm = TRUE) || ) {
breaks <- unique(breaksX)
if (!datetime_vars[[rv]]) {
myxlim <- c(floor(minx), ceiling(maxx))
} else {
myxlim <- c(minx, maxx)
breaks <- as.POSIXct(breaks, origin = min(as.POSIXct(Sys.Date()), 0))
myxlim <- as.POSIXct(myxlim, origin = min(as.POSIXct(Sys.Date()), 0))
}
p <- ggplot(data = ds1, aes_string(x = ds1[[rv]], fill =
factor(ds1[[OUT]]))) +
geom_histogram(breaks = breaks) +
scale_fill_manual(values = out_cols, guide = FALSE) +
coord_flip(xlim = myxlim) +
labs(x = "", y = paste0(rv)) +
theme_minimal() +
theme(
title = txtspec,
axis.text.x = txtspec,
axis.text.y = txtspec,
axis.title.x = txtspec,
axis.title.y = txtspec
) +
# add line/text for lower limit
lll +
tll +
# add line/text for upper limit
llu +
tlu
} else {
p <- ggplot(ds1, aes_string(x = ds1[[rv]], fill = factor(ds1[[OUT]]))) +
geom_bar() +
scale_fill_manual(values = out_cols, guide = FALSE) +
coord_flip(xlim = c(floor(minx), ceiling(maxx))) +
labs(x = "", y = paste0(rv)) +
theme_minimal() +
theme(
title = txtspec,
axis.text.x = txtspec,
axis.text.y = txtspec,
axis.title.x = txtspec,
axis.title.y = txtspec
) +
# add line/text for lower limit
lll +
tll +
# add line/text for upper limit
llu +
tlu
}
return(p)
}) # end lapply plot_list
# remove violations of value limits
msdf <- ds1
for (current_rv in rvs) {
if (HARD_LIMITS %in% names(imdf)) {
# values below hard limit?
minx1 <- imdf[[HARD_LIMIT_LOW]][imdf[[label_col]] == current_rv]
minx2 <- suppressWarnings(min(msdf[[current_rv]], na.rm = TRUE))
if (!is.na(minx1) & minx1 > minx2) {
n_below <- sum(msdf[[current_rv]] < minx1, na.rm = TRUE)
msdf[[current_rv]][msdf[[current_rv]] < minx1] <- NA
util_warning(paste0("N = ", n_below, " values in ", current_rv,
" have been below %s and were removed."),
HARD_LIMIT_LOW)
}
# values above hard limit?
maxx1 <- imdf[[HARD_LIMIT_UP]][imdf[[label_col]] == current_rv]
maxx2 <- suppressWarnings(max(msdf[[current_rv]], na.rm = TRUE))
if (!is.na(maxx1) & maxx1 < maxx2) {
n_above <- sum(msdf[[current_rv]] > maxx1, na.rm = TRUE)
msdf[[current_rv]][msdf[[current_rv]] > maxx1] <- NA
util_warning(paste0("N = ", n_above, " values in ", current_rv,
" have been above %s and were removed."),
HARD_LIMIT_UP)
}
}
}
# add Summary Table with GRADING column
name_bel <- paste0("Below ", infix, " (N)")
name_abo <- paste0("Above ", infix, " (N)")
sumtab <- lapply(setNames(nm = rvs), function(rv) {
BELOW <- paste0(rv, "_below_", infix)
ABOVE <- paste0(rv, "_above_", infix)
r <- list(
rv,
sum(fsd[[BELOW]], na.rm = TRUE),
round(sum(fsd[[BELOW]], na.rm = TRUE) / sum(!(is.na(fsd[[rv]]))) * 100,
digits = 2),
sum(fsd[[ABOVE]], na.rm = TRUE),
round(sum(fsd[[ABOVE]], na.rm = TRUE) / sum(!(is.na(fsd[[rv]]))) * 100,
digits = 2)
)
r <- as.data.frame(r, stringsAsFactors = FALSE)
colnames(r) <- c(
"Variables",
paste0("Below ", infix, " (N)"),
paste0("Below ", infix, " (%)"),
paste0("Above ", infix, " (N)"),
paste0("Above ", infix, " (%)")
)
r
})
sumtab <- do.call(rbind.data.frame, c(sumtab, stringsAsFactors = FALSE,
deparse.level = 0, make.row.names =
FALSE))
sumtab$GRADING <- ifelse((sumtab[[name_bel]] > 0) | (sumtab[[name_abo]] > 0),
1, 0)
return(list(FlaggedStudyData = fsd, SummaryTable = sumtab, SummaryPlotList =
plot_list, ModifiedStudyData = msdf))
}
This implementation makes no use of thresholds.
This R-functions uses a vector of response variables.
MyValueLimits <- con_limit_deviations(resp_vars = c("AGE_0", "SBP_0", "DBP_0", "SEX_0"),
label_col = "LABEL",
study_data = sd1,
meta_data = md1,
limits = "HARD_LIMITS")
names(MyValueLimits)
The function can be applied on selected variables. The output comprises two tables and plots for each selected variable. The function checks whether the respective limits are specified for each selected variable. If not, a warning is supplied.
MyValueLimits <- con_limit_deviations(resp_vars = c("AGE_0", "SBP_0", "SEX_0"),
label_col = "LABEL",
study_data = sd1,
meta_data = md1,
limits = "HARD_LIMITS")
## Warning: In con_limit_deviations: The variables SEX_0 have no defined limits.
## > con_limit_deviations(resp_vars = c("AGE_0", "SBP_0", "SEX_0"),
## label_col = "LABEL", study_data = sd1, meta_data = md1, limits = "HARD_LIMITS")
## Warning: In con_limit_deviations: Found invalid limits for 'HARD_LIMITS': "[2018-01-01 00:00:00 CET;)", "[2018-01-01 00:00:00 CET;)", "[2018-01-01 00:00:00 CET;)", "[2018-01-01 00:00:00 CET;)" -- will ignore these
## > con_limit_deviations(resp_vars = c("AGE_0", "SBP_0", "SEX_0"),
## label_col = "LABEL", study_data = sd1, meta_data = md1, limits = "HARD_LIMITS")
Output 1: FlaggedStudyData
The first table is related to the study data by a 1:1 relationship, i.e. for each observation is checked whether the value is below or above the limits.
AGE_0 | AGE_0_below_HARD | AGE_0_above_HARD | SBP_0 | SBP_0_below_HARD | SBP_0_above_HARD |
---|---|---|---|---|---|
49 | 0 | 0 | 127 | 0 | 0 |
47 | 0 | 0 | 114 | 0 | 0 |
50 | 0 | 0 | 114 | 0 | 0 |
48 | 0 | 0 | 120 | 0 | 0 |
56 | 0 | 0 | 119 | 0 | 0 |
47 | 0 | 0 | 133 | 0 | 0 |
Output 2: SummaryTable
The second table summarizes this information for each variable.
Variables | Below HARD (N) | Below HARD (%) | Above HARD (N) | Above HARD (%) | GRADING |
---|---|---|---|---|---|
AGE_0 | 0 | 0 | 0 | 0 | 0 |
SBP_0 | 0 | 0 | 0 | 0 | 0 |
Output 3: SummaryPlotList
The plots for each variable are either a histogram (continuous) or a barplot (discrete) and all are added to a list which is accessed via MyValueLimits$SummaryPlotList.
Output 4: ModifiedStudyData
The fourth output object is a dataframe similar to the study data, however, limit deviations have been removed.
It is not necessary to specify variables. In this case the functions seeks for all numeric variables with defined limits. If the function identifies limit deviations, the respective values are removed in the dataframe of ModifiedStudyData.
## Warning: In con_limit_deviations: All variables with HARD_LIMITS in the metadata are used.
## > con_limit_deviations(label_col = "LABEL", study_data = sd1, meta_data = md1,
## limits = "HARD_LIMITS")
## Warning: In con_limit_deviations: Found invalid limits for 'HARD_LIMITS': "[2018-01-01 00:00:00 CET;)", "[2018-01-01 00:00:00 CET;)", "[2018-01-01 00:00:00 CET;)", "[2018-01-01 00:00:00 CET;)" -- will ignore these
## > con_limit_deviations(label_col = "LABEL", study_data = sd1, meta_data = md1,
## limits = "HARD_LIMITS")
## Warning: In con_limit_deviations: N = 3 values in EDUCATION_1 have been above HARD_LIMIT_UP and were removed.
## > con_limit_deviations(label_col = "LABEL", study_data = sd1, meta_data = md1,
## limits = "HARD_LIMITS")
## Warning: In con_limit_deviations: N = 24 values in SMOKE_SHOP_0 have been above HARD_LIMIT_UP and were removed.
## > con_limit_deviations(label_col = "LABEL", study_data = sd1, meta_data = md1,
## limits = "HARD_LIMITS")
## Warning: In con_limit_deviations: N = 349 values in MEDICATION_0 have been above HARD_LIMIT_UP and were removed.
## > con_limit_deviations(label_col = "LABEL", study_data = sd1, meta_data = md1,
## limits = "HARD_LIMITS")
Output 2: Summary data table 2
Variables | Below HARD (N) | Below HARD (%) | Above HARD (N) | Above HARD (%) | GRADING |
---|---|---|---|---|---|
DBP_0 | 0 | 0 | 0 | 0.00 | 0 |
GLOBAL_HEALTH_VAS_0 | 0 | 0 | 0 | 0.00 | 0 |
ASTHMA_0 | 0 | 0 | 0 | 0.00 | 0 |
ARM_CIRC_0 | 0 | 0 | 0 | 0.00 | 0 |
ARM_CIRC_DISC_0 | 0 | 0 | 0 | 0.00 | 0 |
ARM_CUFF_0 | 0 | 0 | 0 | 0.00 | 0 |
EXAM_DT_0 | 0 | 0 | 0 | 0.00 | 0 |
CRP_0 | 0 | 0 | 0 | 0.00 | 0 |
BSG_0 | 0 | 0 | 0 | 0.00 | 0 |
LAB_DT_0 | 0 | 0 | 0 | 0.00 | 0 |
EDUCATION_0 | 0 | 0 | 0 | 0.00 | 0 |
EDUCATION_1 | 0 | 0 | 3 | 0.12 | 1 |
MARRIED_0 | 0 | 0 | 0 | 0.00 | 0 |
EATING_PREFS_0 | 0 | 0 | 0 | 0.00 | 0 |
MEAT_CONS_0 | 0 | 0 | 0 | 0.00 | 0 |
SMOKING_0 | 0 | 0 | 0 | 0.00 | 0 |
SMOKE_SHOP_0 | 0 | 0 | 24 | 2.98 | 1 |
PREGNANT_0 | 0 | 0 | 0 | 0.00 | 0 |
MEDICATION_0 | 0 | 0 | 349 | 54.45 | 1 |
AGE_0 | 0 | 0 | 0 | 0.00 | 0 |
N_ATC_CODES_0 | 0 | 0 | 0 | 0.00 | 0 |
INT_DT_0 | 0 | 0 | 0 | 0.00 | 0 |
ITEM_1_0 | 0 | 0 | 0 | 0.00 | 0 |
ITEM_2_0 | 0 | 0 | 0 | 0.00 | 0 |
ITEM_3_0 | 0 | 0 | 0 | 0.00 | 0 |
ITEM_4_0 | 0 | 0 | 0 | 0.00 | 0 |
ITEM_5_0 | 0 | 0 | 0 | 0.00 | 0 |
ITEM_6_0 | 0 | 0 | 0 | 0.00 | 0 |
ITEM_7_0 | 0 | 0 | 0 | 0.00 | 0 |
ITEM_8_0 | 0 | 0 | 0 | 0.00 | 0 |
QUEST_DT_0 | 0 | 0 | 0 | 0.00 | 0 |
AGE_1 | 0 | 0 | 0 | 0.00 | 0 |
SBP_0 | 0 | 0 | 0 | 0.00 | 0 |
Output 3: Plot
Only the 3 selected plots are displayed to reduce the size of this file. However, for each variable with limits, a plot has been generated.
datetime
## Warning: In con_limit_deviations: Found invalid limits for 'HARD_LIMITS': "[2018-01-01 00:00:00 CET;)", "[2018-01-01 00:00:00 CET;)", "[2018-01-01 00:00:00 CET;)", "[2018-01-01 00:00:00 CET;)" -- will ignore these
## > con_limit_deviations(resp_vars = c("QUEST_DT_0"), label_col = "LABEL",
## study_data = sd1, meta_data = md1, limits = "HARD_LIMITS")
Output 2: Summary data table 2
Variables | Below HARD (N) | Below HARD (%) | Above HARD (N) | Above HARD (%) | GRADING |
---|---|---|---|---|---|
QUEST_DT_0 | 0 | 0 | 0 | 0 | 0 |
Output 3: Plot
The definition of HARD_LIMITS
is a common issue in the data curation process. For example, values of a numeric rating scale (0 - 10) should not exceed these limits and values outside these limits must be removed or at least verified as they represent certain incorrect measurements. Nevertheless, there are measurements in which the definition of such limits is difficult. In this case the alternative definition of SOFT_LIMITS
is recommended.