Introduction

dataquieR provides many outputs ready to be integrated with a quality report. However, usually, requirements are more specific. The following documentation can be used for adjusting outputs to meet specific requirements for the two most common types of output, data frames and ggplot2-graphics.

Example output of dataquieR

The basic example used in this documentation requires two objects which are mandatory for all dataquieR functions:

  • study data
  • metadata

These data are loaded from the dataquieR package.

load(system.file("extdata", "study_data.RData", package = "dataquieR"))
sd1 <- study_data

load(system.file("extdata", "meta_data.RData", package = "dataquieR"))
md1 <- meta_data

The example output is generated using the dataquieR function: com_item_missingness().

tab_ex1 <- com_item_missingness(study_data = sd1,
                                meta_data = md1,
                                threshold_value = 90,
                                include_sysmiss = TRUE,
                                show_causes = FALSE)
#> Warning: In com_item_missingness: Setting suppressWarnings to its default FALSE
#> > com_item_missingness(study_data = sd1, meta_data = md1, threshold_value = 90, 
#>     include_sysmiss = TRUE, show_causes = FALSE)

This function generates two objects: SummaryTable, SummaryPlot, ReportSummaryTable. The first is a data frame, the second a ggplot. The following steps show how to edit these objects.

Data frames

For the use of data frames in data quality reporting, there are two important aspects.

  1. they should be displayed in a neat and comprehensible way. For this aspect, many packages exist, e.g. xtable, kableExtra, pixiedust, huxtable and DT, each of which integrates with some of the most output formats supported by rmarkdown/pandoc, namely html, docx, pdf, and flexdashbaord. For using these package, we ask the reader to refer to these packages’ documentation, please.

  2. Given the size of data frames there must be ways to filter and / or sort them, to add or remove columns, and to rename columns. For these issues a good choice is the tidyverse with the dplyr package.

MOVE TO OTHER PLACE??? Related with the next point (ggplot2 graphics generated by dataquieR), wide- and long-format is another point with tables. tidyr is one possible choice for transforming tables from long- to wide-format.

The most simple output of the data frame appears like this (first 10 shown only to reduce file size):

knitr::kable(head(tab_ex1$SummaryTable, 10))
Variables Observations N Sysmiss N (%) Datavalues N (%) Missing codes N (%) Jumps N (%) Measurements N (%) GRADING
v00000 2940 0 (0) 2940 (100) 0 (0) 0 (0) 2940 (100) 0
v00001 2940 0 (0) 2940 (100) 0 (0) 0 (0) 2940 (100) 0
v00002 2940 60 (2.04) 2880 (97.96) 0 (0) 0 (0) 2880 (97.96) 0
v00003 2940 60 (2.04) 2880 (97.96) 0 (0) 0 (0) 2880 (97.96) 0
v00103 2940 60 (2.04) 2880 (97.96) 0 (0) 0 (0) 2880 (97.96) 0
v01003 2940 60 (2.04) 2880 (97.96) 0 (0) 0 (0) 2880 (97.96) 0
v01002 2940 60 (2.04) 2880 (97.96) 0 (0) 0 (0) 2880 (97.96) 0
v10000 2940 60 (2.04) 2880 (97.96) 0 (0) 0 (0) 2880 (97.96) 0
v00004 2940 299 (10.17) 2641 (89.83) 140 (4.76) 0 (0) 2501 (85.07) 1
v00005 2940 293 (9.97) 2647 (90.03) 163 (5.54) 0 (0) 2484 (84.49) 1


Styling

The table above comprises information regarding missing values of all variables in the study data. Nevertheless, it represents not the most beautiful output. We may use some functionality of the kableExtra package and attach this formats to the present table using dplyr.

suppressPackageStartupMessages(library(dplyr))
library(kableExtra)
kable(tab_ex1$SummaryTable, "html") %>%
  kable_styling(bootstrap_options = c("hover"))
Variables Observations N Sysmiss N (%) Datavalues N (%) Missing codes N (%) Jumps N (%) Measurements N (%) GRADING
v00000 2940 0 (0) 2940 (100) 0 (0) 0 (0) 2940 (100) 0
v00001 2940 0 (0) 2940 (100) 0 (0) 0 (0) 2940 (100) 0
v00002 2940 60 (2.04) 2880 (97.96) 0 (0) 0 (0) 2880 (97.96) 0
v00003 2940 60 (2.04) 2880 (97.96) 0 (0) 0 (0) 2880 (97.96) 0
v00103 2940 60 (2.04) 2880 (97.96) 0 (0) 0 (0) 2880 (97.96) 0
v01003 2940 60 (2.04) 2880 (97.96) 0 (0) 0 (0) 2880 (97.96) 0
v01002 2940 60 (2.04) 2880 (97.96) 0 (0) 0 (0) 2880 (97.96) 0
v10000 2940 60 (2.04) 2880 (97.96) 0 (0) 0 (0) 2880 (97.96) 0
v00004 2940 299 (10.17) 2641 (89.83) 140 (4.76) 0 (0) 2501 (85.07) 1
v00005 2940 293 (9.97) 2647 (90.03) 163 (5.54) 0 (0) 2484 (84.49) 1
v00006 2940 306 (10.41) 2634 (89.59) 76 (2.59) 0 (0) 2558 (87.01) 1
v00007 2940 287 (9.76) 2653 (90.24) 72 (2.45) 0 (0) 2581 (87.79) 1
v00008 2940 285 (9.69) 2655 (90.31) 120 (4.08) 0 (0) 2535 (86.22) 1
v00009 2940 280 (9.52) 2660 (90.48) 63 (2.14) 0 (0) 2597 (88.33) 1
v00109 2940 298 (10.14) 2642 (89.86) 69 (2.35) 0 (0) 2573 (87.52) 1
v00010 2940 296 (10.07) 2644 (89.93) 81 (2.76) 0 (0) 2563 (87.18) 1
v00011 2940 149 (5.07) 2791 (94.93) 69 (2.35) 0 (0) 2722 (92.59) 0
v00012 2940 140 (4.76) 2800 (95.24) 85 (2.89) 0 (0) 2715 (92.35) 0
v00013 2940 60 (2.04) 2880 (97.96) 0 (0) 0 (0) 2880 (97.96) 0
v20000 2940 60 (2.04) 2880 (97.96) 0 (0) 0 (0) 2880 (97.96) 0
v00014 2940 232 (7.89) 2708 (92.11) 69 (2.35) 0 (0) 2639 (89.76) 1
v00015 2940 242 (8.23) 2698 (91.77) 72 (2.45) 0 (0) 2626 (89.32) 1
v00016 2940 308 (10.48) 2632 (89.52) 0 (0) 0 (0) 2632 (89.52) 1
v00017 2940 60 (2.04) 2880 (97.96) 0 (0) 0 (0) 2880 (97.96) 0
v30000 2940 60 (2.04) 2880 (97.96) 0 (0) 0 (0) 2880 (97.96) 0
v00018 2940 148 (5.03) 2792 (94.97) 380 (12.93) 0 (0) 2412 (82.04) 1
v01018 2924 159 (5.44) 2765 (94.56) 416 (14.23) 0 (0) 2349 (80.34) 1
v00019 2924 198 (6.77) 2726 (93.23) 413 (14.12) 0 (0) 2313 (79.1) 1
v00020 2924 202 (6.91) 2722 (93.09) 432 (14.77) 0 (0) 2290 (78.32) 1
v00021 2924 236 (8.07) 2688 (91.93) 428 (14.64) 0 (0) 2260 (77.29) 1
v00022 2924 224 (7.66) 2700 (92.34) 448 (15.32) 0 (0) 2252 (77.02) 1
v00023 2924 247 (8.45) 2677 (91.55) 451 (15.42) 0 (0) 2226 (76.13) 1
v00024 2924 259 (8.86) 2665 (91.14) 449 (15.36) 0 (0) 2216 (75.79) 1
v00025 2924 1681 (57.49) 1243 (42.51) 513 (17.54) 0 (0) 730 (24.97) 1
v00026 2924 320 (10.94) 2604 (89.06) 481 (16.45) 0 (0) 2123 (72.61) 1
v00027 2924 289 (9.88) 2635 (90.12) 499 (17.07) 1113 (38.06) 1023 (56.49) 1
v00028 2924 311 (10.64) 2613 (89.36) 515 (17.61) 0 (0) 2098 (71.75) 1
v00029 2924 350 (11.97) 2574 (88.03) 519 (17.75) 1066 (36.46) 989 (53.23) 1
v00030 2924 1809 (61.87) 1115 (38.13) 550 (18.81) 0 (0) 565 (19.32) 1
v00031 2924 386 (13.2) 2538 (86.8) 556 (19.02) 0 (0) 1982 (67.78) 1
v00032 2924 382 (13.06) 2542 (86.94) 332 (11.35) 0 (0) 2210 (75.58) 1
v00033 2924 60 (2.05) 2864 (97.95) 0 (0) 0 (0) 2864 (97.95) 0
v40000 2924 60 (2.05) 2864 (97.95) 0 (0) 0 (0) 2864 (97.95) 0
v00034 2924 453 (15.49) 2471 (84.51) 299 (10.23) 0 (0) 2172 (74.28) 1
v00035 2864 479 (16.72) 2385 (83.28) 324 (11.31) 0 (0) 2061 (71.96) 1
v00036 2864 491 (17.14) 2373 (82.86) 325 (11.35) 0 (0) 2048 (71.51) 1
v00037 2864 483 (16.86) 2381 (83.14) 374 (13.06) 0 (0) 2007 (70.08) 1
v00038 2864 552 (19.27) 2312 (80.73) 374 (13.06) 0 (0) 1938 (67.67) 1
v00039 2864 563 (19.66) 2301 (80.34) 389 (13.58) 0 (0) 1912 (66.76) 1
v00040 2864 531 (18.54) 2333 (81.46) 401 (14) 0 (0) 1932 (67.46) 1
v00041 2864 560 (19.55) 2304 (80.45) 427 (14.91) 0 (0) 1877 (65.54) 1
v00042 2864 60 (2.09) 2804 (97.91) 0 (0) 0 (0) 2804 (97.91) 0
v50000 2864 60 (2.09) 2804 (97.91) 0 (0) 0 (0) 2804 (97.91) 0

Paging

The table above is getting very long. Another possibility is to use paged output of data frames. Therefore a simple line in the YAML-header must be added (df_print: paged) under output. A simple call of the data frame allows then the browsing of rows and columns. Alternatively, you may use the DT package, even as default printer for data.frames.

tab_ex1$SummaryTable

To use DT, you would have to add a chunk like the following to your R-Markdown file:

```{r include=FALSE}
library(knitr)
library(DT)
knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE)
knit_print.data.frame = function(x, ...) { knit_print(DT::datatable(x), ...) }
registerS3method("knit_print", "data.frame", knit_print.data.frame)
```

Remove columns

Apparently the column Observations N is identical in all rows of the table and can be removed without loss of information. This is achieved via a \(-\) operator:

tab_ex1$SummaryTable %>%
  select(-'Observations N') 

The column Variables contains rather technical names of variables not enabling for interpretation of the content. For this reason, all dataquieR functions have an option called label_col. The selected label can be any column in the meta data, our model suggests to name that column LABEL. For time being, the labels must be valid in R formulas, which means, they should basically not contain characters other than letters or numbers. We plan to relax this condition.

tab_ex2 <- com_item_missingness(study_data = sd1,
                                meta_data = md1,
                                threshold_value = 90,
                                label_col = "LABEL",
                                include_sysmiss = TRUE,
                                show_causes = FALSE)
#> Warning: In com_item_missingness: Setting suppressWarnings to its default FALSE
#> > com_item_missingness(study_data = sd1, meta_data = md1, threshold_value = 90, 
#>     label_col = "LABEL", include_sysmiss = TRUE, show_causes = FALSE)

tab_ex2$SummaryTable %>%
  select(-'Observations N')

Order rows

Maybe, we want to sort columns or rows. This can also be achieved by dplyr functions:

tab_ex2$SummaryTable %>%
  select(-'Observations N') %>%
  arrange(desc(`Measurements N (%)`)) 

Sorting by the number of observations is a bit complicated up to now, because currently dataquieR returns text in the columns. The text can be extracted using the following code:

splitted_measurements_col <- # this will be a list of character vectors of length 2 (part before and part after the '(' character for each row)
  strsplit(tab_ex2$SummaryTable$`Measurements N (%)`, # the measurement count column
           '(', # splited at the opening bracket
           fixed = TRUE # fixed string match, no pattern match
           )
percent_part_in_col <- # this will be a character vector of of the percentages
  unlist( # we don't want to have a list but a vector of percentages as usually for data frame columns
    lapply(splitted_measurements_col, `[[`, 2) # select the second entry of each entry in the list
  )
sort_order <- as.numeric(sub(')', '', percent_part_in_col, fixed = TRUE)) # remove the closing bracket and convert the characters to numbers
tab_ex2$SummaryTable %>%
  select(-'Observations N') %>%
  arrange(desc(sort_order)) 

Reorder columns

Maybe the columns should be in some other order too:

tab_ex2$SummaryTable %>%
  select(-'Observations N') %>% # the GRADING column must be removed without using the everyting() in the next row, so we keep to lines.
  select(`Variables`, `Measurements N (%)`, everything()) # everything adds all columns not yet available.

Plots from ggplot2

The versatile ggplot2 package provides possibilities to modify graphics after they have been created, to render them in vector formats and even to extract the underlying data. It is handy for interfacing with user code. Also, ggplot2 has a comprehensive concept behind, a graphics grammar, which makes it highly structured and using its code easy to understand. For more advice about the ggplot2 package, we refer kindly to the vignettes of that package:

browseVignettes(package = "ggplot2")

The package dataquieR generates two types of ggplot-objects.

  1. Either a single summary plot called SummaryPlot or
  2. a list of plots called SummaryPlotList.

The latter is used if several plots are generated, typically for each variable of the study data. As the handling and manipulation of a single SummaryPlot is more straightforward we exemplify a plot list using the dataquieR function acc_distributions:

ex1 <- acc_distributions(resp_vars      = NULL, 
                         group_vars     = NULL, 
                         label_col      = "LABEL",
                         study_data     = sd1, 
                         meta_data      = md1)
#> Warning: In acc_distributions: All variables defined to be integer or float in the metadata are used
#> > acc_distributions(resp_vars = NULL, group_vars = NULL, label_col = "LABEL", 
#>     study_data = sd1, meta_data = md1)
#> Warning: In acc_distributions: Variables PART_STUDY, PART_PHYS_EXAM, PART_LAB contain only one value and will be removed from analyses.
#> > acc_distributions(resp_vars = NULL, group_vars = NULL, label_col = "LABEL", 
#>     study_data = sd1, meta_data = md1)

This yields a set of 40 figures! All of which are ggplot2 objects:

unique(unlist(lapply(ex1$SummaryPlotList, class)))
#> [1] "gg"     "ggplot"

There is a package named ggedit for editing ggplot2-objects easily. Nevertheless, in the following the basics to do so are discussed. For more complex adjustments, we recommend now ggedit.

Lists of plots

To list them all, a simple print of the ex1$SummaryPlotList can be used, but this will also print the “normal” output of printing a list, i.e. the names or numbers of all its elements. To avoid this, you can simply print each element of the list separately:

# for (i in 1:length(ex1$SummaryPlotList)) # substituted by the next row to shorten the output of this vignette:
for (i in head(seq_along(ex1$SummaryPlotList), 4)) {
  print(ex1$SummaryPlotList[[i]])
}

Of course, an apply-iteration would be possible too, but for the means of plotting figures, the for loop perfectly fits.

Using this code, all figures are printed one below the other. To have them in columns, the chunk-option out.width can be handy. rmarkdown plots figures aside, if the current row is not yet filled, so something like out.width=c('50%', '50%') can be used to achieve a two-column image list.

Arrange plots

Another possibility to arrange list of plots is the ggpubr package which handles a specific formal for lists of ggplot2 objects.

ggpubr::ggarrange(plotlist = ex1$SummaryPlotList[1:4])

An alternative to ggpubris the patchwork-package, which provides a very intuitive way of aligning ggplot2 graphics:

library(patchwork)
p1 <- ex1$SummaryPlotList[[1]]
p2 <- ex1$SummaryPlotList[[2]]
p3 <- ex1$SummaryPlotList[[3]]

p1 | (p2 / p3)

See the patchwork vignette for more details.

Plot rotation

Please note, that the plot has obviously been rotated, so that the x/y-coordinates may not be always intuitively used in the following. There are reasons for rotating histograms that way, but in the following, one example will re-rotate the plot to the more common presentation having the counts on the y-axis.

As an example for manipulating figures, first we want to add a red line. This is easily achieved with ggplot’s +-operator. We use the annotate-function instead of the geom_*-functions to draw objects not directly mapped (by aes) to specific data points/samples to avoid redundant plotting the very same object for each data point / sample again:

library(ggplot2)
print(
  ex1$SummaryPlotList[[3]] +
    annotate("segment", x = -Inf, xend = Inf, y = 0, yend = 0, colour = "red")
)

Highlighting

Then, we may like to highlight the largest bin in red. For this, we need to access the bins calculated by geom_histogram which the ggplot_build function makes accessible for ggplot2-objects:

p <- ex1$SummaryPlotList[[3]] # choose the third figure generated by dataquieR.
x <- ggplot_build(p) # make its graphical properties accessible.
largest_bin <- which.max(x[["data"]][[1]][["count"]]) # find the largest bin.
print(x[["data"]][[1]][largest_bin, c("xmin", "xmax", "ymin", "ymax")]) # this would print out the cartesian coordinates of the largest bin.
#>     xmin  xmax ymin ymax
#> 17 50.55 51.45    0  264
# see also the helpful contribution there: https://community.rstudio.com/t/geom-histogram-max-bin-height/10026
print( # print
  p +  # the plot
    annotate("segment", x = -Inf, xend = Inf, y = 0, yend = 0, colour = "red") + # annotate it with the red line again
    annotate("rect", # and highlight the largest bin by overplotting it with red framed black rectangle.
             xmin = x[["data"]][[1]]$xmin[[largest_bin]], 
             xmax = x[["data"]][[1]]$xmax[[largest_bin]], 
             ymin = x[["data"]][[1]]$ymin[[largest_bin]], 
             ymax = x[["data"]][[1]]$ymax[[largest_bin]], color = "red")
)

Annotation

Unfortunately, the annotate function’s documentation is maybe a bit sparse. The geom-parameter refers to existing implementations of graphics in ggplot2 all of which are prefixed with geom_. Usually they extract their coordinates from the data using the mapping given in the aes-parameter of the whole ggplot2 object or for the specific geom. A useful geom_s besides segment and rect is text for really annotating the plot:

print( # print
  p +  # the plot
    annotate("segment", x = -Inf, xend = Inf, y = 0, yend = 0, colour = "red") + # annotate it with the red line again
    annotate("rect", # and highlight the largest bin by overplotting it with red framed black rectangle.
             xmin = x[["data"]][[1]]$xmin[[largest_bin]], 
             xmax = x[["data"]][[1]]$xmax[[largest_bin]], 
             ymin = x[["data"]][[1]]$ymin[[largest_bin]], 
             ymax = x[["data"]][[1]]$ymax[[largest_bin]], color = "red") +
    annotate("text", label = "Largest bin", x = x[["data"]][[1]]$xmax[[largest_bin]], y = x[["data"]][[1]]$ymax[[largest_bin]], angle = 270, vjust = -.5)
)

You may see the documentation of ggplot2::annotate for some examples.

Coordinates are given in the same coordinate system that is shown in the plot, so drawing a line at 100 observations is as easy as directly choosing 100 as y coordinate.

print( # print
  p +  # the plot
    annotate("segment", x = -Inf, xend = Inf, y = 100, yend = 100, colour = "red") + # annotate it with the red line again
    annotate("segment", x = -Inf, xend = Inf, y = 0, yend = 0, colour = "red") + # annotate it with the red line again
    annotate("rect", # and highlight the largest bin by overplotting it with red framed black rectangle.
             xmin = x[["data"]][[1]]$xmin[[largest_bin]], 
             xmax = x[["data"]][[1]]$xmax[[largest_bin]], 
             ymin = x[["data"]][[1]]$ymin[[largest_bin]], 
             ymax = x[["data"]][[1]]$ymax[[largest_bin]], color = "red") +
    annotate("text", label = "Largest bin", x = x[["data"]][[1]]$xmax[[largest_bin]], y = x[["data"]][[1]]$ymax[[largest_bin]], angle = 270, vjust = -.5)
)

As promised above, we will now re-rotate the whole plot.

p2 <-  p +  # the plot
    annotate("segment", x = -Inf, xend = Inf, y = 100, yend = 100, colour = "red") + # annotate it with the red line again
    annotate("segment", x = -Inf, xend = Inf, y = 0, yend = 0, colour = "red") + # annotate it with the red line again
    annotate("rect", # and highlight the largest bin by overplotting it with red framed black rectangle.
             xmin = x[["data"]][[1]]$xmin[[largest_bin]], 
             xmax = x[["data"]][[1]]$xmax[[largest_bin]], 
             ymin = x[["data"]][[1]]$ymin[[largest_bin]], 
             ymax = x[["data"]][[1]]$ymax[[largest_bin]], color = "red") +
    annotate("text", label = "Largest bin", x = x[["data"]][[1]]$xmax[[largest_bin]], y = x[["data"]][[1]]$ymax[[largest_bin]], angle = 0, vjust = -.5)
suppressMessages(p2 + coord_cartesian()) # this restores the original cartesian coordinate system replacing the flipped one introduced by acc_distributions However, it emits a message about replacing the coordinate system, which we can suppress here with suppressMessages.

Note, that neither ggplot2::coord_flip nor ggpubr::rotate can solve this issue. These functions are not aware of already-rotated plots, so the following will not rotate the plot back:

p2 + coord_flip()     # does not rotate the plot but prints
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.

                      # Coordinate system already present. Adding new coordinate
                      # system, which will replace the existing one.

p2 + ggpubr::rotate() # does not rotate the plot but prints
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.

                      # Coordinate system already present. Adding new coordinate
                      # system, which will replace the existing one.

Add new data

All functions of the dataquieR use the data as they are imported, i.e. variables of the study data can be examined and used for grouping/stratification of results. All information for these variables must be attached to the metadata. In some situations, particularly during exploitative data quality reporting, it is necessary to use a new calculated/transformed variable. Naturally, respective information is not defined in the metadata. This peculiarity would preclude the use of such calculated or transformed variables in data quality reporting.

To illustrate the need for a helper function is shown with the following example from com_segment_missingness():

The SummaryPlot shows the frequency of observations in which all measurements of respective study segments are missing.

Exploring the segment missingness over time would require another variable in the study data. We will generate such a variable using the lubridate package.

sd1$exq <- as.integer(lubridate::quarter(sd1$v00013))
table(sd1$exq)
#> 
#>   1   2   3   4 
#> 724 713 776 727

Information regarding this variable is then added to a copy of the metadata (md2) using the dataquieR function prep_add_to_meta():

md2 <- dataquieR::prep_add_to_meta(VAR_NAMES = "exq", 
                                   DATA_TYPE = "integer",
                                   LABEL = "EX_QUARTER_0",
                                   VALUE_LABELS = "1 = 1st | 2 = 2nd | 3 = 3rd | 4 = 4th",
                                   VARIABLE_ROLE = "process",
                                   MISSING_LIST = "",
                                   meta_data = md1)
MissSegs <- com_segment_missingness(study_data = sd1, 
                                    meta_data = md2, 
                                    threshold_value = 1, 
                                    label_col = LABEL,
                                    group_vars = "EX_QUARTER_0",
                                    direction = "high",
                                    exclude_roles = "process")
#> Warning: In com_segment_missingness: Study variables: "ARM_CUFF_0", "USR_VO2_0", "USR_BP_0", "EXAM_DT_0", "DEV_NO_0", "LAB_DT_0", "USR_SOCDEM_0", "INT_DT_0", "QUEST_DT_0" are not considered due to their VARIABLE_ROLE.
#> > com_segment_missingness(study_data = sd1, meta_data = md2, threshold_value = 1, 
#>     label_col = LABEL, group_vars = "EX_QUARTER_0", direction = "high",

MissSegs$SummaryPlot

Back to Overview