Description

The acc_end_digits function focuses on the response variables’ last decimal or end digit. Examining end digits may be relevant when there is manual data transfer or editing because a preference for rounding could occur.

The implementation of the acc_end_digits function is similar to the acc_shape_or_scale function, adapted from the idea of rootograms (Tukey 1977, Kleiber and Zeileis 2016). However, the emphasis is on the last decimals of the measurement variables rather than their overall distribution. In this way, the acc_end_digits function is an implementation of the Unexpected shape indicator and a descriptor for Unexpected proportions, which belong to the Unexpected distributions domain in the Accuracy dimension.

For more details, see the user’s manual and source code.

Usage and arguments

acc_end_digits(
  resp_vars = NULL,
  label_col = LABEL,
  study_data = sd1,
  meta_data = md1
)

The function has the following arguments:

  • study_data: mandatory, the data frame containing the measurements.
  • meta_data: mandatory, the data frame containing the item-level metadata.
  • resp_vars: mandatory, a character specifying the measurement variable of interest. The variable must be of float type.
  • label_col: optional, the column in the metadata data frame containing the labels of all the variables in the study data.

There is no implementation of thresholds.

Example output

To illustrate the output, we use the example synthetic data and metadata that are bundled with the dataquieR package. See the introductory tutorial for instructions on importing these files into R, as well as details on their structure and contents.

For the acc_end_digits function, the metadata columns DATA_TYPE, MISSING_LIST and the number of DECIMALS are relevant:

VAR_NAMES LABEL MISSING_LIST DATA_TYPE DECIMALS
9 v00004 SBP_0 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99987 | 99988 | 99989 | 99990 | 99991 | 99992 | 99993 | 99994 | 99995 float 0
10 v00005 DBP_0 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99987 | 99988 | 99989 | 99990 | 99991 | 99992 | 99993 | 99994 | 99995 float 0
11 v00006 GLOBAL_HEALTH_VAS_0 99980 | 99983 | 99987 | 99988 | 99989 | 99990 | 99991 | 99992 | 99993 | 99994 | 99995 float 1
14 v00009 ARM_CIRC_0 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99987 | 99988 | 99989 | 99990 | 99991 | 99992 | 99993 | 99994 | 99995 float 0
21 v00014 CRP_0 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99988 | 99989 | 99990 | 99991 | 99992 | 99994 | 99995 float 3
22 v00015 BSG_0 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99988 | 99989 | 99990 | 99991 | 99992 | 99994 | 99995 float 0


This example specifies the analysis of end digits for the variable CRP_0 (C-reactive protein):

end_digits <- acc_end_digits(
  resp_vars = "CRP_0",
  label_col = LABEL,
  study_data = sd1,
  meta_data = md1
)

The output is a list containing SummaryTable and SummaryPlot. The SummaryTable is a table containing the response variable and indicating whether the uniform distribution of end digits is met (GRADING = 0) or a deviation was found (GRADING = 1). This table is necessary for the generic function dataquieR::dq_report() to summarize all information for the examined variables.

Run end_digits$SummaryData to see the output:

The second output, SummaryPlot, is a bar chart that indicates significant deviations from the uniform distribution. Call it with end_digits$SummaryPlot:

Interpretation

Any deviation from the distribution specified in the metadata is indicated in red.

Algorithm of the implementation

  1. This implementation is restricted to variables with float data type.
  2. Remove missing codes from resp_vars (if these are defined in the metadata).
  3. Call the function acc_shape_or_scale.
  4. Extract the last decimal digit from resp_vars.
  5. Contrast the empirical versus the assumed uniform distribution in a histogram-like plot and in a summary data frame.

Limitations

Deviations from a uniform number of end digits will only be informative if the response variable has a symmetric distribution. If the underlying measurement has a skewed distribution, the end digits will not follow a uniform distribution.

Concept relations

Kleiber, C., and Zeileis, A. (2016). Visualizing count data regressions using rootograms. The American Statistician 70, 296–303.
Tukey, J.W. (1977). Exploratory data analysis (Addison-Wesley).