APPROACH

The following implementation examines a crude version of unit missingness or unit-nonresponse (Kalton and Kasprzyk 1986), i.e. if all measurement variables in the study data are missing for an observation it has unit missingness.

The function can be applied on stratified data. In this case strata_vars must be specified.

Example of study data

Data from the package dataquieR are loaded as shown below:

load(system.file("extdata", "study_data.RData", package = "dataquieR"))
sd1 <- study_data

This example of study data has N=3000 observations. Study data variables have abstract and non-interpretable names; appropriate labels must be mapped from the metadata.

v00000 v00001 v00002 v00003 v00004 v00005 v01003 v01002 v00103 v00006
3 LEIIX715 0 49 127 77 49 0 40-49 3.8
1 QHNKM456 0 47 114 76 47 0 40-49 1.9
1 HTAOB589 0 50 114 71 50 0 50-59 0.8
5 HNHFV585 0 48 120 65 48 0 40-49 3.8
1 UTDLS949 0 56 119 78 56 0 50-59 4.1
5 YQFGE692 1 47 133 81 47 1 40-49 9.5
1 AVAEH932 0 53 114 78 53 0 50-59 5.0
3 QDOPT378 1 48 116 86 48 1 40-49 9.6
3 BMOAK786 0 44 115 71 44 0 40-49 2.0
5 ZDKNF462 0 50 116 74 50 0 50-59 2.4

Example of metadata

Data from the package dataquieR are loaded as shown below:

load(system.file("extdata", "meta_data.RData", package = "dataquieR"))
md1 <- meta_data

Information corresponding to the study data is kept in the table of static metadata. An interpretable label for each variable is also attached. Besides the data type and labels of all variables further expected characteristics are stored in the metadata.

However, for this implementation the use of variable labels is sufficient.

VAR_NAMES LABEL DATA_TYPE VALUE_LABELS MISSING_LIST JUMP_LIST HARD_LIMITS DETECTION_LIMITS
v00000 CENTER_0 integer 1 = Berlin | 2 = Hamburg | 3 = Leipzig | 4 = Cologne | 5 = Munich NA NA NA NA
v00001 PSEUDO_ID string NA NA NA NA NA
v00002 SEX_0 integer 0 = females | 1 = males NA NA NA NA
v00003 AGE_0 integer NA NA NA [18;Inf) NA
v00103 AGE_GROUP_0 string NA NA NA NA NA
v01003 AGE_1 integer NA NA NA [18;Inf) NA
v01002 SEX_1 integer 0 = females | 1 = males NA NA NA NA
v10000 PART_STUDY integer 0 = no | 1 = yes NA NA NA NA
v00004 SBP_0 float NA 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99987 | 99988 | 99989 | 99990 | 99991 | 99992 | 99993 | 99994 | 99995 NA [80;180] [0;265]
v00005 DBP_0 float NA 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99987 | 99988 | 99989 | 99990 | 99991 | 99992 | 99993 | 99994 | 99995 NA [50;Inf) [0;265]

Required R-packages

This implementation requires the following R-packages:

library(data.table)

R-FUNCTION

The R-function has the following arguments:

  • study_data: mandatory, the name of the data frame that contains the measurements
  • meta_data: mandatory, the name of the data frame that contains static metadata of the study data
  • id_vars: optional, a (vectorized) call of ID-variables that should not be considered in the calculation of unit-missingness
  • strata_vars: optional, a string or integer variable used for stratification
  • label_col: optional, specifies the column name of the metadata table which contains labels for all variables in the study data

The implemented R-Code:

com_unit_missingness <-  function(study_data, meta_data, id_vars = NULL,
                                 strata_vars = NULL, label_col) {

  # map study and metadata
  util_prepare_dataframes()

  # correct variable usage
  util_correct_variable_use("id_vars",
    allow_more_than_one = TRUE,
    allow_null          = TRUE
  )

  util_correct_variable_use("strata_vars",
    allow_null = TRUE,
    need_type = "!float"
  )

  if (is.null(id_vars)) {
    util_warning(
      c("No ID-variables specified, all variables are",
        "considered to be measurements."),
      applicability_problem = TRUE
      )
  }

  # initialize result dataframe
  sumdf1 <- ds1

  # compute unit missingness for individuals having
  # NA in all columns (except ID-vars)
  if (!(is.null(id_vars)) || !(is.null(strata_vars))) {
    leave_out <- c()
    if (!(is.null(id_vars))) {
      util_correct_variable_use("id_vars",
        allow_na = TRUE, allow_more_than_one = TRUE,
        allow_null = TRUE, allow_all_obs_na = TRUE, allow_any_obs_na = TRUE
      )
      leave_out <- union(leave_out, id_vars)
    }
    if (!(is.null(strata_vars))) {
      util_correct_variable_use("strata_vars",
        allow_na = FALSE, allow_more_than_one = FALSE,
        allow_null = TRUE, allow_all_obs_na = FALSE, allow_any_obs_na = TRUE
      )
      leave_out <- union(leave_out, strata_vars)
    }
    sumdf1$Unit_missing <- as.integer(apply(ds1[, -which(names(ds1) %in%
                                                           leave_out)], 1,
                                            function(x) all(is.na(x))))
  } else {
    sumdf1$Unit_missing <- as.integer(apply(ds1, 1, function(x) all(is.na(x))))
  }

  UMR <- c(
    "N" = sum(sumdf1$Unit_missing, na.rm = TRUE),
    "%" = round(sum(sumdf1$Unit_missing, na.rm = TRUE) / dim(sumdf1)[1] * 100,
                digits = 2)
  )

  # summarize for strata_vars
  if (!(is.null(strata_vars))) {
    if (!(is.null(label_col)) & !(is.na(meta_data$VALUE_LABELS[
        meta_data[[label_col]]
          == strata_vars]))) {
      lab_string <- meta_data$VALUE_LABELS[meta_data$LABEL == strata_vars]
      sumdf1[[strata_vars]] <- util_assign_levlabs(sumdf1[[strata_vars]],
        string_of_levlabs = lab_string,
        splitchar = SPLIT_CHAR,
        assignchar = " = "
      )
    }

    sumdf2 <- as.data.frame.matrix(table(sumdf1[[strata_vars]],
                                         sumdf1$Unit_missing))
    if (!any(sumdf1$Unit_missing, na.rm = TRUE)) {
      sumdf2$N_UNIT_MISSINGS <- 0
    }
    colnames(sumdf2) <- c("N_OBS", "N_UNIT_MISSINGS")
    sumdf2[[strata_vars]] <- rownames(sumdf2)
    rownames(sumdf2) <- NULL
    sumdf2 <- sumdf2[, c(strata_vars, c("N_OBS", "N_UNIT_MISSINGS"))]
    sumdf2$"N_UNIT_MISSINGS_(%)" <- round(sumdf2$N_UNIT_MISSINGS /
                                            sumdf2$N_OBS * 100, digits = 2)
  }

  if (!(is.null(strata_vars))) {
    return(list(FlaggedStudyData = sumdf1, SummaryData = sumdf2))
  } else {
    return(list(FlaggedStudyData = sumdf1, SummaryData = UMR))
  }
}

Implementation and use of thresholds

The implementation has no implementation of a threshold_value.

Call of the R-function

my_unit_missings <- com_unit_missingness(study_data  = sd1, 
                                         meta_data   = md1, 
                                         id_vars     = c("CENTER_0", "PSEUDO_ID"), 
                                         strata_vars = "CENTER_0", 
                                         label_col   = "LABEL")

OUTPUT

No stratification

my_unit_missings1 <- com_unit_missingness(study_data  = sd1, 
                                          meta_data   = md1, 
                                          id_vars     = c("CENTER_0", "PSEUDO_ID"),
                                          label_col   = "LABEL")

The function delivers the following objects: FlaggedStudyData, SummaryData

The first object contains a data frame of the study data which uses flags to indication observations without any measurements at all. The second object contains a vector of two elements (1) the no. of observations showing unit missingness, and (2) the percentage of unit missingness.

In this example of study data unit missingness is observed in n=60 observations which equals: 2% in this dataset.

Stratification

In case of e.g. multi-center studies unit missingness can be calculated using a discrete variable for stratification:

my_unit_missings2 <- com_unit_missingness(study_data  = sd1, 
                                          meta_data   = md1, 
                                          id_vars     = c("CENTER_0", "PSEUDO_ID"), 
                                          strata_vars = "CENTER_0", 
                                          label_col   = "LABEL")

provides an additional summary data frame that indicates unit missingness for each stratum.

CENTER_0 N_OBS N_UNIT_MISSINGS N_UNIT_MISSINGS_(%)
Berlin 617 15 2.43
Hamburg 581 11 1.89
Leipzig 593 9 1.52
Cologne 564 13 2.30
Munich 585 12 2.05

INTERPRETATION

This implementations calculates a crude rate of unit-missingness. This type of missingness may have several causes and is an important research outcome. For example, unit-nonresponse may be selective regarding the targeted study population or technical reasons such as record-linkage may cause unit-missingness.

It has to be discriminated form segment and item missingness, since different causes and mechanisms may be the reason for unit-missingness.

Concept relations

Kalton, G., and Kasprzyk, D. (1986). The treatment of missing survey data. Survey Methodology 12, 1–16.