Description

A standard tool to detect multivariate outliers is the Mahalanobis distance (Mahalanobis 1936, Filzmoser 2004). This approach is very helpful for the interpretation of the plausibility of a measurement given the value of another.

In the acc_multivariate_outlier function, the Mahalanobis distance is used as a univariate measure itself. We apply the same rules for the identification of outliers as in univariate outliers:

  • the classical approach from Tukey 1977: \(1.5 * IQR\) from the 1st (\(Q_{25}\)) or 3rd (\(Q_{75}\)) quartile.
  • the \(3* SD\) approach, i.e. any measurement of the Mahalanobis distance not in the interval of \(\bar{x} \pm 3*SD\) is considered an outlier Saleem et al., 2021,.
  • the approach from Hubert and Vandervieren 2008 for skewed distributions which is embedded in the R package robustbase
  • a completely heuristic approach named \(\sigma\)-gap.

In this way, the acc_multivariate_outlier function is an implementation of the Multivariate outliers indicator, which belongs to the Unexpected distributions domain in the Accuracy dimension.

For more details, see the user’s manual, source code, and vignette for univariate outliers.

Usage and arguments

acc_multivariate_outlier(
  variable_group = NULL,
  id_vars = NULL,
  label_col = NULL,
  n_rules = 4,
  max_non_outliers_plot = NULL,
  criteria = NULL,
  study_data = sd1,
  meta_data = md1
)

The function has the following arguments:

  • study_data: mandatory, the data frame containing the measurements.
  • meta_data: mandatory, the data frame containing the study data’s metadata.
  • variable_group: mandatory, the names of the continuous measurement variables building a group, for which calculating multivariate outliers make sense.
  • label_col: optional, the column in the metadata data frame containing the labels of all the variables in the study data.
  • id_vars: optional, an ID variable of the study data. If not specified, then rownumbers are used.
  • n_rules: optional, a number from one to four indicating the number of rules that must be violated for the value to classify as an outlier.
  • max_non_outliers_plot: optional, the maximum number of non-outlier points to plot. If more points exist, only a sub sample will be plotted. Note that sampling is not deterministic.
  • criteria: optional, a vector with the methods to be used for detecting outliers. Currently implemented methods are tukey, 3SD, hubert and sigmagap.

Example output

To illustrate the output, we use the example synthetic data and metadata that are bundled with the dataquieR package. See the introductory tutorial for instructions on importing these files into R, as well as details on their structure and contents.

For the acc_multivariate_outlier function, the columns DATA_TYPE and MISSING_LIST in the metadata are relevant:

VAR_NAMES LABEL MISSING_LIST DATA_TYPE
3 v00002 SEX_0 NA integer
4 v00003 AGE_0 NA integer
6 v01003 AGE_1 NA integer
7 v01002 SEX_1 NA integer
15 v00109 ARM_CIRC_DISC_0 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99987 | 99988 | 99989 | 99990 | 99991 | 99992 | 99993 | 99994 | 99995 integer
16 v00010 ARM_CUFF_0 99980 | 99987 integer
19 v00013 EXAM_DT_0 NA datetime
24 v00017 LAB_DT_0 NA datetime
26 v00018 EDUCATION_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 integer
27 v01018 EDUCATION_1 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 integer
31 v00022 EATING_PREFS_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 integer
32 v00023 MEAT_CONS_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 integer
33 v00024 SMOKING_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 integer
34 v00025 SMOKE_SHOP_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 integer
38 v00029 PREGNANT_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 integer


This example specifies the analyses of multivariate outliers for three variables:

mult_outlier <- acc_multivariate_outlier(
  variable_group  = c("SBP_0", "DBP_0", "AGE_0"),
  label_col  = "LABEL",
  study_data = sd1,
  meta_data  = md1
)

The summary table contains only one line for the respective set of variables tested for multivariate outliers. According to the number of rules (n_rules formal) that must be violated, the last columns GRADING will be \(\in {0; 1}\). In this example only one observation appears to be a multivariate outlier according to all four rules. The summary table is shown using mult_outlier$SummaryTable:

Variables Tukey (N) 3SD (N) Hubert (N) Sigma-gap (N) NUM_acc_ud_outlm PCT_acc_ud_outlm GRADING
SBP_0 | DBP_0 | AGE_0 78 32 6 1 1 0.04 1

In addition to the SummaryTable, an object called FlaggedStudyData is returned. This object can be used to identify observations which present multivariate outlier.

The summary plot uses five different colors to indicate the plausibility of multivariate outliers. In case of dark red observations all four rules identifying outliers have been violated.

mult_outlier$SummaryPlot

The FlaggedStudyData contains the original data frame with the additional columns tukey, 3SD, Hubert, and SigmaGap. Every observation is coded 0 if no outlier was detected in the respective column and 1 if an outlier was detected. This can be used to exclude observations with outliers.

The respective data can be accessed using:

mult_outlier$FlaggedStudyData

Interpretation

An outlier according to statistical criteria does not necessarily imply implausible measurements. It is up to the user how outliers are handled. For a more detailed discussion of the methods see Morgenthaler, 2007,.

Algorithm of the implementation

  1. Implementation is restricted to variables of type float
  2. Remove missing codes from the study data (if defined in the metadata)
  3. The covariance matrix is estimated for all variables in variable_group
  4. The Mahalanobis distance of each observation is calculated \(MD^2_i = (x_i - \mu)^T \Sigma^{-1} (x_i - \mu)\)
  5. The four rules mentioned above are applied on this distance for each observation in the study data
  6. An output data frame is generated that flags each outlier
  7. A parallel coordinate plot indicates respective outliers

Limitations

This implementation has several limitations as it uses a heuristic approach to classify multivariate outliers. The basis is defined by the Mahalanobis distance (Mahalanobis 1936) which provides a univariate and standardized measure of distance from the multivariate center of the data. However, recommendations regarding the use of these values in terms of classifying multivariate outliers were not found. Applying the rules of univariate outliers on the Mahalanobis distance has shown reasonable results. Nevertheless, this approach is not supported by an underlying theory.

Concept relations

Filzmoser, P. (2004). A multivariate outlier detection method (na).
Hubert, M., and Vandervieren, E. (2008). An adjusted boxplot for skewed distributions. Computational Statistics & Data Analysis 52, 5186–5201.
Mahalanobis, P.C. (1936). On the generalized distance in statistics. (National Institute of Science of India),.
Morgenthaler, S. (2007). A survey of robust statistics. Statistical Methods and Applications 15, 271–293.
Saleem, S., Aslam, M., and Shaukat, M.R. (2021). A review and empirical comparison of univariate outlier detection methods. Pakistan Journal of Statistics 37.
Tukey, J.W. (1977). Exploratory data analysis (Addison-Wesley).