Description

The acc_varcomp function examines the impact of so-called process variables on the measurement variables through variance based models and intraclass correlations (ICC). This implementation is model-based. The function can be applied on variables of type float.

Note: The term ICC is more frequently used to describe the agreement between different observers, examiners or even devices. In respective settings, a good agreement is pursued. ICC-values can vary between $[-1; \: 1]$ and an ICC close to $1$ is desired (Koo and Li 2016, Müller and Büttner 1994).

In multi-level analysis the ICC is interpreted differently. Please see Snijders et al. (Sniders and Bosker 1999). In this context, the proportion of variance explained by respective group levels indicates an influence of (at least one) level of the respective group_vars.

Irrespective of the used terminology, regarding data quality it is desired that process variables do not explain systematically components of variance. Therefore, values close to $0$ are desired.

acc_varcomp is an implementation of the Unexpected location indicator, which belongs to the Unexpected distributions domain in the Accuracy dimension.

For more details, see the user’s manual and source code.

Usage and arguments

acc_varcomp(
  resp_vars = NULL,
  group_vars = NULL,
  co_vars = NULL,
  min_obs_in_subgroup = 30,
  min_subgroups = 5,
  label_col = NULL,
  threshold_value = 0.05,
  study_data = sd1,
  meta_data = md1
)

The function has the following arguments:

study_data: mandatory, the data frame containing the measurements.
meta_data: mandatory, the data frame containing the study data’s metadata.
resp_vars: mandatory, a character specifying the measurement variable of interest. The variable must be of float type.
label_col: optional, the column in the metadata data frame containing the labels of all the variables in the study data.
group_vars: the variable used for grouping (e.g., observer, device, reader). Defaults to NULL for output without grouping.
co_vars: optional, a vector of covariables, e.g. age and sex for adjustment.
min_obs_in_subgroup: optional if group_vars is used. Specifies the minimum number of observations required to include a subgroup (level) of the group_vars in the analysis. Subgroups with less observations are excluded. The default is 30.
min_subgroups: optional if group_vars is used. Specifies the minimum number of subgroups (levels) included group_vars. If the variable defined in group_vars has less subgroups it is not used for analysis. The default is 5.
threshold_value: optional, a numerical value ranging from 0 to 1. If no value is specified, the default value of 0.05 will be used.

Example output

To illustrate the output, we use the example synthetic data and metadata that are bundled with the dataquieR package. See the introductory tutorial for instructions on importing these files into R, as well as details on their structure and contents.

Similar to the approach of the acc_margins function, we assume that at least one examiner does not adhere to the SOP and may influence the measurement process:

v00000	v00001	v00002	v00003	v00004	v00005	v01003	v01002	v00103	v00006
3	LEIIX715	0	49	127	77	49	0	40-49	3.8
1	QHNKM456	0	47	114	76	47	0	40-49	1.9
1	HTAOB589	0	50	114	71	50	0	50-59	0.8
5	HNHFV585	0	48	120	65	48	0	40-49	3.8
1	UTDLS949	0	56	119	78	56	0	50-59	4.1
5	YQFGE692	1	47	133	81	47	1	40-49	9.5
1	AVAEH932	0	53	114	78	53	0	50-59	5.0
3	QDOPT378	1	48	116	86	48	1	40-49	9.6
3	BMOAK786	0	44	115	71	44	0	40-49	2.0
5	ZDKNF462	0	50	116	74	50	0	50-59	2.4

For the acc_varcomp function, the columns DATA_TYPE, MISSING_LIST and HARD_LIMITS in the metadata are relevant:

	VAR_NAMES	LABEL	MISSING_LIST	DATA_TYPE	HARD_LIMITS
9	v00004	SBP_0	99980 \| 99981 \| 99982 \| 99983 \| 99984 \| 99985 \| 99986 \| 99987 \| 99988 \| 99989 \| 99990 \| 99991 \| 99992 \| 99993 \| 99994 \| 99995	float	[80;180]
10	v00005	DBP_0	99980 \| 99981 \| 99982 \| 99983 \| 99984 \| 99985 \| 99986 \| 99987 \| 99988 \| 99989 \| 99990 \| 99991 \| 99992 \| 99993 \| 99994 \| 99995	float	[50;Inf)
11	v00006	GLOBAL_HEALTH_VAS_0	99980 \| 99983 \| 99987 \| 99988 \| 99989 \| 99990 \| 99991 \| 99992 \| 99993 \| 99994 \| 99995	float	[0;10]
14	v00009	ARM_CIRC_0	99980 \| 99981 \| 99982 \| 99983 \| 99984 \| 99985 \| 99986 \| 99987 \| 99988 \| 99989 \| 99990 \| 99991 \| 99992 \| 99993 \| 99994 \| 99995	float	[0;Inf)
21	v00014	CRP_0	99980 \| 99981 \| 99982 \| 99983 \| 99984 \| 99985 \| 99986 \| 99988 \| 99989 \| 99990 \| 99991 \| 99992 \| 99994 \| 99995	float	[0;Inf)
22	v00015	BSG_0	99980 \| 99981 \| 99982 \| 99983 \| 99984 \| 99985 \| 99986 \| 99988 \| 99989 \| 99990 \| 99991 \| 99992 \| 99994 \| 99995	float	[0;100]

Here, the function is applied to examine the agreement between observers (USR_BP_0) for the systolic and diastolic blood pressure variables (SBP_0 and DBP_0, respectively):

varcomp_1 <- acc_varcomp(resp_vars = c("SBP_0", "DBP_0"),
                group_vars = c("USR_BP_0"),
                co_vars = c("AGE_0", "SEX_0"),
                label_col = "LABEL",
                min_obs_in_subgroup = 20,
                min_subgroups = 3,
                study_data = sd1,
                meta_data = md1)

## Did not find any 'SCALE_LEVEL' column in item-level meta_data. Predicting it from the data -- please verify these predictions, they may be wrong and lead to functions claiming not to be reasonably applicable to a variable.

## using the same group var "USR_BP_0" for all resp_vars

names(varcomp_1)

## [1] "SummaryTable"           "SummaryData"            "ScalarValue_max_icc"   
## [4] "ScalarValue_argmax_icc"

Output: Summary table

The summary data frame is called using varcomp_1$SummaryTable:

Variables	Object	Model.Call	ICC_acc_ud_loc	Class.Number	Mean.Class.Size	Median.Class.Size	Min.Class.Size	Max.Class.Size	convergence.problem	GRADING
SBP_0	USR_BP_0	SBP_0 ~ AGE_0 + SEX_0 + (1 \| USR_BP_0)	0.153	15	165.8	160	29	413	FALSE	1
DBP_0	USR_BP_0	DBP_0 ~ AGE_0 + SEX_0 + (1 \| USR_BP_0)	0.172	15	165.0	162	28	413	FALSE	1

In addition to this table, some scalar values are returned (“ScalarValue_max_icc”, “ScalarValue_argmax_icc”) which represent the highest proportion ICC/VC and the response variable with the highest ICC/VC.

Interpretation

ICC or the analysis of variance components should be applied in combination with MARGINS. Extended tests showed that ICC is less susceptible to false-positive indications of data quality issues than margins.

Algorithm of the implementation

Missing codes are removed from resp_vars (if defined in the metadata).
Deviations from limits, as defined in the metadata, are removed.
A linear mixed-effects model is estimated for resp_vars using co_vars and group_vars for adjustment.
An output data frame is generated for group_vars indicating the ICC.

Limitations

Sufficient numbers of observations within each level of the group_vars are required. This can be specified by the formal min_obs_level. Nevertheless, the algorithm of the linear mixed effects model may not converge in cases of imbalanced and low numbers of observations.

Concept relations

Data quality Indicator Unexpected location

Koo, T.K., and Li, M.Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine 15, 155–163.

Müller, R., and Büttner, P. (1994). A critical discussion of intraclass correlation coefficients. Statistics in Medicine 13, 2465–2476.

Sniders, T., and Bosker, R. (1999). Multilevel analysis: An introduction to basic and advanced multilevel modeling. (Sage-Publications).

R implementation of variance based models and ICC