The data quality implementations rely on the metadata (as defined in the tutorial). Updates and extensions of the metadata concept are a work in progress.
Here, we list all existing implementations of the project, with links to their respective documentation. Additional examples, alternative implementations, and contributing code guidelines are available as (tutorials2.html).
Any extensive data quality report requires not only study data, meaning for example clinical measurements, but also metadata. Metadata refers to attributes that describe for one part expectations about the study data. Such expectations can be quite diverse, ranging from the number of expected observations in a data set to properties of single variables such as data type or inadmissible values. The check of observed data properties against formalized expectations is the basis of most data quality indicators.
To be easily usable, such information must be organized in a
structured form. For dataquieR
, a spreadsheet type
structure with several tables (as briefly described below) is necessary.
In addition to expectations, these tables also contain descriptions
about the objects of interest such as variable names, variable and value
labels, or information to control the generation of output in the
reports, such as the role or order of variables in a report.
Below find a list of potential metadata tables of relevance. Among these, the item-level metadata table is the most essential.
Item-level metadata refer to descriptions and expectations about single data elements (variables/items), e.g., columns in the study data table.
The setup of item-level metadata is described in the Tutorial section
Cross-item level metadata contains descriptions and expectations about the joint use of two or more data elements for the purpose of data quality assessments. A distinct table is necessary as there is a 1:n relationship of potential assessments to any single data element.
The setup of cross-item level metadata will shortly be available in the Tutorial section.
Dataframe-level metadata refers to descriptions and expectations about the provided data-frames.
The setup of dataframe-level metadata will shortly be available in the Tutorial section.
Segment-level metadata refers to descriptions and expectations about the provided segments (e.g., different examinations of a study).
The setup of segment-level metadata will shortly be available in the Tutorial section.
Below find a list of all dataquieR functions that can be used to trigger single aspects of a data quality assessments. Their use is recommended for rather specific applications. For standard reports it may be more feasible to use the dq_report function.
All functions in dataquieR are linked to the underlying data quality concept as described in the table below.
The indicator functions are supported by 187 support functions. The main task of these function is to ensure a stable operation of dataquieR in the light of potentially deficient data.This requires extensive data preprocessing steps.
In STATA, the package dqrep
can be used for data quality
analyses. It can be installed using the following command syntax:
net from https://packages.qihs.uni-greifswald.de/repository/stata/dqrep
net install dqrep, replace
Note: In case of issues when installing
dqrep
with the net command, please download this package and extract the
files locally. Afterwards, they can be installed with the net command
using the local folder name.
dqrep
stands for “Data Quality REPorter”. This wrapper
command triggers an analysis pipeline to generate data quality
assessments. Assessments range from simple descriptive variable
overviews to full scale data quality reports that cover missing data,
extreme values, value distributions, observer and device effects or the
time course of measurements. Reports are provided as .pdf or .docx files
which are accompanied by a data set on assessment results. Reports are
highly customizable and visualize the severity and number of data
quality issues. In addition, there are options for benchmarking results
between examinations and studies.
There are two essentially different approaches to run
dqrep
:
First, dqrep
can be used to assess variables of the
active dataset. While most functionalities are available, checks that
depend on varying information at the variable level (e.g. range
violations) cannot be performed. Any variable used in a certain role
(e.g. observervars, keyvars) must be called for in
varlist
.
Second, dqrep
can be used to perform checks of variables
across a number of datasets that are specified in the targetfiles
option. In addition, a metadatafile can be specified that holds
information on variables and checks using the metadatafile option. This
allows for a more flexible application on variables in distinct data
sets, making use of all implemented dqrep
functionalities.
For more details on the conduct of dqrep
see this help file.
A Web Application for Data Monitoring in Epidemiological and Clinical Studies
Square\(^2\) is a web-application having all study data and metadata are stored in databases. The application targets a different user type with low technical requirements on the user side. Square manages user rights and roles to enable assessments without direct access to the underlying study data. Square² may prohibit direct study data access. Reporting is only possible for assigned subsets of the study data. From a data protection perspective, this is a huge advantage for complex studies with many collaborators. All routines developed in this project are integrated and Square\(2\) can easily be extended by similar packages that follow dataquieR’s code and metadata format conventions.
Square\(^2\) will be made available under the AGPL-3.0.
The current version comes as a docker-stack (docker-compose.yml and images on request), which will be available from GitLab.com and Docker Hub.