V030

01-2020

Carsten Oliver Schmidt

This document provides an overview for users of Qualityreporter, a tool for automated data quality assessments based on Stata.

Qualityreporter overview

Qualityreporter provides an automatic assessment of data quality based on the statistics package Stata. It flexibly generates standard reports to cover key dimensions of data quality such as missing data, extreme values, value distributions, observer and device effects or the time course of measurements. The suite is triggered via a single Stata function call.

Installation

All necessary Stata ado files, named dq_[xx].ado, should be copied into the Stata ado folder, preferably in the personal subfolder. At least Stata 15 is necessary to generate reports.

The pipeline will not work in earlier Stata versions.

Some non-standard Stata functions require installation to generate reports. These are:

  • catplot

  • tabplot

  • vioplot

  • sg100

  • colorpalette in package gr0075

In addition an Excel File with settings, named dq_control.xlsx must be stored in this directory. This files contains information on the texts, and data quality indicator decision rules.

Calling the report functions

Data quality reports are generated through calling “dq_reportcall”. Alternatively “dqrep” may be used as an abbreviated name.

The minimally sufficient call contains the target file, after a working folder has been specified through the “cd” function. Example: dq_reportcall, filename ("filename"). It will generate a brief table report on all variables in the data set.

A powerful call requires many decisions to be made, including:

  • Selection of the variables to conduct a reporting upon. This steps requires an assignment of variables to standard variable categories to enable differential reporting process. This is described in Section 2 on variable types.

  • In which folders is the information stored to enable reporting? This refers to:

    • Study data files (at least one Stata .dta file, this is the only mandatory file)

    • Metadata files (Excel)

    • Interpretation text files (Excel)

  • Where should newly generated information be stored? An appropriate location is important due to the potentially large number of result files stored. Different locations for different reports are recommendable. If nothing is specified, default settings will be used in the current work directory.

  • Which output format should be chosen? This refers to:

    • technical formats, currently pdf vs docx

    • the structure of the report content as defined through report templates

    • naming of output files

  • Is the report to be generated on study data or on results of data quality reports? Reports may either be generated on original study data but also on results from data quality reports. The latter is of interest to compare results across reports.

dq_reportcall may be addressed in two ways:

  1. Like a normal Stata command, specifying all required parameters. A full call is specified in 9.1., using locals as placeholders for the function parameters.

  2. Based on specified global parameters, using the blank dq_reportcall without options the pipeline may be started as well. An example call is described in 9.2. This approach allows for an easier handling of parameters. CAUTION: When using this options, the macro space must be cleared using macro drop _all. Otherwise unpredictable things may occur.

No other function from the pipeline needs to be addressed by the user to create reports. Communication between them is handled in the background.

Overview on input parameters for dq_reportcall command

The following list describes all parameters that can be called through the wrapper function dq_reportcall to control analyses and related output. A wide range of options exists to control analyses and output, as detailed in the table below:

DATA SOURCE AND FOLDERS
sd Source directory with the data files to be analysed. The modified data set is also stored here in a subfolder as the result directory may not be suitable because of data protection issues.
hd Name of the directory containing the Excel help files to enable report generation.
ld Name of the directory for log files. This is created automatically in the results file folder
rd Result directory for any reports in pdf or docx and Stata dta files containing all results.
gd Result directory for graphical output. Graphical output is stored in an own folder because of the large number of related files.
filename Contains the name of all data files to be analysed (without .dta suffix). If more than one name is specified the files are merged based on provided id variables. All filenames must not contain any blanks
metadatafile Contains the name of the corresponding metadata file that provides additional information for improved data quality analyses. The expected format is Excel (xlsx). Please specify only the stub name.
VARIABLES
keyvars List of all primary variables for data quality assessments. For these, the most extensive computations take place. Commonly each variable of this type receives a dedicated output page with graphs.
minorvars List of secondary variables for which a briefer scope of data quality assessments should take place. Commonly each variable of this type receives only a table overview.
processvars These are variables that are predominantly related to process aspects of the examination such as examination times or ambient conditions. Typically they play no role as outcome variables.
controlvars Control variables are variables used to control in regression analyses, e.g. related to the estimation of cluster effects.
observervars Cluster variables defined by observers.
devicevars Cluster variables defined by devices.
centervars Cluster variables defined by centers.
timevars Are time variables.
idvars Are ID variables which may be used to merge datasets or to check for duplicates
casemissvars List of variables which indicate variables to define unit/segment missingness. The variables need to follow a hierarchical order with the first variable defining the first, and the following variables defining subsequent selection processes
casemisstype Provides optionally the definition of the missing variables `casemissvars’. There should be as many definitions as variables, definitions must be a single word. This information should be provided to ensure a clear meaning of the respective variable.
casemisslogic Specifies the logic to identify available observations. Any logic must be provided as a single term without blanks.
REPORT FORMATTING
reportname Defines the name of the report to store results. This name should be short and concise without blanks. If it contains blanks, they will be replaced by “_”
reporttitle Defines the title of the report to be displayed in output documents.
reportsubtitle Defines the subtitle of the report to be displayed in output documents.
reportformat To select the format of report, either pdf or docx
reporttemplate Selects the template of reports, see 7.1. This defines the selection of analyses for variable type based on analyses matrices.
authors The authors of the report which appear below the reportname
replacereport

Flag to replace an existing report:

0=no replacement, 1 always replace, 2 replace only pdf

maxvarlabellength The maximum length of a variable label. Abbreviated variable labels are forced to meaningfully display content.
view_interpretation Enter an empty interpretation part in the report. This will only be realized in docx files. (0=no, 1=yes (default) ))
view_integrity Display information on the integrity of the variables regarding existence or variable type. (0=no, 1=yes (default) ))
histkat Number of categories up to which a display takes place as a bar chart. (Default is 15)
varlinebreak Whether or not a page break occurs after each single variable table. (0=no, 1=yes (default) ))
sectionlinebreak Whether or not a page break occurs after each summary table and report section. (0=no, 1=yes (default) ))
clustercolorpalettes Specify a list of colorpalettes to be assigned to clusters. The first palette is assigned to the first cluster, the second to the secon and so on. The current default palettes are "s1 economist s2 burd s1r s2 plottig ". When specifying only one color the intensity is graded according to the number of clusters. If m palettes are specified to n clusters, if n>m palette n is assigned to clusters m+1..n.
decimals Number of decimals to be displayed in output tables (default=2)
language Report language (d=deutsch; e=english; p=portugese)
Analysis settings
forcecalc Force new calculations instead of taking existing results. (-1= skip any calculations, generate a report from existing results; 0=take existing results and add new results, 1=calculate everything newly (default))
subgroup Defines subgroups on which analyses are conducted. A list of subgroups may be defined. The definition of each subgroup may depend on as much variables as needed as long as the logic is provided in a Stata readable format. All observations not belonging to the subgroup are deleted. Entering nothing or “all” leads to analyses in the entire group. If the subgroup specification is wrong or leads to few cases no resultreport is generated.
shipmissrecode Recode missings according to SHIP standard settings (default 0=no, otherwise 1=yes)
jumpto0recode Recode allowed jumps to 0 (default 0=no). Should only be used only if the meaning of an allowed jump is 0 = no event
itemmisslist A list of numerical values to be treated as missing values
itemjumplist A list of numerical values to be treated as permitted jumps
extremesuppress Force exclusion of extreme cases, at present with n times the standard deviation +/- mean for lowess and reliability calculations. Enter the number of standard deviations. Default is n=3.
binaryrecode Default recoding of a categorical variable with more than 2 values into binary variable to allow meaningful lowess/icc computations, default is yes (=1)/ vs 0=no
binaryrecodelimit Number of categories up to which a recoding should take place. Default is n=8 categories.
minreportn Minimum case number to generate a report. Default is n=30.
minclustersize_icc Minimum cluster size to compute ICC values, default is 10
minclustersize_lowess Minimum number of cases to compute Lowess graphs, default is n=40. Very low numbers may result in instable results.
minevent_lowess Minimum number of events for computations in Lowess for binary outcomes, default is n=2.
problemvarreport(#) Produce an additional report which contains an in depth analysis of all variables assigned to an issue category n=# or higher. Default is 0, no creation of an additional report.

Cautionary remarks

Stata has some flaws in creating pdf and word documents. After many executions, the creation of new reports may fail even with correct function calls. In this case, when receiving unclear error messages, close and restart Stata.

It is recommended to call less than 100 variables if single variable outputs are demanded. Otherwise some error message may occur.

At the end of the program the entire macro space is cleared to avoid difficulties with the execution of subsequent commands.

If for some reason a report production fails, it is recommended to issue the command: macro drop _all to avoid interferences with earlier results in the macro space.

Text handling – Static and flexible text content

Qualityreporter allows for different options to handle texts, including a multilingual approach. Default output text blocks are handled via Excel sheets (Section 1.5.1). Text content may also be dynamic. One option is to specify via Excel a structured interpretation part (Section 1.5.2).

Management of default output texts

The excel file “dq_control.xlsx” contains a sheet “Texts” with information on output texts and may host as many languages as necessary. These text snipplets typically contain from single words up to entire sentences any content.

There are two types of texts. First, static texts which are parsed into the text as is. Second, there may be flexible parts which are then replaced either by the content of scalars or globals. This solution works as long as the correct reference to macros is safeguarded:

  • To address scalars use the scalar name surrounded by //, e.g. //varname//

  • To address globals use the global name surrounded by %% e.g. %%varname%%

Variable text components may appear in any combination. For example a global may be called from within a global.

Creation of individualized text blocks

To manage report specific text blocks for the interpretation section of a report, the Excel format is used as well. Related text information is by default stored in the same folder as the metadata. If the path is provided with the filename, another location may be used.

There are the following columns in the excel file:

Text type

Header To request a header formatting. This is always followed by a line break.

Text[n] To request a formatting of normal text at level n

Linebreak If a line break is requested to separate contents

Only the name of the type is to be specified
Text content

Any variable text content. This text may contain static as well as variable content. Variable content is created addressing scalars and globals as mentioned in 1.5.1. This text is applied if the logic is true.

It is recommended to end each text block with a blank

Logics

Optionally Stata readable logics. The logics will be used as follows:

  • If no logic is provided always display content

  • If an erroneous logic is provided suppress content and output an integrity error warning

  • If a usable logic is provided output the text content if the logic is true.

Cautionary remark:

If a logic is requested with a “>” compared to a scalar, in addition a statement containing “[scalar] <.” must be requested as well. Otherwise the text content will be displayed if the scalar is missing because the massing scalar is treated as a “.”.

False text content Any variable text content. This text may contain static as well as variable content. Variable content is created addressing scalars and globals as mentioned in 1.5.1. This text is applied if the logic is false.
Invalid text content Any variable text content. This text may contain static as well as variable content. Variable content is created addressing scalars and globals as mentioned in 1.5.1. This text is applied if the logic is invalid. It is of less frequent use and may predominantly be used to issue warnings in case of logics failures.

The logics is used to decide on the display of text within each line.

There is also a logic to decide on the display of content across lines based on a hierarchy. Each text block is assigned a level, ranging from 1 to n (e.g. the level 1 may be assigned to). If the logic at the higher level is true or in case the higher level was displayed without a logic, the content of the lower level may be displayed. If the logic at the higher level is false or invalid, the content at the lower level is not displayed regardless of the logic at that level. Please note:

  • A level must be assigned to each text

  • A step of exactly 1 is needed between levels ( Place Text2 below a Text1 block but not Text3)

Of general importance:

Only assign in one line macros which are to be displayed under exactly the same conditions (e.g. N and % of all variables affected by an issue but NOT: N of all and N of primary variables affected by an issue). If not, empty content may result.

Variable types

Variables are differentiated into different categories to facilitate their use during data quality assessments. This influences the selection of analyses based on analyses matrices as well as the display of output.

Variable overview

The following types of variables are used:

Primary variables
(keyvars, kv)
are the most important variables for data quality checks. They receive the most detailed coverage with a single page dedicated to each variable which focusses on graphical displays. It is useful to include variables frequently used in scientific analyses. Which variables to treat as primary variables really depends on the scientific interest.
Secondary variables (minorvars, mv) are less important variables for data quality checking. They receive a more superficial coverage in tables. The idea is to provide an overview without spending much additional space in the report. Yet, data quality related insights may be systematically missed if they cannot be assessed with data quality indicators.
Process variables (procesvars, pv) …are variables related to the process of the assessment to be checked. These may be examination times, ambient conditions, among others. They commonly receive the same treatment as primary variables.
controlvariables (controlvars, cv) …are variables to adjust for in regression analyses, for example when calculating ICCs. Other, they commonly receive the same treatment as secondary variables.
Observervars …are cluster variables defined by observers. They are used to compute observer related graphics and statistics like ICC.
Devicevars Are cluster variables defined by devices. They are used to compute device related graphics and statistics like ICC.
Centervars Are cluster variables defined by centers. They are used to compute center related graphics and statistics like ICC.
Timevars The time variable should name the date time variable which ideally identifies the beginning of an examination.
idvars …are ID variables which may be used to merge datasets and to check for duplicates.
casemissvars Name of variables which indicate missing units at different levels of the study.

Selecting variables

To create meaningful reports an adequate selection of variables is indispensable. This is done according to the above defined variable categories:

keyvars, minorvars, processvars, controlvars, observervars, devicevars, centervars, timevars, idvars, segmentmissvars, and unitmissvars.

Some things need to be kept in mind:

  • NOT mentioning `keyvars’ leads to a use of ALL variables as key variables unless they have been assigned to another variable category. This means, for example, if no other variable type has been specified, all variables will be treated as key variables, if nothing is specified. The logic behind this is that not specifying at least one variable of primary interest would result in a useless report.

  • Stata 15 cannot create more than 500 tables. This limits the number of variables within one report, when producing single variable output. In this case, normally three tables are produced per variable plus additional overview tables, resulting in recommendable variable lists of 100 or less. This limit is irrelevant or the overview reports.

  • Variables should be numeric. However, if a string variable is encountered, an attempt is made to convert it to numeric and the converted variable is used as such if the attempt is successful.

  • One variable is only to be named once in one category.

  • There are automated checks for duplicate entries. Some category are “weaker” than others to avoid conflicts, if a variable is mentioned in more than one category. This leads to a deletion of identical variables, which have been assigned to more than one category from the weaker category. The “weakness hierarchy” in case of double mentioning of a variables is

    1. `idvars'

    2. `unitmissvars’

    3. `segmentmissvars’

    4. `timevars'

    5. `observervars'

    6. `devicevars'

    7. `centervars'

    8. `processvars’

    9. `keyvars’

    10. `controlvars'

    11. `minorvars’

Example: The variable “observer” has been included in the variable list `observervars' and `keyvars’. Because `keyvars’ is weaker than `observervars' it is deleted from the `keyvars’ list.

Variables to address unit and segment missingness

There is a dedicated group of variables, expressed by the parameter “casemiss” variables to define unit and/or segment missingness. When specifying these indicator variables, the following conventions should, be safeguarded:

  • A list of n indicator variables may be provided with casemissvars, e.g. casemissvars (studyparticipation interviewparticipation), where, in this example, studyparticipation refers to the variable indicating participation in a study and interviewparticipation refers to participation in an interview.

  • Missing indicator variables need to follow a hierarchical order with the first variable defining the first, and the following variables defining subsequent selection processes. There should be no available cases in interviewparticipation if they are unavailable at the previous level studyparticipation. The order in the parameter call must be accordingly.

  • The following information should be provided for each indicator variables:

    • Using the parameter “casemisstype”, the Output ready name of the indicator should be provided using one word per indicator variable. There should be as many definitions as variables. This information should be provided to ensure a clear meaning of the respective variable. e.g., in the example above: casemisstype(Studyparticipation Interviewparticipation)

    • Using the parameter casemisslogic, the logics to identify available observations. Any logic must be provided as a single term without blanks.

Some precautions need to be taken:

  • Specifying missing indicator variables has consequences for the > selection of cases. After computing missing figures at each level, > units without observations are deleted.

  • Specifying a subgroup selection may interfere with missing indicator > variables and may lead to an inappropriate computation of missing > case information (4.2).

Definition of structural elements and storage of results

Structural aspects and their use

Subgroup Leads to an exclusion of cases. This is an initial decision when calling dq_reportcall and information on which subgroup is addressed is not stored in the variables.
Observation time Leads to an exclusion of cases based on a datetime variable, may be treated as subgroup.
Stratification Loop over levels of variable to calculate results for each level
Cluster Use cluster as a specification for analyses which require a cluster

The hierarchical dependency follows the order above.

  1. First, the subgroup selection is made, potentially including time,

  2. Second, the strata are applied within a defined subgroup and

  3. finally cluster related analyses are conducted within the strata.

Defining subgroups

A subgroup is defined by defining a logic using the subgroup specification of the program call. For example if an analyses should only be conducted for males using the variable sex with values males=1, females=2, the parameter should be specified as follows: subgroup(“sex=1”).

If a report is demanded for observations collected during a limited time period, this should be specified as a subgroup option as well by selecting on a datetime variable.

Be aware: All cases not belonging to the subgroup will be deleted. However, an information in the integrity and notes tables will be provided on the number of deleted cases, old and new N.

Related to the parameter call the following information is important:

  • The definition of each subgroup may depend on as much variables as needed as long as the logic is provided in a Stata readable format. Different terms may be combined using and or or, no bracelets are understood

  • Each analysis term must be provided as a single string without any blanks, e.g.: age<18 ist OK age < 18 will lead to a failure in the program

  • All observations not belonging to the subgroup are deleted.

  • Entering nothing or “all” leads to analyses in the entire group.

  • If the subgroup specification is wrong or leads to few cases no resultreport is generated.

  • For defined time variables as specified in timevars currently only a search based on the td format is possible, meaning selection by days not by hours or minutes, the call must be as follows, e.g.:

local subgroup "ident_mez1>td(01feb2019)"

Regarding the storage of information, please note:

  • Because subgrouping does not affect the naming of scalars, different information is stored in different folders.

  • The naming of these subfolders is conducted automatically according the following rules:

    • “all” is used for analyses in the entire group

    • “subgroupterm” is used with logics replaced by characters as in Fortran is used as the subfolder name, however

    • If “subgroupterm”>50 characters, a subgroup folder specifier sg[N] is used with ascending order

  • To effectively combine results afterwards, it is necessary, to run the dq_reportcall separate times with different report names. Afterwards the stored information may be combined by creating a <.

Subgroups may interfere with the correct computation of unit- or segment missingness, if the selection leads to the exclusion of missing cases at the segment and unit level. When specifying a subgroup there are two options:

  • Skip the missing case analyses by omitting the provision of missing indicator variables (3.1)

  • Use a selection variable that has information for all missing cases at the unit or segment level as well. For example frequently age or sex

When calculating different subgroup reports on a body of data it is important to specify different subfolders.

The subgroup term is deleted after application because it may contain variables that are later on dropped from the file.

Stratified analyses

Currently stratified analyses are conducted using the subgroup option. Each stratum needs to be specified as a subgroup. Example: if a report is requested for sex=male /sex = female, the programmcall should include the parameter: subgroup(sex==male sex==female)

Note:

  • If the same variable is specified as a cluster and stratification variable, the cluster preference will dominate over the stratification

Analyses using clusters as predictors

XXXX

Information on specific analyses

Extreme values and limit conflicts

Hard limits are assessed before soft limits. Any violations are only counted once. This means that a violation for hard limits will not be counted for soft limits.

Data quality indicators

Data quality indicators define aspects under which data quality is evaluated. In difference to any descriptive or other metrics they provide a classification for data quality. Data quality indicators are computed after all analyses have been completed as a step entirely independent of the previous computations.

Data quality classification

Based on the help function dq_help_dqiassessment, which is called from dq_report (Data quality indicator assessment part), a data quality indicator is assigned. It uses predefined information on data quality indicator settings as defined in the excel sheet dq_indicators.xlsx. This file must currently be stored in a dq_help folder within the personal ado folder.

As default, one of five quality settings is assigned:

green (1) OK Nothing encountered
blue (2) Uncertain The indicator under observation is not OK but evaluation is uncertain
yellow (3) Moderate A moderate data quality issue is encountered
orange (4) Severe A severe data quality issue is encountered
red (5) Critical A severe data quality issue is encountered that may severely impair analyses.

The last category may predominantly be assigned in case of outcome related issues while processual issues are less likely to appear in this category. For example, unavailable missing data classifications indicate deficiencies in data management. However even if they are not available in 100% of missing cases data may be analysable meaningfully.

The classification takes place from critical to OK based on rules as defined in dq_indicators.xlsx. The rules are taken as is from the fields to be used within an “if” statement. Therefore proper specification must be safeguarded to ensure proper functioning. A specification is always expected with the exception of the category blue, which may or may not be specified.

Conventions for analyses

As default analyses related to missing data and consistency are conducted on the original data while analyses related to accuracy are conducted on modified data.

Conventions for naming of data quality indicator stubs

  • Each data quality indicator stub must be unique.

  • The first characters should denote the type of data quality indicator

Adding new data quality indicators

New data quality indicators may be added by expanding the Excel sheet “indicatordefinitions”. In this document a new row with respect to the new indicator of interest needs to be added.

To properly set up new data quality indicators:

  • Define the summary metric in the column “Summarytype”

    • _PE: Percentage

    • _NU: Number of occurences

  • Define in the column “referencestub” the relevant stub linked to the DQI sheet.

  • Make sure the number in the first column (dqi_number) is unique

  • Make sure there is no row with partial entries in the stub, the program may fail

Based on these information, the correct stub and label parameters will be generated automatically in the columns dqi_parameterstub and dqi_indicatorlabel.

Furthermore all information relevant for output should be added.

Requesting data quality calculations and output

The program has an output oriented approach to requesting calculations. What is calculated is defined by which output is demanded. In turn, the output is defined based on the several design elements.

  1. The program call makes reference to predefined analyses matrices

  2. Analyses matrices define the association between potential calculations and variable types for which they should be conducted

  3. Table definitions which define types of output tables

Note for revised concept [not yet implemented]

The user conducts two steps in setting up a report:

  1. Selection of the requested data quality aspects (potentially with specific settings).

  2. Selection of the organization of output

Selection of data quality aspects to be covered

Calculations for each report are based on the scope of demanded data quality indicators. For this purpose there are three approaches to select this scope:

  1. Selection of a standard report data quality selection. When choosing this option a preselected number of data quality indicators is used to compose a report. This is the fastest but also least flexible option.

  2. Selection of reference indicators (level 3 in the concept). This is useful when applying an individualized list of indicators while relying on default approaches to compute each indicator.

  3. Selection of indicator implementations forms. By choosing this approach precise control is exerted on how something is computed.

These three approaches may be combined. For example, a suggested standard report data quality selection is used and then fine-tuned.

Selection of output options

Core aspects of outputting results may be selected and combined, e.g.:

  • Grouping of results by indicator over data structures, e.g. variables (e.g. table view)

  • Grouping of results by data structure (e.g. variable) over indicators (e.g. combining output elements by variable on one page)

  • Grouping of variables across subgroups (e.g. analysis in entire observation period. vs. analyses in the last three months only)

Analysis matrix

The analysis matrix associates calculations and variable types. The information is stored as a vector of length n with n defined by the number of variable types.

Currently the following vectors are defined and accessible from reportcall.

view_interpretation Display Interpretation section
view_varoverview Display overview on variables
view_percentile Display a table with percentiles
view_missing Display overview on missing values
view_icc1 Display overview icc table with 2 level models
view_icc2 Display overview icc table with 3 level models
view_variabledescriptive Display overview descriptives for single variables
view_observereffect Display overview observer effects for single variables
view_deviceeffect Display overview device effect for single variables

Report templates

The content of reports is defined through the analysis matrix based on the assumption, requested output is also to be displayed.

Layout

Several aspects of a data quality report may be addressed to address variable demands.

Overall layout

Overall layout concerns the output format, pdf vs docx. It is controlled via the reportformat option. If no edits are to be made pdf is preferable over docx because the formatting and color handling is more elaborate.

Further options are specified below:

REPORT FORMATTING
maxvarlabellength The maximum length of a variable label. Abbreviated variable labels are forced to meaningfully display content.
varlinebreak Whether or not a page break occurs after each single variable table. 0=no, 1=yes (default)
sectionlinebreak Whether or not a page break occurs after each summary table and report section. (0=no, 1=yes (default) ))
clustercolorpalettes Specify a list of colorpalettes to be assigned to cluster variables. The first palette is assigned to the first cluster, the second to the secon and so on. The current default palettes are "s1 economist s2 burd s1r s2 plottig ". When specifying only one color the intensity is graded according to the number of clusters. If m palettes are specified to n clusters, if n>m palette n is assigned to clusters m+1..n.
decimals Number of decimals to be displayed in output tables (default=2)
language Report language (d=deutsch; e=english; p=portugese)

Report content

The second important aspect is the request of report content. This is realized through the reporttemplate option, by naming the respective report type. Several approaches have been implemented. “standard” creates a report covering all data quality options with a detailed output on primary and process variables. “full” extends a detailed variable output to all variables in the report. In both cases, a data quality summary is provided along integrity log and an interpretation part, if defined. “var” resembles “full” but omits tables. “tableoverview” creates only tables and omits the single variable part.

Note: In case of many variables (e.g. >100), tableoverview, may need to be used to obtain an output.

Title page

Several aspects of the title page like title (reporttitle), subtitle (reportsubtitle), authors (authors) by using the respective bold options.

Note: The option reportname defines the name of the file to be stored, not the title. If no report title is defined, the reportname is used instead as the title of the report.

Report types

Standard data quality reports

By default reports are generated that target data quality aspects or simply provide a descriptive overview. The many options to design these reports are described in chapter 1.

All default reports create a file that contains all results: “Resultscalars[x].dta” which is used for the subsequent report types.

Problem variable reports

Based on the results of the standard report “Resultscalars[x].dta” a problem variable report provides a more detailed overview on all variables classified beyond a certain problem category. The decision criteria is currently a given problem categorization on any data quality indicator.

A problem variable report is created by setting the problemvarreport parameter to the minimum problem category required for output. E.g. on the default scaling, 0=no problem, 1=unclear, 2=minor, 3=moderate, 4=severe, 5=critical, when specifying 3, all variables assigned to the categories 3,4, and 5 are included.

The problem report will not include control variables.

A problem variable is particularly required for variables assigned to the secondary variable category as they receive a superficial treatment. All variables in a problem variable report are treated as primary variables in the problem variable report. This also means that a problem variable report will not generate any new information if all variables identified as problematic have already been assigned to the primary variable category.

Overview reports

A special option is the creation of result output from already existing results as stored in “Resultscalars[x].dta” files. Essentially, these files are reloaded and as default based on a selectable variable, there is a data quality indicator and data quality measure output. For example, this option is useful to create quality reports that allow for comparisons

  • across examinations in a study that have previously been covered by different reports.

  • across examination centers

  • across studies

Choosing this option will lead to:

  • Skip all novel computation of results and move directly to the result generation section

  • Computation of a maximum problem score for each cluster type, the particular cluster will be ignored. This maximum will be compared across groups

  • Different datasets with results will be appended to each other, not merged

Commands for Multi-result reporting

DATA SOURCE AND FOLDERS
resultreporting Defines whether or not to create a multiresultreport. Default is 0 = no such report. If resultreporting (1), this type of report will be generated
rr_ordervar Variable to order results

Errors and potential solutions

The following list describes a selection of typical errors and potential solutions.

Program terminates

Error Solution options
Program termination upon table creation with error message: “Could not create table borders”

There are two options related to the general Stata 15 problem (as of Jan 2020), that Stata crashes after generating too many tables:

  • The report may contain too many variables -> exclude variables

  • If this issue occurs with a shorter report, approx. <200 variables: Stata has been creating too many reports, close Stata, open it again and rerun the program

Stata crashes unexpectedly when outputting tables
  • Are there unrecognized special characters in the labels? If so remove them.

Wrong results

Error Solution options
Loading of files does not function as expected
  • The file names contain blanks. Remove the blanks and run the program again
No interpretation text appears
  • The program call has been wrongly specified, check that a text with source folder has been transferred
Wrong interpretation texts appear
  • A wrong logic has been specified, check in the logics column of the interpretation file
An expected text does not appear
  • Nonexisting scalars or globals have been specified. Check the interpretation text file against the macorexport file in the logs_results folder. As a search term the expected scalar may be used.
A graph does not display expected content or appears old
  • Set forcecalc to 1: forcecalc(1) to force a new generation of graphs. If this routine is set to 0 new runs of the program will not modify graphs.

  • The graph may be available from previous runs and is mistakenly included, delete all graphs and rerun command.

A result is not displayed in the result table but appears in graphs
  • The stub may be incorrectly specified in the tablesettings ado, check
Subgroup selection does not work with time variables
  • Is the time variable provided in an adequate format. Best try outside the pipeline whether a subgrouping command works
Errors related to data quality assessments occur
  • Check the integrity of the data quality indicator definition sheet in the setuo Excel File