Literature overview

Category Paper Keywords
Methods Aguinis et al. Best-practice recommendations for defining, identifying, and handling outliers, 2013, Organizational Research Methods, 16(2), 270–301 quantitative research, ethics in research, outliers
Standards Public Opinion Research Standard definitions: Final dispositions of case codes and outcome rates for surveys, 2011,
Methods Altman & Bland Assessing agreement between methods of measurement, 2017, Clin Chem,
Tools Assenov et al. Comprehensive analysis of DNA methylation data with RnBeads, 2014, Nature Methods, 11(11), 1138 DNA methylation analysis, computational epigenetics, whole genome bisulfite sequencing, reduced representation bisulfite sequencing, epigenotyping microarrays, Illumina Infinium HumanMethylation450 assay, bioinformatics software, epigenome-wide association studies, medical epigenomics
Thresholds Bach The freiburg visual acuity test–automatic measurement of visual acuity, 1996, Optom Vis Sci, 73(1), 49–53, visual acuity, computer test, psychometric threshold estimation
Cohort Studies Bamberg et al. Whole-body MR imaging in the german national cohort: Rationale, design, and technical background, 2015, Radiology, 277(1), 206–220
Software Boehmke Data wrangling with r, 2016
Methods Bakar et al. A comparative study for outlier detection techniques in data mining, 2006, 2006 IEEE Conference on Cybernetics and Intelligent Systems, 1–6 data mining , clustering , outlier
Dictionary Bangia Dictionary of information technology, 2010
Documentation Bargaje Good documentation practice in clinical research, 2011, Perspectives in Clinical Research, 2(2), 59 ALCOA, documentation, source, training
Methods Barnett & Lewis Outliers in statistical data, 1994
Standards Begley & Ellis Drug development: Raise standards for preclinical cancer research, 2012, Nature, 483(7391), 531–533
Methods Bennett How can i deal with missing data in my study?, 2001, Australian and New Zealand Journal of Public Health, 25(5), 464–469
Metadata Bretherton Reference model for metadata: A strawman, 1994, Whitepaper, University Wisconsin.,
Methods Brown & Forsythe Robust tests for the equality of variances, 1974, Journal of the American Statistical Association, 69(346), 364–367
Review Callahan et al. A comparison of data quality assessment checks in six data sharing networks, 2017, eGEMs (Generating Evidence & Methods to Improve Patient Outcomes), 5(1)
Review Chalmers & Glasziou Avoidable waste in the production and reporting of research evidence, 2009, Obstetrics & Gynecology, 114(6), 1341–1345
Review Chen et al. A review of data quality assessment methods for public health information systems, 2014, International Journal of Environmental Research and Public Health, 11(5), 5170–5207 data quality, information quality, data use, data collection process, evaluation, assessment, public health, population health, information systems
Software Chang et al. Shiny: Web application framework for r, 2015, 2018, R Package Version, 1(0), 14
Methods Callegaro et al. Web survey methodology, 2015
Methods Cleveland et al. Regression by local fitting: Methods, properties, and computational algorithms, 1988, Journal of Econometrics, 37(1), 87–114
Methods Cleveland & Devlin Locally weighted regression: An approach to regression analysis by local fitting, 1988, Journal of the American Statistical Association, 83(403), 596–610
Concept Couchoud et al. Renal replacement therapy registries—time for a structured data quality evaluation programme, 2013, Nephrology Dialysis Transplantation, 28(9), 2215–2220 completeness, data quality, quality assessment, RRT registry, timeliness, validity
Methods Das et al. A new method to evaluate the completeness of case ascertainment by a cancer registry, 2008, Cancer Causes & Control, 19(5), 515–525 Data quality, Cancer, Population registers, Estimation, techniques
Methods Dasu & Johnson Exploratory data mining and data cleaning, 2003
Methods Dong & Peng Principled missing data methods for researchers, 2013, SpringerPlus, 2(1), 222 Missing data Listwise deletion MI FIML EM MAR MCAR MNAR
Methods Drion & others Some distribution-free tests for the difference between two empirical cumulative distribution functions, 1952, The Annals of Mathematical Statistics, 23(4), 563–574
Methods Durrleman & Simon Flexible regression models with cubic splines, 1989, Statistics in Medicine, 8(5), 551–561 Smoothing splines Non‐parametric regression Piecewise polynomials
Epidemiology Ebrahim & Davey Smith Commentary: Should we always deliberately be non-representative?, 2013, International Journal of Epidemiology, 42(4), 1022–1026
Concept Edwards et al. Science friction: Data, metadata, and collaboration, 2011, Social Studies of Science, 41(5), 667–690 collaboration, communication, data, metadata
Methods Fasano & Franceschini A multidimensional version of the kolmogorov–smirnov test, 1987, Monthly Notices of the Royal Astronomical Society, 225(1), 155–170
Methods Feinstein & Cicchetti High agreement but low kappa: I. The problems of two paradoxes, 1990, Journal of Clinical Epidemiology, 43(6), 543–549 Kappa Concordance Agreement Paradox
Methods Filzmoser A multivariate outlier detection method, 2004
Standards Finnie et al. EpiJSON: A unified data-format for epidemiology, 2016, Epidemics, 15, 20–26 Outbreaks, Epidemics, Software, Databases, Communications standards
Epidemiology Fletcher et al. Clinical epidemiology: The essentials, 2012
Methods Freedman & Diaconis On the histogram as a density estimator: L 2 theory, 1981, Probability Theory and Related Fields, 57(4), 453–476
Methods Golub & Van Loan Matrix computations johns hopkins university press, 1996, Baltimore and London
Methods Gonzalez-Chica et al. Test of association: Which one is the most appropriate for my study?, 2015, Anais Brasileiros de Dermatologia, 90(4), 523–528 Data analysis; Association; Epidemiology and biostatistics; Hypothesis testing; Statistical methods and procedures
Methods Grant Data visualization: Charts, maps, and interactive graphics, 2018
Software Hahsler et al. Introduction to arules-a computational environment for mining association rules and frequent item sets, 2010, 2018
Methods Hallgren Computing inter-rater reliability for observational data: An overview and tutorial, 2012, Tutorials in Quantitative Methods for Psychology, 8(1), 23 behavioral observation, coding, inter-rater agreement, intra-class correlation, kappa, reliability, tutorial
Methods Hansen et al. Enabling longitudinal data comparison using DDI, 2011 Data Documentation in Social Sciences; DDI Metadata Standard
Methods Harrell Jr Regression modeling strategies: With applications to linear models, logistic and ordinal regression, and survival analysis, 2015
Software Harris et al. Research electronic data capture (REDCap)—a metadata-driven methodology and workflow process for providing translational research informatics support, 2009, Journal of Biomedical Informatics, 42(2), 377–381 Medical informaticsElectronic data captureClinical researchTranslational research
Dictionary Hartge A dictionary of epidemiology, sixth edition, 2015, Am J Epidemiol,
Methods Hawkins Introduction, 1980, In Identification of outliers (pp. 1–12),
Methods Hayat et al. Statistical methods used in the public health literature and implications for training of public health professionals, 2017, PloS One, 12(6), e0179032
Software Horton & Kleinman Using r and RStudio for data management, statistical analysis, and graphics, 2015
Metadata Hoyle et al. Metadata for the longitudinal data life cycle: The role and benefit of metadata management and reuse., 2010, DDI Working Paper Series: Longitudinal Data Best Practices,
Methods Hubert & Vandervieren An adjusted boxplot for skewed distributions, 2008, Computational Statistics & Data Analysis, 52(12), 5186–5201
Methods Hu & Sung Detecting pattern-based outliers, 2003, Pattern Recognition Letters, 24(16), 3059–3068 Outlier detectionComplete spatial randomnessClusteringRegular spacing
Methods Huebner et al. A contemporary conceptual framework for initial data analysis, 2018, Observational Studies, 4, 71–192, nitial data analysis, data cleaning, data screening, reporting, metadata,research plan, STRATOS Initiative
Methods Huser et al. Methods for examining data quality in healthcare integrated data repositories, 2017 Data Quality, Evaluation Methods, Visualization, Observational Research
Review Ioannidis Why most published research findings are false, 2005, PLoS Medicine, 2(8), e124
Epidemiology Ioannidis Discussion: Why an estimate of the science-wise false discovery rate and application to the top medical literature is false, 2013, Biostatistics, 15(1), 28–36
Epidemiology Ioannidis et al. Increasing value and reducing waste in research design, conduct, and analysis, 2014, The Lancet, 383(9912), 166–175
Epidemiology Jager & Leek An estimate of the science-wise false discovery rate and application to the top medical literature, 2013, Biostatistics, 15(1), 1–12
Epidemiology Jager & Leek Rejoinder: An estimate of the science-wise false discovery rate and application to the top medical literature, 2013, Biostatistics, 15(1), 39–45
Methods Joshi et al. Likert scale: Explored and explained, 2015, British Journal of Applied Science & Technology, 7(4), 396 Psychometrics, Likert scale, points on scale, analysis, education
Methods Jinyuan et al. Correlation and agreement: Overview and clarification of competing concepts and measures, 2016, Shanghai Archives of Psychiatry, 28(2), 115 concordance correlation, intraclass correlation, Kendall’s tau, non-linear association, Pearson’s correlation, Spearman’s rho
Concept Kahn et al. A harmonized data quality assessment terminology and framework for the secondary use of electronic health record data, 2016, eGEMs, 4(1) electronic health records, data use & quality, data completeness
Methods Kalton The treatment of missing survey data, 1986, Survey Methodology, 12, 1–16
Methods Kao & Green Analysis of variance: Is there a difference in means and what does it mean?, 2008, Journal of Surgical Research, 144(1), 158–170 research/statistics and numerical datadata interpretation/statisticalmodelsstatisticalreview
Methods Kahn et al. Quantifying clinical data quality using relative gold standards, 2010, AMIA Annual Symposium Proceedings, 2010, 356
Concept Karr et al. Data quality: A statistical perspective, 2006, Statistical Methodology, 3(2), 137–173
Methods Kalton & Kasprzyk The treatment of missing survey data, 1986, Survey Methodology, 12(1), 1–16
Concept Keller et al. The evolution of data quality: Understanding the transdisciplinary origins of data quality concepts and approaches, 2017 designed data, administrative data, opportunity data, reproducibility, total survey error, decision theoretic framework
Methods Kleiber & Zeileis Visualizing count data regressions using rootograms, 2016, The American Statistician, 70(3), 296–303 Finite mixture, Goodness of fit, Hurdle model, Negative binomial regression, Poisson regression
Methods Koo & Li A guideline of selecting and reporting intraclass correlation coefficients for reliability research, 2016, Journal of Chiropractic Medicine, 15(2), 155–163 Reliability and validityResearchStatistics
Methods Kullback & Leibler On information and sufficiency, 1951, The Annals of Mathematical Statistics, 22(1), 79–86
Methods Kullback Information theory and statistics, 1997
Methods Levene Robust tests for equality of variances, 1961, Contributions to Probability and Statistics. Essays in Honor of Harold Hotelling, 279–292
Concept De Lusignan et al. Key concepts to assess the readiness of data for international research: Data quality, lineage and provenance, extraction and processing errors, traceability, and curation, 2011, Yearb Med Inform, 6(1), 112–120 Medical records systems, computerized; research design; registry;records as topic; databases genetic
Methods Lang & Little Principled missing data treatments, 2016, Prevention Science, Missing data Multiple imputation Full information maximum likelihood Auxiliary variables Intent-to-treat Statistical inference
Cohort Studies Langeheine et al. Consequences of an extended recruitment on participation in the follow‐up of a child study: Results from the german IDEFICS cohort, 2017, Paediatric and Perinatal Epidemiology, 31(1), 76–86 loss to follow‐up late respondents IDEFICS paradata
Concept Lee et al. A framework for data quality assessment in clinical research datasets, 2017, AMIA Annual Symposium Proceedings, 2017, 1080
Methods Lehmann & Casella Theory of point estimation, 2006
Methods Lenth & others Least-squares means: The r package lsmeans, 2016, Journal of Statistical Software, 69(1), 1–33 least-squares means, linear models, experimental design
Concept Liaw et al. Towards an ontology for data quality in integrated chronic disease management: A realist review of the literature, 2013, International Journal of Medical Informatics, 82(1), 10–24 RealistResearch designChronic diseaseInformation systemData qualityOntology
Methods Lindsey Comparison of probability distributions, 1974, Journal of the Royal Statistical Society. Series B (Methodological), 38–47 likelihood inference grouping data goodness of fit comparing models
Methods Lindsey & Mersch Fitting and comparing probability distributions with log linear models, 1992, Computational Statistics & Data Analysis, 13(4), 373–384 Comparison of modelsGeneralized linear modelsGoodness of fitLikelihood inferenceLog linear modelsProbability distributionsTruncated distributions
Methods Little & Rubin Statistical analysis with missing data, 2014
Methods Mayr et al. A permutation test to analyse systematic bias and random measurement errors of medical devices via boosting location and scale models, 2017, Statistical Methods in Medical Research, 26(3), 1443–1460 Measurement errors, systematic bias, random error, statistical models, permutation test, gradient boosting, regression
Methods Mahalanobis On the generalized distance in statistics, 1936
Methods Marsh & Seo A review and comparison of methods for detecting outliers in univariate data sets, 2006 boxplot; lognormal; outlier; skewed distribution
Concept McMahon & Denaxas A novel framework for assessing metadata quality in epidemiological and public health research settings, 2016, AMIA Summits on Translational Science Proceedings, 2016, 199
Review Meyer et al. Efficient data management in a large-scale epidemiology research project, 2012, Computer Methods and Programs in Biomedicine, 107(3), 425–435 Central Data Management, Electronic Data Capture, Electronic Case Report Forms, Individualized medicine, Personalized Medicine
Software Mitchell & others Data management using stata: A practical handbook, 2010
Methods Morgenthaler A survey of robust statistics, 2007, Statistical Methods and Applications, 15(3), 271–293
Methods Müller & Büttner A critical discussion of intraclass correlation coefficients, 1994, Statistics in Medicine, 13(23-24), 2465–2476
Metadata Nadkarni Metadata-driven software systems in biomedicine: Designing systems that can adapt to changing knowledge, 2011
Cohort Studies Consortium The german national cohort: Aims, study design and organization, 2014, European Journal of Epidemiology, 29, 371–382 Population-based cohort, Non-communicable diseases, Chronic infections, Life-style and socio-economic factors, Magnetic resonance imaging, Pre-clinical disease, Functional impairments
Methods Newsom Longitudinal structural equation modeling: A comprehensive introduction, 2015
Epidemiology Nohr & Olsen Commentary: Epidemiologists have debated representativeness for more than 40 years—has the time come to move on?, 2013, International Journal of Epidemiology, 42(4), 1016–1017 pregnancy, conflict of interest, epidemiology, adult, biometry, child, follow-up, garbage, internet, logic, shoes, sociology, time factors, infections, epidemiologic causality, statutes and laws, prenatal care, conception, epidemics, child health, birth, inference, killing, national institute of child health and human development, imputation
Concept Nonnemacher et al. Datenqualität in der medizinischen forschung, 2014
Software Potter et al. Web application teaching tools for statistics using r and shiny, 2016, Technology Innovations in Statistics Education, 9(1)
Software Plantier et al. Biomedical engineering systems and technologies: 7th international joint conference, BIOSTEC 2014, angers, france, 3-6, 2014, revised selected papers, 2016
Dictionary Porta A dictionary of epidemiology, 2014
Methods Press & Teukolsky Kolmogorov-smirnov test for two-dimensional data: How to tell whether a set of (x, y) data paints are consistent with a particular probability distribution, or with another data set, 1988, Computers in Physics, 2(4), 74–77
Epidemiology Prinz et al. Believe it or not: How much can we rely on published data on potential drug targets?, 2011, Nature Reviews Drug Discovery, 10(9), 712 Drug discovery
Methods Priyadarshana & Sofronov Multiple break-points detection in array CGH data via the cross-entropy method, 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 12(2), 487–498 Break-point modelling , aCGH microarray data , stochastic optimization , CNVs , DNA copy number , Cross-Entropy
Methods Ranganathan et al. Common pitfalls in statistical analysis: Measures of agreement, 2017, Perspectives in Clinical Research, 8(4), 187 Agreement, biostatistics, concordance
Documentation Rasmussen & Blank The data documentation initiative: A preservation standard for research, 2007, Archival Science, 7(1), 55–71
Software Rossini et al. Simple parallel statistical computing in r, 2007, Journal of Computational and Graphical Statistics, 16(2), 399–420 Bootstrap, Cross-validation, Grid computing, Kriging, LAM-MPI, MPI, Message passing, Profile likelihood, pVM
Software Reineke et al. Modys–ein modulares steuerungs-und dokumentationssystem für epidemiologische studien, 2006, Medizinische Dokumentation–Wichtig Oder Nichtig
Metadata Richter et al. Data quality monitoring in clinical and observational epidemiologic studies: The role of metadata and process information, 2019, GMS Med Inform Biom Epidemiol, 15(1), 10.3205/mibe000202 data quality, metadata, process variables, data monitoring, health research, cohort studies
Methods R. Rigby et al. Distributions for modelling location, scale, and shape: Using GAMLSS in r, 2017, URL Www. Gamlss. Org.(last Accessed 5 March 2018)
Methods R. A. Rigby & Stasinopoulos Generalized additive models for location, scale and shape, 2005, Journal of the Royal Statistical Society: Series C (Applied Statistics), 54(3), 507–554 Beta–binomial distribution; Box–Cox transformation; Centile estimation; Cubicsmoothing splines; Generalized linear mixed model;LMSmethod; Negative binomialdistribution; Non-normality; Nonparametric models; Overdispersion; Penalized likelihood;Random effects; Skewness and kurtosis
Omics Risch Searching for genetic determinants in the new millennium, 2000, Nature, 405(6788), 847
Software Rossini et al. Simple parallel statistical computing in r, 2007, Journal of Computational and Graphical Statistics, 16(2), 399–420
Epidemiology Rothman et al. Why representativeness should be avoided, 2013, International Journal of Epidemiology, 42(4), 1012–1014 ethnic, group habits, statutes and laws, public health medicine, inference, social survey
Epidemiology Rothman et al. Modern epidemiology, 2008
Epidemiology Rothwell External validity of randomised controlled trials: “To whom do the results of this trial apply?” 2005, The Lancet, 365(9453), 82–93
Software R Core Team R: A language and environment for statistical computing, 2020,
Documentation Ryssevik The data documentation initiative (DDI) metadata specification, 2001
Methods Schafer & Graham Missing data: Our view of the state of the art, 2002, Psychol Methods, 7(2), 147–177,
Tools C. Schmidt et al. Square2-a web application for data monitoring in epidemiological and clinical studies, 2017, Studies in Health Technology and Informatics, 235, 549–553
Concept C. O. Schmidt et al. Assessment of a data quality guideline by representatives of german epidemiologic cohort studies., 2019, MIBE, 15(1), 10.3205/mibe000203 data quality, cohort studies, data quality indicators, data monitoring
Software Schmidberger et al. State-of-the-art in parallel computing with r, 2009, Journal of Statistical Software, 47(1) R, high performance computing, parallel computing, computer cluster, multi-core systems, grid computing, benchmark
Software Signorell et al. DescTools: Tools for descriptive statistics. R package version 0.99. 18, 2016, R Foundation for Statistical Computing, Vienna, Austria
Methods Sison & Glaz Simultaneous confidence intervals and sample size determination for multinomial proportions, 1995, Journal of the American Statistical Association, 90(429), 366–369 Coverage probabilities; Multinomial distribution; Probability approximations; Simultaneous inference
Methods Sniders & Bosker Multilevel analysis: An introduction to basic and advanced multilevel modeling., 1999
Epidemiology Stang & Jöckel Avoidance of representativeness in presence of effect modification, 2014, International Journal of Epidemiology, 43(2), 630–631
Metadata Stausberg et al. Indicators of data quality: Review and requirements from the perspective of networked medical research indikatoren zur datenqualität: Stand und anforderungen aus sicht der vernetzten medizinischen forschung, 2019, GMS Med Inform Biom Epidemiol, 15(1), 10.3205/mibe000199 medical research, data quality, healthcare, guidelines, analytics, informatics
Methods Sterne & Smith Sifting the evidence—what’s wrong with significance tests?, 2001, Physical Therapy, 81(8), 1464–1469
Methods Sturges The choice of a class interval, 1926, Journal of the American Statistical Association, 21(153), 65–66
Cohort Studies Teppo et al. Data quality and quality control of a population-based cancer registry: Experience in finland, 1994, Acta Oncologica, 33(4), 365–369
Epidemiology Thygesen & Ersbøll When the entire population is the sample: Strengths and limitations in register-based epidemiology, 2014, European Journal of Epidemiology, 29(8), 551–558 Registers, Database management systems, Epidemiology, Bias, Nordic countries
Methods Tukey Exploratory data analysis, 1977
Software Van der Loo The stringdist package for approximate string matching, 2014, The R Journal, 6(1), 111–122
Concept Vardaki et al. A statistical metadata model for clinical trials’ data management, 2009, Computer Methods and Programs in Biomedicine, 95(2), 129–145 Metadata, Clinical trials, Medical research, Statistical metadata modeling, Transformations, Clinical Study Data Management, Systems, Harmonization, Quality
Documentation Vardigan et al. Data documentation initiative: Toward a standard for the social sciences, 2008, International Journal of Digital Curation, 3(1), 107–113
Cohort Studies Völzke et al. Cohort profile: The study of health in pomerania, 2010, International Journal of Epidemiology, 40(2), 294–307 ultrasonography , follow-up , germany , ships
Methods Wager et al. Model selection for penalized spline smoothing using akaike information criteria, 2007, Australian & New Zealand Journal of Statistics, 49(2), 173–190 Penalized Spline; Model Selection; Conditional versus Marginal In-ference; Variance Component Selection
Concept Wang & Strong Beyond accuracy: What data quality means to data consumers, 1996, Journal of Management Information Systems, 12(4), 5–33 data administration, data quality, database system
Concept Watts et al. Data quality assessment in context: A cognitive perspective, 2009, Decision Support Systems, 48(1), 202–211 Dual-Process Theory, Cognition, Quality Metadata, Information Quality Management, Information Quality Dimensions, Decision Support
Concept Nicole G. Weiskopf et al. A data quality assessment guideline for electronic health record data reuse, 2017, eGEMs (Generating Evidence & Methods to Improve Patient Outcomes), 5(1)
Concept Nicole G. Weiskopf et al. Defining and measuring completeness of electronic health records for secondary use, 2013, Journal of Biomedical Informatics, 46(5), 830–836 Data quality, Electronic health records, Secondary use, Completeness
Concept Nicole Gray Weiskopf & Weng Methods and dimensions of electronic health record data quality assessment: Enabling reuse for clinical research, 2013, Journal of the American Medical Informatics Association, 20(1), 144–151 Clinical research, clinical research informatics, data quality, electronic health records, knowledge acquisition, knowledge acquisition and knowledge management, knowledge bases, knowledge representations, methods for integration of information from disparate sources, secondary use
Standards Organization International statistical classification of diseases and related health problems, 2004
Metadata Wilson Toward releasing the metadata bottleneck, 2011, Library Resources & Technical Services, 51(1), 16–28
Software Wickham Advanced r, 2014
Software Wickham R packages: Organize, test, document, and share your code, 2015
Methods De Leeuw et al. Prevention and treatment of item nonresponse, 2003, Journal of Official Statistics, 19, 153–176 causes of missingness, data collection mode, ignorability, imputation, item nonresponse, questionnaire development, follow-up survey