Data-driven data validation

Introducing ‘validatesuggest’

Intro

Data validation is a cornerstone in data intense industries, such as the art of making official statistics. To create and maintain high quality statistical output, data needs to be checked before being used in statistical processes. A number of projects have been executed to streamline and optimize data validation processes, both within organisations as well as among organisations. An important success factor for effective data validation is the design and maintenance of validation rules that cover the dynamics of the data to be checked. This has led to the definition of standardized validation rules that cover the most common use-cases in official statistics. Examples are the definition of internationally agreed ‘main types of validation rules’ by Eurostat , and the set of recipes and standard functions as offered in the well-known R-package validate, which are documented in the online cookbook [3]. In this presentation we take another approach to rule maintenance. In addition to the knowledge of the domain specialist we let the data speak. Properties of the data, such as type, range, distribution, correlation can be used to derive rules that catch the essentials of the data. Since the number of rules that could potentially be derived from data in general could be endless, we use the existing international and national standardized validation rule systems to know what type of rules make sense. A refinement of the concept is to also take the time dimension of time series data into consideration. That way time-dependent validation rules com into reach. The suggested rules are expressed in a human-readable form, so that the domain specialist / rule maintainer can inspect and understand them, as the data-driven concept is intended as a suggestion to the rule maintainers. They should always be checked and interpreted before putting into production. The type of rules currently implemented in the experimental R-package ‘validatesuggest’ are:

validate, Loo and Jonge (2021)

Usage

library(validatesuggest)
Loo, Mark P. J. van der, and Edwin de Jonge. 2021. “Data Validation Infrastructure for r 97. https://doi.org/10.18637/jss.v097.i10.