Member-only story

Synthetic data for Data Management Applications

Satish Kodali
2 min readJul 8, 2022

--

Synthetic data generation algorithms encodes the statistical and structural properties of the real world data domain and efficiently generate test data that closely matches real data with out risk of leaking any personal or confidential data into the generated test data set. This article is focuses on use cases of synthetic data relevant to data management (Master Data Management and Data Quality Applications) than synthetic data generation for ML or other statistical models.

The approach of synthetically generating data is safer and efficient than the data masking approach where real data is taken as starting point and scrambled to be used in less controlled environments like test or sandbox environments. The size of masked data is limited to the size of the original data set and poor masking strategies could distort the statistical and structural properties of the original data which in turn creates distorted profile of the data for data management use cases.

For the applications in data management space, synthetic data generator has to create data with ability introduce controlled noise like controlling the number and types of duplicates in Master Data and also to match the frequency of reference and data values to match the frequency of values in real data. This control is needed when generating test data to test the efficiency of the MDM record linkage algorithm, which is to know how many of know duplicate records are detected by the record linkage algorithm or how many of data anomalies are…

--

--

No responses yet