Member-only story

Probabilistic Matching for Master Data Management — Part 2

Satish Kodali
4 min readSep 12, 2019

--

Motivation

Duplicate records in master data will have serious consequences in settings like health care when electronic medical records belonging to same patient are linked to two or more master records. Linking medical history across two or more master records would result in duplication of tests and prescriptions increasing health care costs, affecting care and in extreme cases data duplication might be the cause of life or death.

Data needs to be cleansed and standardizes to eliminate spurious variations with in data set and to improve signal to noise ratio when extracting features like comparison vector which serves as building block for probabilistic matching algorithm. Data standardized to uniform format is also required for bucketing which creates neighborhoods of data with in which candidate pairs are selected for detailed matching.

Syntax Standardization

Syntax standardization is the most basic form of standardization in which string, number and date transformations are applied to remove syntactic variation of the values.

In the most basic case — “JohnMiller” and “john miller” will result in a non-match when exact string comparison is applied. In a slightly complicated case “Tedd O’Maley” and “Tedd OMaley” will also result in non match without cleansing and standardization.

Basic string transformations are:

  • Converting all strings…

--

--

No responses yet