Member-only story
Probabilistic Matching for Master Data Management — Part 1
How many false suspect duplicates to expect
Probabilistic matching algorithm extends the range of search space and adds dynamism to how MDM software find duplicate records that belong to same entity as Person, Product, Account. As data gets added and updated in MDM system, matching algorithm marks some records as suspect duplicates (two records identified to be duplicate records representing same entity) for auto merging or manual review.
Before getting into details of probabilistic matching algorithm it would anchor us well to have an analytical framework based on probability theory that explains the math behind high percentage of false positive suspect records marked for review compared to expected number of suspects our plain intuition would surmise on a data set with low noise.
Bayes Theorem
Bayes theorem and framing the problem in terms of conditional probabilites provide the framework using which we can estimate percentage of true duplicates out of over all suspect records identified by the algorithm.
Assume the scenario of finding duplicates in a low noise data set of customer records that contains distinguishing attributes like SSN, Name and Date of Birth with possible 2% duplicate rate. Also assume that the algorithm has an accuracy rate of 99% in identifying record pair as non duplicate or possible duplicate, we now seek to understand conditional accuracy as…