The Search Process
Element451 finds potential duplicates by performing an initial search over name and email in the database. The search gives each pair of matched records a score based on how close the two records are.
Element451 then calculates a probability that the two records are duplicates by applying logic based on flags and bonuses from other matching fields.
If match probability is higher than 70%, the record constitutes a duplicate in the database.
When one of the following fields is updated, we run that record through the deduplication process: first name, last name, email, phone, addresses, identities, and date of birth.
The initial search is based on first name, last name, and email.
For better results, we are not only searching only for the exact first name, but also for all variations of that name and nicknames.
For email, we are just searching for the first part, not the @domain.com part.
Initial search gives back a list of potential duplicates with scores based on similarity to the record's name and email in question.
We apply flags and bonuses to each of the potential duplicates to determine the probability that they are the same person.
As we notice patterns in the duplicates that enter Element451, we have an opportunity to ensure that known duplicates with certain matching fields are recorded at a certain probability. We created a scoring system based on simple binary options. 1: field matches, 0: field doesn’t match. If the record and the potential duplicate have exact matches in any of these particular combinations of fields, we can be sure that they are duplicates and they will be reported with the related score.
In this system, we compare each field one-by-one and add bonuses to their initial match score. Different bonuses are applied depending on how strong of an indicator that field is.
date of birth - 20%
address - 30%
phone - 30 %
email - 15 %
Once we add those bonuses, we compare the new score to the maximum score in order to calculate the probability that the pair are duplicates. If the new score divided by the maximum source is higher than 0.7, we record that as a duplicate pair.
Name: normalization and comparison.
We compare first and last names separately. If any of the first name parts from one record exists in the other record's first name parts, it is a match. If both of the conditions are positive, the name is considered a match.
Emails from both records are stripped, so just the first parts before “@” are compared. Since these records have already passed the initial search, in this round, we are just searching for an exact match.
If either record has an “EMAIL” identity, we compare it against the other record's primary email address.
Date of birth
We compare the day, month, and year, of the records' dates of birth to see if they are exact matches.
Addresses: normalization and comparison.
We use a third-party API to normalize the “Street 1” field in each address.
If there are multiple addresses, we check if any of the first record's normalized addresses match any of the second record's normalized addresses exactly.
Using the clean 7-digit number only, if any of the first record's phone numbers match any of the second record's phone numbers exactly, it is a positive match.
If the first record and the second record are related, they will not be considered duplicates.
When the last name matches but other fields are mismatched, it is a potential family member.