Overview
This article outlines how Element451 searches for and scores possible duplicate records. For information on how to use the Deduplication Module, read this.
The Duplicate Identification Process
Element451 searches for and scores possible duplicates via these steps:
Updating any of the following student data fields triggers an automatic search for possible duplicates:
First Name
Last Name
Email
Phone
Addresses
Identities
Date of Birth
Element performs an initial search based on the first name, last name, and email in the database. It searches for the exact first name and all variations of that name and nickname. For email, it only searches for the email prefix, not the "@domain.com." This initial search gives back a list of potential duplicates with scores based on similarity to the record's name and email in question.
The search scores each pair of matched records based on their proximity.
Element451 then calculates a probability that the two records are duplicates by applying logic based on flags and bonuses from other matching fields.
Flags: A scoring system based on simple binary options is used.
1: Field matches
0: Field doesn’t match
Bonuses: Each field is compared individually to add bonuses to its initial match score.
date of birth - 20%
address - 30%
phone - 30%
email - 15%
If the match probability is higher than 70%, the record constitutes a duplicate in the database and is displayed in the Deduplication Module.
Matching first and last names alone won’t flag duplicates. Additional matches needed: email, date of birth, address, or phone number.
Search Trigger
When specific fields are updated, Element451 initiates a duplicate search for that record.
A duplicate search is triggered by updating the following fields:
First Name
Last Name
Email
Phone
Addresses
Identities
Date of Birth
Initial Search
The initial search is based on first name, last name, and email.
For better results, Element searches not only for the exact first name but also for all variations of that name and nicknames.
For email, Element only searches for the email prefix, not the "@domain.com."
The initial search returns a list of potential duplicates with scores based on similarity to the record's name and email.
Scoring
Element applies flags and bonuses to each potential duplicate to determine the probability that they are the same person.
Flags
As Element notices patterns in the duplicates that enter Element451, we can ensure that known duplicates with specific matching fields are recorded at a certain probability. A scoring system based on simple binary options is used.
1: field matches
0: field doesn’t match
If the record and the potential duplicate have exact matches in any of these particular combinations of fields, they are recorded as duplicates, and they will be reported with the related score in the Deduplication module.
Bonuses
Each field is compared individually to add bonuses to its initial match score in this system. Different bonuses are applied depending on how strong an indicator that field is.
date of birth - 20%
address - 30%
phone - 30%
email - 15%
Calculating Probability
Once bonuses are added, Element compares the new score to the maximum score to calculate the probability that the pair is duplicated. If the new score divided by the maximum source is higher than 0.7, the record is recorded as a duplicate pair.
If the record is below a 0.7 score, it will not appear in the Deduplication Module.
Fields Matching, Normalization, and Comparison
Person Name | Normalization
| Comparison First and Last names are compared separately. If any of the first name parts from one record exist in the other record's first name parts, it is a match. If both conditions are positive, the name is considered a match. |
Emails from both records are stripped, so just the prefix before “@” is compared. Since these records have already passed the initial search, we are just searching for an exact match in this round. |
| |
Identity | If either record has an “EMAIL” identity, Element compares it against the other record's primary email address. |
|
Date of Birth | The day, month, and year of the records' dates of birth are compared for an exact match. |
|
Addresses | Element uses a third-party API to normalize each address's “Street 1” field. | If there are multiple addresses, Element checks if any of the first record's normalized addresses match any of the second record's normalized addresses exactly. |
Phone | Using the clean 7-digit number only, if any of the first record's phone numbers match any of the second record's phone numbers exactly, it is a positive match. |
|
Relationships | If the first and second records are related, they will not be considered duplicates. |
|
Special Cases | When the last name matches but other fields are mismatched, it is a potential family member. |
|