Deduplication Logic

How Element451 determines duplicates by comparing and scoring fields.

Ardis Kadiu avatar
Written by Ardis Kadiu
Updated over a week ago

Overview

This article outlines how Element451 searches for and scores possible duplicate records. For information on how to use the Deduplication Module, read this.


The Duplicate Identification Process

Element451 searches for and scores possible duplicates via these steps:

  1. Updating any of the following student data fields triggers an automatic search for possible duplicates:

    • First Name

    • Last Name

    • Email

    • Phone

    • Addresses

    • Identities

    • Date of Birth

  2. Element performs an initial search based on the first name, last name, and email in the database. It searches for the exact first name and all variations of that name and nickname. For email, it only searches for the email prefix, not the "@domain.com." This initial search gives back a list of potential duplicates with scores based on similarity to the record's name and email in question.

  3. The search scores each pair of matched records based on their proximity.

  4. Element451 then calculates a probability that the two records are duplicates by applying logic based on flags and bonuses from other matching fields.

    • Flags: A scoring system based on simple binary options is used.

      • 1: Field matches

      • 0: Field doesn’t match

    • Bonuses: Each field is compared individually to add bonuses to its initial match score.

      • date of birth - 20%

      • address - 30%

      • phone - 30%

      • email - 15%

  5. If the match probability is higher than 70%, the record constitutes a duplicate in the database and is displayed in the Deduplication Module.

Matching first and last names alone won’t flag duplicates. Additional matches needed: email, date of birth, address, or phone number.


Search Trigger

When specific fields are updated, Element451 initiates a duplicate search for that record.

A duplicate search is triggered by updating the following fields:

  • First Name

  • Last Name

  • Email

  • Phone

  • Addresses

  • Identities

  • Date of Birth


Initial Search

The initial search is based on first name, last name, and email.

For better results, Element searches not only for the exact first name but also for all variations of that name and nicknames.

For email, Element only searches for the email prefix, not the "@domain.com."

The initial search returns a list of potential duplicates with scores based on similarity to the record's name and email.


Scoring

Element applies flags and bonuses to each potential duplicate to determine the probability that they are the same person.


Flags

As Element notices patterns in the duplicates that enter Element451, we can ensure that known duplicates with specific matching fields are recorded at a certain probability. A scoring system based on simple binary options is used.

1: field matches

0: field doesn’t match

If the record and the potential duplicate have exact matches in any of these particular combinations of fields, they are recorded as duplicates, and they will be reported with the related score in the Deduplication module.


Bonuses

Each field is compared individually to add bonuses to its initial match score in this system. Different bonuses are applied depending on how strong an indicator that field is.

  • date of birth - 20%

  • address - 30%

  • phone - 30%

  • email - 15%


Calculating Probability

Once bonuses are added, Element compares the new score to the maximum score to calculate the probability that the pair is duplicated. If the new score divided by the maximum source is higher than 0.7, the record is recorded as a duplicate pair.

If the record is below a 0.7 score, it will not appear in the Deduplication Module.

Fields Matching, Normalization, and Comparison

Person Name

Normalization

  1. All names are made lowercase.

  2. Leading and Trailing spaces are removed.

  3. A list of name parts is made based on spaces or hyphens

Comparison

First and Last names are compared separately. If any of the first name parts from one record exist in the other record's first name parts, it is a match. If both conditions are positive, the name is considered a match.

Email

Emails from both records are stripped, so just the prefix before “@” is compared. Since these records have already passed the initial search, we are just searching for an exact match in this round.

Identity

If either record has an “EMAIL” identity, Element compares it against the other record's primary email address.

Date of Birth

The day, month, and year of the records' dates of birth are compared for an exact match.

Addresses

Element uses a third-party API to normalize each address's “Street 1” field.

If there are multiple addresses, Element checks if any of the first record's normalized addresses match any of the second record's normalized addresses exactly.

Phone

Using the clean 7-digit number only, if any of the first record's phone numbers match any of the second record's phone numbers exactly, it is a positive match.

Relationships

If the first and second records are related, they will not be considered duplicates.

Special Cases

When the last name matches but other fields are mismatched, it is a potential family member.

Did this answer your question?