Deduplication Logic

How Element451 determines duplicates by comparing and scoring fields.

Ardis Kadiu avatar
Written by Ardis Kadiu
Updated this week

Overview

This article outlines how Element451 searches for and scores possible duplicate records.


Duplicate Search Trigger

Updating any of the following student data fields triggers an automatic search for possible duplicates:

  • First Name

  • Last Name

  • Email

  • Phone

  • Addresses

  • Identities

  • Date of Birth


Step 1: Initial Search

The initial search looks at first name, last name, and email.

  • It searches for the exact first name and all variations of that name and nickname.

  • It only searches for the email prefix, not the "@domain.com." This initial search gives back a list of potential duplicates with scores based on similarity to the record's name and email in question.

The initial search returns a list of potential duplicates with scores based on similarity to the record's name and email.


Step 2: Logic-Based Scoring

Element451 then calculates a probability that the two records are duplicates by applying logic based on flags and bonuses from other matching fields.

Flags

As Element notices patterns in the duplicates that enter Element451, we can ensure that known duplicates with specific matching fields are recorded at a certain probability. A scoring system based on simple binary options is used:

Field Matches

1

Field Does Not Match

0

Bonuses

Each field is compared individually to add bonuses to its initial match score. Different bonuses are applied depending on how strong an indicator that field is.

Date of Birth

20%

Address

30%

Phone

30%

Email

15%

Matching first and last names alone will not flag duplicates. Additional matches are needed: email, date of birth, address, or phone number.

Fields Matching, Normalization, and Comparison

  • Person Name

    • All names are made lowercase.

    • Leading and trailing spaces are removed.

    • A list of name parts is made based on spaces or hyphens.

    • First and last names are compared separately. If any of the first name parts from one record exist in the other record's first name parts, it is a match. If both conditions are positive, the name is considered a match.

  • Email

    • Emails from both records are stripped, so just the prefix before “@” is compared. Since these records have already passed the initial search, we are just searching for an exact match in this round.

  • Identity

    • If either record has an “EMAIL” identity, Element compares it against the other record's primary email address.

  • Date of Birth

    • The day, month, and year of the records' dates of birth are compared for an exact match.

  • Addresses

    • Element uses a third-party API to normalize each address's “Street 1” field.

    • If there are multiple addresses, Element checks if any of the first record's normalized addresses match any of the second record's normalized addresses exactly.

  • Phone

    • Using the clean 7-digit number only, if any of the first record's phone numbers match any of the second record's phone numbers exactly, it is a positive match.

  • Relationships

    • If the first and second records are related, they will not be considered duplicates.

  • Special Cases

    • When the last name matches but other fields are mismatched, it is a potential family member.


Step 3: Calculating Probability

Once bonuses are added, E451 compares the new score to the maximum score to calculate the probability that the pair is duplicated.

If the match probability is higher than 70%, the record constitutes a duplicate in the database and is displayed in the Deduplication Module.


Selection of a Master Record

When Element451 finds a possible duplicate, it performs a "smart" selection to determine which record should be a new "master" record. First, Element looks for a recent activity such as:

  • Logging in

  • Opening an email

  • Filling out a form

  • Registered for an event

Then, Element will prioritize the record without hard email bounces or unsubscribe milestones.

Did this answer your question?