The Silent Duplication Problem in Data Vault 2.0 Satellites

When implementing Data Vault modeling, ensuring data integrity in satellites is critical. Satellites store historical attribute data related to hubs or links, identified uniquely by a combination of a hash key (hk) and a load timestamp (load_dts). Under typical circumstances, inserting multiple records simultaneously with the same hash key and load timestamp is not possible, as this would violate primary key constraints.

The silent duplicate record issue

Despite constraints, a subtle issue arises when multiple rows in staging tables (Level 2 staging) contain the same business key (resulting in identical hash keys) but different hash diffs (hdiff). A typical load query, recommended by Dan Linstedt and Michael Olschimke in the book Building a Scalable Data Warehouse with Data Vault 2.0, might look like this:

INSERT INTO dv.sh_employee_crm_info
SELECT DISTINCT stg.a_hk AS hk,
       (now()) AT TIME ZONE 'utc' AS load_dts,
       stg.src,
       stg.b_hdiff AS hdiff,
       stg.location,
       stg.first_name,
       stg.last_name
FROM stg.crm_employees_l2 stg
LEFT JOIN dv.sh_employee_crm_info sat ON stg.a_hk = sat.hk AND sat.load_dts = (
    SELECT MAX(z.load_dts)
    FROM dv.sh_employee_crm_info z
    WHERE sat.hk = z.hk
)
WHERE sat.hk IS NULL OR stg.b_hdiff <> sat.hdiff;

The above query can unintentionally alternate between inserting and excluding different records when multiple distinct hashdiff entries exist for a single hash key.

This behavior is due to the WHERE condition in the query, which compares the current hashdiff with the last inserted one. This condition is necessary to avoid inserting a new record when no descriptive attribute has changed.
However, when two distinct hashdiffs exist for the same hash key, the record with the latest inserted hashdiff is excluded by the condition, leaving only the other record to be inserted. As a result, each load alternates the inserted record, and since only one is processed at a time, the unique constraint on (hash key, load timestamp) is not violated.
Importantly, this issue occurs only if one record is initially loaded alone, and the second appears in a subsequent load. If both records appear simultaneously in staging for a new hash key, the insertion will fail immediately due to a duplicate key constraint violation.

Example of problematic data in staging:

a_hk

b_hdiff

location

first_name

last_name

HK123

HDIFF1

Brussels

Alice

Smith

HK123

HDIFF2

Brussels

Alice

Smythe

In this example, two distinct records have the same hash key (HK123) but differ in their hashdiff (HDIFF1 and HDIFF2). The described problem arises because the satellite alternately inserts and alternatively excludes these records because the condition in the query compares the hashdiff (hdiff) against the latest inserted record. Thus, after one record is inserted, the other becomes different from the current satellite entry, leading it to be inserted in the next load, and the originally inserted record is then excluded. This cycle repeats with each subsequent load.

Example of resulting data in satellite:

hk

load_dts

hdiff

location

first_name

last_name

HK123

2024-04-01 10:00:00

HDIFF1

Brussels

Alice

Smith

HK123

2024-04-01 11:00:00

HDIFF2

Brussels

Alice

Smythe

HK123

2024-04-01 12:00:00

HDIFF1

Brussels

Alice

Smith

HK123

2024-04-01 13:00:00

HDIFF2

Brussels

Alice

Smythe

Risks associated with duplicate record issues

Without any controls to detect or prevent this issue, the loading process will silently insert new records repeatedly for the same hash key with every load. When performing multiple data loads per day, especially with millions of records, this can significantly and unnecessarily inflate the database size. This silent issue not only increases storage costs but also complicates data analysis and reporting by creating redundant historical entries.

Identifying causes

Understanding the root causes behind duplicate records is essential for effectively addressing the issue. Two main causes may lead to such a scenario:

  1. Incorrect analysis and granularity:

    • The satellite’s granularity may not align with the intended business granularity. For example, if multiple valid descriptive records exist per business key (e.g., multiple phone numbers or addresses), this should be modeled using a multi-active satellite rather than a standard satellite.

    • The issue can also arise when the business key is not well-defined during analysis, resulting in a non-unique business key identifying multiple records.

  2. Source system issues:

    • If duplicates appear unexpectedly, the source system may incorrectly generate duplicate rows. This warrants contacting data stewards or system administrators to investigate and resolve issues at the data origin.

Recommendations for prevention

Preventing duplicate record issues in Data Vault satellites requires proactive measures to detect, address, and manage data quality risks. Implementing the following best practices can significantly enhance data integrity and avoid complications:

  • Implement automated data quality checks: Create automated processes on the staging level (L2) that identify multiple distinct hashdiff values per business key or hash key before data reaches satellites. However, be cautious about performance overhead and implement these checks efficiently.

  • Periodically monitor the number of records per hash key in each satellite:
    An abnormally high number of records, compared to the expected rate of change for that data, can be an early indicator of duplication issues or improperly defined business keys.

  • Revisit data vault model: Ensure your Data Vault model reflects the real-world business scenarios accurately. If multiple descriptive values for a single business key are valid and common, employ multi-active satellites designed explicitly for handling multiple active records per key.

  • Engage data stewards: Maintain open communication with data stewards. Data quality issues originating at the source should be addressed early to avoid propagating incorrect or duplicate data downstream.

By proactively managing these factors, you can maintain high-quality data integrity and efficiency in your Data Vault projects, preventing silent issues that degrade your data warehouse performance and accuracy.


Want to See How beVault can help?

If you’re looking for a solution that detects, prevents, and manages satellite duplication issues seamlessly, request a free demo of beVault. Our automation platform ensures data vault compliance and keeps your warehouse clean and audit-ready—without manual intervention.