Loading the Truth Set

This page covers loading the Senzing truth set demo data, taking a snapshot of the entity resolution results, and running an audit against the expected outcomes.

Step 1: Install Senzing and load the truth set data

Choose an installation method and follow the instructions through the data loading step. Both quickstarts cover installing Senzing, configuring data sources, and loading the truth set data files (customers.jsonl, reference.jsonl, and watchlist.jsonl).

Step 2: Download the audit key files

The truth set key files define which records are expected to resolve to the same entity. They are stored on GitHub alongside the truth set data: Senzing Truth Sets

wget https://raw.githubusercontent.com/Senzing/truth-sets/main/truthsets/demo/actual_truthset_key.csv
wget https://raw.githubusercontent.com/Senzing/truth-sets/main/truthsets/demo/alternate_truthset_key.csv

Key file descriptions

File Description
actual_truthset_key.csv Maps record IDs to expected entity groupings. Each row pairs a data source and record ID with a cluster ID representing the “true” entity. When two records share the same cluster ID, they should resolve to the same entity. This file represents the definitive correct groupings.
alternate_truthset_key.csv Simulated results from a legacy or competing algorithm, used to demonstrate ER auditing. The audit compares the current snapshot against this file to identify where the two approaches agree and differ.

Key file format

Both key files use a simple CSV format with three columns:

Column Description
CLUSTER_ID The entity/cluster identifier. Records sharing the same CLUSTER_ID were resolved to the same entity. These values do not need to match Senzing entity IDs.
RECORD_ID The source record identifier.
DATA_SOURCE The data source the record came from. Required when multiple data sources are present.

Alternate key design

The alternate key was designed with a different matching philosophy than Senzing’s defaults, introducing two deliberate differences:

  • More aggressive name matching: The alternate algorithm treats close name variants with matching date of birth as definitive matches, even when Senzing considers them only possible matches. This reflects an approach that prioritizes recall over precision for name similarity.
  • No employer-based matching: The alternate algorithm does not use employer as a matching feature. Where Senzing merges records based on name and employer, the alternate algorithm keeps them separate. This is common in algorithms that view employer as too volatile or unreliable for identity resolution.

These differences produce the specific MERGE and SPLIT discrepancies shown in the audit results .

Creating a custom key

To generate an alternate key from another ER system, query its results database for the cluster-to-record mapping:

SELECT cluster_id, record_id, data_source
FROM entity_resolution_results
ORDER BY cluster_id, data_source, record_id;

If all records come from a single data source, the DATA_SOURCE column can be omitted.

To learn more about creating truth sets from real data, see How to create an entity resolution truth set .

Step 3: Take a snapshot

sz_snapshot exports all resolved entities and generates summary reports about the entity resolution results. Run it with the following flags:

sz_snapshot -QAo truthset_snapshot
Flag Description
-Q Quiet mode. Suppresses the interactive display and runs to completion.
-A Append audit data. Includes the detailed match data needed by sz_audit.
-o truthset_snapshot Output prefix. Files are written with this prefix (e.g., truthset_snapshot.csv, truthset_snapshot.json).

This produces several output files:

File Contents
truthset_snapshot.json Full snapshot data in JSON format, including all report details
truthset_snapshot.csv Entity-to-record mapping in CSV format, used as input to sz_audit

For details on interpreting snapshot reports, see Snapshot Analysis .

Step 4: Perform an audit

sz_audit compares the snapshot results against the expected outcomes defined in the truth set key file. This indicates how accurately Senzing resolved entities compared to the known ground truth.

sz_audit -n truthset_snapshot.csv -p alternate_truthset_key.csv -o truthset_audit
Flag Description
-n truthset_snapshot.csv The snapshot CSV file (new/actual results from Senzing).
-p alternate_truthset_key.csv The previous/expected results (the truth set key file).
-o truthset_audit Output prefix for audit report files.

This produces several output files:

File Contents
truthset_audit.json Full audit results in JSON format, including precision, recall, and F1 scores
truthset_audit.csv Detailed per-entity audit results

For details on interpreting audit metrics, see Auditing .

Next steps

With the data loaded and reports generated, the results are ready for exploration:

If you have any questions, contact Senzing Support. Support is 100% FREE!