Loading the Truth Set

This page covers loading the Senzing truth set demo data, taking a snapshot of the entity resolution results, and running an audit against the expected outcomes.

Step 1: Install Senzing and load the truth set data

Choose an installation method and follow the instructions through the data loading step. Both quickstarts cover installing Senzing, configuring data sources, and loading the truth set data files (customers.jsonl, reference.jsonl, and watchlist.jsonl).

Linux: Senzing v4 Linux Quickstart Guide
Docker: Senzing v4 Docker Quickstart Guide

Step 2: Download the audit key files

The truth set key files define which records are expected to resolve to the same entity. They are stored on GitHub alongside the truth set data: Senzing Truth Sets

wget https://raw.githubusercontent.com/Senzing/truth-sets/main/truthsets/demo/actual_truthset_key.csv

wget https://raw.githubusercontent.com/Senzing/truth-sets/main/truthsets/demo/alternate_truthset_key.csv

Key file descriptions

File	Description
`actual_truthset_key.csv`	Maps record IDs to expected entity groupings. Each row pairs a data source and record ID with a cluster ID representing the “true” entity. When two records share the same cluster ID, they should resolve to the same entity. This file represents the definitive correct groupings.
`alternate_truthset_key.csv`	Simulated results from a legacy or competing algorithm, used to demonstrate ER auditing. The audit compares the current snapshot against this file to identify where the two approaches agree and differ.

Key file format

Both key files use a simple CSV format with three columns:

Column	Description
`CLUSTER_ID`	The entity/cluster identifier. Records sharing the same `CLUSTER_ID` were resolved to the same entity. These values do not need to match Senzing entity IDs.
`RECORD_ID`	The source record identifier.
`DATA_SOURCE`	The data source the record came from. Required when multiple data sources are present.

Alternate key design

The alternate key was designed with a different matching philosophy than Senzing’s defaults, introducing two deliberate differences:

More aggressive name matching: The alternate algorithm treats close name variants with matching date of birth as definitive matches, even when Senzing considers them only possible matches. This reflects an approach that prioritizes recall over precision for name similarity.
No employer-based matching: The alternate algorithm does not use employer as a matching feature. Where Senzing merges records based on name and employer, the alternate algorithm keeps them separate. This is common in algorithms that view employer as too volatile or unreliable for identity resolution.

These differences produce the specific MERGE and SPLIT discrepancies shown in the audit results .

Creating a custom key

To generate an alternate key from another ER system, query its results database for the cluster-to-record mapping:

SELECT cluster_id, record_id, data_source
FROM entity_resolution_results
ORDER BY cluster_id, data_source, record_id;

If all records come from a single data source, the DATA_SOURCE column can be omitted.

To learn more about creating truth sets from real data, see How to create an entity resolution truth set .

Step 3: Take a snapshot

sz_snapshot exports all resolved entities and generates summary reports about the entity resolution results. Run it with the following flags:

sz_snapshot -QAo truthset_snapshot

Flag	Description
`-Q`	Quiet mode. Suppresses the interactive display and runs to completion.
`-A`	Append audit data. Includes the detailed match data needed by `sz_audit`.
`-o truthset_snapshot`	Output prefix. Files are written with this prefix (e.g., `truthset_snapshot.csv`, `truthset_snapshot.json`).

This produces several output files:

File	Contents
`truthset_snapshot.json`	Full snapshot data in JSON format, including all report details
`truthset_snapshot.csv`	Entity-to-record mapping in CSV format, used as input to `sz_audit`

For details on interpreting snapshot reports, see Snapshot Analysis .

Step 4: Perform an audit

sz_audit compares the snapshot results against the expected outcomes defined in the truth set key file. This indicates how accurately Senzing resolved entities compared to the known ground truth.

sz_audit -n truthset_snapshot.csv -p alternate_truthset_key.csv -o truthset_audit

Flag	Description
`-n truthset_snapshot.csv`	The snapshot CSV file (new/actual results from Senzing).
`-p alternate_truthset_key.csv`	The previous/expected results (the truth set key file).
`-o truthset_audit`	Output prefix for audit report files.

This produces several output files:

File	Contents
`truthset_audit.json`	Full audit results in JSON format, including precision, recall, and F1 scores
`truthset_audit.csv`	Detailed per-entity audit results

For details on interpreting audit metrics, see Auditing .

Next steps

With the data loaded and reports generated, the results are ready for exploration:

If you have any questions, contact Senzing Support. Support is 100% FREE!