How to Build a Multi-Source Data Harmonization Platform in BigQuery (with CoreSignal Enrichment)

Build a Multi-Source Data Harmonization Platform in BigQuery (with CoreSignal Enrichment) With Twopir Consulting

Asset managers and wealth-focused organizations routinely receive business-critical datasets in cycles. These include biannual consultant datasets, distribution lists, event and conference data, CRM exports, and research files. While the data is available, it often arrives with inconsistent schemas, varying identifiers, and uneven quality.

This inconsistency makes it difficult for teams to understand relationships, identify opportunities, and confidently act on the data. Without a standardized foundation, reporting becomes unreliable and scaling data usage across cycles becomes increasingly complex.

This page explains a proven BigQuery-based approach to harmonizing multi-source data across cycles such as MMD, BOB, BrightScope, Salesforce, PNI, and ICR. The approach standardizes data into a layered warehouse and enriches company profiles using CoreSignal, while addressing data quality, rate limits, and ongoing source onboarding.

What Is Data Harmonization?

Data harmonization is the process of converting fragmented, inconsistent datasets into a single, trusted data foundation.

In this platform, harmonization includes:

Ingesting multiple biannual and ad-hoc datasets into a consistent structure.
Resolving identity across sources where the same company appears under different names or identifiers.
Creating analytics-ready Gold tables that marketing and distribution teams can rely on.
Creating analytics-ready Gold tables (dimensions and facts) that marketing and distribution teams can trust
Enriching company attributes such as employee counts and revenue signals, to improve prioritization.

The ultimate objective is to turn disconnected spreadsheets into a durable, extensible dataset that can be queried, reported, and reused reliably every cycle.

Why Use a Bronze–Silver–Gold Medallion Architecture in BigQuery?

A layered warehouse architecture provides structure, scalability, and long-term reliability for multi-source data environments.

This architecture allows organizations to preserve raw data, standardize it incrementally, and expose only trusted outputs for analytics and decision-making.

Architecture Overview: Bronze → Silver → Gold (Plus Master Layer)

Bronze Layer (Raw Data Landing):

Each incoming dataset is ingested into cycle-stamped raw tables, such as:

raw_mmd_20250331
raw_<source>_<cycle_date>
Future-cycle equivalents like raw_mmd_20250930

Purpose: To preserve the original delivered format for replay, auditing, and controlled transformation.

Master Layer (Identity and Geographic Standardization):

The master layer establishes a single source of truth for company identity and geography.

Core tables include:

master_mapping_company for canonical company identity resolution
us_states_regions_abbrev for state, state ID, and regional enrichment

Purpose: To maintain stable company identifiers and standardized geographic attributes across all downstream layers.

Silver Layer (Standardized Sources and Enrichment History):

The Silver layer standardizes each source into a consistent schema while preserving historical cycles.

Key outputs include:

Per-cycle standardized tables such as mmd_company_<cycle_date>
Equivalent tables for PNI and ICR, including combined outputs like pni_dbdc_<cycle_date>
Cross-cycle enrichment history tables, such as mmd_enriched and pni_enriched

Purpose: To normalize source data while supporting append-friendly, cycle-over-cycle analysis.

Gold Layer (Curated Dimensions and Facts):

The Gold layer delivers clean, analytics-ready datasets.

Core outputs include:

dim_company as the primary company dimension

Enrichment fact tables include:

fact_company_employees_monthly
fact_company_employees_country_monthly (US-only)
fact_company_income_statements
Fact_company_growth

CoreSignal logging table:

coresignal_company_raw, an insert-only log of all enrichment attempts

Purpose: To provide a trusted data model for reporting, segmentation, and downstream use cases.

End-to-End Dataflow Example (MMD Cycle):

Stage 1: Ingest, Map, Consolidate, and Export for Cleanup:

The process begins by ingesting raw data and establishing identity resolution:

Raw files are uploaded to GCS and ingested into Bronze tables
Master mapping assigns or generates unique company identifiers (EINs)
Geographic attributes are enriched using standardized reference tables
Consolidated views are created to identify records requiring cleanup
Auto-resolvable records are updated deterministically
Consolidated data is exported for business review and cleanup

This stage ensures that identity and structure issues are resolved before data is promoted downstream.

Stage 2: Cleaned Ingest, Gold Updates, and Enrichment History:

Once the cleaned data is provided:

Cleaned consolidated files are ingested into the Silver layer
Master mapping is finalized using cleaned identifiers and company names
Gold dimensions are updated to reflect finalized identities
Per-source standardized outputs are rebuilt
Enrichment history tables are appended for long-term tracking
Gold and master backups are exported for audit and recovery

This stage ensures consistency, durability, and repeatability across cycles.

CoreSignal Enrichment at Scale

Company enrichment is handled asynchronously to support scale and reliability.

The enrichment process:

Selects target companies from BigQuery
Queues enrichment tasks per company
Executes enrichment requests using the company EIN and website
Logs every enrichment attempt, including successes and misses
Writes employee counts, country-level metrics, and income data into fact tables

All enrichment results are tracked historically to support trend analysis and reruns without duplication.

Proven Checklist for Reliable Data Harmonization:

To ensure consistent results across cycles:

Confirm cycle dates and active sources before each pipeline run
Ensure all processes are idempotent to avoid duplication
Apply strict rate-limiting with retries and backoff
Log every enrichment attempt, including “not found” results
Isolate schema drift handling within the Silver layer
Preserve raw cycle snapshots for audit and replay

Results You Can Expect:

By implementing this BigQuery-based data harmonization platform, organizations can expect:

A single, trusted view of companies across all sources and cycles
Improved data quality and identity consistency
Reliable, analytics-ready datasets for marketing and distribution teams
Scalable enrichment of company attributes over time
A durable foundation that supports future data sources and growth initiatives.

Conclusion:

A multi-source data harmonization platform built on BigQuery enables asset managers and wealth-focused organizations to move beyond fragmented spreadsheets and inconsistent reporting. By combining a Bronze–Silver–Gold architecture with robust identity resolution and scalable CoreSignal enrichment, organizations gain a trusted, extensible data foundation.

This approach transforms cyclical, disconnected data into a long-term strategic asset that supports analytics, segmentation, and informed decision-making across every cycle.