Build a Multi-Source Data Harmonization Platform in BigQuery (with CoreSignal Enrichment) With Twopir Consulting
Asset managers and wealth-focused organizations routinely receive business-critical datasets in cycles. These include biannual consultant datasets, distribution lists, event and conference data, CRM exports, and research files. While the data is available, it often arrives with inconsistent schemas, varying identifiers, and uneven quality.
This inconsistency makes it difficult for teams to understand relationships, identify opportunities, and confidently act on the data. Without a standardized foundation, reporting becomes unreliable and scaling data usage across cycles becomes increasingly complex.
This page explains a proven BigQuery-based approach to harmonizing multi-source data across cycles such as MMD, BOB, BrightScope, Salesforce, PNI, and ICR. The approach standardizes data into a layered warehouse and enriches company profiles using CoreSignal, while addressing data quality, rate limits, and ongoing source onboarding.
What Is Data Harmonization?
Data harmonization is the process of converting fragmented, inconsistent datasets into a single, trusted data foundation.
In this platform, harmonization includes:
- Ingesting multiple biannual and ad-hoc datasets into a consistent structure.
- Resolving identity across sources where the same company appears under different names or identifiers.
- Creating analytics-ready Gold tables that marketing and distribution teams can rely on.
- Creating analytics-ready Gold tables (dimensions and facts) that marketing and distribution teams can trust
- Enriching company attributes such as employee counts and revenue signals, to improve prioritization.
The ultimate objective is to turn disconnected spreadsheets into a durable, extensible dataset that can be queried, reported, and reused reliably every cycle.
Why Use a Bronze–Silver–Gold Medallion Architecture in BigQuery?
A layered warehouse architecture provides structure, scalability, and long-term reliability for multi-source data environments.
This architecture allows organizations to preserve raw data, standardize it incrementally, and expose only trusted outputs for analytics and decision-making.
Architecture Overview: Bronze → Silver → Gold (Plus Master Layer)
Bronze Layer (Raw Data Landing):
Each incoming dataset is ingested into cycle-stamped raw tables, such as:
- raw_mmd_20250331
- raw_<source>_<cycle_date>
- Future-cycle equivalents like raw_mmd_20250930
Purpose: To preserve the original delivered format for replay, auditing, and controlled transformation.
Master Layer (Identity and Geographic Standardization):
The master layer establishes a single source of truth for company identity and geography.
Core tables include:
- master_mapping_company for canonical company identity resolution
- us_states_regions_abbrev for state, state ID, and regional enrichment
Purpose: To maintain stable company identifiers and standardized geographic attributes across all downstream layers.
Silver Layer (Standardized Sources and Enrichment History):
The Silver layer standardizes each source into a consistent schema while preserving historical cycles.
Key outputs include:
- Per-cycle standardized tables such as mmd_company_<cycle_date>
- Equivalent tables for PNI and ICR, including combined outputs like pni_dbdc_<cycle_date>
- Cross-cycle enrichment history tables, such as mmd_enriched and pni_enriched
Purpose: To normalize source data while supporting append-friendly, cycle-over-cycle analysis.
Gold Layer (Curated Dimensions and Facts):
The Gold layer delivers clean, analytics-ready datasets.
Core outputs include:
- dim_company as the primary company dimension
Enrichment fact tables include:
- fact_company_employees_monthly
- fact_company_employees_country_monthly (US-only)
- fact_company_income_statements
- Fact_company_growth
CoreSignal logging table:
- coresignal_company_raw, an insert-only log of all enrichment attempts
Purpose: To provide a trusted data model for reporting, segmentation, and downstream use cases.
End-to-End Dataflow Example (MMD Cycle):
Stage 1: Ingest, Map, Consolidate, and Export for Cleanup:
The process begins by ingesting raw data and establishing identity resolution:
- Raw files are uploaded to GCS and ingested into Bronze tables
- Master mapping assigns or generates unique company identifiers (EINs)
- Geographic attributes are enriched using standardized reference tables
- Consolidated views are created to identify records requiring cleanup
- Auto-resolvable records are updated deterministically
- Consolidated data is exported for business review and cleanup
This stage ensures that identity and structure issues are resolved before data is promoted downstream.
Stage 2: Cleaned Ingest, Gold Updates, and Enrichment History:
Once the cleaned data is provided:
- Cleaned consolidated files are ingested into the Silver layer
- Master mapping is finalized using cleaned identifiers and company names
- Gold dimensions are updated to reflect finalized identities
- Per-source standardized outputs are rebuilt
- Enrichment history tables are appended for long-term tracking
- Gold and master backups are exported for audit and recovery
This stage ensures consistency, durability, and repeatability across cycles.
CoreSignal Enrichment at Scale
Company enrichment is handled asynchronously to support scale and reliability.
The enrichment process:
- Selects target companies from BigQuery
- Queues enrichment tasks per company
- Executes enrichment requests using the company EIN and website
- Logs every enrichment attempt, including successes and misses
- Writes employee counts, country-level metrics, and income data into fact tables
All enrichment results are tracked historically to support trend analysis and reruns without duplication.
Proven Checklist for Reliable Data Harmonization:
To ensure consistent results across cycles:
- Confirm cycle dates and active sources before each pipeline run
- Ensure all processes are idempotent to avoid duplication
- Apply strict rate-limiting with retries and backoff
- Log every enrichment attempt, including “not found” results
- Isolate schema drift handling within the Silver layer
- Preserve raw cycle snapshots for audit and replay
Results You Can Expect:
By implementing this BigQuery-based data harmonization platform, organizations can expect:
- A single, trusted view of companies across all sources and cycles
- Improved data quality and identity consistency
- Reliable, analytics-ready datasets for marketing and distribution teams
- Scalable enrichment of company attributes over time
- A durable foundation that supports future data sources and growth initiatives.
Conclusion:
A multi-source data harmonization platform built on BigQuery enables asset managers and wealth-focused organizations to move beyond fragmented spreadsheets and inconsistent reporting. By combining a Bronze–Silver–Gold architecture with robust identity resolution and scalable CoreSignal enrichment, organizations gain a trusted, extensible data foundation.
This approach transforms cyclical, disconnected data into a long-term strategic asset that supports analytics, segmentation, and informed decision-making across every cycle.