Introduction:
Asset managers and wealth-focused organizations often receive critical business data in recurring cycles—biannual consultant datasets, distribution lists, event and conference files, CRM exports, and research data. The challenge isn’t access to information, but fragmentation. Each source arrives with different schemas, identifiers, and varying data quality, making it difficult for distribution and marketing teams to consistently identify opportunities, understand consultant and client relationships, and drive growth with confidence.
This blog walks through a proven BigQuery-based Medallion architecture approach to harmonize multi-source data across cycles (MMD, BOB, BrightScope, Salesforce, PNI, ICR). By standardizing data into layered Bronze → Silver → Gold zones and enriching company profiles at scale using CoreSignal, this pattern enables reliable analytics while addressing rate limits, data quality challenges, and future source onboarding.
What is Data Harmonization?
In this architecture, data harmonization goes beyond simple ETL. It means:
- Ingesting multi-cycle files (biannual and ad-hoc) into a consistent, replayable structure
- Resolving identity across sources, where the same company appears under different names, IDs, or formats
- Producing analytics-ready Gold tables that marketing and distribution teams can confidently use for segmentation, targeting, and reporting
- Enriching company profiles with employee counts, revenue signals, and growth metrics to improve prioritization
The end goal is a foundational dataset that survives multiple cycles, supports historical analysis, and continuously improves as new data arrives.
Architecture (Bronze → Silver → Gold + Master):
Bronze layer (raw landing):
Each source is ingested into cycle-stamped raw tables, for example:
- raw_mmd_20250331, raw_<source>_<cycle_date>,etc
- Next cycle equivalents like raw_mmd_20250930, etc.
Purpose: preserve the original delivered shape per cycle for replay, auditing, and controlled transformation.
Master layer (identity + geographic truth):
Core tables:
- master_mapping_company (canonical company identity resolution)
- us_states_regions_abbrev (state/state_id/region enrichment)
Purpose: create/maintain stable identifiers and standardized geography used across all downstream layers.
Silver layer (standardized per source + enrichment history):
Per-cycle standardized company tables:
- mmd_company_<cycle_date>
- PNI and ICR equivalents (plus combined PNI outputs like pni_dbdc_<cycle_date>)
Cross-cycle enrichment tables:
- <source>_enriched (e.g. mmd_enriched, pni_enriched, etc.)
Purpose: normalize each source into a consistent schema and keep an append-friendly history across cycles.
Gold layer (curated dims/facts + CoreSignal facts):
Core analytics outputs:
- dim_company
- Enrichment facts such as:
- fact_company_employees_monthly
- fact_company_employees_country_monthly (US-only)
- fact_company_income_statements
- fact_company_growth
- CoreSignal logging:
- coresignal_company_raw (insert-only attempt log)
- coresignal_company_raw (insert-only attempt log)
Purpose: a clean, reliable model for reporting, segmentation, and downstream integrations.
End-to-End Dataflow (Example: MMD Cycle)
Below is a simplified walkthrough aligned to your process, designed for repeatability across sources and cycles.
Stage 1: Ingest → Master mapping → Consolidate → Export for Cleanup
1. Upload raw file to GCS
- Example: allspring_data/bronze/mmd/cycle_20250331
2. Ingest raw → Bronze table
- Procedure: ingestion_sp / call_ingestion_sp
- Output: raw_mmd_20250331
3. Build master mapping (identity resolution)
- Procedure: build_master_mapping_sp / call_build_master_mapping_sp
- Output: updates master_mapping_company
- Notes: assigns a unique company identifier (EIN), generating it where required.
4. Enrich geographic details
- Query: enrich_geographic_details
- Output: enriches State / StateId / Region in master mapping.
5. Create consolidated tables for manual standardization
- Queries:
- create_consolidated_tables → creates only “needs-cleaning” records (missing New_EIN/New_Company)
- create_consolidate_inc_all_data → creates a full consolidated snapshot
- Output: consolidated_company_20250331
6. Update consolidated company (auto-fill where possible)
- Query: update_consolidated_company
- Output: updates records that can be resolved deterministically.
7. Export consolidated to GCS for business cleanup
- Query: export_consolidated_tables
- Output path: allspring_data/silver/cycle_20250331/consolidated/consolidated_company.csv
Stage 2: Ingest cleaned file → enrich master → update gold → build enriched history
1. Upload cleaned consolidated CSV to GCS
- Path: allspring_data/silver/cycle_20250331/cleaned_consolidated/company
2. Ingest cleaned consolidated → Silver
- Output: cleaned_consolidated_company_20250331
3. Enrich master mapping using cleaned values
- Query: enrich_master_mapping_company
- Output: master now has the finalized New_EIN and New_Company.
4. Update gold dimensions
- Queries: update_dim_company
- Output: dim_company updated with newly finalized company identities.
5. Build enriched per-source outputs + append to history
- Procedure: raw_to_enrich_sp / call_enrich_and_append_sp
- Outputs:
- mmd_company_<cycle> (cycle snapshot)
- mmd_enriched (cross-cycle append table)
6. Export key backups
- Query: export_dim_and_master
- Outputs:
- allspring_data/gold/<cycle>/dim_company.csv
- allspring_data/master/<cycle>/master_mapping_company.csv
CoreSignal Enrichment (Scale + Rate Limits)
CoreSignal enrichment is handled asynchronously, so it can scale safely:
- A BigQuery Notebook selects target companies and enqueues one task per company into Cloud Tasks.
- Each Cloud Task invokes (Cloud Run/Function) with EIN + website.
- The function calls the CoreSignal enrich endpoint, parses the response, and writes results into BigQuery.
BigQuery tables used:
- coresignal_company_raw: insert-only log of every attempt (status codes like 200/404/409, timestamps, found/not found).
- fact_company_employees_monthly: upsert monthly employee totals.
- fact_company_employees_country_monthly: upsert monthly employee counts by country (US-only retained).
- fact_company_income_statements: upsert revenue and other income statement metrics.
Twopir-proven checklist:
- Confirm cycle date + active sources are set correctly in pipelines before execution.
- Keep every run idempotent: rerunning a cycle should not duplicate facts or regenerate identities.
- Use a strict rate-limit strategy in Cloud Tasks (steady RPS, retries with backoff) to avoid API throttling.
- Log every enrichment attempt (including 404 “not found”) so reruns can skip known misses.
- Treat schema drift as expected: isolate per-source standardization in Silver, and keep Gold stable.
- Always preserve raw cycle snapshots for replay and audit.
Conclusion:
This BigQuery harmonization pattern transforms recurring, inconsistent data deliveries into a long-lived growth platform. By separating raw ingestion, identity resolution, standardization, enrichment, and analytics, teams gain:
- Consistent company identity across all sources
- Historical continuity across cycles
- Enriched insights that improve targeting and prioritization
- A scalable foundation that welcomes new data sources without disruption
For asset managers and wealth organizations, this isn’t just data engineering—it’s infrastructure for sustainable distribution growth.
Contact Us
We would love to hear from you Please feel free to send us a message via the form