Building a scalable BigQuery data foundation and CoreSignal enrichment for a mid-sized asset management company

The Client

An Asset Manager That Had Outgrown Its Spreadsheets

AMF

Mid-Sized Asset Management Firm

Identity confidential · Full GCP + BigQuery Data Foundation Engagement

IndustryAsset Management / FinTech

RegionUnited States — Multi-office

Team Size8–12 analysts, 3 data engineers

Volume2 analytics cycles per year, 40M+ enrichment records

Stack BeforeCSV exports + Excel pipelines + manual CoreSignal pulls

EngagementDiscovery → Architecture → Build → Training → Handoff

This firm had built a workable analytics process on spreadsheets and biannual data cleanup sprints. But as AUM grew and the entity universe expanded, the manual overhead had become a structural liability — analysts spending the first week of every cycle cleaning data instead of analysing it.

By the time they engaged us, they had tried to build an internal BigQuery pipeline that stalled at the schema design stage. They needed a team that understood both the data engineering side and the asset management use case — not a generic cloud consultancy learning on their project.

The Problem

Four Data Gaps That Were Costing Analyst Capacity

The surface complaint was "our data is slow." Underneath that were four distinct structural failures, each with a direct cost to analytical output and decision quality.

Biannual Cycle Start Was a Cleanup Sprint, Not Analysis

Every analytics cycle began with a week of data normalization — deduplication, schema reconciliation, re-linking entities broken by upstream changes. Analysts were junior data engineers by default for the first 20% of every cycle.

CoreSignal Enrichment Was Pulled Manually Per Analyst

Each analyst ran their own CoreSignal pulls — different entity lists, different field sets, no shared enrichment layer. The same company was enriched multiple times per cycle with different results. Reconciliation consumed hours that should have been analysis.

No Audit Trail — Compliance Reviews Were Manual Reconstructions

When compliance needed to trace a decision back to source data, analysts reconstructed the lineage manually from email threads and file timestamps. Each review took 3–5 days of analyst time and carried material risk of gap or error.

Query Performance Degraded as Data Volume Grew

As the entity universe expanded, ad-hoc BigQuery queries on raw tables took 4–8 minutes to return. Analysts avoided running exploratory queries because the feedback loop was too slow. Data was technically available but practically unused.

The Solution

How We Structured the Engagement

We've built data foundations on Google Cloud for FinTech, asset management, and SaaS firms across the US, UK, and Australia. Our approach is always schema-first — before we write a single pipeline, we map the full entity model and enrichment dependencies. Then we build the layered architecture that supports it, not what looks clean in a diagram.

Four Phases · 10-Week Delivery

We divided the engagement into four structured phases, each validated before advancing. Every pipeline was tested against production-volume data in a staging environment before cutover.

Phase 01 — Discovery & Architecture Design Phase 02 — Bronze / Silver / Gold Schema Build Phase 03 — CoreSignal Enrichment Pipeline Phase 04 — Query Optimisation, Docs & Training

Phase 01

Discovery & Architecture Design

Before any build, we spent two weeks mapping the full data flow — from upstream CSV sources through CoreSignal enrichment to downstream analyst consumption. We documented every entity, every field dependency, and every compliance requirement. That map became the schema blueprint.

We also audited the failed internal BigQuery build to understand exactly where the schema design had stalled — and built the new architecture to avoid those constraints from the ground up.

Discovery & Architecture Design — Gold Layer Schema

🗺

End-to-End Entity Mapping

Every data source, enrichment dependency, and downstream consumer documented before any build began. Schema blueprint finalised and signed off.

🔍

Failed Pipeline Audit

Prior internal BigQuery build reviewed in full — what could be salvaged, what required a clean rebuild. No assumptions carried forward.

📐

Architecture Blueprint Delivered

Bronze / Silver / Gold layer design, partitioning strategy, and CoreSignal enrichment integration architecture finalised before build commenced.

Phase 02

Bronze / Silver / Gold Schema Build

We built the three-layer schema architecture in BigQuery using Dataform — raw ingestion in Bronze, standardised and deduplicated in Silver, analytics-ready in Gold. Every transformation is version-controlled, tested, and documented. No "it worked last time" pipelines.

Partitioning by date and clustering by entity type reduced query costs by over 80% and brought average query time from 6 minutes to under 60 seconds on the production dataset.

Bronze / Silver / Gold Schema Architecture in BigQuery

🏗

Three-Layer Schema with Dataform

Bronze (raw), Silver (clean), Gold (analytics-ready) — every transformation version-controlled and tested before promotion.

⚡

Partitioning & Clustering

Partitioned by ingestion date, clustered by entity type. Average query time dropped from 6 minutes to under 60 seconds on production volume.

🔄

Automated Ingestion Pipelines

Cloud Composer DAGs replace all manual CSV uploads. Pipelines run on schedule with alerting on any upstream source failure.

Phase 03

CoreSignal Enrichment Pipeline

We built a centralised enrichment layer that pulls CoreSignal firmographic, headcount, and funding data on a scheduled basis — replacing the per-analyst manual pulls that had been producing inconsistent results across the team.

Every entity in the Gold layer is now enriched from a single authoritative source. Enrichment jobs run nightly, with delta detection so only changed records trigger re-enrichment. No analyst touches CoreSignal manually. The full enrichment history is preserved in the Silver layer for audit purposes.

CoreSignal Enrichment Pipeline Architecture

🔗

Centralised Enrichment Layer

Single CoreSignal pipeline replaces per-analyst manual pulls. Every entity enriched from one authoritative source — no reconciliation required.

🔄

Delta Detection & Nightly Jobs

Only changed records trigger re-enrichment. Cloud Composer DAG runs nightly with alerting on API failures or enrichment gaps.

📜

Full Enrichment History in Silver

Every enrichment version timestamped and preserved. Compliance can trace any Gold record back to its source pull without manual reconstruction.

Phase 04

Query Optimisation, Runbooks & Training

With the schema and enrichment pipeline live, we spent the final phase on query performance tuning, cost controls, and making the system self-sufficient. Partition pruning and column clustering strategies brought BigQuery processing costs down by over 60% at production scale.

We delivered full operational runbooks for every pipeline procedure. Role-based training sessions enabled the internal analytics team to run future cycles — and troubleshoot independently — without external support. Cycle start dropped from a two-week cleanup sprint to a three-day validation run followed by immediate analysis.

Query Optimisation & Cost Control — Before vs After

⚡

Query & Cost Optimisation

Partition pruning and column clustering strategies applied to control BigQuery processing costs as data volume scales across future cycles.

📖

Runbooks & Documentation

Full operational runbooks for every pipeline procedure — role-based training sessions enabling the internal analytics team to run future cycles without external support.

🎓

Role-Based Training

Analyst, engineer, and compliance tracks. Teams validated on the system before handoff — no black-box pipelines left in production.

Impact & Outcomes

What Changed — In Numbers and In Practice

The metric that resonated most with the team wasn't the percentage reduction in manual work — it was the shift in what analysts were doing with their time. Before this platform, a biannual cycle started with a cleanup sprint that consumed the first week. After go-live, cycle start meant running a notebook, validating a quality dashboard, and moving straight to analysis. That's the compounding return on a good data foundation.

80%

Reduction in manual data preparation time per cycle

6×

Query performance improvement post-partitioning & clustering

3 days

Cycle start-to-analysis — down from a 2-week cleanup sprint

61%

BigQuery cost reduction per analytics cycle at production scale

✓

Single unified BigQuery foundation replaced 4 disconnected CSV pipelines and 3 per-analyst CoreSignal workflows

✓

Zero manual CSV uploads after go-live — all ingestion automated via Cloud Composer DAGs with failure alerting

✓

Complete enrichment audit trail across all entities — compliance traces any Gold record back to source without manual reconstruction

✓

Analysts validated full independence — running pipeline maintenance and future cycle starts without external support

✓

Cost controls and partition strategy in place — pipeline economics improve as data volume scales, not degrade

"Before this platform, our cycle started with a cleanup sprint that consumed the first week. After go-live, it meant running a notebook and moving straight to analysis on day one."

— Twopir Project Lead · Asset Management Data Foundation Engagement · 2025

Frequently Asked Questions

What Asset Management Firms Ask Before Working With Us

Why BigQuery over alternatives like Snowflake or Redshift for asset management analytics?

BigQuery's serverless model eliminates infrastructure management and scales automatically with query volume — well-suited for biannual high-intensity analytics cycles where you'd otherwise over-provision for peak. Combined with Dataform for pipeline orchestration and the GCP ecosystem for enrichment integrations, it offers a full-stack solution without additional tooling overhead. The per-query cost model also aligns well with irregular cycle patterns common in asset management.

How do you handle CoreSignal API rate limits and enrichment reliability?

We build enrichment pipelines with delta detection — only changed records trigger API calls, which dramatically reduces volume. We also implement retry logic, dead-letter queues for failed enrichments, and nightly health-check alerts. The Silver layer preserves full enrichment history, so a transient API failure doesn't corrupt downstream Gold records — it simply delays the next update cycle.

Can you work with a partially built internal BigQuery pipeline?

Yes — we start every engagement with a schema and pipeline audit. We document what's working, what introduces downstream risk, and what needs a clean rebuild. We've completed partial-build rescues for firms in FinTech, asset management, and SaaS — the pattern is consistent enough that we have a standard remediation framework for it.

What compliance documentation do you produce as part of the engagement?

We deliver full data lineage documentation, pipeline runbooks, schema change logs, and enrichment audit trails as standard. Every transformation in the Gold layer is traceable back to its Bronze source with timestamped versions preserved in Silver. For firms with regulatory audit requirements, we can extend this to include formal data governance documentation aligned to specific frameworks.

What makes Twopir different from a local Google Cloud partner?

Most GCP partners have general cloud infrastructure expertise but limited depth in financial data modelling and enrichment pipeline architecture. We have 12+ years and 200+ implementations — with focused practices in FinTech data foundations, asset management analytics, and CRM enrichment (Salesforce, HubSpot, Propertybase). We hold Salesforce Gold Partner status and work across US, UK, AU, and UAE time zones. We don't learn the use case on your project.