Building QData: A Unified Market Data System From Scratch

Every quantitative trading operation eventually faces the same bottleneck: data. Not the absence of data — there is more market data available today than at any point in history. The bottleneck is getting that data into a consistent, reliable, research-ready format.

When you are pulling from 50+ sources, each with its own API, authentication scheme, rate limits, schema conventions, update frequency, and failure mode, the complexity compounds quickly. A research request that should take 2 hours of work takes 2 days because you are debugging a data pipeline instead of running analysis.

QData is our answer to this problem. It is a unified market data system that underpins everything we do — from strategy research to live trading signal generation to risk monitoring.

The Scale

QData currently covers:

500+ cryptocurrency symbols across Binance (spot and futures) and Bybit (perpetuals)
5,000+ equities across NYSE, NASDAQ, Toronto Stock Exchange, London Stock Exchange, Hong Kong Stock Exchange, and Nigerian Stock Exchange (NGX)
Major forex pairs via Dukascopy tick and minute-level data
Commodities: Gold, silver, platinum, palladium, crude oil, natural gas
Macro data: Interest rates, CPI, GDP, unemployment, money supply across the covered countries
Timeframes: 1-minute, 5-minute, 15-minute, 1-hour, 4-hour, daily, weekly

The data spans multiple years of history for most symbols. At any given time, QData manages several terabytes of compressed market data, with approximately 40–50 GB of new data ingested per month across all sources.

Data Sources and the Collection Strategy

The 50+ sources break down into several categories, each requiring different collection approaches:

Exchange APIs (crypto): Binance and Bybit provide REST and WebSocket APIs with clean OHLCV data and historical depth. These are the most reliable sources in the system.

Broker APIs (equities): Yahoo Finance (primary), Stooq (backup for US/European equities), and proprietary scrapers for markets with limited API access.

Central bank and statistics APIs: FRED for US macro data, Bank of England API, Statistics Canada, Central Bank of Nigeria API.

Dukascopy (forex): Provides some of the cleanest long-history forex data available, going back over a decade at tick-level granularity. Access requires browser automation due to CAPTCHA challenges.

NGX (Nigerian Stock Exchange): The most technically challenging source. The NGX website uses dynamic JavaScript rendering, anti-scraping measures, and inconsistent data formats. We built a dedicated multi-layer scraper that handles authentication, session management, and data normalization.

Storage: Apache Parquet with Zstd Compression

We evaluated several storage formats — CSV, HDF5, SQLite, and Parquet — and chose Apache Parquet with Zstd compression for the following reasons:

Columnar access patterns match research workflows. Quantitative research almost always accesses data column-by-column ("give me all close prices for this symbol from 2020 to present") rather than row-by-row ("give me all OHLCV data for this specific timestamp"). Parquet's columnar storage layout makes column access extremely efficient — often 10x faster than CSV for analytical queries.

Compression achieves 3-5x reduction vs CSV. Financial time series compress exceptionally well in columnar format because adjacent values in a column are similar (today's close price is typically within a few percent of yesterday's). Zstd compression achieves 3–5x file size reduction with minimal decompression overhead. Our typical daily OHLCV file for 500 symbols at 1-hour granularity compresses from approximately 25 MB (CSV) to 6 MB (Parquet+Zstd).

Embedded schema prevents silent corruption. Every Parquet file contains its column names, data types, and metadata. This eliminates an entire class of bugs related to column misalignment (did column 3 used to be "volume" or "close"?), type confusion, and silent data corruption that plague CSV-based workflows.

Partition-friendly design. We partition data by symbol and time period (quarterly for 1-minute data, yearly for daily data). Parquet's native support for partitioned datasets makes querying a single symbol's history efficient without scanning the entire dataset.

The Delisted Coin Problem: Survivorship Bias in Data Infrastructure

One of the most subtle challenges in crypto data infrastructure is handling delisted assets. Getting this wrong silently introduces survivorship bias into every backtest that uses the data.

Binance retains historical data for delisted coins. If a coin is removed from Binance, you can still access its full historical OHLCV data through the API. This is the responsible approach and makes survivorship-bias-free backtesting possible.

Bybit purges data for delisted perpetual contracts. When Bybit delists a perpetual contract, the historical data becomes unavailable through their standard API. If you were not already storing this data locally before the delisting, it is gone.

The practical implication: any backtest that relies purely on current Bybit listings for its universe is automatically survivorship-biased. Every coin that failed, crashed 95%, got delisted, or was deemed too illiquid — all of those coins are excluded from the "available" universe, and they are disproportionately the assets that would have lost money during the backtest period.

Our approach to this problem:

We snapshot all available data from all exchanges daily, not on-demand. When a coin is delisted, our daily snapshot preserves the final state of the historical data
All symbols are tagged with their listing date and delisting date (if applicable)
Our backtest framework uses the tagging to correctly determine which symbols were tradeable at each point in time — a strategy evaluated in January 2023 will only use symbols that were actively listed in January 2023
Delisted symbols are included in random universe sampling for survivorship bias audits

This infrastructure is what enables the survivorship bias audits we described in a previous post.

Multi-Market Coverage: Six Countries

QData's multi-market scope is not primarily for diversification (though that is a benefit). It exists because global factor analysis — understanding how value, momentum, and quality factors behave across different market structures — requires consistent data across markets.

Nigerian Equities (NGX) deserve particular mention. The Nigerian Stock Exchange is a frontier market with substantially different characteristics from developed markets: lower correlation with global factors, calendar effects tied to the Nigerian fiscal cycle, different liquidity dynamics, and unique sector exposures (banking, oil, cement, telecommunications). We built the NGX data pipeline specifically because no commercial data provider covers it with sufficient quality for systematic research.

The NGX pipeline uses a multi-stage approach: 1. A Selenium-based scraper handles JavaScript rendering and session management 2. Data normalization converts NGX's non-standard date formats and price conventions to our internal schema 3. A quality check validates that daily returns do not exceed physically plausible bounds (a filter tuned differently for a frontier market than for NYSE)

Near Real-Time Pipeline for Live Trading

For live trading operations, data freshness is a first-order constraint. QData maintains a near-real-time pipeline:

Crypto data: Updated with approximately 1-minute latency via WebSocket connections to Binance and Bybit. The WebSocket streams OHLCV data, order book snapshots, and trade flow data simultaneously
Equity data: Updated every 15 minutes during market hours via API polling with fallback to 5-minute polling during high-volatility sessions
Forex data: Updated every 5 minutes via Dukascopy API

The pipeline is designed for graceful degradation. If a WebSocket connection drops, the system reconnects automatically and backfills any missed candles using the REST API. If an API source is unavailable for more than 15 minutes, consumers are notified via a staleness flag attached to the data. If a primary source fails completely, the system fails over to a secondary source with a data quality warning.

Automated CAPTCHA Solving

Several data sources — particularly Dukascopy and some NGX pages — use CAPTCHA challenges to prevent automated access. Since our pipeline runs on a headless server, human CAPTCHA solving is not practical at scale.

Our solution uses Xvfb (X Virtual Framebuffer) to create a virtual display environment on the server. A headless Chromium browser runs within this virtual display, giving it a realistic browser fingerprint. Combined with a CAPTCHA-solving service for challenges that require visual recognition, and carefully tuned request timing to avoid rate-limit triggers, the pipeline achieves 99%+ uptime on Dukascopy data collection.

It is not elegant engineering. But it is reliable. The Dukascopy pipeline has maintained 99%+ uptime for over a year of continuous operation.

Data Quality Framework

Raw data is never directly trusted. Every data point entering QData passes through a multi-stage quality framework:

Completeness check: Are all expected bars present for a given symbol and timeframe? Missing bars are flagged. If a backup source is available, QData attempts to fill the gap. If not, the gap is recorded with metadata noting its origin.

Range check: Are values within physically plausible bounds? A Bitcoin price of $0 or $10 million triggers an immediate alert. Bounds are symbol-specific (crypto has wider acceptable ranges than blue-chip equities).

OHLC consistency check: Does High >= Open and High >= Close? Does Low <= Open and Low <= Close? Violations indicate data corruption or ingestion errors and are flagged immediately.

Cross-source validation: For assets covered by multiple sources (BTC from both Binance and Bybit, for example), prices are compared. Deviations beyond normal exchange spread differences trigger investigation. This catches silent API errors that would otherwise propagate into the data store.

Staleness check: Each data point carries an ingestion timestamp. Consumers can query the freshness of any data and handle stale data appropriately in their pipelines.

Anomaly detection: A rolling z-score check on daily returns flags outlier returns for manual review. Not all outliers are errors — sharp market moves are real — but genuine data errors often appear as extreme outliers.

The Cron Automation Layer

The batch data pipeline runs on approximately 30 scheduled cron jobs at various frequencies: - Every 1 minute: Crypto WebSocket health check and reconnect if needed - Every 15 minutes: Equity data refresh during market hours - Every hour: NGX data pull during Nigerian market hours - Daily at market close: End-of-day equity data collection, macro data update - Weekly: Full universe sync to catch any new listings or delistings

The scheduler maintains an audit log of every cron job execution: start time, end time, records processed, errors encountered, and data quality issues flagged. This log feeds a monitoring dashboard that highlights any pipeline degradation before it impacts research or trading.

Takeaways

A unified data layer is the single highest-leverage infrastructure investment a quantitative operation can make — it eliminates a class of problems rather than solving them repeatedly
Parquet with Zstd compression achieves 3-5x size reduction over CSV with better schema enforcement, faster analytical queries, and partition-friendly design
The delisted coin problem silently introduces survivorship bias — data infrastructure must preserve historical listings at the moment of delisting to prevent this
Multi-market coverage enables global factor research that single-market data cannot support
Near real-time data with 1-minute latency on crypto enables intraday risk management and systematic signal generation on short timeframes
CAPTCHA solving with Xvfb is operationally unglamorous but achieves 99%+ pipeline uptime
Data quality is a continuous process, not a one-time validation — automated checks at every stage are essential

QData is our answer to this problem. It is a unified market data system that underpins everything we do — from strategy research to live trading signal generation to risk monitoring.

The Scale

QData currently covers:

500+ cryptocurrency symbols across Binance (spot and futures) and Bybit (perpetuals)
5,000+ equities across NYSE, NASDAQ, Toronto Stock Exchange, London Stock Exchange, Hong Kong Stock Exchange, and Nigerian Stock Exchange (NGX)
Major forex pairs via Dukascopy tick and minute-level data
Commodities: Gold, silver, platinum, palladium, crude oil, natural gas
Macro data: Interest rates, CPI, GDP, unemployment, money supply across the covered countries
Timeframes: 1-minute, 5-minute, 15-minute, 1-hour, 4-hour, daily, weekly

Data Sources and the Collection Strategy

The 50+ sources break down into several categories, each requiring different collection approaches:

Exchange APIs (crypto): Binance and Bybit provide REST and WebSocket APIs with clean OHLCV data and historical depth. These are the most reliable sources in the system.

Broker APIs (equities): Yahoo Finance (primary), Stooq (backup for US/European equities), and proprietary scrapers for markets with limited API access.

Central bank and statistics APIs: FRED for US macro data, Bank of England API, Statistics Canada, Central Bank of Nigeria API.

Storage: Apache Parquet with Zstd Compression

We evaluated several storage formats — CSV, HDF5, SQLite, and Parquet — and chose Apache Parquet with Zstd compression for the following reasons:

The Delisted Coin Problem: Survivorship Bias in Data Infrastructure

One of the most subtle challenges in crypto data infrastructure is handling delisted assets. Getting this wrong silently introduces survivorship bias into every backtest that uses the data.

Our approach to this problem:

We snapshot all available data from all exchanges daily, not on-demand. When a coin is delisted, our daily snapshot preserves the final state of the historical data
All symbols are tagged with their listing date and delisting date (if applicable)
Our backtest framework uses the tagging to correctly determine which symbols were tradeable at each point in time — a strategy evaluated in January 2023 will only use symbols that were actively listed in January 2023
Delisted symbols are included in random universe sampling for survivorship bias audits

This infrastructure is what enables the survivorship bias audits we described in a previous post.

Multi-Market Coverage: Six Countries

Near Real-Time Pipeline for Live Trading

For live trading operations, data freshness is a first-order constraint. QData maintains a near-real-time pipeline:

Crypto data: Updated with approximately 1-minute latency via WebSocket connections to Binance and Bybit. The WebSocket streams OHLCV data, order book snapshots, and trade flow data simultaneously
Equity data: Updated every 15 minutes during market hours via API polling with fallback to 5-minute polling during high-volatility sessions
Forex data: Updated every 5 minutes via Dukascopy API

Automated CAPTCHA Solving

It is not elegant engineering. But it is reliable. The Dukascopy pipeline has maintained 99%+ uptime for over a year of continuous operation.

Data Quality Framework

Raw data is never directly trusted. Every data point entering QData passes through a multi-stage quality framework:

OHLC consistency check: Does High >= Open and High >= Close? Does Low <= Open and Low <= Close? Violations indicate data corruption or ingestion errors and are flagged immediately.

Staleness check: Each data point carries an ingestion timestamp. Consumers can query the freshness of any data and handle stale data appropriately in their pipelines.

The Cron Automation Layer

Takeaways

A unified data layer is the single highest-leverage infrastructure investment a quantitative operation can make — it eliminates a class of problems rather than solving them repeatedly
Parquet with Zstd compression achieves 3-5x size reduction over CSV with better schema enforcement, faster analytical queries, and partition-friendly design
The delisted coin problem silently introduces survivorship bias — data infrastructure must preserve historical listings at the moment of delisting to prevent this
Multi-market coverage enables global factor research that single-market data cannot support
Near real-time data with 1-minute latency on crypto enables intraday risk management and systematic signal generation on short timeframes
CAPTCHA solving with Xvfb is operationally unglamorous but achieves 99%+ pipeline uptime
Data quality is a continuous process, not a one-time validation — automated checks at every stage are essential

Building QData: A Unified Market Data System From Scratch

The Scale

Data Sources and the Collection Strategy

Storage: Apache Parquet with Zstd Compression

The Delisted Coin Problem: Survivorship Bias in Data Infrastructure

Multi-Market Coverage: Six Countries

Near Real-Time Pipeline for Live Trading

Automated CAPTCHA Solving

Data Quality Framework

The Cron Automation Layer

Takeaways

Related Articles

Capturing Free Alpha Through Smarter Order Execution

SOL Is the Canary in the Crypto Coal Mine

Building QData: A Unified Market Data System From Scratch

The Scale

Data Sources and the Collection Strategy

Storage: Apache Parquet with Zstd Compression

The Delisted Coin Problem: Survivorship Bias in Data Infrastructure

Multi-Market Coverage: Six Countries

Near Real-Time Pipeline for Live Trading

Automated CAPTCHA Solving

Data Quality Framework

The Cron Automation Layer

Takeaways

Related Articles

Capturing Free Alpha Through Smarter Order Execution

SOL Is the Canary in the Crypto Coal Mine