book-openOverview

The Historical Market Cap dataset provides a point-in-time record of equity size through historical market capitalization and shares outstanding.

Rather than relying on present-day values or retrospectively adjusted data, this dataset reflects market cap as it was observable on each historical date.

Market capitalization is a foundational input in many systematic strategies — from universe construction and factor research to risk modeling and portfolio weighting. However, accurate historical market cap data is difficult to source and easy to misuse. Naïve approaches often introduce survivorship bias, forward-looking revisions, or implicit assumptions about corporate actions that were not known at the time.

This dataset is designed to address those issues directly.


What the Dataset Represents

Each record in the Historical Market Cap dataset represents:

  • A single equity ticker

  • A specific calendar date

  • Shares outstanding as of that date

  • The corresponding market capitalization as observed at that time

Both values are provided directly, allowing users to work with historical market cap data without requiring additional joins or reconstruction steps.

The data is structured to reflect availability at the time — not revised or restated using future information. If a company’s shares outstanding or market capitalization changed due to issuance, buybacks, or other corporate actions, those changes appear in the dataset only after they were observable.


Why Historical Market Cap Is Hard

Historical market capitalization is deceptively complex.

Many commonly used datasets:

  • Backfill shares outstanding using modern values

  • Revise historical market cap figures based on future filings

  • Omit delisted or short-lived securities

  • Provide no visibility into when coverage actually begins

These practices can silently distort research results, especially when building size-based universes, liquidity filters, or long-horizon backtests.

The Historical Market Cap dataset is constructed with point-in-time correctness as the primary constraint, ensuring that users can reason about equity size exactly as it would have appeared historically.


Designed for Survivorship-Safe Research

A core use case of this dataset is survivorship-bias-free universe construction.

By providing:

  • Explicit coverage start dates for each ticker

  • Historical market cap and shares outstanding by date

  • No reliance on present-day index constituents

Users can construct equity universes that reflect what was actually investable at a given moment, rather than what survived into the present.

This is particularly important for:

  • Small-cap and micro-cap research

  • Long-horizon factor studies

  • Corporate action–driven strategies

  • Backtests sensitive to universe composition


Practical Scope and Intent

This dataset is intended to be used programmatically.

It is optimized for:

  • Systematic research workflows

  • Batch querying by ticker or date

  • Integration into production trading pipelines

Guardrails are intentionally built into the API to encourage correct usage patterns and prevent accidental full-table scans. These constraints are documented in later sections and are designed to reflect how practitioners typically interact with historical market cap data.


How This Fits into a Research Stack

The Historical Market Cap dataset is not a standalone signal. It is a foundational input that complements price, volume, and event-based data.

Typical integrations include:

  • Filtering universes by historical size thresholds

  • Weighting portfolios by market cap at rebalance time

  • Conditioning strategies on size regimes

  • Combining with event or factor data for downstream modeling

Used correctly, historical market cap becomes an enabling layer for more sophisticated and defensible research.


What’s Next

The following sections walk through:

  • How to use this dataset in practice, with an emphasis on survivorship-safe universe construction

  • The data schema and API design, including discovery of available tickers and correct query patterns

Last updated