Skip to content

Methodology

How the raw on-chain data becomes the tables you see in this dataset. For the academic motivation and full empirical results, see the companion paper.

Sample window

Start 2022-11-11 (Polymarket's CTF Exchange launch)
End 2026-03-29
Timezone UTC throughout

Source data

The pipeline starts from raw OrderFilled events emitted by the CTF Exchange contract on Polygon (and a small number of supporting contracts for resolution and fee changes). All cleaning happens off these events; no off-chain Polymarket API is required.

End-user reconciliation

The maker/taker addresses in a raw OrderFilled event are typically operator contracts acting on behalf of users, not the user wallets themselves. The reconciliation step:

  1. Loads the ProxyDeployed / SafeDeployed events that bind an operator address to its end-user owner.
  2. Walks each fill and substitutes the operator address with the end-user wallet on both sides.
  3. Skips fills where neither side resolves to an end user (rare — contract-to-contract transfers that aren't user-facing).

The output is the trades table — every row is attributed to a real end-user wallet.

Categories

Polymarket exposes a coarse "category" tag in its UI but does not include it in the on-chain data. We train a TF-IDF + Logistic Regression classifier on Kalshi market questions (which do carry categorical labels) and apply it to every Polymarket market title + description. The output category for each market is one of:

Sports · Crypto · Finance · Politics · Tech · Culture · Weather

Markets that the classifier cannot confidently assign are labeled Untagged. They are kept in markets, events, and trades, but excluded from per-category PnL tables.

PnL computation

For each user, on each day, we compute:

pnl(t) = portfolio_value(t) + usdc_balance(t)

where:

  • usdc_balance(t) is the user's accumulated USDC: deposits − withdrawals − cash spent on fills + cash received from fills + payouts from market resolution. We reconstruct this from the trade stream; no on-chain USDC balance query is performed.
  • portfolio_value(t) marks open conditional-token positions to the day's closing mid price.

PnL is mark-to-market, not realized — open positions at the sample end are valued at their last observed mid.

Storage: delta encoding

Positions and PnL only change on days a user trades or a market they hold settles. We store only those rows (the "sparse" tables). To reconstruct a dense panel, forward-fill — see the forward-fill recipe.

Variants

The user_pnl_summary table carries four methodological variants:

Variant What it changes
Base Default — all trades, all markets, no spread adjustment
Resolved Restricts to markets that have settled (winner not null) by the sample end. Eliminates mark-to-market valuation uncertainty.
Spread-adjusted Charges each fill an additional half-spread (0.005 by default — half the 1¢ tick). Approximates a no-free-lunch market without rebate.
Spread-adjusted + Resolved Both filters applied. The cleanest read on "did this user actually make money?"

The no_fee variant used internally by the paper (markets where taker_base_fee is null or 0) is not shipped — it's a paper-internal robustness check.

Wash trading

We detect (but do not filter) suspected wash trading via counterparty Herfindahl–Hirschman Index (HHI). The HHI for user u is

$$ \text{HHI}u = \sum_c s^2 $$

where $s_{u,c}$ is the share of user u's dollar volume that counter-traded against user c. A user trading exclusively against one other wallet has HHI = 1; a user trading evenly against many wallets has HHI → 0.

user_features.counterparty_hhi exposes this metric. The paper flags users with HHI ≥ 0.5 and ≥ 100 trades as suspected wash traders. We ship the metric so you can apply your own threshold; the trades, pnl_*, and user_features tables are not filtered.

Open interest

The ohlcv_1d.open_interest column is the sum of strictly-positive user positions per (token, day). It can be larger than total volume in a given day if positions are held across days.

What's intentionally excluded

  • Pre-CTF Exchange data (Polymarket's original Augur-based deployment). Out of scope for the reconciliation pipeline.
  • Order book snapshots / quotes / depth. Computed by a separate microstructure pipeline maintained by the authors but not yet packaged for public distribution.
  • L2 / off-chain quote streams. Mid prices used for PnL marking come from realized trades on-chain only.