Skip to content

Methodology

How the raw on-chain data becomes the tables you see in this dataset. For the academic motivation and full empirical results, see the companion paper.

Sample window

Start 2022-11-11 (Polymarket's CTF Exchange launch)
End 2026-03-29
Timezone UTC throughout

Source data

The pipeline starts from raw OrderFilled events emitted by the CTF Exchange contract on Polygon (and a small number of supporting contracts for resolution and fee changes). All cleaning happens off these events; no off-chain Polymarket API is required.

End-user reconciliation

The maker/taker addresses in a raw OrderFilled event are typically operator contracts acting on behalf of users, not the user wallets themselves. The reconciliation step:

  1. Loads the ProxyDeployed / SafeDeployed events that bind an operator address to its end-user owner.
  2. Walks each fill and substitutes the operator address with the end-user wallet on both sides.
  3. Skips fills where neither side resolves to an end user (rare — contract-to-contract transfers that aren't user-facing).

The output is the trades table — every row is attributed to a real end-user wallet.

Categories

Polymarket exposes a coarse "category" tag in its UI but does not include it in the on-chain data. We train a TF-IDF + Logistic Regression classifier on Kalshi market questions (which do carry categorical labels) and apply it to every Polymarket market title + description. The output category for each market is one of:

Sports · Crypto · Finance · Politics · Tech · Culture · Weather

Markets that the classifier cannot confidently assign are labeled Untagged. They are kept in markets, events, and trades, but excluded from per-category PnL tables.

PnL computation

For each user, on each day, we compute:

pnl(t) = portfolio_value(t) + usdc_balance(t)

where:

  • usdc_balance(t) is the user's accumulated USDC: deposits − withdrawals − cash spent on fills + cash received from fills + payouts from market resolution. We reconstruct this from the trade stream; no on-chain USDC balance query is performed.
  • portfolio_value(t) marks open conditional-token positions to the day's closing mid price.

PnL is mark-to-market, not realized — open positions at the sample end are valued at their last observed mid.

Storage: delta encoding

Positions and PnL only change on days a user trades or a market they hold settles. We store only those rows (the "sparse" tables). To reconstruct a dense panel, forward-fill — see the forward-fill recipe.

Variants

The user_pnl_summary table carries five methodological variants:

Variant What it changes
Base Default — all trades, all markets, no spread adjustment
Resolved Restricts to markets that have settled (winner not null) by the sample end. Eliminates mark-to-market valuation uncertainty.
No-fee Restricts to markets without taker fees (predates Polymarket's Q4 2024 fee introduction). Equivalent to markets.has_fee = False.
Spread-adjusted Charges each fill an additional half-spread (0.005 by default — half the 1¢ tick). Approximates a no-free-lunch market without rebate.
Spread-adjusted + Resolved Both filters applied. The cleanest read on "did this user actually make money?"

The sparse panels also ship a Resolved + No-fee intersection (pnl_daily_resolved_no_fee and pnl_category_daily_resolved_no_fee). That variant is not exposed on user_pnl_summary because the strict intersection is the appendix-only specification in the paper.

Variant semantics (v1.1)

Starting in v1.1, the Resolved and No-fee filters are applied to the underlying trades when positions are built — not at PnL aggregation time. A trade on a market outside the variant's subset contributes nothing to that variant's positions, USDC balance, or eventual settlement, even if the user later sold back to flat. This makes each variant's terminal PnL structurally zero-sum within its market subset (modulo platform-collected fees and non-user counterparties). Users whose trading is entirely outside a variant's subset are absent from that variant's columns in user_pnl_summary (encoded as 0.0 for backwards compatibility).

Wash trading

We detect (but do not filter) suspected wash trading via counterparty Herfindahl–Hirschman Index (HHI). The HHI for user u is

$$ \text{HHI}u = \sum_c s^2 $$

where $s_{u,c}$ is the share of user u's dollar volume that counter-traded against user c. A user trading exclusively against one other wallet has HHI = 1; a user trading evenly against many wallets has HHI → 0.

user_features.counterparty_hhi exposes this metric. The paper flags users with HHI ≥ 0.5 and ≥ 100 trades as suspected wash traders. We ship the metric so you can apply your own threshold; the trades, pnl_*, and user_features tables are not filtered.

Open interest

The ohlcv_1d.open_interest column is the sum of strictly-positive user positions per (token, day). It can be larger than total volume in a given day if positions are held across days.

What's intentionally excluded

  • Pre-CTF Exchange data (Polymarket's original Augur-based deployment). Out of scope for the reconciliation pipeline.
  • Order book snapshots / quotes / depth. Computed by a separate microstructure pipeline maintained by the authors but not yet packaged for public distribution.
  • L2 / off-chain quote streams. Mid prices used for PnL marking come from realized trades on-chain only.