Methodology¶
How the raw on-chain data becomes the tables you see in this dataset. For the academic motivation and full empirical results, see the companion paper.
Sample window¶
| Start | 2022-11-11 (Polymarket's CTF Exchange launch) |
| End | 2026-03-29 |
| Timezone | UTC throughout |
Source data¶
The pipeline starts from raw OrderFilled events emitted by the
CTF Exchange contract on
Polygon (and a small number of supporting contracts for resolution and
fee changes). All cleaning happens off these events; no off-chain
Polymarket API is required.
End-user reconciliation¶
The maker/taker addresses in a raw OrderFilled event are typically
operator contracts acting on behalf of users, not the user wallets
themselves. The reconciliation step:
- Loads the
ProxyDeployed/SafeDeployedevents that bind an operator address to its end-user owner. - Walks each fill and substitutes the operator address with the end-user wallet on both sides.
- Skips fills where neither side resolves to an end user (rare — contract-to-contract transfers that aren't user-facing).
The output is the trades table — every row is attributed to a real
end-user wallet.
Categories¶
Polymarket exposes a coarse "category" tag in its UI but does not include it in the on-chain data. We train a TF-IDF + Logistic Regression classifier on Kalshi market questions (which do carry categorical labels) and apply it to every Polymarket market title + description. The output category for each market is one of:
Markets that the classifier cannot confidently assign are labeled
Untagged. They are kept in markets, events, and trades, but
excluded from per-category PnL tables.
PnL computation¶
For each user, on each day, we compute:
where:
usdc_balance(t)is the user's accumulated USDC: deposits − withdrawals − cash spent on fills + cash received from fills + payouts from market resolution. We reconstruct this from the trade stream; no on-chain USDC balance query is performed.portfolio_value(t)marks open conditional-token positions to the day's closing mid price.
PnL is mark-to-market, not realized — open positions at the sample end are valued at their last observed mid.
Storage: delta encoding¶
Positions and PnL only change on days a user trades or a market they hold settles. We store only those rows (the "sparse" tables). To reconstruct a dense panel, forward-fill — see the forward-fill recipe.
Variants¶
The user_pnl_summary table carries five methodological variants:
| Variant | What it changes |
|---|---|
| Base | Default — all trades, all markets, no spread adjustment |
| Resolved | Restricts to markets that have settled (winner not null) by the sample end. Eliminates mark-to-market valuation uncertainty. |
| No-fee | Restricts to markets without taker fees (predates Polymarket's Q4 2024 fee introduction). Equivalent to markets.has_fee = False. |
| Spread-adjusted | Charges each fill an additional half-spread (0.005 by default — half the 1¢ tick). Approximates a no-free-lunch market without rebate. |
| Spread-adjusted + Resolved | Both filters applied. The cleanest read on "did this user actually make money?" |
The sparse panels also ship a Resolved + No-fee intersection
(pnl_daily_resolved_no_fee and pnl_category_daily_resolved_no_fee).
That variant is not exposed on user_pnl_summary because the strict
intersection is the appendix-only specification in the paper.
Variant semantics (v1.1)
Starting in v1.1, the Resolved and No-fee filters are
applied to the underlying trades when positions are built — not
at PnL aggregation time. A trade on a market outside the variant's
subset contributes nothing to that variant's positions, USDC
balance, or eventual settlement, even if the user later sold back to
flat. This makes each variant's terminal PnL structurally zero-sum
within its market subset (modulo platform-collected fees and
non-user counterparties). Users whose trading is entirely outside a
variant's subset are absent from that variant's columns in
user_pnl_summary (encoded as 0.0 for backwards compatibility).
Wash trading¶
We detect (but do not filter) suspected wash trading via counterparty
Herfindahl–Hirschman Index (HHI). The HHI for user u is
$$ \text{HHI}u = \sum_c s^2 $$
where $s_{u,c}$ is the share of user u's dollar volume that
counter-traded against user c. A user trading exclusively against one
other wallet has HHI = 1; a user trading evenly against many wallets has
HHI → 0.
user_features.counterparty_hhi exposes this metric. The paper flags
users with HHI ≥ 0.5 and ≥ 100 trades as suspected wash traders. We ship
the metric so you can apply your own threshold; the trades, pnl_*,
and user_features tables are not filtered.
Open interest¶
The ohlcv_1d.open_interest column is the sum of strictly-positive user
positions per (token, day). It can be larger than total volume in a given
day if positions are held across days.
What's intentionally excluded¶
- Pre-CTF Exchange data (Polymarket's original Augur-based deployment). Out of scope for the reconciliation pipeline.
- Order book snapshots / quotes / depth. Computed by a separate microstructure pipeline maintained by the authors but not yet packaged for public distribution.
- L2 / off-chain quote streams. Mid prices used for PnL marking come from realized trades on-chain only.