Methodology¶
How the raw on-chain data becomes the tables you see in this dataset. For the academic motivation and full empirical results, see the companion paper.
Sample window¶
| Start | 2022-11-11 (Polymarket's CTF Exchange launch) |
| End | 2026-03-29 |
| Timezone | UTC throughout |
Source data¶
The pipeline starts from raw OrderFilled events emitted by the
CTF Exchange contract on
Polygon (and a small number of supporting contracts for resolution and
fee changes). All cleaning happens off these events; no off-chain
Polymarket API is required.
End-user reconciliation¶
The maker/taker addresses in a raw OrderFilled event are typically
operator contracts acting on behalf of users, not the user wallets
themselves. The reconciliation step:
- Loads the
ProxyDeployed/SafeDeployedevents that bind an operator address to its end-user owner. - Walks each fill and substitutes the operator address with the end-user wallet on both sides.
- Skips fills where neither side resolves to an end user (rare — contract-to-contract transfers that aren't user-facing).
The output is the trades table — every row is attributed to a real
end-user wallet.
Categories¶
Polymarket exposes a coarse "category" tag in its UI but does not include it in the on-chain data. We train a TF-IDF + Logistic Regression classifier on Kalshi market questions (which do carry categorical labels) and apply it to every Polymarket market title + description. The output category for each market is one of:
Markets that the classifier cannot confidently assign are labeled
Untagged. They are kept in markets, events, and trades, but
excluded from per-category PnL tables.
PnL computation¶
For each user, on each day, we compute:
where:
usdc_balance(t)is the user's accumulated USDC: deposits − withdrawals − cash spent on fills + cash received from fills + payouts from market resolution. We reconstruct this from the trade stream; no on-chain USDC balance query is performed.portfolio_value(t)marks open conditional-token positions to the day's closing mid price.
PnL is mark-to-market, not realized — open positions at the sample end are valued at their last observed mid.
Storage: delta encoding¶
Positions and PnL only change on days a user trades or a market they hold settles. We store only those rows (the "sparse" tables). To reconstruct a dense panel, forward-fill — see the forward-fill recipe.
Variants¶
The user_pnl_summary table carries four methodological variants:
| Variant | What it changes |
|---|---|
| Base | Default — all trades, all markets, no spread adjustment |
| Resolved | Restricts to markets that have settled (winner not null) by the sample end. Eliminates mark-to-market valuation uncertainty. |
| Spread-adjusted | Charges each fill an additional half-spread (0.005 by default — half the 1¢ tick). Approximates a no-free-lunch market without rebate. |
| Spread-adjusted + Resolved | Both filters applied. The cleanest read on "did this user actually make money?" |
The no_fee variant used internally by the paper (markets where
taker_base_fee is null or 0) is not shipped — it's a paper-internal
robustness check.
Wash trading¶
We detect (but do not filter) suspected wash trading via counterparty
Herfindahl–Hirschman Index (HHI). The HHI for user u is
$$ \text{HHI}u = \sum_c s^2 $$
where $s_{u,c}$ is the share of user u's dollar volume that
counter-traded against user c. A user trading exclusively against one
other wallet has HHI = 1; a user trading evenly against many wallets has
HHI → 0.
user_features.counterparty_hhi exposes this metric. The paper flags
users with HHI ≥ 0.5 and ≥ 100 trades as suspected wash traders. We ship
the metric so you can apply your own threshold; the trades, pnl_*,
and user_features tables are not filtered.
Open interest¶
The ohlcv_1d.open_interest column is the sum of strictly-positive user
positions per (token, day). It can be larger than total volume in a given
day if positions are held across days.
What's intentionally excluded¶
- Pre-CTF Exchange data (Polymarket's original Augur-based deployment). Out of scope for the reconciliation pipeline.
- Order book snapshots / quotes / depth. Computed by a separate microstructure pipeline maintained by the authors but not yet packaged for public distribution.
- L2 / off-chain quote streams. Mid prices used for PnL marking come from realized trades on-chain only.