About¶
Disclaimer¶
This is an independent academic research dataset. The authors are not affiliated with, endorsed by, or sponsored by Polymarket. "Polymarket" is a trademark of its respective owner; it is referenced here only to identify the source platform of the underlying public on-chain data.
Citation¶
If you use this dataset, please cite the companion paper:
@unpublished{akey2026prediction,
title = {Who Wins and Who Loses In Prediction Markets? Evidence from Polymarket},
author = {Akey, Pat and Gr{\'e}goire, Vincent and Harvie, Nicolas and Martineau, Charles},
note = {Working Paper},
year = {2026},
url = {https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6443103}
}
Authors¶
- Pat Akey — ESSEC Business School, CEPR, ECGI
- Vincent Grégoire — HEC Montréal
- Nicolas Harvie — Rotman School of Management, University of Toronto
- Charles Martineau — Rotman School of Management & UTSC Management, University of Toronto
License¶
The processed data is released under CC-BY 4.0 — use, modify, and redistribute freely (including commercially); please cite the paper.
Scope. CC-BY 4.0 covers the authors' contribution: cleaning, end-user reconciliation, category classification, computed PnL, behavioral feature engineering, OHLCV aggregation, and the layout of the parquet files in this release.
It does not cover fields that originate from the Polymarket API — market question text, descriptions, slugs, raw platform tags, and lifecycle timestamps. Those fields are included for convenience and reproducibility but remain subject to Polymarket's terms of use. If you plan to redistribute those fields (e.g., re-publish a derivative dataset), consult Polymarket's terms directly.
No warranty. As CC-BY 4.0 §5 states, this dataset is provided "AS IS" without warranty of any kind, express or implied.
Source code¶
The processing pipeline that produced this dataset is maintained in a private research repository. Researchers needing access for reproducibility purposes can contact the authors via the Discussions tab on the dataset page or by email.
The full column-level schema for every subset in this release is documented in the Datasets section of this site — those pages are the canonical data dictionary.
Reporting issues¶
Please open a thread in the Discussions tab on the dataset page.
Acknowledgments¶
This research was undertaken, in part, thanks to funding from the Canada Research Chairs Program and from the Social Sciences and Humanities Research Council, Rotman FinHub, and the Ontario Securities Commission.
Changelog¶
v1.2 (2026-06-14)¶
Removed 16 maker/taker terminal-PnL columns (pnl_maker_total,
pnl_taker_total, and their per-category siblings) from
user_pnl_summary. These had shipped undocumented in v1.1, and the
per-category role columns did not recombine to the base per-category
totals for users who acted as both maker and taker (≈17.5% of users), so
they have been withdrawn. The five documented variants, their per-category
breakdowns, and the portfolio_value / usdc_balance decomposition are
unchanged, and no other table is affected. A user's maker-vs-taker
activity remains fully recoverable from the trades table (each row
carries maker_address and taker_address) — see
Recipes.
v1.1 (2026-06-01)¶
Corrected PnL panels and per-category breakdowns. Every PnL-related table
(pnl_daily, the three pnl_daily_* variants, pnl_category_daily and
its three variants, pnl_change_daily, pnl_change_monthly, and
user_pnl_summary) was rebuilt from corrected position snapshots. Sample
period and the other panels (markets, events, predictions, trades,
user_features, ohlcv_*) are unchanged from v1.0.
What changed and why:
-
Phantom-settlement bug fixed. A zero-position row was being dropped before settlement logic ran, which caused round-trip trades on already-resolved markets (buy + sell back within the same day) to produce a ghost USDC settlement for a position the user no longer held. The fix moves the zero-row filter to after settlement decisions. Effect on
pnl_daily(base): aggregate user-population terminal PnL moved from $114,798 to exactly $0 (zero-sum restored); the median user's terminal PnL moved by $0.0007; top-5 winners' PnL is bit-identical to 6 decimals. -
Analysis variants now filter at the market level. The
pnl_daily_resolved,pnl_daily_no_fee,pnl_daily_resolved_no_feevariants (and theirpnl_category_daily_*counterparts) previously applied the variant predicate only at PnL aggregation time, which produced asymmetric USDC leakage of $1–5 B per variant. The variants are now built from positions that exclude trades on markets outside the variant's subset, so each variant is structurally zero-sum within its subset (modulo platform-collected fees and non-user counterparties). Effect: each variant's aggregate terminal PnL moved by +$1 B to +$4.7 B, from large negative leakage totals down to ≈ $0. -
Sample composition. A small number of users (≤ 0.05 % of the 2.48 M user base) drop in or out of the panel depending on which variant subset they touch. For the variants, user counts drop by 54 k–127 k as users who only traded outside the variant's market subset now correctly disappear from the panel.
The headline statistics that appear in the accompanying paper round to
the same values as in v1.0 — top 1 % capture 76.5 % of profits, top
0.1 % capture 51.2 %, ~70 % of users lose money — but the underlying
panels are now internally consistent and pass a whole-cache
maker_pnl + taker_pnl = base_pnl invariant across all
2.26 billion daily snapshots.
v1.0 (2026-05-19)¶
Initial release. Sample 2022-11-11 → 2026-03-29.