Polymarket Users¶
Get the dataset
The data lives on Hugging Face: huggingface.co/datasets/vgregoire/polymarket-users
Trading activity, profits, and behavioral features for every user on Polymarket, the largest on-chain prediction market.
The dataset covers 2022-11-11 → 2026-03-29, includes every reconciled end-user trade, daily mark-to-market PnL (sparse delta-encoded), wide end-of-sample PnL summaries, and ~83 behavioral features per user. Built from on-chain CTF Exchange events on Polygon.
Companion paper
Akey, P., Grégoire, V., Harvie, N., & Martineau, C. (2026). Who Wins and Who Loses In Prediction Markets? Evidence from Polymarket. Working Paper. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6443103
@unpublished{akey2026prediction,
title = {Who Wins and Who Loses In Prediction Markets? Evidence from Polymarket},
author = {Akey, Pat and Gr{\'e}goire, Vincent and Harvie, Nicolas and Martineau, Charles},
note = {Working Paper},
year = {2026},
url = {https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6443103}
}
Not affiliated with Polymarket
This is an independent academic research dataset. The authors are not affiliated with, endorsed by, or sponsored by Polymarket. "Polymarket" is a trademark of its respective owner; it is referenced here only to identify the source platform of the underlying public on-chain data.
Time convention¶
All timestamps in the dataset are in UTC, and all daily / monthly bucketing is anchored at midnight UTC. Two different labelling conventions are used across panels — both consistent within themselves but pointing at different things, so it pays to know which is which.
Event panels — natural timestamps¶
trades, ohlcv_1d, ohlcv_1h, ohlcv_5m.
Rows are timestamped at the moment the underlying event occurred (block
timestamp for trades) or at the start of the bar (OHLCV). A row with
timestamp = 2025-06-15 12:00 UTC is activity during the calendar day
2025-06-15. Hive partitions follow the same logic: trades/year=2025/month=06/day=15/...
holds trades that happened on 2025-06-15.
Snapshot panels — +1 day right-boundary convention¶
pnl_daily, pnl_daily_resolved, pnl_daily_no_fee,
pnl_daily_resolved_no_fee, pnl_category_daily and its three variants,
pnl_change_daily, pnl_change_monthly.
The snapshot_time, day, and month columns are labelled with the
right boundary of the period they summarize. A label X 00:00 UTC
means "state or change accumulated up to (but not including) that
boundary" — equivalently, the close of day X − 1. The Hive partition
path always matches the column value (day=15 ↔ 2025-06-15 00:00 UTC).
| Column | A label X 00:00 UTC means… |
|---|---|
pnl_daily.snapshot_time |
State at the close of day X − 1 |
pnl_category_daily*.snapshot_time |
Same, broken out by category |
pnl_change_daily.day |
Change accumulated during day X − 1 |
pnl_change_monthly.month |
Sum of pnl_change_daily rows whose day falls in calendar month X (those daily values are themselves +1-shifted) |
The convention is chosen for compatibility with polars.join_asof
against a daily price grid keyed at midnight UTC: the asof key
2025-06-15 00:00 UTC pairs the opening price of June 15 with the
user's terminal PnL from the previous trading day.
Worked example¶
A user makes their first ever trade at 2025-06-15 18:18 UTC.
pnl_daily: first snapshot row atsnapshot_time = 2025-06-16 00:00 UTC(Hive partitionday=16) — PnL accumulated by the close of June 15. No row atsnapshot_time = 2025-06-15 00:00 UTC(at that moment they had not yet traded).pnl_change_daily: first non-zero delta atday = 2025-06-16 00:00 UTC(Hive partitionday=16) — change accumulated during June 15. A row may exist atday = 2025-06-15withpnl_change = 0.0.
In other words: the calendar day on which trading actually happened is
day − 1 of the Hive partition / column value in every PnL panel.
At a glance¶
| Dataset | Layout | Rows per | One-liner |
|---|---|---|---|
markets |
single parquet | market | Cleaned market metadata + category + resolution |
events |
single parquet | event | Cleaned event metadata + tag-based category |
predictions |
single parquet | conditional token | Token → market / outcome lookup |
user_features |
single parquet | user | ~83 behavioral features per user |
user_pnl_summary |
single parquet | user | Terminal PnL × 4 variants × {total + 7 categories} + base-variant portfolio/cash decomposition |
pnl_daily |
Hive year=/month=/day= |
(user, day-where-changed) | Sparse delta-encoded daily PnL |
pnl_daily_{resolved,no_fee,resolved_no_fee} |
Hive year=/month=/day= |
(user, day-where-changed) | Variant filters of pnl_daily |
pnl_category_daily |
Hive year=/month=/day= |
(user, category, day-where-changed) | Sparse delta-encoded daily PnL per category |
pnl_change_daily |
Hive year=/month=/day= |
(user, day) | Daily PnL change (delta) per user |
pnl_change_monthly |
single parquet | (user, month) | Monthly PnL change per user |
trades |
Hive year=/month=/day= |
trade | All reconciled end-user trades |
ohlcv_1d |
single parquet | (token, day) | Daily OHLCV bars + open interest |
ohlcv_1h |
Hive year=/month=/day= |
(token, hour) | Hourly OHLCV bars |
ohlcv_5m |
Hive year=/month=/day= |
(token, 5-min bucket) | 5-minute OHLCV bars |
Quick start¶
import polars as pl
# One row per user with full-sample PnL across four variants
users = pl.read_parquet("user_pnl_summary.parquet")
users.sort("pnl_total", descending=True).head(10)
Or via the Hugging Face datasets library — install it first:
from datasets import load_dataset
ds = load_dataset("vgregoire/polymarket-users", "user_pnl_summary")
See the Datasets section for per-table schemas, the Recipes page for common queries (forward-filling sparse PnL, joining trades with categories, etc.), and the Methodology page for how PnL is computed and what wash trading detection does (and doesn't do).