Skip to content

Polymarket Users

Get the dataset

The data lives on Hugging Face: huggingface.co/datasets/vgregoire/polymarket-users

Trading activity, profits, and behavioral features for every user on Polymarket, the largest on-chain prediction market.

The dataset covers 2022-11-11 → 2026-03-29, includes every reconciled end-user trade, daily mark-to-market PnL (sparse delta-encoded), wide end-of-sample PnL summaries, and ~83 behavioral features per user. Built from on-chain CTF Exchange events on Polygon.

Companion paper

Akey, P., Grégoire, V., Harvie, N., & Martineau, C. (2026). Who Wins and Who Loses In Prediction Markets? Evidence from Polymarket. Working Paper. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6443103

@unpublished{akey2026prediction,
  title  = {Who Wins and Who Loses In Prediction Markets? Evidence from Polymarket},
  author = {Akey, Pat and Gr{\'e}goire, Vincent and Harvie, Nicolas and Martineau, Charles},
  note   = {Working Paper},
  year   = {2026},
  url    = {https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6443103}
}

Not affiliated with Polymarket

This is an independent academic research dataset. The authors are not affiliated with, endorsed by, or sponsored by Polymarket. "Polymarket" is a trademark of its respective owner; it is referenced here only to identify the source platform of the underlying public on-chain data.

Time convention

All timestamps in the dataset are in UTC, and all daily / monthly bucketing is anchored at midnight UTC. Two different labelling conventions are used across panels — both consistent within themselves but pointing at different things, so it pays to know which is which.

Event panels — natural timestamps

trades, ohlcv_1d, ohlcv_1h, ohlcv_5m.

Rows are timestamped at the moment the underlying event occurred (block timestamp for trades) or at the start of the bar (OHLCV). A row with timestamp = 2025-06-15 12:00 UTC is activity during the calendar day 2025-06-15. Hive partitions follow the same logic: trades/year=2025/month=06/day=15/... holds trades that happened on 2025-06-15.

Snapshot panels — +1 day right-boundary convention

pnl_daily, pnl_daily_resolved, pnl_daily_no_fee, pnl_daily_resolved_no_fee, pnl_category_daily and its three variants, pnl_change_daily, pnl_change_monthly.

The snapshot_time, day, and month columns are labelled with the right boundary of the period they summarize. A label X 00:00 UTC means "state or change accumulated up to (but not including) that boundary" — equivalently, the close of day X − 1. The Hive partition path always matches the column value (day=152025-06-15 00:00 UTC).

Column A label X 00:00 UTC means…
pnl_daily.snapshot_time State at the close of day X − 1
pnl_category_daily*.snapshot_time Same, broken out by category
pnl_change_daily.day Change accumulated during day X − 1
pnl_change_monthly.month Sum of pnl_change_daily rows whose day falls in calendar month X (those daily values are themselves +1-shifted)

The convention is chosen for compatibility with polars.join_asof against a daily price grid keyed at midnight UTC: the asof key 2025-06-15 00:00 UTC pairs the opening price of June 15 with the user's terminal PnL from the previous trading day.

Worked example

A user makes their first ever trade at 2025-06-15 18:18 UTC.

  • pnl_daily: first snapshot row at snapshot_time = 2025-06-16 00:00 UTC (Hive partition day=16) — PnL accumulated by the close of June 15. No row at snapshot_time = 2025-06-15 00:00 UTC (at that moment they had not yet traded).
  • pnl_change_daily: first non-zero delta at day = 2025-06-16 00:00 UTC (Hive partition day=16) — change accumulated during June 15. A row may exist at day = 2025-06-15 with pnl_change = 0.0.

In other words: the calendar day on which trading actually happened is day − 1 of the Hive partition / column value in every PnL panel.

At a glance

Dataset Layout Rows per One-liner
markets single parquet market Cleaned market metadata + category + resolution
events single parquet event Cleaned event metadata + tag-based category
predictions single parquet conditional token Token → market / outcome lookup
user_features single parquet user ~83 behavioral features per user
user_pnl_summary single parquet user Terminal PnL × 4 variants × {total + 7 categories} + base-variant portfolio/cash decomposition
pnl_daily Hive year=/month=/day= (user, day-where-changed) Sparse delta-encoded daily PnL
pnl_daily_{resolved,no_fee,resolved_no_fee} Hive year=/month=/day= (user, day-where-changed) Variant filters of pnl_daily
pnl_category_daily Hive year=/month=/day= (user, category, day-where-changed) Sparse delta-encoded daily PnL per category
pnl_change_daily Hive year=/month=/day= (user, day) Daily PnL change (delta) per user
pnl_change_monthly single parquet (user, month) Monthly PnL change per user
trades Hive year=/month=/day= trade All reconciled end-user trades
ohlcv_1d single parquet (token, day) Daily OHLCV bars + open interest
ohlcv_1h Hive year=/month=/day= (token, hour) Hourly OHLCV bars
ohlcv_5m Hive year=/month=/day= (token, 5-min bucket) 5-minute OHLCV bars

Quick start

import polars as pl

# One row per user with full-sample PnL across four variants
users = pl.read_parquet("user_pnl_summary.parquet")
users.sort("pnl_total", descending=True).head(10)

Or via the Hugging Face datasets library — install it first:

pip install datasets
# or with uv:
uv add datasets
from datasets import load_dataset

ds = load_dataset("vgregoire/polymarket-users", "user_pnl_summary")

See the Datasets section for per-table schemas, the Recipes page for common queries (forward-filling sparse PnL, joining trades with categories, etc.), and the Methodology page for how PnL is computed and what wash trading detection does (and doesn't do).