Skip to content

Polymarket Users Dataset

predictions

`predictions`¶

Per-conditional-token lookup table. One row per token (each multi-outcome market has two or more tokens; a binary market has one Yes token and one No token). Use this to join trades (which carries prediction_id) back to market-level metadata.

Layout¶

Path	Format
`predictions.parquet`	Single parquet file

Load¶

import polars as pl
predictions = pl.read_parquet("predictions.parquet")

from datasets import load_dataset
predictions = load_dataset("vgregoire/polymarket-users", "predictions")

Schema¶

Column	Type	Description
`prediction_id`	`str`	The conditional token id (matches `trades.prediction_id`)
`market_id`	`int64`	Parent market identifier (matches `markets.market_id`)
`outcome`	`str`	Outcome label (e.g., `"Yes"`, `"Trump"`, team name, …)
`outcome_idx`	`int`	Zero-based outcome index within the market
`n_outcomes`	`int`	Total number of outcome tokens in the parent market
`negative_token`	`str`	The complementary token id (the other side, for two-outcome markets)
`winner`	`bool`	Whether this outcome resolved as the winner (nullable until resolution)

Common joins¶

import polars as pl

trades = pl.scan_parquet("trades/**/*.parquet", hive_partitioning=False)
predictions = pl.read_parquet("predictions.parquet").lazy()
markets = pl.read_parquet("markets.parquet").lazy()

# Trade-level outcome label + parent market category
enriched = (
    trades
    .drop("market_id")  # use the int64 market_id from predictions instead
    .join(predictions, on="prediction_id", how="left")
    .join(markets.select("market_id", "category"), on="market_id", how="left")
    .collect()
)