Recipes¶
Common queries against the dataset. All examples use polars; they translate directly to pandas with minor syntactic tweaks.
Top winners and losers¶
import polars as pl
users = pl.read_parquet("user_pnl_summary.parquet")
# Top 100 base-variant winners
top = users.sort("pnl_total", descending=True).head(100)
# Top 100 losers
bottom = users.sort("pnl_total").head(100)
Forward-fill sparse daily PnL to a dense daily series¶
The pnl_daily panel only contains rows where PnL changed. A dense
panel — every user × every day — can be reconstructed with an asof
join against a daily date grid.
import polars as pl
from datetime import date
sparse = pl.scan_parquet("pnl_daily/**/*.parquet", hive_partitioning=False)
# Daily grid for the sample period
grid = pl.LazyFrame({
"snapshot_time": pl.date_range(
date(2022, 11, 11), date(2026, 3, 29), "1d", eager=True
)
})
users = sparse.select("user_address").unique()
dense = (
users.join(grid, how="cross")
.sort("user_address", "snapshot_time")
.join_asof(
sparse.sort("user_address", "snapshot_time"),
on="snapshot_time",
by="user_address",
)
.with_columns(pl.col("pnl").fill_null(0.0))
.collect()
)
Same pattern works for pnl_category_daily — add category to the by=
key and explode the user list to (user, category) pairs that actually
appear in the data.
Spread cost on resolved markets only¶
users = pl.read_parquet("user_pnl_summary.parquet")
spread_cost = users.select(
"user_address",
(pl.col("pnl_resolved_total") - pl.col("pnl_spread_adj_resolved_total"))
.alias("spread_cost_resolved"),
).sort("spread_cost_resolved", descending=True)
Daily trade volume by category¶
import polars as pl
trades = pl.scan_parquet("trades/**/*.parquet", hive_partitioning=False)
daily_volume = (
trades
.with_columns(pl.col("timestamp").dt.date().alias("date"))
.group_by(["date", "category"])
.agg(
(pl.col("price") * pl.col("quantity")).sum().alias("volume_usd"),
pl.len().alias("n_trades"),
)
.sort("date", "category")
.collect()
)
Join trades with market metadata¶
import polars as pl
trades = pl.scan_parquet("trades/**/*.parquet", hive_partitioning=False)
markets = pl.scan_parquet("markets.parquet")
# trades.market_id is stored as a string, markets.market_id as int64 — cast to match.
enriched = (
trades
.with_columns(pl.col("market_id").cast(pl.Int64))
.join(
markets.select("market_id", "close_time", "taker_base_fee", "has_fee"),
on="market_id",
how="left",
)
.collect()
)
The per-trade resolution flag lives on trades.winner directly, so the
join above is only needed when you want market lifecycle or fee metadata.
Realized vs unrealized at the end of the sample¶
user_pnl_summary carries portfolio_value (mark-to-market value of
open positions ≈ unrealized) and usdc_balance (cash account ≈ realized
PnL + initial deposits) for the base variant. Their sum equals
pnl_total.
summary = pl.read_parquet("user_pnl_summary.parquet").select(
"user_address",
pl.col("usdc_balance").alias("realized_proxy"),
pl.col("portfolio_value").alias("unrealized_at_mark"),
pl.col("pnl_total").alias("total_pnl"),
)
Filter out suspected wash traders¶
The dataset includes suspected wash traders so you can study them. To
drop them, threshold on counterparty_hhi from user_features (the
paper uses 0.5 with a min-trade filter):