Recipes¶
Common queries against the dataset. All examples use polars; they translate directly to pandas with minor syntactic tweaks.
Top winners and losers¶
import polars as pl
users = pl.read_parquet("user_pnl_summary.parquet")
# Top 100 base-variant winners
top = users.sort("pnl_total", descending=True).head(100)
# Top 100 losers
bottom = users.sort("pnl_total").head(100)
Forward-fill sparse daily PnL to a dense daily series¶
The pnl_daily panel only contains rows where PnL changed. A dense
panel — every user × every day — can be reconstructed with an asof
join against a daily date grid.
import polars as pl
from datetime import date
sparse = pl.scan_parquet("pnl_daily/**/*.parquet", hive_partitioning=False)
# Daily grid for the sample period
grid = pl.LazyFrame({
"snapshot_time": pl.date_range(
date(2022, 11, 11), date(2026, 3, 29), "1d", eager=True
)
})
users = sparse.select("user_address").unique()
dense = (
users.join(grid, how="cross")
.sort("user_address", "snapshot_time")
.join_asof(
sparse.sort("user_address", "snapshot_time"),
on="snapshot_time",
by="user_address",
)
.with_columns(pl.col("pnl").fill_null(0.0))
.collect()
)
Same pattern works for pnl_category_daily — add category to the by=
key and explode the user list to (user, category) pairs that actually
appear in the data.
Spread cost on resolved markets only¶
users = pl.read_parquet("user_pnl_summary.parquet")
spread_cost = users.select(
"user_address",
(pl.col("pnl_resolved_total") - pl.col("pnl_spread_adj_resolved_total"))
.alias("spread_cost_resolved"),
).sort("spread_cost_resolved", descending=True)
Daily trade volume by category¶
import polars as pl
trades = pl.scan_parquet("trades/**/*.parquet", hive_partitioning=False)
daily_volume = (
trades
.with_columns(pl.col("timestamp").dt.date().alias("date"))
.group_by(["date", "category"])
.agg(
(pl.col("price") * pl.col("quantity")).sum().alias("volume_usd"),
pl.len().alias("n_trades"),
)
.sort("date", "category")
.collect()
)
Classify user-months as maker-heavy or taker-heavy¶
Every trade carries both a maker_address and a taker_address, so a
user's maker-vs-taker activity is fully recoverable from trades — no
separate column is needed. For each (user, month) compare the USDC
volume the user transacted as maker against the volume as taker and label
the month by whichever side dominates (ties are excluded from both
groups). This is the partition used for the paper's maker-heavy vs
taker-heavy transition exhibits.
import polars as pl
trades = pl.scan_parquet("trades/**/*.parquet", hive_partitioning=False)
vol = (pl.col("price") * pl.col("quantity")).alias("usd_volume")
month = pl.col("timestamp").dt.truncate("1mo").alias("month")
maker_vol = (
trades.select(pl.col("maker_address").alias("user_address"), month, vol)
.group_by(["user_address", "month"])
.agg(pl.col("usd_volume").sum().alias("maker_vol"))
)
taker_vol = (
trades.select(pl.col("taker_address").alias("user_address"), month, vol)
.group_by(["user_address", "month"])
.agg(pl.col("usd_volume").sum().alias("taker_vol"))
)
dominance = (
maker_vol.join(taker_vol, on=["user_address", "month"], how="full", coalesce=True)
.with_columns(
pl.col("maker_vol").fill_null(0.0),
pl.col("taker_vol").fill_null(0.0),
)
.with_columns(
pl.when(pl.col("maker_vol") > pl.col("taker_vol"))
.then(pl.lit("maker_heavy"))
.when(pl.col("taker_vol") > pl.col("maker_vol"))
.then(pl.lit("taker_heavy"))
.otherwise(None) # exact ties excluded
.alias("dominance")
)
.filter(pl.col("dominance").is_not_null())
.collect()
)
The same maker_address / taker_address columns let you rebuild any
finer role-decomposed series (e.g. maker-only vs taker-only PnL over time)
by replaying the trade stream through your own position/mark-to-market
reconstruction; the published PnL panels are computed over a user's
combined (both-side) activity.
Join trades with market metadata¶
import polars as pl
trades = pl.scan_parquet("trades/**/*.parquet", hive_partitioning=False)
markets = pl.scan_parquet("markets.parquet")
# trades.market_id is stored as a string, markets.market_id as int64 — cast to match.
enriched = (
trades
.with_columns(pl.col("market_id").cast(pl.Int64))
.join(
markets.select("market_id", "close_time", "taker_base_fee", "has_fee"),
on="market_id",
how="left",
)
.collect()
)
The per-trade resolution flag lives on trades.winner directly, so the
join above is only needed when you want market lifecycle or fee metadata.
Realized vs unrealized at the end of the sample¶
user_pnl_summary carries portfolio_value (mark-to-market value of
open positions ≈ unrealized) and usdc_balance (cash account ≈ realized
PnL + initial deposits) for the base variant. Their sum equals
pnl_total.
summary = pl.read_parquet("user_pnl_summary.parquet").select(
"user_address",
pl.col("usdc_balance").alias("realized_proxy"),
pl.col("portfolio_value").alias("unrealized_at_mark"),
pl.col("pnl_total").alias("total_pnl"),
)
Filter out suspected wash traders¶
The dataset includes suspected wash traders so you can study them. To
drop them, threshold on counterparty_hhi from user_features (the
paper uses 0.5 with a min-trade filter):