Skip to content

Recipes

Common queries against the dataset. All examples use polars; they translate directly to pandas with minor syntactic tweaks.

Top winners and losers

import polars as pl

users = pl.read_parquet("user_pnl_summary.parquet")

# Top 100 base-variant winners
top = users.sort("pnl_total", descending=True).head(100)

# Top 100 losers
bottom = users.sort("pnl_total").head(100)

Forward-fill sparse daily PnL to a dense daily series

The pnl_daily panel only contains rows where PnL changed. A dense panel — every user × every day — can be reconstructed with an asof join against a daily date grid.

import polars as pl
from datetime import date

sparse = pl.scan_parquet("pnl_daily/**/*.parquet", hive_partitioning=False)

# Daily grid for the sample period
grid = pl.LazyFrame({
    "snapshot_time": pl.date_range(
        date(2022, 11, 11), date(2026, 3, 29), "1d", eager=True
    )
})

users = sparse.select("user_address").unique()

dense = (
    users.join(grid, how="cross")
    .sort("user_address", "snapshot_time")
    .join_asof(
        sparse.sort("user_address", "snapshot_time"),
        on="snapshot_time",
        by="user_address",
    )
    .with_columns(pl.col("pnl").fill_null(0.0))
    .collect()
)

Same pattern works for pnl_category_daily — add category to the by= key and explode the user list to (user, category) pairs that actually appear in the data.

Spread cost on resolved markets only

users = pl.read_parquet("user_pnl_summary.parquet")

spread_cost = users.select(
    "user_address",
    (pl.col("pnl_resolved_total") - pl.col("pnl_spread_adj_resolved_total"))
    .alias("spread_cost_resolved"),
).sort("spread_cost_resolved", descending=True)

Daily trade volume by category

import polars as pl

trades = pl.scan_parquet("trades/**/*.parquet", hive_partitioning=False)

daily_volume = (
    trades
    .with_columns(pl.col("timestamp").dt.date().alias("date"))
    .group_by(["date", "category"])
    .agg(
        (pl.col("price") * pl.col("quantity")).sum().alias("volume_usd"),
        pl.len().alias("n_trades"),
    )
    .sort("date", "category")
    .collect()
)

Classify user-months as maker-heavy or taker-heavy

Every trade carries both a maker_address and a taker_address, so a user's maker-vs-taker activity is fully recoverable from trades — no separate column is needed. For each (user, month) compare the USDC volume the user transacted as maker against the volume as taker and label the month by whichever side dominates (ties are excluded from both groups). This is the partition used for the paper's maker-heavy vs taker-heavy transition exhibits.

import polars as pl

trades = pl.scan_parquet("trades/**/*.parquet", hive_partitioning=False)

vol = (pl.col("price") * pl.col("quantity")).alias("usd_volume")
month = pl.col("timestamp").dt.truncate("1mo").alias("month")

maker_vol = (
    trades.select(pl.col("maker_address").alias("user_address"), month, vol)
    .group_by(["user_address", "month"])
    .agg(pl.col("usd_volume").sum().alias("maker_vol"))
)
taker_vol = (
    trades.select(pl.col("taker_address").alias("user_address"), month, vol)
    .group_by(["user_address", "month"])
    .agg(pl.col("usd_volume").sum().alias("taker_vol"))
)

dominance = (
    maker_vol.join(taker_vol, on=["user_address", "month"], how="full", coalesce=True)
    .with_columns(
        pl.col("maker_vol").fill_null(0.0),
        pl.col("taker_vol").fill_null(0.0),
    )
    .with_columns(
        pl.when(pl.col("maker_vol") > pl.col("taker_vol"))
        .then(pl.lit("maker_heavy"))
        .when(pl.col("taker_vol") > pl.col("maker_vol"))
        .then(pl.lit("taker_heavy"))
        .otherwise(None)  # exact ties excluded
        .alias("dominance")
    )
    .filter(pl.col("dominance").is_not_null())
    .collect()
)

The same maker_address / taker_address columns let you rebuild any finer role-decomposed series (e.g. maker-only vs taker-only PnL over time) by replaying the trade stream through your own position/mark-to-market reconstruction; the published PnL panels are computed over a user's combined (both-side) activity.

Join trades with market metadata

import polars as pl

trades = pl.scan_parquet("trades/**/*.parquet", hive_partitioning=False)
markets = pl.scan_parquet("markets.parquet")

# trades.market_id is stored as a string, markets.market_id as int64 — cast to match.
enriched = (
    trades
    .with_columns(pl.col("market_id").cast(pl.Int64))
    .join(
        markets.select("market_id", "close_time", "taker_base_fee", "has_fee"),
        on="market_id",
        how="left",
    )
    .collect()
)

The per-trade resolution flag lives on trades.winner directly, so the join above is only needed when you want market lifecycle or fee metadata.

Realized vs unrealized at the end of the sample

user_pnl_summary carries portfolio_value (mark-to-market value of open positions ≈ unrealized) and usdc_balance (cash account ≈ realized PnL + initial deposits) for the base variant. Their sum equals pnl_total.

summary = pl.read_parquet("user_pnl_summary.parquet").select(
    "user_address",
    pl.col("usdc_balance").alias("realized_proxy"),
    pl.col("portfolio_value").alias("unrealized_at_mark"),
    pl.col("pnl_total").alias("total_pnl"),
)

Filter out suspected wash traders

The dataset includes suspected wash traders so you can study them. To drop them, threshold on counterparty_hhi from user_features (the paper uses 0.5 with a min-trade filter):

import polars as pl

features = pl.read_parquet("user_features.parquet")
users = pl.read_parquet("user_pnl_summary.parquet")

clean = (
    features.filter(
        (pl.col("n_trades") >= 100) & (pl.col("counterparty_hhi") < 0.5)
    )
    .join(users, on="user_address", how="left")
)