Skip to content

Recipes

Common queries against the dataset. All examples use polars; they translate directly to pandas with minor syntactic tweaks.

Top winners and losers

import polars as pl

users = pl.read_parquet("user_pnl_summary.parquet")

# Top 100 base-variant winners
top = users.sort("pnl_total", descending=True).head(100)

# Top 100 losers
bottom = users.sort("pnl_total").head(100)

Forward-fill sparse daily PnL to a dense daily series

The pnl_daily panel only contains rows where PnL changed. A dense panel — every user × every day — can be reconstructed with an asof join against a daily date grid.

import polars as pl
from datetime import date

sparse = pl.scan_parquet("pnl_daily/**/*.parquet", hive_partitioning=False)

# Daily grid for the sample period
grid = pl.LazyFrame({
    "snapshot_time": pl.date_range(
        date(2022, 11, 11), date(2026, 3, 29), "1d", eager=True
    )
})

users = sparse.select("user_address").unique()

dense = (
    users.join(grid, how="cross")
    .sort("user_address", "snapshot_time")
    .join_asof(
        sparse.sort("user_address", "snapshot_time"),
        on="snapshot_time",
        by="user_address",
    )
    .with_columns(pl.col("pnl").fill_null(0.0))
    .collect()
)

Same pattern works for pnl_category_daily — add category to the by= key and explode the user list to (user, category) pairs that actually appear in the data.

Spread cost on resolved markets only

users = pl.read_parquet("user_pnl_summary.parquet")

spread_cost = users.select(
    "user_address",
    (pl.col("pnl_resolved_total") - pl.col("pnl_spread_adj_resolved_total"))
    .alias("spread_cost_resolved"),
).sort("spread_cost_resolved", descending=True)

Daily trade volume by category

import polars as pl

trades = pl.scan_parquet("trades/**/*.parquet", hive_partitioning=False)

daily_volume = (
    trades
    .with_columns(pl.col("timestamp").dt.date().alias("date"))
    .group_by(["date", "category"])
    .agg(
        (pl.col("price") * pl.col("quantity")).sum().alias("volume_usd"),
        pl.len().alias("n_trades"),
    )
    .sort("date", "category")
    .collect()
)

Join trades with market metadata

import polars as pl

trades = pl.scan_parquet("trades/**/*.parquet", hive_partitioning=False)
markets = pl.scan_parquet("markets.parquet")

# trades.market_id is stored as a string, markets.market_id as int64 — cast to match.
enriched = (
    trades
    .with_columns(pl.col("market_id").cast(pl.Int64))
    .join(
        markets.select("market_id", "close_time", "taker_base_fee", "has_fee"),
        on="market_id",
        how="left",
    )
    .collect()
)

The per-trade resolution flag lives on trades.winner directly, so the join above is only needed when you want market lifecycle or fee metadata.

Realized vs unrealized at the end of the sample

user_pnl_summary carries portfolio_value (mark-to-market value of open positions ≈ unrealized) and usdc_balance (cash account ≈ realized PnL + initial deposits) for the base variant. Their sum equals pnl_total.

summary = pl.read_parquet("user_pnl_summary.parquet").select(
    "user_address",
    pl.col("usdc_balance").alias("realized_proxy"),
    pl.col("portfolio_value").alias("unrealized_at_mark"),
    pl.col("pnl_total").alias("total_pnl"),
)

Filter out suspected wash traders

The dataset includes suspected wash traders so you can study them. To drop them, threshold on counterparty_hhi from user_features (the paper uses 0.5 with a min-trade filter):

import polars as pl

features = pl.read_parquet("user_features.parquet")
users = pl.read_parquet("user_pnl_summary.parquet")

clean = (
    features.filter(
        (pl.col("n_trades") >= 100) & (pl.col("counterparty_hhi") < 0.5)
    )
    .join(users, on="user_address", how="left")
)