Extracting Specific Symbols¶
This notebook demonstrates how to create a new ITCH file containing only data for specific symbols of interest.
Filtering large ITCH files to specific symbols can significantly reduce file size and processing time for analysis focused on particular securities. It is also useful for parallel processing, where each process can handle a subset of symbols.
In [ ]:
Copied!
from pathlib import Path
from meatpy.itch50 import ITCH50MessageReader, ITCH50Writer
# Define paths
data_dir = Path("data")
input_file = data_dir / "S081321-v50.txt.gz"
output_file = data_dir / "S081321-v50-AAPL-SPY.itch50.gz"
# Symbols we want to extract
target_symbols = ["AAPL", "SPY"]
input_size_gb = input_file.stat().st_size / (1024**3)
print(f"Input file size: {input_size_gb:.2f} GB")
from pathlib import Path
from meatpy.itch50 import ITCH50MessageReader, ITCH50Writer
# Define paths
data_dir = Path("data")
input_file = data_dir / "S081321-v50.txt.gz"
output_file = data_dir / "S081321-v50-AAPL-SPY.itch50.gz"
# Symbols we want to extract
target_symbols = ["AAPL", "SPY"]
input_size_gb = input_file.stat().st_size / (1024**3)
print(f"Input file size: {input_size_gb:.2f} GB")
Input file size: 4.55 GB
In [5]:
Copied!
# This takes about 10 minutes on a MacBook Pro M3 Max
message_count = 0
with ITCH50MessageReader(input_file) as reader:
with ITCH50Writer(output_file, symbols=target_symbols) as writer:
for message in reader:
message_count += 1
writer.process_message(message)
print(f"Total messages processed: {message_count:,}")
# This takes about 10 minutes on a MacBook Pro M3 Max
message_count = 0
with ITCH50MessageReader(input_file) as reader:
with ITCH50Writer(output_file, symbols=target_symbols) as writer:
for message in reader:
message_count += 1
writer.process_message(message)
print(f"Total messages processed: {message_count:,}")
Total messages processed: 367,986,583
In [6]:
Copied!
new_message_count = 0
with ITCH50MessageReader(output_file) as reader:
for message in reader:
new_message_count += 1
print(f"Total messages in filtered file: {new_message_count:,}")
output_size_gb = output_file.stat().st_size / (1024**3)
print(f"Output file size: {output_size_gb:.2f} GB")
new_message_count = 0
with ITCH50MessageReader(output_file) as reader:
for message in reader:
new_message_count += 1
print(f"Total messages in filtered file: {new_message_count:,}")
output_size_gb = output_file.stat().st_size / (1024**3)
print(f"Output file size: {output_size_gb:.2f} GB")
Total messages in filtered file: 4,503,791 Output file size: 0.13 GB
Key Points¶
- Processing Speed: Smaller filtered files process much faster for subsequent analysis. If your analysis only requires data for a few symbols, filtering out the rest can save significant time for downstream tasks.
- Output Format: The output is a valid ITCH 5.0 file that can be processed by any ITCH-compatible tool
Performance Tips¶
- Early Filtering: Filter as early as possible in your data pipeline to reduce downstream processing time
- Multiple Symbols: You can filter for multiple symbols in a single pass
- Memory Usage: The ITCH50Writer buffers data efficiently to minimize memory usage during filtering
Next Steps¶
With your filtered file, you can now:
- Process order book data much faster
- Generate snapshots at regular intervals
- Calculate trading metrics and statistics
- Create visualizations and reports