Fast CSV Reader/Writer

High-performance CSV processing with multithreaded writing and memory mapping

Fast CSV Reader/Writer

The PyneCore CSV Reader/Writer system provides a high-performance solution for handling CSV data files, with a particular focus on OHLCV (Open, High, Low, Close, Volume) market data. While not as specialized as the binary OHLCV format, this system offers optimized CSV processing with features like multithreaded writing and memory mapping for reading.

Overview

CSV (Comma-Separated Values) is a universal format for tabular data exchange. While simple in concept, high-performance CSV processing presents several challenges:

I/O operations can be a significant bottleneck, especially when writing large datasets
String parsing and formatting can be computationally expensive
CSV files often require sequential processing, limiting random access capabilities

The PyneCore CSV system addresses these challenges through:

Multithreaded Writing: Background thread for non-blocking I/O operations
Buffer Management: Efficient buffer handling to minimize system calls
Memory Mapping: Fast file access for reading operations
Format Auto-Detection: Automatic detection of CSV dialect and headers
Flexible Data Types: Support for various data formats including OHLCV structures

CSVWriter: Multithreaded Performance

The CSVWriter class implements a high-performance CSV writer that leverages a background thread for I/O operations. This approach allows the main thread to continue processing while data is written asynchronously.

Key Features

Background Thread Processing: All I/O operations run in a separate thread
Command Queue: Thread-safe queue for communication between threads
Buffer Management: Efficient buffer handling with configurable sizes
Various Data Formats: Support for tuple data, dictionaries, and OHLCV records
Automatic Headers: Header generation based on data structure
Configurable Formatting: Custom float formatting and timestamp conversion

Architecture

The writer uses a producer-consumer pattern:

The main thread (producer) adds write commands to a thread-safe queue
A background worker thread (consumer) processes these commands
Data is accumulated in an internal buffer and flushed when:
- The buffer reaches a threshold size
- A timeout occurs with no new data
- The writer is closed

This approach minimizes the impact of I/O operations on application performance, particularly important for real-time data processing.

CSVReader: Memory Mapped Reading

The CSVReader class provides an efficient way to read CSV files, optimized for OHLCV data but flexible enough for general use. It leverages memory mapping for improved performance.

Key Features

Memory Mapping: Fast access to file data through the OS’s virtual memory system
Format Auto-Detection: Automatic detection of CSV dialect and headers
Flexible Data Processing: Support for various column mappings and data types
Extra Fields Support: Handling of additional columns beyond OHLCV data
Timestamp Parsing: Automatic conversion of various timestamp formats
NA Value Support: Special handling for NA/NaN values

Usage Examples

Basic CSV Writing

  from pynecore.core.csv_file import CSVWriter
from pathlib import Path

# Create a CSV writer
with CSVWriter(Path("example.csv"),
               headers=["timestamp", "value1", "value2"],
               timestamp_as_iso=True) as writer:

    # Write raw tuple data
    writer.write(1609459200, 42.5, 100.0)

    # Write dictionary data
    writer.write_dict({
        "timestamp": 1609459260,
        "value1": 43.2,
        "value2": 101.5
    })

Writing OHLCV Data

  from pynecore.core.csv_file import CSVWriter
from pynecore.types.ohlcv import OHLCV
from pathlib import Path

# Create a CSV writer for OHLCV data
with CSVWriter(Path("market_data.csv"),
               timestamp_as_iso=True,
               float_fmt='.2f') as writer:

    # Write OHLCV records
    writer.write_ohlcv(OHLCV(
        timestamp=1609459200,
        open=100.0,
        high=110.0,
        low=90.0,
        close=105.0,
        volume=1000.0,
        extra_fields={"indicator1": 42.5, "signal": "buy"}
    ))

Performance-Tuned CSV Writing

  from pynecore.core.csv_file import CSVWriter
from pathlib import Path

# Create a high-performance CSV writer
with CSVWriter(
    Path("large_dataset.csv"),
    buffer_size=65536,      # 64KB buffer
    queue_size=10000,       # Large command queue
    float_fmt='.4g',        # Compact float format
    idle_time=0.1           # Longer idle time before flush
) as writer:

    # Write large volumes of data
    for i in range(100000):
        writer.write(i, i * 2.5, i % 100)

Reading CSV Data

  from pynecore.core.csv_file import CSVReader
from pathlib import Path

# Read a CSV file
with CSVReader(Path("market_data.csv")) as reader:
    # Iterate through all records
    for candle in reader:
        print(f"Time: {candle.timestamp}, Close: {candle.close}")

        # Access extra fields
        if candle.extra_fields and "indicator1" in candle.extra_fields:
            print(f"Indicator: {candle.extra_fields['indicator1']}")

Reading Specific Time Ranges

  from pynecore.core.csv_file import CSVReader
from pathlib import Path

# Read a specific time range
with CSVReader(Path("market_data.csv")) as reader:
    start_time = 1609459200  # Unix timestamp
    end_time = 1609459800    # Unix timestamp

    for candle in reader.read_from(start_time, end_time):
        print(f"Time: {candle.timestamp}, Close: {candle.close}")

Performance Optimization

Writer Optimizations

The CSVWriter employs several optimization techniques:

Threaded I/O Operations: I/O is moved to a background thread to avoid blocking the main application
Buffer Management: Smart buffer management minimizes system calls
Timeout-Based Flushing: Buffers are flushed after an idle period, balancing throughput and latency
Efficient String Formatting: Custom float formatting for optimized string conversion
Batched Operations: Multiple records are processed together before flushing

Reader Optimizations

The CSVReader is optimized through:

Memory Mapping: Data is accessed directly through the OS’s virtual memory
Format Auto-Detection: The CSV dialect is automatically detected for optimal parsing
Type Conversion: Efficient parsing and conversion of values to appropriate types
Sequential Access Patterns: Optimized for the sequential nature of CSV files

Technical Details

CSVWriter Internals

The CSVWriter uses a command-based architecture:

Each write operation generates a command (tuple, dict, or OHLCV)
Commands are placed in a thread-safe queue
A worker thread processes commands from the queue
The worker accumulates data in a string buffer
The buffer is flushed to disk when it reaches a threshold size or after an idle period

This design provides several advantages:

Non-blocking writes for the main application thread
Batched I/O operations for improved throughput
Graceful handling of high-volume data streams

CSVReader Internals

The CSVReader leverages Python’s built-in CSV parsing with additional optimizations:

Memory mapping provides efficient access to the file data
The CSV dialect (delimiter, quoting, etc.) is automatically detected
Headers are parsed and mapped to fields, with case-insensitive matching
A position system enables reading specific records or timestamp ranges
Values are converted to appropriate types (timestamps, floats, etc.)

Thread Safety

The CSVWriter is designed with thread safety in mind:

A thread-safe queue manages communication between threads
Critical operations are protected by a lock
Error handling ensures that worker thread exceptions are propagated
Clean shutdown is guaranteed even in error conditions

Choosing Between Binary OHLCV and CSV

When working with financial data in PyneCore, you have two main options:

Binary OHLCV Format (ohlcv_file.py)

Pros: Maximum performance, compact storage, direct random access
Cons: Specialized format, less human-readable, fixed schema

CSV Format (csv_file.py)

Pros: Universal compatibility, human-readable, flexible schema
Cons: Larger file size, slower access, primarily sequential

Selection Guidelines

Use the Binary OHLCV format for:
- High-performance backtesting
- Systems requiring frequent random access
- Long-term data storage
Use the CSV format for:
- Data interchange with other systems
- Human inspection and editing
- Flexible schema requirements
- When additional fields beyond OHLCV are needed

Both systems are designed for performance while maintaining pure Python implementation, aligning with the PyneCore project vision.

Conclusion

The PyneCore CSV Reader/Writer system provides a high-performance solution for CSV processing, optimized for financial data but flexible enough for general use. By leveraging multithreaded writing and memory-mapped reading, it achieves excellent performance while remaining a pure Python implementation.

The system demonstrates how thoughtful architecture and performance optimization techniques can overcome traditional bottlenecks in Python I/O processing. For applications where CSV compatibility is important but performance cannot be sacrificed, this system offers an ideal balance.

OHLCV Reader/Writer

Ultra-fast OHLCV data storage …

Contributing

Guide for contributing to …

Fast CSV Reader/Writer

Fast CSV Reader/Writer link

Overview link

CSVWriter: Multithreaded Performance link

Key Features link

Architecture link

CSVReader: Memory Mapped Reading link

Key Features link

Usage Examples link

Basic CSV Writing link

Writing OHLCV Data link

Performance-Tuned CSV Writing link

Reading CSV Data link

Reading Specific Time Ranges link

Performance Optimization link

Writer Optimizations link

Reader Optimizations link

Technical Details link

CSVWriter Internals link

CSVReader Internals link

Thread Safety link

Choosing Between Binary OHLCV and CSV link

Binary OHLCV Format (ohlcv_file.py) link

CSV Format (csv_file.py) link

Selection Guidelines link

Conclusion link

Fast CSV Reader/Writer

Overview

CSVWriter: Multithreaded Performance

Key Features

Architecture

CSVReader: Memory Mapped Reading

Key Features

Usage Examples

Basic CSV Writing

Writing OHLCV Data

Performance-Tuned CSV Writing

Reading CSV Data

Reading Specific Time Ranges

Performance Optimization

Writer Optimizations

Reader Optimizations

Technical Details

CSVWriter Internals

CSVReader Internals

Thread Safety

Choosing Between Binary OHLCV and CSV

Binary OHLCV Format (ohlcv_file.py)

CSV Format (csv_file.py)

Selection Guidelines

Conclusion