Performance Guide

When working with civic transparency data at scale, these performance characteristics and optimizations can help you build efficient applications.

Benchmark Results

Performance measured on Windows Python 3.11.9 (your results may vary based on hardware and Python version):

Validation Performance

Model Type	Records/sec	Time per Record	Use Case
ProvenanceTag	~160,000	0.006ms	Post metadata validation
Series (minimal)	~25,000	0.040ms	Simple time series
Series (100 points)	~259	3.861ms	Complex time series

JSON Serialization Performance

Model Type	Pydantic JSON	stdlib json	Speedup
ProvenanceTag	~651,000/sec	~269,000/sec	2.4x
Series (minimal)	~228,000/sec	~91,000/sec	2.5x
Series (100 points)	~3,531/sec	~1,511/sec	2.3x

Key Insight: Pydantic's model_dump_json() is consistently ~2.4x faster than model_dump() + json.dumps().

Memory Usage

Model Type	Memory per Object	JSON Size	Efficiency
ProvenanceTag	~1,084 bytes	207 bytes	5.2x overhead
Series (minimal)	~7,127 bytes	502 bytes	14.2x overhead
Series (100 points)	~660,159 bytes	40,796 bytes	16.2x overhead

Memory overhead includes Python object structures, Pydantic metadata, and validation caching.

Performance Patterns

1. Complexity Scaling

ProvenanceTag: Simple enum validation is very fast
Series: Scales linearly with number of data points
Nested validation: Each point requires full coordinate validation

2. Serialization vs Validation

Serialization is faster: JSON generation is 4-25x faster than validation
Pydantic optimized: Built-in JSON serializer beats stdlib significantly
Memory efficient: JSON representation is much more compact

3. Privacy-Preserving Design Impact

The privacy-by-design approach actually helps performance: - Bucketed enums (like acct_age_bucket) validate faster than ranges - Categorical data requires less processing than continuous values - Aggregated structures reduce individual record complexity

Optimization Strategies

High-Throughput Validation

For processing large volumes of civic data:

from ci.transparency.types import ProvenanceTag
from typing import List, Dict, Any
import logging

def process_provenance_batch(data_list: List[Dict[str, Any]]) -> List[ProvenanceTag]:
    """Process ~160K records/second efficiently."""
    results = []
    errors = 0

    for data in data_list:
        try:
            tag = ProvenanceTag.model_validate(data)
            results.append(tag)
        except ValidationError as e:
            errors += 1
            if errors < 10:  # Log first few errors
                logging.warning(f"Validation failed: {e}")

    return results

# Expected throughput: ~160,000 ProvenanceTags/second

Efficient Series Processing

For time series data with many points:

from ci.transparency.types import Series

def process_series_stream(data_stream):
    """Stream processing for memory efficiency."""
    for series_data in data_stream:
        # Validate once per series (~259/second for 100-point series)
        series = Series.model_validate(series_data)

        # Process and immediately serialize to save memory
        result = series.model_dump_json()  # ~3,531/second
        yield result

        # series goes out of scope, freeing ~660KB

JSON Optimization

Choose serialization method based on your needs:

# Fastest: Use Pydantic's built-in JSON (recommended)
json_str = series.model_dump_json()  # ~228K/sec for minimal series

# Alternative: For custom JSON formatting
import orjson  # Install separately for even better performance

data = series.model_dump(mode='json')  # Convert enums to values
json_bytes = orjson.dumps(data)       # Potentially faster than stdlib

Memory Management

For processing large datasets:

import gc
from typing import Iterator

def memory_efficient_processing(large_dataset: Iterator[dict]):
    """Process data without loading everything into memory."""
    batch_size = 1000
    batch = []

    for record in large_dataset:
        batch.append(record)

        if len(batch) >= batch_size:
            # Process batch
            results = [ProvenanceTag.model_validate(data) for data in batch]

            # Yield results and clear memory
            yield from results
            batch.clear()

            # Optional: Force garbage collection for long-running processes
            if len(results) % 10000 == 0:
                gc.collect()

Production Considerations

Database Integration

Based on the memory usage, consider your storage strategy:

# For ProvenanceTag (1KB each): Can safely keep thousands in memory
provenance_cache = {}  # OK to cache frequently accessed tags

# For Series (7KB-660KB each): Stream processing recommended
def store_series_efficiently(series: Series):
    # Store as compressed JSON rather than keeping objects in memory
    json_data = series.model_dump_json()
    compressed = gzip.compress(json_data.encode())
    database.store(compressed)

API Design

Design your APIs based on these performance characteristics:

from fastapi import FastAPI
from ci.transparency.types import ProvenanceTag, Series

app = FastAPI()

@app.post("/provenance/batch")
async def upload_provenance_batch(tags: List[ProvenanceTag]):
    # Can handle large batches efficiently (~160K/sec validation)
    return {"processed": len(tags)}

@app.post("/series")
async def upload_series(series: Series):
    # Individual series upload (validation cost depends on point count)
    point_count = len(series.points)
    if point_count > 1000:
        # Consider async processing for very large series
        return {"status": "queued", "points": point_count}
    return {"status": "processed", "points": point_count}

When Performance Matters

High-Performance Scenarios

Real-time civic monitoring: ProvenanceTag validation at 160K/sec supports live analysis
Batch processing: Can process millions of records efficiently
API endpoints: Fast enough for responsive web applications

Optimization Not Needed

Typical civic research: These speeds far exceed most analytical workloads
Small datasets: Optimization overhead not worth it for <10K records
Prototype development: Focus on correctness first, optimize later

Monitoring Performance

Add performance monitoring to your applications:

import time
import logging

class PerformanceMonitor:
    def __init__(self, name: str):
        self.name = name
        self.start_time = None

    def __enter__(self):
        self.start_time = time.perf_counter()
        return self

    def __exit__(self, *args):
        duration = time.perf_counter() - self.start_time
        logging.info(f"{self.name} took {duration:.3f}s")

# Usage
with PerformanceMonitor("ProvenanceTag validation"):
    tags = [ProvenanceTag.model_validate(data) for data in batch]

Summary

The civic transparency types deliver high performance for privacy-preserving data processing:

Production-ready speeds: 160K+ validations/second for metadata
Efficient serialization: Built-in JSON optimization
Predictable scaling: Performance scales with data complexity
Memory conscious: Reasonable overhead for rich validation

The privacy-by-design architecture (bucketed enums, aggregated data) improves performance compared to handling raw, detailed data structures.