Skip to content

Performance Guide

When working with civic transparency data at scale, these performance characteristics and optimizations can help you build efficient applications.

Benchmark Results

Performance measured on Windows Python 3.11.9 (your results may vary based on hardware and Python version):

Validation Performance

Model Type Records/sec Time per Record Use Case
ProvenanceTag ~160,000 0.006ms Post metadata validation
Series (minimal) ~25,000 0.040ms Simple time series
Series (100 points) ~259 3.861ms Complex time series

JSON Serialization Performance

Model Type Pydantic JSON stdlib json Speedup
ProvenanceTag ~651,000/sec ~269,000/sec 2.4x
Series (minimal) ~228,000/sec ~91,000/sec 2.5x
Series (100 points) ~3,531/sec ~1,511/sec 2.3x

Key Insight: Pydantic's model_dump_json() is consistently ~2.4x faster than model_dump() + json.dumps().

Memory Usage

Model Type Memory per Object JSON Size Efficiency
ProvenanceTag ~1,084 bytes 207 bytes 5.2x overhead
Series (minimal) ~7,127 bytes 502 bytes 14.2x overhead
Series (100 points) ~660,159 bytes 40,796 bytes 16.2x overhead

Memory overhead includes Python object structures, Pydantic metadata, and validation caching.

Performance Patterns

1. Complexity Scaling

  • ProvenanceTag: Simple enum validation is very fast
  • Series: Scales linearly with number of data points
  • Nested validation: Each point requires full coordinate validation

2. Serialization vs Validation

  • Serialization is faster: JSON generation is 4-25x faster than validation
  • Pydantic optimized: Built-in JSON serializer beats stdlib significantly
  • Memory efficient: JSON representation is much more compact

3. Privacy-Preserving Design Impact

The privacy-by-design approach actually helps performance: - Bucketed enums (like acct_age_bucket) validate faster than ranges - Categorical data requires less processing than continuous values - Aggregated structures reduce individual record complexity

Optimization Strategies

High-Throughput Validation

For processing large volumes of civic data:

from ci.transparency.types import ProvenanceTag
from typing import List, Dict, Any
import logging

def process_provenance_batch(data_list: List[Dict[str, Any]]) -> List[ProvenanceTag]:
    """Process ~160K records/second efficiently."""
    results = []
    errors = 0

    for data in data_list:
        try:
            tag = ProvenanceTag.model_validate(data)
            results.append(tag)
        except ValidationError as e:
            errors += 1
            if errors < 10:  # Log first few errors
                logging.warning(f"Validation failed: {e}")

    return results

# Expected throughput: ~160,000 ProvenanceTags/second

Efficient Series Processing

For time series data with many points:

from ci.transparency.types import Series

def process_series_stream(data_stream):
    """Stream processing for memory efficiency."""
    for series_data in data_stream:
        # Validate once per series (~259/second for 100-point series)
        series = Series.model_validate(series_data)

        # Process and immediately serialize to save memory
        result = series.model_dump_json()  # ~3,531/second
        yield result

        # series goes out of scope, freeing ~660KB

JSON Optimization

Choose serialization method based on your needs:

# Fastest: Use Pydantic's built-in JSON (recommended)
json_str = series.model_dump_json()  # ~228K/sec for minimal series

# Alternative: For custom JSON formatting
import orjson  # Install separately for even better performance

data = series.model_dump(mode='json')  # Convert enums to values
json_bytes = orjson.dumps(data)       # Potentially faster than stdlib

Memory Management

For processing large datasets:

import gc
from typing import Iterator

def memory_efficient_processing(large_dataset: Iterator[dict]):
    """Process data without loading everything into memory."""
    batch_size = 1000
    batch = []

    for record in large_dataset:
        batch.append(record)

        if len(batch) >= batch_size:
            # Process batch
            results = [ProvenanceTag.model_validate(data) for data in batch]

            # Yield results and clear memory
            yield from results
            batch.clear()

            # Optional: Force garbage collection for long-running processes
            if len(results) % 10000 == 0:
                gc.collect()

Production Considerations

Database Integration

Based on the memory usage, consider your storage strategy:

# For ProvenanceTag (1KB each): Can safely keep thousands in memory
provenance_cache = {}  # OK to cache frequently accessed tags

# For Series (7KB-660KB each): Stream processing recommended
def store_series_efficiently(series: Series):
    # Store as compressed JSON rather than keeping objects in memory
    json_data = series.model_dump_json()
    compressed = gzip.compress(json_data.encode())
    database.store(compressed)

API Design

Design your APIs based on these performance characteristics:

from fastapi import FastAPI
from ci.transparency.types import ProvenanceTag, Series

app = FastAPI()

@app.post("/provenance/batch")
async def upload_provenance_batch(tags: List[ProvenanceTag]):
    # Can handle large batches efficiently (~160K/sec validation)
    return {"processed": len(tags)}

@app.post("/series")
async def upload_series(series: Series):
    # Individual series upload (validation cost depends on point count)
    point_count = len(series.points)
    if point_count > 1000:
        # Consider async processing for very large series
        return {"status": "queued", "points": point_count}
    return {"status": "processed", "points": point_count}

When Performance Matters

High-Performance Scenarios

  • Real-time civic monitoring: ProvenanceTag validation at 160K/sec supports live analysis
  • Batch processing: Can process millions of records efficiently
  • API endpoints: Fast enough for responsive web applications

Optimization Not Needed

  • Typical civic research: These speeds far exceed most analytical workloads
  • Small datasets: Optimization overhead not worth it for <10K records
  • Prototype development: Focus on correctness first, optimize later

Monitoring Performance

Add performance monitoring to your applications:

import time
import logging

class PerformanceMonitor:
    def __init__(self, name: str):
        self.name = name
        self.start_time = None

    def __enter__(self):
        self.start_time = time.perf_counter()
        return self

    def __exit__(self, *args):
        duration = time.perf_counter() - self.start_time
        logging.info(f"{self.name} took {duration:.3f}s")

# Usage
with PerformanceMonitor("ProvenanceTag validation"):
    tags = [ProvenanceTag.model_validate(data) for data in batch]

Summary

The civic transparency types deliver high performance for privacy-preserving data processing:

  • Production-ready speeds: 160K+ validations/second for metadata
  • Efficient serialization: Built-in JSON optimization
  • Predictable scaling: Performance scales with data complexity
  • Memory conscious: Reasonable overhead for rich validation

The privacy-by-design architecture (bucketed enums, aggregated data) improves performance compared to handling raw, detailed data structures.