Skip to content

Usage Guide

This guide covers the basic workflows for generating synthetic transparency data and converting it for analysis.

Installation

pip install civic-transparency-sim

Basic Workflow

1. Generate Synthetic Data

Baseline (Organic) World:

ct-sdk generate --world A --topic-id baseline --out world_A.jsonl

Influenced World with Parameters:

ct-sdk generate --world B --topic-id influenced --out world_B.jsonl \
  --seed 4343 --dup-mult 1.35 --burst-minutes 3 --reply-nudge -0.10

2. Convert to Database

ct-sdk convert --jsonl world_A.jsonl --duck world_A.duckdb --schema schema/schema.sql
ct-sdk convert --jsonl world_B.jsonl --duck world_B.duckdb --schema schema/schema.sql

3. Analyze with Direct Scripts

# Individual world analysis
python -m scripts_py.plot_quick --duck world_A.duckdb --outdir plots/world_A

# Comparative analysis  
python -m scripts_py.plot_compare_ducks --ducks world_A.duckdb world_B.duckdb --outdir plots/compare_AB

Generation Parameters

Core Parameters

  • --world: World identifier (e.g., "A", "B", "baseline")
  • --topic-id: Topic identifier for content clustering
  • --out: Output JSONL file path
  • --windows: Number of time windows (default: 12)
  • --step-minutes: Minutes per window (default: 10)
  • --seed: Random seed for reproducibility (default: 4242)

Influence Parameters

Optional parameters that modify generation behavior:

  • --dup-mult: Duplicate multiplier (amplifies content duplication)
  • --burst-minutes: Micro-burst duration (coordinated activity spikes)
  • --reply-nudge: Reply proportion adjustment (positive/negative shift)

When any influence parameters are provided, the system automatically uses the influenced generation algorithm.

Data Format

Generated JSONL files contain window aggregation records with:

  • Temporal data: Window start/end times
  • Activity metrics: Message counts, unique hashes, duplicate rates
  • Content fingerprints: SimHash and MinHash signatures
  • Clustering data: Top hash frequencies, concentration measures
  • Behavioral patterns: Type mix (post/reply/retweet), burst scores

Database Schema

The DuckDB schema includes:

  • Primary table: events with window-level aggregations
  • Columns for all core metrics and metadata
  • JSON columns for complex data (top_hashes, time_histogram)
  • Proper typing for timestamps and numeric values

Reproducibility

All generation is deterministic:

  • Seed-based: Same seed produces identical output
  • Version tracking: Metadata includes package versions
  • Parameter logging: All settings preserved in output
  • Schema versioning: Database structures documented

Example Seeds

Standard seeds for common scenarios:

  • Baseline organic: 4242
  • Light influence: 4343
  • Custom scenarios: Use any integer

Programmatic Usage

from ci.transparency.sdk import WindowAgg
from ci.transparency.sim.metrics import herfindahl, cv_of_bins

# Load and analyze data programmatically
# (See Type Reference for detailed API documentation)