Usage Guide
This guide covers the basic workflows for generating synthetic transparency data and converting it for analysis.
Installation
pip install civic-transparency-sim
Basic Workflow
1. Generate Synthetic Data
Baseline (Organic) World:
ct-sdk generate --world A --topic-id baseline --out world_A.jsonl
Influenced World with Parameters:
ct-sdk generate --world B --topic-id influenced --out world_B.jsonl \
--seed 4343 --dup-mult 1.35 --burst-minutes 3 --reply-nudge -0.10
2. Convert to Database
ct-sdk convert --jsonl world_A.jsonl --duck world_A.duckdb --schema schema/schema.sql
ct-sdk convert --jsonl world_B.jsonl --duck world_B.duckdb --schema schema/schema.sql
3. Analyze with Direct Scripts
# Individual world analysis
python -m scripts_py.plot_quick --duck world_A.duckdb --outdir plots/world_A
# Comparative analysis
python -m scripts_py.plot_compare_ducks --ducks world_A.duckdb world_B.duckdb --outdir plots/compare_AB
Generation Parameters
Core Parameters
--world
: World identifier (e.g., "A", "B", "baseline")--topic-id
: Topic identifier for content clustering--out
: Output JSONL file path--windows
: Number of time windows (default: 12)--step-minutes
: Minutes per window (default: 10)--seed
: Random seed for reproducibility (default: 4242)
Influence Parameters
Optional parameters that modify generation behavior:
--dup-mult
: Duplicate multiplier (amplifies content duplication)--burst-minutes
: Micro-burst duration (coordinated activity spikes)--reply-nudge
: Reply proportion adjustment (positive/negative shift)
When any influence parameters are provided, the system automatically uses the influenced generation algorithm.
Data Format
Generated JSONL files contain window aggregation records with:
- Temporal data: Window start/end times
- Activity metrics: Message counts, unique hashes, duplicate rates
- Content fingerprints: SimHash and MinHash signatures
- Clustering data: Top hash frequencies, concentration measures
- Behavioral patterns: Type mix (post/reply/retweet), burst scores
Database Schema
The DuckDB schema includes:
- Primary table:
events
with window-level aggregations - Columns for all core metrics and metadata
- JSON columns for complex data (top_hashes, time_histogram)
- Proper typing for timestamps and numeric values
Reproducibility
All generation is deterministic:
- Seed-based: Same seed produces identical output
- Version tracking: Metadata includes package versions
- Parameter logging: All settings preserved in output
- Schema versioning: Database structures documented
Example Seeds
Standard seeds for common scenarios:
- Baseline organic:
4242
- Light influence:
4343
- Custom scenarios: Use any integer
Programmatic Usage
from ci.transparency.sdk import WindowAgg
from ci.transparency.sim.metrics import herfindahl, cv_of_bins
# Load and analyze data programmatically
# (See Type Reference for detailed API documentation)