Skip to content

Schema Reference

This document describes the database schema used for storing transparency simulation data in DuckDB.

Privacy

This schema stores only aggregated metrics and hash-based identifiers. Hashes are one-way digests and cannot be reversed to original content. No raw user data or message text is recorded.

Events Table

The primary table events stores window-level aggregations of activity data.

Schema Definition

CREATE TABLE events (
    world_id TEXT,
    topic_id TEXT,
    window_start TIMESTAMP,
    window_end TIMESTAMP,
    n_messages INTEGER,
    n_unique_hashes INTEGER,
    dup_rate REAL,
    top_hashes TEXT,  -- JSON array
    hash_concentration REAL,
    burst_score REAL,
    type_post REAL,
    type_reply REAL,
    type_retweet REAL,
    time_histogram TEXT  -- JSON array
);

Column Descriptions

Identifiers: - world_id: World identifier (e.g., "A", "B") - topic_id: Topic/content cluster identifier

Temporal: - window_start: Window start timestamp (ISO format) - window_end: Window end timestamp (ISO format)

Activity Metrics: - n_messages: Total messages in window - n_unique_hashes: Number of unique content hashes - dup_rate: Duplication rate (0.0 to 1.0)

Content Analysis: - top_hashes: JSON array of top content clusters with counts - hash_concentration: Herfindahl index of hash distribution

Temporal Patterns: - burst_score: Coefficient of variation for minute-level activity - time_histogram: JSON array of message counts per minute

Content Types: - type_post: Proportion of original posts (0.0 to 1.0) - type_reply: Proportion of replies (0.0 to 1.0) - type_retweet: Proportion of retweets/shares (0.0 to 1.0)

JSON Column Formats

top_hashes contains an array of objects:

[
  {
    "hash": {
      "id": {
        "algo": "opaque",
        "value": "h1"
      }
    },
    "count": 8
  }
]

time_histogram contains minute-level counts:

[12, 9, 7, 5, 3, 2, 1, 0, 0, 0]

Schema Variants

Student Schema (schema.sql)

Basic schema for educational use with essential columns and clear documentation.

Loading Data

Data is loaded via the conversion utility:

ct-sdk convert --jsonl data.jsonl --duck data.duckdb --schema schema/schema.sql

The loader: 1. Creates the table if it doesn't exist 2. Clears existing data 3. Inserts new records with proper type conversion 4. Validates JSON column formats

Query Examples

Basic aggregations:

SELECT
    world_id,
    AVG(dup_rate) as avg_dup_rate,
    AVG(hash_concentration) as avg_concentration,
    AVG(burst_score) as avg_burst_score
FROM events
GROUP BY world_id;

Time series analysis:

SELECT
    window_start,
    n_messages,
    dup_rate,
    burst_score
FROM events
WHERE world_id = 'A'
ORDER BY window_start;

Content type analysis:

SELECT
    world_id,
    AVG(type_post) as avg_post_share,
    AVG(type_reply) as avg_reply_share,
    AVG(type_retweet) as avg_retweet_share
FROM events
GROUP BY world_id;

JSON data extraction:

-- Extract top hash information (DuckDB JSON functions)
SELECT
    world_id,
    topic_id,
    json_extract_string(top_hashes, '$[0].hash.id.value') as top_hash_id,
    json_extract(top_hashes, '$[0].count') as top_hash_count
FROM events;

Data Types and Constraints

Timestamps: Stored as TIMESTAMP, typically UTC Rates: REAL values between 0.0 and 1.0 Counts: Non-negative INTEGER values JSON: Valid JSON strings for complex data Identifiers: TEXT with application-specific formats

Performance Considerations

Indexing: Consider indexes on world_id, topic_id, and window_start for common queries.

JSON Processing: DuckDB provides efficient JSON functions for extracting specific fields from JSON columns.

Time Series: Window-based partitioning can improve performance for time series queries.

Aggregations: Pre-computed aggregations are stored at the window level to optimize analysis queries.