ID Management
ID management types provide consistent identifier schemes across the transparency simulation system.
Overview
The ID management module defines standardized approaches for creating, validating, and using identifiers throughout the system. This ensures consistency and enables cross-referencing between different components.
Core Types
HashId
Basic hash identifier with type information.
@dataclass
class HashId:
type: str
value: str
Fields: - type: Hash algorithm or identifier type - value: The actual identifier string
Common Types:
- "opaque"
: Synthetic identifiers for testing
- "sha256"
: SHA-256 cryptographic hashes
- "md5"
: MD5 hashes (legacy support)
- "simhash"
: SimHash signatures
- "uuid"
: UUID identifiers
WorldId
Identifier for simulation worlds.
@dataclass
class WorldId:
value: str
def __post_init__(self):
if not self.value:
raise ValueError("World ID cannot be empty")
TopicId
Identifier for content clusters without seeing content. A TopicId
is a deterministic key derived from
content identifiers/fingerprints (e.g., SimHash/MinHash via LSH).
from dataclasses import dataclass
@dataclass(frozen=True)
class TopicId:
algo: str # e.g., "simhash64-lsh", "minhash-lsh", "sha256", "opaque-topic"
value: str # canonical cluster key for that algo
def __str__(self) -> str:
return f"{self.algo}:{self.value}"
Examples (strings stored in DB):
- simhash64-lsh:9f3a5c10aa55ee77
- minhash-lsh:AbC1_2xY... (base64url if not hex)
- opaque-topic:t_001
Note: TopicId is not a human-readable label. TopicId is a stable, privacy-preserving cluster key.
Usage Examples
Basic Identifiers
from ci.transparency.sdk import HashId, WorldId, TopicId
# Hash identifiers
content_hash = HashId("sha256", "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855")
opaque_hash = HashId("opaque", "test_content_123")
# World and topic identifiers
world = WorldId("experiment_A")
topic = TopicId("simhash64-lsh")
Identifier Validation
def validate_hash_id(hash_id: HashId) -> bool:
"""Validate hash ID format."""
if hash_id.type == "sha256":
return len(hash_id.value) == 64 and all(c in "0123456789abcdef" for c in hash_id.value.lower())
elif hash_id.type == "md5":
return len(hash_id.value) == 32 and all(c in "0123456789abcdef" for c in hash_id.value.lower())
elif hash_id.type == "opaque":
return bool(hash_id.value) # Any non-empty string
return False
Identifier Generation
import hashlib
import uuid
def generate_content_hash(content: str) -> HashId:
"""Generate SHA-256 hash for content."""
hash_value = hashlib.sha256(content.encode()).hexdigest()
return HashId("sha256", hash_value)
def generate_world_id() -> WorldId:
"""Generate unique world identifier."""
return WorldId(f"world_{uuid.uuid4().hex[:8]}")
Identifier Patterns
Hierarchical Naming
# Use hierarchical patterns for organization
experiment_world = WorldId("exp2025_baseline_A")
topic_cluster = TopicId("election2024_policy_discussion")
Timestamped Identifiers
from datetime import datetime
def timestamped_world_id(prefix: str) -> WorldId:
"""Create timestamped world identifier."""
timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S")
return WorldId(f"{prefix}_{timestamp}")
Content-Based Identifiers
def content_topic_id(content_sample: str) -> TopicId:
"""Generate topic ID based on content characteristics."""
# Simplified example - in practice would use NLP
keywords = content_sample.lower().split()[:3]
return TopicId("_".join(keywords))
Serialization
Identifiers serialize to simple string values or structured objects:
{
"hash_id": {
"algo": "sha256",
"value": "e3b0c44298fc..."
},
"world_id": "experiment_A",
"topic_id": "9f3a5c10aa55ee77"
}
Validation Utilities
Format Checking
class IdValidator:
@staticmethod
def is_valid_world_id(world_id: str) -> bool:
"""Check if world ID format is valid."""
return bool(world_id and len(world_id) <= 64 and world_id.replace("_", "").replace("-", "").isalnum())
@staticmethod
def is_valid_topic_id(topic_id: str) -> bool:
"""Check if topic ID format is valid."""
return bool(topic_id and len(topic_id) <= 128)
Uniqueness Checking
class IdRegistry:
def __init__(self):
self.used_world_ids = set()
self.used_topic_ids = set()
def register_world_id(self, world_id: WorldId) -> bool:
"""Register world ID and check uniqueness."""
if world_id.value in self.used_world_ids:
return False
self.used_world_ids.add(world_id.value)
return True
Best Practices
Naming Conventions
- Use descriptive prefixes for different identifier types
- Include version numbers for evolving experiments
- Use consistent separators (underscores recommended)
- Avoid spaces and special characters
Collision Avoidance
- Include timestamps for time-sensitive identifiers
- Use random components for high-uniqueness requirements
- Validate uniqueness before using identifiers
- Maintain registries for active identifier spaces
Documentation
- Document identifier schemes in experiment metadata
- Include identifier generation rules in configuration
- Provide examples of valid identifiers
- Explain identifier meaning and scope
Related Types
- ContentHash - Uses HashId for content identification
- WindowAgg - Uses world_id and topic_id fields
Database Storage
Identifiers map to database columns: - Simple string storage for most identifier types - JSON storage preserves type information for HashId - Indexes on identifier columns for efficient queries
See Schema Reference for database identifier handling.