Skip to content

Content Hashing

Content hashing types provide structures for content identification without requiring access to raw content.

Overview

The hash core defines types for representing content via identifiers and fingerprints. This supports privacy-preserving analysis and interoperable APIs.

Core Types

HashId {#hashid}

Represents a typed, canonical content identifier.

from dataclasses import dataclass

@dataclass(frozen=True)
class HashId:
    algo: str   # e.g., "sha256", "blake3", "opaque", "simhash64"
    value: str  # canonical string for that algo (e.g., lowercase hex)

Fields: - algo: Hash algorithm type (e.g., "sha256", "opaque", "simhash") - value: Canonicalized hash value (e.g., lowercase hex; simhash64 is 16 hex chars, no 0x).

Usage:

from ci.transparency.sdk import HashId

# Opaque identifier (privacy-preserving)
hash_id = HashId("opaque", "h1")

# Cryptographic hash (lowercase hex)
crypto_hash = HashId("sha256", "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855")

# Similarity hash (64-bit, 16 hex chars)
sim64 = HashId("simhash64", "9f3a5c10aa55ee77")

Note: Multi-value fingerprints like MinHash are not represented as HashId. They live in Digests.

ContentHash {#contenthash}

Wrapper for a HashId.

from dataclasses import dataclass

@dataclass(frozen=True)
class ContentHash:
    id: HashId

Usage:

from ci.transparency.sdk import ContentHash, HashId

content = ContentHash(HashId("opaque", "content_123"))

TopHash {#tophash}

Represents a content item (by ContentHash) with its frequency.

from dataclasses import dataclass

@dataclass(frozen=True)
class TopHash:
    hash: ContentHash
    count: int

Fields: - hash: Content identifier - count: Frequency of this item in the dataset/window

Usage:

from ci.transparency.sdk.types import TopHash, ContentHash, HashId

top_content = TopHash(
    hash=ContentHash(HashId("opaque", "popular_content")),
    count=42
)

Design Principles

Privacy-First

Content is represented by hashes and cluster keys, not raw text.

Type Safety

Algorithms are explicitly labeled via algo to prevent mixing types.

Frequency Tracking

TopHash couples a content identifier with counts for aggregation.

Hash Types

Opaque Hashes

Synthetic/testing identifiers (no content required):

HashId("opaque", "h1")
HashId("opaque", "test_content_abc")

Cryptographic Hashes

Standard one-way hashes:

HashId("sha256", "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855")
HashId("md5", "d41d8cd98f00b204e9800998ecf8427e")

Similarity Hashes

Locality-sensitive single-value identifiers (64-bit SimHash):

HashId("simhash", "9f3a5c10aa55ee77")

For MinHash (multi-value), see Digests.

Usage Patterns

Content Identification

# Build a map from content id -> count
content_map = {}
for top in window.top_hashes:
    content_map[top.hash.id.value] = top.count

Data Structure Creation

from ci.transparency.sdk.types import TopHash, ContentHash, HashId

top_hashes = [
    TopHash(ContentHash(HashId("opaque", f"content_{i}")), count=10 - i)
    for i in range(5)
]

Serialization

Content hashes serialize to nested JSON:

{
  "hash": {
    "id": {
      "algo": "opaque",
      "value": "h1"
    }
  },
  "count": 8
}
  • WindowAgg - Contains lists of TopHash objects
  • Digests - Content fingerprinting with similarity hashes
  • I/O Schema - Serialization utilities

Database Storage

Store TopHash inside JSON columns for flexibility: - Preserves algorithm and canonical value - Enables JSON extraction of fields for queries - Keeps counts alongside identifiers

See Schema Reference for examples.