Illinois Identity Dataset (Chicago Contracts)¶
This repository provides a curated, publication-ready subset of the Chicago
Contracts dataset, formatted for use in Civic Interconnect identity
experiments.
Its primary purpose is to support evaluation of the SNFEI entity-identity
algorithm across heterogeneous civic data sources.
The data and metadata in this package allow Civic Interconnect users to:
- Reproduce a fixed snapshot of a large public dataset.
- Evaluate the stability of legal name normalization rules.
- Measure false-split and false-merge rates for entity identity clustering.
- Test adapter implementations against real, messy, high-value data.
- Integrate provenance and citation into record envelopes (CEP and CEE).
What This Package Contains¶
data/
raw/
chicago_contracts_2025-12-11.csv # full snapshot from Chicago Open Data
data-out/
chicago_contracts_vendors_sample_20k.csv
index.json # machine-readable dataset registry
manifest.json # provenance and generation metadata
The data-out/ directory contains versioned, stable artifacts intended for:
- SNFEI testing
- adapter development
- entity-normalization experiments
- cross-domain identity resolution research
- training examples for documentation
The raw file under data/raw/ is preserved as-is to support:
- reproducibility
- hash checking
- independent recomputation of derived samples
Why Chicago Contracts?¶
Chicago's procurement dataset contains:
- hundreds of thousands of records
- many repeated vendors
- variations in capitalization, punctuation, abbreviations, and corporate suffixes
- misspellings and data-entry inconsistencies
- rich attributes (department, amounts, contract type, dates, vendor identifiers)
This makes it an ideal stress-test corpus for:
- organization-name canonicalization
- SNFEI projection stability
- cross-year cluster consistency
- partial or malformed vendor identifiers
How the 20k Identity Sample Was Generated¶
The sample in data-out/chicago_contracts_vendors_sample_20k.csv was produced using:
uv run python src/civic_data_identity_us_il/make_chicago_identity_sample.py --overwrite
The script:
- Loads the raw CSV snapshot.
- Selects the vendor-name column and related fields.
- Deduplicates, cleans, and slices a reproducible subset of 20,000 rows.
- Writes both:
- the derived sample
- a PROV-style manifest documenting inputs, outputs, timestamp, and generator
Both index.json and manifest.json are generated automatically.
All transformations are deterministic.
Provenance & Reproducibility¶
This dataset provides two layers of provenance, aligned with CEP's CEE model:
1. Manifest Provenance (data-out/manifest.json)¶
Describes:
- exact raw input file(s)
- checksum-stable paths
- software and script used
- generation timestamp
- license and citation of the upstream data source
2. PROV-YAML (provenance/chicago_contracts.prov.yaml)¶
Describes:
- Entity → Activity → Agent
- how the sample file was produced
- claims suitable for embedding in CEP record envelopes
Citation¶
If you use this dataset in research, testing, documentation, or software:
City of Chicago. "Contracts Dataset." Accessed via
https://data.cityofchicago.org/Administration-Finance/Contracts/rsxa-ify5.
Civic Interconnect. "Illinois Identity Dataset (Chicago Contracts)." v0.1.0.
https://github.com/civic-interconnect/civic-data-identity-us-il.
A machine-readable citation is also embedded in:
CITATION.cffdata-out/index.json
Status and Versioning¶
This repository follows:
- Semantic Versioning for published dataset package versions
- Keep a Changelog conventions
- setuptools_scm for automated tagging
Relationship to Civic Interconnect¶
This dataset is not part of CEP itself.
It is a companion dataset supporting:
- SNFEI design validation
- Adapter testing
- Text normalization tuning
- Reference examples for CEP documentation
- Realistic cross-domain identity experiments
Contributing¶
Contributions welcome, including:
- additional identity benchmarks
- domain-specific normalization tests
- enhancements to the sampling script
- improved provenance annotations
Please open issues or PRs on GitHub.
License: Upstream Data¶
The raw Chicago Contracts dataset is released by the City of Chicago under the Chicago Open Data Portal Terms of Use, which functionally place the data in the public domain.
Source: https://data.cityofchicago.org/Administration-Finance/Contracts/rsxa-ify5
License: This Repository¶
All code, scripts, derived datasets, and documentation in this repository
(civic-data-identity-us-il) are released under the MIT License.
This allows:
- reuse in academic, civic-technology, and commercial settings
- embedding into the broader Civic Interconnect ecosystem
- distribution through PyPI or other registries
The MIT license used here is intentionally compatible with:
- Apache-2.0 (the license for Civic Interconnect core repositories), and
- other permissive licenses commonly used for data-oriented packages.
Nothing in this repository overrides the upstream public-domain data rights. Generated artifacts combine MIT-licensed code with public data and are therefore MIT-licensed.