API Reference¶
Core Modules¶
Configuration¶
civic_interconnect.paperkit.config ¶
Configuration module for paper kit metadata handling.
This module provides: - TypedDict definitions for asset and metadata configuration - Functions to load and normalize metadata from YAML files - Default file extension configurations for allowed assets
File: src/civic_interconnect/paperkit/config.py
DirectAssetTD ¶
              Bases: TypedDict
TypedDict for direct asset configuration.
Attributes:
| Name | Type | Description | 
|---|---|---|
| url | str | The URL of the asset. | 
| filename | NotRequired[str] | Optional filename for the asset. | 
| checksum | NotRequired[str] | Optional checksum for the asset. | 
Source code in src/civic_interconnect/paperkit/config.py
                | 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |  | 
EntryMetaTD ¶
              Bases: TypedDict
TypedDict for entry metadata configuration.
Attributes:
| Name | Type | Description | 
|---|---|---|
| notes | NotRequired[str] | Optional notes about the entry. | 
| out_dir | NotRequired[str] | Optional output directory for the entry. | 
| assets | NotRequired[list[AssetTD]] | Optional list of assets associated with the entry. | 
Source code in src/civic_interconnect/paperkit/config.py
                | 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |  | 
PageAssetTD ¶
              Bases: TypedDict
TypedDict for page-based asset configuration.
Attributes:
| Name | Type | Description | 
|---|---|---|
| page_url | str | The URL of the page to scrape for assets. | 
| allow_ext | NotRequired[list[str]] | Optional list of allowed file extensions. | 
| href_regex | NotRequired[str] | Optional regex pattern to match href attributes. | 
| limit | NotRequired[int] | Optional limit on number of assets to collect. | 
| base_url | NotRequired[str] | Optional base URL for relative links. | 
Source code in src/civic_interconnect/paperkit/config.py
                | 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |  | 
load_meta ¶
load_meta(meta_path: Path) -> MetaTD
Load metadata from a YAML file.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| meta_path | Path | Path to the YAML metadata file. | required | 
Returns:
| Type | Description | 
|---|---|
| MetaTD | Dictionary containing the loaded and normalized metadata entries. | 
Raises:
| Type | Description | 
|---|---|
| ValueError | If the YAML file does not contain a mapping of bibkeys. | 
Source code in src/civic_interconnect/paperkit/config.py
              | 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |  | 
Bibliography¶
civic_interconnect.paperkit.bib ¶
Bibliography handling utilities for the paper kit.
This module provides functionality for loading and processing BibTeX files: - BibEntry: TypedDict for bibliography entries - BibDatabaseLike: Protocol for bibliography database objects - load_bib_keys: Function to extract citation keys from BibTeX files
File: src/civic_interconnect/paperkit/bib.py
BibDatabaseLike ¶
              Bases: Protocol
Protocol for bibliography database objects.
This protocol defines the interface for bibliography database objects that contain a list of bibliography entries and support attribute access.
Attributes:
| Name | Type | Description | 
|---|---|---|
| entries | List[BibEntry] | A list of bibliography entries from the database. | 
Methods:
| Name | Description | 
|---|---|
| __getattr__ | Provide access to additional attributes on the database object. | 
Source code in src/civic_interconnect/paperkit/bib.py
                | 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |  | 
__getattr__ ¶
__getattr__(name: str) -> object
Provide access to additional attributes on the database object.
Source code in src/civic_interconnect/paperkit/bib.py
              | 51 52 53 |  | 
BibEntry ¶
              Bases: TypedDict
A bibliography entry from a BibTeX file.
Attributes:
| Name | Type | Description | 
|---|---|---|
| ID | str | The citation key/identifier for the bibliography entry. | 
Source code in src/civic_interconnect/paperkit/bib.py
                | 20 21 22 23 24 25 26 27 28 29 |  | 
load_bib_keys ¶
load_bib_keys(bib_path: Path) -> list[str]
Load citation keys from a BibTeX file.
Source code in src/civic_interconnect/paperkit/bib.py
              | 56 57 58 59 60 61 62 63 64 65 |  | 
Orchestration¶
civic_interconnect.paperkit.orchestrate ¶
Orchestration module for downloading and managing assets linked to bibliography entries.
This module provides: - DownloadRecord and Summary dataclasses for tracking downloads, - Functions to guess filenames, run the download process, and handle asset scraping.
File: src/civic_interconnect/paperkit/orchestrate.py
            DownloadRecord
  
      dataclass
  
¶
    Represents a record of downloaded assets for a bibliography entry.
Attributes:
| Name | Type | Description | 
|---|---|---|
| bibkey | str | The bibliography key associated with the entry. | 
| paths | list[Path] | List of file paths to successfully downloaded assets. | 
| errors | list[str] | List of error messages encountered during download. | 
Source code in src/civic_interconnect/paperkit/orchestrate.py
                | 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |  | 
            Summary
  
      dataclass
  
¶
    Summary of the download process for bibliography entries.
Attributes:
| Name | Type | Description | 
|---|---|---|
| processed | list[DownloadRecord] | List of records for processed entries. | 
| skipped | list[str] | List of keys that were skipped. | 
Source code in src/civic_interconnect/paperkit/orchestrate.py
                | 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |  | 
guess_filename_from_url ¶
guess_filename_from_url(url: str) -> str
Guess a safe filename from a URL.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| url | str | The URL from which to extract the filename. | required | 
Returns:
| Type | Description | 
|---|---|
| str | A sanitized filename derived from the URL. | 
Source code in src/civic_interconnect/paperkit/orchestrate.py
              | 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 |  | 
run ¶
run(
    bib_path: Path,
    meta_path: Path,
    out_root: Path,
    client: Any,
) -> Summary
Orchestrate the download of assets for bibliography entries.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| bib_path | Path | Path to the bibliography file. | required | 
| meta_path | Path | Path to the metadata file. | required | 
| out_root | Path | Root directory for output files. | required | 
| client | any | HTTP client for downloading files. | required | 
Returns:
| Type | Description | 
|---|---|
| Summary | Summary of processed entries and any errors encountered. | 
Source code in src/civic_interconnect/paperkit/orchestrate.py
              | 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 |  | 
HTTP Client¶
civic_interconnect.paperkit.http_client ¶
HTTP client wrapper for making GET requests with retries and logging.
This module provides the HttpClient dataclass for robust HTTP GET requests, including configurable timeout, retries, backoff, and user-agent.
File: src/civic_interconnect/paperkit/http_client.py
            HttpClient
  
      dataclass
  
¶
    HTTP client for making GET requests with retries, backoff, and custom user-agent.
Attributes:
| Name | Type | Description | 
|---|---|---|
| session | Session | The requests session used for HTTP requests. | 
| timeout | int | Timeout for each request in seconds. | 
| retries | int | Number of retry attempts for failed requests. | 
| backoff_seconds | int | Base seconds to wait between retries (multiplied by attempt number). | 
| user_agent | str | User-Agent header for requests. | 
Source code in src/civic_interconnect/paperkit/http_client.py
                | 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 |  | 
get ¶
get(url: str) -> requests.Response
Perform an HTTP GET request with retries and exponential backoff.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| url | str | The URL to send the GET request to. | required | 
Returns:
| Type | Description | 
|---|---|
| Response | The HTTP response object. | 
Raises:
| Type | Description | 
|---|---|
| Exception | If all retry attempts fail, the last exception is raised. | 
Source code in src/civic_interconnect/paperkit/http_client.py
              | 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 |  | 
Download¶
civic_interconnect.paperkit.download ¶
File download utilities with checksum validation and safe filename handling.
This module provides: - ensure_dir: Create directories recursively if they don't exist - safe_filename: Convert strings to filesystem-safe filenames - sha256_file: Calculate SHA256 hash of a file - write_bytes: Write bytes to a file with directory creation - download_file: Download files with optional checksum verification
File: src/civic_interconnect/paperkit/download.py
download_file ¶
download_file(
    client: Any,
    url: str,
    out_path: Path,
    checksum: str | None = None,
) -> Path
Download a file from a URL, save it to a path, and optionally verify its checksum.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| client | Any | HTTP client with a .get(url) method returning a response with .content. | required | 
| url | str | The URL to download the file from. | required | 
| out_path | Path | The path to save the downloaded file. | required | 
| checksum | str | None | Optional SHA256 checksum to verify the downloaded file. | None | 
Returns:
| Type | Description | 
|---|---|
| Path | The path to the saved file. | 
Raises:
| Type | Description | 
|---|---|
| ValueError | If the checksum does not match. | 
Source code in src/civic_interconnect/paperkit/download.py
              | 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |  | 
ensure_dir ¶
ensure_dir(p: Path) -> None
Create the directory at the given path, including any necessary parent directories.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| p | Path | The directory path to create. | required | 
Source code in src/civic_interconnect/paperkit/download.py
              | 22 23 24 25 26 27 28 29 30 |  | 
safe_filename ¶
safe_filename(name: str) -> str
Convert a string to a filesystem-safe filename.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| name | str | The original filename or string. | required | 
Returns:
| Type | Description | 
|---|---|
| str | A sanitized, filesystem-safe filename. | 
Source code in src/civic_interconnect/paperkit/download.py
              | 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |  | 
sha256_file ¶
sha256_file(path: Path) -> str
Calculate the SHA256 hash of a file.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| path | Path | The path to the file to hash. | required | 
Returns:
| Type | Description | 
|---|---|
| str | The SHA256 hexadecimal digest of the file. | 
Source code in src/civic_interconnect/paperkit/download.py
              | 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |  | 
write_bytes ¶
write_bytes(path: Path, content: bytes) -> None
Write bytes to a file, creating parent directories if necessary.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| path | Path | The file path to write to. | required | 
| content | bytes | The bytes content to write. | required | 
Returns:
| Type | Description | 
|---|---|
| None |  | 
Source code in src/civic_interconnect/paperkit/download.py
              | 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |  | 
Web Scraping¶
civic_interconnect.paperkit.scrape ¶
Functions for extracting and filtering links from HTML documents.
This module provides utilities to parse HTML, extract anchor links, filter them by extension and regular expression, and log the results.
File: src/civic_interconnect/paperkit/scrape.py
extract_links ¶
extract_links(
    html: str,
    base_url: str,
    allow_ext: list[str],
    href_regex: str | None,
) -> list[str]
Extract and filter anchor links from an HTML document.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| html | str | The HTML content to parse. | required | 
| base_url | str | The base URL to resolve relative links. | required | 
| allow_ext | list[str] | List of allowed file extensions (e.g., ['.pdf', '.html']). | required | 
| href_regex | str | None | Optional regular expression to further filter hrefs. | required | 
Returns:
| Type | Description | 
|---|---|
| list[str] | List of filtered, absolute URLs extracted from the HTML. | 
Source code in src/civic_interconnect/paperkit/scrape.py
              | 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |  | 
CLI¶
civic_interconnect.paperkit.cli ¶
Command-line interface for the paperkit tool.
This module provides the main CLI entry point for fetching public data for bibliography references, including argument parsing and orchestration of the fetch process.
File: src/civic_interconnect/paperkit/cli.py
main ¶
main() -> int
Run the paperkit CLI.
Source code in src/civic_interconnect/paperkit/cli.py
              | 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |  | 
Logging¶
civic_interconnect.paperkit.log ¶
Logging utilities for the civic_interconnect.paperkit module.
Provides a library-wide logger and optional configuration for console output.
File: src/civic_interconnect/paperkit/log.py
configure ¶
configure(level: str = 'INFO') -> None
Configure basic console output for logging.
Only used by the CLI or by applications that explicitly opt in.
Source code in src/civic_interconnect/paperkit/log.py
              | 15 16 17 18 19 20 21 22 23 24 25 26 27 |  |