Practical ingestion patterns for common data sources.
Examples should demonstrate ContractForge capabilities without hiding recurring behavior behind notebook workarounds. If a pattern is broadly useful, prefer a declarative contract or a reusable template.
Public CSV over HTTP
Uses http_file, explicit CSV options, source completeness and overwrite mode.
NASA EONET REST API
Uses rest_api raw response mode and transform.shape.parse_json to model nested arrays.
Azure Blob / ADLS files
Uses object storage paths, schemas, partitioned folders, recursive lookup and multiple file formats.
RDS/Aurora JDBC
Uses JDBC partitioning, pushdown, RDS IAM token generation and hash diff writes.
Native connector handoff
Uses Databricks-native connectors, Lakehouse Federation, vendor replication or marketplace tools to land specialized sources first, then hands the landed table or files to ContractForge.
Generate a starter project
contractforge init --bundle --layer bronze --target-schema raw --table b_orders --mode append
contractforge validate-project contracts/
Real-world examples in the repository
The repository includes deterministic examples derived from connector validation work. They are intended to look like production handoff assets rather than minimal snippets.
| Example | What it demonstrates |
|---|---|
examples/real-world/rest-nested-json-shape | REST raw payload ingestion, transform.shape.parse_json, array handling, SCD hash-diff and quality warnings. |
examples/real-world/usgs-earthquake-geojson-shape | Public USGS GeoJSON ingestion with schema_ref, array-of-struct explosion, coordinate extraction, derived metrics, SCD hash-diff, gold marts and governance split. |
examples/real-world/large-known-dataset-tpch | Known Databricks TPCH sample source with 29.9M observed rows, a 250k-row validated run, SQL/table connectors, hash-diff upsert, gold aggregation and control-table evidence. |
examples/real-world/autoloader-available-now | Auto Loader available-now ingestion, checkpoints, schema location and stream control-table evidence. |
examples/real-world/jdbc-postgres-scd1 | JDBC reads, partitioning, watermark, current-state upsert and quarantine-capable quality rules. |
examples/real-world/object-storage-multiformat | Provider-neutral object storage, CSV/JSON/Parquet/Avro landing folders, explicit schemas and governance separation. |
examples/real-world/supabase-jdbc-medallion | Real Databricks/AWS medallion parity project with shared connection YAML, logical table refs, JDBC partitioning, quality quarantine, DAB execution, AWS S3 artifacts and Glue job deployment. Snowflake direct JDBC extraction remains review-required unless the source is pre-staged into a Snowflake-supported source. |
Validate all real-world examples:
PYTHONPATH=src python examples/real-world/scripts/validate_real_world.py
Scenario catalog
These scenarios are intended to show production-shaped ingestion patterns, not isolated syntax snippets. Each scenario should make source access, schema handling, transform behavior, quality evidence and runtime assumptions visible.
| Scenario | Typical runtime | What it demonstrates |
|---|---|---|
| Public CSV over HTTP | Databricks serverless | http_file, explicit CSV options, overwrite, source metadata and a serverless-safe download path. |
| Multi-format object storage | Databricks serverless or classic | Provider-neutral object_storage, governed paths, CSV/JSON/Parquet/Avro folders, recursive lookup and source-format metadata. |
| Nested REST JSON | Databricks serverless or classic | rest_api, raw payload preservation, transform.shape.parse_json, array explosion and schema separation between source and transform. |
| Small-file folder stress | Serverless or classic | Many files, explicit schema, recursive lookup, glob/regex selection, row counts and control-table observability. |
| Azure Blob SAS files | Azure Databricks classic cluster | Direct object storage credentials, recursive folders, schemas, multiple file formats and cluster-level storage configuration. |
| Azure External Location | Databricks serverless | Governed storage access without direct SAS in the contract, path reads through Unity Catalog and source metadata. |
| AWS S3 External Location | AWS Databricks serverless | Unity Catalog storage governance, S3 paths through external locations and serverless-compatible access. |
| AWS S3 direct credentials | Classic cluster | S3A credential setup, access key/session token behavior and cross-cloud object storage constraints. |
| Auto Loader on Blob/ADLS | Databricks | cloudFiles, checkpoints, available_now, child run metrics and stream aggregation. |
| Supabase/Postgres JDBC | Databricks and AWS Glue/Iceberg; Snowflake after pre-stage or native source binding | JDBC driver, pushdown, partitioned reads, watermarking, current-state/hash diff patterns and platform parity through logical refs. |
| RDS/Aurora IAM | Classic cluster | JDBC auth.type=rds_iam, botocore/default credential chain and network prerequisite diagnosis. |
| Large known dataset TPCH | Azure Databricks classic cluster | Known public sample source, 250k-row processing, SQL/table connectors, hash-diff upsert, gold aggregation and runtime evidence suitable for dashboard screenshots. |
| Native platform handoff | Serverless or classic | Specialized sources are landed or federated by Databricks/native tooling and then processed by ContractForge as table, sql, files or object storage. |
| SFTP partner drops | Classic or serverless native connection | Driver-download versus Databricks-native SFTP connection strategy, staging paths, file limits and host-key policy. |
Evidence expected from examples
An example is considered useful when it produces evidence that a user can inspect after execution:
| Evidence | Where to inspect it | Why it matters |
|---|---|---|
| Run status and row counts | ctrl_ingestion_runs | Confirms whether data movement and write accounting match expectations. |
| Source metadata | source payload in run records | Shows which connector, path/query and redacted options were used. |
| Quality status | ctrl_ingestion_quality and run summary | Proves quality gates were executed and not hidden in notebook code. |
| Quarantine details | ctrl_ingestion_quarantine | Shows isolated bad rows when rules are row-level and quarantine-capable. |
| Stream aggregation | ctrl_ingestion_streams | Confirms available-now stream parent/child run metrics and micro-batch totals. |
| Schema changes | ctrl_ingestion_schema_changes | Shows additive changes, rejected drift and type widening decisions. |
The same evidence model should be used when adapting examples to private sources.
Template families
Templates provide opinionated starting points for common connector, transform and write-mode combinations.
contractforge templates list
contractforge templates wizard --layer silver --source jdbc --mode hash_diff_upsert
contractforge templates write jdbc_hash_diff --output contracts/silver/s_orders
Small files and partitioned object storage
Use explicit schemas, recursive lookup and optional regex filters for advanced file selection.
REST raw payloads
Keep API retrieval in the connector and model nested payloads through transform.shape.
JDBC incremental
Combine watermark, pushdown, partitioned reads, quality gates and deterministic merge semantics.
Auto Loader available-now
Use stream control tables to verify micro-batch counts and row totals.
Example design rule
Examples are not allowed to hide framework gaps with ad-hoc Spark code. If a real source requires common behavior, prefer improving the library or documenting a reusable pattern.
Design principle
Reusable source behavior belongs in connectors, reusable shaping belongs in transform.shape, and source-specific business logic belongs in project code or downstream transformations.