Practical ingestion patterns for common data sources.

Examples should demonstrate ContractForge capabilities without hiding recurring behavior behind notebook workarounds. If a pattern is broadly useful, prefer a declarative contract or a reusable template.

Public CSV over HTTP

Uses http_file, explicit CSV options, source completeness and overwrite mode.

NASA EONET REST API

Uses rest_api raw response mode and transform.shape.parse_json to model nested arrays.

Azure Blob / ADLS files

Uses object storage paths, schemas, partitioned folders, recursive lookup and multiple file formats.

RDS/Aurora JDBC

Uses JDBC partitioning, pushdown, RDS IAM token generation and hash diff writes.

Native connector handoff

Uses Databricks-native connectors, Lakehouse Federation, vendor replication or marketplace tools to land specialized sources first, then hands the landed table or files to ContractForge.

Generate a starter project

contractforge init --bundle --layer bronze --target-schema raw --table b_orders --mode append
contractforge validate-project contracts/

Real-world examples in the repository

The repository includes deterministic examples derived from connector validation work. They are intended to look like production handoff assets rather than minimal snippets.

Example	What it demonstrates
`examples/real-world/rest-nested-json-shape`	REST raw payload ingestion, `transform.shape.parse_json`, array handling, SCD hash-diff and quality warnings.
`examples/real-world/usgs-earthquake-geojson-shape`	Public USGS GeoJSON ingestion with `schema_ref`, array-of-struct explosion, coordinate extraction, derived metrics, SCD hash-diff, gold marts and governance split.
`examples/real-world/large-known-dataset-tpch`	Known Databricks TPCH sample source with 29.9M observed rows, a 250k-row validated run, SQL/table connectors, hash-diff upsert, gold aggregation and control-table evidence.
`examples/real-world/autoloader-available-now`	Auto Loader available-now ingestion, checkpoints, schema location and stream control-table evidence.
`examples/real-world/jdbc-postgres-scd1`	JDBC reads, partitioning, watermark, current-state upsert and quarantine-capable quality rules.
`examples/real-world/object-storage-multiformat`	Provider-neutral object storage, CSV/JSON/Parquet/Avro landing folders, explicit schemas and governance separation.
`examples/real-world/supabase-jdbc-medallion`	Real Databricks/AWS medallion parity project with shared connection YAML, logical table refs, JDBC partitioning, quality quarantine, DAB execution, AWS S3 artifacts and Glue job deployment. Snowflake direct JDBC extraction remains review-required unless the source is pre-staged into a Snowflake-supported source.

Validate all real-world examples:

PYTHONPATH=src python examples/real-world/scripts/validate_real_world.py

Scenario catalog

These scenarios are intended to show production-shaped ingestion patterns, not isolated syntax snippets. Each scenario should make source access, schema handling, transform behavior, quality evidence and runtime assumptions visible.

Scenario	Typical runtime	What it demonstrates
Public CSV over HTTP	Databricks serverless	`http_file`, explicit CSV options, overwrite, source metadata and a serverless-safe download path.
Multi-format object storage	Databricks serverless or classic	Provider-neutral `object_storage`, governed paths, CSV/JSON/Parquet/Avro folders, recursive lookup and source-format metadata.
Nested REST JSON	Databricks serverless or classic	`rest_api`, raw payload preservation, `transform.shape.parse_json`, array explosion and schema separation between source and transform.
Small-file folder stress	Serverless or classic	Many files, explicit schema, recursive lookup, glob/regex selection, row counts and control-table observability.
Azure Blob SAS files	Azure Databricks classic cluster	Direct object storage credentials, recursive folders, schemas, multiple file formats and cluster-level storage configuration.
Azure External Location	Databricks serverless	Governed storage access without direct SAS in the contract, path reads through Unity Catalog and source metadata.
AWS S3 External Location	AWS Databricks serverless	Unity Catalog storage governance, S3 paths through external locations and serverless-compatible access.
AWS S3 direct credentials	Classic cluster	S3A credential setup, access key/session token behavior and cross-cloud object storage constraints.
Auto Loader on Blob/ADLS	Databricks	`cloudFiles`, checkpoints, `available_now`, child run metrics and stream aggregation.
Supabase/Postgres JDBC	Databricks and AWS Glue/Iceberg; Snowflake after pre-stage or native source binding	JDBC driver, pushdown, partitioned reads, watermarking, current-state/hash diff patterns and platform parity through logical refs.
RDS/Aurora IAM	Classic cluster	JDBC `auth.type=rds_iam`, botocore/default credential chain and network prerequisite diagnosis.
Large known dataset TPCH	Azure Databricks classic cluster	Known public sample source, 250k-row processing, SQL/table connectors, hash-diff upsert, gold aggregation and runtime evidence suitable for dashboard screenshots.
Native platform handoff	Serverless or classic	Specialized sources are landed or federated by Databricks/native tooling and then processed by ContractForge as `table`, `sql`, files or object storage.
SFTP partner drops	Classic or serverless native connection	Driver-download versus Databricks-native SFTP connection strategy, staging paths, file limits and host-key policy.

Evidence expected from examples

An example is considered useful when it produces evidence that a user can inspect after execution:

Evidence	Where to inspect it	Why it matters
Run status and row counts	`ctrl_ingestion_runs`	Confirms whether data movement and write accounting match expectations.
Source metadata	`source` payload in run records	Shows which connector, path/query and redacted options were used.
Quality status	`ctrl_ingestion_quality` and run summary	Proves quality gates were executed and not hidden in notebook code.
Quarantine details	`ctrl_ingestion_quarantine`	Shows isolated bad rows when rules are row-level and quarantine-capable.
Stream aggregation	`ctrl_ingestion_streams`	Confirms available-now stream parent/child run metrics and micro-batch totals.
Schema changes	`ctrl_ingestion_schema_changes`	Shows additive changes, rejected drift and type widening decisions.

The same evidence model should be used when adapting examples to private sources.

Template families

Templates provide opinionated starting points for common connector, transform and write-mode combinations.

contractforge templates list
contractforge templates wizard --layer silver --source jdbc --mode hash_diff_upsert
contractforge templates write jdbc_hash_diff --output contracts/silver/s_orders

Files

Small files and partitioned object storage

Use explicit schemas, recursive lookup and optional regex filters for advanced file selection.

APIs

REST raw payloads

Keep API retrieval in the connector and model nested payloads through transform.shape.

Databases

JDBC incremental

Combine watermark, pushdown, partitioned reads, quality gates and deterministic merge semantics.

Streaming

Auto Loader available-now

Use stream control tables to verify micro-batch counts and row totals.

Example design rule

Examples are not allowed to hide framework gaps with ad-hoc Spark code. If a real source requires common behavior, prefer improving the library or documenting a reusable pattern.

Design principle

Reusable source behavior belongs in connectors, reusable shaping belongs in transform.shape, and source-specific business logic belongs in project code or downstream transformations.

Public CSV over HTTP​

NASA EONET REST API​

Azure Blob / ADLS files​

RDS/Aurora JDBC​

Native connector handoff​

Generate a starter project​

Real-world examples in the repository​

Scenario catalog​

Evidence expected from examples​

Template families​

Small files and partitioned object storage​

REST raw payloads​

JDBC incremental​

Auto Loader available-now​

Example design rule​