Skip to main content

Practical ingestion patterns for common data sources.

Examples should demonstrate ContractForge capabilities without hiding recurring behavior behind notebook workarounds. If a pattern is broadly useful, prefer a declarative contract or a reusable template.

Public CSV over HTTP

Uses http_file, explicit CSV options, source completeness and overwrite mode.

NASA EONET REST API

Uses rest_api raw response mode and transform.shape.parse_json to model nested arrays.

Azure Blob / ADLS files

Uses object storage paths, schemas, partitioned folders, recursive lookup and multiple file formats.

RDS/Aurora JDBC

Uses JDBC partitioning, pushdown, RDS IAM token generation and hash diff writes.

Native connector handoff

Uses Databricks-native connectors, Lakehouse Federation, vendor replication or marketplace tools to land specialized sources first, then hands the landed table or files to ContractForge.

Generate a starter project

contractforge init --bundle --layer bronze --target-schema raw --table b_orders --mode append
contractforge validate-project contracts/

Real-world examples in the repository

The repository includes deterministic examples derived from connector validation work. They are intended to look like production handoff assets rather than minimal snippets.

ExampleWhat it demonstrates
examples/real-world/rest-nested-json-shapeREST raw payload ingestion, transform.shape.parse_json, array handling, SCD hash-diff and quality warnings.
examples/real-world/usgs-earthquake-geojson-shapePublic USGS GeoJSON ingestion with schema_ref, array-of-struct explosion, coordinate extraction, derived metrics, SCD hash-diff, gold marts and governance split.
examples/real-world/large-known-dataset-tpchKnown Databricks TPCH sample source with 29.9M observed rows, a 250k-row validated run, SQL/table connectors, hash-diff upsert, gold aggregation and control-table evidence.
examples/real-world/autoloader-available-nowAuto Loader available-now ingestion, checkpoints, schema location and stream control-table evidence.
examples/real-world/jdbc-postgres-scd1JDBC reads, partitioning, watermark, current-state upsert and quarantine-capable quality rules.
examples/real-world/object-storage-multiformatProvider-neutral object storage, CSV/JSON/Parquet/Avro landing folders, explicit schemas and governance separation.
examples/real-world/supabase-jdbc-medallionReal Databricks/AWS medallion parity project with shared connection YAML, logical table refs, JDBC partitioning, quality quarantine, DAB execution, AWS S3 artifacts and Glue job deployment. Snowflake direct JDBC extraction remains review-required unless the source is pre-staged into a Snowflake-supported source.

Validate all real-world examples:

PYTHONPATH=src python examples/real-world/scripts/validate_real_world.py

Scenario catalog

These scenarios are intended to show production-shaped ingestion patterns, not isolated syntax snippets. Each scenario should make source access, schema handling, transform behavior, quality evidence and runtime assumptions visible.

ScenarioTypical runtimeWhat it demonstrates
Public CSV over HTTPDatabricks serverlesshttp_file, explicit CSV options, overwrite, source metadata and a serverless-safe download path.
Multi-format object storageDatabricks serverless or classicProvider-neutral object_storage, governed paths, CSV/JSON/Parquet/Avro folders, recursive lookup and source-format metadata.
Nested REST JSONDatabricks serverless or classicrest_api, raw payload preservation, transform.shape.parse_json, array explosion and schema separation between source and transform.
Small-file folder stressServerless or classicMany files, explicit schema, recursive lookup, glob/regex selection, row counts and control-table observability.
Azure Blob SAS filesAzure Databricks classic clusterDirect object storage credentials, recursive folders, schemas, multiple file formats and cluster-level storage configuration.
Azure External LocationDatabricks serverlessGoverned storage access without direct SAS in the contract, path reads through Unity Catalog and source metadata.
AWS S3 External LocationAWS Databricks serverlessUnity Catalog storage governance, S3 paths through external locations and serverless-compatible access.
AWS S3 direct credentialsClassic clusterS3A credential setup, access key/session token behavior and cross-cloud object storage constraints.
Auto Loader on Blob/ADLSDatabrickscloudFiles, checkpoints, available_now, child run metrics and stream aggregation.
Supabase/Postgres JDBCDatabricks and AWS Glue/Iceberg; Snowflake after pre-stage or native source bindingJDBC driver, pushdown, partitioned reads, watermarking, current-state/hash diff patterns and platform parity through logical refs.
RDS/Aurora IAMClassic clusterJDBC auth.type=rds_iam, botocore/default credential chain and network prerequisite diagnosis.
Large known dataset TPCHAzure Databricks classic clusterKnown public sample source, 250k-row processing, SQL/table connectors, hash-diff upsert, gold aggregation and runtime evidence suitable for dashboard screenshots.
Native platform handoffServerless or classicSpecialized sources are landed or federated by Databricks/native tooling and then processed by ContractForge as table, sql, files or object storage.
SFTP partner dropsClassic or serverless native connectionDriver-download versus Databricks-native SFTP connection strategy, staging paths, file limits and host-key policy.

Evidence expected from examples

An example is considered useful when it produces evidence that a user can inspect after execution:

EvidenceWhere to inspect itWhy it matters
Run status and row countsctrl_ingestion_runsConfirms whether data movement and write accounting match expectations.
Source metadatasource payload in run recordsShows which connector, path/query and redacted options were used.
Quality statusctrl_ingestion_quality and run summaryProves quality gates were executed and not hidden in notebook code.
Quarantine detailsctrl_ingestion_quarantineShows isolated bad rows when rules are row-level and quarantine-capable.
Stream aggregationctrl_ingestion_streamsConfirms available-now stream parent/child run metrics and micro-batch totals.
Schema changesctrl_ingestion_schema_changesShows additive changes, rejected drift and type widening decisions.

The same evidence model should be used when adapting examples to private sources.

Template families

Templates provide opinionated starting points for common connector, transform and write-mode combinations.

contractforge templates list
contractforge templates wizard --layer silver --source jdbc --mode hash_diff_upsert
contractforge templates write jdbc_hash_diff --output contracts/silver/s_orders
Files

Small files and partitioned object storage

Use explicit schemas, recursive lookup and optional regex filters for advanced file selection.

APIs

REST raw payloads

Keep API retrieval in the connector and model nested payloads through transform.shape.

Databases

JDBC incremental

Combine watermark, pushdown, partitioned reads, quality gates and deterministic merge semantics.

Streaming

Auto Loader available-now

Use stream control tables to verify micro-batch counts and row totals.

Example design rule

Examples are not allowed to hide framework gaps with ad-hoc Spark code. If a real source requires common behavior, prefer improving the library or documenting a reusable pattern.

Design principle

Reusable source behavior belongs in connectors, reusable shaping belongs in transform.shape, and source-specific business logic belongs in project code or downstream transformations.