Skip to main content

Files

stable

Use path-based file connectors for paths that Spark can read directly: local test paths, DBFS-compatible paths, Unity Catalog volumes, external locations and mounted cloud paths.

Use it when authentication is already handled by the runtime. Use the cloud specific connectors (s3, azure_blob, gcs) when the contract must declare cloud-specific credentials or path semantics.

When to use it

Runtime-managed access

Spark-readable paths

Use this connector when the Databricks runtime already knows how to read the path through a Volume, external location, mounted path, DBFS-compatible path or local test path.

Multi-format landing

Direct Spark file formats

Use the same contract shape for direct file connectors such as CSV, JSON, Parquet, ORC, Delta and text. Use object-storage style connectors with source.format for Avro, XML, JSONL and NDJSON when the runtime supports them.

Governed storage

Volumes and external locations

Prefer governed paths when access is controlled by Unity Catalog and credentials should not appear in ingestion contracts.

Cloud-specific semantics

Use provider connectors when needed

Use s3, azure_blob or gcs when the contract must declare provider-specific authentication, path semantics or runtime guidance.

Supported formats

FormatConnector valueNotes
CSVcsvUse explicit header, delimiter, quote and schema options for stable ingestion.
JSONjsonMultiline JSON should be explicit with Spark reader options.
JSON Linessource.format=jsonl or ndjson with object_storage/provider connectorRecommended for large event files.
ParquetparquetPreferred for typed lakehouse files.
Avrosource.format=avro with object_storage/provider connectorRequires Avro support in the runtime.
ORCorcSupported when Spark runtime includes ORC.
DeltadeltaReads Delta paths directly.
TexttextUseful for raw landing or custom parsing.
XMLsource.format=xml with object_storage/provider connectorRequires the Spark XML package or DBR support depending on runtime.

Folder read

source:
type: connector
connector: parquet
path: /Volumes/main/landing/orders
options:
recursiveFileLookup: true

CSV with explicit schema

source:
type: connector
connector: csv
path: /Volumes/main/landing/orders_csv/
schema: "order_id STRING, customer_id STRING, amount DOUBLE, updated_at TIMESTAMP"
options:
header: true
mode: FAILFAST

JSON and shape

Keep JSON parsing and structural normalization explicit. Source schema describes how Spark reads the file; transform.shape describes how ContractForge prepares business columns.

source:
type: connector
connector: json
path: /Volumes/main/landing/orders_json/
schema: "payload STRING, ingest_ts TIMESTAMP"

transform:
shape:
parse_json:
payload:
schema: "STRUCT<order_id: STRING, items: ARRAY<STRUCT<sku: STRING, qty: INT>>>"
target: parsed
flatten:
- path: parsed
prefix: ""

Avro, Parquet and XML

# Parquet

source:
type: connector
connector: parquet
path: /Volumes/main/landing/orders_parquet/
options:
recursiveFileLookup: true

---

# Avro

source:
type: connector
connector: object_storage
path: /Volumes/main/landing/orders_avro/
format: avro

---

# XML

source:
type: connector
connector: object_storage
path: /Volumes/main/landing/orders_xml/
format: xml
options:
rowTag: order

Reader options

OptionUse case
recursiveFileLookupRead files under nested folders without relying on partition discovery.
pathGlobFilterRestrict file names using Spark glob patterns.
modifiedBefore / modifiedAfterSpark file filtering when supported by the runtime.
mergeSchemaParquet/Delta schema evolution reads. Use carefully on large folders.
multilineMultiline JSON or CSV records.

For advanced filename matching that cannot be represented by Spark glob patterns, use a staging process or a future connector-specific filter. Do not encode business filtering in file name selection when it belongs in filter_expression.