Files
Use path-based file connectors for paths that Spark can read directly: local test paths, DBFS-compatible paths, Unity Catalog volumes, external locations and mounted cloud paths.
Use it when authentication is already handled by the runtime. Use the cloud
specific connectors (s3, azure_blob, gcs) when the contract must declare
cloud-specific credentials or path semantics.
When to use it
Spark-readable paths
Use this connector when the Databricks runtime already knows how to read the path through a Volume, external location, mounted path, DBFS-compatible path or local test path.
Direct Spark file formats
Use the same contract shape for direct file connectors such as CSV, JSON,
Parquet, ORC, Delta and text. Use object-storage style connectors with
source.format for Avro, XML, JSONL and NDJSON when the runtime supports them.
Volumes and external locations
Prefer governed paths when access is controlled by Unity Catalog and credentials should not appear in ingestion contracts.
Use provider connectors when needed
Use s3, azure_blob or gcs when the contract must declare provider-specific
authentication, path semantics or runtime guidance.
Supported formats
| Format | Connector value | Notes |
|---|---|---|
| CSV | csv | Use explicit header, delimiter, quote and schema options for stable ingestion. |
| JSON | json | Multiline JSON should be explicit with Spark reader options. |
| JSON Lines | source.format=jsonl or ndjson with object_storage/provider connector | Recommended for large event files. |
| Parquet | parquet | Preferred for typed lakehouse files. |
| Avro | source.format=avro with object_storage/provider connector | Requires Avro support in the runtime. |
| ORC | orc | Supported when Spark runtime includes ORC. |
| Delta | delta | Reads Delta paths directly. |
| Text | text | Useful for raw landing or custom parsing. |
| XML | source.format=xml with object_storage/provider connector | Requires the Spark XML package or DBR support depending on runtime. |
Folder read
- YAML
- Python
source:
type: connector
connector: parquet
path: /Volumes/main/landing/orders
options:
recursiveFileLookup: true
source = {
"type": "connector",
"connector": "parquet",
"path": "/Volumes/main/landing/orders",
"options": {"recursiveFileLookup": True},
}
CSV with explicit schema
- YAML
- Python
source:
type: connector
connector: csv
path: /Volumes/main/landing/orders_csv/
schema: "order_id STRING, customer_id STRING, amount DOUBLE, updated_at TIMESTAMP"
options:
header: true
mode: FAILFAST
source = {
"type": "connector",
"connector": "csv",
"path": "/Volumes/main/landing/orders_csv/",
"schema": "order_id STRING, customer_id STRING, amount DOUBLE, updated_at TIMESTAMP",
"options": {"header": True, "mode": "FAILFAST"},
}
JSON and shape
Keep JSON parsing and structural normalization explicit. Source schema describes
how Spark reads the file; transform.shape describes how ContractForge prepares
business columns.
- YAML
- Python
source:
type: connector
connector: json
path: /Volumes/main/landing/orders_json/
schema: "payload STRING, ingest_ts TIMESTAMP"
transform:
shape:
parse_json:
payload:
schema: "STRUCT<order_id: STRING, items: ARRAY<STRUCT<sku: STRING, qty: INT>>>"
target: parsed
flatten:
- path: parsed
prefix: ""
source = {
"type": "connector",
"connector": "json",
"path": "/Volumes/main/landing/orders_json/",
"schema": "payload STRING, ingest_ts TIMESTAMP",
}
transform = {
"shape": {
"parse_json": {
"payload": {
"schema": "STRUCT<order_id: STRING, items: ARRAY<STRUCT<sku: STRING, qty: INT>>>",
"target": "parsed",
}
},
"flatten": [{"path": "parsed", "prefix": ""}],
}
}
Avro, Parquet and XML
- YAML
- Python
# Parquet
source:
type: connector
connector: parquet
path: /Volumes/main/landing/orders_parquet/
options:
recursiveFileLookup: true
---
# Avro
source:
type: connector
connector: object_storage
path: /Volumes/main/landing/orders_avro/
format: avro
---
# XML
source:
type: connector
connector: object_storage
path: /Volumes/main/landing/orders_xml/
format: xml
options:
rowTag: order
parquet_source = {
"type": "connector",
"connector": "parquet",
"path": "/Volumes/main/landing/orders_parquet/",
"options": {"recursiveFileLookup": True},
}
avro_source = {
"type": "connector",
"connector": "object_storage",
"path": "/Volumes/main/landing/orders_avro/",
"format": "avro",
}
xml_source = {
"type": "connector",
"connector": "object_storage",
"path": "/Volumes/main/landing/orders_xml/",
"format": "xml",
"options": {"rowTag": "order"},
}
Reader options
| Option | Use case |
|---|---|
recursiveFileLookup | Read files under nested folders without relying on partition discovery. |
pathGlobFilter | Restrict file names using Spark glob patterns. |
modifiedBefore / modifiedAfter | Spark file filtering when supported by the runtime. |
mergeSchema | Parquet/Delta schema evolution reads. Use carefully on large folders. |
multiline | Multiline JSON or CSV records. |
For advanced filename matching that cannot be represented by Spark glob
patterns, use a staging process or a future connector-specific filter. Do not
encode business filtering in file name selection when it belongs in
filter_expression.