Files

stable

Use path-based file connectors for paths that Spark can read directly: local test paths, DBFS-compatible paths, Unity Catalog volumes, external locations and mounted cloud paths.

Use it when authentication is already handled by the runtime. Use the cloud specific connectors (s3, azure_blob, gcs) when the contract must declare cloud-specific credentials or path semantics.

When to use it

Runtime-managed access

Spark-readable paths

Use this connector when the Databricks runtime already knows how to read the path through a Volume, external location, mounted path, DBFS-compatible path or local test path.

Multi-format landing

Direct Spark file formats

Use the same contract shape for direct file connectors such as CSV, JSON, Parquet, ORC, Delta and text. Use object-storage style connectors with source.format for Avro, XML, JSONL and NDJSON when the runtime supports them.

Governed storage

Volumes and external locations

Prefer governed paths when access is controlled by Unity Catalog and credentials should not appear in ingestion contracts.

Cloud-specific semantics

Use provider connectors when needed

Use s3, azure_blob or gcs when the contract must declare provider-specific authentication, path semantics or runtime guidance.

Supported formats

Format	Connector value	Notes
CSV	`csv`	Use explicit header, delimiter, quote and schema options for stable ingestion.
JSON	`json`	Multiline JSON should be explicit with Spark reader options.
JSON Lines	`source.format=jsonl` or `ndjson` with `object_storage`/provider connector	Recommended for large event files.
Parquet	`parquet`	Preferred for typed lakehouse files.
Avro	`source.format=avro` with `object_storage`/provider connector	Requires Avro support in the runtime.
ORC	`orc`	Supported when Spark runtime includes ORC.
Delta	`delta`	Reads Delta paths directly.
Text	`text`	Useful for raw landing or custom parsing.
XML	`source.format=xml` with `object_storage`/provider connector	Requires the Spark XML package or DBR support depending on runtime.

Folder read

YAML
Python

source:
  type: connector
  connector: parquet
  path: /Volumes/main/landing/orders
  options:
    recursiveFileLookup: true

source = {
    "type": "connector",
    "connector": "parquet",
    "path": "/Volumes/main/landing/orders",
    "options": {"recursiveFileLookup": True},
}

CSV with explicit schema

YAML
Python

source:
  type: connector
  connector: csv
  path: /Volumes/main/landing/orders_csv/
  schema: "order_id STRING, customer_id STRING, amount DOUBLE, updated_at TIMESTAMP"
  options:
    header: true
    mode: FAILFAST

source = {
    "type": "connector",
    "connector": "csv",
    "path": "/Volumes/main/landing/orders_csv/",
    "schema": "order_id STRING, customer_id STRING, amount DOUBLE, updated_at TIMESTAMP",
    "options": {"header": True, "mode": "FAILFAST"},
}

JSON and shape

Keep JSON parsing and structural normalization explicit. Source schema describes how Spark reads the file; transform.shape describes how ContractForge prepares business columns.

YAML
Python

source:
  type: connector
  connector: json
  path: /Volumes/main/landing/orders_json/
  schema: "payload STRING, ingest_ts TIMESTAMP"

transform:
  shape:
    parse_json:
      payload:
        schema: "STRUCT<order_id: STRING, items: ARRAY<STRUCT<sku: STRING, qty: INT>>>"
        target: parsed
    flatten:
      - path: parsed
        prefix: ""

source = {
    "type": "connector",
    "connector": "json",
    "path": "/Volumes/main/landing/orders_json/",
    "schema": "payload STRING, ingest_ts TIMESTAMP",
}

transform = {
    "shape": {
        "parse_json": {
            "payload": {
                "schema": "STRUCT<order_id: STRING, items: ARRAY<STRUCT<sku: STRING, qty: INT>>>",
                "target": "parsed",
            }
        },
        "flatten": [{"path": "parsed", "prefix": ""}],
    }
}

Avro, Parquet and XML

YAML
Python

# Parquet

source:
  type: connector
  connector: parquet
  path: /Volumes/main/landing/orders_parquet/
  options:
    recursiveFileLookup: true

---

# Avro

source:
  type: connector
  connector: object_storage
  path: /Volumes/main/landing/orders_avro/
  format: avro

---

# XML

source:
  type: connector
  connector: object_storage
  path: /Volumes/main/landing/orders_xml/
  format: xml
  options:
    rowTag: order

parquet_source = {
    "type": "connector",
    "connector": "parquet",
    "path": "/Volumes/main/landing/orders_parquet/",
    "options": {"recursiveFileLookup": True},
}

avro_source = {
    "type": "connector",
    "connector": "object_storage",
    "path": "/Volumes/main/landing/orders_avro/",
    "format": "avro",
}

xml_source = {
    "type": "connector",
    "connector": "object_storage",
    "path": "/Volumes/main/landing/orders_xml/",
    "format": "xml",
    "options": {"rowTag": "order"},
}

Reader options

Option	Use case
`recursiveFileLookup`	Read files under nested folders without relying on partition discovery.
`pathGlobFilter`	Restrict file names using Spark glob patterns.
`modifiedBefore` / `modifiedAfter`	Spark file filtering when supported by the runtime.
`mergeSchema`	Parquet/Delta schema evolution reads. Use carefully on large folders.
`multiline`	Multiline JSON or CSV records.

For advanced filename matching that cannot be represented by Spark glob patterns, use a staging process or a future connector-specific filter. Do not encode business filtering in file name selection when it belongs in filter_expression.

When to use it​

Spark-readable paths​

Direct Spark file formats​

Volumes and external locations​

Use provider connectors when needed​

Supported formats​

Folder read​

CSV with explicit schema​

JSON and shape​

Avro, Parquet and XML​

Reader options​