HTTP file

validated

Use http_file for bounded files served through HTTP(S), especially when the Spark runtime cannot read arbitrary HTTPS paths directly. The connector downloads the file on the driver, materializes it for Spark and preserves source metadata in control tables.

When to use it

Public datasets

Direct file download

Use this for bounded CSV, JSON, JSONL, NDJSON or text files exposed over HTTP(S).

Runtime fallback

Spark cannot read HTTPS

The driver downloads the file and then hands a materialized local/staged file to Spark.

Small and medium files

Controlled payloads

Use explicit timeout, retry and size limits. Move large recurring feeds to object storage.

Connector boundary

No business parsing

Use transform.shape or downstream contracts for nested payload normalization.

Runtime requirements

Requirement	Details
Driver egress	The Databricks driver or local Spark process must reach the HTTP(S) endpoint.
Payload limits	Configure timeout/retry and avoid unbounded downloads.
Format support	Spark must support the declared file format after the file is materialized.
Authentication	Use secret-backed headers or URLs for authenticated files.

Basic example

YAML
Python

source:
  type: connector
  connector: http_file
  path: https://example.org/data/orders.csv
  format: csv
  options:
    header: true
    inferSchema: false
  read:
    schema: "order_id STRING, order_ts TIMESTAMP, amount DOUBLE"
    source_complete: true
  limits:
    timeout_seconds: 60
    retry_attempts: 3
    retry_backoff_seconds: 2
    max_bytes: 52428800
    max_records: 100000

target:
  catalog: contractforge
  schema: bronze_examples
  table: b_http_orders

layer: bronze
mode: overwrite

from contractforge import ingest

result = ingest(
    source={
        "type": "connector",
        "connector": "http_file",
        "path": "https://example.org/data/orders.csv",
        "format": "csv",
        "options": {"header": True, "inferSchema": False},
        "read": {
            "schema": "order_id STRING, order_ts TIMESTAMP, amount DOUBLE",
            "source_complete": True,
        },
        "limits": {
            "timeout_seconds": 60,
            "retry_attempts": 3,
            "retry_backoff_seconds": 2,
            "max_bytes": 52_428_800,
            "max_records": 100_000,
        },
    },
    catalog="contractforge",
    target_schema="bronze_examples",
    target_table="b_http_orders",
    layer="bronze",
    mode="overwrite",
)

Supported formats

Supported formats are csv, json, jsonl, ndjson and text. Aliases http_csv, http_json and http_text are available when you want the connector name to imply the file format.

Operational guidance

Use this connector for small and medium files, not unbounded API extraction.
Set explicit byte, timeout and retry limits because the driver performs the download.
For recurring high-volume feeds, land files in object storage and use Auto Loader or a file connector.

Public dataset pattern

This pattern is useful for complete public extracts that are easier to download as one bounded file than to access through a Spark filesystem reader.

YAML
Python

source:
  type: connector
  connector: http_file
  path: https://example.org/public/covid.csv
  format: csv
  options:
    header: true
    inferSchema: false
  read:
    schema: "submission_date DATE, state STRING, tot_cases BIGINT"
    source_complete: true
  limits:
    timeout_seconds: 60
    retry_attempts: 3
    retry_backoff_seconds: 2
    max_bytes: 52428800
    max_records: 100000

target:
  catalog: contractforge
  schema: bronze_health
  table: b_public_covid

layer: bronze
mode: overwrite

from contractforge import ingest

result = ingest(
    source={
        "type": "connector",
        "connector": "http_file",
        "path": "https://example.org/public/covid.csv",
        "format": "csv",
        "options": {"header": True, "inferSchema": False},
        "read": {
            "schema": "submission_date DATE, state STRING, tot_cases BIGINT",
            "source_complete": True,
        },
        "limits": {"timeout_seconds": 60},
    },
    catalog="contractforge",
    target_schema="bronze_health",
    target_table="b_public_covid",
    layer="bronze",
    mode="overwrite",
)

Use bounded files only

For high-volume recurring feeds, land files in object storage and use Auto Loader or a file connector.

Operational validation

SELECT
  run_id,
  status,
  source_connector,
  source_format,
  source_path,
  rows_read,
  rows_written,
  source_metrics_json
FROM main.ops.ctrl_ingestion_runs
WHERE source_connector IN ('http_file', 'http_csv', 'http_json', 'http_text')
ORDER BY started_at_utc DESC;

Common issues

Symptom	Likely cause	Action
DNS or connection failure	The driver cannot reach the host.	Validate egress, DNS, proxy and endpoint allowlists from the runtime.
Timeout	File is too large or endpoint is slow.	Increase limits cautiously or move the feed to object storage.
Unexpected schema	File format changed or schema inference was used.	Declare explicit schema and reader options.
Memory pressure	Payload is too large for driver-side download.	Use object storage ingestion or Auto Loader for large feeds.

How this connector fits the contract

Keep extraction concerns in source, structural normalization in transform, validation in quality_rules and target semantics in mode. This separation keeps examples portable and prevents connector-specific runtime fixes from becoming hidden business logic.

When to use it​

Direct file download​

Spark cannot read HTTPS​

Controlled payloads​

No business parsing​

Runtime requirements​

Basic example​

Supported formats​

Operational guidance​

Public dataset pattern​

Operational validation​

Common issues​

How this connector fits the contract​