Direct file download
Use this for bounded CSV, JSON, JSONL, NDJSON or text files exposed over HTTP(S).
Connector
Use http_file for bounded files served through HTTP(S), especially when the Spark runtime cannot read arbitrary HTTPS paths directly. The connector downloads the file on the driver, materializes it for Spark and preserves source metadata in control tables.
Use this for bounded CSV, JSON, JSONL, NDJSON or text files exposed over HTTP(S).
The driver downloads the file and then hands a materialized local/staged file to Spark.
Use explicit timeout, retry and size limits. Move large recurring feeds to object storage.
Use transform.shape or downstream contracts for nested payload normalization.
| Requirement | Details |
|---|---|
| Driver egress | The Databricks driver or local Spark process must reach the HTTP(S) endpoint. |
| Payload limits | Configure timeout/retry and avoid unbounded downloads. |
| Format support | Spark must support the declared file format after the file is materialized. |
| Authentication | Use secret-backed headers or URLs for authenticated files. |
source:
type: connector
connector: http_file
path: https://example.org/data/orders.csv
format: csv
options:
header: true
inferSchema: false
read:
schema: "order_id STRING, order_ts TIMESTAMP, amount DOUBLE"
source_complete: true
limits:
timeout_seconds: 60
retry_attempts: 3
retry_backoff_seconds: 2
target:
catalog: contractforge
schema: bronze_examples
table: b_http_orders
layer: bronze
mode: scd0_overwrite
from contractforge import ingest
result = ingest(
source={
"type": "connector",
"connector": "http_file",
"path": "https://example.org/data/orders.csv",
"format": "csv",
"options": {"header": True, "inferSchema": False},
"read": {
"schema": "order_id STRING, order_ts TIMESTAMP, amount DOUBLE",
"source_complete": True,
},
"limits": {
"timeout_seconds": 60,
"retry_attempts": 3,
"retry_backoff_seconds": 2,
},
},
catalog="contractforge",
target_schema="bronze_examples",
target_table="b_http_orders",
layer="bronze",
mode="scd0_overwrite",
)
Supported formats are csv, json, jsonl, ndjson and text. Aliases http_csv, http_json and http_text are available when you want the connector name to imply the file format.
This pattern is useful for complete public extracts that are easier to download as one bounded file than to access through a Spark filesystem reader.
source:
type: connector
connector: http_file
path: https://example.org/public/covid.csv
format: csv
options:
header: true
inferSchema: false
read:
schema: "submission_date DATE, state STRING, tot_cases BIGINT"
source_complete: true
limits:
timeout_seconds: 60
target:
catalog: contractforge
schema: bronze_health
table: b_public_covid
layer: bronze
mode: scd0_overwrite
from contractforge import ingest
result = ingest(
source={
"type": "connector",
"connector": "http_file",
"path": "https://example.org/public/covid.csv",
"format": "csv",
"options": {"header": True, "inferSchema": False},
"read": {
"schema": "submission_date DATE, state STRING, tot_cases BIGINT",
"source_complete": True,
},
"limits": {"timeout_seconds": 60},
},
catalog="contractforge",
target_schema="bronze_health",
target_table="b_public_covid",
layer="bronze",
mode="scd0_overwrite",
)
For high-volume recurring feeds, land files in object storage and use Auto Loader or a file connector.
SELECT
run_id,
status,
source_connector,
source_format,
source_path,
rows_read,
rows_written,
source_metrics_json
FROM main.ops.ctrl_ingestion_runs
WHERE source_connector IN ('http_file', 'http_csv', 'http_json', 'http_text')
ORDER BY started_at_utc DESC;
| Symptom | Likely cause | Action |
|---|---|---|
| DNS or connection failure | The driver cannot reach the host. | Validate egress, DNS, proxy and endpoint allowlists from the runtime. |
| Timeout | File is too large or endpoint is slow. | Increase limits cautiously or move the feed to object storage. |
| Unexpected schema | File format changed or schema inference was used. | Declare explicit schema and reader options. |
| Memory pressure | Payload is too large for driver-side download. | Use object storage ingestion or Auto Loader for large feeds. |
Keep extraction concerns in source, structural normalization in transform, validation in quality_rules and target semantics in mode. This separation keeps examples portable and prevents connector-specific workarounds from becoming hidden business logic.