dataorc-utils — Lake¶
Filesystem utilities for reading and writing to the Data Lake in Databricks pipelines.
Overview¶
The lake module provides a unified interface for file operations on Azure Data Lake Storage,
abstracting away the differences between local development and Databricks runtime environments.
Key design principle: The module is path-agnostic. It performs pure I/O operations
without assuming any specific mounting conventions. Path normalization (e.g., dls:// → /mnt/...)
is the responsibility of your pipeline code.
Quick start¶
from dataorc_utils.lake import LakeFileSystem
# Initialize with a base path
fs = LakeFileSystem(base_path="/dbfs/mnt/datalakestore/bronze/sales/orders")
# Write and read text files
fs.write_text("metadata.txt", "Pipeline run: 2026-02-02")
content = fs.read_text("metadata.txt")
# Write and read JSON files
fs.write_json("config.json", {"version": 1, "status": "complete"})
config = fs.read_json("config.json")
# Check existence and delete
if fs.exists("old_file.txt"):
fs.delete("old_file.txt")
API Reference¶
LakeFileSystem¶
The main class for all file operations.
from dataorc_utils.lake import LakeFileSystem
fs = LakeFileSystem(base_path="/dbfs/mnt/datalakestore/bronze")
Constructor¶
| Parameter | Type | Description |
|---|---|---|
base_path |
str \| None |
Optional base path prepended to all operations. Should be an absolute path valid for the runtime environment. |
Methods¶
Text Operations¶
| Method | Returns | Description |
|---|---|---|
read_text(path) |
str \| None |
Read a text file. Returns None if file doesn't exist. |
write_text(path, content) |
None |
Write a text file. Creates parent directories if needed. |
JSON Operations¶
| Method | Returns | Description |
|---|---|---|
read_json(path) |
dict \| None |
Read and parse a JSON file. Returns None if file doesn't exist or parse fails. |
write_json(path, data, indent=2) |
None |
Write a dictionary as JSON. Creates parent directories if needed. |
Directory Operations¶
| Method | Returns | Description |
|---|---|---|
exists(path) |
bool |
Check if a file or directory exists. |
delete(path) |
bool |
Delete a file. Returns True if deleted, False if didn't exist. |
Usage in Pipelines¶
With CorePipelineConfig¶
The lake module integrates naturally with CorePipelineConfig:
from dataorc_utils.config import PipelineParameterManager
from dataorc_utils.lake import LakeFileSystem
# Build config as usual
mgr = PipelineParameterManager()
infra = mgr.prepare_infrastructure(["datalake_name"])
cfg = mgr.build_core_config(infra, domain="sales", product="orders", table_name="lines")
# Use lake paths from config
fs = LakeFileSystem(base_path=cfg.get_lake_path("bronze"))
# Now all operations are relative to the bronze path
fs.write_json("_metadata/run_info.json", {
"pipeline": "orders_ingestion",
"timestamp": "2026-02-02T10:00:00Z",
"records_processed": 1500
})
Path Handling¶
The module does not perform path normalization. Your pipeline code is responsible for providing correct absolute paths for the runtime environment.
On Databricks with FUSE mount, paths should include the /dbfs/ prefix:
# Correct - includes /dbfs/ prefix
fs = LakeFileSystem(base_path="/dbfs/mnt/datalakestore/bronze")
# Or use absolute paths directly
fs = LakeFileSystem()
fs.write_text("/dbfs/mnt/datalakestore/bronze/file.txt", "content")
Error Handling¶
The module returns None for missing files rather than raising exceptions: