Datalake Framework¶
This python framework provides Data Engineering features for managing and organizing datasets in a Cloud datalake.
Datacatalog driven operations
Cloud storage abstraction.
Monitoring
Getting started¶
Install the framework with PIP
pip install datalake-framework
The main class in the framework is datalake.Datalake. It gets its configuration with a dict
- catalog_url
the URL for the Datacatalog API
- monitoring
see Monitoring
The Datacatalog API provides most of the parameters like the cloud provider and storage identifiers.
For example:
from datalake import Datalake
config = {
"catalog_url": "http://catalog.datalake.svc:8080",
"monitoring": {
"class": "NoMonitor",
"params": {
"quiet": False
}
}
}
dlk = Datalake(config)
# Fetch the dataset specs for a catalog entry
my_entry = dlk.get_entry("my-entry")
# Download a file from a storage bucket
dlk.download("silver", my_entry["_key"], "/local/path/my-file.csv")