Helpers¶
- class datalake.helpers.DatasetBuilder(datalake, key, path=None, lang='en_US', date_formats=None, ciphered=False)[source]¶
Creates a new CSV file according to a Datalake’s standards
- Parameters
datalake (Datalake) – a datalake instance
key (str) – the catalog entry identifier for which a dataset is created
path (str) – the path to a local file to create. Default to a auto-generated temp file
lang (str) – a locale to use for casting numbers
date_formats (list(str)) – a list of datetime formats to use for casting date and times. Defaults to iso8601 formats.
ciphered (bool) – Indicates whether the dataset has pseudonymized values or not. Defaults to
False
Example
building a dataset with dicts:
with DatasetBuilder(dlk, "my-entry") as dsb: for i in range(10): dsb.add_dict({ "id": i, "column_s" : "lorem ipsum", "column_i": 1234.56, "column_d": "2022-04-01", }) dlk.upload(dsb.path, "storename", "my-entry")
- add_dict(row)[source]¶
Appends a row in the dataset
- Parameters
row (dict) – a row in key/value pairs
- add_sequence(row)[source]¶
Appends a row in the dataset
- Parameters
row (list) – a row sequence of values
- property path¶
The path to the local file
- property row_count¶
The number of rows appended in the dataset
- class datalake.helpers.DatasetReader(datalake, store, key, path_params=None, ciphered=False)[source]¶
Reads a CSV dataset from a store.
No file is downloaded and data is streamed when using the iterators.
- Parameters
datalake (Datalake) – a datalake instance
store (str) – the name of the store to read the dataset from
key (str) – the catalog entry identifier for which a dataset is read
path_params (dict) – the entry path placeholders used to find a specific dataset
ciphered (bool) – Indicates whether the dataset has pseudonymized values or not. Defaults to
False
Example
Count the number of rows in a dataset:
dsr = DatasetReader(dlk, "storename", "my-entry") count = 0 for item in dsr.iter_list(): count += 1 print(f"Found {count} rows")
- class datalake.helpers.StandardDialect[source]¶
CSV format according to RFC 4180
- delimiter = ','¶
- doublequote = True¶
- escapechar = None¶
- lineterminator = '\n'¶
- quotechar = '"'¶
- quoting = 0¶
- skipinitialspace = False¶
- strict = True¶
- datalake.helpers.cast_date(d, formats=['YYYY-MM-DD'])[source]¶
Cast a string as an date according to a set of formats strings
- Parameters
d (str) – a value to cast
formats (list(str)) – a set of formats used to try to cast the string as a date. See also the Supported formats
- Returns
the value formatted with ISO 8601 date format
- datalake.helpers.cast_datetime(d, formats=['YYYY-MM-DDTHH:mm:ss.SSSZZ'])[source]¶
Cast a string as an datetime according to a set of formats strings
- Parameters
d (str) – a value to cast
formats (list(str)) – a set of formats used to try to cast the string as a datetime. See also the Supported formats
- Returns
the value formatted with ISO 8601 datetime format
- datalake.helpers.cast_float(x, lang='en_US')[source]¶
Cast a string as an decimal according to a locale
- Parameters
x (str) – a value to cast
lang (str) – the locale used to interpret the string.
- Returns
the value casted as
float
- datalake.helpers.cast_integer(x, lang='en_US')[source]¶
Cast a string as an integer according to a locale
- Parameters
x (str) – a value to cast
lang (str) – the locale used to interpret the string
- Returns
the value casted as
int
- datalake.helpers.cast_time(d, formats=['HH:mm:ss.SSSZZ'])[source]¶
Cast a string as an time according to a set of formats strings
- Parameters
d (str) – a value to cast
formats (list(str)) – a set of formats used to try to cast the string as a time. See also the Supported formats
- Returns
the value formatted with ISO 8601 time format