Helpers

class datalake.helpers.DatasetBuilder(datalake, key, path=None, lang='en_US', date_formats=None, ciphered=False)[source]

Creates a new CSV file according to a Datalake’s standards

Parameters
  • datalake (Datalake) – a datalake instance

  • key (str) – the catalog entry identifier for which a dataset is created

  • path (str) – the path to a local file to create. Default to a auto-generated temp file

  • lang (str) – a locale to use for casting numbers

  • date_formats (list(str)) – a list of datetime formats to use for casting date and times. Defaults to iso8601 formats.

  • ciphered (bool) – Indicates whether the dataset has pseudonymized values or not. Defaults to False

Example

building a dataset with dicts:

with DatasetBuilder(dlk, "my-entry") as dsb:
    for i in range(10):
        dsb.add_dict({
            "id": i,
            "column_s" : "lorem ipsum",
            "column_i": 1234.56,
            "column_d": "2022-04-01",
        })
dlk.upload(dsb.path, "storename", "my-entry")
add_dict(row)[source]

Appends a row in the dataset

Parameters

row (dict) – a row in key/value pairs

add_sequence(row)[source]

Appends a row in the dataset

Parameters

row (list) – a row sequence of values

new_dict()[source]

Returns an empty row as dict

property path

The path to the local file

property row_count

The number of rows appended in the dataset

class datalake.helpers.DatasetReader(datalake, store, key, path_params=None, ciphered=False)[source]

Reads a CSV dataset from a store.

No file is downloaded and data is streamed when using the iterators.

Parameters
  • datalake (Datalake) – a datalake instance

  • store (str) – the name of the store to read the dataset from

  • key (str) – the catalog entry identifier for which a dataset is read

  • path_params (dict) – the entry path placeholders used to find a specific dataset

  • ciphered (bool) – Indicates whether the dataset has pseudonymized values or not. Defaults to False

Example

Count the number of rows in a dataset:

dsr = DatasetReader(dlk, "storename", "my-entry")
count = 0
for item in dsr.iter_list():
    count += 1
print(f"Found {count} rows")
dataframe()[source]

Returns a pandas DataFrame

Raises

RuntimeError – if pandas package is not installed

iter_dict()[source]

Returns an iterator of dict for each row

iter_list()[source]

Returns an iterator of list for each row

class datalake.helpers.StandardDialect[source]

CSV format according to RFC 4180

delimiter = ','
doublequote = True
escapechar = None
lineterminator = '\n'
quotechar = '"'
quoting = 0
skipinitialspace = False
strict = True
datalake.helpers.cast_date(d, formats=['YYYY-MM-DD'])[source]

Cast a string as an date according to a set of formats strings

Parameters
  • d (str) – a value to cast

  • formats (list(str)) – a set of formats used to try to cast the string as a date. See also the Supported formats

Returns

the value formatted with ISO 8601 date format

datalake.helpers.cast_datetime(d, formats=['YYYY-MM-DDTHH:mm:ss.SSSZZ'])[source]

Cast a string as an datetime according to a set of formats strings

Parameters
  • d (str) – a value to cast

  • formats (list(str)) – a set of formats used to try to cast the string as a datetime. See also the Supported formats

Returns

the value formatted with ISO 8601 datetime format

datalake.helpers.cast_float(x, lang='en_US')[source]

Cast a string as an decimal according to a locale

Parameters
  • x (str) – a value to cast

  • lang (str) – the locale used to interpret the string.

Returns

the value casted as float

datalake.helpers.cast_integer(x, lang='en_US')[source]

Cast a string as an integer according to a locale

Parameters
  • x (str) – a value to cast

  • lang (str) – the locale used to interpret the string

Returns

the value casted as int

datalake.helpers.cast_time(d, formats=['HH:mm:ss.SSSZZ'])[source]

Cast a string as an time according to a set of formats strings

Parameters
  • d (str) – a value to cast

  • formats (list(str)) – a set of formats used to try to cast the string as a time. See also the Supported formats

Returns

the value formatted with ISO 8601 time format