Collect

class datalake_ingestion.collect.Collector(datalake, collect_config)[source]

Runs the main ingestion workflow.

Parameters
  • datalake (datalake.Datalake) – a datalake framework instance

  • collect_config (list(dict)) – a collect configuration

identify(path)[source]

Searches the collect configuration for a match with the given file path

Parameters

path (str) – the file path to identify

Returns

the configuration entry dict if an entry is found, None otherwise. The values captured from the path are stored in the dict under the pattern_extract key

process(storage, path)[source]

Identifies the file path and runs the preprocessor.

Also builds a Measurement and sends it to the telemetry backend

Parameters
datalake_ingestion.collect.validate_config(cfg)[source]

Validates that the given configuration conforms to the schema.

Parameters

cfg (dict) – a configuration to test

Raises

jsonschema.exceptions.ValidationError – when configuration is invalid