split_fileset#

coffea.dataset_tools.split_fileset(fileset, strategy=None, datasets=None, percentage=None, treename=None)[source]#

Split a fileset into partial filesets so that a partial result can still be obtained if one or more of them fail during processing.

Each returned element is a partial fileset (a unique combination of files), not one of the usual coffea row-range chunks.

Both fileset schemas accepted by coffea.processor.Runner are supported. For list-format datasets ({dataset: [path, ...]} or {dataset: {"files": [...], }} without an inner "treename" field), the treename keyword must be supplied; it is folded into each resulting chunk so the chunks are self-contained and usable as cache keys via hash_fileset(). File paths are sorted before being sliced into bins, so the chunk composition is deterministic regardless of input dict insertion order.

Parameters:

fileset (dict) – A fileset of the form {dataset: [file, ...]} or {dataset: {"files": {path: treename, ...} | [path, ...], ...}}.
strategy (str or None, default None) – "by_dataset" puts each dataset in its own chunk; None keeps all datasets together.
datasets (list, tuple, callable or None, default None) – Restrict splitting to a subset of datasets. If callable, it is applied to each dataset name and must return a truthy value to include it.
percentage (int or None, default None) – An integer that divides 100 evenly (e.g. 10, 20, 25, 50). Each chunk receives this percentage of each dataset’s files.
treename (str or None, default None) – Tree name to attach to list-format datasets so the resulting chunks are self-contained. Required when any dataset uses list-format files without its own "treename" field.

Returns:

out – The partial filesets. The behaviour depends on the arguments:

strategy="by_dataset" alone: one chunk per dataset.
percentage=p alone: 100/p chunks, each containing p percent of every dataset’s files (mixed chunks).
strategy="by_dataset" with percentage=p: N_datasets * (100/p) chunks, not mixed.
datasets combined with any of the above restricts splitting to the selected datasets.

Return type:

list of dict