split_fileset#
- coffea.dataset_tools.split_fileset(fileset, strategy=None, datasets=None, percentage=None, treename=None)[source]#
Split a fileset into partial filesets so that a partial result can still be obtained if one or more of them fail during processing.
Each returned element is a partial fileset (a unique combination of files), not one of the usual coffea row-range chunks.
Both fileset schemas accepted by
coffea.processor.Runnerare supported. For list-format datasets ({dataset: [path, ...]}or{dataset: {"files": [...], }}without an inner"treename"field), thetreenamekeyword must be supplied; it is folded into each resulting chunk so the chunks are self-contained and usable as cache keys viahash_fileset(). File paths are sorted before being sliced into bins, so the chunk composition is deterministic regardless of input dict insertion order.- Parameters:
fileset (
dict) – A fileset of the form{dataset: [file, ...]}or{dataset: {"files": {path: treename, ...} | [path, ...], ...}}.strategy (
strorNone, defaultNone) –"by_dataset"puts each dataset in its own chunk;Nonekeeps all datasets together.datasets (
list,tuple,callableorNone, defaultNone) – Restrict splitting to a subset of datasets. If callable, it is applied to each dataset name and must return a truthy value to include it.percentage (
intorNone, defaultNone) – An integer that divides 100 evenly (e.g. 10, 20, 25, 50). Each chunk receives this percentage of each dataset’s files.treename (
strorNone, defaultNone) – Tree name to attach to list-format datasets so the resulting chunks are self-contained. Required when any dataset uses list-format files without its own"treename"field.
- Returns:
out – The partial filesets. The behaviour depends on the arguments:
strategy="by_dataset"alone: one chunk per dataset.percentage=palone:100/pchunks, each containingppercent of every dataset’s files (mixed chunks).strategy="by_dataset"withpercentage=p:N_datasets * (100/p)chunks, not mixed.datasetscombined with any of the above restricts splitting to the selected datasets.
- Return type: