preprocess#
- coffea.dataset_tools.preprocess(fileset: ~coffea.dataset_tools.filespec.DataGroupSpec | dict, step_size: None | int = None, align_clusters: bool = False, recalculate_steps: bool = False, files_per_batch: int = 1, skip_bad_files: bool = False, file_exceptions: Exception | Warning | tuple[Exception | Warning] = (<class 'OSError'>, ), save_form: bool = False, scheduler: None | ~collections.abc.Callable | str = None, uproot_options: dict = {}, step_size_safety_factor: float = 0.5, allow_empty_datasets: bool = False) tuple[DataGroupSpec, DataGroupSpec] | tuple[dict, dict][source]#
Given a list of normalized file and object paths (defined in uproot), determine the steps for each file according to the supplied processing options.
- Parameters:
fileset (
DataGroupSpec | dict) – The set of datasets whose files will be preprocessed.step_size (
intorNone, defaultNone) – If specified, the size of the steps to make when analyzing the input files.align_clusters (
bool, defaultFalse) – Round to the cluster size in a root file, when chunks are specified. Reduces data transfer in analysis.recalculate_steps (
bool, defaultFalse) – If steps are present in the input normed files, force the recalculation of those steps, instead of only recalculating the steps if the uuid has changed.files_per_batch (
int, default1) – The number of files to preprocess in a single batch. Large values will result in fewer dask tasks but each task will have to do more work.skip_bad_files (
bool, defaultFalse) – Instead of failing, catch exceptions specified by file_exceptions and return null data.file_exceptions (
ExceptionorWarningortuple[ExceptionorWarning], default(FileNotFoundError,OSError)) – What exceptions to catch when skipping bad files.save_form (
bool, defaultFalse) – Extract the form of the TTree from each file in each dataset, creating the union of the forms over the dataset.scheduler (
NoneorCallableorstr, defaultNone) – Specifies the scheduler that dask should use to execute the preprocessing task graph.uproot_options (
dict, default{}) – Options to pass to get_steps for opening files with uproot.step_size_safety_factor (
float, default0.5) – When using align_clusters, if a resulting step is larger than step_size by this factor warn the user that the resulting steps may be highly irregular.allow_empty_datasets (
bool, defaultFalse) – When a dataset query comes back completely empty, this is normally considered a processing error. Toggle this argument to True to change this to warnings and allow incomplete returned filesets.
- Returns:
out_available (
DataGroupSpec | dict) – The subset of files in each dataset that were successfully preprocessed, organized by dataset.out_updated (
DataGroupSpec | dict) – The original set of datasets including files that were not accessible, updated to include the result of preprocessing where available.