preprocess#

coffea.dataset_tools.preprocess(fileset: ~coffea.dataset_tools.filespec.DataGroupSpec | dict, step_size: None | int = None, align_clusters: bool = False, recalculate_steps: bool = False, files_per_batch: int = 1, skip_bad_files: bool = False, file_exceptions: Exception | Warning | tuple[Exception | Warning] = (<class 'OSError'>, ), save_form: bool = False, scheduler: None | ~collections.abc.Callable | str = None, uproot_options: dict = {}, step_size_safety_factor: float = 0.5, allow_empty_datasets: bool = False) tuple[DataGroupSpec, DataGroupSpec] | tuple[dict, dict][source]#

Given a list of normalized file and object paths (defined in uproot), determine the steps for each file according to the supplied processing options.

Parameters:
  • fileset (DataGroupSpec | dict) – The set of datasets whose files will be preprocessed.

  • step_size (int or None, default None) – If specified, the size of the steps to make when analyzing the input files.

  • align_clusters (bool, default False) – Round to the cluster size in a root file, when chunks are specified. Reduces data transfer in analysis.

  • recalculate_steps (bool, default False) – If steps are present in the input normed files, force the recalculation of those steps, instead of only recalculating the steps if the uuid has changed.

  • files_per_batch (int, default 1) – The number of files to preprocess in a single batch. Large values will result in fewer dask tasks but each task will have to do more work.

  • skip_bad_files (bool, default False) – Instead of failing, catch exceptions specified by file_exceptions and return null data.

  • file_exceptions (Exception or Warning or tuple[Exception or Warning], default (FileNotFoundError, OSError)) – What exceptions to catch when skipping bad files.

  • save_form (bool, default False) – Extract the form of the TTree from each file in each dataset, creating the union of the forms over the dataset.

  • scheduler (None or Callable or str, default None) – Specifies the scheduler that dask should use to execute the preprocessing task graph.

  • uproot_options (dict, default {}) – Options to pass to get_steps for opening files with uproot.

  • step_size_safety_factor (float, default 0.5) – When using align_clusters, if a resulting step is larger than step_size by this factor warn the user that the resulting steps may be highly irregular.

  • allow_empty_datasets (bool, default False) – When a dataset query comes back completely empty, this is normally considered a processing error. Toggle this argument to True to change this to warnings and allow incomplete returned filesets.

Returns:

  • out_available (DataGroupSpec | dict) – The subset of files in each dataset that were successfully preprocessed, organized by dataset.

  • out_updated (DataGroupSpec | dict) – The original set of datasets including files that were not accessible, updated to include the result of preprocessing where available.