DatasetSpec#

class coffea.dataset_tools.DatasetSpec(*, files: PreprocessedFiles | InputFiles, metadata: dict[Hashable, Any] = {}, format: str | None = None, compressed_form: str | None = None, did: str | None = None)[source]#

Bases: BaseModel

Attributes Summary

`form`
`joinable`	Identify DatasetSpec criteria to be pre-joined for typetracing (necessary) and column-joining (sufficient)
`model_config`	Configuration for the model, should be a dictionary conforming to [`ConfigDict`][pydantic.config.ConfigDict].
`num_entries`	Compute the total number of entries across all files, if available.
`num_selected_entries`	Compute the total number of selected entries across all files (calculated from steps), if available.
`steps`	Get the steps per dataset file, if available.

Methods Summary

`filter_files`([filter_name, filter_callable])	Filter files by a regex pattern on the file names(filter_name) or callable applied to Filespecs (filter_callable).
`limit_files`(max_files)	Limit the number of files.
`limit_steps`(max_steps[, per_file])	Limit the steps.
`post_validate`()
`preprocess_data`(data)
`set_check_format`()	Set and/or validate the format if manually specified

Attributes Documentation

joinable#: Identify DatasetSpec criteria to be pre-joined for typetracing (necessary) and column-joining (sufficient)

model_config: ClassVar[ConfigDict] = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

num_entries#: Compute the total number of entries across all files, if available.

num_selected_entries#: Compute the total number of selected entries across all files (calculated from steps), if available.

Methods Documentation

filter_files(filter_name: str | None = None, filter_callable: Callable[[CoffeaROOTFileSpec | CoffeaParquetFileSpec | CoffeaROOTFileSpecOptional | CoffeaParquetFileSpecOptional], bool] | None = None) → Self[source]#: Filter files by a regex pattern on the file names(filter_name) or callable applied to Filespecs (filter_callable).

limit_files(max_files: int | slice | None) → Self[source]#: Limit the number of files.

limit_steps(max_steps: int | slice, per_file: bool = False) → Self[source]#: Limit the steps. pass per_file=True to limit steps per file, otherwise limits across all files cumulatively

set_check_format() → bool[source]#: Set and/or validate the format if manually specified