API Reference Guide#

Coffea: a column object framework for effective analysis.

When executing

import coffea

a subset of the full coffea package is imported into the python environment. Some packages must be imported explicitly, so as to avoid importing unnecessary and/or heavy dependencies. Below lists the packages available in the coffea namespace. Under that, we list documentation for some of the coffea packages that need to be imported explicitly.

In coffea Namespace#

coffea.analysis_tools

Tools of general use for columnar analysis

coffea.btag_tools

BTag tools: CMS analysis-level b-tagging corrections and uncertainties

coffea.dataset_tools

coffea.jetmet_tools

JetMET tools: CMS analysis-level jet corrections and uncertainties

coffea.lookup_tools

Lookup tools

coffea.lumi_tools

Tools to parse CMS luminosity non-event data

coffea.ml_tools

Tools to interface with various ML inference services

coffea.nanoevents

NanoEvents and helpers

coffea.nanoevents.methods.base

Basic NanoEvents and NanoCollection mixins

coffea.nanoevents.methods.candidate

Physics object candidate mixin

coffea.nanoevents.methods.nanoaod

Mixins for the CMS NanoAOD schema

coffea.nanoevents.methods.vector

2D, 3D, and Lorentz vector class mixins

coffea.processor

A framework for analysis scale-out

coffea.util

Utility functions

Not in coffea Namespace#

Here is documentation for some of the packages that are not automatically imported on a call to import coffea.

This page contains documentation for parts of the coffea.dataset_tools package that are not included in the coffea namespace. That is, they must be explicitly imported.

class coffea.dataset_tools.dataset_query.DataDiscoveryCLI[source]#

Simplifies dataset query, replicas, filters, and uproot preprocessing with Dask. It can be accessed in a Python script or interpreter via this class, or from the command line (as in python -m coffea.dataset_tools.dataset_query --help).

do_allowlist_sites(sites=None)[source]#

Restrict the grid sites available for replicas query only to the requested list

Parameters:

sites (list[str] or None, default None) – The sites to allow the replicas query to look at. If passing in a list, elements of the list are sites. If passing in None, the prompt requires a single string containing a comma-separated listing.

do_blocklist_sites(sites=None)[source]#

Exclude grid sites from the available sites for replicas query

Parameters:

sites (list[str] or None, default None) – The sites to prevent the replicas query from looking at. If passing in a list elements of the list are sites. If passing in None, the prompt requires a single string containing a comma-separated listing.

do_list_replicas()[source]#

Print the selected files replicas for the selected dataset

do_list_selected()[source]#

Print a list of the selected datasets

do_login(proxy=None)[source]#

Login to the rucio client. Optionally a specific proxy file can be passed to the command. If the proxy file is not specified, voms-proxy-info is used

do_preprocess(output_file=None, step_size=None, align_to_clusters=None, scheduler_url=None, recalculate_steps=None, files_per_batch=None, file_exceptions=(<class 'OSError'>, ), save_form=None, uproot_options={}, step_size_safety_factor=0.5, allow_empty_datasets=False)[source]#

Perform preprocessing for concrete fileset extraction into a file, compressed with gzip.

Parameters:
  • output_file (str or None, default None) – Target prefix for the generated *_available.json.gz and *_all.json.gz files.

  • step_size (int or None, default None) – Chunk size (number of events) to process per step.

  • align_to_clusters (bool or None, default None) – Whether to align step boundaries to ROOT cluster boundaries. Mirrors the align_clusters argument of coffea.dataset_tools.preprocess.

  • scheduler_url (str or None, default None) – Dask scheduler URL on which to run preprocessing.

  • recalculate_steps (bool or None, default None) – Recompute step definitions even if cached values are present.

  • files_per_batch (int or None, default None) – Number of files to send to each preprocessing task.

  • file_exceptions (tuple[type[BaseException], ], default (OSError,)) – Exceptions that should trigger file skipping instead of aborting.

  • save_form (bool or None, default None) – Persist the Awkward form extracted during preprocessing alongside the output.

  • uproot_options (dict, default {}) – Keyword arguments forwarded to uproot when opening files.

  • step_size_safety_factor (float, default 0.5) – Multiplicative safety factor applied when estimating step sizes.

  • allow_empty_datasets (bool, default False) – Whether to keep datasets that produce zero valid chunks.

do_query(query=None)[source]#

Look for datasets with * wildcards (like in DAS)

Parameters:

query (str or None, default None) – The query to pass to rucio. If None, will prompt the user for an input.

do_query_results()[source]#

List the results of the last dataset query

do_regex_sites(regex=None)[source]#

Select sites with a regex for replica queries: e.g. “T[123]_(FR|IT|BE|CH|DE)_w+”

Parameters:

regex (str or None, default None) – Sites to use for replica queries, described with a regex string.

do_replicas(mode=None, selection=None)[source]#

Query Rucio for replicas.

Parameters:
  • mode (str or None, default None) –

    Selection strategy for preferred sites. Options:
    • None: ask the user about the mode

    • round-robin (take files randomly from available sites),

    • choose: ask the user to choose from a list of sites

    • first: take the first site from the rucio query

  • selection (str or None, default None) – Indices (or the literal "all") identifying datasets on which to run the replica query.

do_save(filename=None)[source]#

Save the replica information in yaml format

Parameters:

filename (str or None, default None) – The name of the file to save the information into.

do_select(selection=None, metadata=None)[source]#

Selected the datasets from the list of query results. Input a list of indices also with range 4-6 or “all”.

Parameters:
  • selection (str or None, default None) – Space-delimited indices corresponding to selected datasets. Can include ranges (like "4-6") or the literal "all".

  • metadata (dict[Hashable, Any] or None, default None) – Metadata to store in associated with selected datasets.

do_sites_filters(ask_clear=True)[source]#

Show the active sites filters (allowed, disallowed, and regex) and ask to clear them

Parameters:

ask_clear (bool, default True) – If True, ask the user via prompt if allow, disallow, and regex filters should be cleared.

load_dataset_definition(dataset_definition, query_results_strategy='all', replicas_strategy='round-robin')[source]#

Initialize the DataDiscoverCLI by querying a set of datasets defined in dataset_definitions and selected results and replicas following the options.

Parameters:
  • dataset_definition (dict[str, dict[Hashable, Any]]) – Mapping from dataset query string to metadata to attach to the selection.

  • query_results_strategy (str, default "all") – How to decide which datasets to select. If “manual”, user will be prompted for selection

  • replicas_strategy (str, default "round-robin") –

    Options are:
    • ”round-robin”: select randomly from the available sites for each file

    • ”choose”: filter the sites with a list of indices for all the files

    • ”first”: take the first result returned by rucio

    • ”manual”: to be prompt for manual decision dataset by dataset

Returns:

out_replicas – An uproot-readable fileset. At this point, the fileset is not fully preprocessed, but this can be done with do_preprocess().

Return type:

FilesetSpecOptional

coffea.dataset_tools.dataset_query.print_dataset_query(query: str, dataset_list: dict[str, dict[str, list[str]]], console: ~rich.console.Console = <console width=80 None>, selected: list[str] = []) None[source]#

Pretty-print the results of a rucio query in a table.

Parameters:
  • query (str) – The query given to rucio

  • dataset_list (dict[str, dict[str, list[str]]]) – The second output of a call to query_dataset with tree=True

  • console (Console) – A Console object to print to

  • selected (list[str], default []) – A list of selected datasets

coffea.dataset_tools.rucio_utils.get_dataset_files_replicas(dataset, allowlist_sites=None, blocklist_sites=None, regex_sites=None, mode='full', partial_allowed=False, client=None, scope='cms')[source]#

This function queries the Rucio server to get information about the location of all the replicas of the files in a CMS dataset.

The sites can be filtered in 3 different ways: - allowlist_sites: list of sites to select from. If the file is not found there, raise an Exception. - blocklist_sites: list of sites to avoid. If the file has no left site, raise an Exception - regex_sites: regex expression to restrict the list of sites.

The fileset returned by the function is controlled by the mode parameter: - “full”: returns the full set of replicas and sites (passing the filtering parameters) - “first”: returns the first replica found for each file - “best”: to be implemented (ServiceX..) - “roundrobin”: try to distribute the replicas over different sites

Parameters:
  • dataset (str) – The dataset to search for.

  • allowlist_sites (list or None) – List of sites to select from. If the file is not found there, raise an Exception.

  • blocklist_sites (list or None) – List of sites to avoid. If the file has no left site, raise an Exception.

  • regex_sites (list or None) – Regex expression to restrict the list of sites.

  • mode (str, default "full") – One of “full”, “first”, “best”, or “roundrobin”. Behavior of each described above.

  • client (rucio.client.Client or None, optional) – The rucio client to use. If not provided, one will be generated for you.

  • partial_allowed (bool, default False) – If False, throws an exception if any file in the dataset cannot be found. If True, will find as many files from the dataset as it can.

  • scope (str, default "cms") – The scope for rucio to search through.

Returns:

  • files (list) – Depending on mode. For "full" this is the list of replicas per file; for "first" it contains only the first replica per file.

  • sites (list) – Depending on mode. For "full" this is the list of sites where each file replica is available; for "first" it contains the site of the first replica.

  • sites_counts (dict) – Metadata counting the coverage of the dataset by site.

coffea.dataset_tools.rucio_utils.get_proxy_path() str[source]#

Checks if the VOMS proxy exists and if it is valid for at least 1 hour. If it exists, returns the path of it

coffea.dataset_tools.rucio_utils.get_rucio_client(proxy=None) Client[source]#

Open a client to the CMS rucio server using x509 proxy.

Parameters:

proxy (str, optional) – Use the provided proxy file if given, if not use voms-proxy-info to get the current active one.

Returns:

Rucio client connected with the resolved proxy credentials.

Return type:

Client

coffea.dataset_tools.rucio_utils.get_xrootd_sites_map()[source]#

The mapping between RSE (sites) and the xrootd prefix rules is read from /cvmfs/cms/cern.ch/SITECONF/*site*/storage.json.

This function returns the list of xrootd prefix rules for each site.

coffea.dataset_tools.rucio_utils.query_dataset(query: str, client=None, tree: bool = False, datatype='container', scope='cms')[source]#

This function uses the rucio client to query for containers or datasets.

Parameters:
  • query (str) – Pattern passed to rucio list_dids.

  • client (rucio.client.Client or None, optional) – Client instance to use. If omitted, a new client is created.

  • tree (bool, default False) – If True, return a mapping grouped by dataset components alongside the list.

  • datatype (str, default "container") – Rucio type to query: "container" (CMS dataset) or "dataset" (CMS block).

  • scope (str, default "cms") – Rucio scope to operate in.

Returns:

When tree is False, returns the matched dataset names. Otherwise returns a tuple of the flat list and a nested dictionary grouping the names by their components.

Return type:

list[str] or tuple[list[str], dict[str, dict[str, list[str]]]]