API Reference Guide#
Coffea: a column object framework for effective analysis.
When executing
import coffea
a subset of the full coffea package is imported into the python environment.
Some packages must be imported explicitly, so as to avoid importing unnecessary
and/or heavy dependencies. Below lists the packages available in the coffea namespace.
Under that, we list documentation for some of the coffea packages that need to be
imported explicitly.
In coffea Namespace#
Tools of general use for columnar analysis |
|
BTag tools: CMS analysis-level b-tagging corrections and uncertainties |
|
JetMET tools: CMS analysis-level jet corrections and uncertainties |
|
Lookup tools |
|
Tools to parse CMS luminosity non-event data |
|
Tools to interface with various ML inference services |
|
NanoEvents and helpers |
|
Basic NanoEvents and NanoCollection mixins |
|
Physics object candidate mixin |
|
Mixins for the CMS NanoAOD schema |
|
2D, 3D, and Lorentz vector class mixins |
|
A framework for analysis scale-out |
|
Utility functions |
Not in coffea Namespace#
Here is documentation for some of the packages that are not automatically
imported on a call to import coffea.
This page contains documentation for parts of the coffea.dataset_tools
package that are not included in the coffea namespace. That is, they
must be explicitly imported.
- class coffea.dataset_tools.dataset_query.DataDiscoveryCLI[source]#
Simplifies dataset query, replicas, filters, and uproot preprocessing with Dask. It can be accessed in a Python script or interpreter via this class, or from the command line (as in
python -m coffea.dataset_tools.dataset_query --help).- do_allowlist_sites(sites=None)[source]#
Restrict the grid sites available for replicas query only to the requested list
- do_blocklist_sites(sites=None)[source]#
Exclude grid sites from the available sites for replicas query
- do_login(proxy=None)[source]#
Login to the rucio client. Optionally a specific proxy file can be passed to the command. If the proxy file is not specified,
voms-proxy-infois used
- do_preprocess(output_file=None, step_size=None, align_to_clusters=None, scheduler_url=None, recalculate_steps=None, files_per_batch=None, file_exceptions=(<class 'OSError'>, ), save_form=None, uproot_options={}, step_size_safety_factor=0.5, allow_empty_datasets=False)[source]#
Perform preprocessing for concrete fileset extraction into a file, compressed with gzip.
- Parameters:
output_file (
strorNone, defaultNone) – Target prefix for the generated*_available.json.gzand*_all.json.gzfiles.step_size (
intorNone, defaultNone) – Chunk size (number of events) to process per step.align_to_clusters (
boolorNone, defaultNone) – Whether to align step boundaries to ROOT cluster boundaries. Mirrors thealign_clustersargument ofcoffea.dataset_tools.preprocess.scheduler_url (
strorNone, defaultNone) – Dask scheduler URL on which to run preprocessing.recalculate_steps (
boolorNone, defaultNone) – Recompute step definitions even if cached values are present.files_per_batch (
intorNone, defaultNone) – Number of files to send to each preprocessing task.file_exceptions (
tuple[type[BaseException],], default(OSError,)) – Exceptions that should trigger file skipping instead of aborting.save_form (
boolorNone, defaultNone) – Persist the Awkward form extracted during preprocessing alongside the output.uproot_options (
dict, default{}) – Keyword arguments forwarded touprootwhen opening files.step_size_safety_factor (
float, default0.5) – Multiplicative safety factor applied when estimating step sizes.allow_empty_datasets (
bool, defaultFalse) – Whether to keep datasets that produce zero valid chunks.
- do_regex_sites(regex=None)[source]#
Select sites with a regex for replica queries: e.g. “T[123]_(FR|IT|BE|CH|DE)_w+”
- do_replicas(mode=None, selection=None)[source]#
Query Rucio for replicas.
- Parameters:
mode (
strorNone, defaultNone) –- Selection strategy for preferred sites. Options:
None: ask the user about the mode
round-robin (take files randomly from available sites),
choose: ask the user to choose from a list of sites
first: take the first site from the rucio query
selection (
strorNone, defaultNone) – Indices (or the literal"all") identifying datasets on which to run the replica query.
- do_select(selection=None, metadata=None)[source]#
Selected the datasets from the list of query results. Input a list of indices also with range 4-6 or “all”.
- do_sites_filters(ask_clear=True)[source]#
Show the active sites filters (allowed, disallowed, and regex) and ask to clear them
- load_dataset_definition(dataset_definition, query_results_strategy='all', replicas_strategy='round-robin')[source]#
Initialize the DataDiscoverCLI by querying a set of datasets defined in
dataset_definitionsand selected results and replicas following the options.- Parameters:
dataset_definition (
dict[str,dict[Hashable,Any]]) – Mapping from dataset query string to metadata to attach to the selection.query_results_strategy (
str, default"all") – How to decide which datasets to select. If “manual”, user will be prompted for selectionreplicas_strategy (
str, default"round-robin") –- Options are:
”round-robin”: select randomly from the available sites for each file
”choose”: filter the sites with a list of indices for all the files
”first”: take the first result returned by rucio
”manual”: to be prompt for manual decision dataset by dataset
- Returns:
out_replicas – An uproot-readable fileset. At this point, the fileset is not fully preprocessed, but this can be done with do_preprocess().
- Return type:
FilesetSpecOptional
- coffea.dataset_tools.dataset_query.print_dataset_query(query: str, dataset_list: dict[str, dict[str, list[str]]], console: ~rich.console.Console = <console width=80 None>, selected: list[str] = []) None[source]#
Pretty-print the results of a rucio query in a table.
- Parameters:
query (
str) – The query given to ruciodataset_list (
dict[str,dict[str,list[str]]]) – The second output of a call to query_dataset with tree=Trueconsole (
Console) – A Console object to print toselected (
list[str], default[]) – A list of selected datasets
- coffea.dataset_tools.rucio_utils.get_dataset_files_replicas(dataset, allowlist_sites=None, blocklist_sites=None, regex_sites=None, mode='full', partial_allowed=False, client=None, scope='cms')[source]#
This function queries the Rucio server to get information about the location of all the replicas of the files in a CMS dataset.
The sites can be filtered in 3 different ways: -
allowlist_sites: list of sites to select from. If the file is not found there, raise an Exception. -blocklist_sites: list of sites to avoid. If the file has no left site, raise an Exception -regex_sites: regex expression to restrict the list of sites.The fileset returned by the function is controlled by the
modeparameter: - “full”: returns the full set of replicas and sites (passing the filtering parameters) - “first”: returns the first replica found for each file - “best”: to be implemented (ServiceX..) - “roundrobin”: try to distribute the replicas over different sites- Parameters:
dataset (
str) – The dataset to search for.allowlist_sites (
listorNone) – List of sites to select from. If the file is not found there, raise an Exception.blocklist_sites (
listorNone) – List of sites to avoid. If the file has no left site, raise an Exception.regex_sites (
listorNone) – Regex expression to restrict the list of sites.mode (
str, default"full") – One of “full”, “first”, “best”, or “roundrobin”. Behavior of each described above.client (
rucio.client.ClientorNone, optional) – The rucio client to use. If not provided, one will be generated for you.partial_allowed (
bool, defaultFalse) – If False, throws an exception if any file in the dataset cannot be found. If True, will find as many files from the dataset as it can.scope (
str, default"cms") – The scope for rucio to search through.
- Returns:
files (
list) – Depending onmode. For"full"this is the list of replicas per file; for"first"it contains only the first replica per file.sites (
list) – Depending onmode. For"full"this is the list of sites where each file replica is available; for"first"it contains the site of the first replica.sites_counts (
dict) – Metadata counting the coverage of the dataset by site.
- coffea.dataset_tools.rucio_utils.get_proxy_path() str[source]#
Checks if the VOMS proxy exists and if it is valid for at least 1 hour. If it exists, returns the path of it
- coffea.dataset_tools.rucio_utils.get_rucio_client(proxy=None) Client[source]#
Open a client to the CMS rucio server using x509 proxy.
- Parameters:
proxy (
str, optional) – Use the provided proxy file if given, if not usevoms-proxy-infoto get the current active one.- Returns:
Rucio client connected with the resolved proxy credentials.
- Return type:
Client
- coffea.dataset_tools.rucio_utils.get_xrootd_sites_map()[source]#
The mapping between RSE (sites) and the xrootd prefix rules is read from
/cvmfs/cms/cern.ch/SITECONF/*site*/storage.json.This function returns the list of xrootd prefix rules for each site.
- coffea.dataset_tools.rucio_utils.query_dataset(query: str, client=None, tree: bool = False, datatype='container', scope='cms')[source]#
This function uses the rucio client to query for containers or datasets.
- Parameters:
query (
str) – Pattern passed toruciolist_dids.client (
rucio.client.ClientorNone, optional) – Client instance to use. If omitted, a new client is created.tree (
bool, defaultFalse) – If True, return a mapping grouped by dataset components alongside the list.datatype (
str, default"container") – Rucio type to query:"container"(CMS dataset) or"dataset"(CMS block).scope (
str, default"cms") – Rucio scope to operate in.
- Returns:
When
treeis False, returns the matched dataset names. Otherwise returns a tuple of the flat list and a nested dictionary grouping the names by their components.- Return type:
list[str]ortuple[list[str],dict[str,dict[str,list[str]]]]