Dataset Tools
This page contains documentation for parts of the coffea.dataset_tools
package that are not included in the coffea namespace. That is, they
must be explicitly imported.
- class coffea.dataset_tools.dataset_query.DataDiscoveryCLI[source]
Simplifies dataset query, replicas, filters, and uproot preprocessing with Dask. It can be accessed in a Python script or interpreter via this class, or from the command line (as in
python -m coffea.dataset_tools.dataset_query --help).- do_allowlist_sites(sites=None)[source]
Restrict the grid sites available for replicas query only to the requested list
- do_blocklist_sites(sites=None)[source]
Exclude grid sites from the available sites for replicas query
- do_login(proxy=None)[source]
Login to the rucio client. Optionally a specific proxy file can be passed to the command. If the proxy file is not specified,
voms-proxy-infois used
- do_preprocess(output_file=None, step_size=None, align_to_clusters=None, scheduler_url=None, recalculate_steps=None, files_per_batch=None, file_exceptions=(<class 'OSError'>, ), save_form=None, uproot_options={}, step_size_safety_factor=0.5, allow_empty_datasets=False)[source]
Perform preprocessing for concrete fileset extraction into a file, compressed with gzip.
- Parameters:
output_file (str | None, default None) – The name of the file to write the preprocessed file into
step_size (int | None, default None) – The chunk size for file splitting
align_to_clusters (bool | None, default None) – Whether or not round to the cluster size in a root file. See align_clusters parameter in coffea.dataset_tools.preprocess.
scheduler_url (str | None, default None) – Dask scheduler URL where the preprocessing should take place
- do_query(query=None)[source]
Look for datasets with * wildcards (like in DAS)
- Parameters:
query (str | None, default None) – The query to pass to rucio. If None, will prompt the user for an input.
- do_regex_sites(regex=None)[source]
Select sites with a regex for replica queries: e.g. “T[123]_(FR|IT|BE|CH|DE)_w+”
- Parameters:
regex (str | None, default None) – Sites to use for replica queries, described with a regex string.
- do_replicas(mode=None, selection=None)[source]
Query Rucio for replicas.
- Parameters:
mode (str, default None) –
- One of the following
None: ask the user about the mode
round-robin (take files randomly from available sites),
choose: ask the user to choose from a list of sites
first: take the first site from the rucio query
selection (str, default None) – list of indices or ‘all’ to select all the selected datasets for replicas query
- do_save(filename=None)[source]
Save the replica information in yaml format
- Parameters:
filename – str | None, default None The name of the file to save the information into
- do_select(selection=None, metadata=None)[source]
Selected the datasets from the list of query results. Input a list of indices also with range 4-6 or “all”.
- Parameters:
- do_sites_filters(ask_clear=True)[source]
Show the active sites filters (allowed, disallowed, and regex) and ask to clear them
- Parameters:
ask_clear (bool, default True) – If True, ask the user via prompt if allow, disallow, and regex filters should be cleared.
- load_dataset_definition(dataset_definition, query_results_strategy='all', replicas_strategy='round-robin')[source]
Initialize the DataDiscoverCLI by querying a set of datasets defined in
dataset_definitionsand selected results and replicas following the options.- Parameters:
dataset_definition (Dict[str,Dict[Hashable,Any]]) – Keys are dataset queries (ie: something that can be passed to do_query())
query_results_strategy (str, default "all") – How to decide which datasets to select. If “manual”, user will be prompted for selection
replicas_strategy (str, default "round-robin") –
- Options are:
”round-robin”: select randomly from the available sites for each file
”choose”: filter the sites with a list of indices for all the files
”first”: take the first result returned by rucio
”manual”: to be prompt for manual decision dataset by dataset
- Returns:
out_replicas – An uproot-readable fileset. At this point, the fileset is not fully preprocessed, but this can be done with do_preprocess().
- Return type:
FilesetSpecOptional
- coffea.dataset_tools.dataset_query.print_dataset_query(query: str, dataset_list: Dict[str, Dict[str, list[str]]], console: Console, selected: list[str] = []) None[source]
Pretty-print the results of a rucio query in a table.
- coffea.dataset_tools.rucio_utils.get_dataset_files_replicas(dataset, allowlist_sites=None, blocklist_sites=None, regex_sites=None, mode='full', partial_allowed=False, client=None, scope='cms')[source]
This function queries the Rucio server to get information about the location of all the replicas of the files in a CMS dataset.
The sites can be filtered in 3 different ways: -
allowlist_sites: list of sites to select from. If the file is not found there, raise an Exception. -blocklist_sites: list of sites to avoid. If the file has no left site, raise an Exception -regex_sites: regex expression to restrict the list of sites.The fileset returned by the function is controlled by the
modeparameter: - “full”: returns the full set of replicas and sites (passing the filtering parameters) - “first”: returns the first replica found for each file - “best”: to be implemented (ServiceX..) - “roundrobin”: try to distribute the replicas over different sites- Parameters:
dataset (str) – The dataset to search for.
allowlist_sites (list) – List of sites to select from. If the file is not found there, raise an Exception.
blocklist_sites (list) – List of sites to avoid. If the file has no left site, raise an Exception.
regex_sites (list) – Regex expression to restrict the list of sites.
mode (str, default "full") – One of “full”, “first”, “best”, or “roundrobin”. Behavior of each described above.
client (rucio Client, optional) – The rucio client to use. If not provided, one will be generated for you.
partial_allowed (bool, default False) – If False, throws an exception if any file in the dataset cannot be found. If True, will find as many files from the dataset as it can.
scope (rucio scope, "cms") – The scope for rucio to search through.
- Returns:
files (list) – depending on the
modeoption. - Ifmode=="full", returns the complete list of replicas for each file in the dataset - Ifmode=="first", returns only the first replica for each file.sites (list) – depending on the
modeoption. - Ifmode=="full", returns the list of sites where the file replica is available for each file in the dataset - Ifmode=="first", returns a list of sites for the first replica of each file.sites_counts (dict) – Metadata counting the coverage of the dataset by site
- coffea.dataset_tools.rucio_utils.get_proxy_path() str[source]
Checks if the VOMS proxy exists and if it is valid for at least 1 hour. If it exists, returns the path of it
- coffea.dataset_tools.rucio_utils.get_rucio_client(proxy=None) Client[source]
Open a client to the CMS rucio server using x509 proxy.
- Parameters:
proxy (str, optional) – Use the provided proxy file if given, if not use
voms-proxy-infoto get the current active one.- Returns:
nativeClient – Rucio client
- Return type:
rucio.Client
- coffea.dataset_tools.rucio_utils.get_xrootd_sites_map()[source]
The mapping between RSE (sites) and the xrootd prefix rules is read from
/cvmfs/cms/cern.ch/SITECONF/*site*/storage.json.This function returns the list of xrootd prefix rules for each site.
- coffea.dataset_tools.rucio_utils.query_dataset(query: str, client=None, tree: bool = False, datatype='container', scope='cms')[source]
This function uses the rucio client to query for containers or datasets.
- Parameters:
query (str) – Query to filter datasets / containers with the rucio list_dids functions
client (rucio Client) – The rucio client to use. If not provided, one will be generated for you
tree (bool, default False) – If True, return the results splitting the dataset name in parts
datatype (str, default "container") – Options are “container”, “dataset”. rucio terminology. “Container”==CMS dataset. “Dataset” == CMS block.
scope (str, default "cms") – Rucio instance
- Returns:
List of containers/datasets. If tree==True, returns the list of dataset and also a dictionary decomposing
the datasets names in the 1st command part and a list of available 2nd parts.