Dataset Tools

This page contains documentation for parts of the coffea.dataset_tools package that are not included in the coffea namespace. That is, they must be explicitly imported.

class coffea.dataset_tools.dataset_query.DataDiscoveryCLI[source]

Simplifies dataset query, replicas, filters, and uproot preprocessing with Dask. It can be accessed in a Python script or interpreter via this class, or from the command line (as in python -m coffea.dataset_tools.dataset_query --help).

do_allowlist_sites(sites=None)[source]

Restrict the grid sites available for replicas query only to the requested list

Parameters:

sites (list[str] | None, default None) – The sites to allow the replicas query to look at. If passing in a list, elements of the list are sites. If passing in None, the prompt requires a single string containing a comma-separated listing.

do_blocklist_sites(sites=None)[source]

Exclude grid sites from the available sites for replicas query

Parameters:

sites (list[str] | None, default None) – The sites to prevent the replicas query from looking at. If passing in a list elements of the list are sites. If passing in None, the prompt requires a single string containing a comma-separated listing.

do_list_replicas()[source]

Print the selected files replicas for the selected dataset

do_list_selected()[source]

Print a list of the selected datasets

do_login(proxy=None)[source]

Login to the rucio client. Optionally a specific proxy file can be passed to the command. If the proxy file is not specified, voms-proxy-info is used

do_preprocess(output_file=None, step_size=None, align_to_clusters=None, scheduler_url=None, recalculate_steps=None, files_per_batch=None, file_exceptions=(<class 'OSError'>, ), save_form=None, uproot_options={}, step_size_safety_factor=0.5, allow_empty_datasets=False)[source]

Perform preprocessing for concrete fileset extraction into a file, compressed with gzip.

Parameters:
  • output_file (str | None, default None) – The name of the file to write the preprocessed file into

  • step_size (int | None, default None) – The chunk size for file splitting

  • align_to_clusters (bool | None, default None) – Whether or not round to the cluster size in a root file. See align_clusters parameter in coffea.dataset_tools.preprocess.

  • scheduler_url (str | None, default None) – Dask scheduler URL where the preprocessing should take place

do_query(query=None)[source]

Look for datasets with * wildcards (like in DAS)

Parameters:

query (str | None, default None) – The query to pass to rucio. If None, will prompt the user for an input.

do_query_results()[source]

List the results of the last dataset query

do_regex_sites(regex=None)[source]

Select sites with a regex for replica queries: e.g. “T[123]_(FR|IT|BE|CH|DE)_w+”

Parameters:

regex (str | None, default None) – Sites to use for replica queries, described with a regex string.

do_replicas(mode=None, selection=None)[source]

Query Rucio for replicas.

Parameters:
  • mode (str, default None) –

    One of the following
    • None: ask the user about the mode

    • round-robin (take files randomly from available sites),

    • choose: ask the user to choose from a list of sites

    • first: take the first site from the rucio query

  • selection (str, default None) – list of indices or ‘all’ to select all the selected datasets for replicas query

do_save(filename=None)[source]

Save the replica information in yaml format

Parameters:

filename – str | None, default None The name of the file to save the information into

do_select(selection=None, metadata=None)[source]

Selected the datasets from the list of query results. Input a list of indices also with range 4-6 or “all”.

Parameters:
  • selection (list[str] | None, default None) – A list of indices corresponding to selected datasets. Should be a string, with indices separated by spaces. Can include ranges (like “4-6”) or “all”.

  • metadata (dict[Hashable,Any], default None) – Metadata to store in associated with selected datasets.

do_sites_filters(ask_clear=True)[source]

Show the active sites filters (allowed, disallowed, and regex) and ask to clear them

Parameters:

ask_clear (bool, default True) – If True, ask the user via prompt if allow, disallow, and regex filters should be cleared.

load_dataset_definition(dataset_definition, query_results_strategy='all', replicas_strategy='round-robin')[source]

Initialize the DataDiscoverCLI by querying a set of datasets defined in dataset_definitions and selected results and replicas following the options.

Parameters:
  • dataset_definition (Dict[str,Dict[Hashable,Any]]) – Keys are dataset queries (ie: something that can be passed to do_query())

  • query_results_strategy (str, default "all") – How to decide which datasets to select. If “manual”, user will be prompted for selection

  • replicas_strategy (str, default "round-robin") –

    Options are:
    • ”round-robin”: select randomly from the available sites for each file

    • ”choose”: filter the sites with a list of indices for all the files

    • ”first”: take the first result returned by rucio

    • ”manual”: to be prompt for manual decision dataset by dataset

Returns:

out_replicas – An uproot-readable fileset. At this point, the fileset is not fully preprocessed, but this can be done with do_preprocess().

Return type:

FilesetSpecOptional

coffea.dataset_tools.dataset_query.print_dataset_query(query: str, dataset_list: Dict[str, Dict[str, list[str]]], console: Console, selected: list[str] = []) None[source]

Pretty-print the results of a rucio query in a table.

Parameters:
  • query (str) – The query given to rucio

  • dataset_list (dict[str, dict[str,list[str]]]) – The second output of a call to query_dataset with tree=True

  • console (Console) – A Console object to print to

  • selected (list[str], default []) – A list of selected datasets

coffea.dataset_tools.rucio_utils.get_dataset_files_replicas(dataset, allowlist_sites=None, blocklist_sites=None, regex_sites=None, mode='full', partial_allowed=False, client=None, scope='cms')[source]

This function queries the Rucio server to get information about the location of all the replicas of the files in a CMS dataset.

The sites can be filtered in 3 different ways: - allowlist_sites: list of sites to select from. If the file is not found there, raise an Exception. - blocklist_sites: list of sites to avoid. If the file has no left site, raise an Exception - regex_sites: regex expression to restrict the list of sites.

The fileset returned by the function is controlled by the mode parameter: - “full”: returns the full set of replicas and sites (passing the filtering parameters) - “first”: returns the first replica found for each file - “best”: to be implemented (ServiceX..) - “roundrobin”: try to distribute the replicas over different sites

Parameters:
  • dataset (str) – The dataset to search for.

  • allowlist_sites (list) – List of sites to select from. If the file is not found there, raise an Exception.

  • blocklist_sites (list) – List of sites to avoid. If the file has no left site, raise an Exception.

  • regex_sites (list) – Regex expression to restrict the list of sites.

  • mode (str, default "full") – One of “full”, “first”, “best”, or “roundrobin”. Behavior of each described above.

  • client (rucio Client, optional) – The rucio client to use. If not provided, one will be generated for you.

  • partial_allowed (bool, default False) – If False, throws an exception if any file in the dataset cannot be found. If True, will find as many files from the dataset as it can.

  • scope (rucio scope, "cms") – The scope for rucio to search through.

Returns:

  • files (list) – depending on the mode option. - If mode=="full", returns the complete list of replicas for each file in the dataset - If mode=="first", returns only the first replica for each file.

  • sites (list) – depending on the mode option. - If mode=="full", returns the list of sites where the file replica is available for each file in the dataset - If mode=="first", returns a list of sites for the first replica of each file.

  • sites_counts (dict) – Metadata counting the coverage of the dataset by site

coffea.dataset_tools.rucio_utils.get_proxy_path() str[source]

Checks if the VOMS proxy exists and if it is valid for at least 1 hour. If it exists, returns the path of it

coffea.dataset_tools.rucio_utils.get_rucio_client(proxy=None) Client[source]

Open a client to the CMS rucio server using x509 proxy.

Parameters:

proxy (str, optional) – Use the provided proxy file if given, if not use voms-proxy-info to get the current active one.

Returns:

nativeClient – Rucio client

Return type:

rucio.Client

coffea.dataset_tools.rucio_utils.get_xrootd_sites_map()[source]

The mapping between RSE (sites) and the xrootd prefix rules is read from /cvmfs/cms/cern.ch/SITECONF/*site*/storage.json.

This function returns the list of xrootd prefix rules for each site.

coffea.dataset_tools.rucio_utils.query_dataset(query: str, client=None, tree: bool = False, datatype='container', scope='cms')[source]

This function uses the rucio client to query for containers or datasets.

Parameters:
  • query (str) – Query to filter datasets / containers with the rucio list_dids functions

  • client (rucio Client) – The rucio client to use. If not provided, one will be generated for you

  • tree (bool, default False) – If True, return the results splitting the dataset name in parts

  • datatype (str, default "container") – Options are “container”, “dataset”. rucio terminology. “Container”==CMS dataset. “Dataset” == CMS block.

  • scope (str, default "cms") – Rucio instance

Returns:

  • List of containers/datasets. If tree==True, returns the list of dataset and also a dictionary decomposing

  • the datasets names in the 1st command part and a list of available 2nd parts.