{ "cells": [ { "cell_type": "markdown", "id": "52b09612", "metadata": {}, "source": [ "# Dataset specification with FileSpec classes\n", "\n", "This is a rendered copy of [filespec.ipynb](https://github.com/scikit-hep/coffea/blob/master/binder/filespec.ipynb). You can optionally run it interactively on [binder at this link](https://mybinder.org/v2/gh/coffeateam/coffea/master?filepath=binder%2Ffilespec.ipynb)\n", "\n", "This notebook provides a comprehensive guide to using the new Pydantic-based FileSpec classes in Coffea's dataset tools. These classes provide type-safe, validated data structures for managing file specifications, datasets, and filesets in high-energy physics data analysis workflows.\n", "\n", "## Overview\n", "\n", "The FileSpec system provides:\n", "- **Type-safe data structures** with automatic validation\n", "- **Automatic format detection** for ROOT and Parquet files\n", "- **Seamless integration** with existing Coffea functions\n", "- **JSON serialization/deserialization** for data persistence\n", "- **Automatic promotion** between optional and concrete specifications\n", "\n", "## Table of Contents\n", "\n", "1. [Basic File Specifications](#basic-file-specifications)\n", "2. [InputFiles, PreprocessedFiles](#coffea-file-dict)\n", "3. [Dataset Specifications](#dataset-specifications)\n", "4. [Fileset Specifications](#fileset-specifications)\n", "5. [Integration with Preprocessing](#integration-with-preprocessing)\n", "6. [Integration with apply_to_fileset](#integration-with-apply_to_fileset)\n", "7. [Dataset Manipulation Functions](#dataset-manipulation-functions)\n", "8. [Advanced Usage Examples](#advanced-usage-examples)\n", "9. [Migration from Legacy Formats](#migration-from-legacy-formats)" ] }, { "cell_type": "code", "execution_count": 1, "id": "f6c75a47", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "FileSpec classes and dataset tools imported successfully!\n" ] } ], "source": [ "# Import necessary libraries\n", "from pydantic import ValidationError\n", "import rich\n", "import dask\n", "\n", "\n", "# Import the FileSpec classes and dataset tools\n", "from coffea.dataset_tools import (\n", " # FileSpec classes\n", " ROOTFileSpec,\n", " ParquetFileSpec,\n", " CoffeaROOTFileSpec,\n", " CoffeaROOTFileSpecOptional,\n", " CoffeaParquetFileSpec,\n", " CoffeaParquetFileSpecOptional,\n", " InputFiles,\n", " DatasetSpec,\n", " DataGroupSpec,\n", " \n", " # Dataset manipulation functions\n", " preprocess,\n", " apply_to_fileset,\n", " max_chunks,\n", " max_chunks_per_file,\n", " slice_chunks,\n", " slice_files,\n", " max_files,\n", " filter_files,\n", "\n", " # ModelFactory utility class\n", " ModelFactory,\n", ")\n", "from coffea.nanoevents import NanoAODSchema\n", "from coffea.processor.test_items import NanoEventsProcessor\n", "\n", "print(\"FileSpec classes and dataset tools imported successfully!\")" ] }, { "cell_type": "markdown", "id": "04306ba3", "metadata": {}, "source": [ "## 1. Basic File Specifications\n", "\n", "The FileSpec system provides several classes for representing individual file specifications:\n", "\n", "### File Specification Hierarchy\n", "\n", "- **ROOTFileSpec**: Basic specification for ROOT files\n", "- **ParquetFileSpec**: Basic specification for Parquet files \n", "- **CoffeaROOTFileSpecOptional**: ROOT files with optional metadata\n", "- **CoffeaROOTFileSpec**: ROOT files with complete metadata (required)\n", "- **CoffeaParquetFileSpecOptional**: Parquet files with optional metadata\n", "- **CoffeaParquetFileSpec**: Parquet files with complete metadata (required)" ] }, { "cell_type": "code", "execution_count": 2, "id": "2ecaf55b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== Basic ROOTFileSpec ===\n", "Basic ROOT spec:\n" ] }, { "data": { "text/html": [ "
ROOTFileSpec(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " num_selected_entries=None\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mROOTFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Format: root\n", "Steps: None\n", "\n", "ROOT spec with steps:\n" ] }, { "data": { "text/html": [ "
ROOTFileSpec(\n", " object_path='Events',\n", " steps=[[0, 1000], [1000, 2000], [2000, 3000]],\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " num_selected_entries=3000\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mROOTFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m1000\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m1000\u001b[0m, \u001b[1;36m2000\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m2000\u001b[0m, \u001b[1;36m3000\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m3000\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 1.1 Basic ROOTFileSpec for ROOT files\n", "print(\"=== Basic ROOTFileSpec ===\")\n", "\n", "# Minimal ROOT file specification\n", "uproot_spec = ROOTFileSpec(object_path=\"Events\")\n", "print(\"Basic ROOT spec:\")\n", "rich.print(uproot_spec)\n", "print(f\"Format: {uproot_spec.format}\")\n", "print(f\"Steps: {uproot_spec.steps}\")\n", "\n", "# ROOT file specification with steps\n", "uproot_spec_with_steps = ROOTFileSpec(\n", " object_path=\"Events\",\n", " steps=[[0, 1000], [1000, 2000], [2000, 3000]]\n", ")\n", "print(\"\\nROOT spec with steps:\")\n", "rich.print(uproot_spec_with_steps)" ] }, { "cell_type": "code", "execution_count": 3, "id": "d59e690c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== Basic ParquetFileSpec ===\n", "Basic Parquet spec:\n" ] }, { "data": { "text/html": [ "
ParquetFileSpec(\n", " object_path=None,\n", " steps=None,\n", " num_entries=None,\n", " format='parquet',\n", " lfn=None,\n", " pfn=None,\n", " num_selected_entries=None\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mParquetFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'parquet'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Format: parquet\n", "Object path (always None): None\n", "\n", "Parquet spec with steps:\n" ] }, { "data": { "text/html": [ "
ParquetFileSpec(\n", " object_path=None,\n", " steps=[[0, 5000], [5000, 10000]],\n", " num_entries=None,\n", " format='parquet',\n", " lfn=None,\n", " pfn=None,\n", " num_selected_entries=10000\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mParquetFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m5000\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m5000\u001b[0m, \u001b[1;36m10000\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'parquet'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m10000\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 1.2 Basic ParquetFileSpec for Parquet files\n", "print(\"=== Basic ParquetFileSpec ===\")\n", "\n", "# Minimal Parquet file specification\n", "parquet_spec = ParquetFileSpec()\n", "print(\"Basic Parquet spec:\")\n", "rich.print(parquet_spec)\n", "print(f\"Format: {parquet_spec.format}\")\n", "print(f\"Object path (always None): {parquet_spec.object_path}\")\n", "\n", "# Parquet file specification with steps\n", "parquet_spec_with_steps = ParquetFileSpec(\n", " steps=[[0, 5000], [5000, 10000]]\n", ")\n", "print(\"\\nParquet spec with steps:\")\n", "rich.print(parquet_spec_with_steps)" ] }, { "cell_type": "code", "execution_count": 4, "id": "85d98abf", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== CoffeaROOTFileSpecOptional ===\n", "Optional ROOT spec:\n" ] }, { "data": { "text/html": [ "
CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Partial ROOT spec:\n" ] }, { "data": { "text/html": [ "
CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=[[0, 1000]],\n", " num_entries=1000,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=1000\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m1000\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m1000\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m1000\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Complete optional ROOT spec:\n" ] }, { "data": { "text/html": [ "
CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=[[0, 1000], [1000, 2000]],\n", " num_entries=2000,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid='12345678-90ab-cdef-1234-567890abcdef',\n", " num_selected_entries=2000\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m1000\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m1000\u001b[0m, \u001b[1;36m2000\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m2000\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'12345678-90ab-cdef-1234-567890abcdef'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m2000\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 1.3 CoffeaROOTFileSpecOptional - ROOT files with optional metadata\n", "print(\"=== CoffeaROOTFileSpecOptional ===\")\n", "\n", "# Optional specification with minimal data\n", "coffea_uproot_optional = CoffeaROOTFileSpecOptional(object_path=\"Events\")\n", "print(\"Optional ROOT spec:\")\n", "rich.print(coffea_uproot_optional)\n", "\n", "# Optional specification with some metadata\n", "coffea_uproot_partial = CoffeaROOTFileSpecOptional(\n", " object_path=\"Events\",\n", " steps=[[0, 1000]],\n", " num_entries=1000\n", ")\n", "print(\"Partial ROOT spec:\")\n", "rich.print(coffea_uproot_partial)\n", "\n", "# Optional specification with all metadata\n", "coffea_uproot_complete = CoffeaROOTFileSpecOptional(\n", " object_path=\"Events\",\n", " steps=[[0, 1000], [1000, 2000]],\n", " num_entries=2000,\n", " uuid=\"12345678-90ab-cdef-1234-567890abcdef\"\n", ")\n", "print(\"Complete optional ROOT spec:\")\n", "rich.print(coffea_uproot_complete)" ] }, { "cell_type": "code", "execution_count": 5, "id": "38ee2f96", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== CoffeaROOTFileSpec ===\n", "Complete required ROOT spec:\n" ] }, { "data": { "text/html": [ "
CoffeaROOTFileSpec(\n", " object_path='Events',\n", " steps=[[0, 1000], [1000, 2000]],\n", " num_entries=2000,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid='12345678-90ab-cdef-1234-567890abcdef',\n", " num_selected_entries=2000\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mCoffeaROOTFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m1000\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m1000\u001b[0m, \u001b[1;36m2000\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m2000\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'12345678-90ab-cdef-1234-567890abcdef'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m2000\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Expected validation error for incomplete spec: 3 errors\n" ] }, { "data": { "text/html": [ "
3 validation errors for CoffeaROOTFileSpec\n", "steps\n", " Field required [type=missing, input_value={'object_path': 'Events'}, input_type=dict]\n", " For further information visit https://errors.pydantic.dev/2.11/v/missing\n", "num_entries\n", " Field required [type=missing, input_value={'object_path': 'Events'}, input_type=dict]\n", " For further information visit https://errors.pydantic.dev/2.11/v/missing\n", "uuid\n", " Field required [type=missing, input_value={'object_path': 'Events'}, input_type=dict]\n", " For further information visit https://errors.pydantic.dev/2.11/v/missing\n", "\n" ], "text/plain": [ "\u001b[1;36m3\u001b[0m validation errors for CoffeaROOTFileSpec\n", "steps\n", " Field required \u001b[1m[\u001b[0m\u001b[33mtype\u001b[0m=\u001b[35mmissing\u001b[0m, \u001b[33minput_value\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m\u001b[1m}\u001b[0m, \u001b[33minput_type\u001b[0m=\u001b[35mdict\u001b[0m\u001b[1m]\u001b[0m\n", " For further information visit \u001b[4;94mhttps://errors.pydantic.dev/2.11/v/missing\u001b[0m\n", "num_entries\n", " Field required \u001b[1m[\u001b[0m\u001b[33mtype\u001b[0m=\u001b[35mmissing\u001b[0m, \u001b[33minput_value\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m\u001b[1m}\u001b[0m, \u001b[33minput_type\u001b[0m=\u001b[35mdict\u001b[0m\u001b[1m]\u001b[0m\n", " For further information visit \u001b[4;94mhttps://errors.pydantic.dev/2.11/v/missing\u001b[0m\n", "uuid\n", " Field required \u001b[1m[\u001b[0m\u001b[33mtype\u001b[0m=\u001b[35mmissing\u001b[0m, \u001b[33minput_value\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m\u001b[1m}\u001b[0m, \u001b[33minput_type\u001b[0m=\u001b[35mdict\u001b[0m\u001b[1m]\u001b[0m\n", " For further information visit \u001b[4;94mhttps://errors.pydantic.dev/2.11/v/missing\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 1.4 CoffeaROOTFileSpec - ROOT files with required metadata\n", "print(\"=== CoffeaROOTFileSpec ===\")\n", "\n", "# Complete specification (all fields required)\n", "try:\n", " coffea_uproot_required = CoffeaROOTFileSpec(\n", " object_path=\"Events\",\n", " steps=[[0, 1000], [1000, 2000]],\n", " num_entries=2000,\n", " uuid=\"12345678-90ab-cdef-1234-567890abcdef\"\n", " )\n", " print(\"Complete required ROOT spec:\")\n", " rich.print(coffea_uproot_required)\n", "except ValidationError as e:\n", " print(f\"Validation error: {e}\")\n", "\n", "# Attempt to create incomplete specification (should fail)\n", "try:\n", " incomplete_spec = CoffeaROOTFileSpec(object_path=\"Events\")\n", " print(\"This shouldn't print - validation should fail!\")\n", "except ValidationError as e:\n", " print(f\"Expected validation error for incomplete spec: {e.error_count()} errors\")\n", " rich.print(e)" ] }, { "cell_type": "code", "execution_count": 6, "id": "694daed3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== CoffeaParquetFileSpec ===\n", "Optional Parquet spec:\n" ] }, { "data": { "text/html": [ "
CoffeaParquetFileSpecOptional(\n", " object_path=None,\n", " steps=[[0, 5000]],\n", " num_entries=5000,\n", " format='parquet',\n", " lfn=None,\n", " pfn=None,\n", " uuid='parquet-uuid-example',\n", " num_selected_entries=5000\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mCoffeaParquetFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m5000\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m5000\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'parquet'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'parquet-uuid-example'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m5000\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Required Parquet spec:\n" ] }, { "data": { "text/html": [ "
CoffeaParquetFileSpec(\n", " object_path=None,\n", " steps=[[0, 5000], [5000, 10000]],\n", " num_entries=10000,\n", " format='parquet',\n", " lfn=None,\n", " pfn=None,\n", " uuid='parquet-uuid-complete',\n", " num_selected_entries=10000\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mCoffeaParquetFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m5000\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m5000\u001b[0m, \u001b[1;36m10000\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m10000\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'parquet'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'parquet-uuid-complete'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m10000\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 1.5 Parquet specifications (similar pattern)\n", "print(\"=== CoffeaParquetFileSpec ===\")\n", "\n", "# Optional Parquet specification\n", "parquet_optional = CoffeaParquetFileSpecOptional(\n", " steps=[[0, 5000]],\n", " num_entries=5000,\n", " uuid=\"parquet-uuid-example\"\n", ")\n", "print(\"Optional Parquet spec:\")\n", "rich.print(parquet_optional)\n", "\n", "# Required Parquet specification\n", "parquet_required = CoffeaParquetFileSpec(\n", " steps=[[0, 5000], [5000, 10000]],\n", " num_entries=10000,\n", " uuid=\"parquet-uuid-complete\"\n", ")\n", "print(\"Required Parquet spec:\")\n", "rich.print(parquet_required)" ] }, { "cell_type": "markdown", "id": "3f36f852", "metadata": {}, "source": [ "## 2. InputFiles Specification\n", "\n", "The `InputFiles` classe is a dictionary-like containers for any mixture of CoffeaFileSpec classes, both Uproot/Parquet and concrete/Optional. `PreprocessedFiles` is the specific subtype permitting only concrete FileSpecs. They automatically handle:\n", "\n", "- **Format detection**: Automatically identifies if files are ROOT or Parquet, by testing the key (filename)\n", "- **Dictionary-like interface**: Easy access to files using standard dict methods\n", "- **FileSpec promotion**: Automatically tries to upcast CoffeaROOTFileSpecOptional and CoffeaParquetFileSpecOptional to their concrete classes, when the necessary fields have been set post-initialization (such as when they are preprocessed)\n", "- **FileSpec-wide format**: Provides the `format` computed property to determine which format(s) are present.\n", "\n", "The `InputFiles` or `PreprocessedFiles` forms the \"files\" subfield of the `DatasetSpec` class. Notably, unlike the FileSpec classes, it doesn't require kwarg-setting in the constructor, simply pass in a regular dictionary of `{\"filename1\": dict|FileSpec, ..., \"filenameN\": dict|FileSpec}`" ] }, { "cell_type": "code", "execution_count": 7, "id": "3cfd0d6a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== InputFiles ===\n", "InputFiles:\n" ] }, { "data": { "text/html": [ "
InputFiles(\n", " root={\n", " 'file1.root': CoffeaROOTFileSpec(\n", " object_path='Events',\n", " steps=[[0, 10]],\n", " num_entries=10,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid='uuid1',\n", " num_selected_entries=10\n", " ),\n", " 'file1.parquet': CoffeaParquetFileSpec(\n", " object_path=None,\n", " steps=[[0, 100]],\n", " num_entries=100,\n", " format='parquet',\n", " lfn=None,\n", " pfn=None,\n", " uuid='uuid2',\n", " num_selected_entries=100\n", " ),\n", " 'file2.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=[[10, 20]],\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=10\n", " )\n", " }\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mInputFiles\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'file1.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m10\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m10\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'uuid1'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m10\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'file1.parquet'\u001b[0m: \u001b[1;35mCoffeaParquetFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m100\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m100\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'parquet'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'uuid2'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m100\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'file2.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m10\u001b[0m, \u001b[1;36m20\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m10\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Detected format(s): root|parquet\n", "Number of files: 3\n", "Iterating over file dict:\n", "File: file1.root\n", "Spec:\n" ] }, { "data": { "text/html": [ "
CoffeaROOTFileSpec(\n", " object_path='Events',\n", " steps=[[0, 10]],\n", " num_entries=10,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid='uuid1',\n", " num_selected_entries=10\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mCoffeaROOTFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m10\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m10\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'uuid1'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m10\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "File: file1.parquet\n", "Spec:\n" ] }, { "data": { "text/html": [ "
CoffeaParquetFileSpec(\n", " object_path=None,\n", " steps=[[0, 100]],\n", " num_entries=100,\n", " format='parquet',\n", " lfn=None,\n", " pfn=None,\n", " uuid='uuid2',\n", " num_selected_entries=100\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mCoffeaParquetFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m100\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m100\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'parquet'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'uuid2'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m100\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "File: file2.root\n", "Spec:\n" ] }, { "data": { "text/html": [ "
CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=[[10, 20]],\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=10\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m10\u001b[0m, \u001b[1;36m20\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m10\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Accessing 'file1.root': object_path='Events' steps=[[0, 10]] num_entries=10 format='root' lfn=None pfn=None uuid='uuid1' num_selected_entries=10\n", "=== Modifying a file spec in the dict ===\n", "Keys in filedict: ['file1.root', 'file1.parquet', 'file2.root', 'file3.root']\n" ] } ], "source": [ "# 2.1 Create an InputFiles\n", "print(\"=== InputFiles ===\")\n", "\n", "# using a dictioanry of CoffeaROOTFileSpec(Optional) and CoffeaParquetFileSpec(Optional)\n", "dict_of_filespecs = {\n", " \"file1.root\": CoffeaROOTFileSpec(\n", " object_path=\"Events\", steps=[[0, 10]], num_entries=10, uuid=\"uuid1\"\n", " ),\n", " \"file1.parquet\": CoffeaParquetFileSpec(\n", " steps=[[0, 100]], num_entries=100, uuid=\"uuid2\"\n", " ),\n", " \"file2.root\": CoffeaROOTFileSpecOptional(\n", " object_path=\"Events\", steps=[[10, 20]], num_entries=None, uuid=None\n", " ),\n", "}\n", "\n", "filedict_from_filespecs = InputFiles(dict_of_filespecs)\n", "\n", "print(\"InputFiles:\")\n", "rich.print(filedict_from_filespecs)\n", "\n", "# computed property: format\n", "print(f\"Detected format(s): {filedict_from_filespecs.format}\")\n", "print(f\"Number of files: {len(filedict_from_filespecs)}\")\n", "\n", "# Iteration over the file dict\n", "print(\"Iterating over file dict:\")\n", "for fname, spec in filedict_from_filespecs.items():\n", " print(f\"File: {fname}\\nSpec:\")\n", " rich.print(spec)\n", " \n", "# __getitem__, __setitem__ access\n", "print(f\"Accessing 'file1.root': {filedict_from_filespecs['file1.root']}\")\n", "\n", "print(\"=== Modifying a file spec in the dict ===\")\n", "filedict_from_filespecs[\"file2.root\"].num_entries = 20\n", "\n", "filedict_from_filespecs[\"file3.root\"] = CoffeaROOTFileSpec(\n", " object_path=\"Events\", steps=[[0, 30]], num_entries=30, uuid=\"uuid3\"\n", ")\n", "\n", "# show keys\n", "print(f\"Keys in filedict: {list(filedict_from_filespecs.keys())}\")" ] }, { "cell_type": "code", "execution_count": 8, "id": "b2db2dcd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== InputFiles from pure dictionary ===\n", "InputFiles from pure dictionary:\n" ] }, { "data": { "text/html": [ "
InputFiles(\n", " root={\n", " 'file1.root': CoffeaROOTFileSpec(\n", " object_path='Events',\n", " steps=[[0, 10]],\n", " num_entries=10,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid='uuid1',\n", " num_selected_entries=10\n", " ),\n", " 'file1.parquet': CoffeaParquetFileSpec(\n", " object_path=None,\n", " steps=[[0, 100]],\n", " num_entries=100,\n", " format='parquet',\n", " lfn=None,\n", " pfn=None,\n", " uuid='uuid2',\n", " num_selected_entries=100\n", " ),\n", " 'file2.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=[[10, 20]],\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=10\n", " )\n", " }\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mInputFiles\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'file1.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m10\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m10\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'uuid1'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m10\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'file1.parquet'\u001b[0m: \u001b[1;35mCoffeaParquetFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m100\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m100\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'parquet'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'uuid2'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m100\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'file2.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m10\u001b[0m, \u001b[1;36m20\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m10\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 2.2 Create a InputFiles from pure dictionary\n", "print(\"=== InputFiles from pure dictionary ===\")\n", "\n", "dict_of_dicts = {\n", " \"file1.root\": {\n", " \"object_path\": \"Events\", \n", " \"steps\": [[0, 10]], \n", " \"num_entries\": 10, \n", " \"uuid\": \"uuid1\"\n", " },\n", " \"file1.parquet\": {\n", " \"steps\": [[0, 100]], \n", " \"num_entries\": 100, \n", " \"uuid\": \"uuid2\"\n", " },\n", " \"file2.root\": {\n", " \"object_path\": \"Events\", \n", " \"steps\": [[10, 20]], \n", " \"num_entries\": None, \n", " \"uuid\": None\n", " },\n", "\n", "}\n", "\n", "filedict_from_pure_dict = InputFiles(dict_of_dicts)\n", "print(\"InputFiles from pure dictionary:\")\n", "rich.print(filedict_from_pure_dict)" ] }, { "cell_type": "markdown", "id": "5f18f95b", "metadata": {}, "source": [ "## 3. Dataset Specifications\n", "\n", "The `DatasetSpec` class represents a collection of files that form a logical dataset. It automatically handles:\n", "\n", "- **Format detection**: Automatically identifies if files are ROOT or Parquet\n", "- **File validation**: Ensures all files in a dataset are compatible\n", "- **Metadata management**: Stores dataset-level metadata and forms\n", "- **Dictionary-like interface**: Easy access to files using standard dict methods" ] }, { "cell_type": "code", "execution_count": 9, "id": "b31c13f5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== DatasetSpec Creation ===\n", "Simple ROOT dataset:\n" ] }, { "data": { "text/html": [ "
DatasetSpec(\n", " files=InputFiles(\n", " root={\n", " 'data_file_1.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " ),\n", " 'data_file_2.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " ),\n", " 'data_file_3.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " )\n", " }\n", " ),\n", " metadata={'sample_type': 'data', 'year': 2023},\n", " format='root',\n", " compressed_form=None,\n", " did=None\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mDatasetSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mfiles\u001b[0m=\u001b[1;35mInputFiles\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'data_file_1.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'data_file_2.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'data_file_3.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[33mmetadata\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'sample_type'\u001b[0m: \u001b[32m'data'\u001b[0m, \u001b[32m'year'\u001b[0m: \u001b[1;36m2023\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mcompressed_form\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mdid\u001b[0m=\u001b[3;35mNone\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Detected format: root\n", "Number of files: 3\n", "Metadata: {'sample_type': 'data', 'year': 2023}\n" ] } ], "source": [ "# 3.1 Creating DatasetSpec from file dictionaries\n", "print(\"=== DatasetSpec Creation ===\")\n", "\n", "# Create a dataset from a simple file dictionary (ROOT files)\n", "root_dataset_simple = DatasetSpec(\n", " files={\n", " \"data_file_1.root\": \"Events\",\n", " \"data_file_2.root\": \"Events\",\n", " \"data_file_3.root\": \"Events\"\n", " },\n", " metadata={\"sample_type\": \"data\", \"year\": 2023}\n", ")\n", "\n", "print(\"Simple ROOT dataset:\")\n", "rich.print(root_dataset_simple)\n", "print(f\"Detected format: {root_dataset_simple.format}\")\n", "print(f\"Number of files: {len(root_dataset_simple.files)}\")\n", "print(f\"Metadata: {root_dataset_simple.metadata}\")" ] }, { "cell_type": "code", "execution_count": 10, "id": "76aa5992", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== DatasetSpec with Complete Specifications ===\n", "Complete dataset:\n" ] }, { "data": { "text/html": [ "
DatasetSpec(\n", " files=InputFiles(\n", " root={\n", " 'processed_data_1.root': CoffeaROOTFileSpec(\n", " object_path='Events',\n", " steps=[[0, 1000], [1000, 2000]],\n", " num_entries=2000,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid='file1-uuid',\n", " num_selected_entries=2000\n", " ),\n", " 'processed_data_2.root': CoffeaROOTFileSpec(\n", " object_path='Events',\n", " steps=[[0, 1500], [1500, 3000]],\n", " num_entries=3000,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid='file2-uuid',\n", " num_selected_entries=3000\n", " )\n", " }\n", " ),\n", " metadata={'processing_version': 'v2.1', 'cross_section': 1.23},\n", " format='root',\n", " compressed_form=None,\n", " did=None\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mDatasetSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mfiles\u001b[0m=\u001b[1;35mInputFiles\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'processed_data_1.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m1000\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m1000\u001b[0m, \u001b[1;36m2000\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m2000\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'file1-uuid'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m2000\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'processed_data_2.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m1500\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m1500\u001b[0m, \u001b[1;36m3000\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m3000\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'file2-uuid'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m3000\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[33mmetadata\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'processing_version'\u001b[0m: \u001b[32m'v2.1'\u001b[0m, \u001b[32m'cross_section'\u001b[0m: \u001b[1;36m1.23\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mcompressed_form\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mdid\u001b[0m=\u001b[3;35mNone\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Detected format: root\n", "Number of files: 2\n", "Ready for column-joining: False\n" ] } ], "source": [ "# 3.2 Creating DatasetSpec with complete file specifications\n", "print(\"=== DatasetSpec with Complete Specifications ===\")\n", "\n", "# Create individual file specifications\n", "file1_spec = CoffeaROOTFileSpec(\n", " object_path=\"Events\",\n", " steps=[[0, 1000], [1000, 2000]],\n", " num_entries=2000,\n", " uuid=\"file1-uuid\"\n", ")\n", "\n", "file2_spec = CoffeaROOTFileSpec(\n", " object_path=\"Events\", \n", " steps=[[0, 1500], [1500, 3000]],\n", " num_entries=3000,\n", " uuid=\"file2-uuid\"\n", ")\n", "\n", "# Create dataset with complete specifications\n", "complete_dataset = DatasetSpec(\n", " files=InputFiles({\n", " \"processed_data_1.root\": file1_spec,\n", " \"processed_data_2.root\": file2_spec\n", " }),\n", " metadata={\"processing_version\": \"v2.1\", \"cross_section\": 1.23},\n", ")\n", "\n", "print(\"Complete dataset:\")\n", "rich.print(complete_dataset)\n", "print(f\"Detected format: {complete_dataset.format}\")\n", "print(f\"Number of files: {len(complete_dataset.files)}\")\n", "print(f\"Ready for column-joining: {complete_dataset.joinable}\")" ] }, { "cell_type": "code", "execution_count": 11, "id": "6af14880", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== Mixed Format Datasets ===\n", "Validation error for mixed format dataset: 1 validation error for DatasetSpec\n", " Value error, format: format must be one of {'root', 'parquet'} [type=value_error, input_value={'files': {'data.root': C...selected_entries=2000)}}, input_type=dict]\n", " For further information visit https://errors.pydantic.dev/2.11/v/value_error\n" ] } ], "source": [ "# 3.3 Mixed format handling\n", "print(\"=== Mixed Format Datasets ===\")\n", "\n", "# Try to create a dataset with both ROOT and Parquet files\n", "try:\n", " mixed_dataset = DatasetSpec(\n", " files={\n", " \"data.root\": CoffeaROOTFileSpec(\n", " object_path=\"Events\",\n", " steps=[[0, 1000]],\n", " num_entries=1000,\n", " uuid=\"root-uuid\"\n", " ),\n", " \"data.parquet\": CoffeaParquetFileSpec(\n", " steps=[[0, 2000]],\n", " num_entries=2000,\n", " uuid=\"parquet-uuid\"\n", " )\n", " }\n", " )\n", "\n", " print(\"Mixed format dataset:\")\n", " rich.print(mixed_dataset)\n", " print(f\"Detected format: {mixed_dataset.format}\")\n", "except ValidationError as e:\n", " print(f\"Validation error for mixed format dataset: {e}\")\n", "\n", "# If you need a mixed format, file an issue in the coffea GitHub repository requesting the feature, with an example of your usecase!" ] }, { "cell_type": "code", "execution_count": 12, "id": "98564dea", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== DatasetSpec from File Lists ===\n", "Dataset from list:\n" ] }, { "data": { "text/html": [ "
DatasetSpec(\n", " files=InputFiles(\n", " root={\n", " 'simulation_1.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " ),\n", " 'simulation_2.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " ),\n", " 'root://simulation_3.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " ),\n", " 'simulation_4.root.1': CoffeaROOTFileSpecOptional(\n", " object_path='AuxiliaryData',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " )\n", " }\n", " ),\n", " metadata={'sample_type': 'simulation', 'process': 'ttbar'},\n", " format='root',\n", " compressed_form=None,\n", " did=None\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mDatasetSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mfiles\u001b[0m=\u001b[1;35mInputFiles\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'simulation_1.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'simulation_2.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'root://simulation_3.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'simulation_4.root.1'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'AuxiliaryData'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[33mmetadata\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'sample_type'\u001b[0m: \u001b[32m'simulation'\u001b[0m, \u001b[32m'process'\u001b[0m: \u001b[32m'ttbar'\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mcompressed_form\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mdid\u001b[0m=\u001b[3;35mNone\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Files: ['simulation_1.root', 'simulation_2.root', 'root://simulation_3.root', 'simulation_4.root.1']\n" ] } ], "source": [ "# 3.4 DatasetSpec from file lists\n", "print(\"=== DatasetSpec from File Lists ===\")\n", "\n", "# Create dataset from a list of file:object_path strings\n", "dataset_from_list = DatasetSpec(\n", " files=[\n", " \"simulation_1.root:Events\",\n", " \"simulation_2.root:Events\", \n", " \"root://simulation_3.root:Events\",\n", " \"simulation_4.root.1:AuxiliaryData\",\n", " ],\n", " metadata={\"sample_type\": \"simulation\", \"process\": \"ttbar\"}\n", ")\n", "\n", "print(\"Dataset from list:\")\n", "rich.print(dataset_from_list)\n", "print(f\"Files: {list(dataset_from_list.files.keys())}\")" ] }, { "cell_type": "markdown", "id": "762e909c", "metadata": {}, "source": [ "## 4. Fileset Specifications\n", "\n", "The `DataGroupSpec` class represents a collection of datasets, typically used for analysis workflows. It provides:\n", "\n", "- **Multiple datasets management**: Handle multiple physics processes/samples\n", "- **JSON serialization**: Save and load complete analysis configurations \n", "- **Dictionary interface**: Access datasets by name\n", "- **Validation**: Ensure all datasets are properly specified" ] }, { "cell_type": "code", "execution_count": 13, "id": "7109b611", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== DataGroupSpec Creation ===\n", "Analysis fileset:\n" ] }, { "data": { "text/html": [ "
DataGroupSpec(\n", " root={\n", " 'ttbar_simulation': DatasetSpec(\n", " files=InputFiles(\n", " root={\n", " 'ttbar_1.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " ),\n", " 'ttbar_2.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " )\n", " }\n", " ),\n", " metadata={'process': 'ttbar', 'cross_section': 831.8},\n", " format='root',\n", " compressed_form=None,\n", " did=None\n", " ),\n", " 'single_top': DatasetSpec(\n", " files=InputFiles(\n", " root={\n", " 'singletop_1.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " ),\n", " 'singletop_2.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " )\n", " }\n", " ),\n", " metadata={'process': 'single_top', 'cross_section': 136.02},\n", " format='root',\n", " compressed_form=None,\n", " did=None\n", " ),\n", " 'data': DatasetSpec(\n", " files=InputFiles(\n", " root={\n", " 'data_2023A.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " ),\n", " 'data_2023B.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " )\n", " }\n", " ),\n", " metadata={'is_data': True, 'era': '2023'},\n", " format='root',\n", " compressed_form=None,\n", " did=None\n", " )\n", " }\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mDataGroupSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'ttbar_simulation'\u001b[0m: \u001b[1;35mDatasetSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mfiles\u001b[0m=\u001b[1;35mInputFiles\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'ttbar_1.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_2.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[33mmetadata\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'process'\u001b[0m: \u001b[32m'ttbar'\u001b[0m, \u001b[32m'cross_section'\u001b[0m: \u001b[1;36m831.8\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mcompressed_form\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mdid\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'single_top'\u001b[0m: \u001b[1;35mDatasetSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mfiles\u001b[0m=\u001b[1;35mInputFiles\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'singletop_1.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'singletop_2.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[33mmetadata\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'process'\u001b[0m: \u001b[32m'single_top'\u001b[0m, \u001b[32m'cross_section'\u001b[0m: \u001b[1;36m136.02\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mcompressed_form\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mdid\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'data'\u001b[0m: \u001b[1;35mDatasetSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mfiles\u001b[0m=\u001b[1;35mInputFiles\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'data_2023A.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'data_2023B.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[33mmetadata\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'is_data'\u001b[0m: \u001b[3;92mTrue\u001b[0m, \u001b[32m'era'\u001b[0m: \u001b[32m'2023'\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mcompressed_form\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mdid\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Number of datasets: 3\n", "Dataset names: ['ttbar_simulation', 'single_top', 'data']\n" ] } ], "source": [ "# 4.1 Creating DataGroupSpec\n", "print(\"=== DataGroupSpec Creation ===\")\n", "\n", "# Create a fileset with multiple datasets\n", "analysis_fileset = DataGroupSpec({\n", " \"ttbar_simulation\": DatasetSpec(\n", " files={\n", " \"ttbar_1.root\": \"Events\",\n", " \"ttbar_2.root\": \"Events\"\n", " },\n", " metadata={\"process\": \"ttbar\", \"cross_section\": 831.8}\n", " ),\n", " \n", " \"single_top\": DatasetSpec(\n", " files={\n", " \"singletop_1.root\": \"Events\",\n", " \"singletop_2.root\": \"Events\"\n", " },\n", " metadata={\"process\": \"single_top\", \"cross_section\": 136.02}\n", " ),\n", " \n", " \"data\": DatasetSpec(\n", " files={\n", " \"data_2023A.root\": \"Events\", \n", " \"data_2023B.root\": \"Events\"\n", " },\n", " metadata={\"is_data\": True, \"era\": \"2023\"}\n", " )\n", "})\n", "\n", "print(\"Analysis fileset:\")\n", "rich.print(analysis_fileset)\n", "print(f\"Number of datasets: {len(analysis_fileset)}\")\n", "print(f\"Dataset names: {list(analysis_fileset.keys())}\")" ] }, { "cell_type": "code", "execution_count": 14, "id": "376a6765", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== Fileset Access and Manipulation ===\n", "TTbar dataset: {'process': 'ttbar', 'cross_section': 831.8}\n", "\n", "Dataset summary:\n", " ttbar_simulation: 2 files, process=ttbar\n", " single_top: 2 files, process=single_top\n", " data: 2 files, process=unknown\n", "\n", "After adding WJets: 4 datasets\n" ] } ], "source": [ "# 4.2 Accessing and manipulating filesets\n", "print(\"=== Fileset Access and Manipulation ===\")\n", "\n", "# Access individual datasets\n", "ttbar_dataset = analysis_fileset[\"ttbar_simulation\"]\n", "print(f\"TTbar dataset: {ttbar_dataset.metadata}\")\n", "\n", "# Iterate over datasets\n", "print(\"\\nDataset summary:\")\n", "for dataset_name, dataset in analysis_fileset.items():\n", " num_files = len(dataset.files)\n", " process = dataset.metadata.get(\"process\", \"unknown\")\n", " print(f\" {dataset_name}: {num_files} files, process={process}\")\n", "\n", "# Add a new dataset\n", "analysis_fileset[\"wjets\"] = DatasetSpec(\n", " files={\"wjets_1.root\": \"Events\"},\n", " metadata={\"process\": \"wjets\", \"cross_section\": 61526.7}\n", ")\n", "\n", "print(f\"\\nAfter adding WJets: {len(analysis_fileset)} datasets\")" ] }, { "cell_type": "code", "execution_count": 15, "id": "fc3b6e90", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== JSON Serialization ===\n", "Fileset JSON (first 500 characters):\n", "{\n", " \"ttbar_simulation\": {\n", " \"files\": {\n", " \"ttbar_1.root\": {\n", " \"object_path\": \"Events\",\n", " \"steps\": null,\n", " \"num_entries\": null,\n", " \"format\": \"root\",\n", " \"lfn\": null,\n", " \"pfn\": null,\n", " \"uuid\": null,\n", " \"num_selected_entries\": null\n", " },\n", " \"ttbar_2.root\": {\n", " \"object_path\": \"Events\",\n", " \"steps\": null,\n", " \"num_entries\": null,\n", " \"format\": \"root\",\n", " \"lfn\": null,\n", " \"pfn\": null,\n", " \"uuid\": null,\n", " \"num_se...\n", "\n", "Restored fileset has 4 datasets\n", "Dataset names match: True\n" ] } ], "source": [ "# 4.3 JSON serialization and deserialization\n", "print(\"=== JSON Serialization ===\")\n", "\n", "# Serialize fileset to JSON\n", "fileset_json = analysis_fileset.model_dump_json(indent=2)\n", "print(\"Fileset JSON (first 500 characters):\")\n", "print(fileset_json[:500] + \"...\" if len(fileset_json) > 500 else fileset_json)\n", "\n", "# Deserialize from JSON\n", "restored_fileset = DataGroupSpec.model_validate_json(fileset_json)\n", "print(f\"\\nRestored fileset has {len(restored_fileset)} datasets\")\n", "print(f\"Dataset names match: {set(analysis_fileset.keys()) == set(restored_fileset.keys())}\")" ] }, { "cell_type": "markdown", "id": "69c1d204", "metadata": {}, "source": [ "## 5. Integration with Preprocessing\n", "\n", "The `preprocess` function works seamlessly with FileSpec classes, and will promote Optional types to concrete types for successfully accessed elements of the datasets:\n", "\n", "- **Calculate file steps**: Automatically determine optimal chunking\n", "- **Extract metadata**: Get file UUIDs, entry counts, and schemas(using `save_form=True`)\n", "- **Generate forms**: Create Awkward Array forms for type checking\n", "- **Handle errors**: Skip bad files and report issues" ] }, { "cell_type": "code", "execution_count": 16, "id": "8d360bf8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== Preprocessing with FileSpec ===\n" ] }, { "data": { "text/html": [ "
DataGroupSpec(\n", " root={\n", " 'ZJets': DatasetSpec(\n", " files=InputFiles(\n", " root={\n", " 'https://raw.githubusercontent.com/scikit-hep/coffea/master/tests/samples/nano_dy.root': \n", "CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " )\n", " }\n", " ),\n", " metadata={},\n", " format='root',\n", " compressed_form=None,\n", " did=None\n", " ),\n", " 'Data': DatasetSpec(\n", " files=InputFiles(\n", " root={\n", " 'https://raw.githubusercontent.com/scikit-hep/coffea/master/tests/samples/nano_dimuon.root': \n", "CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " ),\n", " 'nano_dimuon_not_there.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " )\n", " }\n", " ),\n", " metadata={},\n", " format='root',\n", " compressed_form=None,\n", " did=None\n", " )\n", " }\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mDataGroupSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'ZJets'\u001b[0m: \u001b[1;35mDatasetSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mfiles\u001b[0m=\u001b[1;35mInputFiles\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'https://raw.githubusercontent.com/scikit-hep/coffea/master/tests/samples/nano_dy.root'\u001b[0m: \n", "\u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[33mmetadata\u001b[0m=\u001b[1m{\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mcompressed_form\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mdid\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'Data'\u001b[0m: \u001b[1;35mDatasetSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mfiles\u001b[0m=\u001b[1;35mInputFiles\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'https://raw.githubusercontent.com/scikit-hep/coffea/master/tests/samples/nano_dimuon.root'\u001b[0m: \n", "\u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'nano_dimuon_not_there.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[33mmetadata\u001b[0m=\u001b[1m{\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mcompressed_form\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mdid\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Preprocessing the fileset...\n", "Fileset after preprocessing (excluding compressed_form string):\n" ] }, { "data": { "text/html": [ "
{\n", " 'ZJets': {\n", " 'files': {\n", " 'https://raw.githubusercontent.com/scikit-hep/coffea/master/tests/samples/nano_dy.root': {\n", " 'object_path': 'Events',\n", " 'steps': [[0, 7], [7, 14], [14, 21], [21, 28], [28, 35], [35, 40]],\n", " 'num_entries': 40,\n", " 'format': 'root',\n", " 'lfn': None,\n", " 'pfn': None,\n", " 'uuid': 'a9490124-3648-11ea-89e9-f5b55c90beef',\n", " 'num_selected_entries': 40\n", " }\n", " },\n", " 'metadata': {},\n", " 'format': 'root',\n", " 'did': None\n", " },\n", " 'Data': {\n", " 'files': {\n", " 'https://raw.githubusercontent.com/scikit-hep/coffea/master/tests/samples/nano_dimuon.root': {\n", " 'object_path': 'Events',\n", " 'steps': [[0, 7], [7, 14], [14, 21], [21, 28], [28, 35], [35, 40]],\n", " 'num_entries': 40,\n", " 'format': 'root',\n", " 'lfn': None,\n", " 'pfn': None,\n", " 'uuid': 'a210a3f8-3648-11ea-a29f-f5b55c90beef',\n", " 'num_selected_entries': 40\n", " }\n", " },\n", " 'metadata': {},\n", " 'format': 'root',\n", " 'did': None\n", " }\n", "}\n", "\n" ], "text/plain": [ "\u001b[1m{\u001b[0m\n", " \u001b[32m'ZJets'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'files'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'https://raw.githubusercontent.com/scikit-hep/coffea/master/tests/samples/nano_dy.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m7\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m7\u001b[0m, \u001b[1;36m14\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m14\u001b[0m, \u001b[1;36m21\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m21\u001b[0m, \u001b[1;36m28\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m28\u001b[0m, \u001b[1;36m35\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m35\u001b[0m, \u001b[1;36m40\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[1;36m40\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[32m'a9490124-3648-11ea-89e9-f5b55c90beef'\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[1;36m40\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'did'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'Data'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'files'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'https://raw.githubusercontent.com/scikit-hep/coffea/master/tests/samples/nano_dimuon.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m7\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m7\u001b[0m, \u001b[1;36m14\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m14\u001b[0m, \u001b[1;36m21\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m21\u001b[0m, \u001b[1;36m28\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m28\u001b[0m, \u001b[1;36m35\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m35\u001b[0m, \u001b[1;36m40\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[1;36m40\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[32m'a210a3f8-3648-11ea-a29f-f5b55c90beef'\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[1;36m40\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'did'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m\n", "\u001b[1m}\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Updated files (including inaccessible ones):\n" ] }, { "data": { "text/html": [ "
{\n", " 'ZJets': {\n", " 'files': {\n", " 'https://raw.githubusercontent.com/scikit-hep/coffea/master/tests/samples/nano_dy.root': {\n", " 'object_path': 'Events',\n", " 'steps': [[0, 7], [7, 14], [14, 21], [21, 28], [28, 35], [35, 40]],\n", " 'num_entries': 40,\n", " 'format': 'root',\n", " 'lfn': None,\n", " 'pfn': None,\n", " 'uuid': 'a9490124-3648-11ea-89e9-f5b55c90beef',\n", " 'num_selected_entries': 40\n", " }\n", " },\n", " 'metadata': {},\n", " 'format': 'root',\n", " 'did': None\n", " },\n", " 'Data': {\n", " 'files': {\n", " 'https://raw.githubusercontent.com/scikit-hep/coffea/master/tests/samples/nano_dimuon.root': {\n", " 'object_path': 'Events',\n", " 'steps': [[0, 7], [7, 14], [14, 21], [21, 28], [28, 35], [35, 40]],\n", " 'num_entries': 40,\n", " 'format': 'root',\n", " 'lfn': None,\n", " 'pfn': None,\n", " 'uuid': 'a210a3f8-3648-11ea-a29f-f5b55c90beef',\n", " 'num_selected_entries': 40\n", " },\n", " 'nano_dimuon_not_there.root': {\n", " 'object_path': 'Events',\n", " 'steps': None,\n", " 'num_entries': None,\n", " 'format': 'root',\n", " 'lfn': None,\n", " 'pfn': None,\n", " 'uuid': None,\n", " 'num_selected_entries': None\n", " }\n", " },\n", " 'metadata': {},\n", " 'format': 'root',\n", " 'did': None\n", " }\n", "}\n", "\n" ], "text/plain": [ "\u001b[1m{\u001b[0m\n", " \u001b[32m'ZJets'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'files'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'https://raw.githubusercontent.com/scikit-hep/coffea/master/tests/samples/nano_dy.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m7\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m7\u001b[0m, \u001b[1;36m14\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m14\u001b[0m, \u001b[1;36m21\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m21\u001b[0m, \u001b[1;36m28\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m28\u001b[0m, \u001b[1;36m35\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m35\u001b[0m, \u001b[1;36m40\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[1;36m40\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[32m'a9490124-3648-11ea-89e9-f5b55c90beef'\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[1;36m40\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'did'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'Data'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'files'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'https://raw.githubusercontent.com/scikit-hep/coffea/master/tests/samples/nano_dimuon.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m7\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m7\u001b[0m, \u001b[1;36m14\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m14\u001b[0m, \u001b[1;36m21\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m21\u001b[0m, \u001b[1;36m28\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m28\u001b[0m, \u001b[1;36m35\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m35\u001b[0m, \u001b[1;36m40\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[1;36m40\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[32m'a210a3f8-3648-11ea-a29f-f5b55c90beef'\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[1;36m40\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'nano_dimuon_not_there.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'did'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m\n", "\u001b[1m}\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 5.1 Basic preprocessing with FileSpec\n", "print(\"=== Preprocessing with FileSpec ===\")\n", "\n", "# Note: This is a demonstration - in practice you'd use real file paths\n", "demo_fileset = DataGroupSpec({\n", " \"ZJets\": {\"files\": [\"https://raw.githubusercontent.com/scikit-hep/coffea/master/tests/samples/nano_dy.root:Events\"]},\n", " \"Data\": {\"files\": [\n", " \"https://raw.githubusercontent.com/scikit-hep/coffea/master/tests/samples/nano_dimuon.root:Events\",\n", " \"nano_dimuon_not_there.root:Events\",\n", " ]},\n", "})\n", "rich.print(demo_fileset)\n", "\n", "print(\"Preprocessing the fileset...\")\n", "dataset_runnable, dataset_updated = preprocess(\n", " demo_fileset,\n", " step_size=7,\n", " align_clusters=False,\n", " files_per_batch=10,\n", " skip_bad_files=True,\n", " save_form=True,\n", ")\n", "print(\"Fileset after preprocessing (excluding compressed_form string):\")\n", "rich.print({k: v.model_dump(exclude=\"compressed_form\") for k, v in dataset_runnable.items()})\n", "\n", "print(\"Updated files (including inaccessible ones):\")\n", "rich.print({k: v.model_dump(exclude=\"compressed_form\") for k, v in dataset_updated.items()})\n", "#rich.print({dname: {k: v for k, v in dataset_updated[dname].files.items() if k not in dataset_runnable[dname].files} for dname in dataset_updated})\n" ] }, { "cell_type": "markdown", "id": "6b974346", "metadata": {}, "source": [ "## 6. Integration with apply_to_fileset\n", "\n", "The `apply_to_fileset` function processes datasets using FileSpec classes:" ] }, { "cell_type": "code", "execution_count": 17, "id": "39a43fa8", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/nmangane/scikit-hep-dev-4/coffea/src/coffea/nanoevents/schemas/nanoaod.py:264: RuntimeWarning: Missing cross-reference index for LowPtElectron_electronIdx => Electron\n", " warnings.warn(\n", "/Users/nmangane/scikit-hep-dev-4/coffea/src/coffea/nanoevents/schemas/nanoaod.py:264: RuntimeWarning: Missing cross-reference index for LowPtElectron_genPartIdx => GenPart\n", " warnings.warn(\n", "/Users/nmangane/scikit-hep-dev-4/coffea/src/coffea/nanoevents/schemas/nanoaod.py:264: RuntimeWarning: Missing cross-reference index for LowPtElectron_photonIdx => Photon\n", " warnings.warn(\n", "/Users/nmangane/scikit-hep-dev-4/coffea/src/coffea/nanoevents/schemas/nanoaod.py:264: RuntimeWarning: Missing cross-reference index for FatJet_genJetAK8Idx => GenJetAK8\n", " warnings.warn(\n" ] }, { "data": { "text/html": [ "
{\n", " 'ZJets': {\n", " 'mass': Hist(\n", " StrCategory(['ZJets'], growth=True, name='dataset', label='Primary dataset'),\n", " Regular(30000, 0.25, 300, name='mass', label='$m_{\\\\mu\\\\mu}$ [GeV]'),\n", " storage=Double()) # Sum: 6.0,\n", " 'pt': Hist(\n", " StrCategory(['ZJets'], growth=True, name='dataset', label='Primary dataset'),\n", " Regular(30000, 0.24, 300, name='pt', label='$p_{T}$ [GeV]'),\n", " storage=Double()) # Sum: 18.0,\n", " 'cutflow': {'ZJets_pt': np.int64(18), 'ZJets_mass': np.int64(6)},\n", " 'worker': set()\n", " },\n", " 'Data': {\n", " 'mass': Hist(\n", " StrCategory(['Data'], growth=True, name='dataset', label='Primary dataset'),\n", " Regular(30000, 0.25, 300, name='mass', label='$m_{\\\\mu\\\\mu}$ [GeV]'),\n", " storage=Double()) # Sum: 66.0,\n", " 'pt': Hist(\n", " StrCategory(['Data'], growth=True, name='dataset', label='Primary dataset'),\n", " Regular(30000, 0.24, 300, name='pt', label='$p_{T}$ [GeV]'),\n", " storage=Double()) # Sum: 84.0,\n", " 'cutflow': {'Data_pt': np.int64(84), 'Data_mass': np.int64(66)},\n", " 'worker': set()\n", " }\n", "}\n", "\n" ], "text/plain": [ "\u001b[1m{\u001b[0m\n", " \u001b[32m'ZJets'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'mass'\u001b[0m: \u001b[1;35mHist\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[1;35mStrCategory\u001b[0m\u001b[1m(\u001b[0m\u001b[1m[\u001b[0m\u001b[32m'ZJets'\u001b[0m\u001b[1m]\u001b[0m, \u001b[33mgrowth\u001b[0m=\u001b[3;92mTrue\u001b[0m, \u001b[33mname\u001b[0m=\u001b[32m'dataset'\u001b[0m, \u001b[33mlabel\u001b[0m=\u001b[32m'Primary dataset'\u001b[0m\u001b[1m)\u001b[0m,\n", " \u001b[1;35mRegular\u001b[0m\u001b[1m(\u001b[0m\u001b[1;36m30000\u001b[0m, \u001b[1;36m0.25\u001b[0m, \u001b[1;36m300\u001b[0m, \u001b[33mname\u001b[0m=\u001b[32m'mass'\u001b[0m, \u001b[33mlabel\u001b[0m=\u001b[32m'$m_\u001b[0m\u001b[32m{\u001b[0m\u001b[32m\\\\mu\\\\mu\u001b[0m\u001b[32m}\u001b[0m\u001b[32m$ \u001b[0m\u001b[32m[\u001b[0m\u001b[32mGeV\u001b[0m\u001b[32m]\u001b[0m\u001b[32m'\u001b[0m\u001b[1m)\u001b[0m,\n", " \u001b[33mstorage\u001b[0m=\u001b[1;35mDouble\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m\u001b[1m)\u001b[0m # Sum: \u001b[1;36m6.0\u001b[0m,\n", " \u001b[32m'pt'\u001b[0m: \u001b[1;35mHist\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[1;35mStrCategory\u001b[0m\u001b[1m(\u001b[0m\u001b[1m[\u001b[0m\u001b[32m'ZJets'\u001b[0m\u001b[1m]\u001b[0m, \u001b[33mgrowth\u001b[0m=\u001b[3;92mTrue\u001b[0m, \u001b[33mname\u001b[0m=\u001b[32m'dataset'\u001b[0m, \u001b[33mlabel\u001b[0m=\u001b[32m'Primary dataset'\u001b[0m\u001b[1m)\u001b[0m,\n", " \u001b[1;35mRegular\u001b[0m\u001b[1m(\u001b[0m\u001b[1;36m30000\u001b[0m, \u001b[1;36m0.24\u001b[0m, \u001b[1;36m300\u001b[0m, \u001b[33mname\u001b[0m=\u001b[32m'pt'\u001b[0m, \u001b[33mlabel\u001b[0m=\u001b[32m'$p_\u001b[0m\u001b[32m{\u001b[0m\u001b[32mT\u001b[0m\u001b[32m}\u001b[0m\u001b[32m$ \u001b[0m\u001b[32m[\u001b[0m\u001b[32mGeV\u001b[0m\u001b[32m]\u001b[0m\u001b[32m'\u001b[0m\u001b[1m)\u001b[0m,\n", " \u001b[33mstorage\u001b[0m=\u001b[1;35mDouble\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m\u001b[1m)\u001b[0m # Sum: \u001b[1;36m18.0\u001b[0m,\n", " \u001b[32m'cutflow'\u001b[0m: \u001b[1m{\u001b[0m\u001b[32m'ZJets_pt'\u001b[0m: \u001b[1;35mnp.int64\u001b[0m\u001b[1m(\u001b[0m\u001b[1;36m18\u001b[0m\u001b[1m)\u001b[0m, \u001b[32m'ZJets_mass'\u001b[0m: \u001b[1;35mnp.int64\u001b[0m\u001b[1m(\u001b[0m\u001b[1;36m6\u001b[0m\u001b[1m)\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[32m'worker'\u001b[0m: \u001b[1;35mset\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'Data'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'mass'\u001b[0m: \u001b[1;35mHist\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[1;35mStrCategory\u001b[0m\u001b[1m(\u001b[0m\u001b[1m[\u001b[0m\u001b[32m'Data'\u001b[0m\u001b[1m]\u001b[0m, \u001b[33mgrowth\u001b[0m=\u001b[3;92mTrue\u001b[0m, \u001b[33mname\u001b[0m=\u001b[32m'dataset'\u001b[0m, \u001b[33mlabel\u001b[0m=\u001b[32m'Primary dataset'\u001b[0m\u001b[1m)\u001b[0m,\n", " \u001b[1;35mRegular\u001b[0m\u001b[1m(\u001b[0m\u001b[1;36m30000\u001b[0m, \u001b[1;36m0.25\u001b[0m, \u001b[1;36m300\u001b[0m, \u001b[33mname\u001b[0m=\u001b[32m'mass'\u001b[0m, \u001b[33mlabel\u001b[0m=\u001b[32m'$m_\u001b[0m\u001b[32m{\u001b[0m\u001b[32m\\\\mu\\\\mu\u001b[0m\u001b[32m}\u001b[0m\u001b[32m$ \u001b[0m\u001b[32m[\u001b[0m\u001b[32mGeV\u001b[0m\u001b[32m]\u001b[0m\u001b[32m'\u001b[0m\u001b[1m)\u001b[0m,\n", " \u001b[33mstorage\u001b[0m=\u001b[1;35mDouble\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m\u001b[1m)\u001b[0m # Sum: \u001b[1;36m66.0\u001b[0m,\n", " \u001b[32m'pt'\u001b[0m: \u001b[1;35mHist\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[1;35mStrCategory\u001b[0m\u001b[1m(\u001b[0m\u001b[1m[\u001b[0m\u001b[32m'Data'\u001b[0m\u001b[1m]\u001b[0m, \u001b[33mgrowth\u001b[0m=\u001b[3;92mTrue\u001b[0m, \u001b[33mname\u001b[0m=\u001b[32m'dataset'\u001b[0m, \u001b[33mlabel\u001b[0m=\u001b[32m'Primary dataset'\u001b[0m\u001b[1m)\u001b[0m,\n", " \u001b[1;35mRegular\u001b[0m\u001b[1m(\u001b[0m\u001b[1;36m30000\u001b[0m, \u001b[1;36m0.24\u001b[0m, \u001b[1;36m300\u001b[0m, \u001b[33mname\u001b[0m=\u001b[32m'pt'\u001b[0m, \u001b[33mlabel\u001b[0m=\u001b[32m'$p_\u001b[0m\u001b[32m{\u001b[0m\u001b[32mT\u001b[0m\u001b[32m}\u001b[0m\u001b[32m$ \u001b[0m\u001b[32m[\u001b[0m\u001b[32mGeV\u001b[0m\u001b[32m]\u001b[0m\u001b[32m'\u001b[0m\u001b[1m)\u001b[0m,\n", " \u001b[33mstorage\u001b[0m=\u001b[1;35mDouble\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m\u001b[1m)\u001b[0m # Sum: \u001b[1;36m84.0\u001b[0m,\n", " \u001b[32m'cutflow'\u001b[0m: \u001b[1m{\u001b[0m\u001b[32m'Data_pt'\u001b[0m: \u001b[1;35mnp.int64\u001b[0m\u001b[1m(\u001b[0m\u001b[1;36m84\u001b[0m\u001b[1m)\u001b[0m, \u001b[32m'Data_mass'\u001b[0m: \u001b[1;35mnp.int64\u001b[0m\u001b[1m(\u001b[0m\u001b[1;36m66\u001b[0m\u001b[1m)\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[32m'worker'\u001b[0m: \u001b[1;35mset\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", "\u001b[1m}\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 6.1 Example processor for demonstration\n", "to_compute = apply_to_fileset(\n", " NanoEventsProcessor(),\n", " dataset_runnable,\n", " schemaclass=NanoAODSchema,\n", ")\n", "out = dask.compute(to_compute)[0]\n", "rich.print(out)\n" ] }, { "cell_type": "markdown", "id": "d31e7ee4", "metadata": {}, "source": [ "## 7. Dataset Manipulation Functions\n", "\n", "Coffea provides powerful functions for manipulating FileSpec-based datasets:\n", "\n", "- **max_chunks**: Limit processing to first N chunks per dataset\n", "- **max_chunks_per_file**: Limit processing to first N chunks per file\n", "- **slice_chunks**: Select specific chunk ranges \n", "- **max_files**: Limit number of files per dataset\n", "- **slice_files**: Select specific file ranges\n", "- **filter_files**: Remove files based on criteria" ] }, { "cell_type": "code", "execution_count": 18, "id": "e7919395", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== Chunk-based Manipulations ===\n", "Original dataset: 3 files\n", "After max_chunks(5):\n", " Total chunks: 5\n", "After max_chunks_per_file(2):\n", " file_0.root: 2 chunks\n", " file_1.root: 2 chunks\n", " file_2.root: 2 chunks\n" ] } ], "source": [ "# 7.1 Chunk-based manipulations\n", "print(\"=== Chunk-based Manipulations ===\")\n", "\n", "# Create a sample fileset for demonstration\n", "sample_fileset = DataGroupSpec({\n", " \"large_dataset\": DatasetSpec(\n", " files={\n", " f\"file_{i}.root\": CoffeaROOTFileSpec(\n", " object_path=\"Events\",\n", " steps=[[j*1000, (j+1)*1000] for j in range(10)], # 10 chunks per file\n", " num_entries=10000,\n", " uuid=f\"uuid-{i}\"\n", " )\n", " for i in range(3) # 3 files\n", " },\n", " metadata={\"total_files\": 3}\n", " )\n", "})\n", "\n", "print(f\"Original dataset: {len(sample_fileset['large_dataset'].files)} files\")\n", "\n", "# Limit to first 5 chunks total per dataset\n", "limited_chunks = max_chunks(sample_fileset, maxchunks=5)\n", "print(\"After max_chunks(5):\")\n", "total_chunks = sum(len(f.steps) for f in limited_chunks['large_dataset'].files.values())\n", "print(f\" Total chunks: {total_chunks}\")\n", "\n", "# Limit to first 2 chunks per file\n", "limited_per_file = max_chunks_per_file(sample_fileset, maxchunks=2)\n", "print(\"After max_chunks_per_file(2):\")\n", "for fname, fspec in limited_per_file['large_dataset'].files.items():\n", " print(f\" {fname}: {len(fspec.steps)} chunks\")" ] }, { "cell_type": "code", "execution_count": 19, "id": "d860f60a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== Advanced Chunk Slicing ===\n", "Middle chunks (5:15):\n", " Total chunks: 10\n", "Every other chunk (::2):\n", " Total chunks: 15\n", "First 3 chunks per file:\n", " file_0.root: 3 chunks\n", " file_1.root: 3 chunks\n", " file_2.root: 3 chunks\n" ] } ], "source": [ "# 7.2 Advanced chunk slicing\n", "print(\"=== Advanced Chunk Slicing ===\")\n", "\n", "# Slice specific chunk ranges\n", "middle_chunks = slice_chunks(sample_fileset, slice(5, 15))\n", "print(\"Middle chunks (5:15):\")\n", "total_chunks = sum(len(f.steps) for f in middle_chunks['large_dataset'].files.values())\n", "print(f\" Total chunks: {total_chunks}\")\n", "\n", "# Slice every other chunk\n", "every_other = slice_chunks(sample_fileset, slice(None, None, 2))\n", "print(\"Every other chunk (::2):\")\n", "total_chunks = sum(len(f.steps) for f in every_other['large_dataset'].files.values())\n", "print(f\" Total chunks: {total_chunks}\")\n", "\n", "# Slice per file vs per dataset\n", "per_file_slice = slice_chunks(sample_fileset, slice(3), bydataset=False)\n", "print(\"First 3 chunks per file:\")\n", "for fname, fspec in per_file_slice['large_dataset'].files.items():\n", " print(f\" {fname}: {len(fspec.steps)} chunks\")" ] }, { "cell_type": "code", "execution_count": 20, "id": "01044461", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== File-based Manipulations ===\n", "After max_files(2): 2 files\n", "First two files: 2 files\n", "File names: ['file_0.root', 'file_1.root']\n", "Last file: ['file_2.root']\n" ] } ], "source": [ "# 7.3 File-based manipulations\n", "print(\"=== File-based Manipulations ===\")\n", "\n", "# Limit number of files\n", "limited_files = max_files(sample_fileset, maxfiles=2)\n", "print(f\"After max_files(2): {len(limited_files['large_dataset'].files)} files\")\n", "\n", "# Slice specific files\n", "first_two_files = slice_files(sample_fileset, slice(2))\n", "print(f\"First two files: {len(first_two_files['large_dataset'].files)} files\")\n", "print(f\"File names: {list(first_two_files['large_dataset'].files.keys())}\")\n", "\n", "# Last file only\n", "last_file = slice_files(sample_fileset, slice(-1, None))\n", "print(f\"Last file: {list(last_file['large_dataset'].files.keys())}\")" ] }, { "cell_type": "code", "execution_count": 21, "id": "ee931c3e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== File Filtering ===\n", "Before filtering: 3 files\n", "After filtering: 2 files\n", "Remaining files: ['good_file_1.root', 'good_file_2.root']\n" ] } ], "source": [ "# 7.4 Filtering files\n", "print(\"=== File Filtering ===\")\n", "\n", "# Create a sample with some empty files for filtering\n", "fileset_with_empty = DataGroupSpec({\n", " \"mixed_dataset\": DatasetSpec(\n", " files={\n", " \"good_file_1.root\": CoffeaROOTFileSpec(\n", " object_path=\"Events\",\n", " steps=[[0, 1000]],\n", " num_entries=1000,\n", " uuid=\"good-1\"\n", " ),\n", " \"empty_file.root\": CoffeaROOTFileSpec(\n", " object_path=\"Events\", \n", " steps=[[0, 0]], # Empty file\n", " num_entries=0,\n", " uuid=\"empty\"\n", " ),\n", " \"good_file_2.root\": CoffeaROOTFileSpec(\n", " object_path=\"Events\",\n", " steps=[[0, 2000]],\n", " num_entries=2000,\n", " uuid=\"good-2\"\n", " )\n", " }\n", " )\n", "})\n", "\n", "print(f\"Before filtering: {len(fileset_with_empty['mixed_dataset'].files)} files\")\n", "\n", "# Filter out empty files\n", "filtered_fileset = filter_files(fileset_with_empty)\n", "print(f\"After filtering: {len(filtered_fileset['mixed_dataset'].files)} files\")\n", "print(f\"Remaining files: {list(filtered_fileset['mixed_dataset'].files.keys())}\")" ] }, { "cell_type": "markdown", "id": "a2f8d515", "metadata": {}, "source": [ "## 8. Advanced Usage Examples\n", "\n", "This section demonstrates advanced patterns and best practices for using FileSpec classes in real-world scenarios." ] }, { "cell_type": "code", "execution_count": 22, "id": "bc8ca29a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== Complex Analysis Fileset ===\n", "Full analysis fileset: 9 datasets\n", "Signal datasets: 3\n", "Background datasets: 2\n", "Data datasets: 4\n" ] } ], "source": [ "# 8.1 Building complex analysis filesets\n", "print(\"=== Complex Analysis Fileset ===\")\n", "\n", "def build_analysis_fileset():\n", " \"\"\"Build a comprehensive analysis fileset\"\"\"\n", " \n", " # Signal samples\n", " signal_samples = {}\n", " for mass in [125, 200, 300]:\n", " signal_samples[f\"higgs_m{mass}\"] = DatasetSpec(\n", " files={\n", " f\"higgs_m{mass}_part{i}.root\": CoffeaROOTFileSpec(\n", " object_path=\"Events\",\n", " steps=[[j*5000, (j+1)*5000] for j in range(20)],\n", " num_entries=100000,\n", " uuid=f\"higgs-{mass}-{i}\"\n", " )\n", " for i in range(3)\n", " },\n", " metadata={\n", " \"process\": \"higgs\",\n", " \"mass\": mass,\n", " \"cross_section\": 48.58 if mass == 125 else 10.0,\n", " \"is_signal\": True\n", " }\n", " )\n", " \n", " # Background samples\n", " background_samples = {\n", " \"ttbar\": DatasetSpec(\n", " files={\n", " f\"ttbar_part{i}.root\": \"Events\" for i in range(10)\n", " },\n", " metadata={\"process\": \"ttbar\", \"cross_section\": 831.8, \"is_signal\": False}\n", " ),\n", " \"wjets\": DatasetSpec(\n", " files={\n", " f\"wjets_part{i}.root\": \"Events\" for i in range(15)\n", " },\n", " metadata={\"process\": \"wjets\", \"cross_section\": 61526.7, \"is_signal\": False}\n", " )\n", " }\n", " \n", " # Data samples\n", " data_samples = {\n", " f\"data_{era}\": DatasetSpec(\n", " files={\n", " f\"data_{era}_part{i}.root\": \"Events\" for i in range(5)\n", " },\n", " metadata={\"is_data\": True, \"era\": era, \"luminosity\": 41.5}\n", " )\n", " for era in [\"2022A\", \"2022B\", \"2022C\", \"2022D\"]\n", " }\n", " \n", " # Combine all samples\n", " all_samples = {}\n", " all_samples.update(signal_samples)\n", " all_samples.update(background_samples)\n", " all_samples.update(data_samples)\n", " \n", " return DataGroupSpec(all_samples)\n", "\n", "# Build the fileset\n", "full_analysis = build_analysis_fileset()\n", "print(f\"Full analysis fileset: {len(full_analysis)} datasets\")\n", "\n", "# Categorize datasets\n", "signal_datasets = [name for name, ds in full_analysis.items() \n", " if ds.metadata.get(\"is_signal\", False)]\n", "background_datasets = [name for name, ds in full_analysis.items() \n", " if not ds.metadata.get(\"is_data\", False) and not ds.metadata.get(\"is_signal\", False)]\n", "data_datasets = [name for name, ds in full_analysis.items() \n", " if ds.metadata.get(\"is_data\", False)]\n", "\n", "print(f\"Signal datasets: {len(signal_datasets)}\")\n", "print(f\"Background datasets: {len(background_datasets)}\")\n", "print(f\"Data datasets: {len(data_datasets)}\")" ] }, { "cell_type": "code", "execution_count": 23, "id": "a6237760", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== Conditional Processing ===\n", "Signal-only fileset: 3 datasets\n", "2022 data fileset: 4 datasets\n", "Test subset: 3 datasets with limited files\n" ] } ], "source": [ "# 8.2 Conditional processing and dataset selection\n", "print(\"=== Conditional Processing ===\")\n", "\n", "def select_datasets_by_criteria(fileset: DataGroupSpec, **criteria) -> DataGroupSpec:\n", " \"\"\"Select datasets matching specific criteria\"\"\"\n", " selected = {}\n", " \n", " for name, dataset in fileset.items():\n", " match = True\n", " for key, value in criteria.items():\n", " if dataset.metadata.get(key) != value:\n", " match = False\n", " break\n", " \n", " if match:\n", " selected[name] = dataset\n", " \n", " return DataGroupSpec(selected)\n", "\n", "# Select only signal datasets\n", "signal_only = select_datasets_by_criteria(full_analysis, is_signal=True)\n", "print(f\"Signal-only fileset: {len(signal_only)} datasets\")\n", "\n", "# Select 2022 data only\n", "data_2022 = select_datasets_by_criteria(full_analysis, is_data=True)\n", "data_2022_filtered = DataGroupSpec({\n", " name: ds for name, ds in data_2022.items() \n", " if \"2022\" in name\n", "})\n", "print(f\"2022 data fileset: {len(data_2022_filtered)} datasets\")\n", "\n", "# Create a test subset with limited files\n", "test_subset = DataGroupSpec({\n", " name: max_files(DataGroupSpec({name: ds}), maxfiles=2)[name]\n", " for name, ds in full_analysis.items()\n", " if name in signal_datasets[:2] + background_datasets[:1]\n", "})\n", "print(f\"Test subset: {len(test_subset)} datasets with limited files\")" ] }, { "cell_type": "code", "execution_count": 24, "id": "60f065ea", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== Error Handling and Validation ===\n", "Fileset validation results:\n", " total_datasets: 9\n", " total_files: 54\n", " format_distribution: {'root': 9}\n" ] } ], "source": [ "# 8.3 Error handling and validation\n", "print(\"=== Error Handling and Validation ===\")\n", "\n", "def validate_fileset(fileset: DataGroupSpec) -> dict:\n", " \"\"\"Validate a fileset and return diagnostic information\"\"\"\n", " diagnostics = {\n", " \"total_datasets\": len(fileset),\n", " \"total_files\": 0,\n", " \"empty_datasets\": [],\n", " \"format_distribution\": {},\n", " \"metadata_issues\": []\n", " }\n", " \n", " for name, dataset in fileset.items():\n", " # Count files\n", " num_files = len(dataset.files)\n", " diagnostics[\"total_files\"] += num_files\n", " \n", " # Check for empty datasets\n", " if num_files == 0:\n", " diagnostics[\"empty_datasets\"].append(name)\n", " \n", " # Track format distribution\n", " fmt = dataset.format\n", " diagnostics[\"format_distribution\"][fmt] = diagnostics[\"format_distribution\"].get(fmt, 0) + 1\n", " \n", " # Check metadata\n", " if not dataset.metadata:\n", " diagnostics[\"metadata_issues\"].append(f\"{name}: No metadata\")\n", " elif \"process\" not in dataset.metadata and not dataset.metadata.get(\"is_data\", False):\n", " diagnostics[\"metadata_issues\"].append(f\"{name}: Missing process information\")\n", " \n", " return diagnostics\n", "\n", "# Validate our analysis fileset\n", "validation_results = validate_fileset(full_analysis)\n", "print(\"Fileset validation results:\")\n", "for key, value in validation_results.items():\n", " if isinstance(value, list) and len(value) == 0:\n", " continue\n", " print(f\" {key}: {value}\")" ] }, { "cell_type": "code", "execution_count": 25, "id": "f22002f4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== Performance Optimization ===\n", " higgs_m125: chunk_slicing, 1 files\n", " higgs_m200: chunk_slicing, 1 files\n", "Optimized subset: 2 datasets\n" ] }, { "data": { "text/html": [ "
DataGroupSpec(\n", " root={\n", " 'higgs_m125': DatasetSpec(\n", " files=InputFiles(\n", " root={\n", " 'higgs_m125_part0.root': CoffeaROOTFileSpec(\n", " object_path='Events',\n", " steps=[[0, 5000], [5000, 10000]],\n", " num_entries=100000,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid='higgs-125-0',\n", " num_selected_entries=10000\n", " )\n", " }\n", " ),\n", " metadata={'process': 'higgs', 'mass': 125, 'cross_section': 48.58, 'is_signal': True},\n", " format='root',\n", " compressed_form=None,\n", " did=None\n", " ),\n", " 'higgs_m200': DatasetSpec(\n", " files=InputFiles(\n", " root={\n", " 'higgs_m200_part0.root': CoffeaROOTFileSpec(\n", " object_path='Events',\n", " steps=[[0, 5000], [5000, 10000]],\n", " num_entries=100000,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid='higgs-200-0',\n", " num_selected_entries=10000\n", " )\n", " }\n", " ),\n", " metadata={'process': 'higgs', 'mass': 200, 'cross_section': 10.0, 'is_signal': True},\n", " format='root',\n", " compressed_form=None,\n", " did=None\n", " )\n", " }\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mDataGroupSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'higgs_m125'\u001b[0m: \u001b[1;35mDatasetSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mfiles\u001b[0m=\u001b[1;35mInputFiles\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'higgs_m125_part0.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m5000\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m5000\u001b[0m, \u001b[1;36m10000\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m100000\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'higgs-125-0'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m10000\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[33mmetadata\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'process'\u001b[0m: \u001b[32m'higgs'\u001b[0m, \u001b[32m'mass'\u001b[0m: \u001b[1;36m125\u001b[0m, \u001b[32m'cross_section'\u001b[0m: \u001b[1;36m48.58\u001b[0m, \u001b[32m'is_signal'\u001b[0m: \u001b[3;92mTrue\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mcompressed_form\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mdid\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'higgs_m200'\u001b[0m: \u001b[1;35mDatasetSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mfiles\u001b[0m=\u001b[1;35mInputFiles\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'higgs_m200_part0.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m5000\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m5000\u001b[0m, \u001b[1;36m10000\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m100000\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'higgs-200-0'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m10000\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[33mmetadata\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'process'\u001b[0m: \u001b[32m'higgs'\u001b[0m, \u001b[32m'mass'\u001b[0m: \u001b[1;36m200\u001b[0m, \u001b[32m'cross_section'\u001b[0m: \u001b[1;36m10.0\u001b[0m, \u001b[32m'is_signal'\u001b[0m: \u001b[3;92mTrue\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mcompressed_form\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mdid\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 8.4 Performance optimization strategies\n", "print(\"=== Performance Optimization ===\")\n", "\n", "def optimize_fileset_for_processing(fileset: DataGroupSpec, target_chunk_size: int = 100000) -> DataGroupSpec:\n", " \"\"\"Optimize fileset for processing performance\"\"\"\n", " \n", " optimized = {}\n", " \n", " for name, dataset in fileset.items():\n", " # Calculate total events and files\n", " total_events = sum(f.num_entries for f in dataset.files.values() \n", " if hasattr(f, 'num_entries') and f.num_entries)\n", " num_files = len(dataset.files)\n", " \n", " if total_events == 0:\n", " # Skip empty datasets\n", " continue\n", " \n", " # Determine optimal chunking strategy\n", " if total_events < target_chunk_size:\n", " # Small dataset - process as single chunk per file\n", " chunk_strategy = \"single_chunk_per_file\"\n", " optimized_dataset = dataset\n", " elif num_files < 5:\n", " # Few large files - use chunk slicing\n", " chunk_strategy = \"chunk_slicing\"\n", " max_chunks_total = max(1, total_events // target_chunk_size)\n", " optimized_dataset = max_chunks(DataGroupSpec({name: dataset}), \n", " maxchunks=max_chunks_total)[name]\n", " else:\n", " # Many files - limit files and chunks per file\n", " chunk_strategy = \"file_and_chunk_limiting\"\n", " max_files_count = min(num_files, 20) # Limit to 20 files\n", " temp_fileset = max_files(DataGroupSpec({name: dataset}), maxfiles=max_files_count)\n", " optimized_dataset = max_chunks_per_file(temp_fileset, maxchunks=5)[name]\n", " \n", " optimized[name] = optimized_dataset\n", " print(f\" {name}: {chunk_strategy}, {len(optimized_dataset.files)} files\")\n", " \n", " return DataGroupSpec(optimized)\n", "\n", "# Optimize our test subset\n", "optimized_subset = optimize_fileset_for_processing(test_subset)\n", "print(f\"Optimized subset: {len(optimized_subset)} datasets\")\n", "rich.print(optimized_subset)" ] }, { "cell_type": "markdown", "id": "79847d77", "metadata": {}, "source": [ "## 9. Migration from pure dictionary Formats and conversion utility ModelFactory\n", "\n", "This section shows how to migrate from legacy dictionary-based filesets to the explicit FileSpec classes.\n", "\n", "Largely, a well-defined legacy fileset (purely nested dictionary) can be converted merely by passing it into the DataGroupSpec as an argument.\n", "\n", "It should be noted that DataGroupSpec, InputFiles, and PreprocessedFiles behave like dictionaries and expect a dictionary input, but the other filespec classes expect keyword arguments, and so when a dictionary is explicitly passed to the FileSpec constructors, they should be unpacked with the `**some_dict` syntax." ] }, { "cell_type": "code", "execution_count": 26, "id": "b5c82c85", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== Legacy Format Migration ===\n", "Legacy fileset structure:\n", " ttbar: 2 files\n", " data: 1 files\n" ] } ], "source": [ "# 9.1 Legacy format examples\n", "print(\"=== Legacy Format Migration ===\")\n", "\n", "# Legacy dictionary format (old style)\n", "legacy_fileset = {\n", " \"ttbar\": {\n", " \"files\": {\n", " \"ttbar_1.root\": \"Events\",\n", " \"ttbar_2.root\": \"Events\"\n", " }\n", " },\n", " \"data\": {\n", " \"files\": {\n", " \"data_1.root\": {\n", " \"object_path\": \"Events\",\n", " \"steps\": [[0, 1000], [1000, 2000]],\n", " \"num_entries\": 2000,\n", " \"uuid\": \"legacy-uuid\"\n", " }\n", " }\n", " }\n", "}\n", "\n", "print(\"Legacy fileset structure:\")\n", "for name, content in legacy_fileset.items():\n", " print(f\" {name}: {len(content['files'])} files\")" ] }, { "cell_type": "markdown", "id": "5c27340c", "metadata": {}, "source": [ "## ModelFactory\n", "\n", "The `ModelFactory` class contains a few utility methods to help with manipulating the pydantic FileSpec classes. Largely, they serve as an example, with a few utilities regarding formats (which are called internally during validation/instantiation of the classes) plus conversion functions with simple logic for manipulating the filespec classes.\n", "\n", "- **dict_to_ROOTFileSpec**: Tries to convert the dictionary to a concrete CoffeaROOTFileSpec, and failing that, falls back to the Optional type\n", "- **dict_to_parquetfilespec**: Tries to convert the dictionary to a concrete CoffeaParquetFileSpec, and failing that, falls back to the Optional type\n", "- **filespec_to_dict**: Inverse function to convert FileSpec to dictionaries. Thanks to pydantic functionality, merely calls `.model_dump()` on the class\n", "- **dict_to_datasetspec**: Tries to convert the dictionary to a DatasetSpec, by utilizing the constructor.\n", "- **datasetspec_to_dict**: If coerce_filespec_to_dict is True (default), calls `.model_dump()` to completely convert to a dictionary. If False, only the outermost DatasetSpec is removed, leaving a dictionary of pydantic and elementary python types, which is the result of calling `dict(datasetspec)` instead of `.model_dump()`\n", "- **valid_format**: Ensures the format(s) are in the supported list for coffea processing\n", "- **attempt_promotion**: Will accept any of the FileSpec, DatasetSpec, or DataGroupSpec and try to promote any(nested) types within to concrete classes. Can effectively be emulated by calling the pydantic class constructor on the output of the original model's `.model_dump()` method, with or without `**inputs` call in place (for non-dictionary-like models)" ] }, { "cell_type": "code", "execution_count": 27, "id": "f6f4842c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== Converting legacy formats via ModelFactory ===\n", "The methods dict_to_ROOTFileSpec and dict_to_parquetfilespec are deprecated, as their functionality is covered by the pydantic models directly.\n", "For converting pydantic classes to dictionaries, the function datasetspec_to_dict demonstrates the two methods: model_dump() and dict(AModel).\n", "With the former, the entire model hierarchy is converted to dictionaries, while with the latter only the top-level model is converted, leaving nested models intact.\n", "DatasetSpec to pure dictionary (with coerce_filespec_to_dict=True):\n" ] }, { "data": { "text/html": [ "
{\n", " 'files': {\n", " 'ttbar_part0.root': {\n", " 'object_path': 'Events',\n", " 'steps': None,\n", " 'num_entries': None,\n", " 'format': 'root',\n", " 'lfn': None,\n", " 'pfn': None,\n", " 'uuid': None,\n", " 'num_selected_entries': None\n", " },\n", " 'ttbar_part1.root': {\n", " 'object_path': 'Events',\n", " 'steps': None,\n", " 'num_entries': None,\n", " 'format': 'root',\n", " 'lfn': None,\n", " 'pfn': None,\n", " 'uuid': None,\n", " 'num_selected_entries': None\n", " },\n", " 'ttbar_part2.root': {\n", " 'object_path': 'Events',\n", " 'steps': None,\n", " 'num_entries': None,\n", " 'format': 'root',\n", " 'lfn': None,\n", " 'pfn': None,\n", " 'uuid': None,\n", " 'num_selected_entries': None\n", " },\n", " 'ttbar_part3.root': {\n", " 'object_path': 'Events',\n", " 'steps': None,\n", " 'num_entries': None,\n", " 'format': 'root',\n", " 'lfn': None,\n", " 'pfn': None,\n", " 'uuid': None,\n", " 'num_selected_entries': None\n", " },\n", " 'ttbar_part4.root': {\n", " 'object_path': 'Events',\n", " 'steps': None,\n", " 'num_entries': None,\n", " 'format': 'root',\n", " 'lfn': None,\n", " 'pfn': None,\n", " 'uuid': None,\n", " 'num_selected_entries': None\n", " },\n", " 'ttbar_part5.root': {\n", " 'object_path': 'Events',\n", " 'steps': None,\n", " 'num_entries': None,\n", " 'format': 'root',\n", " 'lfn': None,\n", " 'pfn': None,\n", " 'uuid': None,\n", " 'num_selected_entries': None\n", " },\n", " 'ttbar_part6.root': {\n", " 'object_path': 'Events',\n", " 'steps': None,\n", " 'num_entries': None,\n", " 'format': 'root',\n", " 'lfn': None,\n", " 'pfn': None,\n", " 'uuid': None,\n", " 'num_selected_entries': None\n", " },\n", " 'ttbar_part7.root': {\n", " 'object_path': 'Events',\n", " 'steps': None,\n", " 'num_entries': None,\n", " 'format': 'root',\n", " 'lfn': None,\n", " 'pfn': None,\n", " 'uuid': None,\n", " 'num_selected_entries': None\n", " },\n", " 'ttbar_part8.root': {\n", " 'object_path': 'Events',\n", " 'steps': None,\n", " 'num_entries': None,\n", " 'format': 'root',\n", " 'lfn': None,\n", " 'pfn': None,\n", " 'uuid': None,\n", " 'num_selected_entries': None\n", " },\n", " 'ttbar_part9.root': {\n", " 'object_path': 'Events',\n", " 'steps': None,\n", " 'num_entries': None,\n", " 'format': 'root',\n", " 'lfn': None,\n", " 'pfn': None,\n", " 'uuid': None,\n", " 'num_selected_entries': None\n", " }\n", " },\n", " 'metadata': {'process': 'ttbar', 'cross_section': 831.8, 'is_signal': False},\n", " 'format': 'root',\n", " 'compressed_form': None,\n", " 'did': None\n", "}\n", "\n" ], "text/plain": [ "\u001b[1m{\u001b[0m\n", " \u001b[32m'files'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'ttbar_part0.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part1.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part2.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part3.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part4.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part5.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part6.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part7.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part8.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part9.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\u001b[32m'process'\u001b[0m: \u001b[32m'ttbar'\u001b[0m, \u001b[32m'cross_section'\u001b[0m: \u001b[1;36m831.8\u001b[0m, \u001b[32m'is_signal'\u001b[0m: \u001b[3;91mFalse\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'compressed_form'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'did'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", "\u001b[1m}\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Accomplished via model_dump():\n" ] }, { "data": { "text/html": [ "
{\n", " 'files': {\n", " 'ttbar_part0.root': {\n", " 'object_path': 'Events',\n", " 'steps': None,\n", " 'num_entries': None,\n", " 'format': 'root',\n", " 'lfn': None,\n", " 'pfn': None,\n", " 'uuid': None,\n", " 'num_selected_entries': None\n", " },\n", " 'ttbar_part1.root': {\n", " 'object_path': 'Events',\n", " 'steps': None,\n", " 'num_entries': None,\n", " 'format': 'root',\n", " 'lfn': None,\n", " 'pfn': None,\n", " 'uuid': None,\n", " 'num_selected_entries': None\n", " },\n", " 'ttbar_part2.root': {\n", " 'object_path': 'Events',\n", " 'steps': None,\n", " 'num_entries': None,\n", " 'format': 'root',\n", " 'lfn': None,\n", " 'pfn': None,\n", " 'uuid': None,\n", " 'num_selected_entries': None\n", " },\n", " 'ttbar_part3.root': {\n", " 'object_path': 'Events',\n", " 'steps': None,\n", " 'num_entries': None,\n", " 'format': 'root',\n", " 'lfn': None,\n", " 'pfn': None,\n", " 'uuid': None,\n", " 'num_selected_entries': None\n", " },\n", " 'ttbar_part4.root': {\n", " 'object_path': 'Events',\n", " 'steps': None,\n", " 'num_entries': None,\n", " 'format': 'root',\n", " 'lfn': None,\n", " 'pfn': None,\n", " 'uuid': None,\n", " 'num_selected_entries': None\n", " },\n", " 'ttbar_part5.root': {\n", " 'object_path': 'Events',\n", " 'steps': None,\n", " 'num_entries': None,\n", " 'format': 'root',\n", " 'lfn': None,\n", " 'pfn': None,\n", " 'uuid': None,\n", " 'num_selected_entries': None\n", " },\n", " 'ttbar_part6.root': {\n", " 'object_path': 'Events',\n", " 'steps': None,\n", " 'num_entries': None,\n", " 'format': 'root',\n", " 'lfn': None,\n", " 'pfn': None,\n", " 'uuid': None,\n", " 'num_selected_entries': None\n", " },\n", " 'ttbar_part7.root': {\n", " 'object_path': 'Events',\n", " 'steps': None,\n", " 'num_entries': None,\n", " 'format': 'root',\n", " 'lfn': None,\n", " 'pfn': None,\n", " 'uuid': None,\n", " 'num_selected_entries': None\n", " },\n", " 'ttbar_part8.root': {\n", " 'object_path': 'Events',\n", " 'steps': None,\n", " 'num_entries': None,\n", " 'format': 'root',\n", " 'lfn': None,\n", " 'pfn': None,\n", " 'uuid': None,\n", " 'num_selected_entries': None\n", " },\n", " 'ttbar_part9.root': {\n", " 'object_path': 'Events',\n", " 'steps': None,\n", " 'num_entries': None,\n", " 'format': 'root',\n", " 'lfn': None,\n", " 'pfn': None,\n", " 'uuid': None,\n", " 'num_selected_entries': None\n", " }\n", " },\n", " 'metadata': {'process': 'ttbar', 'cross_section': 831.8, 'is_signal': False},\n", " 'format': 'root',\n", " 'compressed_form': None,\n", " 'did': None\n", "}\n", "\n" ], "text/plain": [ "\u001b[1m{\u001b[0m\n", " \u001b[32m'files'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'ttbar_part0.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part1.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part2.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part3.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part4.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part5.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part6.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part7.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part8.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part9.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\u001b[32m'process'\u001b[0m: \u001b[32m'ttbar'\u001b[0m, \u001b[32m'cross_section'\u001b[0m: \u001b[1;36m831.8\u001b[0m, \u001b[32m'is_signal'\u001b[0m: \u001b[3;91mFalse\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'compressed_form'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'did'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", "\u001b[1m}\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "DatasetSpec to top-level dictionary (with coerce_filespec_to_dict=False):\n" ] }, { "data": { "text/html": [ "
{\n", " 'files': InputFiles(\n", " root={\n", " 'ttbar_part0.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " ),\n", " 'ttbar_part1.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " ),\n", " 'ttbar_part2.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " ),\n", " 'ttbar_part3.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " ),\n", " 'ttbar_part4.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " ),\n", " 'ttbar_part5.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " ),\n", " 'ttbar_part6.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " ),\n", " 'ttbar_part7.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " ),\n", " 'ttbar_part8.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " ),\n", " 'ttbar_part9.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " )\n", " }\n", " ),\n", " 'metadata': {'process': 'ttbar', 'cross_section': 831.8, 'is_signal': False},\n", " 'format': 'root',\n", " 'compressed_form': None,\n", " 'did': None\n", "}\n", "\n" ], "text/plain": [ "\u001b[1m{\u001b[0m\n", " \u001b[32m'files'\u001b[0m: \u001b[1;35mInputFiles\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'ttbar_part0.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part1.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part2.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part3.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part4.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part5.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part6.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part7.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part8.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part9.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\u001b[32m'process'\u001b[0m: \u001b[32m'ttbar'\u001b[0m, \u001b[32m'cross_section'\u001b[0m: \u001b[1;36m831.8\u001b[0m, \u001b[32m'is_signal'\u001b[0m: \u001b[3;91mFalse\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'compressed_form'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'did'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", "\u001b[1m}\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Now the partial conversion via dict():\n" ] }, { "data": { "text/html": [ "
{\n", " 'files': InputFiles(\n", " root={\n", " 'ttbar_part0.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " ),\n", " 'ttbar_part1.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " ),\n", " 'ttbar_part2.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " ),\n", " 'ttbar_part3.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " ),\n", " 'ttbar_part4.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " ),\n", " 'ttbar_part5.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " ),\n", " 'ttbar_part6.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " ),\n", " 'ttbar_part7.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " ),\n", " 'ttbar_part8.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " ),\n", " 'ttbar_part9.root': CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=None,\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=None\n", " )\n", " }\n", " ),\n", " 'metadata': {'process': 'ttbar', 'cross_section': 831.8, 'is_signal': False},\n", " 'format': 'root',\n", " 'compressed_form': None,\n", " 'did': None\n", "}\n", "\n" ], "text/plain": [ "\u001b[1m{\u001b[0m\n", " \u001b[32m'files'\u001b[0m: \u001b[1;35mInputFiles\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'ttbar_part0.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part1.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part2.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part3.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part4.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part5.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part6.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part7.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part8.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part9.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\u001b[32m'process'\u001b[0m: \u001b[32m'ttbar'\u001b[0m, \u001b[32m'cross_section'\u001b[0m: \u001b[1;36m831.8\u001b[0m, \u001b[32m'is_signal'\u001b[0m: \u001b[3;91mFalse\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'compressed_form'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'did'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", "\u001b[1m}\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 10.1 Converting formats\n", "print(\"=== Converting legacy formats via ModelFactory ===\")\n", "\n", "print(\"The methods dict_to_ROOTFileSpec and dict_to_parquetfilespec are deprecated, as their functionality is covered by the pydantic models directly.\")\n", "\n", "print(\"For converting pydantic classes to dictionaries, the function datasetspec_to_dict demonstrates the two methods: model_dump() and dict(AModel).\")\n", "print(\"With the former, the entire model hierarchy is converted to dictionaries, while with the latter only the top-level model is converted, leaving nested models intact.\")\n", "\n", "pure_dictionary = ModelFactory.datasetspec_to_dict(full_analysis['ttbar'], coerce_filespec_to_dict=True)\n", "print(\"DatasetSpec to pure dictionary (with coerce_filespec_to_dict=True):\")\n", "rich.print(pure_dictionary)\n", "\n", "print(\"Accomplished via model_dump():\")\n", "rich.print(full_analysis['ttbar'].model_dump())\n", "\n", "mixed_dictionary = ModelFactory.datasetspec_to_dict(full_analysis['ttbar'], coerce_filespec_to_dict=False)\n", "print(\"DatasetSpec to top-level dictionary (with coerce_filespec_to_dict=False):\")\n", "rich.print(mixed_dictionary)\n", "\n", "print(\"Now the partial conversion via dict():\")\n", "rich.print(dict(full_analysis['ttbar']))\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 28, "id": "aeb145d5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== Promoting Specs to Concrete Types ===\n", "ModelFactory.attempt_promotion can be used to update Spcs after parameters have been set, fulfilling the requirements of the non-Optional variants.\n", "Starting with CoffeaROOTFileSpecOptional:\n" ] }, { "data": { "text/html": [ "
CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=[[0, 1000]],\n", " num_entries=None,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid=None,\n", " num_selected_entries=1000\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m1000\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m1000\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "After setting num_entries and uuid:\n" ] }, { "data": { "text/html": [ "
CoffeaROOTFileSpecOptional(\n", " object_path='Events',\n", " steps=[[0, 1000]],\n", " num_entries=1000,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid='promote-me',\n", " num_selected_entries=1000\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m1000\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m1000\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'promote-me'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m1000\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "After promotion to CoffeaROOTFileSpec:\n" ] }, { "data": { "text/html": [ "
CoffeaROOTFileSpec(\n", " object_path='Events',\n", " steps=[[0, 1000]],\n", " num_entries=1000,\n", " format='root',\n", " lfn=None,\n", " pfn=None,\n", " uuid='promote-me',\n", " num_selected_entries=1000\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mCoffeaROOTFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m1000\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m1000\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'promote-me'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m1000\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 10.3 Promoting Specs to concrete types\n", "print(\"=== Promoting Specs to Concrete Types ===\")\n", "print(\"ModelFactory.attempt_promotion can be used to update Spcs after parameters have been set, fulfilling the requirements of the non-Optional variants.\")\n", "starting_spec = CoffeaROOTFileSpecOptional(object_path=\"Events\", steps=[[0, 1000]])\n", "print(\"Starting with CoffeaROOTFileSpecOptional:\")\n", "rich.print(starting_spec)\n", "starting_spec.num_entries = 1000\n", "starting_spec.uuid = \"promote-me\"\n", "print(\"After setting num_entries and uuid:\")\n", "rich.print(starting_spec)\n", "promoted_spec = ModelFactory.attempt_promotion(starting_spec)\n", "print(\"After promotion to CoffeaROOTFileSpec:\")\n", "rich.print(promoted_spec)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.14.0" } }, "nbformat": 4, "nbformat_minor": 5 }