{ "cells": [ { "cell_type": "markdown", "id": "52b09612", "metadata": {}, "source": [ "# Dataset specification with FileSpec classes\n", "\n", "This is a rendered copy of [filespec.ipynb](https://github.com/scikit-hep/coffea/blob/master/binder/filespec.ipynb). You can optionally run it interactively on [binder at this link](https://mybinder.org/v2/gh/coffeateam/coffea/master?filepath=binder%2Ffilespec.ipynb)\n", "\n", "This notebook provides a comprehensive guide to using the new Pydantic-based FileSpec classes in Coffea's dataset tools. These classes provide type-safe, validated data structures for managing file specifications, datasets, and filesets in high-energy physics data analysis workflows.\n", "\n", "## Overview\n", "\n", "The FileSpec system provides:\n", "- **Type-safe data structures** with automatic validation\n", "- **Automatic format detection** for ROOT and Parquet files\n", "- **Seamless integration** with existing Coffea functions\n", "- **JSON serialization/deserialization** for data persistence\n", "- **Automatic promotion** between optional and concrete specifications\n", "\n", "## Table of Contents\n", "\n", "1. [Basic File Specifications](#basic-file-specifications)\n", "2. [InputFiles, PreprocessedFiles](#coffea-file-dict)\n", "3. [Dataset Specifications](#dataset-specifications)\n", "4. [Fileset Specifications](#fileset-specifications)\n", "5. [Integration with Preprocessing](#integration-with-preprocessing)\n", "6. [Integration with apply_to_fileset](#integration-with-apply_to_fileset)\n", "7. [Dataset Manipulation Functions](#dataset-manipulation-functions)\n", "8. [Advanced Usage Examples](#advanced-usage-examples)\n", "9. [Migration from Legacy Formats](#migration-from-legacy-formats)" ] }, { "cell_type": "code", "execution_count": 1, "id": "f6c75a47", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "FileSpec classes and dataset tools imported successfully!\n" ] } ], "source": [ "# Import necessary libraries\n", "from pydantic import ValidationError\n", "import rich\n", "import dask\n", "\n", "\n", "# Import the FileSpec classes and dataset tools\n", "from coffea.dataset_tools import (\n", " # FileSpec classes\n", " ROOTFileSpec,\n", " ParquetFileSpec,\n", " CoffeaROOTFileSpec,\n", " CoffeaROOTFileSpecOptional,\n", " CoffeaParquetFileSpec,\n", " CoffeaParquetFileSpecOptional,\n", " InputFiles,\n", " DatasetSpec,\n", " DataGroupSpec,\n", " \n", " # Dataset manipulation functions\n", " preprocess,\n", " apply_to_fileset,\n", " max_chunks,\n", " max_chunks_per_file,\n", " slice_chunks,\n", " slice_files,\n", " max_files,\n", " filter_files,\n", "\n", " # ModelFactory utility class\n", " ModelFactory,\n", ")\n", "from coffea.nanoevents import NanoAODSchema\n", "from coffea.processor.test_items import NanoEventsProcessor\n", "\n", "print(\"FileSpec classes and dataset tools imported successfully!\")" ] }, { "cell_type": "markdown", "id": "04306ba3", "metadata": {}, "source": [ "## 1. Basic File Specifications\n", "\n", "The FileSpec system provides several classes for representing individual file specifications:\n", "\n", "### File Specification Hierarchy\n", "\n", "- **ROOTFileSpec**: Basic specification for ROOT files\n", "- **ParquetFileSpec**: Basic specification for Parquet files \n", "- **CoffeaROOTFileSpecOptional**: ROOT files with optional metadata\n", "- **CoffeaROOTFileSpec**: ROOT files with complete metadata (required)\n", "- **CoffeaParquetFileSpecOptional**: Parquet files with optional metadata\n", "- **CoffeaParquetFileSpec**: Parquet files with complete metadata (required)" ] }, { "cell_type": "code", "execution_count": 2, "id": "2ecaf55b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== Basic ROOTFileSpec ===\n", "Basic ROOT spec:\n" ] }, { "data": { "text/html": [ "
ROOTFileSpec(\n",
       "    object_path='Events',\n",
       "    steps=None,\n",
       "    num_entries=None,\n",
       "    format='root',\n",
       "    lfn=None,\n",
       "    pfn=None,\n",
       "    num_selected_entries=None\n",
       ")\n",
       "
\n" ], "text/plain": [ "\u001b[1;35mROOTFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Format: root\n", "Steps: None\n", "\n", "ROOT spec with steps:\n" ] }, { "data": { "text/html": [ "
ROOTFileSpec(\n",
       "    object_path='Events',\n",
       "    steps=[[0, 1000], [1000, 2000], [2000, 3000]],\n",
       "    num_entries=None,\n",
       "    format='root',\n",
       "    lfn=None,\n",
       "    pfn=None,\n",
       "    num_selected_entries=3000\n",
       ")\n",
       "
\n" ], "text/plain": [ "\u001b[1;35mROOTFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m1000\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m1000\u001b[0m, \u001b[1;36m2000\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m2000\u001b[0m, \u001b[1;36m3000\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m3000\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 1.1 Basic ROOTFileSpec for ROOT files\n", "print(\"=== Basic ROOTFileSpec ===\")\n", "\n", "# Minimal ROOT file specification\n", "uproot_spec = ROOTFileSpec(object_path=\"Events\")\n", "print(\"Basic ROOT spec:\")\n", "rich.print(uproot_spec)\n", "print(f\"Format: {uproot_spec.format}\")\n", "print(f\"Steps: {uproot_spec.steps}\")\n", "\n", "# ROOT file specification with steps\n", "uproot_spec_with_steps = ROOTFileSpec(\n", " object_path=\"Events\",\n", " steps=[[0, 1000], [1000, 2000], [2000, 3000]]\n", ")\n", "print(\"\\nROOT spec with steps:\")\n", "rich.print(uproot_spec_with_steps)" ] }, { "cell_type": "code", "execution_count": 3, "id": "d59e690c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== Basic ParquetFileSpec ===\n", "Basic Parquet spec:\n" ] }, { "data": { "text/html": [ "
ParquetFileSpec(\n",
       "    object_path=None,\n",
       "    steps=None,\n",
       "    num_entries=None,\n",
       "    format='parquet',\n",
       "    lfn=None,\n",
       "    pfn=None,\n",
       "    num_selected_entries=None\n",
       ")\n",
       "
\n" ], "text/plain": [ "\u001b[1;35mParquetFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'parquet'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Format: parquet\n", "Object path (always None): None\n", "\n", "Parquet spec with steps:\n" ] }, { "data": { "text/html": [ "
ParquetFileSpec(\n",
       "    object_path=None,\n",
       "    steps=[[0, 5000], [5000, 10000]],\n",
       "    num_entries=None,\n",
       "    format='parquet',\n",
       "    lfn=None,\n",
       "    pfn=None,\n",
       "    num_selected_entries=10000\n",
       ")\n",
       "
\n" ], "text/plain": [ "\u001b[1;35mParquetFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m5000\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m5000\u001b[0m, \u001b[1;36m10000\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'parquet'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m10000\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 1.2 Basic ParquetFileSpec for Parquet files\n", "print(\"=== Basic ParquetFileSpec ===\")\n", "\n", "# Minimal Parquet file specification\n", "parquet_spec = ParquetFileSpec()\n", "print(\"Basic Parquet spec:\")\n", "rich.print(parquet_spec)\n", "print(f\"Format: {parquet_spec.format}\")\n", "print(f\"Object path (always None): {parquet_spec.object_path}\")\n", "\n", "# Parquet file specification with steps\n", "parquet_spec_with_steps = ParquetFileSpec(\n", " steps=[[0, 5000], [5000, 10000]]\n", ")\n", "print(\"\\nParquet spec with steps:\")\n", "rich.print(parquet_spec_with_steps)" ] }, { "cell_type": "code", "execution_count": 4, "id": "85d98abf", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== CoffeaROOTFileSpecOptional ===\n", "Optional ROOT spec:\n" ] }, { "data": { "text/html": [ "
CoffeaROOTFileSpecOptional(\n",
       "    object_path='Events',\n",
       "    steps=None,\n",
       "    num_entries=None,\n",
       "    format='root',\n",
       "    lfn=None,\n",
       "    pfn=None,\n",
       "    uuid=None,\n",
       "    num_selected_entries=None\n",
       ")\n",
       "
\n" ], "text/plain": [ "\u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Partial ROOT spec:\n" ] }, { "data": { "text/html": [ "
CoffeaROOTFileSpecOptional(\n",
       "    object_path='Events',\n",
       "    steps=[[0, 1000]],\n",
       "    num_entries=1000,\n",
       "    format='root',\n",
       "    lfn=None,\n",
       "    pfn=None,\n",
       "    uuid=None,\n",
       "    num_selected_entries=1000\n",
       ")\n",
       "
\n" ], "text/plain": [ "\u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m1000\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m1000\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m1000\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Complete optional ROOT spec:\n" ] }, { "data": { "text/html": [ "
CoffeaROOTFileSpecOptional(\n",
       "    object_path='Events',\n",
       "    steps=[[0, 1000], [1000, 2000]],\n",
       "    num_entries=2000,\n",
       "    format='root',\n",
       "    lfn=None,\n",
       "    pfn=None,\n",
       "    uuid='12345678-90ab-cdef-1234-567890abcdef',\n",
       "    num_selected_entries=2000\n",
       ")\n",
       "
\n" ], "text/plain": [ "\u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m1000\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m1000\u001b[0m, \u001b[1;36m2000\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m2000\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'12345678-90ab-cdef-1234-567890abcdef'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m2000\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 1.3 CoffeaROOTFileSpecOptional - ROOT files with optional metadata\n", "print(\"=== CoffeaROOTFileSpecOptional ===\")\n", "\n", "# Optional specification with minimal data\n", "coffea_uproot_optional = CoffeaROOTFileSpecOptional(object_path=\"Events\")\n", "print(\"Optional ROOT spec:\")\n", "rich.print(coffea_uproot_optional)\n", "\n", "# Optional specification with some metadata\n", "coffea_uproot_partial = CoffeaROOTFileSpecOptional(\n", " object_path=\"Events\",\n", " steps=[[0, 1000]],\n", " num_entries=1000\n", ")\n", "print(\"Partial ROOT spec:\")\n", "rich.print(coffea_uproot_partial)\n", "\n", "# Optional specification with all metadata\n", "coffea_uproot_complete = CoffeaROOTFileSpecOptional(\n", " object_path=\"Events\",\n", " steps=[[0, 1000], [1000, 2000]],\n", " num_entries=2000,\n", " uuid=\"12345678-90ab-cdef-1234-567890abcdef\"\n", ")\n", "print(\"Complete optional ROOT spec:\")\n", "rich.print(coffea_uproot_complete)" ] }, { "cell_type": "code", "execution_count": 5, "id": "38ee2f96", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== CoffeaROOTFileSpec ===\n", "Complete required ROOT spec:\n" ] }, { "data": { "text/html": [ "
CoffeaROOTFileSpec(\n",
       "    object_path='Events',\n",
       "    steps=[[0, 1000], [1000, 2000]],\n",
       "    num_entries=2000,\n",
       "    format='root',\n",
       "    lfn=None,\n",
       "    pfn=None,\n",
       "    uuid='12345678-90ab-cdef-1234-567890abcdef',\n",
       "    num_selected_entries=2000\n",
       ")\n",
       "
\n" ], "text/plain": [ "\u001b[1;35mCoffeaROOTFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m1000\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m1000\u001b[0m, \u001b[1;36m2000\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m2000\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'12345678-90ab-cdef-1234-567890abcdef'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m2000\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Expected validation error for incomplete spec: 3 errors\n" ] }, { "data": { "text/html": [ "
3 validation errors for CoffeaROOTFileSpec\n",
       "steps\n",
       "  Field required [type=missing, input_value={'object_path': 'Events'}, input_type=dict]\n",
       "    For further information visit https://errors.pydantic.dev/2.11/v/missing\n",
       "num_entries\n",
       "  Field required [type=missing, input_value={'object_path': 'Events'}, input_type=dict]\n",
       "    For further information visit https://errors.pydantic.dev/2.11/v/missing\n",
       "uuid\n",
       "  Field required [type=missing, input_value={'object_path': 'Events'}, input_type=dict]\n",
       "    For further information visit https://errors.pydantic.dev/2.11/v/missing\n",
       "
\n" ], "text/plain": [ "\u001b[1;36m3\u001b[0m validation errors for CoffeaROOTFileSpec\n", "steps\n", " Field required \u001b[1m[\u001b[0m\u001b[33mtype\u001b[0m=\u001b[35mmissing\u001b[0m, \u001b[33minput_value\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m\u001b[1m}\u001b[0m, \u001b[33minput_type\u001b[0m=\u001b[35mdict\u001b[0m\u001b[1m]\u001b[0m\n", " For further information visit \u001b[4;94mhttps://errors.pydantic.dev/2.11/v/missing\u001b[0m\n", "num_entries\n", " Field required \u001b[1m[\u001b[0m\u001b[33mtype\u001b[0m=\u001b[35mmissing\u001b[0m, \u001b[33minput_value\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m\u001b[1m}\u001b[0m, \u001b[33minput_type\u001b[0m=\u001b[35mdict\u001b[0m\u001b[1m]\u001b[0m\n", " For further information visit \u001b[4;94mhttps://errors.pydantic.dev/2.11/v/missing\u001b[0m\n", "uuid\n", " Field required \u001b[1m[\u001b[0m\u001b[33mtype\u001b[0m=\u001b[35mmissing\u001b[0m, \u001b[33minput_value\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m\u001b[1m}\u001b[0m, \u001b[33minput_type\u001b[0m=\u001b[35mdict\u001b[0m\u001b[1m]\u001b[0m\n", " For further information visit \u001b[4;94mhttps://errors.pydantic.dev/2.11/v/missing\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 1.4 CoffeaROOTFileSpec - ROOT files with required metadata\n", "print(\"=== CoffeaROOTFileSpec ===\")\n", "\n", "# Complete specification (all fields required)\n", "try:\n", " coffea_uproot_required = CoffeaROOTFileSpec(\n", " object_path=\"Events\",\n", " steps=[[0, 1000], [1000, 2000]],\n", " num_entries=2000,\n", " uuid=\"12345678-90ab-cdef-1234-567890abcdef\"\n", " )\n", " print(\"Complete required ROOT spec:\")\n", " rich.print(coffea_uproot_required)\n", "except ValidationError as e:\n", " print(f\"Validation error: {e}\")\n", "\n", "# Attempt to create incomplete specification (should fail)\n", "try:\n", " incomplete_spec = CoffeaROOTFileSpec(object_path=\"Events\")\n", " print(\"This shouldn't print - validation should fail!\")\n", "except ValidationError as e:\n", " print(f\"Expected validation error for incomplete spec: {e.error_count()} errors\")\n", " rich.print(e)" ] }, { "cell_type": "code", "execution_count": 6, "id": "694daed3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== CoffeaParquetFileSpec ===\n", "Optional Parquet spec:\n" ] }, { "data": { "text/html": [ "
CoffeaParquetFileSpecOptional(\n",
       "    object_path=None,\n",
       "    steps=[[0, 5000]],\n",
       "    num_entries=5000,\n",
       "    format='parquet',\n",
       "    lfn=None,\n",
       "    pfn=None,\n",
       "    uuid='parquet-uuid-example',\n",
       "    num_selected_entries=5000\n",
       ")\n",
       "
\n" ], "text/plain": [ "\u001b[1;35mCoffeaParquetFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m5000\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m5000\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'parquet'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'parquet-uuid-example'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m5000\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Required Parquet spec:\n" ] }, { "data": { "text/html": [ "
CoffeaParquetFileSpec(\n",
       "    object_path=None,\n",
       "    steps=[[0, 5000], [5000, 10000]],\n",
       "    num_entries=10000,\n",
       "    format='parquet',\n",
       "    lfn=None,\n",
       "    pfn=None,\n",
       "    uuid='parquet-uuid-complete',\n",
       "    num_selected_entries=10000\n",
       ")\n",
       "
\n" ], "text/plain": [ "\u001b[1;35mCoffeaParquetFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m5000\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m5000\u001b[0m, \u001b[1;36m10000\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m10000\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'parquet'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'parquet-uuid-complete'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m10000\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 1.5 Parquet specifications (similar pattern)\n", "print(\"=== CoffeaParquetFileSpec ===\")\n", "\n", "# Optional Parquet specification\n", "parquet_optional = CoffeaParquetFileSpecOptional(\n", " steps=[[0, 5000]],\n", " num_entries=5000,\n", " uuid=\"parquet-uuid-example\"\n", ")\n", "print(\"Optional Parquet spec:\")\n", "rich.print(parquet_optional)\n", "\n", "# Required Parquet specification\n", "parquet_required = CoffeaParquetFileSpec(\n", " steps=[[0, 5000], [5000, 10000]],\n", " num_entries=10000,\n", " uuid=\"parquet-uuid-complete\"\n", ")\n", "print(\"Required Parquet spec:\")\n", "rich.print(parquet_required)" ] }, { "cell_type": "markdown", "id": "3f36f852", "metadata": {}, "source": [ "## 2. InputFiles Specification\n", "\n", "The `InputFiles` classe is a dictionary-like containers for any mixture of CoffeaFileSpec classes, both Uproot/Parquet and concrete/Optional. `PreprocessedFiles` is the specific subtype permitting only concrete FileSpecs. They automatically handle:\n", "\n", "- **Format detection**: Automatically identifies if files are ROOT or Parquet, by testing the key (filename)\n", "- **Dictionary-like interface**: Easy access to files using standard dict methods\n", "- **FileSpec promotion**: Automatically tries to upcast CoffeaROOTFileSpecOptional and CoffeaParquetFileSpecOptional to their concrete classes, when the necessary fields have been set post-initialization (such as when they are preprocessed)\n", "- **FileSpec-wide format**: Provides the `format` computed property to determine which format(s) are present.\n", "\n", "The `InputFiles` or `PreprocessedFiles` forms the \"files\" subfield of the `DatasetSpec` class. Notably, unlike the FileSpec classes, it doesn't require kwarg-setting in the constructor, simply pass in a regular dictionary of `{\"filename1\": dict|FileSpec, ..., \"filenameN\": dict|FileSpec}`" ] }, { "cell_type": "code", "execution_count": 7, "id": "3cfd0d6a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== InputFiles ===\n", "InputFiles:\n" ] }, { "data": { "text/html": [ "
InputFiles(\n",
       "    root={\n",
       "        'file1.root': CoffeaROOTFileSpec(\n",
       "            object_path='Events',\n",
       "            steps=[[0, 10]],\n",
       "            num_entries=10,\n",
       "            format='root',\n",
       "            lfn=None,\n",
       "            pfn=None,\n",
       "            uuid='uuid1',\n",
       "            num_selected_entries=10\n",
       "        ),\n",
       "        'file1.parquet': CoffeaParquetFileSpec(\n",
       "            object_path=None,\n",
       "            steps=[[0, 100]],\n",
       "            num_entries=100,\n",
       "            format='parquet',\n",
       "            lfn=None,\n",
       "            pfn=None,\n",
       "            uuid='uuid2',\n",
       "            num_selected_entries=100\n",
       "        ),\n",
       "        'file2.root': CoffeaROOTFileSpecOptional(\n",
       "            object_path='Events',\n",
       "            steps=[[10, 20]],\n",
       "            num_entries=None,\n",
       "            format='root',\n",
       "            lfn=None,\n",
       "            pfn=None,\n",
       "            uuid=None,\n",
       "            num_selected_entries=10\n",
       "        )\n",
       "    }\n",
       ")\n",
       "
\n" ], "text/plain": [ "\u001b[1;35mInputFiles\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'file1.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m10\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m10\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'uuid1'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m10\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'file1.parquet'\u001b[0m: \u001b[1;35mCoffeaParquetFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m100\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m100\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'parquet'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'uuid2'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m100\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'file2.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m10\u001b[0m, \u001b[1;36m20\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m10\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Detected format(s): root|parquet\n", "Number of files: 3\n", "Iterating over file dict:\n", "File: file1.root\n", "Spec:\n" ] }, { "data": { "text/html": [ "
CoffeaROOTFileSpec(\n",
       "    object_path='Events',\n",
       "    steps=[[0, 10]],\n",
       "    num_entries=10,\n",
       "    format='root',\n",
       "    lfn=None,\n",
       "    pfn=None,\n",
       "    uuid='uuid1',\n",
       "    num_selected_entries=10\n",
       ")\n",
       "
\n" ], "text/plain": [ "\u001b[1;35mCoffeaROOTFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m10\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m10\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'uuid1'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m10\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "File: file1.parquet\n", "Spec:\n" ] }, { "data": { "text/html": [ "
CoffeaParquetFileSpec(\n",
       "    object_path=None,\n",
       "    steps=[[0, 100]],\n",
       "    num_entries=100,\n",
       "    format='parquet',\n",
       "    lfn=None,\n",
       "    pfn=None,\n",
       "    uuid='uuid2',\n",
       "    num_selected_entries=100\n",
       ")\n",
       "
\n" ], "text/plain": [ "\u001b[1;35mCoffeaParquetFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m100\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m100\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'parquet'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'uuid2'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m100\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "File: file2.root\n", "Spec:\n" ] }, { "data": { "text/html": [ "
CoffeaROOTFileSpecOptional(\n",
       "    object_path='Events',\n",
       "    steps=[[10, 20]],\n",
       "    num_entries=None,\n",
       "    format='root',\n",
       "    lfn=None,\n",
       "    pfn=None,\n",
       "    uuid=None,\n",
       "    num_selected_entries=10\n",
       ")\n",
       "
\n" ], "text/plain": [ "\u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m10\u001b[0m, \u001b[1;36m20\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m10\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Accessing 'file1.root': object_path='Events' steps=[[0, 10]] num_entries=10 format='root' lfn=None pfn=None uuid='uuid1' num_selected_entries=10\n", "=== Modifying a file spec in the dict ===\n", "Keys in filedict: ['file1.root', 'file1.parquet', 'file2.root', 'file3.root']\n" ] } ], "source": [ "# 2.1 Create an InputFiles\n", "print(\"=== InputFiles ===\")\n", "\n", "# using a dictioanry of CoffeaROOTFileSpec(Optional) and CoffeaParquetFileSpec(Optional)\n", "dict_of_filespecs = {\n", " \"file1.root\": CoffeaROOTFileSpec(\n", " object_path=\"Events\", steps=[[0, 10]], num_entries=10, uuid=\"uuid1\"\n", " ),\n", " \"file1.parquet\": CoffeaParquetFileSpec(\n", " steps=[[0, 100]], num_entries=100, uuid=\"uuid2\"\n", " ),\n", " \"file2.root\": CoffeaROOTFileSpecOptional(\n", " object_path=\"Events\", steps=[[10, 20]], num_entries=None, uuid=None\n", " ),\n", "}\n", "\n", "filedict_from_filespecs = InputFiles(dict_of_filespecs)\n", "\n", "print(\"InputFiles:\")\n", "rich.print(filedict_from_filespecs)\n", "\n", "# computed property: format\n", "print(f\"Detected format(s): {filedict_from_filespecs.format}\")\n", "print(f\"Number of files: {len(filedict_from_filespecs)}\")\n", "\n", "# Iteration over the file dict\n", "print(\"Iterating over file dict:\")\n", "for fname, spec in filedict_from_filespecs.items():\n", " print(f\"File: {fname}\\nSpec:\")\n", " rich.print(spec)\n", " \n", "# __getitem__, __setitem__ access\n", "print(f\"Accessing 'file1.root': {filedict_from_filespecs['file1.root']}\")\n", "\n", "print(\"=== Modifying a file spec in the dict ===\")\n", "filedict_from_filespecs[\"file2.root\"].num_entries = 20\n", "\n", "filedict_from_filespecs[\"file3.root\"] = CoffeaROOTFileSpec(\n", " object_path=\"Events\", steps=[[0, 30]], num_entries=30, uuid=\"uuid3\"\n", ")\n", "\n", "# show keys\n", "print(f\"Keys in filedict: {list(filedict_from_filespecs.keys())}\")" ] }, { "cell_type": "code", "execution_count": 8, "id": "b2db2dcd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== InputFiles from pure dictionary ===\n", "InputFiles from pure dictionary:\n" ] }, { "data": { "text/html": [ "
InputFiles(\n",
       "    root={\n",
       "        'file1.root': CoffeaROOTFileSpec(\n",
       "            object_path='Events',\n",
       "            steps=[[0, 10]],\n",
       "            num_entries=10,\n",
       "            format='root',\n",
       "            lfn=None,\n",
       "            pfn=None,\n",
       "            uuid='uuid1',\n",
       "            num_selected_entries=10\n",
       "        ),\n",
       "        'file1.parquet': CoffeaParquetFileSpec(\n",
       "            object_path=None,\n",
       "            steps=[[0, 100]],\n",
       "            num_entries=100,\n",
       "            format='parquet',\n",
       "            lfn=None,\n",
       "            pfn=None,\n",
       "            uuid='uuid2',\n",
       "            num_selected_entries=100\n",
       "        ),\n",
       "        'file2.root': CoffeaROOTFileSpecOptional(\n",
       "            object_path='Events',\n",
       "            steps=[[10, 20]],\n",
       "            num_entries=None,\n",
       "            format='root',\n",
       "            lfn=None,\n",
       "            pfn=None,\n",
       "            uuid=None,\n",
       "            num_selected_entries=10\n",
       "        )\n",
       "    }\n",
       ")\n",
       "
\n" ], "text/plain": [ "\u001b[1;35mInputFiles\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'file1.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m10\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m10\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'uuid1'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m10\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'file1.parquet'\u001b[0m: \u001b[1;35mCoffeaParquetFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m100\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m100\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'parquet'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'uuid2'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m100\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'file2.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m10\u001b[0m, \u001b[1;36m20\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m10\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 2.2 Create a InputFiles from pure dictionary\n", "print(\"=== InputFiles from pure dictionary ===\")\n", "\n", "dict_of_dicts = {\n", " \"file1.root\": {\n", " \"object_path\": \"Events\", \n", " \"steps\": [[0, 10]], \n", " \"num_entries\": 10, \n", " \"uuid\": \"uuid1\"\n", " },\n", " \"file1.parquet\": {\n", " \"steps\": [[0, 100]], \n", " \"num_entries\": 100, \n", " \"uuid\": \"uuid2\"\n", " },\n", " \"file2.root\": {\n", " \"object_path\": \"Events\", \n", " \"steps\": [[10, 20]], \n", " \"num_entries\": None, \n", " \"uuid\": None\n", " },\n", "\n", "}\n", "\n", "filedict_from_pure_dict = InputFiles(dict_of_dicts)\n", "print(\"InputFiles from pure dictionary:\")\n", "rich.print(filedict_from_pure_dict)" ] }, { "cell_type": "markdown", "id": "5f18f95b", "metadata": {}, "source": [ "## 3. Dataset Specifications\n", "\n", "The `DatasetSpec` class represents a collection of files that form a logical dataset. It automatically handles:\n", "\n", "- **Format detection**: Automatically identifies if files are ROOT or Parquet\n", "- **File validation**: Ensures all files in a dataset are compatible\n", "- **Metadata management**: Stores dataset-level metadata and forms\n", "- **Dictionary-like interface**: Easy access to files using standard dict methods" ] }, { "cell_type": "code", "execution_count": 9, "id": "b31c13f5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== DatasetSpec Creation ===\n", "Simple ROOT dataset:\n" ] }, { "data": { "text/html": [ "
DatasetSpec(\n",
       "    files=InputFiles(\n",
       "        root={\n",
       "            'data_file_1.root': CoffeaROOTFileSpecOptional(\n",
       "                object_path='Events',\n",
       "                steps=None,\n",
       "                num_entries=None,\n",
       "                format='root',\n",
       "                lfn=None,\n",
       "                pfn=None,\n",
       "                uuid=None,\n",
       "                num_selected_entries=None\n",
       "            ),\n",
       "            'data_file_2.root': CoffeaROOTFileSpecOptional(\n",
       "                object_path='Events',\n",
       "                steps=None,\n",
       "                num_entries=None,\n",
       "                format='root',\n",
       "                lfn=None,\n",
       "                pfn=None,\n",
       "                uuid=None,\n",
       "                num_selected_entries=None\n",
       "            ),\n",
       "            'data_file_3.root': CoffeaROOTFileSpecOptional(\n",
       "                object_path='Events',\n",
       "                steps=None,\n",
       "                num_entries=None,\n",
       "                format='root',\n",
       "                lfn=None,\n",
       "                pfn=None,\n",
       "                uuid=None,\n",
       "                num_selected_entries=None\n",
       "            )\n",
       "        }\n",
       "    ),\n",
       "    metadata={'sample_type': 'data', 'year': 2023},\n",
       "    format='root',\n",
       "    compressed_form=None,\n",
       "    did=None\n",
       ")\n",
       "
\n" ], "text/plain": [ "\u001b[1;35mDatasetSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mfiles\u001b[0m=\u001b[1;35mInputFiles\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'data_file_1.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'data_file_2.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'data_file_3.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[33mmetadata\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'sample_type'\u001b[0m: \u001b[32m'data'\u001b[0m, \u001b[32m'year'\u001b[0m: \u001b[1;36m2023\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mcompressed_form\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mdid\u001b[0m=\u001b[3;35mNone\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Detected format: root\n", "Number of files: 3\n", "Metadata: {'sample_type': 'data', 'year': 2023}\n" ] } ], "source": [ "# 3.1 Creating DatasetSpec from file dictionaries\n", "print(\"=== DatasetSpec Creation ===\")\n", "\n", "# Create a dataset from a simple file dictionary (ROOT files)\n", "root_dataset_simple = DatasetSpec(\n", " files={\n", " \"data_file_1.root\": \"Events\",\n", " \"data_file_2.root\": \"Events\",\n", " \"data_file_3.root\": \"Events\"\n", " },\n", " metadata={\"sample_type\": \"data\", \"year\": 2023}\n", ")\n", "\n", "print(\"Simple ROOT dataset:\")\n", "rich.print(root_dataset_simple)\n", "print(f\"Detected format: {root_dataset_simple.format}\")\n", "print(f\"Number of files: {len(root_dataset_simple.files)}\")\n", "print(f\"Metadata: {root_dataset_simple.metadata}\")" ] }, { "cell_type": "code", "execution_count": 10, "id": "76aa5992", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== DatasetSpec with Complete Specifications ===\n", "Complete dataset:\n" ] }, { "data": { "text/html": [ "
DatasetSpec(\n",
       "    files=InputFiles(\n",
       "        root={\n",
       "            'processed_data_1.root': CoffeaROOTFileSpec(\n",
       "                object_path='Events',\n",
       "                steps=[[0, 1000], [1000, 2000]],\n",
       "                num_entries=2000,\n",
       "                format='root',\n",
       "                lfn=None,\n",
       "                pfn=None,\n",
       "                uuid='file1-uuid',\n",
       "                num_selected_entries=2000\n",
       "            ),\n",
       "            'processed_data_2.root': CoffeaROOTFileSpec(\n",
       "                object_path='Events',\n",
       "                steps=[[0, 1500], [1500, 3000]],\n",
       "                num_entries=3000,\n",
       "                format='root',\n",
       "                lfn=None,\n",
       "                pfn=None,\n",
       "                uuid='file2-uuid',\n",
       "                num_selected_entries=3000\n",
       "            )\n",
       "        }\n",
       "    ),\n",
       "    metadata={'processing_version': 'v2.1', 'cross_section': 1.23},\n",
       "    format='root',\n",
       "    compressed_form=None,\n",
       "    did=None\n",
       ")\n",
       "
\n" ], "text/plain": [ "\u001b[1;35mDatasetSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mfiles\u001b[0m=\u001b[1;35mInputFiles\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'processed_data_1.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m1000\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m1000\u001b[0m, \u001b[1;36m2000\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m2000\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'file1-uuid'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m2000\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'processed_data_2.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m1500\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m1500\u001b[0m, \u001b[1;36m3000\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m3000\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'file2-uuid'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m3000\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[33mmetadata\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'processing_version'\u001b[0m: \u001b[32m'v2.1'\u001b[0m, \u001b[32m'cross_section'\u001b[0m: \u001b[1;36m1.23\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mcompressed_form\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mdid\u001b[0m=\u001b[3;35mNone\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Detected format: root\n", "Number of files: 2\n", "Ready for column-joining: False\n" ] } ], "source": [ "# 3.2 Creating DatasetSpec with complete file specifications\n", "print(\"=== DatasetSpec with Complete Specifications ===\")\n", "\n", "# Create individual file specifications\n", "file1_spec = CoffeaROOTFileSpec(\n", " object_path=\"Events\",\n", " steps=[[0, 1000], [1000, 2000]],\n", " num_entries=2000,\n", " uuid=\"file1-uuid\"\n", ")\n", "\n", "file2_spec = CoffeaROOTFileSpec(\n", " object_path=\"Events\", \n", " steps=[[0, 1500], [1500, 3000]],\n", " num_entries=3000,\n", " uuid=\"file2-uuid\"\n", ")\n", "\n", "# Create dataset with complete specifications\n", "complete_dataset = DatasetSpec(\n", " files=InputFiles({\n", " \"processed_data_1.root\": file1_spec,\n", " \"processed_data_2.root\": file2_spec\n", " }),\n", " metadata={\"processing_version\": \"v2.1\", \"cross_section\": 1.23},\n", ")\n", "\n", "print(\"Complete dataset:\")\n", "rich.print(complete_dataset)\n", "print(f\"Detected format: {complete_dataset.format}\")\n", "print(f\"Number of files: {len(complete_dataset.files)}\")\n", "print(f\"Ready for column-joining: {complete_dataset.joinable}\")" ] }, { "cell_type": "code", "execution_count": 11, "id": "6af14880", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== Mixed Format Datasets ===\n", "Validation error for mixed format dataset: 1 validation error for DatasetSpec\n", " Value error, format: format must be one of {'root', 'parquet'} [type=value_error, input_value={'files': {'data.root': C...selected_entries=2000)}}, input_type=dict]\n", " For further information visit https://errors.pydantic.dev/2.11/v/value_error\n" ] } ], "source": [ "# 3.3 Mixed format handling\n", "print(\"=== Mixed Format Datasets ===\")\n", "\n", "# Try to create a dataset with both ROOT and Parquet files\n", "try:\n", " mixed_dataset = DatasetSpec(\n", " files={\n", " \"data.root\": CoffeaROOTFileSpec(\n", " object_path=\"Events\",\n", " steps=[[0, 1000]],\n", " num_entries=1000,\n", " uuid=\"root-uuid\"\n", " ),\n", " \"data.parquet\": CoffeaParquetFileSpec(\n", " steps=[[0, 2000]],\n", " num_entries=2000,\n", " uuid=\"parquet-uuid\"\n", " )\n", " }\n", " )\n", "\n", " print(\"Mixed format dataset:\")\n", " rich.print(mixed_dataset)\n", " print(f\"Detected format: {mixed_dataset.format}\")\n", "except ValidationError as e:\n", " print(f\"Validation error for mixed format dataset: {e}\")\n", "\n", "# If you need a mixed format, file an issue in the coffea GitHub repository requesting the feature, with an example of your usecase!" ] }, { "cell_type": "code", "execution_count": 12, "id": "98564dea", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== DatasetSpec from File Lists ===\n", "Dataset from list:\n" ] }, { "data": { "text/html": [ "
DatasetSpec(\n",
       "    files=InputFiles(\n",
       "        root={\n",
       "            'simulation_1.root': CoffeaROOTFileSpecOptional(\n",
       "                object_path='Events',\n",
       "                steps=None,\n",
       "                num_entries=None,\n",
       "                format='root',\n",
       "                lfn=None,\n",
       "                pfn=None,\n",
       "                uuid=None,\n",
       "                num_selected_entries=None\n",
       "            ),\n",
       "            'simulation_2.root': CoffeaROOTFileSpecOptional(\n",
       "                object_path='Events',\n",
       "                steps=None,\n",
       "                num_entries=None,\n",
       "                format='root',\n",
       "                lfn=None,\n",
       "                pfn=None,\n",
       "                uuid=None,\n",
       "                num_selected_entries=None\n",
       "            ),\n",
       "            'root://simulation_3.root': CoffeaROOTFileSpecOptional(\n",
       "                object_path='Events',\n",
       "                steps=None,\n",
       "                num_entries=None,\n",
       "                format='root',\n",
       "                lfn=None,\n",
       "                pfn=None,\n",
       "                uuid=None,\n",
       "                num_selected_entries=None\n",
       "            ),\n",
       "            'simulation_4.root.1': CoffeaROOTFileSpecOptional(\n",
       "                object_path='AuxiliaryData',\n",
       "                steps=None,\n",
       "                num_entries=None,\n",
       "                format='root',\n",
       "                lfn=None,\n",
       "                pfn=None,\n",
       "                uuid=None,\n",
       "                num_selected_entries=None\n",
       "            )\n",
       "        }\n",
       "    ),\n",
       "    metadata={'sample_type': 'simulation', 'process': 'ttbar'},\n",
       "    format='root',\n",
       "    compressed_form=None,\n",
       "    did=None\n",
       ")\n",
       "
\n" ], "text/plain": [ "\u001b[1;35mDatasetSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mfiles\u001b[0m=\u001b[1;35mInputFiles\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'simulation_1.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'simulation_2.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'root://simulation_3.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'simulation_4.root.1'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'AuxiliaryData'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[33mmetadata\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'sample_type'\u001b[0m: \u001b[32m'simulation'\u001b[0m, \u001b[32m'process'\u001b[0m: \u001b[32m'ttbar'\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mcompressed_form\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mdid\u001b[0m=\u001b[3;35mNone\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Files: ['simulation_1.root', 'simulation_2.root', 'root://simulation_3.root', 'simulation_4.root.1']\n" ] } ], "source": [ "# 3.4 DatasetSpec from file lists\n", "print(\"=== DatasetSpec from File Lists ===\")\n", "\n", "# Create dataset from a list of file:object_path strings\n", "dataset_from_list = DatasetSpec(\n", " files=[\n", " \"simulation_1.root:Events\",\n", " \"simulation_2.root:Events\", \n", " \"root://simulation_3.root:Events\",\n", " \"simulation_4.root.1:AuxiliaryData\",\n", " ],\n", " metadata={\"sample_type\": \"simulation\", \"process\": \"ttbar\"}\n", ")\n", "\n", "print(\"Dataset from list:\")\n", "rich.print(dataset_from_list)\n", "print(f\"Files: {list(dataset_from_list.files.keys())}\")" ] }, { "cell_type": "markdown", "id": "762e909c", "metadata": {}, "source": [ "## 4. Fileset Specifications\n", "\n", "The `DataGroupSpec` class represents a collection of datasets, typically used for analysis workflows. It provides:\n", "\n", "- **Multiple datasets management**: Handle multiple physics processes/samples\n", "- **JSON serialization**: Save and load complete analysis configurations \n", "- **Dictionary interface**: Access datasets by name\n", "- **Validation**: Ensure all datasets are properly specified" ] }, { "cell_type": "code", "execution_count": 13, "id": "7109b611", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== DataGroupSpec Creation ===\n", "Analysis fileset:\n" ] }, { "data": { "text/html": [ "
DataGroupSpec(\n",
       "    root={\n",
       "        'ttbar_simulation': DatasetSpec(\n",
       "            files=InputFiles(\n",
       "                root={\n",
       "                    'ttbar_1.root': CoffeaROOTFileSpecOptional(\n",
       "                        object_path='Events',\n",
       "                        steps=None,\n",
       "                        num_entries=None,\n",
       "                        format='root',\n",
       "                        lfn=None,\n",
       "                        pfn=None,\n",
       "                        uuid=None,\n",
       "                        num_selected_entries=None\n",
       "                    ),\n",
       "                    'ttbar_2.root': CoffeaROOTFileSpecOptional(\n",
       "                        object_path='Events',\n",
       "                        steps=None,\n",
       "                        num_entries=None,\n",
       "                        format='root',\n",
       "                        lfn=None,\n",
       "                        pfn=None,\n",
       "                        uuid=None,\n",
       "                        num_selected_entries=None\n",
       "                    )\n",
       "                }\n",
       "            ),\n",
       "            metadata={'process': 'ttbar', 'cross_section': 831.8},\n",
       "            format='root',\n",
       "            compressed_form=None,\n",
       "            did=None\n",
       "        ),\n",
       "        'single_top': DatasetSpec(\n",
       "            files=InputFiles(\n",
       "                root={\n",
       "                    'singletop_1.root': CoffeaROOTFileSpecOptional(\n",
       "                        object_path='Events',\n",
       "                        steps=None,\n",
       "                        num_entries=None,\n",
       "                        format='root',\n",
       "                        lfn=None,\n",
       "                        pfn=None,\n",
       "                        uuid=None,\n",
       "                        num_selected_entries=None\n",
       "                    ),\n",
       "                    'singletop_2.root': CoffeaROOTFileSpecOptional(\n",
       "                        object_path='Events',\n",
       "                        steps=None,\n",
       "                        num_entries=None,\n",
       "                        format='root',\n",
       "                        lfn=None,\n",
       "                        pfn=None,\n",
       "                        uuid=None,\n",
       "                        num_selected_entries=None\n",
       "                    )\n",
       "                }\n",
       "            ),\n",
       "            metadata={'process': 'single_top', 'cross_section': 136.02},\n",
       "            format='root',\n",
       "            compressed_form=None,\n",
       "            did=None\n",
       "        ),\n",
       "        'data': DatasetSpec(\n",
       "            files=InputFiles(\n",
       "                root={\n",
       "                    'data_2023A.root': CoffeaROOTFileSpecOptional(\n",
       "                        object_path='Events',\n",
       "                        steps=None,\n",
       "                        num_entries=None,\n",
       "                        format='root',\n",
       "                        lfn=None,\n",
       "                        pfn=None,\n",
       "                        uuid=None,\n",
       "                        num_selected_entries=None\n",
       "                    ),\n",
       "                    'data_2023B.root': CoffeaROOTFileSpecOptional(\n",
       "                        object_path='Events',\n",
       "                        steps=None,\n",
       "                        num_entries=None,\n",
       "                        format='root',\n",
       "                        lfn=None,\n",
       "                        pfn=None,\n",
       "                        uuid=None,\n",
       "                        num_selected_entries=None\n",
       "                    )\n",
       "                }\n",
       "            ),\n",
       "            metadata={'is_data': True, 'era': '2023'},\n",
       "            format='root',\n",
       "            compressed_form=None,\n",
       "            did=None\n",
       "        )\n",
       "    }\n",
       ")\n",
       "
\n" ], "text/plain": [ "\u001b[1;35mDataGroupSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'ttbar_simulation'\u001b[0m: \u001b[1;35mDatasetSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mfiles\u001b[0m=\u001b[1;35mInputFiles\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'ttbar_1.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_2.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[33mmetadata\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'process'\u001b[0m: \u001b[32m'ttbar'\u001b[0m, \u001b[32m'cross_section'\u001b[0m: \u001b[1;36m831.8\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mcompressed_form\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mdid\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'single_top'\u001b[0m: \u001b[1;35mDatasetSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mfiles\u001b[0m=\u001b[1;35mInputFiles\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'singletop_1.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'singletop_2.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[33mmetadata\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'process'\u001b[0m: \u001b[32m'single_top'\u001b[0m, \u001b[32m'cross_section'\u001b[0m: \u001b[1;36m136.02\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mcompressed_form\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mdid\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'data'\u001b[0m: \u001b[1;35mDatasetSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mfiles\u001b[0m=\u001b[1;35mInputFiles\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'data_2023A.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'data_2023B.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[33mmetadata\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'is_data'\u001b[0m: \u001b[3;92mTrue\u001b[0m, \u001b[32m'era'\u001b[0m: \u001b[32m'2023'\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mcompressed_form\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mdid\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Number of datasets: 3\n", "Dataset names: ['ttbar_simulation', 'single_top', 'data']\n" ] } ], "source": [ "# 4.1 Creating DataGroupSpec\n", "print(\"=== DataGroupSpec Creation ===\")\n", "\n", "# Create a fileset with multiple datasets\n", "analysis_fileset = DataGroupSpec({\n", " \"ttbar_simulation\": DatasetSpec(\n", " files={\n", " \"ttbar_1.root\": \"Events\",\n", " \"ttbar_2.root\": \"Events\"\n", " },\n", " metadata={\"process\": \"ttbar\", \"cross_section\": 831.8}\n", " ),\n", " \n", " \"single_top\": DatasetSpec(\n", " files={\n", " \"singletop_1.root\": \"Events\",\n", " \"singletop_2.root\": \"Events\"\n", " },\n", " metadata={\"process\": \"single_top\", \"cross_section\": 136.02}\n", " ),\n", " \n", " \"data\": DatasetSpec(\n", " files={\n", " \"data_2023A.root\": \"Events\", \n", " \"data_2023B.root\": \"Events\"\n", " },\n", " metadata={\"is_data\": True, \"era\": \"2023\"}\n", " )\n", "})\n", "\n", "print(\"Analysis fileset:\")\n", "rich.print(analysis_fileset)\n", "print(f\"Number of datasets: {len(analysis_fileset)}\")\n", "print(f\"Dataset names: {list(analysis_fileset.keys())}\")" ] }, { "cell_type": "code", "execution_count": 14, "id": "376a6765", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== Fileset Access and Manipulation ===\n", "TTbar dataset: {'process': 'ttbar', 'cross_section': 831.8}\n", "\n", "Dataset summary:\n", " ttbar_simulation: 2 files, process=ttbar\n", " single_top: 2 files, process=single_top\n", " data: 2 files, process=unknown\n", "\n", "After adding WJets: 4 datasets\n" ] } ], "source": [ "# 4.2 Accessing and manipulating filesets\n", "print(\"=== Fileset Access and Manipulation ===\")\n", "\n", "# Access individual datasets\n", "ttbar_dataset = analysis_fileset[\"ttbar_simulation\"]\n", "print(f\"TTbar dataset: {ttbar_dataset.metadata}\")\n", "\n", "# Iterate over datasets\n", "print(\"\\nDataset summary:\")\n", "for dataset_name, dataset in analysis_fileset.items():\n", " num_files = len(dataset.files)\n", " process = dataset.metadata.get(\"process\", \"unknown\")\n", " print(f\" {dataset_name}: {num_files} files, process={process}\")\n", "\n", "# Add a new dataset\n", "analysis_fileset[\"wjets\"] = DatasetSpec(\n", " files={\"wjets_1.root\": \"Events\"},\n", " metadata={\"process\": \"wjets\", \"cross_section\": 61526.7}\n", ")\n", "\n", "print(f\"\\nAfter adding WJets: {len(analysis_fileset)} datasets\")" ] }, { "cell_type": "code", "execution_count": 15, "id": "fc3b6e90", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== JSON Serialization ===\n", "Fileset JSON (first 500 characters):\n", "{\n", " \"ttbar_simulation\": {\n", " \"files\": {\n", " \"ttbar_1.root\": {\n", " \"object_path\": \"Events\",\n", " \"steps\": null,\n", " \"num_entries\": null,\n", " \"format\": \"root\",\n", " \"lfn\": null,\n", " \"pfn\": null,\n", " \"uuid\": null,\n", " \"num_selected_entries\": null\n", " },\n", " \"ttbar_2.root\": {\n", " \"object_path\": \"Events\",\n", " \"steps\": null,\n", " \"num_entries\": null,\n", " \"format\": \"root\",\n", " \"lfn\": null,\n", " \"pfn\": null,\n", " \"uuid\": null,\n", " \"num_se...\n", "\n", "Restored fileset has 4 datasets\n", "Dataset names match: True\n" ] } ], "source": [ "# 4.3 JSON serialization and deserialization\n", "print(\"=== JSON Serialization ===\")\n", "\n", "# Serialize fileset to JSON\n", "fileset_json = analysis_fileset.model_dump_json(indent=2)\n", "print(\"Fileset JSON (first 500 characters):\")\n", "print(fileset_json[:500] + \"...\" if len(fileset_json) > 500 else fileset_json)\n", "\n", "# Deserialize from JSON\n", "restored_fileset = DataGroupSpec.model_validate_json(fileset_json)\n", "print(f\"\\nRestored fileset has {len(restored_fileset)} datasets\")\n", "print(f\"Dataset names match: {set(analysis_fileset.keys()) == set(restored_fileset.keys())}\")" ] }, { "cell_type": "markdown", "id": "69c1d204", "metadata": {}, "source": [ "## 5. Integration with Preprocessing\n", "\n", "The `preprocess` function works seamlessly with FileSpec classes, and will promote Optional types to concrete types for successfully accessed elements of the datasets:\n", "\n", "- **Calculate file steps**: Automatically determine optimal chunking\n", "- **Extract metadata**: Get file UUIDs, entry counts, and schemas(using `save_form=True`)\n", "- **Generate forms**: Create Awkward Array forms for type checking\n", "- **Handle errors**: Skip bad files and report issues" ] }, { "cell_type": "code", "execution_count": 16, "id": "8d360bf8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== Preprocessing with FileSpec ===\n" ] }, { "data": { "text/html": [ "
DataGroupSpec(\n",
       "    root={\n",
       "        'ZJets': DatasetSpec(\n",
       "            files=InputFiles(\n",
       "                root={\n",
       "                    'https://raw.githubusercontent.com/scikit-hep/coffea/master/tests/samples/nano_dy.root': \n",
       "CoffeaROOTFileSpecOptional(\n",
       "                        object_path='Events',\n",
       "                        steps=None,\n",
       "                        num_entries=None,\n",
       "                        format='root',\n",
       "                        lfn=None,\n",
       "                        pfn=None,\n",
       "                        uuid=None,\n",
       "                        num_selected_entries=None\n",
       "                    )\n",
       "                }\n",
       "            ),\n",
       "            metadata={},\n",
       "            format='root',\n",
       "            compressed_form=None,\n",
       "            did=None\n",
       "        ),\n",
       "        'Data': DatasetSpec(\n",
       "            files=InputFiles(\n",
       "                root={\n",
       "                    'https://raw.githubusercontent.com/scikit-hep/coffea/master/tests/samples/nano_dimuon.root': \n",
       "CoffeaROOTFileSpecOptional(\n",
       "                        object_path='Events',\n",
       "                        steps=None,\n",
       "                        num_entries=None,\n",
       "                        format='root',\n",
       "                        lfn=None,\n",
       "                        pfn=None,\n",
       "                        uuid=None,\n",
       "                        num_selected_entries=None\n",
       "                    ),\n",
       "                    'nano_dimuon_not_there.root': CoffeaROOTFileSpecOptional(\n",
       "                        object_path='Events',\n",
       "                        steps=None,\n",
       "                        num_entries=None,\n",
       "                        format='root',\n",
       "                        lfn=None,\n",
       "                        pfn=None,\n",
       "                        uuid=None,\n",
       "                        num_selected_entries=None\n",
       "                    )\n",
       "                }\n",
       "            ),\n",
       "            metadata={},\n",
       "            format='root',\n",
       "            compressed_form=None,\n",
       "            did=None\n",
       "        )\n",
       "    }\n",
       ")\n",
       "
\n" ], "text/plain": [ "\u001b[1;35mDataGroupSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'ZJets'\u001b[0m: \u001b[1;35mDatasetSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mfiles\u001b[0m=\u001b[1;35mInputFiles\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'https://raw.githubusercontent.com/scikit-hep/coffea/master/tests/samples/nano_dy.root'\u001b[0m: \n", "\u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[33mmetadata\u001b[0m=\u001b[1m{\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mcompressed_form\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mdid\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'Data'\u001b[0m: \u001b[1;35mDatasetSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mfiles\u001b[0m=\u001b[1;35mInputFiles\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'https://raw.githubusercontent.com/scikit-hep/coffea/master/tests/samples/nano_dimuon.root'\u001b[0m: \n", "\u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'nano_dimuon_not_there.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[33mmetadata\u001b[0m=\u001b[1m{\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mcompressed_form\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mdid\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Preprocessing the fileset...\n", "Fileset after preprocessing (excluding compressed_form string):\n" ] }, { "data": { "text/html": [ "
{\n",
       "    'ZJets': {\n",
       "        'files': {\n",
       "            'https://raw.githubusercontent.com/scikit-hep/coffea/master/tests/samples/nano_dy.root': {\n",
       "                'object_path': 'Events',\n",
       "                'steps': [[0, 7], [7, 14], [14, 21], [21, 28], [28, 35], [35, 40]],\n",
       "                'num_entries': 40,\n",
       "                'format': 'root',\n",
       "                'lfn': None,\n",
       "                'pfn': None,\n",
       "                'uuid': 'a9490124-3648-11ea-89e9-f5b55c90beef',\n",
       "                'num_selected_entries': 40\n",
       "            }\n",
       "        },\n",
       "        'metadata': {},\n",
       "        'format': 'root',\n",
       "        'did': None\n",
       "    },\n",
       "    'Data': {\n",
       "        'files': {\n",
       "            'https://raw.githubusercontent.com/scikit-hep/coffea/master/tests/samples/nano_dimuon.root': {\n",
       "                'object_path': 'Events',\n",
       "                'steps': [[0, 7], [7, 14], [14, 21], [21, 28], [28, 35], [35, 40]],\n",
       "                'num_entries': 40,\n",
       "                'format': 'root',\n",
       "                'lfn': None,\n",
       "                'pfn': None,\n",
       "                'uuid': 'a210a3f8-3648-11ea-a29f-f5b55c90beef',\n",
       "                'num_selected_entries': 40\n",
       "            }\n",
       "        },\n",
       "        'metadata': {},\n",
       "        'format': 'root',\n",
       "        'did': None\n",
       "    }\n",
       "}\n",
       "
\n" ], "text/plain": [ "\u001b[1m{\u001b[0m\n", " \u001b[32m'ZJets'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'files'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'https://raw.githubusercontent.com/scikit-hep/coffea/master/tests/samples/nano_dy.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m7\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m7\u001b[0m, \u001b[1;36m14\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m14\u001b[0m, \u001b[1;36m21\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m21\u001b[0m, \u001b[1;36m28\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m28\u001b[0m, \u001b[1;36m35\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m35\u001b[0m, \u001b[1;36m40\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[1;36m40\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[32m'a9490124-3648-11ea-89e9-f5b55c90beef'\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[1;36m40\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'did'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'Data'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'files'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'https://raw.githubusercontent.com/scikit-hep/coffea/master/tests/samples/nano_dimuon.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m7\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m7\u001b[0m, \u001b[1;36m14\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m14\u001b[0m, \u001b[1;36m21\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m21\u001b[0m, \u001b[1;36m28\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m28\u001b[0m, \u001b[1;36m35\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m35\u001b[0m, \u001b[1;36m40\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[1;36m40\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[32m'a210a3f8-3648-11ea-a29f-f5b55c90beef'\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[1;36m40\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'did'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m\n", "\u001b[1m}\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Updated files (including inaccessible ones):\n" ] }, { "data": { "text/html": [ "
{\n",
       "    'ZJets': {\n",
       "        'files': {\n",
       "            'https://raw.githubusercontent.com/scikit-hep/coffea/master/tests/samples/nano_dy.root': {\n",
       "                'object_path': 'Events',\n",
       "                'steps': [[0, 7], [7, 14], [14, 21], [21, 28], [28, 35], [35, 40]],\n",
       "                'num_entries': 40,\n",
       "                'format': 'root',\n",
       "                'lfn': None,\n",
       "                'pfn': None,\n",
       "                'uuid': 'a9490124-3648-11ea-89e9-f5b55c90beef',\n",
       "                'num_selected_entries': 40\n",
       "            }\n",
       "        },\n",
       "        'metadata': {},\n",
       "        'format': 'root',\n",
       "        'did': None\n",
       "    },\n",
       "    'Data': {\n",
       "        'files': {\n",
       "            'https://raw.githubusercontent.com/scikit-hep/coffea/master/tests/samples/nano_dimuon.root': {\n",
       "                'object_path': 'Events',\n",
       "                'steps': [[0, 7], [7, 14], [14, 21], [21, 28], [28, 35], [35, 40]],\n",
       "                'num_entries': 40,\n",
       "                'format': 'root',\n",
       "                'lfn': None,\n",
       "                'pfn': None,\n",
       "                'uuid': 'a210a3f8-3648-11ea-a29f-f5b55c90beef',\n",
       "                'num_selected_entries': 40\n",
       "            },\n",
       "            'nano_dimuon_not_there.root': {\n",
       "                'object_path': 'Events',\n",
       "                'steps': None,\n",
       "                'num_entries': None,\n",
       "                'format': 'root',\n",
       "                'lfn': None,\n",
       "                'pfn': None,\n",
       "                'uuid': None,\n",
       "                'num_selected_entries': None\n",
       "            }\n",
       "        },\n",
       "        'metadata': {},\n",
       "        'format': 'root',\n",
       "        'did': None\n",
       "    }\n",
       "}\n",
       "
\n" ], "text/plain": [ "\u001b[1m{\u001b[0m\n", " \u001b[32m'ZJets'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'files'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'https://raw.githubusercontent.com/scikit-hep/coffea/master/tests/samples/nano_dy.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m7\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m7\u001b[0m, \u001b[1;36m14\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m14\u001b[0m, \u001b[1;36m21\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m21\u001b[0m, \u001b[1;36m28\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m28\u001b[0m, \u001b[1;36m35\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m35\u001b[0m, \u001b[1;36m40\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[1;36m40\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[32m'a9490124-3648-11ea-89e9-f5b55c90beef'\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[1;36m40\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'did'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'Data'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'files'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'https://raw.githubusercontent.com/scikit-hep/coffea/master/tests/samples/nano_dimuon.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m7\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m7\u001b[0m, \u001b[1;36m14\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m14\u001b[0m, \u001b[1;36m21\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m21\u001b[0m, \u001b[1;36m28\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m28\u001b[0m, \u001b[1;36m35\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m35\u001b[0m, \u001b[1;36m40\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[1;36m40\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[32m'a210a3f8-3648-11ea-a29f-f5b55c90beef'\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[1;36m40\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'nano_dimuon_not_there.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'did'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m\n", "\u001b[1m}\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 5.1 Basic preprocessing with FileSpec\n", "print(\"=== Preprocessing with FileSpec ===\")\n", "\n", "# Note: This is a demonstration - in practice you'd use real file paths\n", "demo_fileset = DataGroupSpec({\n", " \"ZJets\": {\"files\": [\"https://raw.githubusercontent.com/scikit-hep/coffea/master/tests/samples/nano_dy.root:Events\"]},\n", " \"Data\": {\"files\": [\n", " \"https://raw.githubusercontent.com/scikit-hep/coffea/master/tests/samples/nano_dimuon.root:Events\",\n", " \"nano_dimuon_not_there.root:Events\",\n", " ]},\n", "})\n", "rich.print(demo_fileset)\n", "\n", "print(\"Preprocessing the fileset...\")\n", "dataset_runnable, dataset_updated = preprocess(\n", " demo_fileset,\n", " step_size=7,\n", " align_clusters=False,\n", " files_per_batch=10,\n", " skip_bad_files=True,\n", " save_form=True,\n", ")\n", "print(\"Fileset after preprocessing (excluding compressed_form string):\")\n", "rich.print({k: v.model_dump(exclude=\"compressed_form\") for k, v in dataset_runnable.items()})\n", "\n", "print(\"Updated files (including inaccessible ones):\")\n", "rich.print({k: v.model_dump(exclude=\"compressed_form\") for k, v in dataset_updated.items()})\n", "#rich.print({dname: {k: v for k, v in dataset_updated[dname].files.items() if k not in dataset_runnable[dname].files} for dname in dataset_updated})\n" ] }, { "cell_type": "markdown", "id": "6b974346", "metadata": {}, "source": [ "## 6. Integration with apply_to_fileset\n", "\n", "The `apply_to_fileset` function processes datasets using FileSpec classes:" ] }, { "cell_type": "code", "execution_count": 17, "id": "39a43fa8", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/nmangane/scikit-hep-dev-4/coffea/src/coffea/nanoevents/schemas/nanoaod.py:264: RuntimeWarning: Missing cross-reference index for LowPtElectron_electronIdx => Electron\n", " warnings.warn(\n", "/Users/nmangane/scikit-hep-dev-4/coffea/src/coffea/nanoevents/schemas/nanoaod.py:264: RuntimeWarning: Missing cross-reference index for LowPtElectron_genPartIdx => GenPart\n", " warnings.warn(\n", "/Users/nmangane/scikit-hep-dev-4/coffea/src/coffea/nanoevents/schemas/nanoaod.py:264: RuntimeWarning: Missing cross-reference index for LowPtElectron_photonIdx => Photon\n", " warnings.warn(\n", "/Users/nmangane/scikit-hep-dev-4/coffea/src/coffea/nanoevents/schemas/nanoaod.py:264: RuntimeWarning: Missing cross-reference index for FatJet_genJetAK8Idx => GenJetAK8\n", " warnings.warn(\n" ] }, { "data": { "text/html": [ "
{\n",
       "    'ZJets': {\n",
       "        'mass': Hist(\n",
       "  StrCategory(['ZJets'], growth=True, name='dataset', label='Primary dataset'),\n",
       "  Regular(30000, 0.25, 300, name='mass', label='$m_{\\\\mu\\\\mu}$ [GeV]'),\n",
       "  storage=Double()) # Sum: 6.0,\n",
       "        'pt': Hist(\n",
       "  StrCategory(['ZJets'], growth=True, name='dataset', label='Primary dataset'),\n",
       "  Regular(30000, 0.24, 300, name='pt', label='$p_{T}$ [GeV]'),\n",
       "  storage=Double()) # Sum: 18.0,\n",
       "        'cutflow': {'ZJets_pt': np.int64(18), 'ZJets_mass': np.int64(6)},\n",
       "        'worker': set()\n",
       "    },\n",
       "    'Data': {\n",
       "        'mass': Hist(\n",
       "  StrCategory(['Data'], growth=True, name='dataset', label='Primary dataset'),\n",
       "  Regular(30000, 0.25, 300, name='mass', label='$m_{\\\\mu\\\\mu}$ [GeV]'),\n",
       "  storage=Double()) # Sum: 66.0,\n",
       "        'pt': Hist(\n",
       "  StrCategory(['Data'], growth=True, name='dataset', label='Primary dataset'),\n",
       "  Regular(30000, 0.24, 300, name='pt', label='$p_{T}$ [GeV]'),\n",
       "  storage=Double()) # Sum: 84.0,\n",
       "        'cutflow': {'Data_pt': np.int64(84), 'Data_mass': np.int64(66)},\n",
       "        'worker': set()\n",
       "    }\n",
       "}\n",
       "
\n" ], "text/plain": [ "\u001b[1m{\u001b[0m\n", " \u001b[32m'ZJets'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'mass'\u001b[0m: \u001b[1;35mHist\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[1;35mStrCategory\u001b[0m\u001b[1m(\u001b[0m\u001b[1m[\u001b[0m\u001b[32m'ZJets'\u001b[0m\u001b[1m]\u001b[0m, \u001b[33mgrowth\u001b[0m=\u001b[3;92mTrue\u001b[0m, \u001b[33mname\u001b[0m=\u001b[32m'dataset'\u001b[0m, \u001b[33mlabel\u001b[0m=\u001b[32m'Primary dataset'\u001b[0m\u001b[1m)\u001b[0m,\n", " \u001b[1;35mRegular\u001b[0m\u001b[1m(\u001b[0m\u001b[1;36m30000\u001b[0m, \u001b[1;36m0.25\u001b[0m, \u001b[1;36m300\u001b[0m, \u001b[33mname\u001b[0m=\u001b[32m'mass'\u001b[0m, \u001b[33mlabel\u001b[0m=\u001b[32m'$m_\u001b[0m\u001b[32m{\u001b[0m\u001b[32m\\\\mu\\\\mu\u001b[0m\u001b[32m}\u001b[0m\u001b[32m$ \u001b[0m\u001b[32m[\u001b[0m\u001b[32mGeV\u001b[0m\u001b[32m]\u001b[0m\u001b[32m'\u001b[0m\u001b[1m)\u001b[0m,\n", " \u001b[33mstorage\u001b[0m=\u001b[1;35mDouble\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m\u001b[1m)\u001b[0m # Sum: \u001b[1;36m6.0\u001b[0m,\n", " \u001b[32m'pt'\u001b[0m: \u001b[1;35mHist\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[1;35mStrCategory\u001b[0m\u001b[1m(\u001b[0m\u001b[1m[\u001b[0m\u001b[32m'ZJets'\u001b[0m\u001b[1m]\u001b[0m, \u001b[33mgrowth\u001b[0m=\u001b[3;92mTrue\u001b[0m, \u001b[33mname\u001b[0m=\u001b[32m'dataset'\u001b[0m, \u001b[33mlabel\u001b[0m=\u001b[32m'Primary dataset'\u001b[0m\u001b[1m)\u001b[0m,\n", " \u001b[1;35mRegular\u001b[0m\u001b[1m(\u001b[0m\u001b[1;36m30000\u001b[0m, \u001b[1;36m0.24\u001b[0m, \u001b[1;36m300\u001b[0m, \u001b[33mname\u001b[0m=\u001b[32m'pt'\u001b[0m, \u001b[33mlabel\u001b[0m=\u001b[32m'$p_\u001b[0m\u001b[32m{\u001b[0m\u001b[32mT\u001b[0m\u001b[32m}\u001b[0m\u001b[32m$ \u001b[0m\u001b[32m[\u001b[0m\u001b[32mGeV\u001b[0m\u001b[32m]\u001b[0m\u001b[32m'\u001b[0m\u001b[1m)\u001b[0m,\n", " \u001b[33mstorage\u001b[0m=\u001b[1;35mDouble\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m\u001b[1m)\u001b[0m # Sum: \u001b[1;36m18.0\u001b[0m,\n", " \u001b[32m'cutflow'\u001b[0m: \u001b[1m{\u001b[0m\u001b[32m'ZJets_pt'\u001b[0m: \u001b[1;35mnp.int64\u001b[0m\u001b[1m(\u001b[0m\u001b[1;36m18\u001b[0m\u001b[1m)\u001b[0m, \u001b[32m'ZJets_mass'\u001b[0m: \u001b[1;35mnp.int64\u001b[0m\u001b[1m(\u001b[0m\u001b[1;36m6\u001b[0m\u001b[1m)\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[32m'worker'\u001b[0m: \u001b[1;35mset\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'Data'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'mass'\u001b[0m: \u001b[1;35mHist\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[1;35mStrCategory\u001b[0m\u001b[1m(\u001b[0m\u001b[1m[\u001b[0m\u001b[32m'Data'\u001b[0m\u001b[1m]\u001b[0m, \u001b[33mgrowth\u001b[0m=\u001b[3;92mTrue\u001b[0m, \u001b[33mname\u001b[0m=\u001b[32m'dataset'\u001b[0m, \u001b[33mlabel\u001b[0m=\u001b[32m'Primary dataset'\u001b[0m\u001b[1m)\u001b[0m,\n", " \u001b[1;35mRegular\u001b[0m\u001b[1m(\u001b[0m\u001b[1;36m30000\u001b[0m, \u001b[1;36m0.25\u001b[0m, \u001b[1;36m300\u001b[0m, \u001b[33mname\u001b[0m=\u001b[32m'mass'\u001b[0m, \u001b[33mlabel\u001b[0m=\u001b[32m'$m_\u001b[0m\u001b[32m{\u001b[0m\u001b[32m\\\\mu\\\\mu\u001b[0m\u001b[32m}\u001b[0m\u001b[32m$ \u001b[0m\u001b[32m[\u001b[0m\u001b[32mGeV\u001b[0m\u001b[32m]\u001b[0m\u001b[32m'\u001b[0m\u001b[1m)\u001b[0m,\n", " \u001b[33mstorage\u001b[0m=\u001b[1;35mDouble\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m\u001b[1m)\u001b[0m # Sum: \u001b[1;36m66.0\u001b[0m,\n", " \u001b[32m'pt'\u001b[0m: \u001b[1;35mHist\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[1;35mStrCategory\u001b[0m\u001b[1m(\u001b[0m\u001b[1m[\u001b[0m\u001b[32m'Data'\u001b[0m\u001b[1m]\u001b[0m, \u001b[33mgrowth\u001b[0m=\u001b[3;92mTrue\u001b[0m, \u001b[33mname\u001b[0m=\u001b[32m'dataset'\u001b[0m, \u001b[33mlabel\u001b[0m=\u001b[32m'Primary dataset'\u001b[0m\u001b[1m)\u001b[0m,\n", " \u001b[1;35mRegular\u001b[0m\u001b[1m(\u001b[0m\u001b[1;36m30000\u001b[0m, \u001b[1;36m0.24\u001b[0m, \u001b[1;36m300\u001b[0m, \u001b[33mname\u001b[0m=\u001b[32m'pt'\u001b[0m, \u001b[33mlabel\u001b[0m=\u001b[32m'$p_\u001b[0m\u001b[32m{\u001b[0m\u001b[32mT\u001b[0m\u001b[32m}\u001b[0m\u001b[32m$ \u001b[0m\u001b[32m[\u001b[0m\u001b[32mGeV\u001b[0m\u001b[32m]\u001b[0m\u001b[32m'\u001b[0m\u001b[1m)\u001b[0m,\n", " \u001b[33mstorage\u001b[0m=\u001b[1;35mDouble\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m\u001b[1m)\u001b[0m # Sum: \u001b[1;36m84.0\u001b[0m,\n", " \u001b[32m'cutflow'\u001b[0m: \u001b[1m{\u001b[0m\u001b[32m'Data_pt'\u001b[0m: \u001b[1;35mnp.int64\u001b[0m\u001b[1m(\u001b[0m\u001b[1;36m84\u001b[0m\u001b[1m)\u001b[0m, \u001b[32m'Data_mass'\u001b[0m: \u001b[1;35mnp.int64\u001b[0m\u001b[1m(\u001b[0m\u001b[1;36m66\u001b[0m\u001b[1m)\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[32m'worker'\u001b[0m: \u001b[1;35mset\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", "\u001b[1m}\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 6.1 Example processor for demonstration\n", "to_compute = apply_to_fileset(\n", " NanoEventsProcessor(),\n", " dataset_runnable,\n", " schemaclass=NanoAODSchema,\n", ")\n", "out = dask.compute(to_compute)[0]\n", "rich.print(out)\n" ] }, { "cell_type": "markdown", "id": "d31e7ee4", "metadata": {}, "source": [ "## 7. Dataset Manipulation Functions\n", "\n", "Coffea provides powerful functions for manipulating FileSpec-based datasets:\n", "\n", "- **max_chunks**: Limit processing to first N chunks per dataset\n", "- **max_chunks_per_file**: Limit processing to first N chunks per file\n", "- **slice_chunks**: Select specific chunk ranges \n", "- **max_files**: Limit number of files per dataset\n", "- **slice_files**: Select specific file ranges\n", "- **filter_files**: Remove files based on criteria" ] }, { "cell_type": "code", "execution_count": 18, "id": "e7919395", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== Chunk-based Manipulations ===\n", "Original dataset: 3 files\n", "After max_chunks(5):\n", " Total chunks: 5\n", "After max_chunks_per_file(2):\n", " file_0.root: 2 chunks\n", " file_1.root: 2 chunks\n", " file_2.root: 2 chunks\n" ] } ], "source": [ "# 7.1 Chunk-based manipulations\n", "print(\"=== Chunk-based Manipulations ===\")\n", "\n", "# Create a sample fileset for demonstration\n", "sample_fileset = DataGroupSpec({\n", " \"large_dataset\": DatasetSpec(\n", " files={\n", " f\"file_{i}.root\": CoffeaROOTFileSpec(\n", " object_path=\"Events\",\n", " steps=[[j*1000, (j+1)*1000] for j in range(10)], # 10 chunks per file\n", " num_entries=10000,\n", " uuid=f\"uuid-{i}\"\n", " )\n", " for i in range(3) # 3 files\n", " },\n", " metadata={\"total_files\": 3}\n", " )\n", "})\n", "\n", "print(f\"Original dataset: {len(sample_fileset['large_dataset'].files)} files\")\n", "\n", "# Limit to first 5 chunks total per dataset\n", "limited_chunks = max_chunks(sample_fileset, maxchunks=5)\n", "print(\"After max_chunks(5):\")\n", "total_chunks = sum(len(f.steps) for f in limited_chunks['large_dataset'].files.values())\n", "print(f\" Total chunks: {total_chunks}\")\n", "\n", "# Limit to first 2 chunks per file\n", "limited_per_file = max_chunks_per_file(sample_fileset, maxchunks=2)\n", "print(\"After max_chunks_per_file(2):\")\n", "for fname, fspec in limited_per_file['large_dataset'].files.items():\n", " print(f\" {fname}: {len(fspec.steps)} chunks\")" ] }, { "cell_type": "code", "execution_count": 19, "id": "d860f60a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== Advanced Chunk Slicing ===\n", "Middle chunks (5:15):\n", " Total chunks: 10\n", "Every other chunk (::2):\n", " Total chunks: 15\n", "First 3 chunks per file:\n", " file_0.root: 3 chunks\n", " file_1.root: 3 chunks\n", " file_2.root: 3 chunks\n" ] } ], "source": [ "# 7.2 Advanced chunk slicing\n", "print(\"=== Advanced Chunk Slicing ===\")\n", "\n", "# Slice specific chunk ranges\n", "middle_chunks = slice_chunks(sample_fileset, slice(5, 15))\n", "print(\"Middle chunks (5:15):\")\n", "total_chunks = sum(len(f.steps) for f in middle_chunks['large_dataset'].files.values())\n", "print(f\" Total chunks: {total_chunks}\")\n", "\n", "# Slice every other chunk\n", "every_other = slice_chunks(sample_fileset, slice(None, None, 2))\n", "print(\"Every other chunk (::2):\")\n", "total_chunks = sum(len(f.steps) for f in every_other['large_dataset'].files.values())\n", "print(f\" Total chunks: {total_chunks}\")\n", "\n", "# Slice per file vs per dataset\n", "per_file_slice = slice_chunks(sample_fileset, slice(3), bydataset=False)\n", "print(\"First 3 chunks per file:\")\n", "for fname, fspec in per_file_slice['large_dataset'].files.items():\n", " print(f\" {fname}: {len(fspec.steps)} chunks\")" ] }, { "cell_type": "code", "execution_count": 20, "id": "01044461", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== File-based Manipulations ===\n", "After max_files(2): 2 files\n", "First two files: 2 files\n", "File names: ['file_0.root', 'file_1.root']\n", "Last file: ['file_2.root']\n" ] } ], "source": [ "# 7.3 File-based manipulations\n", "print(\"=== File-based Manipulations ===\")\n", "\n", "# Limit number of files\n", "limited_files = max_files(sample_fileset, maxfiles=2)\n", "print(f\"After max_files(2): {len(limited_files['large_dataset'].files)} files\")\n", "\n", "# Slice specific files\n", "first_two_files = slice_files(sample_fileset, slice(2))\n", "print(f\"First two files: {len(first_two_files['large_dataset'].files)} files\")\n", "print(f\"File names: {list(first_two_files['large_dataset'].files.keys())}\")\n", "\n", "# Last file only\n", "last_file = slice_files(sample_fileset, slice(-1, None))\n", "print(f\"Last file: {list(last_file['large_dataset'].files.keys())}\")" ] }, { "cell_type": "code", "execution_count": 21, "id": "ee931c3e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== File Filtering ===\n", "Before filtering: 3 files\n", "After filtering: 2 files\n", "Remaining files: ['good_file_1.root', 'good_file_2.root']\n" ] } ], "source": [ "# 7.4 Filtering files\n", "print(\"=== File Filtering ===\")\n", "\n", "# Create a sample with some empty files for filtering\n", "fileset_with_empty = DataGroupSpec({\n", " \"mixed_dataset\": DatasetSpec(\n", " files={\n", " \"good_file_1.root\": CoffeaROOTFileSpec(\n", " object_path=\"Events\",\n", " steps=[[0, 1000]],\n", " num_entries=1000,\n", " uuid=\"good-1\"\n", " ),\n", " \"empty_file.root\": CoffeaROOTFileSpec(\n", " object_path=\"Events\", \n", " steps=[[0, 0]], # Empty file\n", " num_entries=0,\n", " uuid=\"empty\"\n", " ),\n", " \"good_file_2.root\": CoffeaROOTFileSpec(\n", " object_path=\"Events\",\n", " steps=[[0, 2000]],\n", " num_entries=2000,\n", " uuid=\"good-2\"\n", " )\n", " }\n", " )\n", "})\n", "\n", "print(f\"Before filtering: {len(fileset_with_empty['mixed_dataset'].files)} files\")\n", "\n", "# Filter out empty files\n", "filtered_fileset = filter_files(fileset_with_empty)\n", "print(f\"After filtering: {len(filtered_fileset['mixed_dataset'].files)} files\")\n", "print(f\"Remaining files: {list(filtered_fileset['mixed_dataset'].files.keys())}\")" ] }, { "cell_type": "markdown", "id": "a2f8d515", "metadata": {}, "source": [ "## 8. Advanced Usage Examples\n", "\n", "This section demonstrates advanced patterns and best practices for using FileSpec classes in real-world scenarios." ] }, { "cell_type": "code", "execution_count": 22, "id": "bc8ca29a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== Complex Analysis Fileset ===\n", "Full analysis fileset: 9 datasets\n", "Signal datasets: 3\n", "Background datasets: 2\n", "Data datasets: 4\n" ] } ], "source": [ "# 8.1 Building complex analysis filesets\n", "print(\"=== Complex Analysis Fileset ===\")\n", "\n", "def build_analysis_fileset():\n", " \"\"\"Build a comprehensive analysis fileset\"\"\"\n", " \n", " # Signal samples\n", " signal_samples = {}\n", " for mass in [125, 200, 300]:\n", " signal_samples[f\"higgs_m{mass}\"] = DatasetSpec(\n", " files={\n", " f\"higgs_m{mass}_part{i}.root\": CoffeaROOTFileSpec(\n", " object_path=\"Events\",\n", " steps=[[j*5000, (j+1)*5000] for j in range(20)],\n", " num_entries=100000,\n", " uuid=f\"higgs-{mass}-{i}\"\n", " )\n", " for i in range(3)\n", " },\n", " metadata={\n", " \"process\": \"higgs\",\n", " \"mass\": mass,\n", " \"cross_section\": 48.58 if mass == 125 else 10.0,\n", " \"is_signal\": True\n", " }\n", " )\n", " \n", " # Background samples\n", " background_samples = {\n", " \"ttbar\": DatasetSpec(\n", " files={\n", " f\"ttbar_part{i}.root\": \"Events\" for i in range(10)\n", " },\n", " metadata={\"process\": \"ttbar\", \"cross_section\": 831.8, \"is_signal\": False}\n", " ),\n", " \"wjets\": DatasetSpec(\n", " files={\n", " f\"wjets_part{i}.root\": \"Events\" for i in range(15)\n", " },\n", " metadata={\"process\": \"wjets\", \"cross_section\": 61526.7, \"is_signal\": False}\n", " )\n", " }\n", " \n", " # Data samples\n", " data_samples = {\n", " f\"data_{era}\": DatasetSpec(\n", " files={\n", " f\"data_{era}_part{i}.root\": \"Events\" for i in range(5)\n", " },\n", " metadata={\"is_data\": True, \"era\": era, \"luminosity\": 41.5}\n", " )\n", " for era in [\"2022A\", \"2022B\", \"2022C\", \"2022D\"]\n", " }\n", " \n", " # Combine all samples\n", " all_samples = {}\n", " all_samples.update(signal_samples)\n", " all_samples.update(background_samples)\n", " all_samples.update(data_samples)\n", " \n", " return DataGroupSpec(all_samples)\n", "\n", "# Build the fileset\n", "full_analysis = build_analysis_fileset()\n", "print(f\"Full analysis fileset: {len(full_analysis)} datasets\")\n", "\n", "# Categorize datasets\n", "signal_datasets = [name for name, ds in full_analysis.items() \n", " if ds.metadata.get(\"is_signal\", False)]\n", "background_datasets = [name for name, ds in full_analysis.items() \n", " if not ds.metadata.get(\"is_data\", False) and not ds.metadata.get(\"is_signal\", False)]\n", "data_datasets = [name for name, ds in full_analysis.items() \n", " if ds.metadata.get(\"is_data\", False)]\n", "\n", "print(f\"Signal datasets: {len(signal_datasets)}\")\n", "print(f\"Background datasets: {len(background_datasets)}\")\n", "print(f\"Data datasets: {len(data_datasets)}\")" ] }, { "cell_type": "code", "execution_count": 23, "id": "a6237760", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== Conditional Processing ===\n", "Signal-only fileset: 3 datasets\n", "2022 data fileset: 4 datasets\n", "Test subset: 3 datasets with limited files\n" ] } ], "source": [ "# 8.2 Conditional processing and dataset selection\n", "print(\"=== Conditional Processing ===\")\n", "\n", "def select_datasets_by_criteria(fileset: DataGroupSpec, **criteria) -> DataGroupSpec:\n", " \"\"\"Select datasets matching specific criteria\"\"\"\n", " selected = {}\n", " \n", " for name, dataset in fileset.items():\n", " match = True\n", " for key, value in criteria.items():\n", " if dataset.metadata.get(key) != value:\n", " match = False\n", " break\n", " \n", " if match:\n", " selected[name] = dataset\n", " \n", " return DataGroupSpec(selected)\n", "\n", "# Select only signal datasets\n", "signal_only = select_datasets_by_criteria(full_analysis, is_signal=True)\n", "print(f\"Signal-only fileset: {len(signal_only)} datasets\")\n", "\n", "# Select 2022 data only\n", "data_2022 = select_datasets_by_criteria(full_analysis, is_data=True)\n", "data_2022_filtered = DataGroupSpec({\n", " name: ds for name, ds in data_2022.items() \n", " if \"2022\" in name\n", "})\n", "print(f\"2022 data fileset: {len(data_2022_filtered)} datasets\")\n", "\n", "# Create a test subset with limited files\n", "test_subset = DataGroupSpec({\n", " name: max_files(DataGroupSpec({name: ds}), maxfiles=2)[name]\n", " for name, ds in full_analysis.items()\n", " if name in signal_datasets[:2] + background_datasets[:1]\n", "})\n", "print(f\"Test subset: {len(test_subset)} datasets with limited files\")" ] }, { "cell_type": "code", "execution_count": 24, "id": "60f065ea", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== Error Handling and Validation ===\n", "Fileset validation results:\n", " total_datasets: 9\n", " total_files: 54\n", " format_distribution: {'root': 9}\n" ] } ], "source": [ "# 8.3 Error handling and validation\n", "print(\"=== Error Handling and Validation ===\")\n", "\n", "def validate_fileset(fileset: DataGroupSpec) -> dict:\n", " \"\"\"Validate a fileset and return diagnostic information\"\"\"\n", " diagnostics = {\n", " \"total_datasets\": len(fileset),\n", " \"total_files\": 0,\n", " \"empty_datasets\": [],\n", " \"format_distribution\": {},\n", " \"metadata_issues\": []\n", " }\n", " \n", " for name, dataset in fileset.items():\n", " # Count files\n", " num_files = len(dataset.files)\n", " diagnostics[\"total_files\"] += num_files\n", " \n", " # Check for empty datasets\n", " if num_files == 0:\n", " diagnostics[\"empty_datasets\"].append(name)\n", " \n", " # Track format distribution\n", " fmt = dataset.format\n", " diagnostics[\"format_distribution\"][fmt] = diagnostics[\"format_distribution\"].get(fmt, 0) + 1\n", " \n", " # Check metadata\n", " if not dataset.metadata:\n", " diagnostics[\"metadata_issues\"].append(f\"{name}: No metadata\")\n", " elif \"process\" not in dataset.metadata and not dataset.metadata.get(\"is_data\", False):\n", " diagnostics[\"metadata_issues\"].append(f\"{name}: Missing process information\")\n", " \n", " return diagnostics\n", "\n", "# Validate our analysis fileset\n", "validation_results = validate_fileset(full_analysis)\n", "print(\"Fileset validation results:\")\n", "for key, value in validation_results.items():\n", " if isinstance(value, list) and len(value) == 0:\n", " continue\n", " print(f\" {key}: {value}\")" ] }, { "cell_type": "code", "execution_count": 25, "id": "f22002f4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== Performance Optimization ===\n", " higgs_m125: chunk_slicing, 1 files\n", " higgs_m200: chunk_slicing, 1 files\n", "Optimized subset: 2 datasets\n" ] }, { "data": { "text/html": [ "
DataGroupSpec(\n",
       "    root={\n",
       "        'higgs_m125': DatasetSpec(\n",
       "            files=InputFiles(\n",
       "                root={\n",
       "                    'higgs_m125_part0.root': CoffeaROOTFileSpec(\n",
       "                        object_path='Events',\n",
       "                        steps=[[0, 5000], [5000, 10000]],\n",
       "                        num_entries=100000,\n",
       "                        format='root',\n",
       "                        lfn=None,\n",
       "                        pfn=None,\n",
       "                        uuid='higgs-125-0',\n",
       "                        num_selected_entries=10000\n",
       "                    )\n",
       "                }\n",
       "            ),\n",
       "            metadata={'process': 'higgs', 'mass': 125, 'cross_section': 48.58, 'is_signal': True},\n",
       "            format='root',\n",
       "            compressed_form=None,\n",
       "            did=None\n",
       "        ),\n",
       "        'higgs_m200': DatasetSpec(\n",
       "            files=InputFiles(\n",
       "                root={\n",
       "                    'higgs_m200_part0.root': CoffeaROOTFileSpec(\n",
       "                        object_path='Events',\n",
       "                        steps=[[0, 5000], [5000, 10000]],\n",
       "                        num_entries=100000,\n",
       "                        format='root',\n",
       "                        lfn=None,\n",
       "                        pfn=None,\n",
       "                        uuid='higgs-200-0',\n",
       "                        num_selected_entries=10000\n",
       "                    )\n",
       "                }\n",
       "            ),\n",
       "            metadata={'process': 'higgs', 'mass': 200, 'cross_section': 10.0, 'is_signal': True},\n",
       "            format='root',\n",
       "            compressed_form=None,\n",
       "            did=None\n",
       "        )\n",
       "    }\n",
       ")\n",
       "
\n" ], "text/plain": [ "\u001b[1;35mDataGroupSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'higgs_m125'\u001b[0m: \u001b[1;35mDatasetSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mfiles\u001b[0m=\u001b[1;35mInputFiles\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'higgs_m125_part0.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m5000\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m5000\u001b[0m, \u001b[1;36m10000\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m100000\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'higgs-125-0'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m10000\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[33mmetadata\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'process'\u001b[0m: \u001b[32m'higgs'\u001b[0m, \u001b[32m'mass'\u001b[0m: \u001b[1;36m125\u001b[0m, \u001b[32m'cross_section'\u001b[0m: \u001b[1;36m48.58\u001b[0m, \u001b[32m'is_signal'\u001b[0m: \u001b[3;92mTrue\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mcompressed_form\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mdid\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'higgs_m200'\u001b[0m: \u001b[1;35mDatasetSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mfiles\u001b[0m=\u001b[1;35mInputFiles\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'higgs_m200_part0.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m5000\u001b[0m\u001b[1m]\u001b[0m, \u001b[1m[\u001b[0m\u001b[1;36m5000\u001b[0m, \u001b[1;36m10000\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m100000\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'higgs-200-0'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m10000\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[33mmetadata\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'process'\u001b[0m: \u001b[32m'higgs'\u001b[0m, \u001b[32m'mass'\u001b[0m: \u001b[1;36m200\u001b[0m, \u001b[32m'cross_section'\u001b[0m: \u001b[1;36m10.0\u001b[0m, \u001b[32m'is_signal'\u001b[0m: \u001b[3;92mTrue\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mcompressed_form\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mdid\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 8.4 Performance optimization strategies\n", "print(\"=== Performance Optimization ===\")\n", "\n", "def optimize_fileset_for_processing(fileset: DataGroupSpec, target_chunk_size: int = 100000) -> DataGroupSpec:\n", " \"\"\"Optimize fileset for processing performance\"\"\"\n", " \n", " optimized = {}\n", " \n", " for name, dataset in fileset.items():\n", " # Calculate total events and files\n", " total_events = sum(f.num_entries for f in dataset.files.values() \n", " if hasattr(f, 'num_entries') and f.num_entries)\n", " num_files = len(dataset.files)\n", " \n", " if total_events == 0:\n", " # Skip empty datasets\n", " continue\n", " \n", " # Determine optimal chunking strategy\n", " if total_events < target_chunk_size:\n", " # Small dataset - process as single chunk per file\n", " chunk_strategy = \"single_chunk_per_file\"\n", " optimized_dataset = dataset\n", " elif num_files < 5:\n", " # Few large files - use chunk slicing\n", " chunk_strategy = \"chunk_slicing\"\n", " max_chunks_total = max(1, total_events // target_chunk_size)\n", " optimized_dataset = max_chunks(DataGroupSpec({name: dataset}), \n", " maxchunks=max_chunks_total)[name]\n", " else:\n", " # Many files - limit files and chunks per file\n", " chunk_strategy = \"file_and_chunk_limiting\"\n", " max_files_count = min(num_files, 20) # Limit to 20 files\n", " temp_fileset = max_files(DataGroupSpec({name: dataset}), maxfiles=max_files_count)\n", " optimized_dataset = max_chunks_per_file(temp_fileset, maxchunks=5)[name]\n", " \n", " optimized[name] = optimized_dataset\n", " print(f\" {name}: {chunk_strategy}, {len(optimized_dataset.files)} files\")\n", " \n", " return DataGroupSpec(optimized)\n", "\n", "# Optimize our test subset\n", "optimized_subset = optimize_fileset_for_processing(test_subset)\n", "print(f\"Optimized subset: {len(optimized_subset)} datasets\")\n", "rich.print(optimized_subset)" ] }, { "cell_type": "markdown", "id": "79847d77", "metadata": {}, "source": [ "## 9. Migration from pure dictionary Formats and conversion utility ModelFactory\n", "\n", "This section shows how to migrate from legacy dictionary-based filesets to the explicit FileSpec classes.\n", "\n", "Largely, a well-defined legacy fileset (purely nested dictionary) can be converted merely by passing it into the DataGroupSpec as an argument.\n", "\n", "It should be noted that DataGroupSpec, InputFiles, and PreprocessedFiles behave like dictionaries and expect a dictionary input, but the other filespec classes expect keyword arguments, and so when a dictionary is explicitly passed to the FileSpec constructors, they should be unpacked with the `**some_dict` syntax." ] }, { "cell_type": "code", "execution_count": 26, "id": "b5c82c85", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== Legacy Format Migration ===\n", "Legacy fileset structure:\n", " ttbar: 2 files\n", " data: 1 files\n" ] } ], "source": [ "# 9.1 Legacy format examples\n", "print(\"=== Legacy Format Migration ===\")\n", "\n", "# Legacy dictionary format (old style)\n", "legacy_fileset = {\n", " \"ttbar\": {\n", " \"files\": {\n", " \"ttbar_1.root\": \"Events\",\n", " \"ttbar_2.root\": \"Events\"\n", " }\n", " },\n", " \"data\": {\n", " \"files\": {\n", " \"data_1.root\": {\n", " \"object_path\": \"Events\",\n", " \"steps\": [[0, 1000], [1000, 2000]],\n", " \"num_entries\": 2000,\n", " \"uuid\": \"legacy-uuid\"\n", " }\n", " }\n", " }\n", "}\n", "\n", "print(\"Legacy fileset structure:\")\n", "for name, content in legacy_fileset.items():\n", " print(f\" {name}: {len(content['files'])} files\")" ] }, { "cell_type": "markdown", "id": "5c27340c", "metadata": {}, "source": [ "## ModelFactory\n", "\n", "The `ModelFactory` class contains a few utility methods to help with manipulating the pydantic FileSpec classes. Largely, they serve as an example, with a few utilities regarding formats (which are called internally during validation/instantiation of the classes) plus conversion functions with simple logic for manipulating the filespec classes.\n", "\n", "- **dict_to_ROOTFileSpec**: Tries to convert the dictionary to a concrete CoffeaROOTFileSpec, and failing that, falls back to the Optional type\n", "- **dict_to_parquetfilespec**: Tries to convert the dictionary to a concrete CoffeaParquetFileSpec, and failing that, falls back to the Optional type\n", "- **filespec_to_dict**: Inverse function to convert FileSpec to dictionaries. Thanks to pydantic functionality, merely calls `.model_dump()` on the class\n", "- **dict_to_datasetspec**: Tries to convert the dictionary to a DatasetSpec, by utilizing the constructor.\n", "- **datasetspec_to_dict**: If coerce_filespec_to_dict is True (default), calls `.model_dump()` to completely convert to a dictionary. If False, only the outermost DatasetSpec is removed, leaving a dictionary of pydantic and elementary python types, which is the result of calling `dict(datasetspec)` instead of `.model_dump()`\n", "- **valid_format**: Ensures the format(s) are in the supported list for coffea processing\n", "- **attempt_promotion**: Will accept any of the FileSpec, DatasetSpec, or DataGroupSpec and try to promote any(nested) types within to concrete classes. Can effectively be emulated by calling the pydantic class constructor on the output of the original model's `.model_dump()` method, with or without `**inputs` call in place (for non-dictionary-like models)" ] }, { "cell_type": "code", "execution_count": 27, "id": "f6f4842c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== Converting legacy formats via ModelFactory ===\n", "The methods dict_to_ROOTFileSpec and dict_to_parquetfilespec are deprecated, as their functionality is covered by the pydantic models directly.\n", "For converting pydantic classes to dictionaries, the function datasetspec_to_dict demonstrates the two methods: model_dump() and dict(AModel).\n", "With the former, the entire model hierarchy is converted to dictionaries, while with the latter only the top-level model is converted, leaving nested models intact.\n", "DatasetSpec to pure dictionary (with coerce_filespec_to_dict=True):\n" ] }, { "data": { "text/html": [ "
{\n",
       "    'files': {\n",
       "        'ttbar_part0.root': {\n",
       "            'object_path': 'Events',\n",
       "            'steps': None,\n",
       "            'num_entries': None,\n",
       "            'format': 'root',\n",
       "            'lfn': None,\n",
       "            'pfn': None,\n",
       "            'uuid': None,\n",
       "            'num_selected_entries': None\n",
       "        },\n",
       "        'ttbar_part1.root': {\n",
       "            'object_path': 'Events',\n",
       "            'steps': None,\n",
       "            'num_entries': None,\n",
       "            'format': 'root',\n",
       "            'lfn': None,\n",
       "            'pfn': None,\n",
       "            'uuid': None,\n",
       "            'num_selected_entries': None\n",
       "        },\n",
       "        'ttbar_part2.root': {\n",
       "            'object_path': 'Events',\n",
       "            'steps': None,\n",
       "            'num_entries': None,\n",
       "            'format': 'root',\n",
       "            'lfn': None,\n",
       "            'pfn': None,\n",
       "            'uuid': None,\n",
       "            'num_selected_entries': None\n",
       "        },\n",
       "        'ttbar_part3.root': {\n",
       "            'object_path': 'Events',\n",
       "            'steps': None,\n",
       "            'num_entries': None,\n",
       "            'format': 'root',\n",
       "            'lfn': None,\n",
       "            'pfn': None,\n",
       "            'uuid': None,\n",
       "            'num_selected_entries': None\n",
       "        },\n",
       "        'ttbar_part4.root': {\n",
       "            'object_path': 'Events',\n",
       "            'steps': None,\n",
       "            'num_entries': None,\n",
       "            'format': 'root',\n",
       "            'lfn': None,\n",
       "            'pfn': None,\n",
       "            'uuid': None,\n",
       "            'num_selected_entries': None\n",
       "        },\n",
       "        'ttbar_part5.root': {\n",
       "            'object_path': 'Events',\n",
       "            'steps': None,\n",
       "            'num_entries': None,\n",
       "            'format': 'root',\n",
       "            'lfn': None,\n",
       "            'pfn': None,\n",
       "            'uuid': None,\n",
       "            'num_selected_entries': None\n",
       "        },\n",
       "        'ttbar_part6.root': {\n",
       "            'object_path': 'Events',\n",
       "            'steps': None,\n",
       "            'num_entries': None,\n",
       "            'format': 'root',\n",
       "            'lfn': None,\n",
       "            'pfn': None,\n",
       "            'uuid': None,\n",
       "            'num_selected_entries': None\n",
       "        },\n",
       "        'ttbar_part7.root': {\n",
       "            'object_path': 'Events',\n",
       "            'steps': None,\n",
       "            'num_entries': None,\n",
       "            'format': 'root',\n",
       "            'lfn': None,\n",
       "            'pfn': None,\n",
       "            'uuid': None,\n",
       "            'num_selected_entries': None\n",
       "        },\n",
       "        'ttbar_part8.root': {\n",
       "            'object_path': 'Events',\n",
       "            'steps': None,\n",
       "            'num_entries': None,\n",
       "            'format': 'root',\n",
       "            'lfn': None,\n",
       "            'pfn': None,\n",
       "            'uuid': None,\n",
       "            'num_selected_entries': None\n",
       "        },\n",
       "        'ttbar_part9.root': {\n",
       "            'object_path': 'Events',\n",
       "            'steps': None,\n",
       "            'num_entries': None,\n",
       "            'format': 'root',\n",
       "            'lfn': None,\n",
       "            'pfn': None,\n",
       "            'uuid': None,\n",
       "            'num_selected_entries': None\n",
       "        }\n",
       "    },\n",
       "    'metadata': {'process': 'ttbar', 'cross_section': 831.8, 'is_signal': False},\n",
       "    'format': 'root',\n",
       "    'compressed_form': None,\n",
       "    'did': None\n",
       "}\n",
       "
\n" ], "text/plain": [ "\u001b[1m{\u001b[0m\n", " \u001b[32m'files'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'ttbar_part0.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part1.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part2.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part3.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part4.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part5.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part6.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part7.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part8.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part9.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\u001b[32m'process'\u001b[0m: \u001b[32m'ttbar'\u001b[0m, \u001b[32m'cross_section'\u001b[0m: \u001b[1;36m831.8\u001b[0m, \u001b[32m'is_signal'\u001b[0m: \u001b[3;91mFalse\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'compressed_form'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'did'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", "\u001b[1m}\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Accomplished via model_dump():\n" ] }, { "data": { "text/html": [ "
{\n",
       "    'files': {\n",
       "        'ttbar_part0.root': {\n",
       "            'object_path': 'Events',\n",
       "            'steps': None,\n",
       "            'num_entries': None,\n",
       "            'format': 'root',\n",
       "            'lfn': None,\n",
       "            'pfn': None,\n",
       "            'uuid': None,\n",
       "            'num_selected_entries': None\n",
       "        },\n",
       "        'ttbar_part1.root': {\n",
       "            'object_path': 'Events',\n",
       "            'steps': None,\n",
       "            'num_entries': None,\n",
       "            'format': 'root',\n",
       "            'lfn': None,\n",
       "            'pfn': None,\n",
       "            'uuid': None,\n",
       "            'num_selected_entries': None\n",
       "        },\n",
       "        'ttbar_part2.root': {\n",
       "            'object_path': 'Events',\n",
       "            'steps': None,\n",
       "            'num_entries': None,\n",
       "            'format': 'root',\n",
       "            'lfn': None,\n",
       "            'pfn': None,\n",
       "            'uuid': None,\n",
       "            'num_selected_entries': None\n",
       "        },\n",
       "        'ttbar_part3.root': {\n",
       "            'object_path': 'Events',\n",
       "            'steps': None,\n",
       "            'num_entries': None,\n",
       "            'format': 'root',\n",
       "            'lfn': None,\n",
       "            'pfn': None,\n",
       "            'uuid': None,\n",
       "            'num_selected_entries': None\n",
       "        },\n",
       "        'ttbar_part4.root': {\n",
       "            'object_path': 'Events',\n",
       "            'steps': None,\n",
       "            'num_entries': None,\n",
       "            'format': 'root',\n",
       "            'lfn': None,\n",
       "            'pfn': None,\n",
       "            'uuid': None,\n",
       "            'num_selected_entries': None\n",
       "        },\n",
       "        'ttbar_part5.root': {\n",
       "            'object_path': 'Events',\n",
       "            'steps': None,\n",
       "            'num_entries': None,\n",
       "            'format': 'root',\n",
       "            'lfn': None,\n",
       "            'pfn': None,\n",
       "            'uuid': None,\n",
       "            'num_selected_entries': None\n",
       "        },\n",
       "        'ttbar_part6.root': {\n",
       "            'object_path': 'Events',\n",
       "            'steps': None,\n",
       "            'num_entries': None,\n",
       "            'format': 'root',\n",
       "            'lfn': None,\n",
       "            'pfn': None,\n",
       "            'uuid': None,\n",
       "            'num_selected_entries': None\n",
       "        },\n",
       "        'ttbar_part7.root': {\n",
       "            'object_path': 'Events',\n",
       "            'steps': None,\n",
       "            'num_entries': None,\n",
       "            'format': 'root',\n",
       "            'lfn': None,\n",
       "            'pfn': None,\n",
       "            'uuid': None,\n",
       "            'num_selected_entries': None\n",
       "        },\n",
       "        'ttbar_part8.root': {\n",
       "            'object_path': 'Events',\n",
       "            'steps': None,\n",
       "            'num_entries': None,\n",
       "            'format': 'root',\n",
       "            'lfn': None,\n",
       "            'pfn': None,\n",
       "            'uuid': None,\n",
       "            'num_selected_entries': None\n",
       "        },\n",
       "        'ttbar_part9.root': {\n",
       "            'object_path': 'Events',\n",
       "            'steps': None,\n",
       "            'num_entries': None,\n",
       "            'format': 'root',\n",
       "            'lfn': None,\n",
       "            'pfn': None,\n",
       "            'uuid': None,\n",
       "            'num_selected_entries': None\n",
       "        }\n",
       "    },\n",
       "    'metadata': {'process': 'ttbar', 'cross_section': 831.8, 'is_signal': False},\n",
       "    'format': 'root',\n",
       "    'compressed_form': None,\n",
       "    'did': None\n",
       "}\n",
       "
\n" ], "text/plain": [ "\u001b[1m{\u001b[0m\n", " \u001b[32m'files'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'ttbar_part0.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part1.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part2.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part3.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part4.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part5.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part6.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part7.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part8.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'ttbar_part9.root'\u001b[0m: \u001b[1m{\u001b[0m\n", " \u001b[32m'object_path'\u001b[0m: \u001b[32m'Events'\u001b[0m,\n", " \u001b[32m'steps'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'lfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'pfn'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'uuid'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'num_selected_entries'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\u001b[32m'process'\u001b[0m: \u001b[32m'ttbar'\u001b[0m, \u001b[32m'cross_section'\u001b[0m: \u001b[1;36m831.8\u001b[0m, \u001b[32m'is_signal'\u001b[0m: \u001b[3;91mFalse\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'compressed_form'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'did'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", "\u001b[1m}\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "DatasetSpec to top-level dictionary (with coerce_filespec_to_dict=False):\n" ] }, { "data": { "text/html": [ "
{\n",
       "    'files': InputFiles(\n",
       "        root={\n",
       "            'ttbar_part0.root': CoffeaROOTFileSpecOptional(\n",
       "                object_path='Events',\n",
       "                steps=None,\n",
       "                num_entries=None,\n",
       "                format='root',\n",
       "                lfn=None,\n",
       "                pfn=None,\n",
       "                uuid=None,\n",
       "                num_selected_entries=None\n",
       "            ),\n",
       "            'ttbar_part1.root': CoffeaROOTFileSpecOptional(\n",
       "                object_path='Events',\n",
       "                steps=None,\n",
       "                num_entries=None,\n",
       "                format='root',\n",
       "                lfn=None,\n",
       "                pfn=None,\n",
       "                uuid=None,\n",
       "                num_selected_entries=None\n",
       "            ),\n",
       "            'ttbar_part2.root': CoffeaROOTFileSpecOptional(\n",
       "                object_path='Events',\n",
       "                steps=None,\n",
       "                num_entries=None,\n",
       "                format='root',\n",
       "                lfn=None,\n",
       "                pfn=None,\n",
       "                uuid=None,\n",
       "                num_selected_entries=None\n",
       "            ),\n",
       "            'ttbar_part3.root': CoffeaROOTFileSpecOptional(\n",
       "                object_path='Events',\n",
       "                steps=None,\n",
       "                num_entries=None,\n",
       "                format='root',\n",
       "                lfn=None,\n",
       "                pfn=None,\n",
       "                uuid=None,\n",
       "                num_selected_entries=None\n",
       "            ),\n",
       "            'ttbar_part4.root': CoffeaROOTFileSpecOptional(\n",
       "                object_path='Events',\n",
       "                steps=None,\n",
       "                num_entries=None,\n",
       "                format='root',\n",
       "                lfn=None,\n",
       "                pfn=None,\n",
       "                uuid=None,\n",
       "                num_selected_entries=None\n",
       "            ),\n",
       "            'ttbar_part5.root': CoffeaROOTFileSpecOptional(\n",
       "                object_path='Events',\n",
       "                steps=None,\n",
       "                num_entries=None,\n",
       "                format='root',\n",
       "                lfn=None,\n",
       "                pfn=None,\n",
       "                uuid=None,\n",
       "                num_selected_entries=None\n",
       "            ),\n",
       "            'ttbar_part6.root': CoffeaROOTFileSpecOptional(\n",
       "                object_path='Events',\n",
       "                steps=None,\n",
       "                num_entries=None,\n",
       "                format='root',\n",
       "                lfn=None,\n",
       "                pfn=None,\n",
       "                uuid=None,\n",
       "                num_selected_entries=None\n",
       "            ),\n",
       "            'ttbar_part7.root': CoffeaROOTFileSpecOptional(\n",
       "                object_path='Events',\n",
       "                steps=None,\n",
       "                num_entries=None,\n",
       "                format='root',\n",
       "                lfn=None,\n",
       "                pfn=None,\n",
       "                uuid=None,\n",
       "                num_selected_entries=None\n",
       "            ),\n",
       "            'ttbar_part8.root': CoffeaROOTFileSpecOptional(\n",
       "                object_path='Events',\n",
       "                steps=None,\n",
       "                num_entries=None,\n",
       "                format='root',\n",
       "                lfn=None,\n",
       "                pfn=None,\n",
       "                uuid=None,\n",
       "                num_selected_entries=None\n",
       "            ),\n",
       "            'ttbar_part9.root': CoffeaROOTFileSpecOptional(\n",
       "                object_path='Events',\n",
       "                steps=None,\n",
       "                num_entries=None,\n",
       "                format='root',\n",
       "                lfn=None,\n",
       "                pfn=None,\n",
       "                uuid=None,\n",
       "                num_selected_entries=None\n",
       "            )\n",
       "        }\n",
       "    ),\n",
       "    'metadata': {'process': 'ttbar', 'cross_section': 831.8, 'is_signal': False},\n",
       "    'format': 'root',\n",
       "    'compressed_form': None,\n",
       "    'did': None\n",
       "}\n",
       "
\n" ], "text/plain": [ "\u001b[1m{\u001b[0m\n", " \u001b[32m'files'\u001b[0m: \u001b[1;35mInputFiles\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'ttbar_part0.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part1.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part2.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part3.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part4.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part5.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part6.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part7.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part8.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part9.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\u001b[32m'process'\u001b[0m: \u001b[32m'ttbar'\u001b[0m, \u001b[32m'cross_section'\u001b[0m: \u001b[1;36m831.8\u001b[0m, \u001b[32m'is_signal'\u001b[0m: \u001b[3;91mFalse\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'compressed_form'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'did'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", "\u001b[1m}\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Now the partial conversion via dict():\n" ] }, { "data": { "text/html": [ "
{\n",
       "    'files': InputFiles(\n",
       "        root={\n",
       "            'ttbar_part0.root': CoffeaROOTFileSpecOptional(\n",
       "                object_path='Events',\n",
       "                steps=None,\n",
       "                num_entries=None,\n",
       "                format='root',\n",
       "                lfn=None,\n",
       "                pfn=None,\n",
       "                uuid=None,\n",
       "                num_selected_entries=None\n",
       "            ),\n",
       "            'ttbar_part1.root': CoffeaROOTFileSpecOptional(\n",
       "                object_path='Events',\n",
       "                steps=None,\n",
       "                num_entries=None,\n",
       "                format='root',\n",
       "                lfn=None,\n",
       "                pfn=None,\n",
       "                uuid=None,\n",
       "                num_selected_entries=None\n",
       "            ),\n",
       "            'ttbar_part2.root': CoffeaROOTFileSpecOptional(\n",
       "                object_path='Events',\n",
       "                steps=None,\n",
       "                num_entries=None,\n",
       "                format='root',\n",
       "                lfn=None,\n",
       "                pfn=None,\n",
       "                uuid=None,\n",
       "                num_selected_entries=None\n",
       "            ),\n",
       "            'ttbar_part3.root': CoffeaROOTFileSpecOptional(\n",
       "                object_path='Events',\n",
       "                steps=None,\n",
       "                num_entries=None,\n",
       "                format='root',\n",
       "                lfn=None,\n",
       "                pfn=None,\n",
       "                uuid=None,\n",
       "                num_selected_entries=None\n",
       "            ),\n",
       "            'ttbar_part4.root': CoffeaROOTFileSpecOptional(\n",
       "                object_path='Events',\n",
       "                steps=None,\n",
       "                num_entries=None,\n",
       "                format='root',\n",
       "                lfn=None,\n",
       "                pfn=None,\n",
       "                uuid=None,\n",
       "                num_selected_entries=None\n",
       "            ),\n",
       "            'ttbar_part5.root': CoffeaROOTFileSpecOptional(\n",
       "                object_path='Events',\n",
       "                steps=None,\n",
       "                num_entries=None,\n",
       "                format='root',\n",
       "                lfn=None,\n",
       "                pfn=None,\n",
       "                uuid=None,\n",
       "                num_selected_entries=None\n",
       "            ),\n",
       "            'ttbar_part6.root': CoffeaROOTFileSpecOptional(\n",
       "                object_path='Events',\n",
       "                steps=None,\n",
       "                num_entries=None,\n",
       "                format='root',\n",
       "                lfn=None,\n",
       "                pfn=None,\n",
       "                uuid=None,\n",
       "                num_selected_entries=None\n",
       "            ),\n",
       "            'ttbar_part7.root': CoffeaROOTFileSpecOptional(\n",
       "                object_path='Events',\n",
       "                steps=None,\n",
       "                num_entries=None,\n",
       "                format='root',\n",
       "                lfn=None,\n",
       "                pfn=None,\n",
       "                uuid=None,\n",
       "                num_selected_entries=None\n",
       "            ),\n",
       "            'ttbar_part8.root': CoffeaROOTFileSpecOptional(\n",
       "                object_path='Events',\n",
       "                steps=None,\n",
       "                num_entries=None,\n",
       "                format='root',\n",
       "                lfn=None,\n",
       "                pfn=None,\n",
       "                uuid=None,\n",
       "                num_selected_entries=None\n",
       "            ),\n",
       "            'ttbar_part9.root': CoffeaROOTFileSpecOptional(\n",
       "                object_path='Events',\n",
       "                steps=None,\n",
       "                num_entries=None,\n",
       "                format='root',\n",
       "                lfn=None,\n",
       "                pfn=None,\n",
       "                uuid=None,\n",
       "                num_selected_entries=None\n",
       "            )\n",
       "        }\n",
       "    ),\n",
       "    'metadata': {'process': 'ttbar', 'cross_section': 831.8, 'is_signal': False},\n",
       "    'format': 'root',\n",
       "    'compressed_form': None,\n",
       "    'did': None\n",
       "}\n",
       "
\n" ], "text/plain": [ "\u001b[1m{\u001b[0m\n", " \u001b[32m'files'\u001b[0m: \u001b[1;35mInputFiles\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mroot\u001b[0m=\u001b[1m{\u001b[0m\n", " \u001b[32m'ttbar_part0.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part1.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part2.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part3.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part4.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part5.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part6.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part7.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part8.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'ttbar_part9.root'\u001b[0m: \u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[3;35mNone\u001b[0m\n", " \u001b[1m)\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m)\u001b[0m,\n", " \u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\u001b[32m'process'\u001b[0m: \u001b[32m'ttbar'\u001b[0m, \u001b[32m'cross_section'\u001b[0m: \u001b[1;36m831.8\u001b[0m, \u001b[32m'is_signal'\u001b[0m: \u001b[3;91mFalse\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[32m'format'\u001b[0m: \u001b[32m'root'\u001b[0m,\n", " \u001b[32m'compressed_form'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'did'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", "\u001b[1m}\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 10.1 Converting formats\n", "print(\"=== Converting legacy formats via ModelFactory ===\")\n", "\n", "print(\"The methods dict_to_ROOTFileSpec and dict_to_parquetfilespec are deprecated, as their functionality is covered by the pydantic models directly.\")\n", "\n", "print(\"For converting pydantic classes to dictionaries, the function datasetspec_to_dict demonstrates the two methods: model_dump() and dict(AModel).\")\n", "print(\"With the former, the entire model hierarchy is converted to dictionaries, while with the latter only the top-level model is converted, leaving nested models intact.\")\n", "\n", "pure_dictionary = ModelFactory.datasetspec_to_dict(full_analysis['ttbar'], coerce_filespec_to_dict=True)\n", "print(\"DatasetSpec to pure dictionary (with coerce_filespec_to_dict=True):\")\n", "rich.print(pure_dictionary)\n", "\n", "print(\"Accomplished via model_dump():\")\n", "rich.print(full_analysis['ttbar'].model_dump())\n", "\n", "mixed_dictionary = ModelFactory.datasetspec_to_dict(full_analysis['ttbar'], coerce_filespec_to_dict=False)\n", "print(\"DatasetSpec to top-level dictionary (with coerce_filespec_to_dict=False):\")\n", "rich.print(mixed_dictionary)\n", "\n", "print(\"Now the partial conversion via dict():\")\n", "rich.print(dict(full_analysis['ttbar']))\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 28, "id": "aeb145d5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== Promoting Specs to Concrete Types ===\n", "ModelFactory.attempt_promotion can be used to update Spcs after parameters have been set, fulfilling the requirements of the non-Optional variants.\n", "Starting with CoffeaROOTFileSpecOptional:\n" ] }, { "data": { "text/html": [ "
CoffeaROOTFileSpecOptional(\n",
       "    object_path='Events',\n",
       "    steps=[[0, 1000]],\n",
       "    num_entries=None,\n",
       "    format='root',\n",
       "    lfn=None,\n",
       "    pfn=None,\n",
       "    uuid=None,\n",
       "    num_selected_entries=1000\n",
       ")\n",
       "
\n" ], "text/plain": [ "\u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m1000\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m1000\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "After setting num_entries and uuid:\n" ] }, { "data": { "text/html": [ "
CoffeaROOTFileSpecOptional(\n",
       "    object_path='Events',\n",
       "    steps=[[0, 1000]],\n",
       "    num_entries=1000,\n",
       "    format='root',\n",
       "    lfn=None,\n",
       "    pfn=None,\n",
       "    uuid='promote-me',\n",
       "    num_selected_entries=1000\n",
       ")\n",
       "
\n" ], "text/plain": [ "\u001b[1;35mCoffeaROOTFileSpecOptional\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m1000\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m1000\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'promote-me'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m1000\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "After promotion to CoffeaROOTFileSpec:\n" ] }, { "data": { "text/html": [ "
CoffeaROOTFileSpec(\n",
       "    object_path='Events',\n",
       "    steps=[[0, 1000]],\n",
       "    num_entries=1000,\n",
       "    format='root',\n",
       "    lfn=None,\n",
       "    pfn=None,\n",
       "    uuid='promote-me',\n",
       "    num_selected_entries=1000\n",
       ")\n",
       "
\n" ], "text/plain": [ "\u001b[1;35mCoffeaROOTFileSpec\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mobject_path\u001b[0m=\u001b[32m'Events'\u001b[0m,\n", " \u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m1000\u001b[0m\u001b[1m]\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[33mnum_entries\u001b[0m=\u001b[1;36m1000\u001b[0m,\n", " \u001b[33mformat\u001b[0m=\u001b[32m'root'\u001b[0m,\n", " \u001b[33mlfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33mpfn\u001b[0m=\u001b[3;35mNone\u001b[0m,\n", " \u001b[33muuid\u001b[0m=\u001b[32m'promote-me'\u001b[0m,\n", " \u001b[33mnum_selected_entries\u001b[0m=\u001b[1;36m1000\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 10.3 Promoting Specs to concrete types\n", "print(\"=== Promoting Specs to Concrete Types ===\")\n", "print(\"ModelFactory.attempt_promotion can be used to update Spcs after parameters have been set, fulfilling the requirements of the non-Optional variants.\")\n", "starting_spec = CoffeaROOTFileSpecOptional(object_path=\"Events\", steps=[[0, 1000]])\n", "print(\"Starting with CoffeaROOTFileSpecOptional:\")\n", "rich.print(starting_spec)\n", "starting_spec.num_entries = 1000\n", "starting_spec.uuid = \"promote-me\"\n", "print(\"After setting num_entries and uuid:\")\n", "rich.print(starting_spec)\n", "promoted_spec = ModelFactory.attempt_promotion(starting_spec)\n", "print(\"After promotion to CoffeaROOTFileSpec:\")\n", "rich.print(promoted_spec)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.14.0" } }, "nbformat": 4, "nbformat_minor": 5 }