--- jupytext: formats: md:myst text_representation: extension: .md format_name: myst --- # Coffea concepts This page explains concepts and terminology used within the coffea package. It is intended to provide a high-level overview, while details can be found in other sections of the documentation. (columnar-analysis)= ## Columnar analysis Columnar analysis is a paradigm that describes the way the user writes the analysis application that is best described in contrast to the traditional paradigm in high-energy particle physics (HEP) of using an event loop. In an event loop, the analysis operates row-wise on the input data (in HEP, one row usually corresponds to one reconstructed particle collision event.) Each row is a structure containing several fields, such as the properties of the visible outgoing particles that were reconstructed in a collision event. The analysis code manipulates this structure to either output derived quantities or summary statistics in the form of histograms. In contrast, columnar analysis operates on individual columns of data spanning a *chunk* (partition, batch) of rows using [array programming](https://en.wikipedia.org/wiki/Array_programming) primitives in turn, to compute derived quantities and summary statistics. Array programming is widely used within the [scientific python ecosystem](https://www.scipy.org/about.html), supported by the [numpy](https://numpy.org/) library. However, although the existing scientific python stack is fully capable of analyzing rectangular arrays (i.e. no variable-length array dimensions), HEP data is very irregular, and manipulating it can become awkward without first generalizing array structure a bit. The [awkward](https://awkward-array.org) package does this, extending array programming capabilities to the complexity of HEP data. :::{figure} ../images/columnar.png :width: 70 % :align: center ::: (processor)= ## Coffea processor In almost all HEP analyses, each row corresponds to an independent event, and it is exceptionally rare to need to compute inter-row derived quantities. This makes horizontal scale-out straightforward: each chunk of rows can be processed independently. Coffea wraps this pattern with the {class}`coffea.processor.ProcessorABC`, which defines a `process` method returning an accumulator. The {class}`coffea.processor.Runner` helper bundles the dataset chunking, NanoEvents creation, and reduction of per-chunk results so that you can focus on analysis code. A processor instance can be executed with the same interface regardless of the executor in use: ```python from coffea import processor from coffea.nanoevents import NanoAODSchema # Assume ``my_processor`` is an instance of a subclass of ProcessorABC. fileset = { "ZJets": {"treename": "Events", "files": ["/data/nano_dy.root"]}, "Data": {"treename": "Events", "files": ["/data/nano_dimuon.root"]}, } runner = processor.Runner( executor=processor.FuturesExecutor(workers=4, status=True), schema=NanoAODSchema, ) result = runner( fileset, processor_instance=my_processor, treename="Events", ) ``` Changing the executor is all that is required to scale from a laptop to a cluster. See {doc}`../user_guide/executors` for a practical overview. (scale-out)= ## Scale-out Often, the computation requirements of a HEP data analysis exceed the resources of a single thread of execution. To facilitate parallelization and allow the user to access more compute resources, coffea ships several executors that all implement the same interface. The local options cover quick iteration and debugging, while the distributed options connect to clusters and grid-style resources. Switching between them does not require changes to the processor itself. (#local-executors)= ### Local executors Coffea provides two executors for running on a single machine: - `IterativeExecutor`: processes chunks sequentially in one Python thread. This is ideal for debugging and validation because it has the least moving parts. - `FuturesExecutor`: uses `concurrent.futures` to fan out work to multiple local workers. By default it creates a process pool, and you can pass `pool` or `workers` to fine-tune the level of parallelism. You can swap between these executors by adjusting the `executor` argument passed to {class}`~coffea.processor.Runner`. (#distributed-executors)= ### Distributed executors Coffea supports three distributed schedulers out of the box: - {class}`~coffea.processor.DaskExecutor` integrates with a running [Dask Distributed](https://distributed.dask.org/en/latest/) cluster via a `distributed.Client`. - {class}`~coffea.processor.ParslExecutor` uses [Parsl](http://parsl-project.org/) to target a wide range of HPC and batch backends. - {class}`~coffea.processor.TaskVineExecutor` leverages [TaskVine](https://cctools.readthedocs.io/en/latest/taskvine/) for opportunistic and heterogeneous workers. Each executor shares the same `Runner` interface, making it easy to start locally and later connect to a remote resource manager.