Coffea concepts

This page explains concepts and terminology used within the coffea package. It is intended to provide a high-level overview, while details can be found in other sections of the documentation.

Columnar analysis

Columnar analysis is a paradigm that describes the way the user writes the analysis application that is best described in contrast to the the traditional paradigm in high-energy particle physics (HEP) of using an event loop. In an event loop, the analysis operates row-wise on the input data (in HEP, one row usually corresponds to one reconstructed particle collision event.) Each row is a structure containing several fields, such as the properties of the visible outgoing particles that were reconstructed in a collision event. The analysis code manipulates this structure to either output derived quantities or summary statistics in the form of histograms. In contrast, columnar analysis operates on individual columns of data spanning a chunk (partition, batch) of rows using array programming primitives in turn, to compute derived quantities and summary statistics. Array programming is widely used within the scientific python ecosystem, supported by the numpy library. However, although the existing scientific python stack is fully capable of analyzing rectangular arrays (i.e. no variable-length array dimensions), HEP data is very irregular, and manipulating it can become awkward without first generalizing array structure a bit. The awkward package does this, extending array programming capabilities to the complexity of HEP data.

_images/columnar.png

Coffea processor

In almost all HEP analyses, each row corresponds to an independent event, and it is exceptionally rare to need to compute inter-row derived quantites. Due to this, horizontal scale-out is almost trivial: each chunk of rows can be operated on independently. Further, if the output of an analysis is restricted to reducible accumulators such as histograms (abstracted by dask, dask-awkward, and dask-histogram), then outputs can even be merged via tree reduction. The ProcessorABC class is an abstraction to encapsulate analysis code so that it can be easily scaled out, leaving the delivery of input columns and reduction of output accumulators to the coffea framework. However it is not an absolute requirement, and mere a useful organizational framework.

Scale-out

Often, the computation requirements of a HEP data analysis exceed the resources of a single thread of execution. To facilitate parallelization and allow the user to access more compute resources, coffea employs dask, dask-distributed, and taskvine to ease the transition between a local analysis on a small set of test data to a full-scale analysis. Dask provides local schedulers and distributed schedulers, and taskvine supplies an additional distributed scheduler.

Local executors

Currently, four local executors exist: sync, Thread State and the Global Interpreter Lock, processes, and local distributed client. The sync schedulersimply processes each chunk of an input dataset in turn, using the current python thread. The threads scheduler employs python multithreaded to spawn multiple python threads that process chunks in parallel on the machine. The processes and local distributed client schedulures use multiprocessing to spawn multiple python processes that process chunk in parallel on the machine. Process-base parallelism tends to more efficient when using uproot since it avoids performance limitations due to the CPython global interpreter lock.

Distributed executors

Currently, coffea supports two types of distributed schedulers:

  • the dask distributed executor, accessed via the distributed.Client entrypoint,

  • and the taskvine distributed scheduler.

These schedulers use their respective underlying libraries to distribute processing tasks over multiple machines.