{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Reading data with coffea NanoEvents\n", "\n", "This is a rendered copy of [nanoevents.ipynb](https://github.com/CoffeaTeam/coffea/blob/master/binder/nanoevents.ipynb). You can optionally run it interactively on [binder at this link](https://mybinder.org/v2/gh/coffeateam/coffea/master?filepath=binder%2Fnanoevents.ipynb)\n", "\n", "NanoEvents is a Coffea utility to wrap flat nTuple structures (such as the CMS [NanoAOD](https://www.epj-conferences.org/articles/epjconf/pdf/2019/19/epjconf_chep2018_06021.pdf) format) into a single awkward array with appropriate object methods (such as Lorentz vector methods$^*$), cross references, and nested objects, all lazily accessed$^\\dagger$ from the source ROOT TTree via uproot. The interpretation of the TTree data is configurable via [schema objects](https://coffeateam.github.io/coffea/modules/coffea.nanoevents.html#classes), which are community-supplied for various source file types. These schema objects allow a richer interpretation of the file contents than the [uproot.lazy](https://uproot4.readthedocs.io/en/latest/uproot4.behaviors.TBranch.lazy.html) methods. Currently available schemas include:\n", "\n", " - `BaseSchema`, which provides a simple representation of the input TTree, where each branch is available verbatim as `events.branch_name`, effectively the same behavior as `uproot.lazy`. Any branches that uproot supports at \"full speed\" (i.e. that are fully split and either flat or single-jagged) can be read by this schema;\n", " - `NanoAODSchema`, which is optimized to provide all methods and cross-references in CMS NanoAOD format;\n", " - `PFNanoAODSchema`, which builds a double-jagged particle flow candidate colllection `events.jet.constituents` from compatible PFNanoAOD input files;\n", " - `TreeMakerSchema` which is designed to read TTrees made by [TreeMaker](https://github.com/TreeMaker/TreeMaker), an alternative CMS nTuplization format;\n", " - `PHYSLITESchema`, for the ATLAS DAOD_PHYSLITE derivation, a compact centrally-produced data format similar to CMS NanoAOD; and\n", " - `DelphesSchema`, for reading Delphes fast simulation [nTuples](https://cp3.irmp.ucl.ac.be/projects/delphes/wiki/WorkBook/RootTreeDescription).\n", "\n", "We welcome contributions for new schemas, and can assist with the design of them.\n", "\n", "$^*$ Vector methods are currently made possible via the [coffea vector](https://coffeateam.github.io/coffea/modules/coffea.nanoevents.methods.vector.html) methods mixin class structure. In a future version of coffea, they will instead be provided by the dedicated scikit-hep [vector](https://vector.readthedocs.io/en/latest/) library, which provides a more rich feature set. The coffea vector methods predate the release of the vector library.\n", "\n", "$^\\dagger$ _Lazy_ access refers to only fetching the needed data from the (possibly remote) file when a sub-array is first accessed. The sub-array is then _materialized_ and subsequent access of the sub-array uses a cached value in memory. As such, fully materializing a `NanoEvents` object may require a significant amount of memory.\n", "\n", "\n", "In this demo, we will use NanoEvents to read a small CMS NanoAOD sample. The events object can be instantiated as follows:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import awkward as ak\n", "from coffea.nanoevents import NanoEventsFactory, NanoAODSchema\n", "\n", "NanoAODSchema.warn_missing_crossrefs = False\n", "\n", "fname = \"https://raw.githubusercontent.com/CoffeaTeam/coffea/master/tests/samples/nano_dy.root\"\n", "events = NanoEventsFactory.from_root(\n", " {fname: \"Events\"},\n", " schemaclass=NanoAODSchema,\n", " metadata={\"dataset\": \"DYJets\"},\n", ").events()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the factory constructor, we also pass the desired schema version (the latest version of NanoAOD can be built with `schemaclass=NanoAODSchema`) for this file and some extra metadata that we can later access with `events.metadata`. In a later example, we will show how to set up this metadata in coffea processors where the `events` object is pre-created for you. Consider looking at the [from_root](https://coffeateam.github.io/coffea/api/coffea.nanoevents.NanoEventsFactory.html#coffea.nanoevents.NanoEventsFactory.from_root) class method to see all optional arguments.\n", "\n", "The `events` object is an awkward array, which at its top level is a record array with one record for each \"collection\", where a collection is a grouping of fields (TBranches) based on the naming conventions of [NanoAODSchema](https://coffeateam.github.io/coffea/api/coffea.nanoevents.NanoAODSchema.html). For example, in the file we opened, the branches:\n", "```\n", "Generator_binvar\n", "Generator_scalePDF\n", "Generator_weight\n", "Generator_x1\n", "Generator_x2\n", "Generator_xpdf1\n", "Generator_xpdf2\n", "Generator_id1\n", "Generator_id2\n", "```\n", "are grouped into one sub-record named `Generator` which can be accessed using either getitem or getattr syntax, i.e. `events[\"Generator\"]` or `events.Generator`. e.g." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
[1,\n", " -1,\n", " -1,\n", " 21,\n", " 21,\n", " 4,\n", " 2,\n", " -2,\n", " 2,\n", " 1,\n", " ...,\n", " 1,\n", " -2,\n", " 2,\n", " 1,\n", " 2,\n", " -2,\n", " -1,\n", " 2,\n", " 1]\n", "--------------------------------------------------------------\n", "type: 40 * int32[parameters={"__doc__": "id of first parton"}]" ], "text/plain": [ "
[[217, 670, 258],\n", " [34.5, 98.3, 1.16e+03, 38.1, 20.4, 29.7],\n", " [306, 62.8, 74.1, 769, 11.2],\n", " [170, 117, 29.3, 45.9],\n", " [101, 117, 129, 15.6],\n", " [63.1, 37.2, 33.7, 36.2],\n", " [303, 50.5, 1.29e+03, 278],\n", " [615, 282, 2.11e+03],\n", " [195, 47.6],\n", " [95, 44.6, 223, 318, 30, 108, 62.9],\n", " ...,\n", " [41.6, 36.7, 78.9, 13],\n", " [1.51e+03, 1.23e+03],\n", " [152, 160, 777, 27.1, 346, 65.1, 37.9, 27.2, 16.3],\n", " [35.4, 20.4],\n", " [20.1, 16.2],\n", " [34],\n", " [553, 283],\n", " [771, 452, 16],\n", " [76.9]]\n", "----------------------------------------------------\n", "type: 40 * var * float32" ], "text/plain": [ "
[[],\n", " [3.13],\n", " [3.45, 2.18],\n", " [1.58, 3.76],\n", " [],\n", " [0.053],\n", " [0.0748],\n", " [],\n", " [],\n", " [1.82],\n", " ...,\n", " [0.00115],\n", " [],\n", " [0.0149],\n", " [],\n", " [0.0308],\n", " [],\n", " [0.0858],\n", " [],\n", " []]\n", "------------------------\n", "type: 40 * var * float32" ], "text/plain": [ "
[None,\n", " 3.13,\n", " 2.18,\n", " 1.58,\n", " None,\n", " 0.053,\n", " 0.0748,\n", " None,\n", " None,\n", " 1.82,\n", " ...,\n", " 0.00115,\n", " None,\n", " 0.0149,\n", " None,\n", " 0.0308,\n", " None,\n", " 0.0858,\n", " None,\n", " None]\n", "-------------------\n", "type: 40 * ?float32" ], "text/plain": [ "
[[],\n", " [-11],\n", " [-11, 11],\n", " [22, None],\n", " [],\n", " [None],\n", " [None],\n", " [],\n", " [],\n", " [11],\n", " ...,\n", " [11],\n", " [],\n", " [11],\n", " [],\n", " [-11],\n", " [],\n", " [None],\n", " [],\n", " []]\n", "---------------------------------------------------------\n", "type: 40 * var * ?int32[parameters={"__doc__": "PDG id"}]" ], "text/plain": [ "
[[84.4, 29.4],\n", " [31.1],\n", " [53.4, 81.9],\n", " [29.2],\n", " [17.5],\n", " [65.9, 47.8],\n", " [58.5, 44.7],\n", " [50.2, 45.2],\n", " [33.3, 25.9],\n", " [None],\n", " [26.1],\n", " [25.8]]\n", "-------------------------------------------------------\n", "type: 12 * var * ?float32[parameters={"__doc__": "pt"}]" ], "text/plain": [ "
[[None, None, 1, 1, 23, 23, 23, 23, ..., 15, -15, -15, -15, -15, -15, 111, 111],\n", " [None, None, -1, 23, 23, 23, 23, ..., -11, None, None, None, None, None, 433],\n", " [None, None, -1, -1, 23, 23, 23, 23, ..., -423, -1, -1, -421, -421, 111, 111],\n", " [None, None, 21, 21, 23, -1, 23, 23, ..., -15, -15, -15, -15, -15, 111, 111],\n", " [None, None, 21, 21, 23, 23, 23, 23, ..., 13, 13, -13, 1, None, None, 2, 2],\n", " [None, None, 4, 23, 23, 23, 23, 23, ..., -15, -15, -15, 15, 15, 15, 423, 311],\n", " [None, None, 2, 2, 2, 23, 23, 2, 23, ..., -13, 13, 2, 2, 2, 2, 111, 111, 111],\n", " [None, None, -2, -2, 23, 21, 21, 23, 23, ..., 21, 21, 21, 21, None, 423, 2, 2],\n", " [None, None, 2, 23, 23, 23, 23, 23, ..., -15, -15, -15, -15, 111, 111, 311],\n", " [None, None, 1, 1, 1, 23, 21, 23, 23, 23, ..., -411, 21, 21, 1, 1, 1, 1, 3, 3],\n", " ...,\n", " [None, None, 1, 23, 23, 23, 23, 23, ..., -15, -15, -15, -15, 1, 1, 111, 111],\n", " [None, None, -2, 23, 23, 23, 23, 23, 13, 13, -13, -13, -13, 13],\n", " [None, None, 2, 2, 2, 23, 2, 23, ..., None, -413, 413, 413, 2, 2, -421, -421],\n", " [None, None, 1, 23, 23, 23, 23, 23, ..., -15, 15, 15, 15, -15, -15, -15, -15],\n", " [None, None, 2, 2, 23, 23, 21, 23, ..., 15, 15, 15, -15, -15, -15, 111, 111],\n", " [None, None, -2, 23, 23, None, 23, 23, ..., -15, -15, -15, 423, 4, 4, 3, 3],\n", " [None, None, -1, 23, 23, 23],\n", " [None, None, 2, 23, 23, 23, 23, -11, -11, 11],\n", " [None, None, 1, 1, 23, 23, 23, 23, ..., -15, -15, -15, 111, 111, 111, 111]]\n", "--------------------------------------------------------------------------------\n", "type: 40 * var * ?int32[parameters={"__doc__": "PDG id"}]" ], "text/plain": [ "
[[None, None, [23, 21], ..., [-16, 111, ..., 211, -211], [22, 22], [22, 22]],\n", " [None, None, [23], [23], [23], [23], ..., None, None, None, None, None, [431]],\n", " [None, None, [23, -1], [23, -1], [23], ..., [13, -14], [13, -14], [22], [22]],\n", " [None, None, [23, -1], ..., [-16, 111, ..., 211, -211], [22, 22], [22, 22]],\n", " [None, None, [23, 1], [23, 1], [23], ..., None, None, [11, -11], [11, -11]],\n", " [None, None, [23], [23], ..., [16, 13, -14], [16, 13, -14], [421], [310]],\n", " [None, None, [23, 2, 2], [23, 2, 2], ..., [...], [22], [11, -11], [11, -11]],\n", " [None, None, [23, 21], [23, 21], [23], ..., None, [421], [11, -11], [11, -11]],\n", " [None, None, [23], [23], ..., [-16, 111, ..., 311], [22, 22], [22, 22], [310]],\n", " [None, None, [23, 21, 21], [23, ...], ..., [11, -11], [11, -11], [11, -11]],\n", " ...,\n", " [None, None, [23], [23], [23], ..., [11, -11], [11, -11], [22, 22], [22, 22]],\n", " [None, None, [23], [23], [23], ..., [...], [-13], [-13, 22], [-13, 22], [13]],\n", " [None, None, [23, 2, 21], ..., [2, 21, ..., 11, -11], [13, -14], [13, -14]],\n", " [None, None, [23], ..., [...], [-16, 211, 211, -211], [-16, 211, 211, -211]],\n", " [None, None, [23, 21], [23, 21], ..., [-16, -11, 12], [22, 22], [22, 22]],\n", " [None, None, [23], [23], ..., [423, -421, 11, -11], [11, -11], [11, -11]],\n", " [None, None, [23], [23], [-13, 13], [-13, 13]],\n", " [None, None, [23], [23], [23], ..., [-11, 11], [-11, 22], [-11, 22], [11]],\n", " [None, None, [23, 21], [23, 21], ..., [22, ...], [22, 22], [22, 22], [22, 22]]]\n", "--------------------------------------------------------------------------------\n", "type: 40 * var * option[var * ?int32[parameters={"__doc__": "PDG id"}]]" ], "text/plain": [ "
[[],\n", " [23, 23],\n", " [23, 23],\n", " [],\n", " [],\n", " [],\n", " [],\n", " [23, 23],\n", " [],\n", " [23, 23],\n", " ...,\n", " [],\n", " [],\n", " [23, 23],\n", " [],\n", " [],\n", " [],\n", " [],\n", " [23, 23],\n", " []]\n", "---------------------------------------------------------\n", "type: 40 * var * ?int32[parameters={"__doc__": "PDG id"}]" ], "text/plain": [ "
[94.6,\n", " 87.6,\n", " 88,\n", " 90.4,\n", " 89.1,\n", " 31.6]\n", "-----------------\n", "type: 6 * float32" ], "text/plain": [ "
[94.6,\n", " 87.6,\n", " 88,\n", " 90.4,\n", " 89.1,\n", " 31.6]\n", "-----------------\n", "type: 6 * float32" ], "text/plain": [ "
[-15,\n", " 15]\n", "--------------------------------------------------\n", "type: 2 * ?int32[parameters={"__doc__": "PDG id"}]" ], "text/plain": [ "
[[],\n", " [121],\n", " [],\n", " [],\n", " [],\n", " []]\n", "-----------------------\n", "type: 6 * var * float32" ], "text/plain": [ "