Reading data with coffea NanoEvents

This is a rendered copy of nanoevents.ipynb. You can optionally run it interactively on binder at this link

NanoEvents is a Coffea utility to wrap flat nTuple structures (such as the CMS NanoAOD format) into a single awkward array with appropriate object methods (such as Lorentz vector methods\(^*\)), cross references, and nested objects, all lazily accessed\(^\dagger\) from the source ROOT TTree via uproot. The interpretation of the TTree data is configurable via schema objects, which are community-supplied for various source file types. These schema objects allow a richer interpretation of the file contents than the uproot.lazy methods. Currently available schemas include:

BaseSchema, which provides a simple representation of the input TTree, where each branch is available verbatim as events.branch_name, effectively the same behavior as uproot.lazy. Any branches that uproot supports at “full speed” (i.e. that are fully split and either flat or single-jagged) can be read by this schema;
NanoAODSchema, which is optimized to provide all methods and cross-references in CMS NanoAOD format;
PFNanoAODSchema, which builds a double-jagged particle flow candidate colllection events.jet.constituents from compatible PFNanoAOD input files;
TreeMakerSchema which is designed to read TTrees made by TreeMaker, an alternative CMS nTuplization format;
PHYSLITESchema, for the ATLAS DAOD_PHYSLITE derivation, a compact centrally-produced data format similar to CMS NanoAOD; and
DelphesSchema, for reading Delphes fast simulation nTuples.

We welcome contributions for new schemas, and can assist with the design of them.

\(^*\) Vector methods are currently made possible via the coffea vector methods mixin class structure. In a future version of coffea, they will instead be provided by the dedicated scikit-hep vector library, which provides a more rich feature set. The coffea vector methods predate the release of the vector library.

\(^\dagger\) Lazy access refers to only fetching the needed data from the (possibly remote) file when a sub-array is first accessed. The sub-array is then materialized and subsequent access of the sub-array uses a cached value in memory. As such, fully materializing a NanoEvents object may require a significant amount of memory.

In this demo, we will use NanoEvents to read a small CMS NanoAOD sample. The events object can be instantiated as follows:

[1]:

import awkward as ak
from coffea.nanoevents import NanoEventsFactory, NanoAODSchema

fname = "https://raw.githubusercontent.com/CoffeaTeam/coffea/master/tests/samples/nano_dy.root"
events = NanoEventsFactory.from_root(
    fname,
    schemaclass=NanoAODSchema.v6,
    metadata={"dataset": "DYJets"},
).events()

In the factory constructor, we also pass the desired schema version (the latest version of NanoAOD can be built with schemaclass=NanoAODSchema) for this file and some extra metadata that we can later access with events.metadata. In a later example, we will show how to set up this metadata in coffea processors where the events object is pre-created for you. Consider looking at the from_root class method to see all optional arguments.

The events object is an awkward array, which at its top level is a record array with one record for each “collection”, where a collection is a grouping of fields (TBranches) based on the naming conventions of NanoAODSchema. For example, in the file we opened, the branches:

Generator_binvar
Generator_scalePDF
Generator_weight
Generator_x1
Generator_x2
Generator_xpdf1
Generator_xpdf2
Generator_id1
Generator_id2

are grouped into one sub-record named Generator which can be accessed using either getitem or getattr syntax, i.e. events["Generator"] or events.Generator. e.g.

[2]:

events.Generator.id1

[2]:

<Array [1, -1, -1, 21, 21, ... 2, -2, -1, 2, 1] type='40 * int32[parameters={"__...'>

[3]:

# all names can be listed with:
events.Generator.fields

[3]:

['binvar', 'scalePDF', 'weight', 'x1', 'x2', 'xpdf1', 'xpdf2', 'id1', 'id2']

In CMS NanoAOD, each TBranch has a self-documenting help string embedded in the title field, which is carried into the NanoEvents, e.g. executing the following cell should produce a help pop-up:

Type:            Array
String form:     [1, -1, -1, 21, 21, 4, 2, -2, 2, 1, 3, 1, ... -1, -1, 1, -2, 2, 1, 2, -2, -1, 2, 1]
Length:          40
File:            ~/src/awkward-1.0/awkward1/highlevel.py
Docstring:       id of first parton
Class docstring: ...

where the Docstring shows information about the content of this array.

[4]:

events.Generator.id1?

Based on a collection’s name or contents, some collections acquire additional methods, which are extra features exposed by the code in the mixin classes of the coffea.nanoevents.methods modules. For example, although events.GenJet has the fields:

[5]:

events.GenJet.fields

[5]:

['eta', 'mass', 'phi', 'pt', 'partonFlavour', 'hadronFlavour']

we can access additional attributes associated to each generated jet by virtue of the fact that they can be interpreted as Lorentz vectors:

[6]:

events.GenJet.energy

[6]:

<Array [[217, 670, 258], ... 16], [76.9]] type='40 * var * float32'>

We can call more complex methods, like computing the distance \(\Delta R = \sqrt{\Delta \eta^2 + \Delta \phi ^2}\) between two LorentzVector objects:

[7]:

# find distance between leading jet and all electrons in each event
dr = events.Jet[:, 0].delta_r(events.Electron)
dr

[7]:

<Array [[], [3.13], [3.45, ... 0.0858], [], []] type='40 * var * float32'>

[8]:

# find minimum distance
ak.min(dr, axis=1)

[8]:

<Array [None, 3.13, 2.18, ... None, None] type='40 * ?float32'>

[9]:

# a convenience method for this operation on all jets is available
events.Jet.nearest(events.Electron)

[9]:

<ElectronArray [[None, None, None, ... [None, None]] type='40 * var * ?electron'>

The assignment of methods classes to collections is done inside the schema object during the initial creation of the array, governed by the awkward array’s __record__ parameter and the associated behavior. See ak.behavior for a more detailed explanation of array behaviors.

Additional methods provide convenience functions for interpreting some branches, e.g. CMS NanoAOD packs several jet identification flag bits into a single integer, jetId. By implementing the bit-twiddling in the Jet mixin, the analsyis code becomes more clear:

[10]:

print(events.Jet.jetId)
print(events.Jet.isTight)

[[6, 6, 6, 6, 6], [6, 2, 6, 6, 6, 6, 6, 0], ... 6], [6], [6, 6, 0, 6, 6, 6], [6, 6]]
[[True, True, True, True, True], [True, ... False, True, True, True], [True, True]]

We can also define convenience functions to unpack and apply some mask to a set of flags, e.g. for generated particles:

[11]:

print(f"Raw status flags: {events.GenPart.statusFlags}")
events.GenPart.hasFlags(['isPrompt', 'isLastCopy'])

Raw status flags: [[10625, 27009, 4481, 22913, 257, 257, ... 13884, 13884, 12876, 12876, 12876, 12876]]

[11]:

<Array [[True, True, False, ... False, False]] type='40 * var * bool'>

CMS NanoAOD also contains pre-computed cross-references for some types of collections. For example, there is a TBranch Electron_genPartIdx which indexes the GenPart collection per event to give the matched generated particle, and -1 if no match is found. NanoEvents transforms these indices into an awkward indexed array pointing to the collection, so that one can directly access the matched particle using getattr syntax:

[12]:

events.Electron.matched_gen.pdgId

[12]:

<Array [[], [-11], [-11, ... [None], [], []] type='40 * var * ?int32[parameters=...'>

[13]:

events.Muon[ak.num(events.Muon)>0].matched_jet.pt

[13]:

<Array [[84.4, 29.4], [31.1, ... 26.1], [25.8]] type='12 * var * ?float32[parame...'>

For generated particles, the parent index is similarly mapped:

[14]:

events.GenPart.parent.pdgId

[14]:

<Array [[None, None, 1, 1, ... 111, 111, 111]] type='40 * var * ?int32[parameter...'>

In addition, using the parent index, a helper method computes the inverse mapping, namely, children. As such, one can find particle siblings with:

[15]:

events.GenPart.parent.children.pdgId
# notice this is a doubly-jagged array

[15]:

<Array [[None, None, [23, 21, ... [22, 22]]] type='40 * var * option[var * ?int3...'>

Since often one wants to shortcut repeated particles in a decay sequence, a helper method distinctParent is also available. Here we use it to find the parent particle ID for all prompt electrons:

[16]:

events.GenPart[
    (abs(events.GenPart.pdgId) == 11)
    & events.GenPart.hasFlags(['isPrompt', 'isLastCopy'])
].distinctParent.pdgId

[16]:

<Array [[], [23, 23], [23, ... [23, 23], []] type='40 * var * ?int32[parameters=...'>

Events can be filtered like any other awkward array using boolean fancy-indexing

[17]:

mmevents = events[ak.num(events.Muon) == 2]
zmm = mmevents.Muon[:, 0] + mmevents.Muon[:, 1]
zmm.mass

[17]:

<Array [94.6, 87.6, 88, 90.4, 89.1, 31.6] type='6 * float32'>

[18]:

# a convenience method is available to sum vectors along an axis:
mmevents.Muon.sum(axis=1).mass

[18]:

<Array [94.6, 87.6, 88, 90.4, 89.1, 31.6] type='6 * float32'>

As expected for this sample, most of the dimuon events have a pair invariant mass close to that of a Z boson. But what about the last event? Let’s take a look at the generator information:

[19]:

print(mmevents[-1].Muon.matched_gen.pdgId)
print(mmevents[-1].Muon.matched_gen.hasFlags(["isPrompt"]))

[-13, 13]
[False, False]

So they are real generated muons, but they are not prompt (i.e. from the initial decay of a heavy resonance)

Let’s look at their parent particles:

[20]:

mmevents[-1].Muon.matched_gen.parent.pdgId

[20]:

<Array [-15, 15] type='2 * ?int32[parameters={"__doc__": "PDG id"}]'>

aha! They are muons coming from tau lepton decays, and hence a fair amount of the Z mass is carried away by the neutrinos:

[21]:

print(mmevents[-1].Muon.matched_gen.sum().mass)
print(mmevents[-1].Muon.matched_gen.parent.sum().mass)

31.265271292167853
91.68363205830444

One can assign new variables to the arrays, with some caveats:

Assignment must use setitem (events["path", "to", "name"] = value)
Assignment to a sliced events won’t be accessible from the original variable
New variables are not visible from cross-references

[22]:

mmevents["Electron", "myvariable"] = mmevents.Electron.pt + zmm.mass
mmevents.Electron.myvariable

[22]:

<Array [[], [121], [], [], [], []] type='6 * var * float32'>