Analysis tools and accumulators

This is a rendered copy of accumulators.ipynb. You can optionally run it interactively on binder at this link

Now that we know how to access data with NanoEvents and apply corrections, let’s go through some useful columnar analysis tools and idioms for building accumulators, namely, the eventual output of a coffea processor that has natural “merge” semantics. The most familiar type of accumulator is the histogram, but other types are useful in certain contexts.

We’ll use our small sample file to demonstrate the utilities, although it won’t be very interesting to analyze

[1]:

import numpy as np
import awkward as ak
from coffea.nanoevents import NanoEventsFactory, NanoAODSchema

fname = "https://raw.githubusercontent.com/CoffeaTeam/coffea/master/tests/samples/nano_dy.root"
events = NanoEventsFactory.from_root(
    fname,
    schemaclass=NanoAODSchema.v6,
    metadata={"dataset": "DYJets"},
).events()

To generate some mock systematics, we’ll use one of the scale factors from the applying_corrections notebook (note you will have to at least execute the cell that downloads test data in that notebook for this to work)

[2]:

from coffea.lookup_tools import extractor

ext = extractor()
ext.add_weight_sets(["* * data/testSF2d.histo.root"])
ext.finalize()
evaluator = ext.make_evaluator()
evaluator.keys()

[2]:

dict_keys(['scalefactors_Tight_Electron', 'scalefactors_Tight_Electron_error'])

Weights

This is a container for event weights and associated systematic shifts, which helps track the product of the weights (i.e. the total event weight to be used for filling histograms) as well as systematic variations to that product. Here we demo its use by constructing an event weight consisting of the generator weight, the $\alpha_s$ uncertainty variation, and the electron ID scale factor with its associated systematic.

[3]:

from coffea.analysis_tools import Weights

weights = Weights(len(events))

weights.add("genWeight", events.genWeight)

weights.add(
    "alphaS",
    # in NanoAOD, the generator weights are already stored with respect to nominal
    weight=np.ones(len(events)),
    # 31 => alphas(MZ)=0.1165 central value; 32 => alphas(MZ)=0.1195
    # per https://lhapdfsets.web.cern.ch/current/PDF4LHC15_nnlo_30_pdfas/PDF4LHC15_nnlo_30_pdfas.info
    # which was found by looking up the LHA ID in events.LHEPdfWeight.__doc__
    weightUp=events.LHEPdfWeight[:, 32],
    weightDown=events.LHEPdfWeight[:, 31],
)

eleSF = evaluator["scalefactors_Tight_Electron"](events.Electron.eta, events.Electron.pt)
eleSFerror = evaluator["scalefactors_Tight_Electron_error"](events.Electron.eta, events.Electron.pt)
weights.add(
    "eleSF",
    # the event weight is the product of the per-electron weights
    # note, in a real analysis we would first have to select electrons of interest
    weight=ak.prod(eleSF, axis=1),
    weightUp=ak.prod(eleSF + eleSFerror, axis=1),
)

A WeightStatistics object tracks the smallest and largest weights seen per type, as well as some other summary statistics. It is kept internally and can be accessed via weights.weightStatistics. This object is addable, so it can be used in an accumulator.

[4]:

weights.weightStatistics

[4]:

{'genWeight': WeightStatistics(sumw=578762.4375, sumw2=27678371840.0, minw=-26331.201171875, maxw=26331.201171875, n=40),
 'alphaS': WeightStatistics(sumw=40.0, sumw2=40.0, minw=1.0, maxw=1.0, n=40),
 'eleSF': WeightStatistics(sumw=38.26972579956055, sumw2=36.81547546386719, minw=0.6674144268035889, maxw=1.0077519416809082, n=40)}

Then the total event weight is available via

[5]:

weights.weight()

[5]:

array([ 26331.20117188,  23933.4196682 ,  24385.06528643,  17224.10489525,
        26331.20117188,  26535.31910775, -23236.21447283,  26067.890625  ,
        26331.20117188,  25105.68057538,  26331.20117188,  26061.82814944,
        26061.82814944, -26331.20117188,  26331.20117188,  24421.85661211,
       -24854.62516629,  26331.20117188, -19978.69273076,  26331.20117188,
        22168.68846612, -25134.17258657,  26331.20117188,  26331.20117188,
        26331.20117188,  24770.33051391,  26331.20117188,  22732.29949574,
        26331.20117188,  24421.85661211,  26331.20117188,  25857.1530546 ,
        26331.20117188, -23903.50101615,  26331.20117188, -24770.33051391,
        26331.20117188, -24906.05914783,  26331.20117188, -26331.20117188])

And the total event weight with a given variation is available via

[6]:

weights.weight("eleSFUp")

[6]:

array([ 26331.20117188,  24397.53583916,  25294.0252539 ,  17997.89462374,
        26331.20117188,  27253.27428405, -23502.68164558,  26067.890625  ,
        26331.20117188,  25230.22218679,  26331.20117188,  26779.78332574,
        26779.78332574, -26331.20117188,  26331.20117188,  24885.97278307,
       -24977.92136851,  26331.20117188, -20983.93483174,  26331.20117188,
        22734.27398292, -25869.18763094,  26331.20117188,  26331.20117188,
        26331.20117188,  25448.12088952,  26331.20117188,  22998.76666849,
        26331.20117188,  24885.97278307,  26331.20117188,  26254.27086417,
        26331.20117188, -24141.51091823,  26331.20117188, -25448.12088952,
        26331.20117188, -25583.84952343,  26331.20117188, -26331.20117188])

all variations tracked by the weights object are available via

[7]:

weights.variations

[7]:

{'alphaSDown', 'alphaSUp', 'eleSFDown', 'eleSFUp'}

PackedSelection

This class can store several boolean arrays in a memory-efficient mannner and evaluate arbitrary combinations of boolean requirements in an CPU-efficient way. Supported inputs include 1D numpy or awkward arrays. This makes it a good tool to form analysis signal and control regions, and to implement cutflow or “N-1” plots.

Below we create a packed selection with some typical selections for a Z+jets study, to be used later to form same-sign and opposite-sign $ee$ and $\mu\mu$ event categories/regions.

[8]:

from coffea.analysis_tools import PackedSelection

selection = PackedSelection()

selection.add("twoElectron", ak.num(events.Electron) == 2)
selection.add("eleOppSign", ak.sum(events.Electron.charge, axis=1) == 0)
selection.add("noElectron", ak.num(events.Electron) == 0)

selection.add("twoMuon", ak.num(events.Muon) == 2)
selection.add("muOppSign", ak.sum(events.Muon.charge, axis=1) == 0)
selection.add("noMuon", ak.num(events.Muon) == 0)


selection.add(
    "leadPt20",
    # assuming one of `twoElectron` or `twoMuon` is imposed, this implies at least one is above threshold
    ak.any(events.Electron.pt >= 20.0, axis=1) | ak.any(events.Muon.pt >= 20.0, axis=1)
)

print(selection.names)

['twoElectron', 'eleOppSign', 'noElectron', 'twoMuon', 'muOppSign', 'noMuon', 'leadPt20']

To evaluate a boolean mask (e.g. to filter events) we can use the selection.all(*names) function, which will compute the AND of all listed boolean selections

[9]:

selection.all("twoElectron", "noMuon", "leadPt20")

[9]:

array([False, False,  True, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,  True,  True, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False])

We can also be more specific and require that a specific set of selections have a given value (with the unspecified ones allowed to be either true or false) using selection.require

[10]:

selection.require(twoElectron=True, noMuon=True, eleOppSign=False)

[10]:

array([False, False, False,  True, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False])

Using the python syntax for passing an arguments variable, we can easily implement a “N-1” style selection

[11]:

allCuts = {"twoElectron", "noMuon", "leadPt20"}
for cut in allCuts:
    nev = selection.all(*(allCuts - {cut})).sum()
    print(f"Events passing all cuts, ignoring {cut}: {nev}")

nev = selection.all(*allCuts).sum()
print(f"Events passing all cuts: {nev}")

Events passing all cuts, ignoring noMuon: 3
Events passing all cuts, ignoring twoElectron: 10
Events passing all cuts, ignoring leadPt20: 5
Events passing all cuts: 3

Accumulator semantics

Prior to coffea 0.7.2, accumulators were more explicit in that they inherited from AccumulatorABC. Such accumulators are still supported, but now the requirements for an acuumulator output of a processor have been relaxed to the protocol:

class Addable(Protocol):
    def __add__(self: T, other: T) -> T:
        ...


Accumulatable = Union[Addable, MutableSet, MutableMapping]

Essentially, any item that can be added, any set, or any dictionary of the above, can be used as an accumulator:

[12]:

from coffea.processor import accumulate

accumulate([
    {"num": 0, "val": 3.1},
    {"num": 2, "val": 4.0, "thing": {"a", "b"}},
    {"num": 2, "val": 4.0, "thing": {"a", "c"}},
])

[12]:

{'num': 4, 'val': 11.1, 'thing': {'a', 'b', 'c'}}

Since hist histograms are addable, they can be used as well:

[13]:

import hist

def makehist():
    return hist.Hist.new.Reg(10, 0, 1).Double()

h = accumulate([
    makehist().fill([0.1, 0.1, 0.3]),
    makehist().fill([0.1, 0.3, 0.5]),
])
print(h.values())
h

[0. 3. 0. 2. 0. 1. 0. 0. 0. 0.]

[13]:

Regular(10, 0, 1, label='Axis 0')

Double() Σ=6.0

Bringing it together

Let’s build an output accumulator that stores, per dataset: - the sum of weights for the events processed, to use for later luminosity-normalizing the yields; - a histogram of the dilepton invariant mass, with category axes for various selection regions of interest and systematics; and - the weight statistics, for debugging purposes

[14]:

regions = {
    "ee": {"twoElectron": True, "noMuon": True, "leadPt20": True, "eleOppSign": True},
    "eeSS": {"twoElectron": True, "noMuon": True, "leadPt20": True, "eleOppSign": False},
    "mm": {"twoMuon": True, "noElectron": True, "leadPt20": True, "muOppSign": True},
    "mmSS": {"twoMuon": True, "noElectron": True, "leadPt20": True, "muOppSign": False},
}

masshist = (
    hist.Hist.new
    .StrCat(regions.keys(), name="region")
    .StrCat(["nominal"] + list(weights.variations), name="systematic")
    .Reg(60, 60, 120, name="mass", label="$m_{ll}$ [GeV]")
    .Weight()
)

for region, cuts in regions.items():
    goodevent = selection.require(**cuts)

    if region.startswith("ee"):
        mass = events.Electron[goodevent].sum().mass
    elif region.startswith("mm"):
        mass = events.Muon[goodevent].sum().mass

    masshist.fill(
        region=region,
        systematic="nominal",
        mass=mass,
        weight=weights.weight()[goodevent],
    )
    for syst in weights.variations:
        masshist.fill(
            region=region,
            systematic=syst,
            mass=mass,
            weight=weights.weight(syst)[goodevent],
        )

out = {
    events.metadata["dataset"]: {
        "sumw": ak.sum(events.genWeight),
        "mass": masshist,
        "weightStats": weights.weightStatistics,
    }
}

out

[14]:

{'DYJets': {'sumw': 578762.44,
  'mass': Hist(
    StrCategory(['ee', 'eeSS', 'mm', 'mmSS'], name='region', label='region'),
    StrCategory(['nominal', 'alphaSUp', 'eleSFUp', 'alphaSDown', 'eleSFDown'], name='systematic', label='systematic'),
    Regular(60, 60, 120, name='mass', label='$m_{ll}$ [GeV]'),
    storage=Weight()) # Sum: WeightedSum(value=621215, variance=2.20829e+10),
  'weightStats': {'genWeight': WeightStatistics(sumw=578762.4375, sumw2=27678371840.0, minw=-26331.201171875, maxw=26331.201171875, n=40),
   'alphaS': WeightStatistics(sumw=40.0, sumw2=40.0, minw=1.0, maxw=1.0, n=40),
   'eleSF': WeightStatistics(sumw=38.26972579956055, sumw2=36.81547546386719, minw=0.6674144268035889, maxw=1.0077519416809082, n=40)}}}

The cell below demonstrates that indeed this output is accumulatable. So if we were to return such a result from a coffea processor, we would be all ready to process thousands of files!

[15]:

accumulate([out, out])

[15]:

{'DYJets': {'mass': Hist(
    StrCategory(['ee', 'eeSS', 'mm', 'mmSS'], name='region', label='region'),
    StrCategory(['nominal', 'alphaSUp', 'eleSFUp', 'alphaSDown', 'eleSFDown'], name='systematic', label='systematic'),
    Regular(60, 60, 120, name='mass', label='$m_{ll}$ [GeV]'),
    storage=Weight()) # Sum: WeightedSum(value=1.24243e+06, variance=4.41659e+10),
  'weightStats': {'genWeight': WeightStatistics(sumw=1157524.875, sumw2=55356743680.0, minw=-26331.201171875, maxw=26331.201171875, n=80),
   'alphaS': WeightStatistics(sumw=80.0, sumw2=80.0, minw=1.0, maxw=1.0, n=80),
   'eleSF': WeightStatistics(sumw=76.5394515991211, sumw2=73.63095092773438, minw=0.6674144268035889, maxw=1.0077519416809082, n=80)},
  'sumw': 1157524.9}}

The mass histogram itself is not very interesting with only 40 input events, however

[16]:

out["DYJets"]["mass"][sum, "nominal", :]

[16]:

Regular(60, 60, 120, name='mass', label='$m_{ll}$ [GeV]')

Weight() Σ=WeightedSum(value=126744, variance=4.49114e+09)