run_spark_job
- coffea.processor.run_spark_job(fileset, processor_instance, executor, executor_args={}, spark=None, partitionsize=200000, thread_workers=16)[source]
A wrapper to submit spark jobs
A convenience wrapper to submit jobs for spark datasets, which is a dictionary of dataset: [file list] entries. Presently supports reading of parquet files converted from root. For more customized processing, e.g. to read other objects from the files and pass them into data frames, one can write a similar function in their user code.
- Parameters:
fileset (dict) – dictionary {dataset: [file, file], }
processor_instance (ProcessorABC) –
An instance of a class deriving from ProcessorABC
Note
The processor instance must define all the columns in data and MC that it reads as
.columns
executor –
anything that inherits from
SparkExecutor
likespark_executor
In general, a function that takes 3 arguments: items, function accumulator and performs some action equivalent to: for item in items: accumulator += function(item)
executor_args – arguments to send to the creation of a spark session
spark –
an optional already created spark instance
if
None
then we create an ephemeral spark instance using a configpartitionsize – partition size to try to aim for (coalescese only, repartition too expensive)
thread_workers – how many spark jobs to let fly in parallel during processing steps