preprocess

coffea.dataset_tools.preprocess(fileset: ~typing.Dict[str, ~coffea.dataset_tools.preprocess.DatasetSpecOptional], step_size: None | int = None, align_clusters: bool = False, recalculate_steps: bool = False, files_per_batch: int = 1, skip_bad_files: bool = False, file_exceptions: Exception | Warning | tuple[Exception | Warning] = (<class 'OSError'>,), save_form: bool = False, scheduler: None | ~typing.Callable | str = None, uproot_options: dict = {}, step_size_safety_factor: float = 0.5) tuple[Dict[str, DatasetSpec], Dict[str, DatasetSpecOptional]][source]

Given a list of normalized file and object paths (defined in uproot), determine the steps for each file according to the supplied processing options.

Parameters:
  • fileset (FilesetSpecOptional) – The set of datasets whose files will be preprocessed.

  • step_size (int | None, default None) – If specified, the size of the steps to make when analyzing the input files.

  • align_clusters (bool, default False) – Round to the cluster size in a root file, when chunks are specified. Reduces data transfer in analysis.

  • recalculate_steps (bool, default False) – If steps are present in the input normed files, force the recalculation of those steps, instead of only recalculating the steps if the uuid has changed.

  • skip_bad_files (bool, False) – Instead of failing, catch exceptions specified by file_exceptions and return null data.

  • file_exceptions (Exception | Warning | tuple[Exception | Warning], default (FileNotFoundError, OSError)) – What exceptions to catch when skipping bad files.

  • save_form (bool, default False) – Extract the form of the TTree from each file in each dataset, creating the union of the forms over the dataset.

  • scheduler (None | Callable | str, default None) – Specifies the scheduler that dask should use to execute the preprocessing task graph.

  • uproot_options (dict, default {}) – Options to pass to get_steps for opening files with uproot.

  • step_size_safety_factor (float, default 0.5) – When using align_clusters, if a resulting step is larger than step_size by this factor warn the user that the resulting steps may be highly irregular.

Returns:

  • out_available (FilesetSpec) – The subset of files in each dataset that were successfully preprocessed, organized by dataset.

  • out_updated (FilesetSpecOptional) – The original set of datasets including files that were not accessible, updated to include the result of preprocessing where available.