scripts.data.preprocessing.generateDataset.generateDatasetParallel

scripts.data.preprocessing.generateDataset.generateDatasetParallel(in_folder, out_folder, cutoff=50, num_cores=1, update=True, coords_only=False, dummy_terms=None)[source]

Parallelize dataGen over a list of files.

Parameters:
  • in_folder (str) – Path to input directory in proper structure

  • out_folder (str) – Path to the output folder

  • cutoff (int) – Max number of TERMs to featurize

  • num_cores (int) – Number of processes to parallelize with

  • update (bool) – Whether or not to overwrite existing files

  • coords_only (bool) – Whether to use only backbone-derived features

  • dummy_terms (str or None) – Method by which to incorperate dummy TERMs. Options include 'replace', which means replacing TERM features with those derived from a dummy TERM, or 'include', which includes the dummy TERM into the mined TERM matches.