scripts.data.preprocessing.generateDataset.generateDatasetParallel¶

scripts.data.preprocessing.generateDataset.generateDatasetParallel(in_folder, out_folder, cutoff=50, num_cores=1, update=True, coords_only=False, dummy_terms=None)[source]¶

Parallelize dataGen over a list of files.

Parameters:

in_folder (str) – Path to input directory in proper structure
out_folder (str) – Path to the output folder
cutoff (int) – Max number of TERMs to featurize
num_cores (int) – Number of processes to parallelize with
update (bool) – Whether or not to overwrite existing files
coords_only (bool) – Whether to use only backbone-derived features
dummy_terms (str or None) – Method by which to incorperate dummy TERMs. Options include 'replace', which means replacing TERM features with those derived from a dummy TERM, or 'include', which includes the dummy TERM into the mined TERM matches.