terminator.data.data.TERMLazyDataset¶
- class terminator.data.data.TERMLazyDataset(in_folder, pdb_ids=None, min_protein_len=30, num_processes=32)[source]¶
Bases:
DatasetTERM Dataset that loads all feature files into a Pytorch Dataset-like structure.
Unlike TERMDataset, this just loads feature filenames, not actual features.
- Variables:
dataset (list) – list of tuples containing feature filenames, TERM length, and sequence length
shuffle_idx (list) – array of indices for the dataset, for shuffling
- __init__(in_folder, pdb_ids=None, min_protein_len=30, num_processes=32)[source]¶
Initializes current TERM dataset by reading in feature files.
Reads in all feature files from the given directory, using multiprocessing with the provided number of processes. Stores the feature filenames, the TERM length, and the sequence length as a tuple representing the data. Can read from PDB ids or file paths directly. Uses the given protein length as a cutoff.
- Parameters:
in_folder (str) – path to directory containing feature files generated by
scripts/data/preprocessing/generateDataset.pypdb_ids (list, optional) – list of pdbs from in_folder to include in the dataset
min_protein_len (int, default=30) – minimum length of a protein in the dataset
num_processes (int, default=32) – number of processes to use during dataloading
Methods
__init__(in_folder[, pdb_ids, ...])Initializes current TERM dataset by reading in feature files.
shuffle()Shuffle the dataset
Attributes
functions