terminator.data.data.TERMLazyDataset

class terminator.data.data.TERMLazyDataset(in_folder, pdb_ids=None, min_protein_len=30, num_processes=32)[source]

Bases: Dataset

TERM Dataset that loads all feature files into a Pytorch Dataset-like structure.

Unlike TERMDataset, this just loads feature filenames, not actual features.

Variables:
  • dataset (list) – list of tuples containing feature filenames, TERM length, and sequence length

  • shuffle_idx (list) – array of indices for the dataset, for shuffling

__init__(in_folder, pdb_ids=None, min_protein_len=30, num_processes=32)[source]

Initializes current TERM dataset by reading in feature files.

Reads in all feature files from the given directory, using multiprocessing with the provided number of processes. Stores the feature filenames, the TERM length, and the sequence length as a tuple representing the data. Can read from PDB ids or file paths directly. Uses the given protein length as a cutoff.

Parameters:
  • in_folder (str) – path to directory containing feature files generated by scripts/data/preprocessing/generateDataset.py

  • pdb_ids (list, optional) – list of pdbs from in_folder to include in the dataset

  • min_protein_len (int, default=30) – minimum length of a protein in the dataset

  • num_processes (int, default=32) – number of processes to use during dataloading

Methods

__init__(in_folder[, pdb_ids, ...])

Initializes current TERM dataset by reading in feature files.

shuffle()

Shuffle the dataset

Attributes

functions

shuffle()[source]

Shuffle the dataset