terminator.data.data.TERMDataset

class terminator.data.data.TERMDataset(in_folder, pdb_ids=None, min_protein_len=30, num_processes=32)[source]

Bases: Dataset

TERM Dataset that loads all feature files into a Pytorch Dataset-like structure.

Variables:
  • dataset (list) – list of tuples containing features, TERM length, and sequence length

  • shuffle_idx (list) – array of indices for the dataset, for shuffling

__init__(in_folder, pdb_ids=None, min_protein_len=30, num_processes=32)[source]

Initializes current TERM dataset by reading in feature files.

Reads in all feature files from the given directory, using multiprocessing with the provided number of processes. Stores the features, the TERM length, and the sequence length as a tuple representing the data. Can read from PDB ids or file paths directly. Uses the given protein length as a cutoff.

Parameters:
  • in_folder (str) – path to directory containing feature files generated by scripts/data/preprocessing/generateDataset.py

  • pdb_ids (list, optional) – list of pdbs from in_folder to include in the dataset

  • min_protein_len (int, default=30) – minimum length of a protein in the dataset

  • num_processes (int, default=32) – number of processes to use during dataloading

Methods

__init__(in_folder[, pdb_ids, ...])

Initializes current TERM dataset by reading in feature files.

shuffle()

Shuffle the current dataset.

Attributes

functions

shuffle()[source]

Shuffle the current dataset.