scripts.data.preprocessing.generateDataset

Generate feature files for TERMinator.

Usage:
python generateDataset.py \
    --in_folder <input_folder> \
    --out_folder <output_folder> \
    [--cutoff <matches_cutoff>] \
    [-n <num_processes>] \
    [-u] \ # update existing files
    [--coords_only] \
    [--dummy_terms [None, 'replace', 'include']]

--in_folder <input_folder> should be structured as <input_folder>/<pdb_id>/<pdb_id>.<ext>. For full feature generation, ext must include .dat and .red.pdb, while if running using --coords_only only .red.pdb is required. If you use scripts/data/preprocessing/cleanStructs.py, this structure is automatically built.

--out_folder <output_folder> will be structured as <input_folder>/<pdb_id>/<pdb_id>.<ext>, where <ext> includes .features, which specifies protein and TERM features, and .length, which contains two integerss. The first integer specifies the number of TERM residues in the protein, while the second integer specifies the sequence length of the protein.

--cutoff <matches_cutoff> restricts the number of matches featurized to the top <matches_cutoff>, ranked by increasing RMSD. Defaults to 50.

-n <num_processes> specifies how many processes to use while processing. Defaults to 1.

[-u] is an optional flag which, if specified, forces rewriting of existing feature files.

--coords_only is an option flag which, if specified, generated only backbone-derived features. Running this mode does not require prior TERM mining, but does require you clean the backbone using scripts/data/preprocessing/cleanStructs.py.

--dummy_terms allows specifying how dummy TERMs are incorperated into features. Dummy TERMs are constructs where there is one TERM match with a degenerate X sequence and structural features derived from the target structure, By default, it is set to None, or no dummy TERMs. If set to 'replace', only the dummy TERM is included. If set to 'include', the first match is set to the dummy TERM match and the remaining TERMs are those parsed from the .dat file.

See python generateDataset.py --help for more info.

Functions

dataGen(file, out_folder, cutoff, ...)

Wrapper function for parallelization which deals with paths and other args.

generateDatasetParallel(in_folder, out_folder)

Parallelize dataGen over a list of files.