Architecture Design Process

Basic structure

The file structure of the project as a whole was largely determined by the requirements for building Python packages using setuptools. All of the metadata needs to be in the base directory, with all of the code inside of folders for each package. Rather than listing each one in setup.cfg, we relied on setuptools's automatic package discovery, which means each folder used as a package needs an __init__.py file inside of it. We used a flat layout, since there was only one root package.

CLI files

To the furthest extent possible we tried to isolate the CLI-specific code from the code that actually does the work of the project, so it could be easily extended to support other input and output formats. setup.cfg lists the two entry points as compare and search, which go to the main functions in compare.py and search.py.

Inside of those files, argparse is used to process the input. To reduce any duplicated code, two parent parsers are definied in __init__.py, and the other parsers inherit them. One of them (dist_opt_parser) enables optional flags to specify which distance metric will be used, and the other (output_opt_parser) allows users to choose between different forms of output.

The current options allow output to be printed to stdout (default) or saved to a file, and files can either include all of the human-readable text or just the values in csv format. For files that already exist, users have the option to either overwrite that data or to append the new information to the end of it.

search.py also uses subparsers, so the first argument after search selects which parser to use. One of them does latent space searches, and inherits both of the parent parsers, and the other handles sequence searches and only inherits output_opt_parser. A third subparser was added to print the full list of protein family names, but it doesn't accept any arguments and functions more like a flag when used at the command line. Each subparser goes to a different function after it's used, so there's no need to guess which one the users used. Both latent space searches and sequence searches can accept any number of filenames, which it appends to a list stored in the argparse namespace.

compare.py only uses a single parser in the file, but it still inherits both of the parent parsers. Since users can give either a protein family name or a filename as input, it first tries treating it as a protein family name, but if there isn't a family with that name it's treated as a filename. It requires exactly two family / file names to be given as arguments, which it stores in a list in the argparse namespace. Since it requires two arguments, in order to list the known family names, users need to type list names, which it then checks for before calling the comparison function.

Classes

The classes were designed to function independently of anything related to argparse, with a series of helper functions bridging the gaps between the two sides. LSVectors store the name of a latent space and its data, and there are functions to load the data if it wasn't already given when the object was created. CompareLS and SearchLS only take LSVectors and scipy distance function objects as parameters, so it doesn't matter where the data inside of them came from.

The results returned from the comparisons and searches are in format-agnostic CompareLSOutput, SearchLSOutput, and SearchSQOutput objects, which have methods for different ways to output the results. output_results.py uses the argparse arguments to determine which one should be called, and since the method names are the same in all of the output classes it doesn't matter which one it's operating on.