Semantic protein indexing
The main idea of our approach is to learn a mapping of protein domain sequences into a vector space that captures their “semantic similarity”, i.e. closeness in the semantic space should reflect homology relationships between sequences.
In order to learn an embedding of protein sequences into a semantic space, we need to define (i) a feature representation for proteins, (ii) a training signal that determines whether a given pair of training sequences are similar and should be pushed together by the algorithm, or dissimilar and should be pulled apart, and (iii) an algorithm that learns an appropriate embedding.
Let us denote the set of proteins in the database as
and a query protein as
is the set of all possible sequences of amino acids. We then choose a feature map
to represent proteins as vectors. This map is necessary so that we can perform geometric operations on proteins. We use the following representation for a protein
is the E-value returned by a surrogate protein alignment algorithm, such as PSI-BLAST, suitably transformed. Following Rankprop 
, we use the following transformation:
is the PSI-BLAST E-value assigned to protein
and where we set the parameter
. This transformation yields a stochastic connectivity matrix; i.e., the value
can be interpreted as the probability that a random walk on the protein similarity network will choose to move from protein
. Note that, because most protein pairs exhibit no detectable similarity according to an algorithm such as PSI-BLAST, most feature values are zero. (Specifically, PSI-BLAST assigns a large maximal E-value to all database sequences for which no homology to the query is detected, and the exponential transfer function converts these values to zero.) The sparseness of the feature vectors will be important for computational reasons.
Next, we again use a surrogate protein alignment algorithm, this time as a teacher
to provide a noisy training signal. We construct a training set of tuples
, where each tuple contains a query
, a related protein
and an unrelated (or lower ranked) protein
. The tuples themselves are collected by running PSI-BLAST in an all-versus-all fashion over the database of proteins. Taking any given protein
as the query, we consider any protein with an E-value lower than 0.1 to be a similar protein (instance of a
); in the current implementation, instances of
are chosen randomly from all training examples and with high probability will be dissimilar to
. We can then, in principle, construct all possible combinations (tuples) from which we sample randomly during online training.
Given the feature vectors and the training tuples, our aim is to learn a feature embedding that performs well for protein ranking and classification tasks. We will learn an embedding function
matrix, resulting in an embedding
is chosen to be low dimensional, e.g.
. The learning procedure consists of finding a matrix
such that similar proteins have close proximity in the embedding space. Specifically, we would like to choose
such that, for all tuples
should be ranked higher than
, relative to an appropriate distance measure
in the embedding space. We define this distance measure using the
-norm (which is defined as
After training, given a query protein
, we will rank the database using the ranking score:
where we consider smaller values of
to be more highly ranked.
The training objective employs the margin ranking loss 
, which has been used successfully in the field of information retrieval to rank documents given a query 
. That is, we minimize:
to be smaller than
until a margin constraint of
is satisfied. Intuitively, the algorithm tries to push
together while pulling
apart, until the difference in distances achieves a margin of
. For an equivalent formulation, we can introduce a slack variable
for each tuple
and enforce the constraints
for all tuples while minimizing the objective function
This optimization problem is solved using stochastic gradient descent 
: iteratively, one picks a random tuple
, makes a gradient step for that tuple as follows:
denotes that the sign function is applied componentwise to the vector
to yield a vector of
values. Pseudocode for training the Prot
embedding is given in Algorithm 1 in Text S1
One can exploit the sparsity of
when calculating these updates to make them computationally cheap. To train our model, we choose the (fixed) learning rate
that minimizes the training error, i.e. the loss defined by equation (1). We initialize the matrix
randomly using a normal distribution with mean zero and standard deviation one. Overall, stochastic training is highly scalable and is easy to implement for our model, and learning can scale to millions of proteins.
After training, we precompute the embedding
for every protein in the database. At test time, given a query protein
, we compute its linear embedding once. Then we are left with only
operations per protein in the database to perform when retrieving results for that query.
Adding information about protein structure
In general, recognizing remote homology relationships among protein structures is easier than recognizing remote homologies based only on protein sequences. Although structural information is available for only a subset of the proteins in the database, we would like to ensure that our embedding captures this structural information in addition to the sequence-based information provided by PSI-BLAST. We consider two sources of structural information: (1) category labels for a given protein and (2) similarity scores between pairs of proteins. For the the category labels, we use the Structural Classification of Proteins (SCOP) 
. For pairwise similarity scores, we use pairwise structure alignments of known 3D structures using MAMMOTH 
We incorporate this auxiliary information using the framework of multitask learning: in addition to the main embedding task, we simultaneously learn models to solve additional tasks using appropriate subsets of the training data. The tasks share internal representations learned by the algorithm, in this case, the embedding function
. In particular, we pose an auxiliary classification task using SCOP categories, and we pose an auxiliary ranking task using either SCOP category relationships or using MAMMOTH similarities. In all cases, the multitask objective function is simply the sum of the original Prot
objective function and of that of the auxiliary task. We consider these two task types in turn.
For auxiliary data in the form of a class label
we train an auxiliary classification task that is multitasked with the original Prot
objective, sharing the same embedding space. For each fold and superfamily class we create a vector
, which can be thought of as a set of class centroids. We then would like to satisfy the constraints:
That is, proteins belonging to some class should be closer to that class centroid than proteins that do not belong to that class. We train this model using the margin ranking loss as before, and multitask this problem with the original objective using the following updates:
is a matrix containing the centroid vectors
as columns, and
) is the bit vector of length
whose two non-zero entries are placed at indices for the fold and superfamily of the labeled training example
). Pseudocode for training the Prot
embedding with class-based auxiliary data is given in Algorithm 2 in Text S1
For auxiliary data in the form of similarity scores between pairs of proteins, we simply add more ranking constraints into the set of tuples
. That is, we consider additional tuples of the form
are similar SCOP proteins based on auxiliary data—i.e., a similarity score comparing these proteins is above a cutoff value—while
is chosen at random from all of SCOP and with high probability will be structurally dissimilar to
. Then we require these additional tuples to satisfy constraints of the form
analogous to the constraints in the main optimization problem. Two examples of the use of such auxiliary constraints are given by using SCOP superfamily labels or MAMMOTH. For SCOP labels, if two proteins are in the same superfamily, we say they are similar. For MAMMOTH, we choose a cutoff value of 2.0, and a pair of proteins that has a structural alignment scoring above this cutoff is deemed to be similar. Pseudocode for training the Prot
embedding with ranking-based auxiliary data is given in Algorithm 3 in Text S1
For labeled data—namely, proteins with structural category labels and 3D structures from which to compute pairwise similarity scores—we used proteins from the SCOP v1.59 protein database. We used ASTRAL 
to filter these sequences so that no two sequences share greater than 95% identity. This filtering resulted in 7329 sequences. Our test set consists of 97 proteins selected at random from these SCOP sequences. These test sequences were excluded entirely from the training data.
For unlabeled data, i.e. protein domain sequences without category labels or structural information, we used sequences from the ADDA domain database version 4 
). This database contains 3,854,803 single-domain sequences. We removed from the database sequences comprised entirely of the ambiguity code “X,” sequences shorter than 6 amino acids and sequences longer than 10,000 amino acids. We then randomly selected sequences from the remaining sequences until we had picked 3% of the original sequences. This left us with an unlabeled single domain database of 115,644 sequences.
We ran PSI-BLAST version 2.2.8 on the combined SCOP+ADDA database using the default parameters, allowing a maximum of 6 iterations. For a second and more powerful pairwise sequence similarity method based on HMM-HMM comparisons, we also ran HHSearch version 1.5.0, using default parameters. HHPred/HHSearch is considered a leading method for remote homology detection 
. When searching for homologs to the test set domains, we added the HHSearch options “-realign -mact 0,” which uses local Viterbi search followed by MAC to realign the proteins globally on a local posterior probability matrix. Similarly, MAMMOTH was run with its default settings.
We first trained embeddings on SCOP+ADDA (with SCOP test sequences held out) using PSI-BLAST or HHSearch as the pairwise sequence comparison method to serve as “teacher” for producing
tuples. In this setting, we did not make use of the category labels or structural information for the SCOP training examples. We then trained embeddings using ADDA as unlabeled data and SCOP as labeled data, where the labeled data was used in (i) an auxiliary classification task based on SCOP category labels or (ii) an auxiliary ranking task based either on SCOP category relationships or on MAMMOTH similarity scores.