The goal of the Mosaic Vaccine Designer is to use natural sequences to generate a small number of "mosaic" sequences that include maximal diversity of potential T-cell epitopes from the natural sequences. The resulting mosaic proteins in the proposed vaccine cocktail resemble real proteins from the input set of natural viral proteins (the 'training set'), but are assembled from fragments of the natural proteins using a genetic algorithm (a computational optimization method). This method was first applied to HIV, but can be readily applied to other variable pathogens.
A mosaic protein or peptide is an artificial recombinant protein designed from a set of reference protein sequences so that every constituent peptide (k-mer) is found some place in the set of input proteins. (For common usage, k-mer lengths correspond to T-cell epitope lengths.) Mosaics differ from other artificial recombinants, such as "peptide beads-on-a-string" or consensus sequences, because for the chosen value of k amino acids, there are no non-natural k-mers. Values of k between 9 and 12 are typically chosen because that is the size of epitopes recognized by cytotoxic T-cells.
Mosaics proteins are generally vaccine candidates. Shorter mosaic peptides can be used to assay T-cell specificity in EliSpot assays. Mosaic proteins will give equal or better coverage of potential epitopes (k-mers) than is possible from the same number of natural proteins or from a consensus sequence. Mosaics will generally have k-mer coverage superior to consensus sequences, and similar to that of peptide beads-on-a-string. However, because mosaics have only natural k-mers in natural local contexts, intracellular processing of mosaics for MHC presentation is likely to more closely resemble processing of viral proteins in natural infection.
The tool uses a genetic algorithm to optimize a vaccine cocktail, which contains one sequence from each of several populations of artificial recombinants. Each population is taken in turn for optimization, in which new recombinants are generated and tested to see if they would improve the cocktail. Recombinants that have any non-natural k-mer (one that does not exist in any of the input sequences) are rejected from the population, and a replacement is tried until a valid recombinant is generated. The cocktail is scored based on its coverage of k-mers in the natural sequence input. There is a best sequence from each population in the cocktail at all times. If a new recombinant proves superior to its population's representative in the cocktail, the new sequence will replace the old one. Populations are optimized in turn until a stopping criterion is reached (typically, lack of improvement for a certain number of optimization cycles). For a more detailed explanation of the overall method, see Fig. 2 in Fischer et al. 2007.
Start with any collection of natural sequences from a protein of interest. The number of sequences used may be anywhere from dozens to thousands. Using a larger number of sequences is generally better, but the ideal number depends on the diversity among them. If there is more diversity among the input sequences, a larger number of sequences are needed. It is important to avoid over-representation of a group of closely-related sequences.
To run the Mosaic tool, the input proteins do not need to be aligned. Results from Mosaic are alignment-independent. However, subsequent analysis with Posicover will require that the sequences be aligned. Mosaic accepts most common sequence formats.
Larger sets of proteins will take longer to find an optimal set of mosaic protein(s). Sets of proteins that are biased toward a particular viral strain or subtype will produce mosaics that are likewise biased toward that particular strain. Such biases are sometimes unavoidable, and there are several ways to manage these biases, as discussed in the options below.
"Create mosaic sequence cocktail"
This function creates mosaic(s). It runs the genetic algorithm to generate a cocktail of synthetic peptides with best possible coverage.
"Pick the best natural sequences"
This function selects the unmodified input sequence(s) with best coverage from among the input sequences. The number of sequences selected is determined by "Cocktail Size".
"See the coverage distribution of natural sequences"
This function examines the coverage provided by each unmodified input sequence and gives the coverage score of each sequence.
The number of sequences (full length protein antigens) desired for the vaccine sequence cocktail. When the cocktail size is 1, the tool will derive a single mosaic protein. When the cocktail size is n, the tool will give n proteins. A larger number of proteins will give better overall coverage, but a smaller number of proteins is more practical for production of an actual vaccine. The number of input sequences should be much larger than the requested cocktail size.
Choose the length of the k-mers being selected. In general, 9 works well. Lengths >9 reduce the possible number of non-natural epitopes (e.g., if 10 is chosen, all 8mers and 9mers will, of necessity, be natural, while if 8 is chosen, it is possible to have non-natural 9-mers and 10mers). However, larger values of k will have fewer possible recombination sites, and thus reduce the number of possible optimal solutions.
On the basis of empirical evidence with HIV, HCV, and Filoviruses, decreasing k in the Mosaic tool only slightly increases the number of non-natural peptides when k-mers with lengths >k are considered. Thus there is usually little benefit from running Mosaic with k>9.
Please note: It is important to choose the rare threshold carefully. If it is too high, the job will stall. If it is too low, the resulting mosaics will be sub-optimal. It must be chosen differently depending on the number of sequences in the input and the diversity of the protein.
To obtain a vaccine that has the fewest possible number of infrequently-occurring epitopes, increase the Rare Threshold. The genetic algorithm will prohibit any k-mer with an occurrence count below the threshold from occurring in the recombinant populations, and hence in the cocktail. In addition, only k-mers present more often than the Rare Threshold will be scored by the algorithm. For example, when the Rare Threshold is 3, then the algorithm requires that there be at least 3 copies of a given k-mer in order for that k-mer to be considered in optimization, or included in the cocktail.
If the Rare Threshold is 1, every k-mer that exists in the input set counts in the score. If the Rare Threshold is set too high, the algorithm cannot work because it will be impossible for it to generate recombinant sequences without rare k-mers. To some extent, your choice of rare threshold will depend on the total number of input sequences; larger sets will tolerate a higher threshold. For small or diverse data sets, the rare-threshold may need to be set to 1 to avoid having the algorithm "stall".
To minimize stalling problems for novice users, the default value is 1. To achieve optimal mosaics, you may need to increase it. As a rule of thumb, you should increase it if your input set is large (>~50 sequences) and from a conserved protein. A typical value for large, conserved sets would be 3.
It is not necessary to provide a Fixed Sequence in order to run the tool. Providing a fixed sequence, however, is one way that you can address the problem of biases in the input set. Fixed sequences can be used to better optimize the mosaic proteins in a cocktail.
For example, suppose you had only 10 sequences available for clade X, but 1000 sequences available for clade Y and 1000 for clade Z. It would be beneficial to run Mosaic to give one protein (cocktail size=1) for clade X alone, and then use this peptide as a fixed sequence in a second Mosaic run that included all 2010 sequences.
This method provides an alternative to optimizing each individual clade separately. Depending on details of the data, such as the number of sequences available from each clade and the phylogenetic distances between the clades, using a fixed sequence may reach a better optimum.
This option limits the total run time of the optimizer. If you choose "Allow to run only once", the tool will only go through a single iteration of the algorithm, rather than re-initializing the optimizer pools while retaining the best mosaic proteins from the previous iteration. Typically, this should be used only for testing: a single cycle may not generate near-optimal mosaics, and the automatic stopping criterion is fairly effective. The following examples show how the rate of cocktail improvement slows as the runtime increases.
The population size is the number of sequences in the optimization pools. For small input sets, population size could be a small multiple of the number of input sequences. For large input sets, population size may be smaller than the number of input sequences (see examples above). Highly variable input sequence sets may benefit from a larger population size, but excessive population size will slow optimization. We have typically used population sizes in the 50-500 range.
New sequences are generated and evaluated as part of the population optimization process. The cycle is the number of sequential attempts made to generate improved sequences in each population before moving to the next population. For a more detailed explanation, see Figure 2C of Fischer et al. 2007.
The Stall Factor controls how long the program will continue to try to optimize after it reaches a stall
New sequences are generated during population optimization by recombination between two parent sequences. One of these sequences is drawn from the artificial population; the other parent is chosen from either the recombinant population or the original set of natural sequences. The internal crossover probability is the probability that the second parent will be chosen from the artificial population. It is not clear what effect this has on the final results.
If defined, this will limit the number of optimization rounds that will be applied to each population. The default value of 0 means iterate until the Stall Factor is reached.
When Random Seed is set to zero (default), the seed is selected at random. Thus the results of each run may differ slightly. In the Mosaic results, the seed will be reported. Entering this seed number in a subsequent run will allow you to repeat the exact same run.
After obtaining results from the Mosaic tool, you will have two options to assess the quality of your putative vaccine: Epicover and Posicover. Ideally, you use both assessment tools. If you ran multiple runs of the Mosaic tool to test various input options, the Epicover and Posicover tools will show graphically which options gave better results.
Epicover (Epitope Coverage Assessment Tool) will give an overall view of how well your mosaic peptide(s) cover the naturally-occurring k-mers in your set of proteins.
Posicover (Positional Epitope Coverage Assessment Tool) will show, position-by-position, how well your mosaic peptide(s) cover the k-mers in your natural set of proteins. This will allow you to see if there are regions of the protein that are not well covered, or to compare the coverage for various clades.