HIV Databases HIV Databases home HIV Databases home
HIV sequence database

Epigraph Explanation


This tool generates Epigraph sequences based on an input population of diverse sequences. Epigraph designs can be used for vaccine or reagent design.

Like Mosaic proteins [Fischer et al., Nat Med. 2007 13(1):100-6], Epigraphs are artificial proteins that combine to maximize the potential epitope coverage (PTE) of a diverse population of protein sequences. The basic concept for Mosaic and Epigraph is similar, but Epigraph can reach a solution faster. Because of its speed, Epigraph readily allows additional explorations, such as optimization on imperfect matches.

The basic input of the Epigraph design program is a diverse set of sequences that is representative of a viral population of interest. The output is a user specified number of artificial but intact sequences that, for a given number of sequences, will provide optimal epitope coverage.

Getting started: try the Sample Input. If you need help selecting your own set of sequences, or selecting the best parameters to use, please contact us.




Sample input

the sample input set made available for exploring these tools includes 189 HIV-1 B clade Gag p24, one of the most conserved regions of the HIV proteome. The set represents sequences sampled in the USA between 2006 and 2011.

Algorithm: Aligned vs. Unaligned

The default for generating Epigraphs is to optimize PTE coverage based on unaligned (but related) protein sequences that are representative of infections in the vaccine target population. The reconstructed Epigraphs from unaligned proteins will still resemble natural proteins, will be readily aligned to natural proteins in the input set provided. They are reconstructed by tiling together overlapping PTEs, while ensuring that protein assembly continues on from beginning to the end of the protein.

If an aligned set is provided as input and the “Unaligned” option is selected, Epigraph run will begin by automatically stripping gaps. If the aligned option is selected, the positions in the alignment will be factored in, the graph will be acyclic, and the resulting Epigraphs will often provide slightly less PTE coverage than the unaligned optimal coverage scores. The use of aligned proteins will enable the advanced option of optimizing on inexact matches (e.g. 8/9 or 7/9); the unaligned version requires optimization on perfect matches (e.g. 9/9).

Epitope length

The underlying premise for this tool is that all 9mers in a protein set are “potential T cell epitopes” (PTEs). The default length is 9 as this is the most common length of HLA class I presented optimal CD8+ T cell epitopes. Epitopes range between 8-12 amino acids in length, and optimizing on 9-mers still provides excellent coverage of other epitope lengths (Theiler et al. 2016). Nine amino acids is also the core length of class II epitopes, although they tend to be longer overall.

# of seqs in vaccine pool

The number of sequences is the number of output Epigraphs. If one generates a single epigraph it will be similar to a consensus, but a slightly improved from an immunological perspective as local co-variation patterns are preserved. For a simplistic example of this, consider two variable positions within a 9 mer in an alignment. If the most common amino acid in one of them is found 50% of the time, and the most common amino acid in the other found 40% of the time, but they are mutually exclusive, an epigraph protein will not combine the two, while a consensus sequence would, creating an 9 mer that did not exist in nature.

More commonly, 2 or 3 epigraph solutions are used, as these are practical numbers for vaccine design. The epigraphs will be complementary to each other, as they are designed to be used in combination to maximize PTE coverage of a population.

As the number of epigraphs is increased, population coverage of diversity is increased, but with diminishing returns, as increasingly rare epitopes are brought into the proteins.

Pad sequences

Your sequences must all be the same length to use Epigraph. If you have some sequences that end early, they will be filled out to identical lengths with dashes (-).

# of trials

For doing rapid exploratory runs, the number of trials should be set to 1. Such a run ultimately may be the best run; it will start with the best single sequence, and build on it. If you want to do more trials, for doing final and more in-depth searching try setting number of trials to 10 or higher (up to thousands, depending on your patience), and iterative refinements to 10. If you are creating a pair of Epigraphs, the coverage of the single epitope may diminish if by doing so the coverage of the pair can be enhanced. If you are producing 2 or more first Epigraph, your first may not be as good as with the single trial, but the combination should improve.

For polyvalent vaccines, one can sometimes find antigens with better coverage by doing multiple trials with random initial sequences, followed by iterative refinement. The first trial uses the “episensus” (i.e., the best monovalent epigraph solution) as the first sequence. But if there are more trials, then they are initialized with random initial sequences. As consequence of this is that in the output, the sequence with the best coverage may not be the first one listed when multiple trials are conducted.

Random seed

Some aspects of the epigraph algorithm involve random choices; this means that multiple runs with the same input data can produce (usually very slightly) different results. If a random seed is specified, then two runs with the same random seed and the same input data should produce the same output.

Iterative refinement step

For polyvalent vaccines, new antigens are added sequentially, by optimizing complementary coverage (ie, coverage of epitopes that are not already covered by the other antigens that are already in the vaccine). After reaching the desired number of antigens in a vaccine design, further improvement in coverage is often achieved by deleting one of the antigens, and replacing it with the antigen that optimizes complementary coverage with respect to the other antigens. This can be done for multiple antigens, though there is usually little or no gain after the number of refinements steps gets to be a few times the number of antigens in the vaccine.

0 is fastest, the first is best single antigen, second is best complementary antigen, etc. 10 does the iterative refinement either until convergence or until 10 iterations, but it almost always converges by 10 iterations.

Vaccine seq names

The automatic epigraph names, are like this: EG-0, EG-1... If you are testing many parameters, it can be helpful to use more informative names.

Exclude Rare epitopes

Rare epitopes are the minimum number of times an epitope must exist in the input set for inclusion in the Epigraph. It is desirable to exclude rare epitopes that might be type-specific. If the rare epitope number is set to 0, then rare 9-mers that are included only 1 time in the input sequence set may be present in the final Epigraphs. Such a 9-mer may be a sequencing artifact, or contain a lethal mutation. Thus it is advisable to incorporate only 9-mers that are repeated at least few times in the data. So, unless the input sequence set is small or has been carefully selected to contain only "good" epitopes, then we recommend the default value of 2. If you are creating epigraphs for highly variable proteins that span regions where 9-mers are never repeated between sequences (e.g., HIV-1 Env hypervariable regions, or the mucin-like domain in the filovirus GP protein), it may be necessary to use a value of 0 to completely span the protein.

When using epigraph on a small number of input sequences, an epitope that appears only once or twice will not necessarily be very rare, and a smaller threshold may be preferred; even a value of zero (ie, un-check the Exclude Rare Epitopes option) would be reasonable in this case. Since the epitope runs are much faster when the number of sequences is small, you may want to try runs with several different values of the threshold, and based on the coverage percentage and on your own judgment of the output epigraph sequences, select the run that is most appropriate for your purposes.


The tolerance is the number of amino acids that can be un-matched and still have the imperfect match count. If you select “Aligned sequences” and you can reset the “tolerance” to optimize on the coverage of imperfect matches (eg 8/9 or 7/9) rather than perfect matches, the default setting. It can be useful to explore these options in a situation where the input sequences are very diverse and few 9 mers are ever perfectly matched.

Exact match bonus

If you decide to optimize on imperfect matches (8/9), the resulting epigraphs can have in a dramatic drop in exact matches if this value is set to 0; this is because exact matches are ignored during the optimization of the imperfect match coverage score. The exact match bonus option can be used to tune and balance optimal coverage of exact and inexact matches; if you’re willing to take a small loss of exact PTE coverage to enhance 8/9 coverage, a good starting value for this parameter is 0.1. As larger values are selected, perfect matches will be weighted increasingly.

Fixed initial vaccine sequences

If you want to use a particular sequence, and have Epigraph build a complementary set around it, you can fix that sequence here.

Suggested initial vaccine sequences

This input sequence allows a user to input a sequence that will be improved upon in iterative cycles. An example of when you might use this is if you wished to iterate between Epigraphs and mosaic designs.

Evaluate antigen coverage

This provides coverage statistics of the input data by the Epigraph vaccine that is created in the design run. See the stand-alone tool description below for details, and use the stand-alone tool to compare this vaccine to other vaccine options or other populations of sequences.

Coverage Distribution of Natural Sequences

This graphic provides context to see how the Epigraph design compares to combinations of natural sequences in terms of PTE coverage of the input population. See stand-alone tool below.

Compute coverage of a range of frequencies of the rarest epitope

Provide coverage results for your Epigraph solution. This setting lets you explore the coverage cost of excluding rare epitopes, and will provide context for selecting the number to fill in to the rare epitope exclusion box. 70 is a good value for the graph below, 2 global full proteome Hep B’s.

Example: Two epigraphs on the Global Hep B. A cut off of 70 works well.

Antigen Coverage Evaluation

This tool, Eval_coverage, builds on our earlier Epicover tool, adding several new options. These two antigen coverage tools provide slightly different views of average PTE coverage across proteins in an alignment, but are very similar, and do not require that proteins be aligned for analysis. Our mosaic Posicover provides alternative views, considering PTEs position by position moving across a protein alignment (first considering the PTE at positions 1-9, then 2-10, then 3-11...), and may also be useful to consider.

Antigen Coverage Evaluation allows a user to look at the PTE coverage for a particular vaccine antigen set over all proteins in the set, or over subgroups of proteins. HIV sequences in the database are typically named starting with (e.g. B.US.2007 would be a B clade sequence isolated in the USA in 2007). A user can readily divide coverage into groups of sequences delineated by fields, to determine coverage patterns differ in different clades, different countries, or over time.

The Evaluation tool will show the coverage or your vaccine. This tool is similar to the Epicover tool; the differences are explained in detail in Epigraph Evaluation vs. Epicover.

# of seqs to read from antigens/vaccine

If you have, say a pair of Epigraphs or sequence in a file, and just want to look at the coverage of the first sequence in the set, choose 1.

Counts or fractions

Epigraph antigen coverage provides two alternative views. One can calculate and show graphically the count of matched epitopes, or the fraction. The fraction of matched epitopes is equivalent the count of all PTE in a population that are contained in the Epigraph vaccine set (either perfectly matched, or off-by-one or off-by-two, etc.) divided by the total number of PTEs in the population being evaluated.

# of bootstraps

Bootstraps can be performed to determine how stable the values of epitope coverage are. 100 or 1000 bootstraps would be reasonable values. The sequences in the test population are iteratively resampled with replacement the specified number of times, and the coverage is re-evaluated for each newly resampled population. The standard deviation based of distribution of coverage values is calculated and provided along with the observed coverage.

Bad epitopes in target sequences

The user can decide whether or not to count "bad" PTEs; these are PTEs that include problem characters, such as: premature stop codons (* or $), frameshifts (#), or undetermined amino acids (X).

More precisely, counts and fractions are calculated in the following way:

Let us write h(n,e) as equal to 1, if PTE e appears in the n'th sequence of the target population, and it is zero otherwise.

If a PTE appears more than once we still set h(n,e)=1. Further, define c(e) as the "count" of PTE e; it is the integer number of sequences from the target population in which the PTE appears.

In particular,

c(e) = Sum_n h(n,e)

For an Epigraph vaccine, let V be the set of distinct PTEs that appear in at least one of the antigen sequences.

If a PTE appears in more than one vaccine antigen sequence, or if it appears more than once in a given antigen because it is a direct repeat, it still is included only once in the vaccine set V.

Then, C(V), the total number of PTEs in a population that are covered by the vaccine V, written as an integer, is given by:

C(V) = Sum_{e in V} c(e) = Sum_{e in V} Sum_n h(n,e)

To obtain fractional coverage, the integer coverage is divided by a denominator that can be defined in one of two ways:

  1. D = Sum_{all valid e} c(e), if PTEs containing problematic characters are excluded (default)
  2. D = Sum_{all e} c(e), is all PTEs are included.

The fractional coverage for a vaccine V is given by C(V)/D.

Our older Epicover tool has fewer options, and is calculated slightly differently. It produces only a fractional coverage, and is similar to the fractional estimate obtained here if one includes “bad” epitopes. But it is slightly different. The old Epicover defines the fractional coverage of the n'th sequence as:

C_n = [ Sum_{e in V} h(n,e) / Sum_{all e} h(n,e) ]

Note that the denominator is defined to be the number of distinct epitopes in the n'th sequence. Epicover then defines the overall coverage to be average of this value over all sequences:

C'(V) = (1/N) Sum_n C_n If all sequences are the same length, they have the same number of distinct PTEs, then the two fractions, C(V) and C'(V) are identical. In practice, they are very nearly so, but slightly different values are obtained from the two codes.

To summarize the differences between the old and new code:

  1. Epicover gives sequences equal weight, Eval_coverage gives PTEs equal weight. In practice, this distinction is minor.
  2. Epicover uses all epitopes (good and bad) in the denominator. In its default mode Eval_coverage includes only valid epitopes in the denominator. In this mode, Eval_coverage tends to give higher values for fractional coverage.
  3. Neither code ever uses “bad” epitopes in the numerator. When Eval_coverage is set to include bad epitopes, it will, like, use all epitopes in the denominator, and the two tools will report very similar scores.

Report on subsets of target population sequences

Group sequences by: there are 3 ways to specify the groupings.

Coverage Distribution of Natural Sequences

We had included this tool in the mosaic design package, along with the option to select the best combinations of natural sequences from a population to provide the best epitope coverage of that population using natural protein vaccine antigens. It attempts to address what you might expect in terms of cross-reactivity of immune responses if you used natural strains for vaccines because of convenience or availability rather than rationally selecting or designing them. As a part of that tool, we developed a graphic output that showed the coverage of 100, or 1000, random selected sets of sequences from among the natural strains. This tool is included in the Epigraph suite, even though it just reproduces the graphic in the context of the epigraph code, because it provides an interesting comparison to Epigraph output.

Exclude Rare Epitopes

This tool allows you to exclude rare epitopes from the Epigraph solution, and to set the threshold for defining "rare". If such rare PTEs are immunogenic and are included in a vaccine, they are likely to elicit highly type-specific immune responses. Every natural HIV protein, even conserved proteins like Gag, contain many rare PTEs, some of which are completely unique in the entire database (on average, there are 18 such unique PTEs in every natural HIV Gag protein, and 130 in every natural Env protein). We hypothesize, but have not shown, that such atypical PTEs are more likely to be problematic in the context of other HIV proteins that commonly observed PTEs which are clearly tolerated in many contexts, and a priori it may be advantageous to exclude them to the extent possible (regions like the hypervariable regions of HIV-1 Env are distinct in every person, and in such cases to make an intact epigraph or mosaic spanning these regions, rare PTEs must be included).

There can be a coverage cost when rare epitopes are excluded. This is explored graphically so the user can make an informed decision when weighing the coverage cost to the minimum number of PTEs observed in a the study sequences required for inclusion in the epigraphs.

If rare epitopes are excluded, even just 2, the number of nodes in the graph can be greatly reduced, and the simplified Epigraph runs faster. This can be useful when exploring the impact of other options.

The mosaic tool allowed a user to exclude up to 3 rare epitopes, but through this new tool we have learned that often a much higher threshold for exclusion can be used with very minimal costs to coverage.

Coverage threshold

If you specify, it will give you the vaccine with the largest rare counts (nodelowcounts, i.e., f_o) value that has a coverage that is within "cut" of the coverage obtained when f_o=0. If you don't specify cut, then cut=0 by default.

Characterize PTEs

This tool summarizes the frequencies of all PTEs in the input population, the essential information that informs the graph. A table is produced for an unaligned input set of protein sequences that includes each unique PTE in the set, and a tally of the number of times it was found. A given PTE may be found more than one time in a sequence it is only counted once. A histogram is generated. White vertical bars indicate the number of PTEs that appear in n sequences, as a function of n. The rarest PTEs, that are found only once, are plotted on the left (n=1), while the most conserved, that are present in every sequence, are on the right (n=N, total number of sequences). Blue bars are cumulative histogram: number of PTEs that appear in n-or-more sequences. Note that the vertical axis is log-scale.

A second table can be generated for an aligned set. Each PTE is grouped according to its starting position in the alignment; counts are summarized in that context. The most common PTE starting at each position heads the set, and three columns are provided: the position number, the unique PTE sequence and the number of times it was observed in the data.

Design: Tailored Therapeutic Vaccine


This tool was generated to answer a very specific question. If a therapeutic T-cell vaccine was going to be delivered to an HIV infected individual (say a conserved vaccine or a vaccine delivered in a CMV vector), and HIV sequences could be made available from the people who were to be treated, could that information be effectively utilized? It is not feasible to make a vaccine to match every individual’s virus in a large sequence group. But it may be feasible to manufacture a set of six to 10 vaccine antigens, and then chose 2 or 3 among them that best match an individual’s infecting strain. There are two central factors one might consider: the number of PTEs that are in the individual that are matched by the vaccine, and the number of PTEs that are in the vaccine but are not in the individual and that might trigger diversionary responses.

This design tool will optimize the Epigraph design of a set of vaccine antigens for manufacture, given the constraints of the problem.

We found the best solutions resulted when the user starts with a fixed single Epigraph sequence that is optimized to cover the target populations, so we provide an opportunity for a fixed user sequence, and recommend this be a single Epigraph

K-means iterations

K-means is an iterative algorithm, and this option specifies how many iterations are run. The default value is 10, and we don’t expect significant improvement for larger values.



The history of the clustering algorithm runs, the run with the minimum score is preserved.

Antigen sequences

The output contains the antigens designed for manufacture, and a summary table. Each antigen is an Epigraph “centroid” of a user-specified number of clusters of sequences, that are grouped according to PTE similarities.

Summary Table

To assess potential for the coverage of natural sequences in the sample population, each sequence is singled out and treated as a test case. If 6 antigens are intended manufacture, and 2 will be selected from among these 6 that best cover the subject’s population of PTEs for use as a therapeutic vaccine, we can then ask what is the fraction of PTEs covered by the vaccine, and how many extra epitopes are “wasted” in the vaccine because they are not detected in the subject. We do not yet understand how to weigh these two factors, but both may be important for success. We calculate these two numbers of each sequence in the input population, after determining the two best among the manufactured for treating that subject. For both coverage and extras, the average and standard deviation for the populations are given.

Evaluation: Tailored Therapeutic Vaccine


James Theiler, Hyejin Yoon, Karina Yusim, Louis J. Picker, Klaus Frueh, and Bette Korber. Epigraph: A Vaccine Design Tool Applied to an HIV Therapeutic Vaccine and a Pan-Filovirus Vaccine. Sci Rep. 2016 Oct 5;6:33987. doi: 10.1038/srep33987. PMID: 27703185.

last modified: Tue Feb 1 16:10 2022

Questions or comments? Contact us at

Operated by Triad National Security, LLC for the U.S. Department of Energy's National Nuclear Security Administration
© Copyright Triad National Security, LLC. All Rights Reserved | Disclaimer/Privacy

Dept of Health & Human Services Los Alamos National Institutes of Health