HIV Databases HIV Databases home HIV Databases home
HIV Sequence Database



Epigraph Evaluation vs. Epicover

Epigraph Evaluation vs. Epicover

Some (too detailed?) remarks on eval_coverage.py versus epicover.pl

Let us write h(n,e) as equal to 1 if epitope e appears in the n'th
sequence of the target population, and it is zero otherwise.  If an
epitope appears more than once we still set h(n,e)=1.

Further, define f(e) as the "frequency" of epitope e; it is the
integer number of sequences from the target population in which the
epitope appears. In particular,

  f(e) = Sum_n h(n,e)

For a vaccine, let V be the set of distinct epitopes that appear in at
least one of the antigen sequences.  If an epitope appears in more
than one antigen sequence, or if it appears more than once in a given
antigen, it still is included only once in the vaccine set V.

Then the coverage, written as an integer, for vaccine V, is given by

  C(V) = Sum_{e in V} f(e) = Sum_{e in V} Sum_n h(n,e)

For eval_coverage.py with the "--usecounts=True" option, this is the
coverage that is produced.

To obtain fractional coverage, the integer coverage is divided by a
denominator which can be defined in one of two ways:

   D = Sum_{all valid e} f(e)   if "--usebadepi=False" (default)
   D = Sum_{all e} f(e)         if "--usebadepi=True" 

A bad, nor non-valid, epitope is for instance one that includes "$",
"#", or "X" in its string.

The fractional coverage for a vaccince V is given by C(V)/D.

epicover.pl produces only a fractional coverage, and it is a value
that is usually very close to what is produced by eval_coverage.py in
its --usebadepi=True mode.  But it is slightly different.  

epicover.pl defines 

     C_n = [ Sum_{e in V} h(n,e) / Sum_{all e} h(n,e) ]

as the fractional coverage of the n'th sequence.  Note that the
denominator depends on (in fact, is defined to be) the number of
distinct epitopes in the n'th sequence.

epicover.pl then defines the overall coverage to be average of this
value over all sequences; ie:
     
  C'(V) = (1/N) Sum_n C_n

If all sequences had the same number of valid distinct epitopes, then
the two fractions, C(V) and C'(V), would [in the --usebadepi=False
case] be identical. In practice, they are very nearly so.

To summarize:

1. epicover.pl gives sequences equal weight; eval_coverage.py gives
epitopes equal weight. In practice, this distinction is minor.

2. epicover.pl uses all epitopes (good and bad) in the denominator; in
its default mode (with option --usebadepi=False), eval_coverage.py
uses only valid epitopes in the denominator. In this mode,
eval_coverage.py tends to give higher values for fractional coverage.

3. But with --usebadepi=True, eval_coverage.py will, like epicover.pl,
use all epitopes in the denominator, and will report very similar
scores.  (Neither ever uses bad epitopes in the numerator.)

last modified: Tue Mar 1 14:18 2016


Questions or comments? Contact us at seq-info@lanl.gov.

 
Operated by Triad National Security, LLC for the U.S. Department of Energy's National Nuclear Security Administration
© Copyright Triad National Security, LLC. All Rights Reserved | Disclaimer/Privacy

Dept of Health & Human Services Los Alamos National Institutes of Health