HIV Databases HIV Databases home HIV Databases home
HIV sequence database

Shannon Entropy Readme File

  Entropy-Two   Entropy-One   Entropy Readme   Entropy Options  


Shannon entropy is a simple quantitative measure of uncertainty in a data set. One qualitative way to think about it in terms of sequences is that if a sample set is drawn from a large population, the Shannon entropy could be considered as a measure indicative of your ability to guess what amino acids would be in the next sequence you took from the population, based on your previous sampling.

Imagine for example, you were interested in a particular position where mutations can confer drug resistance. Knowledge of the frequencies of different amino acids in that position drawn from resistant and susceptible populations would enable you to calculate the Shannon entropies, a reflection of how well you would be able to guess what amino acids would be next in an unknown sample drawn from each population. You might be able to narrow down or define drug resistance sites in complex genomes by defining positions in proteins that were "certain" in drug susceptible populations (low entropy), but uncertain in drug resistant populations (significantly higher entropy). Even if the consensus amino acid was the same in both sets, sites could be identified that tended to vary more in resistant viruses.

When this uncertainty measure is used as a strategy to quantify sequence variability in a column in a sequence alignment, it incorporates both the frequencies (for example, a column that was 50% A and 50% T has a higher Shannon entropy than a column that is 90% A and 10% T) and number of possibilities (a column that is 90% A, 5% T and 5% G has a higher Shannon entropy than 90% A and 10%T). An invariant column has a Shannon entropy of zero. The maximum Shannon entropy is dependent on the number of discrete variable in your set; for example, if you are considering DNA, you can have A, C, G, and T, and the maximum entropy would be if they were present at equal frequencies, 25% of each.

In this simple application, we consider each column in a sequence alignment independently, and we use the Shannon entropy as a way to assign a score to each column that reflects the variability in that column. We do not take into consideration the phylogenetic history of the sequences, nor do we do look for patterns of co-variation. This application is potentially useful for situations where you are simply trying to assess the diversity in the population in a cross-sectional sense; for biological applications where you need to understand selective pressure on a site a phylogenetic method to assess comparative measures of synonymous and non- synonymous substitution rates would be preferred.


There are two applications we have used in our HIV studies. The first is a comparison of alignments representing two kinds of data. For example, we have compared blood-derived HIV Envelope sequences to brain-derived sequences, and found evidence for sites that were more variable in the blood than brain [4]. We have also found that different HIV clades have different levels of variability in different sites [3]. To obtain statistical confidence, we use the Monte Carlo randomization strategy described below.

The second application has been to compare the variability of sequences positions to immunologically important regions. Here the Shannon entropy of each position was calculated, and compared to some other biological property that has been characterized for that position. For example, we have shown that the number of distinct, known, cytotoxic T-lymphocyte (CTL) epitopes that span a position inversely correlates with the variability of that position [6]. In this class of application, you have one alignment you are studying, and then compare the entropy scores with another score of biological interest, in our case CTL epitope density.

In a variation of this application, we have used average entropy scores calculated for each overlapping peptide used for scanning proteins for T-cell reactivity, to provide a score for each peptide tested. A typical set of peptides would, say, be 15 amino acids long, and immunological reactivity would be assessed by moving along through the protein with each consecutive peptide overlapping by 11 amino acids. Consistent with the findings of Yusim et al. [6], we also found that the average entropy of a peptide is inversely correlated with how many people are capable of making an immune response that recognizes that peptide [2].

We have a tool on our website to help design such T-cell peptide reagents (PeptGen), and our entropy tool can be used in conjunction with PeptGen to assign an average entropy score to each peptide during the process of designing a panel of reagents.


This section briefly defines Shannon entropy. A more in-depth description can be found in many texts and web sites (for example, [5], and

Let $ X$ be a discrete random variable (bases if you are considering nucleotides, amino acids if you are considering proteins), taking a finite number of possible values $ x_1, x_2, ..., x_n$ with probabilities $ p_1, p_2, ..., p_n$ such that $ p_i \ge 0, i = 1, 2, ..., n \sum_{i=1}^{n}p_i = 1$
The Shannon entropy is
$ H_n(p_1,p_2, ...,p_n) = - \sum_{i=1}^{n}p_i\log_{2}p_i$
where b is typically base 2, Euler's number e, or 10. Our code uses the natural logarithm, loge.

Statistical Confidence Through Randomization

We use a Monte Carlo randomization strategy for determining if there is more variability in one sequence set than in another [1]. There are two strong notes of caution here. The first is that in using a sequence alignment, you are doing many tests, essentially one for each position. A correction for multiple tests is in order, and one strategy we have used is to only include sites that vary, and then apply a Bonferroni correction for multiple tests (see to set our threshold of significance. The second is that our program, based on alignments, does not take into account the phylogenetic history of events that might lead to the observed diversity. For example the phylogenetic history of a sequence might suggest either more or less substitution events giving rise to the observed variation. For example, a minimum of two substitutions gives rise to the observed variation of 2 A's and 2 T's, in the context of phylogeny A, while only one substitution is required to account for the variation in phylogeny B:


		      ----  A
                ----- |
                |     ----- T
                |     ----- A
                ----- |
                      ----  T


                      ----  T
                ----- |
                |     ----- T
                |     ----- A
                ----- |
                      ----  A

Our strategy reflects only observed diversity, and the measure obtained will treat both scenarios A and B above as equivalent. In some situations, this would be the most appropriate measure (for example, if you were asking questions pertaining to how much diversity currently exists in a population for an immunological survey). Additionally, basal branches within HIV subtypes tend to be very short, and the branching orders not well defined; in this scenario a simplification of a star phylogeny with all sequences radiating from an ancestral node is not a bad assumption, and under this assumption the sites are varying independently. In many situations, however, the context of the phylogenetic tree might provide further useful insights, and this tool should be used with appropriate caution.

The statistic our site provides can be used in the situation of comparing two aligned data sets to each. The input data is randomized to create as many random data sets as the user finds desirable; the program can take a few hours to run if large numbers of randomizations are done (>100,000). For example, you could select 1000 randomizations, and this would allow you to detect situations where the entropy in one data set was greater than another with a p < 0.001. You get a complete listing of the differences in the entropy in each position comparing the first set and the second set, and the number of times in comparing 1000 randomized data sets that an entropy difference of greater than or equal to the difference in the real data was observed. A summary highlights the sites that never have an entropy difference greater than in the randomized data. You can also choose a cutoff for the number of random data sets that give an entropy value greater than or equal to the observed difference for the real data. For example you might choose 5, and then you could get a summary of those sites that have a entropy difference in 5 of the 1000 randomization greater than the entropy difference in the real data, and your estimated p-value for those sites would be <= 0.005.

The randomizations are done by first combining the two sets of data being compared, and then reconstructing two data sets of the same size as the originals by randomly selecting sequences from the pooled data. For example, one might want to identify sites in HIV that intend to be conserved at transmission. Imagine comparing 33 HIV sequences obtained from acute infection, with 77 sequences obtained from people who had been infected for several years. First you would obtain the differences in entropy at each position by calculating the entropy for each position in the real data sets, and subtracting one from the other to define an entropy difference for each position. Next you could make one random data set of 33 sequences, and one of 77 sequences, drawing from the pooled data. You then calculate the difference in the entropy at each position between the two randomized sets, and then do this process iteratively the number of times requested by the user. The number of times the absolute value of the entropy difference of the random data at a position is greater than the real data at a position is tallied, and used as the measure of significance.

We have incorporated two methods of randomization. The first was described above, where the sequences to generate the randomized data sets were pulled each time from the intact pooled data of 110 sequences in the above example; this is called "random with replacement" because when you take a sequence out to build one of the random data sets, it is replaced in the pooled sequence set for the next selection; thus some sequences might be represented several times, some not at all, in a given randomized sets. The alternative method we refer to as "random with no replacement". This method removes a sequence from the pooled data set when it is selected for the random data sets, thus each sequence that was in the original pool is represented only once in the random data sets, and the method essentially re-shuffles the data.


  1. Efron B and R Tibshirani. Statistical Data Analysis in the Computer Age. Science 253: 290-395 (1991).
  2. Fraham N, B Korber, C Adams, J Szinger, R Draenert, M Addo, M Feeney, K Yusim, K Sango, N Brown, D SenGupta, A Piechocka-Trocha, T Simonis, F Marincola, A Wurcel, D Stone, C J Russell, P Adolf, D Cohen, T Roach, A St John, A Khatri, K Davies, J Mullins, P Goulder, B Walker, and C Brander. Consistent CTL targeting of immunodominant regions in HIV across multiple ethnicities. J Virol. 78(5):2187-2000 (2004).
  3. Gaschen B, J Taylor, K Yusim, F Gao, V Novitsky, B Haynes, B Foley, T Bhattacharya, and BT Korber. Diversity considerations in HIV-1 vaccine selection. Science 296(5577):2354-60 (2002).
  4. Korber BT, K Kuntsman, B Patterson, M Furtado, M. McEvilly, R Levy, and Wolinsky S. Genetic differences between blood- and brain-derived viral sequences from human immunodeficiency virus type 1-infected patients: evidence of conserved elements in the V3 region of the envelope protein of brain derived sequences. J. Virol. 68:7467-81 (1994).
  5. Reza, FM. An Introduction to Information Theory. Dover Publications, Inc. NY (1994).
  6. Yusim K, C Kesmir, MM Addo, M Altfeld, B Gaschen, A Chigaev, V Detours and BT Korber Clustering Patterns of CTL Epitopes in HIV-1 Proteins Reveal Imprints of Immune Evasion on HIV-1 Global Variation. J Virol. 76:8757-68 (2002).
last modified: Tue Jul 12 13:33 2016

Questions or comments? Contact us at

Operated by Triad National Security, LLC for the U.S. Department of Energy's National Nuclear Security Administration
© Copyright Triad National Security, LLC. All Rights Reserved | Disclaimer/Privacy

Dept of Health & Human Services Los Alamos National Institutes of Health