Poisson-Fitter analyzes genetic data from homogeneous sequences, such as HIV or HCV sequences from a single patient early in infection. The maximum genetic diversity from the consensus should not exceed 10%. (For sets of sequences with >10% diversity, we suggest using BEAST or other tools.) Poisson-Fitter performs statistical tests on Hamming Distance (HD) frequency distributions: it computes the best fitting Poisson distribution through Maximum Likelihood, performs a Χ2 Goodness of Fit (GOF) test, and tests for Star-Phylogeny.
Giorgi EE, Funkhouser B, Athreya G, Perelson AS, Korber BT, Bhattacharya T. Estimating time since infection in early homogeneous HIV-1 samples using a Poisson model. BMC Bioinformatics 2010 Oct 25;11:532. PMID: 20973976
Giorgi EE and Bhattacharya T. A note on two-sample tests for comparing intra-individual genetic sequence diversity between populations. Biometrics December 2012; 68:4. PMID: 23004569
The tool accepts a single input file. This file may contain alignments from multiple patient sets, to spare the user from uploading each file separately. The alignment file can be in any of the Common Sequence Formats. Large-scale sequence data can be accepted in the format provided by ElimDupes (see details below). The latter is recommended for large alignments with many duplicates.
The following are REQUIREMENTS for the input file:
Example of valid input format:
>CH40.CONSENSUS AGCCAGCATGGGATGGACGACCCAGA >CH40.SEQ1 AGCCTGCATGGGATGGACGACCCAGA >CH40.SEQ2 AGCCAGCATGGGATGGACGACCCAGA >CH40.SEQ3 AGCCAGCATGGGATGGACGACCCAGA >SUMA.CONSENSUS TCCAGACCCGACAGGCCCAAAATCAAAAAAGTAGACAA >SUMA.SEQ1 TCCAGACCCGACAGGCCCAAA---AAAAAAGTAGACAA >SUMA.SEQ2 TCCAGACCCGACAGGCCCAAAATCAAAAAAGTAGACAA
This input file contains two alignments, each from a different patient and different genomic region. Separate result files will be generated for CH40 and for SUMA. The sample ID need not be in every sequence name, but it is required that the user provides an ID in the consensus sequence.
Deep sequencing detection: Poisson-Fitter can handle big alignments such as the ones obtained through deep sequencing. The program automatically reformats the alignments using the same conventions and formatting as provided by the tool ElimDupes (see below for a more detailed explanation). The user can also opt to have the input already formatted as a large-scale input file, in which case the option "Large-scale formatted" should be checked. If the option is not checked, the program will ignore the formatting and the output will not be accurate.
Formatting: The standard format we chose for deep sequencing input (Fischer et al. 2010 PLoS ONE 5(8):e12303), implemented by the tool ElimDupes, is the following: every unique sequence is represented only once, and its name should be of the type XXX.xxx_yyy (XXX_yyy is an acceptable alternative form) where XXX is an arbitrary string (for example the sample ID), xxx is the unique identifier of the sequence, and the yyy is the multiplicity of such sequence, i.e. how many sequences identical to that one are represented in the alignment. The user can choose to input the alignment already in this format, however, if it is NOT formatted, and the number of sequences exceeds 150, the program will automatically format the sample in the manner explained above. Note:If the original input is in large-scale format, it still needs a consensus sequence. The first sequence should be the consensus, and it should have the label CONSENSUS without the multiplicity tag (in other words, do not include the tag"_xx" in the consensus sequence; sequences identical to the consensus should be included separately). When the original input is not formatted in this manner, a link to the formatted files will be provided in the output page.
The program does not handle ambiguous IUPAC codes. If these are present in the alignment, the user can check the option "eliminate ambiguity codes", and all sequences where the characters are found will be eliminated from the analysis. If there are ambiguity codes, but the user does not check this option, the program will generate an error message and halt execution.
The default mutation rate is set to 2.16e-5, based on Mansky and Temin 1995 (J Virol 69:5087-5094). This is the default value for HIV. Users have the option of changing this value depending on whether they are working with a different virus or a set of alignments for which it is reasonable to assume a different mutation rate. Increasing the mutation rate will result in a shorter estimate of the time since the infection, whereas, on the other hand, a smaller mutation rate will result in longer time estimates. The value must be between 0 and 10e-3.
If desired, the tool also checks for hypermutation enrichment and, if found, removes positions and/or sequences presenting the APOBEC3G/F signature. Because APOBEC mutations happen at a faster rate than random mutations, they can bias the Poisson fit. We recommend first testing the sample with NO APOBEC correction. However, if some early samples show strong divergence from a Poisson, it is advisable to check whether or not this divergence is be due to APOBEC enrichment. We provide two ways to do so in the options: after uploading the input file the user can choose to either have all positions found in an APOBEC context removed by checking the box next to "APOBEC positions" (APOBEC positions will be removed in all samples if the first radio button is chosen, or in the samples that are statistically significantly enriched for APOBEC), or all sequences found to be significantly enriched for APOBEC (tested via the Hypermut tool), or both. The tool will then analyze the additional, APOBEC-removed files, as well as the original file, for easy comparison.
When a p-value is specified, APOBEC enrichment is tested using the Hypermut tool as follows: when the "remove APOBEC positions" option is checked, the program will create a "compressed file" (an alignment with two sequences, the consensus and a second sequence with all mutations found in the full alignment) and test it through the Hypermut tool. If the p-value is les than the user-input p-value, then a new alignment with all positions in APOBEC context removed will be created and analyzed along with the original fasta file. When the "remove APOBEC sequences" option is checked, the program will test each sequence in the alignment using the Hypermut tool. Any sequence with a p-value less than the user-input p-value will be removed and a new alignment created and analyzed along with the original fasta file.
The Poisson-Fitter provides a null model of an early HIV-1 infection initiated by a single genetic strain, and prior to the
onset of host-driven selection. Such assumption is based on the occurrence of a genetic bottleneck in HIV sexual or mother to infant
transmissions (Wolinsky et al. 1992 Science 255:1134-1137; Derdeyn et al. 2004 Science 303:2019-2022; Delwart et al. 2002 AIDS 16:189-195; Zhang et al. 1993 J Virol 67:3345-3356). This results in a majority of new infections being homogeneous, i.e., initiated by a single
genetic strain (Keele et al. 2008 Proc Natl Acad Sci 105:7552-7). Furthermore, the viral population grows exponentially during
the early phases of infection prior to the onset of the host immune response. In this simple setting, the Poisson Fitter provides a tool for
estimating evolutionary and demographic parameters.
The null model provided by the tool has a two-fold use: it can be used as a method of comparison to determine whether or not the sample had indeed originated from one unique genetic patriarch; and, for those samples that do meet such assumptions, the tool can provide an estimate on the time since the Most Recent Common Ancestor (MRCA) and a test for star-phylogeny evolution.
Time since the MRCA is estimated based on a model of exponential evolution and random accumulation of mutations. This is the case early in the infection, when the population is expanding, and selective pressure from the host has not yet started. Under such scenario, the pairwise Hamming Distance (HD, the number of bases at which any two sequences differ) frequency counts are expected to follow a Poisson distribution with main parameter given by (from minimizing the Log-Likelihood function)
Compressed files (when "remove APOBEC positions" is selected): For each input alignment, a fasta file gets created with the consensus sequence followed by a summary sequence that
displays all mutations found across the entire sample. This type of output is useful when testing for overall APOBEC enrichment (see below).
Hypermut results: If the user chooses to remove APOBEC positions and/or APOBEC enriched sequences, each alignment is
tested for APOBEC enrichment through the tool Hypermut, and p-values
below the threshold selected by the user are displayed in the following format:
The user can decide how to correct for hypermutation. The
NOTE: If the user selects a certain hypermutation p-value threshold (first option), then new, APOBEC-removed files will be created only if the hypermut tool will detect APOBEC enrichment below the p-value threshold. If the user wishes to remove APOBEC in all samples regardless of whether hypermutation is significant of not, then the option "remove APOBEC in all samples" should be checked.In either case both corrected and non-corrected files will be analyzed by the program, so that the user can evaluate the effect of removing APOBEC from the alignment. Links to the APOBEC-cleaned files are provided in the output for the user to download.
Log Likelihood - Estimated Parameters: This is a table with all the computed statistics for each alignment. It can be viewed by clicking on the link in the output page, and can be downloaded as a text file by clicking on the link "View As Data File" next to the table title. The table contains the following columns:
SAMPLE: unique alignment identifier, extrapolated from the name in the consensus sequence.
LAMBDA: parameter of the Poisson distribution that best fits the HD frequency counts.
ST.DEV.: standard deviation on the parameter estimated above.
NSEQ: number of sequences in the alignment.
NBASES: number of bases in the alignment.
LengthYvec0, SumYvec0: These two columns display the total number of cells with a mutation from the consensus and the total number of cells. The two numbers are useful if the user wishes to caclulate Bayesian credible intervals rather than standard confidence intervals.
meanHD: mean intersequence Hamming Distance (should be equal to or very close to Lambda).
maxHD: maximum intersequence Hamming Distance.
DAYS(CI): number of days since MRCA with 95% Confidence Intervals.
Chi2: χ2 statistic from Goodness of Fit test.
DF: degrees of freedom on the χ2 statistic.
GOF_PVAL: P value from the Goodness of Fit (low p-values indicate divergence from a Poisson). NOTE: sometimes the fit is so poor (for example when the maximum Hamming Distance is too high to be compatible with a homogeneous infection) that the GOF fails and NAs are outputted. The user should take that as an indication of divergence from a Poisson distribution.
Convolution Estimates: This is a text file to be used as an internal check for star-phylogeny. As the program runs, for each alignment the following data is appended to the file:
Figure files: For each alignment, the program outputs two figures. The first one plots the histogram of the pairwise HD frequency counts and
the best fitting Poisson distribution (in red, as a continuous line for better visualization). The following is an example:
This tool uses R software. Thanks to the R Team:
R Development Core Team (2005). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, www.R-project.org.
Please reference this article when using Poisson Fitter:
Go Back to the Poisson Fitter v2 submission form.