HIV Databases HIV Databases home HIV Databases home
HIV sequence database



PeptGen Explanation

Given an amino acid (AA) sequence, this site generates and displays shorter peptide fragments of the sequence. Here is what part of the output might look like:

MENRWSVMIVWQVDRMRIRTWKSLVKHHMYVSGKARGWFYRHHYESPHPRISSEVHIGPGDQTLVITTYWGLH
MENRWSVMIVWQVDR (15)
     SVMIVWQVDRMRIR (14)
         VWQVDRMRIRTWKSL (15)
              RMRIRTWKSLVKHHM (15)
                   TWKSLVKHHMYVSGK (15)
                        VKHHMYVSGKARGWF (15)
                             YVSGKARGWFYRHHY (15)
                                  ARGWFYRHHYESPH (14)
                                      FYRHHYESPHPRI (13)
                                         HHYESPHPRISSEVH (15)
                                              PHPRISSEVHIGPGD (15) All C-term AAs forbidden
                                                   SSEVHIGPGDQTLVI (15)
                                                        IGPGDQTLVITTYW (14)
                                                            DQTLVITTYWGLH (13)

The first line is the submitted AA sequence, while the peptides generated by the program form the stairstep pattern below. The submitted protein sequence may have spaces and newline characters within it, but these (and a few other "funny characters") will be removed. The number in parentheses following each peptide records the length of the peptide. In this particular example, the user had specified that the program attempt to construct peptides of 15 AAs.

Forbidden C-term Amino Acids

If you use this option, sometimes your peptides will be shorter than the selected length. This is because users can specify amino acids they would prefer to not have at the end of their peptides; in particular Philip Goulder requested this option as he did not want the peptides he was designing to end in any of the amino acids: GPEDQNTSC, as these amino acids are rarely found in the C-terminal positions CTL epitopes. Therefore the program will automatically extend or shorten particular peptides to avoid user specified amino acids. If the field is left blank all amino acids will be the same length.

Forbidden N-term Amino Acids

In addition to forbidden C-term residues, the user may specify forbidden N-term AAs. or examples, some users have found peptides beginning with Q (glutamine) difficult to synthesize, so Q can be excluded as an N-term forbidden AAs. This ensures that the program will not generate any peptides that begin with Q. If a Q is present at the position the program "wants" to start a peptide, it will move the start leftwards by one space at a time until an allowed N-term AA is found.

As you can see, the second peptide generated above (SVMIVWQVDRMRIR) is only 14 AA-long because the fifteenth AA is a T, a member of the forbidden set. The program reads the AA at the target length (15). If the fifteenth AA is a forbidden AA then it looks at the fourteenth AA. If it too is forbidden it looks at the the thirteenth and finally the twelfth, searching for an allowed AA. The first position at which it finds a nonforbidden AA determines the length of the peptide. How much the program will shorten the peptide in its search for an allowed AA is specified by the user. In this example the "shorten by" parameter was set to 3. If all the AAs from 15 down to 12 are forbidden, then the program begins to lengthen the peptide beyond the "ideal" of 15, one amino acid at a time until it finds an allowed AA. It will add amino acids to the peptide up to the limit set by the "lengthen by" parameter (2 in this case). If all AAs between between 12 and 17 are forbidden, then the 15-mer is used even though it ends in a forbidden AA. Such peptides are marked with the words ``All C-term AAs forbidden!". (There is one example of this rare occurrence in the output reproduced above.) The number of times such "forbidden" peptides are generated is reported at the end of your output. In the peptides only the forbidden AAs near the C-terminus are printed in bold and underlined.

Peptide Overlap

What determines the offset, or indentation between one peptide and the next, i.e., the "width" of each stairstep? Yet another user-specified parameter referred to as "Overlap peptide by", which for the example being discussed has been set to 10. What this means is that two consecutive peptides will have ten AAs in common. If a short peptide has been generated (say 13 AA-long), then to maintain the overlap of 10, the subsequent peptide will be "indented" only 3 places.

MENRWQVMIVWQV (13)
   ||||||||||
   RWQVMIVWQVDRMRI (15)
   

The practical result of specifying an overlap parameter is that n-mers of length = (overlap + 1) will be represented exactly once. Here is how this works for 2 peptides.

MENRWQVMIVWQVDR (15) First peptide

MENRWQVMIVW     |
 ENRWQVMIVWQ    |
  NRWQVMIVWQV   |- 11-mers contained within first peptide
   RWQVMIVWQVD  |
    WQVMIVWQVDR |

     QVMIVWQVDRMRIR (14) Second peptide

     QVMIVWQVDRM    |
      VMIVWQVDRMR   |- 11-mers contained within second peptide
       MIVWQVDRMRI  |
        IVWQVDRMRIR |

         VWQVDRMRIRTWKSL (15) etc. .... 

The reason for implementing this parameter is that in immunological studies it is usually the shorter peptides (9 to 11-mers) that are of most interest, but the cost and labor of generating every single one of these is prohibitive. As a compromise, longer peptides are synthesised in such a way to insure that shorter peptides of a given length (e.g. 11-mers) will be represented at least once in the longer peptides. The algorithm this program uses insures that this will be the case.

Proline Rule

The "Proline rule" says that no matter what a peptide may not end in a proline residue. In this sense proline may be thought of as ultra-forbidden. We have seen that it is possible all the AAs near the C-terminus are forbidden. In this case, the program selects the peptide of "ideal length" (15-mer in the examples above) even though it ends with a forbidden AA. But if the proline rule is being observed and the 15th AA is proline then the 14-mer will be chosen as the peptide. Under the proline rule the algorithm will shorten the ideal peptide and then lengthen it searching for non-proline AAs. If all AAs at the C-term are prolines this is reported and the 15-mer would be used.

Calculate Hydropathy

The hydropathy index of each peptide generated will be calculated and shown [in square brackets] if this option is selected. It is done by assigning the Kyte-Doolittle hydropathy index to each AA in the peptide and calculating the average for the peptide (Kyte, J. and Doolitle, R.F., A simple method for displaying the hydropathic character of a protein, J. Mol. Biol. 157:105-132 (1982)).

Color Amino Acids

In addition to computing the hydropathy index of peptides you may color each residue according to its Kyte-Doolitle hydropathy class as shown here.

I,V,L Most hydrophobic
F,C,M,A
G,T,S,W,Y,P
K,H,N,Q,D,E
R Most hydrophilic

 

Failure to generate peptides

It is possible that when severe constraints on generating a peptide are applied to a protein the algorithm cannot always find an allowed peptide. An actual example from the Env protein of HIV-1 that, incidentally, shows the algorithm and its failure is described here.

The following constraints apply:

Ideal Peptide Length = 15
C-term forbidden = "GPEDQNTSC";
N-term forbidden = "Q";   <- Note Q is forbidden!
lengthen = 2;
shorten = 3;
overlap = 10;

aRqll sGIvqQQnnLLrAieaQQhllqLTvwGiKQL ... the protein

aRqllsGIvqQQnnL (15)           first pept (normal)
     sGIvqQQnnLLrAie           second candidate pept shifted right 5
     sGIvqQQnnLLrAi (14)       truncate at C-term (e forbidden)
         qQQnnLLrAieaQQh       third candidate pept (q at N-term) shifted right 4
        vqQQnnLLrAieaQQ        third pept second attempt move left 1 character
        vqQQnnLLrAiea (13)     truncate at C-term (2 Q forbidden)  note this !!!!
           QnnLLrAieaQQhll     fourth pept shifted only 3 right because of 2 Q removed above
          QQnnLLrAieaQQhl      fourth pept 2nd attempt forbidden N-term Q
         qQQnnLLrAieaQQh       fourth pept 3rd attempt forbidden N-term q
        vqQQnnLLrAieaQQ        fourth pept 4th attempt
        vqQQnnLLrAiea (13)     truncate at C-term (2 Q forbidden)  same as third peptide!!!!
                                     etc. ad infinitum ...

This results in looping over and over generating the same peptide repeatedly. Should this type of situation develop, the program will shift forward four (rather than three) characters after generating the second peptide. This will probably get things unwedged, but it means one 11-mer will not be represented. If the program really can't find a solution it will quit and you should try supplying less stringent requirements.

Submitting an alignment

Instead of submitting a single raw amino acid input sequence, the user may also submit either

The output of the above submission looks something like:

MENRWSVMIVWQ.VDRMRIRTWK   ... etc
MENRWSVMIVWQ.VDRMRI (18)
--Q.--------Q-----L
--Q.--------Q------
        IVWQVDRMRIRTWK (14)
        ---------L----
        --------------

Note that the first sequence is taken as the "master" sequence, against which the other two sequences are compared. Identical amino acids will be shown as dashes, while gaps are appear as "." characters.

Simple Output

In the simple output option peptides are listed without indentation, coloring, extraneous comments or formatting. Simple output is available in 2 formats, "old" and "new".

"Classic" simple format. If an alignment has been submitted then blocks of peptides are delineated by a blank line. There is a good probability that any block of peptides in the alignment may contain duplicate peptides. The user can choose to eliminate duplicate peptides or flag them with a "!" character as shown in the example below. An alignment submission may contain gaps. Gaps may be saved or eliminated. The number to the left of the peptide is the sequence number within a block of peptides.

Duplicates flagged            Duplicates removed
Gaps preserved                Gaps eliminated

1 MENRWSVMIVWQ-VDR 1-1-1      1 MENRWSVMIVWQVDR  1-1-1
2 MENRWSVMIVWQQVDR 1-2-1      2 MENRWSVMIVWQQVDR 1-2-1
3 MENRWSVMIVWQRVDR 1-3-1      3 MENRWSVMIVWQRVDR 1-3-1
4 MENRWSVMIVWQ-VDR 1-1-2!
                              1 SVMIVWQQVDRMRIR 2-1-1
1 SVMIVWQQVDRMRIR 2-1-1       2 SVMIVWQDRMRIR   2-2-1
2 SVMIVWQ-VDRMRIR 2-2-1       4 SVMIVWQVDRMRGR  2-3-1
3 SVMIVWQ-VDRMRIR 2-2-2!
4 SVMIVWQ-VDRMRGR 2-3-1

Each peptide is assigned a peptide ID, written to the right of the peptide, which consists of three numbers separated by dashes. In an alignment the first number is the peptide block number, the second number is the peptide "type", the third number is the occurrence order in the block, and the "!" is a handy visual tag for duplicate. The combination of the three numbers uniquely identifies each peptide in the entire set generated. Here is a simple alignment of "3-mers" to help you understand the ID number.

1 ABC 1-1-1
2 XYZ 1-2-1    2 means this is the 2nd peptide type in the block
3 ABC 1-1-2 !  2 means this is the 2nd occurrence of peptide type 1 in the block
4 XYZ 1-2-2 !
5 CCC 1-3-1    3 means this is the 3rd peptide type in the block
6 ABC 1-1-3 !
7 XYZ 1-2-3 !

If the submission is a single protein sequence, not an alignment, the peptides are simply numbered serially.

"New" simple format. An alignment is submitted. The results look like:

1 NAKSIIVQLNETVEI 1_1&4.2
2 NAKGIIVQLSETVEI 1_2.1
3 NAKVIIVQLNESVEI 1_3&13&15.3
4 NAKIIIVQLNESVEI 1_5&6&8&9&10&11&12&14&18&20.10
5 NESVEI 1_7&16&17&19.4
?
6 IIVQLNETVEIDCTR 2_1.1
7 IIVQLSETVEIDCTR 2_2.1
8 IIVQLNESVEINCTR 2_3&5&6&8&9&10&11&12&13&14&15&18&20.13
9 IIVQLNETVEINCTR 2_4.1
10 NESVEINCTR 2_7&16&17&19.4
?
11 LNETVEIDCTRPNNNTR 3_1.1

In the first peptide

1 NAKSIIVQLNETVEI 1_1&4.2

"1" is a serial number that means it is the first unique peptide sequence,

NAKSIIVQLNETVEI

is its sequence

1_

means that it is in the first block of overlapping peptides.

1&4

means that this exact 15 mer was found in sequences 1 & 4 of your alignment. A key that relates sequence name to alignment position is printed at the top of your output.

.2

means that two sequences matched this epitope (sequences 1 and 4).

Back to submission page.

last modified: Fri Jan 18 09:30 2013


Questions or comments? Contact us at seq-info@lanl.gov.

 
Operated by Los Alamos National Security, LLC, for the U.S. Department of Energy's National Nuclear Security Administration
Copyright © 2005-2012 LANS LLC All rights reserved | Disclaimer/Privacy

Dept of Health & Human Services Los Alamos National Institutes of Health