Given an amino acid (AA) sequence, this site generates and displays shorter peptide fragments of the sequence. Here is what part of the output might look like:
MENRWSVMIVWQVDRMRIRTWKSLVKHHMYVSGKARGWFYRHHYESPHPRISSEVHIGPGDQTLVITTYWGLH MENRWSVMIVWQVDR (15) SVMIVWQVDRMRIR (14) VWQVDRMRIRTWKSL (15) RMRIRTWKSLVKHHM (15) TWKSLVKHHMYVSGK (15) VKHHMYVSGKARGWF (15) YVSGKARGWFYRHHY (15) ARGWFYRHHYESPH (14) FYRHHYESPHPRI (13) HHYESPHPRISSEVH (15) PHPRISSEVHIGPGD (15) All C-term AAs forbidden SSEVHIGPGDQTLVI (15) IGPGDQTLVITTYW (14) DQTLVITTYWGLH (13)
The first line is the submitted AA sequence, while the peptides generated by the program form the stairstep pattern below. The submitted protein sequence may have spaces and newline characters within it, but these (and a few other "funny characters") will be removed. The number in parentheses following each peptide records the length of the peptide. In this particular example, the user had specified that the program attempt to construct peptides of 15 AAs.
If you use this option, sometimes your peptides will be shorter than
the selected length. This is because users can specify amino acids
they would prefer to not have at the end of their peptides; in
particular Philip Goulder requested this option as he did not want
the peptides he was designing to end in any of the amino acids:
GPEDQNTSC, as these amino acids are rarely found in the
C-terminal positions CTL epitopes. Therefore the program will
automatically extend or shorten particular peptides to avoid user
specified amino acids. If the field is left blank all amino acids
will be the same length.
In addition to forbidden C-term residues, the user may specify
forbidden N-term AAs. or examples, some users have found peptides
Q (glutamine) difficult to synthesize,
Q can be excluded as an N-term forbidden AAs. This
ensures that the program will not generate any peptides that begin
Q. If a
Q is present at the position
the program "wants" to start a peptide, it will move the start
leftwards by one space at a time until an allowed N-term AA is
As you can see, the second peptide generated above (
is only 14 AA-long because the fifteenth AA is a
T, a member of
the forbidden set. The program reads the AA at the target length
(15). If the fifteenth AA is a forbidden AA then it looks at the
fourteenth AA. If it too is forbidden it looks at the the thirteenth
and finally the twelfth, searching for an allowed AA. The first
position at which it finds a nonforbidden AA determines the length
of the peptide. How much the program will shorten the peptide in its
search for an allowed AA is specified by the user. In this example
the "shorten by" parameter was set to 3. If
all the AAs from 15 down to 12 are forbidden, then the program
begins to lengthen the peptide beyond the "ideal" of 15, one amino
acid at a time until it finds an allowed AA. It will add amino acids
to the peptide up to the limit set by the "lengthen by" parameter (2 in this case). If all
AAs between between 12 and 17 are forbidden, then the 15-mer is used
even though it ends in a forbidden AA. Such peptides are marked with
the words ``All C-term AAs forbidden!". (There is one example of
this rare occurrence in the output reproduced above.) The number of
times such "forbidden" peptides are generated is reported at the end
of your output. In the peptides only the forbidden AAs near the
C-terminus are printed in bold and underlined.
What determines the offset, or indentation between one peptide and the next, i.e., the "width" of each stairstep? Yet another user-specified parameter referred to as "Overlap peptide by", which for the example being discussed has been set to 10. What this means is that two consecutive peptides will have ten AAs in common. If a short peptide has been generated (say 13 AA-long), then to maintain the overlap of 10, the subsequent peptide will be "indented" only 3 places.
MENRWQVMIVWQV (13) |||||||||| RWQVMIVWQVDRMRI (15)
The practical result of specifying an overlap parameter is that n-mers of length = (overlap + 1) will be represented exactly once. Here is how this works for 2 peptides.
MENRWQVMIVWQVDR (15) First peptide MENRWQVMIVW | ENRWQVMIVWQ | NRWQVMIVWQV |- 11-mers contained within first peptide RWQVMIVWQVD | WQVMIVWQVDR | QVMIVWQVDRMRIR (14) Second peptide QVMIVWQVDRM | VMIVWQVDRMR |- 11-mers contained within second peptide MIVWQVDRMRI | IVWQVDRMRIR | VWQVDRMRIRTWKSL (15) etc. ....
The reason for implementing this parameter is that in immunological studies it is usually the shorter peptides (9 to 11-mers) that are of most interest, but the cost and labor of generating every single one of these is prohibitive. As a compromise, longer peptides are synthesised in such a way to insure that shorter peptides of a given length (e.g. 11-mers) will be represented at least once in the longer peptides. The algorithm this program uses insures that this will be the case.
The "Proline rule" says that no matter what a peptide may not end in a proline residue. In this sense proline may be thought of as ultra-forbidden. We have seen that it is possible all the AAs near the C-terminus are forbidden. In this case, the program selects the peptide of "ideal length" (15-mer in the examples above) even though it ends with a forbidden AA. But if the proline rule is being observed and the 15th AA is proline then the 14-mer will be chosen as the peptide. Under the proline rule the algorithm will shorten the ideal peptide and then lengthen it searching for non-proline AAs. If all AAs at the C-term are prolines this is reported and the 15-mer would be used.
The hydropathy index of each peptide generated will be calculated and shown [in square brackets] if this option is selected. It is done by assigning the Kyte-Doolittle hydropathy index to each AA in the peptide and calculating the average for the peptide (Kyte, J. and Doolitle, R.F., A simple method for displaying the hydropathic character of a protein, J. Mol. Biol. 157:105-132 (1982)).
In addition to computing the hydropathy index of peptides you may color each residue according to its Kyte-Doolitle hydropathy class as shown here.
It is possible that when severe constraints on generating a peptide are applied to a protein the algorithm cannot always find an allowed peptide. An actual example from the Env protein of HIV-1 that, incidentally, shows the algorithm and its failure is described here.
The following constraints apply:
Ideal Peptide Length = 15 C-term forbidden = "GPEDQNTSC"; N-term forbidden = "Q"; <- Note Q is forbidden! lengthen = 2; shorten = 3; overlap = 10; aRqll sGIvqQQnnLLrAieaQQhllqLTvwGiKQL ... the protein aRqllsGIvqQQnnL (15) first pept (normal) sGIvqQQnnLLrAie second candidate pept shifted right 5 sGIvqQQnnLLrAi (14) truncate at C-term (e forbidden) qQQnnLLrAieaQQh third candidate pept (q at N-term) shifted right 4 vqQQnnLLrAieaQQ third pept second attempt move left 1 character vqQQnnLLrAiea (13) truncate at C-term (2 Q forbidden) note this !!!! QnnLLrAieaQQhll fourth pept shifted only 3 right because of 2 Q removed above QQnnLLrAieaQQhl fourth pept 2nd attempt forbidden N-term Q qQQnnLLrAieaQQh fourth pept 3rd attempt forbidden N-term q vqQQnnLLrAieaQQ fourth pept 4th attempt vqQQnnLLrAiea (13) truncate at C-term (2 Q forbidden) same as third peptide!!!! etc. ad infinitum ...
This results in looping over and over generating the same peptide repeatedly. Should this type of situation develop, the program will shift forward four (rather than three) characters after generating the second peptide. This will probably get things unwedged, but it means one 11-mer will not be represented. If the program really can't find a solution it will quit and you should try supplying less stringent requirements.
Instead of submitting a single raw amino acid input sequence, the user may also submit either
MENRWSVMIVWQ-VDRMRIRTWKSLVKHHMYVSKGKARGWFYRHH MEQR-SVMIVWQQVDRMRLRTWKSLVKH-MYVSKGKARGWFYRHH MEQR-SVMIVWQQVDRMRIRTWKSLVKH-MYVSKGKARGWFYRHH
Sequence should be the same length and should not contain any extraneous name or identification information. Separate sequences by a carriage return, and be sure to check the box labeled "Aligned sequences" or PeptGen will think the three sequences are actually one sequence.
>seq name 1 MENRWSVMIVWQ-VDRMRIRTWK SLVKHHMYVSKGKARGWFYRHH >seq name 2 MEQR-SVMIVWQQVDRMRLRTWK SLVKH-MYVSKGKARGWFYRHH >seq name 3 MEQR-SVMIVWQQVDRMRIRTWK SLVKH-MYVSKGKARGWFYRHH
In fasta format the name of each sequence occurs on a line starting with the ">" character. The sequence associated with that name follows on one or more line until the next ">name" line occurs. PeptGen expects any gaps in the alignment to be shown as dash ("-") characters.
The output of the above submission looks something like:
MENRWSVMIVWQ.VDRMRIRTWK ... etc MENRWSVMIVWQ.VDRMRI (18) --Q.--------Q-----L --Q.--------Q------ IVWQVDRMRIRTWK (14) ---------L---- --------------
Note that the first sequence is taken as the "master" sequence, against which the other two sequences are compared. Identical amino acids will be shown as dashes, while gaps are appear as "." characters.
In the simple output option peptides are listed without indentation, coloring, extraneous comments or formatting. Simple output is available in 2 formats, "old" and "new".
"Classic" simple format. If an alignment has been submitted then blocks of peptides are delineated by a blank line. There is a good probability that any block of peptides in the alignment may contain duplicate peptides. The user can choose to eliminate duplicate peptides or flag them with a "!" character as shown in the example below. An alignment submission may contain gaps. Gaps may be saved or eliminated. The number to the left of the peptide is the sequence number within a block of peptides.
Duplicates flagged Duplicates removed Gaps preserved Gaps eliminated 1 MENRWSVMIVWQ-VDR 1-1-1 1 MENRWSVMIVWQVDR 1-1-1 2 MENRWSVMIVWQQVDR 1-2-1 2 MENRWSVMIVWQQVDR 1-2-1 3 MENRWSVMIVWQRVDR 1-3-1 3 MENRWSVMIVWQRVDR 1-3-1 4 MENRWSVMIVWQ-VDR 1-1-2! 1 SVMIVWQQVDRMRIR 2-1-1 1 SVMIVWQQVDRMRIR 2-1-1 2 SVMIVWQDRMRIR 2-2-1 2 SVMIVWQ-VDRMRIR 2-2-1 4 SVMIVWQVDRMRGR 2-3-1 3 SVMIVWQ-VDRMRIR 2-2-2! 4 SVMIVWQ-VDRMRGR 2-3-1
Each peptide is assigned a peptide ID, written to the right of the peptide, which consists of three numbers separated by dashes. In an alignment the first number is the peptide block number, the second number is the peptide "type", the third number is the occurrence order in the block, and the "!" is a handy visual tag for duplicate. The combination of the three numbers uniquely identifies each peptide in the entire set generated. Here is a simple alignment of "3-mers" to help you understand the ID number.
1 ABC 1-1-1 2 XYZ 1-2-1 2 means this is the 2nd peptide type in the block 3 ABC 1-1-2 ! 2 means this is the 2nd occurrence of peptide type 1 in the block 4 XYZ 1-2-2 ! 5 CCC 1-3-1 3 means this is the 3rd peptide type in the block 6 ABC 1-1-3 ! 7 XYZ 1-2-3 !
If the submission is a single protein sequence, not an alignment, the peptides are simply numbered serially.
"New" simple format. An alignment is submitted. The results look like:
1 NAKSIIVQLNETVEI 1_1&4.2 2 NAKGIIVQLSETVEI 1_2.1 3 NAKVIIVQLNESVEI 1_3&13&15.3 4 NAKIIIVQLNESVEI 1_5&6&8&9&10&11&12&14&18&20.10 5 NESVEI 1_7&16&17&19.4 ? 6 IIVQLNETVEIDCTR 2_1.1 7 IIVQLSETVEIDCTR 2_2.1 8 IIVQLNESVEINCTR 2_3&5&6&8&9&10&11&12&13&14&15&18&20.13 9 IIVQLNETVEINCTR 2_4.1 10 NESVEINCTR 2_7&16&17&19.4 ? 11 LNETVEIDCTRPNNNTR 3_1.1
In the first peptide
1 NAKSIIVQLNETVEI 1_1&4.2
"1" is a serial number that means it is the first unique peptide sequence,
is its sequence
means that it is in the first block of overlapping peptides.
means that this exact 15 mer was found in sequences 1 & 4 of your alignment. A key that relates sequence name to alignment position is printed at the top of your output.
means that two sequences matched this epitope (sequences 1 and 4).