HIV Databases HIV Databases home HIV Databases home
HIV sequence database

ElimDupes Explanation


There are various ways of defining "duplicateness" in two sequences.

1. The strongest definition would be the case in which two sequences match exactly as in:

     ACCCTGATTAGC   seq1
     ACCCTGATTAGC   seq2

2. Slightly less strong than perfect match is the situation in which the sequences match in all respects except the case of the letters:

     ACCCTGATTAGC   seq1
     aCCCtGATTaGC   seq2

3. A third consideration is the case of gaps and other non-letter or "extraneous" characters. With gaps removed, the two sequences below are duplicates.

     ACCCTGATTAGC       seq1
     ACCCT----GATTAGC   seq2

4. Fourth, there is the case of one sequence that matches part of another:

     ACCCTGATTAGC   seq1
         TGAT       seq2

5. Final consideration is the similarity of sequences. In the example below, 8 of 10 bases of seq2 are duplicated in seq1. Thus, the two sequences are said to be 80% similar.

     ACCCTGATTA   seq1
     ACCCGTATTA   seq2


The tool accepts:

Option Summary
Option Details

Remove extraneous characters from sequences

'No' (default) means that gaps and other non-letter characters will not be removed and thus will be included in the comparisons. In this case, the following two sequences will not be considered duplicates:

     ACCCTGATTAGC       seq1
     ACCCT----GATTAGC   seq2

If this option is changed to 'Yes', the gaps will be removed from seq2 and the two sequences will be treated as duplicates.

Make all letters uppercase

'Yes' (default) converts all characters to upper case. With this setting the following two sequences will be treated as duplicates:

     ACCCTGATTAGC   seq1
     aCCCtGATTaGC   seq2

If this option is changed to 'No' the above two sequences will not be considered duplicates.

Consider subsequences as duplicates

'Yes' (default) means that a shorter sequence that is contained within a larger sequence will be considered duplicate. For example, consider the two sequences:

     ACCCTGATTAGC       seq1
     ACCCT-------       seq2

If gaps are removed (Remove extraneous characters set to 'Yes') then the sequences become:

     ACCCTGATTAGC       seq1
     ACCCT              seq2

If Consider subsequences as duplicates = 'Yes', then seq2 will be considered a duplicate of seq1, otherwise not.

Restore original sequences in output

'Yes' (default) means the resulting downloadable file will be the original sequences in their unchanged form instead of the form as may be altered by the tool options such as changing case or stripping gaps.

Eliminate sequences more similar than...

In the example below, 8 of the 10 bases of seq2 are duplicated in seq1. Thus, the two sequences are said to be 80% similar. If this option is set to 79% or less, these two sequences will be treated at duplicates. If the option is set to 80% or higher, then these sequences will not be considered duplicates.

     ACCCTGATTA   seq1
     ACCCGTATTA   seq2

Analyze input by groups

This option performs analysis and produces files of unique sequence by group. A "group" is defined by N number of leading characters in the sequence name. For example, if your sequence set of based on samples taken a specific points in time for a given patient, then your labels might be something like:


If you enter 6 in the analyze input by groups box then Elimdupes will group the sequences by the first 6 characters and treat them as distinct groups.

Note that if you choose to create a file of uniques sequences with _count added... the resulting file will contain the unique sequences for all groups, with a blank line between groups. This allows you to easily cut paste the entire results, or just the results for a given group.

Create File of unique sequences with _count added...

This option will create an additional file of unique sequences where the number of occurrences (count) of a given sequence is appended to the sequence name.

Include rank in sequence names

In addition to marking sequence names with the occurrence count, you can opt to mark sequence names with both the rank and count. The sequence with the highest count has a rank of 1. The rank of the count (the sequence with the highest count has a rank of 1) is optionally added as ".rank_count" at the end of the sequence names.

Note that if "Analyze input by groups" is selected, the counts (and rank if chosen) will be reset at the beginning of each group. The ouput for all groups will be combined in a single file with a blank line between groups.

This option is helpful for handling deep sequences, reducing them to unique forms with their counts and ranking. Sometimes these files need to be trimmed after alignment, and by trimming the ends, more repetition can occur and the file can be reduced further. For example:




If the user trimmed the last 2 bases, and re-entered the alignment with .rank_count (as above), it would give:


Sequence names end in '_nn'

If your sequence names already have occurrence counts encoded as '_nn' on the sequence names, check this box. If you don't check this box, the program will assume that any '_nn' are simply part of the sequence names.

The reason this is necessary is that some accession numbers (NC_123456) contain '_nn', so this option allows the program to tell the difference.


1----- first, Elimdupes displays the option settings for this run:

Options used:
Remove extraneous characters from sequences: true
Make all letters uppercase: true
Consider subsequences as duplicates: true
Use original sequences in output: true
Create a file of unique sequences with _count: true
Add rank to unique sequences with count (.rank_count format): true

2----- next, Elimdupes displays links to View and Download the file with _counts (and optional rank), if selected:

Unique sequences with rank and count appended (.rank_count):      View    Download

3----- next, Elimdupes displays the analysis. Note that if analyze by groups is selected, this section will repeat for each group.

Unique sequences file:                     View    Download

Duplicate (eliminated) sequences file:     View    Download

Tab-delimited summary table below:                 Download

Unique             Number of   Duplicate
sequences         duplicates   sequences
A3_seq1                    2   A3_seq2, A3_seq4
A1_seq1                    3   A1_seq2, A1_seq3, A1_seq4
Total unique seqs = 3
Total duplicate seqs = 5

last modified: Tue Jun 16 12:37 2015

Questions or comments? Contact us at

Operated by Los Alamos National Security, LLC, for the U.S. Department of Energy's National Nuclear Security Administration
Copyright © 2005-2012 LANS LLC All rights reserved | Disclaimer/Privacy

Dept of Health & Human Services Los Alamos National Institutes of Health