HIV Databases HIV Databases home HIV Databases home
HIV sequence database

Format Converter Explanation

Input formats recognized

The format of the input file can be automatically detected in most cases. If you get an error message stating that your format cannot be recognized, try specifying the input format instead of choosing "Automatic". If you still receive this error, double check the details of your format, or try removing all blank spaces from your sequence names.

Input formats accepted are:

For descriptions of some common sequence formats, see Common Sequence Formats.

Output formats generated

Available output formats are listed below. GenBank, EMBL, MacVector, and BLAST are not supported.

File extensions used by the Format Converter tool were chosen to reflect the generated output:

Sequence formatFile extension
Output aligned.outali
GDE Flat.gdeflat
MEGA interleaved.megai
MEGA sequential.megas
Nexus interleaved.nexusi
Nexus sequential.nexuss
Phylip standard interleaved.phylipi
Phylip standard sequential.phylips
Phylip relaxed interleaved.rphylipi
Phylip relaxed sequential.rphylips

Some of these file extensions are non-standard. You might have to change the file extension to it's standard (e.g., '.nxs' for Nexus) if you are using the generated output file with other software.

Molecule type (nucleotide or amino acid)

By default, your molecule type is automatically determined. However, in the case of very short peptide sequences, this calculation can fail if the count of characters like 'A' (present in both base types) is high.

Enforce sequence name uniqueness

There are two situations where you may want to select this option. One, if there is a possibility that any of your sequence names are duplicated, this may cause problems. Selecting this option will ensure that your names are unique, thus avoiding problems in subsequent analyses. Two, some sequence formats limit the number of characters in the names, so your unique names may be truncated into non-unique names, unless you check this option. In particular, phylip standard and SLX limit the number of characters in names.

Convert GenBank to GFF3

This translation option is provided specifically to convert the information from GenBank format files into GFF3 format. Unlike other translation options, this conversion retains the annotated data from the GenBank file, not just the name and sequence. If this selection is chosen, other options are ignored.

Notes about specific formats


The "Raw" format consists of pure sequence, either nucleotides or one-letter amino acids.

When this sequence is converted to a non-raw format it will be given the name "seq1". If Raw input consists of multiple lines, each line is interpreted as a separate sequence. Thus, the input
would produce the following fasta output:
Therefore if you are submitting a single raw sequence, be sure it is on a single line.


Phylip files must begin with a line that looks like:

3  78  i
which shows the number of sequences in the file (3), the number of characters in each sequence (78), and then the letter "i" or "s" which indicates "interleaved" or "sequential". The i or s letters are optional.

Standard phylip files have a limitation of 10 characters in the sequence names. For this reason, we also provide relaxed phylip options that will preserve the full length of your sequence names.

Phylip relaxed

The relaxed Phylip format is unique to the Format Converter tool. It is called 'relaxed' because it will generate a Phylip formatted file where sequence names can be longer than 10 characters. Relaxed Phylip (sequential and interleaved) will produce the same output as standard Phylip, except that in the relaxed format sequence names are not truncated to 10 characters. Instead, sequence names are left as they are and buffered with whitespaces based on the longest sequence name in the submitted data set. This ensures proper display of the aligned sequences in the interleaved format and consistent sequence name lengths for both interleaved and sequential formats.


The format converter program deals with only two essential data items, the sequence, and the sequence name. Thus, a complicated file format such as Nexus, when converted to a simpler format such as table, will lose all the associated information except the sequence name and the sequence. For example, this Nexus file:

Begin data;
	Dimensions ntax=3 nchar=79;
	Format datatype=dna gap=-;
would produce the following Fasta file:

The datatype (dna), number of taxa, etc., are not represented in the Fasta file, only the names and sequences.


Sequence names in SLX are limited to 32 characters. Any sequence names longer than that will be truncated in the format conversion process, which can result in non-unique sequence names in the generated output. If you need to preserve the uniqueness of your sequence names please use the check box labeled 'Enforce sequence name uniqueness' in the 'Options' panel.

Alternative tools


last modified: Mon Apr 6 14:29 2015

Questions or comments? Contact us at

Operated by Los Alamos National Security, LLC, for the U.S. Department of Energy's National Nuclear Security Administration
Copyright © 2005-2012 LANS LLC All rights reserved | Disclaimer/Privacy

Dept of Health & Human Services Los Alamos National Institutes of Health