HIV Databases HIV Databases home HIV Databases home
HIV sequence database

Format Converter Explanation

Input formats recognized by format converter:

GenBank, GenBank Raw (sequence only from a GenBank flat file), EMBL, Table, Fasta, Mase (= IG, Intelligenetics), NEXUS interleaved, NEXUS sequential, MEGA interleaved, MEGA sequential, Stockholm, Clustal, BLAST, RSF, Phylip interleaved, Phylip sequential, MSF, GCG, GDE, GDEFlat, Raw, SLX and MacVector.

For descriptions of some common sequence formats, see Common Sequence Formats.

Output formats producible by format converter:

In addition to the formats listed above relaxed Phylip interleaved, relaxed Phylip sequential, comma-separated (CSV) and Pretty-print are supported as output formats. GenBank, EMBL, MacVector, and BLAST are not supported.

Relaxed Phylip (sequential and interleaved) will produce the same output as standard Phylip does with the only exception that in the relaxed format sequence names are not truncated to 10 characters. Instead, sequence names are left as they are and buffered with whitespaces based on the longest sequence name in the submitted data set. This ensures proper display of the aligned sequences in the interleaved format and consistent sequence name lengths for both, interleaved and sequential formats.

File extensions used by the Format Converter tool were chosen to reflect the generated output and are:
Sequence FormatFile Extension
Autput aligned.outali
GDE Flat.gdeflat
MEGA interleaved.megai
MEGA sequential.megas
Nexus interleaved.nexusi
Nexus sequential.nexuss
Phylip standard interleaved.phylipi
Phylip standard sequential.phylips
Phylip relaxed interleaved.rphylipi
Phylip relaxed sequential.rphylips
You will notice that some of the file extensions are non-standard. You might have to change the file extension to it's standard (e.g. 'nxs' for Nexus) if you are using the generated output file with other software for downstream analysis.

Sequence base

By default, the sequence base (nucleotide or amino acid) of the user's input data is automatically determined. However, we observed in the case of very short peptide sequences that this calculation can fail if the count of characters like 'A' (present in both base type) is high.

Notes on sequence names

The "Raw" format consists of pure sequence, either nucleotides or one-letter amino acids.

When this sequence is converted to a non-raw format it will be given the name "seq1". If Raw input consists of multiple lines, each line is interpreted as a separate sequence. Thus, the input
would produce the following fasta output:
Therefore if you are submitting a single raw sequence be sure it is on a single line.

Phylip files must begin with a line that looks like

3  78  i
that shows the number of sequences in the file (3), the number of characters in each sequence (78), and then the letter "i" or "s" which indicates whether the file is "interleaved" or "sequential" respectively. The format converter requires the i or s letters. The format converter program deals with only two essential data items, the sequence, and the sequence name. Thus, a complicated file format such as Nexus when converted to a simpler format such as table will lose all the associated information except the sequence name and the sequence. Converting a Nexus file like:
Begin data;
	Dimensions ntax=3 nchar=79;
	Format datatype=dna gap=-;
to fasta format would produce the following file:
The datatype (dna), number of taxa, etc. are not represented in the fasta file, only the names and sequences.

Alternative tools


last modified: Wed Jan 21 15:13 2015

Questions or comments? Contact us at

Operated by Los Alamos National Security, LLC, for the U.S. Department of Energy's National Nuclear Security Administration
Copyright © 2005-2012 LANS LLC All rights reserved | Disclaimer/Privacy

Dept of Health & Human Services Los Alamos National Institutes of Health