HIV Databases HIV Databases home HIV Databases home
HIV sequence database



Format Converter Explanation

Input formats recognized by format converter:

GenBank, GenBank Raw (sequence only from a GenBank flat file), EMBL, Table, Fasta, Mase (= IG, Intelligenetics), NEXUS interleaved, NEXUS sequential, MEGA interleaved, MEGA sequential, Stockholm, Clustal, BLAST, RSF, Phylip interleaved, Phylip sequential, MSF, GCG, GDE, GDEFlat, Raw, SLX and MacVector.

For descriptions of some common sequence formats, see Common Sequence Formats.


Output formats producible by format converter:

In addition to the formats listed above relaxed Phylip interleaved, relaxed Phylip sequential, comma-separated (CSV) and Pretty-print are supported as output formats. GenBank, EMBL, MacVector, and BLAST are not supported.

Relaxed Phylip (sequential and interleaved) will produce the same output as standard Phylip does with the only exception that in the relaxed format sequence names are not truncated to 10 characters. Instead, sequence names are left as they are and buffered with whitespaces based on the longest sequence name in the submitted data set. This ensures proper display of the aligned sequences in the interleaved format and consistent sequence name lengths for both, interleaved and sequential formats.


File extensions used by the Format Converter tool were chosen to reflect the generated output and are:
Sequence FormatFile Extension
Autput aligned.outali
Clustal.clustal
CSV.csv
Fasta.fasta
GCG.gcg
GDE Flat.gdeflat
GDE.gde
MASE.mase
MEGA interleaved.megai
MEGA sequential.megas
MSF.msf
Nexus interleaved.nexusi
Nexus sequential.nexuss
Phylip standard interleaved.phylipi
Phylip standard sequential.phylips
Phylip relaxed interleaved.rphylipi
Phylip relaxed sequential.rphylips
PIR.pir
Pretty.pretty
Raw.raw
RSF.rsf
SLX.slx
Stockholm.stockholm
Table.table
You will notice that some of the file extensions are non-standard. You might have to change the file extension to it's standard (e.g. 'nxs' for Nexus) if you are using the generated output file with other software for downstream analysis.

Sequence Alphabet

By default, the sequence alphabet (nucleotide or amino acid) of the user's input data is automatically determined. However, we observed in the case of very short peptide sequences that this calculation can fail if the count of characters like 'A' (present in both alphabets) is high.

Notes on sequence names

The "Raw" format consists of pure sequence, either nucleotides or one-letter amino acids.

ACATGTGCGCGCGATTATCTATCGATGCTACGTA
When this sequence is converted to a non-raw format it will be given the name "seq1". If Raw input consists of multiple lines, each line is interpreted as a separate sequence. Thus, the input
ACATGTGCGCGCGATTATCTATCGATGCTACGTA
GCATGTGCACGCGATTATCTACCGATGCTACTTA
would produce the following fasta output:
>seq1
ACATGTGCGCGCGATTATCTATCGATGCTACGTA
>seq2
GCATGTGCACGCGATTATCTACCGATGCTACTTA
Therefore if you are submitting a single raw sequence be sure it is on a single line.

Phylip files must begin with a line that looks like

3  78  i
that shows the number of sequences in the file (3), the number of characters in each sequence (78), and then the letter "i" or "s" which indicates whether the file is "interleaved" or "sequential" respectively. The format converter requires the i or s letters. The format converter program deals with only two essential data items, the sequence, and the sequence name. Thus, a complicated file format such as Nexus when converted to a simpler format such as table will lose all the associated information except the sequence name and the sequence. Converting a Nexus file like:
#NEXUS
Begin data;
	Dimensions ntax=3 nchar=79;
	Format datatype=dna gap=-;
	Matrix
4axED43xco GGAGGCCCTACCTCAAGTAGTGACGCCCTACCTCCCGTTGGCTGTTTCCTCTTGCGTAGAACGCTACTTTCGGGCAACC
2bxMD2b2x1 CGCTGTTGATCACCAAATCGGAGGGCACCTA-----GGAACACAGCTCCTCATGGATCGAGAGTACTTTCTAACCGTGA
2bxMD2b9x1 CGCTGCCAAATACCGAGTCGGAAGGCATCTACGGTTGAGACACGGCTCCCCATGAACCGAGGGTATTTCCTAACCGTGG
;
End;
to fasta format would produce the following file:
>4axED43xco
GGAGGCCCTACCTCAAGTAGTGACGCCCTACCTCCCGTTGGCTGTTTCCTCTTGCGTAGAACGCTACTTTCGGGCAACC
>2bxMD2b2x1
CGCTGTTGATCACCAAATCGGAGGGCACCTA-----GGAACACAGCTCCTCATGGATCGAGAGTACTTTCTAACCGTGA
>2bxMD2b9x1
CGCTGCCAAATACCGAGTCGGAAGGCATCTACGGTTGAGACACGGCTCCCCATGAACCGAGGGTATTTCCTAACCGTGG
The datatype (dna), number of taxa, etc. are not represented in the fasta file, only the names and sequences.

Alternative tools

 

last modified: Wed Jun 18 13:44 2014


Questions or comments? Contact us at seq-info@lanl.gov.

 
Operated by Los Alamos National Security, LLC, for the U.S. Department of Energy's National Nuclear Security Administration
Copyright © 2005-2012 LANS LLC All rights reserved | Disclaimer/Privacy

Dept of Health & Human Services Los Alamos National Institutes of Health