HIV Databases HIV Databases home HIV Databases home
HIV sequence database



Format Converter Explanation

Input formats recognized

The format of the input file can be automatically detected in most cases. If you get an error message stating that your format cannot be recognized, try specifying the input format instead of choosing "Automatic". If you still receive this error, double check the details of your format, or try removing all blank spaces from your sequence names.

Input formats accepted are:

For descriptions of some common sequence formats, see Common Sequence Formats.

Output formats generated

Available output formats are listed below. GenBank, EMBL, MacVector, and BLAST are not supported.

File extensions used by the Format Converter tool were chosen to reflect the generated output:

Sequence formatFile extension
Output aligned.outali
Clustal.clustal
CSV.csv
Fasta.fasta
GCG.gcg
GDE Flat.gdeflat
GDE.gde
MASE.mase
MEGA interleaved.megai
MEGA sequential.megas
MSF.msf
Nexus interleaved.nexusi
Nexus sequential.nexuss
Phylip standard interleaved.phylipi
Phylip standard sequential.phylips
Phylip relaxed interleaved.rphylipi
Phylip relaxed sequential.rphylips
PIR.pir
Pretty.pretty
Raw.raw
RSF.rsf
SLX.slx
Stockholm.stockholm
Table.table

Some of these file extensions are non-standard. You might have to change the file extension to it's standard (e.g., '.nxs' for Nexus) if you are using the generated output file with other software.

Molecule type (nucleotide or amino acid)

By default, your molecule type is automatically determined. However, in the case of very short peptide sequences, this calculation can fail if the count of characters like 'A' (present in both base types) is high.

Enforce sequence name uniqueness

There are two situations where you may want to select this option. One, if there is a possibility that any of your sequence names are duplicated, this may cause problems. Selecting this option will ensure that your names are unique, thus avoiding problems in subsequent analyses. Two, some sequence formats limit the number of characters in the names, so your unique names may be truncated into non-unique names, unless you check this option. In particular, phylip standard and SLX limit the number of characters in names.

Convert GenBank to GFF3

This translation option is provided specifically to convert the information from GenBank format files into GFF3 format. Unlike other translation options, this conversion retains the annotated data from the GenBank file, not just the name and sequence. If this selection is chosen, other options are ignored.

Notes about specific formats

Raw

The "Raw" format consists of pure sequence, either nucleotides or one-letter amino acids.

ACATGTGCGCGCGATTATCTATCGATGCTACGTA
When this sequence is converted to a non-raw format it will be given the name "seq1". If Raw input consists of multiple lines, each line is interpreted as a separate sequence. Thus, the input
ACATGTGCGCGCGATTATCTATCGATGCTACGTA
GCATGTGCACGCGATTATCTACCGATGCTACTTA
would produce the following fasta output:
>seq1
ACATGTGCGCGCGATTATCTATCGATGCTACGTA
>seq2
GCATGTGCACGCGATTATCTACCGATGCTACTTA
Therefore if you are submitting a single raw sequence, be sure it is on a single line.

Phylip

Phylip files must begin with a line that looks like:

3  78  i
which shows the number of sequences in the file (3), the number of characters in each sequence (78), and then the letter "i" or "s" which indicates "interleaved" or "sequential". The i or s letters are optional.

Standard phylip files have a limitation of 10 characters in the sequence names. For this reason, we also provide relaxed phylip options that will preserve the full length of your sequence names.

Phylip relaxed

The relaxed Phylip format is unique to the Format Converter tool. It is called 'relaxed' because it will generate a Phylip formatted file where sequence names can be longer than 10 characters. Relaxed Phylip (sequential and interleaved) will produce the same output as standard Phylip, except that in the relaxed format sequence names are not truncated to 10 characters. Instead, sequence names are left as they are and buffered with whitespaces based on the longest sequence name in the submitted data set. This ensures proper display of the aligned sequences in the interleaved format and consistent sequence name lengths for both interleaved and sequential formats.

Nexus

The format converter program deals with only two essential data items, the sequence, and the sequence name. Thus, a complicated file format such as Nexus, when converted to a simpler format such as table, will lose all the associated information except the sequence name and the sequence. For example, this Nexus file:

#NEXUS
Begin data;
	Dimensions ntax=3 nchar=79;
	Format datatype=dna gap=-;
	Matrix
4axED43xco GGAGGCCCTACCTCAAGTAGTGACGCCCTACCTCCCGTTGGCTGTTTCCTCTTGCGTAGAACGCTACTTTCGGGCAACC
2bxMD2b2x1 CGCTGTTGATCACCAAATCGGAGGGCACCTA-----GGAACACAGCTCCTCATGGATCGAGAGTACTTTCTAACCGTGA
2bxMD2b9x1 CGCTGCCAAATACCGAGTCGGAAGGCATCTACGGTTGAGACACGGCTCCCCATGAACCGAGGGTATTTCCTAACCGTGG
;
End;
would produce the following Fasta file:
>4axED43xco
GGAGGCCCTACCTCAAGTAGTGACGCCCTACCTCCCGTTGGCTGTTTCCTCTTGCGTAGAACGCTACTTTCGGGCAACC
>2bxMD2b2x1
CGCTGTTGATCACCAAATCGGAGGGCACCTA-----GGAACACAGCTCCTCATGGATCGAGAGTACTTTCTAACCGTGA
>2bxMD2b9x1
CGCTGCCAAATACCGAGTCGGAAGGCATCTACGGTTGAGACACGGCTCCCCATGAACCGAGGGTATTTCCTAACCGTGG

The datatype (dna), number of taxa, etc., are not represented in the Fasta file, only the names and sequences.

SLX

Sequence names in SLX are limited to 32 characters. Any sequence names longer than that will be truncated in the format conversion process, which can result in non-unique sequence names in the generated output. If you need to preserve the uniqueness of your sequence names please use the check box labeled 'Enforce sequence name uniqueness' in the 'Options' panel.

Alternative tools

 

last modified: Mon Apr 6 14:29 2015


Questions or comments? Contact us at seq-info@lanl.gov.

 
Operated by Los Alamos National Security, LLC, for the U.S. Department of Energy's National Nuclear Security Administration
Copyright © 2005-2012 LANS LLC All rights reserved | Disclaimer/Privacy

Dept of Health & Human Services Los Alamos National Institutes of Health