HIV Databases HIV Databases home HIV Databases home
HIV sequence database



Format Converter Explanation

Input

Both nucleotide and protein sequences are accepted. By default, your molecule type is automatically determined. (In the case of short sequences, this calculation can sometimes fail if the count of characters like 'A' (present in both base types) is high.)

The format of your input file will be automatically detected in most cases. If you get an error message stating that your format cannot be recognized, try specifying the input format instead of choosing "Automatic". If you still receive this error, double check the details of your format, or try removing all blank spaces from your sequence names.

Input formats accepted are:

For descriptions of some common sequence formats, see Common Sequence Formats.

Output

Available output formats are listed below. GenBank, EMBL, MacVector, and BLAST are not supported.

File extensions assigned by this tool reflect the generated output.
Some of these file extensions are non-standard. If you are using the file as input in other software, you may need to change the file extension to its standard form (e.g., '.nxs' for Nexus).

Sequence output formatFile extension
Output aligned.outali
Clustal.clustal
CSV.csv
Fasta.fasta
GCG.gcg
GDE Flat.gdeflat
GDE.gde
MASE.mase
MEGA interleaved.megai
MEGA sequential.megas
MSF.msf
Nexus interleaved.nexusi
Nexus sequential.nexuss
Phylip standard interleaved.phylipi
Phylip standard sequential.phylips
Phylip relaxed interleaved.rphylipi
Phylip relaxed sequential.rphylips
PIR.pir
Pretty.pretty
Raw.raw
RSF.rsf
SLX.slx
Stockholm.stockholm
Table.table

Enforce sequence name uniqueness

There are two situations where you may want to select this option. One, if there is a possibility that any of your sequence names are duplicated, this may cause problems in other software. Two, some sequence formats limit the number of characters in the names, so your unique names may be truncated into non-unique names, unless you check this option. In particular, phylip standard and SLX limit the number of characters in names.

When selected, this option will change the sequence names, as in these examples:
Input namesOutput names
SeqNameSeqName.a
SeqNameSeqName.b
SeqNameSeqName.c
Input namesOutput names
ReallyReallyReallyLongSequenceName_1ReallyReallyReallyLongSequenceNa
ReallyReallyReallyLongSequenceName_2ReallyReallyReallyLongSequenceNb
ReallyReallyReallyLongSequenceName_3ReallyReallyReallyLongSequenceNc

Convert GenBank to GFF3

This translation option is provided specifically to convert the information from GenBank format files into GFF3 format. Unlike other format translations in this tool, this conversion retains the annotated data from the GenBank file, not just the name and sequence. If this selection is chosen, other options are ignored.

For testing purposes, click here to download a sample GenBank format file.

Remove IUPAC characters

Some tools cannot handle IUPAC ambiguity codes in nucleotide sequences. This option replaces any character other than ACGTU with a dash character "-". This option is relevant only for nucleotide sequences.

Notes about specific formats

Raw

The "Raw" format consists of pure sequence, either nucleotides or one-letter amino acids.

ACATGTGCGCGCGATTATCTATCGATGCTACGTA
When this sequence is converted to a non-raw format it will be given the name "seq1". If Raw input consists of multiple lines, each line is interpreted as a separate sequence. Thus, the input
ACATGTGCGCGCGATTATCTATCGATGCTACGTA
GCATGTGCACGCGATTATCTACCGATGCTACTTA
would produce the following fasta output:
>seq1
ACATGTGCGCGCGATTATCTATCGATGCTACGTA
>seq2
GCATGTGCACGCGATTATCTACCGATGCTACTTA
Therefore if you are submitting a single raw sequence, be sure it is on a single line.

Phylip

Phylip files must begin with a line that looks like:

3  78  i
which shows the number of sequences in the file (3), the number of characters in each sequence (78), and then the letter "i" or "s" which indicates "interleaved" or "sequential". The i or s letters are optional.

Standard phylip files have a limitation of 10 characters in the sequence names. For this reason, we also provide relaxed phylip options that will preserve the full length of your sequence names.

Phylip relaxed

The relaxed Phylip format is unique to the Format Converter tool. It is called 'relaxed' because it will generate a Phylip formatted file where sequence names can be longer than 10 characters. Relaxed Phylip (sequential and interleaved) will produce the same output as standard Phylip, except that in the relaxed format sequence names are not truncated to 10 characters. Instead, sequence names are left as they are and buffered with whitespaces based on the longest sequence name in the submitted data set. This ensures proper display of the aligned sequences in the interleaved format and consistent sequence name lengths for both interleaved and sequential formats.

Nexus

The format converter program deals with only two essential data items, the sequence, and the sequence name. Thus, a complicated file format such as Nexus, when converted to a simpler format such as table, will lose all the associated information except the sequence name and the sequence. For example, this Nexus file:

#NEXUS
Begin data;
	Dimensions ntax=3 nchar=79;
	Format datatype=dna gap=-;
	Matrix
4axED43xco GGAGGCCCTACCTCAAGTAGTGACGCCCTACCTCCCGTTGGCTGTTTCCTCTTGCGTAGAACGCTACTTTCGGGCAACC
2bxMD2b2x1 CGCTGTTGATCACCAAATCGGAGGGCACCTA-----GGAACACAGCTCCTCATGGATCGAGAGTACTTTCTAACCGTGA
2bxMD2b9x1 CGCTGCCAAATACCGAGTCGGAAGGCATCTACGGTTGAGACACGGCTCCCCATGAACCGAGGGTATTTCCTAACCGTGG
;
End;
would produce the following Fasta file:
>4axED43xco
GGAGGCCCTACCTCAAGTAGTGACGCCCTACCTCCCGTTGGCTGTTTCCTCTTGCGTAGAACGCTACTTTCGGGCAACC
>2bxMD2b2x1
CGCTGTTGATCACCAAATCGGAGGGCACCTA-----GGAACACAGCTCCTCATGGATCGAGAGTACTTTCTAACCGTGA
>2bxMD2b9x1
CGCTGCCAAATACCGAGTCGGAAGGCATCTACGGTTGAGACACGGCTCCCCATGAACCGAGGGTATTTCCTAACCGTGG

The datatype (dna), number of taxa, etc., are not represented in the Fasta file, only the names and sequences.

SLX

Sequence names in SLX are limited to 32 characters. Any sequence names longer than that will be truncated in the format conversion process, which can result in non-unique sequence names in the generated output. If you need to preserve the uniqueness of your sequence names please use the check box labeled 'Enforce sequence name uniqueness' in the 'Options' panel.

Alternative tools

 

last modified: Wed Apr 20 11:42 2016


Questions or comments? Contact us at seq-info@lanl.gov.

 
Operated by Los Alamos National Security, LLC, for the U.S. Department of Energy's National Nuclear Security Administration
Copyright © 2005-2012 LANS LLC All rights reserved | Disclaimer/Privacy

Dept of Health & Human Services Los Alamos National Institutes of Health