Both nucleotide and protein sequences are accepted. By default, your molecule type is automatically determined. (In the case of short sequences, this calculation can sometimes fail if the count of characters like 'A' (present in both base types) is high.)
The format of your input file will be automatically detected in most cases. If you get an error message stating that your format cannot be recognized, try specifying the input format instead of choosing "Automatic". If you still receive this error, double check the details of your format, or try removing all blank spaces from your sequence names.
Input formats accepted are:
For descriptions of some common sequence formats, see Common Sequence Formats.
Available output formats are listed below. GenBank, EMBL, MacVector, and BLAST are not supported.
File extensions assigned by this tool reflect the generated output.
Some of these file extensions are non-standard. If you are using the file as input in other software, you may need to change the file extension to its standard form (e.g., '.nxs' for Nexus).
|Sequence output format||File extension|
|Phylip standard interleaved||.phylipi|
|Phylip standard sequential||.phylips|
|Phylip relaxed interleaved||.rphylipi|
|Phylip relaxed sequential||.rphylips|
There are two situations where you may want to select this option. One, if there is a possibility that any of your sequence names are duplicated, this may cause problems in other software. Two, some sequence formats limit the number of characters in the names, so your unique names may be truncated into non-unique names, unless you check this option. In particular, phylip standard and SLX limit the number of characters in names.
When selected, this option will change the sequence names, as in these examples:
|Input names||Output names|
|Input names||Output names|
This translation option is provided specifically to convert the information from GenBank format files into GFF3 format. Unlike other format translations in this tool, this conversion retains the annotated data from the GenBank file, not just the name and sequence. If this selection is chosen, other options are ignored.
For testing purposes, click here to download a sample GenBank format file.
Some tools cannot handle IUPAC ambiguity codes in nucleotide sequences. This option replaces any character other than ACGTU with a "N". This option is relevant only for nucleotide sequences.
The "Raw" format consists of pure sequence, either nucleotides or one-letter amino acids.
ACATGTGCGCGCGATTATCTATCGATGCTACGTAWhen this sequence is converted to a non-raw format it will be given the name "seq1". If Raw input consists of multiple lines, each line is interpreted as a separate sequence. Thus, the input
ACATGTGCGCGCGATTATCTATCGATGCTACGTA GCATGTGCACGCGATTATCTACCGATGCTACTTAwould produce the following fasta output:
>seq1 ACATGTGCGCGCGATTATCTATCGATGCTACGTA >seq2 GCATGTGCACGCGATTATCTACCGATGCTACTTATherefore if you are submitting a single raw sequence, be sure it is on a single line.
Phylip files must begin with a line that looks like:
3 78 iwhich shows the number of sequences in the file (3), the number of characters in each sequence (78), and then the letter "i" or "s" which indicates "interleaved" or "sequential". The i or s letters are optional.
Standard phylip files have a limitation of 10 characters in the sequence names. For this reason, we also provide relaxed phylip options that will preserve the full length of your sequence names.
The relaxed Phylip format is unique to the Format Converter tool. It is called 'relaxed' because it will generate a Phylip formatted file where sequence names can be longer than 10 characters. Relaxed Phylip (sequential and interleaved) will produce the same output as standard Phylip, except that in the relaxed format sequence names are not truncated to 10 characters. Instead, sequence names are left as they are and buffered with whitespaces based on the longest sequence name in the submitted data set. This ensures proper display of the aligned sequences in the interleaved format and consistent sequence name lengths for both interleaved and sequential formats.
The format converter program deals with only two essential data items, the sequence, and the sequence name. Thus, a complicated file format such as Nexus, when converted to a simpler format such as table, will lose all the associated information except the sequence name and the sequence. For example, this Nexus file:
#NEXUS Begin data; Dimensions ntax=3 nchar=79; Format datatype=dna gap=-; Matrix 4axED43xco GGAGGCCCTACCTCAAGTAGTGACGCCCTACCTCCCGTTGGCTGTTTCCTCTTGCGTAGAACGCTACTTTCGGGCAACC 2bxMD2b2x1 CGCTGTTGATCACCAAATCGGAGGGCACCTA-----GGAACACAGCTCCTCATGGATCGAGAGTACTTTCTAACCGTGA 2bxMD2b9x1 CGCTGCCAAATACCGAGTCGGAAGGCATCTACGGTTGAGACACGGCTCCCCATGAACCGAGGGTATTTCCTAACCGTGG ; End;would produce the following Fasta file:
>4axED43xco GGAGGCCCTACCTCAAGTAGTGACGCCCTACCTCCCGTTGGCTGTTTCCTCTTGCGTAGAACGCTACTTTCGGGCAACC >2bxMD2b2x1 CGCTGTTGATCACCAAATCGGAGGGCACCTA-----GGAACACAGCTCCTCATGGATCGAGAGTACTTTCTAACCGTGA >2bxMD2b9x1 CGCTGCCAAATACCGAGTCGGAAGGCATCTACGGTTGAGACACGGCTCCCCATGAACCGAGGGTATTTCCTAACCGTGG
The datatype (dna), number of taxa, etc., are not represented in the Fasta file, only the names and sequences.
Sequence names in SLX are limited to 32 characters. Any sequence names longer than that will be truncated in the format conversion process, which can result in non-unique sequence names in the generated output. If you need to preserve the uniqueness of your sequence names please use the check box labeled 'Enforce sequence name uniqueness' in the 'Options' panel.