To format HIV sequences for submission to GenBank and the other International Consortium Databases (EMBL & DDBJ), the HIV Sequence Database provides a quality-control and GenBank submission tool. The process is outlined below.
Note: You may choose to run the QC analysis (Steps 1-3) without preparing the sequences for submission to GenBank. No one will look at your sequences or QC results, unless you ask us for assistance. If you are not doing the GenBank submission process, ignore the instructions for making a spreadsheet of annotation data.
Please prepare your data in two files, a sequence file and a comma delimited(.csv) file containing sequence annotation data (see example files). If you are doing only the QC analysis, not GenBank deposit, you need only the sequence file.
The QC tool accepts only HIV-1 Fasta-formatted nucleotide sequences, aligned or unaligned. (Note that the GenBank Entry Generation tool can also handle HIV-2 and SIV sequences: see organism information, below.)
The Fasta format uses a "greater-than" sign (>) to indicate the start of each sequence record. The sequence name should follow this symbol. Sequence names should not contain any space characters; please use an underscore character (_) to denote a space. If prepared via word-processing software, sequence data should be saved in plain-text format (ASCII encoding).
To export a file in Excel to comma delimited format, go to File > Save As, select CSV(Comma delimited)(*.csv).
Each row in the comma delimited (.csv) file should correspond to a sequence in the Fasta file. The "header row" (top row) should name the data that appear in each column (e.g., "Sequence name", "subtype", "sampling date", etc.).
One column, preferably the first column, should contain the name of the sequence exactly as it appears in the Fasta file. Any differences in sequence names will lead to serious errors.
The order of rows for sequences in the two files need not match. Annotation data are associated with sequences by matching sequence names, rather than the order in the files.
Each column should contain annotation data, such as the viral subtype, patient code, viral load, sample date, sample country, etc. Individual cells may be left blank if the data is unknown for a particular sequence.
The QC tool accepts only HIV-1 sequences, but the GenBank Entry Generation tool can accept HIV-1, HIV-2, or SIV sequences. Prior to starting the preparation for GenBank deposit, you will be asked to select the organism of your sequences. All sequences for a single run must be the same organism. The tool does not support SHIV sequences at this time. If you want to use the tool for deposit of SHIV sequences, please contact us. It may be possible to devise a workaround on a case-by-case basis.
Go to the form for sequence submission: QC tool. Enter your sequence set and email address. Click on the 'Submit' button to initiate the analysis. On submission, you will receive a confirmation email message.
You will receive another message from email@example.com upon completion of the QC analysis.
Note: Because this tool compiles results from several other tools, it is slow. Please be patient; it may take several hours to produce results from large data sets.
Below is an example of a set of QC results. Each result is a link to more details.
When you are confident that sequences appear as you expect, check the checkboxes for the sequences you want to submit to GenBank. Click on the 'Create GenBank entry' button to continue.
The GenBank Entry Generation Tool will format sequences in ASN.1 (a.k.a. SQN or Sequin) format. GenBank's Sequin tool can also do this. The advantage of using the online service provided here is that feature annotation (which depicts gene locations along the sequence) will be added automatically from GeneCutter results. Click the "Begin GenBank Entry Generation" button to proceed.
Please review this carefully. This step maps columns of annotation to different regions of the GenBank entry, and great confusion will result if the fields are mapped inappropriately. The first several rows of data from the CSV file will be listed. Compare each column with the name at the head of the column to the data below. If you find that the column names do not match the data, change it by clicking where it says "change it here".
Click the 'Submit' button when you are ready to add annotation to the sequences.
After processing, you should receive an email message that contains 2 links, one to a zipped set of GenBank flatfiles ("gb.zip"), and one sequin-formatted file ("all.sequin").
The flatfiles are human readable plain text and can be read with a text editor or word processor. They approximate how the entries will appear after GenBank accession. The sequin file is formatted for submission to GenBank. GenBank does not accept sequence submissions in flatfile format. You can open the sequin files with GenBank's Sequin software or with a text editor, should you choose.
Please unzip and review at least one of the flatfiles. Please do not edit the flatfiles if you encounter errors, because the changes will not be reflected in the sequin files. You must correct mistakes that appear flatfiles by repeating the process from Step 4. In case of insurmountable difficulty, please contact firstname.lastname@example.org with a detailed description of the problem.
Download the sequin (sqn) file and save it on your computer's drive, noting where you save it. Then, with a web browser, open this page: Sequin MacroSend. At the bottom, locate the form entry called "File(s)". Press the "Browse" button, and locate the sequin file you created and saved above. Note: While GenBank does also accept email submissions of sequin files, some mail software can corrupt the file contents by modifying the file encoding, and thereby cause errors that are difficult to diagnose. The web submission form does not suffer this problem.
Specify your contact information (First/last name, email, other optional fields). Note: Correspondence regarding these sequences and subsequent updates or corrections should be sent from the individual whose contact information is provided here. In the subject line, you should specify how many of what kind of sequences are in the sqn file (e.g., '186 HIV-1 env sequences').
In the comments field, you can enter a message that functions as a cover letter for the GenBank curatorial staff. You should specify whether to embargo the data, pending manuscript acceptance, or disclose them immediately upon review and acceptance. You might also mention whether or not the sequences should be available as a PopSet.
Upon completion, click on the 'Submit' button, and look for a confirmation that the file has been received.
You will receive an automated acknowledgement from GenBank via email. You may also receive clarifying questions from a GenBank curator. You will receive a message containing accession numbers upon acceptance.