HIV Databases HIV Databases home HIV Databases home
HIV sequence database



QC/GenBank Tool Explanation

Overview

To format HIV sequences for submission to GenBank and the other International Consortium Databases (EMBL & DDBJ), the HIV Sequence Database provides a quality-control and GenBank submission tool. The process is outlined below.

Note: You may choose to run the QC analysis (Steps 1-3) without preparing the sequences for submission to GenBank. No one will look at your sequences or QC results, unless you ask us for assistance. If you are not doing the GenBank submission process, ignore the instructions for making a spreadsheet of annotation data.

flowchart of QC tool and GenBank submission process

Step 1: Prepare files

Please prepare your data in two files, a sequence file and a comma delimited(.csv) file containing sequence annotation data (see example files).

Sequence file:

The QC tool accepts Fasta-formatted HIV-1 nucleotide sequences, aligned or unaligned.

The Fasta format uses a "greater-than" sign (>) to indicate the start of each sequence record. The sequence name should follow this symbol. Sequence names should not contain any space characters; please use an underscore character (_) to denote a space. If prepared via word-processing software, sequence data should be saved in plain-text format (ASCII encoding).

Limitations:

Annotation file:

To export a file in Excel to comma delimited format, go to File > Save As, select CSV(Comma delimited)(*.csv).

Each row in the comma delimited (.csv) file should correspond to a sequence in the Fasta file. The "header row" (top row) should name the data that appear in each column (e.g., "Sequence name", "subtype", "sampling date", etc.).

One column, preferably the first column, should contain the name of the sequence exactly as it appears in the Fasta file. Any differences in sequence names will lead to serious errors.

The order of rows for sequences in the two files need not match. Annotation data are associated with sequences by matching sequence names, rather than the order in the files.

Each column should contain annotation data, such as the viral subtype, patient code, viral load, sample date, sample country, etc. Individual cells may be left blank if the data is unknown for a particular sequence.

Details about supported annotation and the requisite format are given in Data Field Help. The CSV example shows what your CSV annotation file should look like.

Step 2: Run Sequences Through QC Review

Go to the form for sequence submission: QC tool. Enter your sequence set and email address. Click on the 'Submit' button to initiate the analysis. On submission, you will receive a confirmation email message.

You will receive another message from seq-info@lanl.gov upon completion of the QC analysis.

Note: Because this tool compiles results from several other tools, it is slow. Please be patient; it may take several hours to produce results from large data sets.

Step 3: Carefully Review QC Analysis Results

Below is an example of a set of QC results. Each result is a link to more details.

screenshot of QC results

When you are confident that sequences appear as you expect, check the checkboxes for the sequences you want to submit to GenBank. Click on the 'Create GenBank entry' button to continue.

Step 4: Add Annotation and Format Entry for GenBank Submission

The GenBank Entry Generation Tool will format sequences in ASN.1 (a.k.a. SQN or Sequin) format. GenBank's Sequin tool can also do this. The advantage of using the online service provided here is that feature annotation (which depicts gene locations along the sequence) will be added automatically from GeneCutter results. Click the "Begin GenBank Entry Generation" button to proceed.

Please review this carefully. This step maps columns of annotation to different regions of the GenBank entry, and great confusion will result if the fields are mapped inappropriately. The first several rows of data from the CSV file will be listed. Compare each column with the name at the head of the column to the data below. If you find that the column names do not match the data, change it by clicking where it says "change it here".

Click the 'Submit' button when you are ready to add annotation to the sequences.

Step 5: Submit SQN File to GenBank

After processing, you should receive an email message that contains 2 links, one to a zipped set of GenBank flatfiles ("gb.zip"), and one sequin-formatted file ("all.sequin").

The flatfiles are human readable plain text and can be read with a text editor or word processor. They approximate how the entries will appear after GenBank accession. The sequin file is formatted for submission to GenBank. GenBank does not accept sequence submissions in flatfile format. You can open the sequin files with GenBank's Sequin software or with a text editor, should you choose.

Please unzip and review at least one of the flatfiles. Please do not edit the flatfiles if you encounter errors, because the changes will not be reflected in the sequin files. You must correct mistakes that appear flatfiles by repeating the process from Step 4. In case of insurmountable difficulty, please contact seq-info@lanl.gov with a detailed description of the problem.

Download the sequin (sqn) file and save it on your computer's drive, noting where you save it. Then, with a web browser, open this page: Sequin MacroSend. At the bottom, locate the form entry called "File(s)". Press the "Browse" button, and locate the sequin file you created and saved above. Note: While GenBank does also accept email submissions of sequin files, some mail software can corrupt the file contents by modifying the file encoding, and thereby cause errors that are difficult to diagnose. The web submission form does not suffer this problem.

Specify your contact information (First/last name, email, other optional fields). Note: Correspondence regarding these sequences and subsequent updates or corrections should be sent from the individual whose contact information is provided here. In the subject line, you should specify how many of what kind of sequences are in the sqn file (e.g., '186 HIV-1 env sequences').

In the comments field, you can enter a message that functions as a cover letter for the GenBank curatorial staff. You should specify whether to embargo the data, pending manuscript acceptance, or disclose them immediately upon review and acceptance. You might also mention whether or not the sequences should be available as a PopSet.

Upon completion, click on the 'Submit' button, and look for a confirmation that the file has been received.

Step 6: Respond to Any Inquiries from GenBank Curators

You will receive an automated acknowledgement from GenBank via email. You may also receive clarifying questions from a GenBank curator. You will receive a message containing accession numbers upon acceptance.

 


 

Related Links:

Data Fields Accepted in Annotation File
Sequence Quality Control Tutorial
Search Interface Help describes database fields
Data Dictionary defines database fields

 

last modified: Tue Mar 5 10:39 2013


Questions or comments? Contact us at seq-info@lanl.gov.

 
Operated by Los Alamos National Security, LLC, for the U.S. Department of Energy's National Nuclear Security Administration
Copyright © 2005-2012 LANS LLC All rights reserved | Disclaimer/Privacy

Dept of Health & Human Services Los Alamos National Institutes of Health