AspGD

The Aspergillus Genome Database: Sequence Documentation


This page provides information about the DNA and protein sequences in AspGD, including their sources, how to access them, and further explanation of some sequence-related issues.

Contents


Information about multiple Aspergillus species in AspGD

The Sybil comparative visualization tool in AspGD provides searching and browsing of orthologs and syntenic regions in ten sequenced Aspergillus genomes among the following species: A. nidulans, A. fumigatus, A. flavus, A. oryzae, A. niger, A. clavatus, A. terreus, and Neosartorya fischeri. AspGD also provides sequences for download for these sequenced genomes.

The AspGD GBrowse Genome browser supports navigation and exploration of the A. nidulans FGSC A4 and A. fumigatus Af293 genomes and annotated features.

Initially, AspGD curation is focused on the A. nidulans literature, because A. nidulans serves as a genetic model for the other Aspergilli, and it is the most well-represented of these species in the published experimental literature. In the future, we will expand the manual curation process to include information about other Aspergillus species, and will be adding gene-based information, including Locus Summary pages, for the non-nidulans Aspergilli.

PLEASE NOTE: The A. nidulans FGSC A4 sequence file names, as well as chromosome identifiers within the files, were updated in July and August 2010 to include the name of the species and strain. This change was necessary to accommodate multiple Aspergillus and Aspergillus-related species and strains at AspGD.

Sources of sequence-based information in AspGD

Aspergillus nidulans FGSC A4:
Sequenced by the Broad Institute. Sequence and annotation originally provided by CADRE on 26-Aug-2009. Updates to chromosomal sequence and to structural annotation of genes and other features are described on the
Chromosome History page for each chromosome and the Locus History page of each feature that has been updated.

Aspergillus fumigatus A1163:
Sequenced by JCVI. Sequence and annotation downloaded from GenBank on 24-Feb-2009.

Aspergillus oryzae RIB40/ATCC 42149:
Sequenced by NITE, AIST, A. oryzae consortium. Sequence and annotation downloaded from the Broad Institute Aspergillus download page on 23-Feb-2009. Gene models subsequently refined at AspGD by PASA alignment of A. oryzae sequence from GenBank: ESTs (strain A1560), and CDSs (strains RIB40, ATCC 20386, LMTC 2.14, GX0015, NFRI1599, VTCC-F-187, AS 3.4382 DLFCC-39).

Aspergillus flavus NRRL 3357:
Sequenced by JCVI. Sequence and annotation downloaded from the Broad Institute Aspergillus download page on 23-Feb-2009. Gene models subsequently refined at AspGD by PASA alignment of A. oryzae ESTs.

Neosartorya fischeri NRRL 181:
Sequenced by JCVI. Sequence and annotation downloaded from the Broad Institute Aspergillus download page on 23-Feb-2009.

Aspergillus clavatus NRRL 1:
Sequenced by JCVI. Sequence and GFF annotation downloaded from the Broad Institute Aspergillus download page on 23-Feb-2009.

Aspergillus fumigatus Af293:
Sequenced by JCVI/Sanger Institute/Institut Pasteur. Sequence and annotation downloaded from the Broad Institute Aspergillus download page on 23-Feb-2009.

Aspergillus niger ATCC 1015:
Sequenced by JGI. Sequence and GFF annotation downloaded from the JGI v3.0 download page on 27-Apr-2009. The work conducted by the U.S. Department of Energy Joint Genome Institute is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. Gene models subsequently refined at AspGD by PASA alignment of A. niger ATCC 1015 ESTs.

Aspergillus terreus NIH2624:
Sequenced by the Broad Institute. Sequence and GFF annotation downloaded from the Broad Institute Aspergillus download page on 23-Feb-2009.

Aspergillus niger CBS 513.88:
Sequenced by Gene Alliance for DSM. Sequence and GFF annotation downloaded from GenBank on 24-Feb-2009.Gene models subsequently refined at AspGD by PASA alignment of A. niger ATCC 1015 ESTs.

Accessing Sequences in AspGD

From the Locus Summary Page:

The "Retrieve Sequences" pull-down menu, which is located on the Resources sidebar on the right-hand side of each Locus Summary Page, retrieves the Genomic DNA (with introns) including or excluding UTR sequences; the Exon Sequences (without introns) including or excluding UTR sequences; the Genomic DNA with 1 kb of flanking sequence upstream and downstream of the gene (also includes any introns); or the ORF translation (predicted protein sequence).

From the AspGD Sequence Retrieval Tool:

To access the Sequence Retrieval Tool (also called Get Sequence, or Gene/Sequence Resources, use the link under Search Options on the left-hand sidebar of the AspGD Home Page or use the "Gene/ Sequence Resources" link under Specialized Gene and Sequence Searches on the Search Options page.

By Bulk Download

You may download gzip compressed sequence files in bulk from the AspGD Sequence Download Page; a variety of file options exist for retrieval of data. There is a link to this page under Download Data on the left-hand sidebar of the AspGD Home Page.

You may also retrieve sequence information for any set of genes (either specified by a list of gene names, or by selecting a region of a chromosome) using the Batch Download Tool.

From the GBrowse Genome Browser:

To view the nucleotide sequence of a gene using GBrowse, begin by zooming in on the gene in the browser, which is described in detail on the GBrowse Help Documentation page. GBrowse may be accessed using the "Chromosomal Location" or the GBrowse map thumbnail views on each Locus page, or by using the "Genome Browser" links displayed on each BLAST result page. You may use GBrowse to search by gene name simply by typing the name into the Landmark or Region search box and click on Search. To view the DNA sequence of the region displayed in the browser (which is now your gene of interest), select Download Sequence File or Download Decorated FASTA File from the pull-down menu labeled "Reports and Analysis." The difference between these two formats is that the decorated FASTA file format highlights ORFs contained within the sequence, which is convenient when viewing a large sequence file. The non-decorated sequence file can be displayed in any of several different configurable formats. Each of the file formats are configurable; select the file format from the pull-down menu and then click on the Configure button to select configuration options. Click on the button marked "Go" to view the sequence.

To view any amount of nucleotide sequence of the region upstream or downstream of a gene, you can use the browser to display a specific region relative to the ORF start site and then ask to download this sequence. For example, if you want the sequence of the 1.5 kb region upstream of orf_name, enter "orf_name:-1500..-1" into the Landmark or Region search box and click on Search. Now use the Download Sequence File or Download Decorated FASTA File option to get the nucleotide sequence of the region.

To view the predicted protein sequence (orf translation) of an ORF in GBrowse, first type "orf_name" into the Landmark or Region search box and click on Search to zoom in on this ORF. Now select Download Protein Sequence File. Click on the button marked "Go" to view the sequence. The protein sequence file format is configurable; select Download Protein Sequence File from the pull-down menu and then click on the Configure button to select configuration options.

The GBrowse Help Documentation page has additional instructions for use of the GBrowse interface. To begin exploring in GBrowse now, use this link to see a region of Chromosome III as an example.

Using BLAST (Basic Local Alignment Search Tool):

You may use the AspGD BLAST tool to conduct protein or DNA sequence searches against various sequence datasets in AspGD, as described in detail on the BLAST documentation page. Alignments of the query sequence with its sequence matches (also called "hits") are displayed along with hyperlinks to related sequence resources. The "Genome Browser" hyperlink above each set of HSPs on the BLAST results page opens the GBrowse genome browser, with the HSP displayed in the browser window. GBrowse may be used to further explore the region containing the match: to view ORFs and other features in the neighborhood of the hit, to browse and download adjacent sequences, to view the 6-frame translation of the region, and to view restriction sites. (For a description of GBrowse features, please see our GBrowse documentation). If applicable, links are provided to directly download/view the entire ORF or peptide sequence, or to navigate to the corresponding Locus page.

Classification of Open Reading Frames in AspGD

A. nidulans open reading frames are classified in AspGD either as Verified, meaning that there is experimental evidence for the existence of a gene product (as defined by the ORF having curated Gene Ontology terms with experimental evidence codes, i.e., evidence codes other than IEA, ISS, RCA, ISA, ISM, ISO, NAS), or as Uncharacterized, meaning that no experimental evidence currently exists but that the ORF is likely to represent a biologically significant gene. These classifications are displayed on the Locus Summary page of each ORF, and may be changed in the future as new experimental evidence becomes available. A third classification, Dubious (meaning that the ORF is unlikely to represent a biologically significant gene), is not currently in use for A. nidulans ORFs.

UTR sequences in AspGD

Untranslated regions (UTRs) are segments of the transcribed DNA located before and after the coding region (CDS). Some UTRs contain introns, which are removed during splicing. The UTR sequences retained in the mature mRNA are not translated into the protein. The UTRs play important roles in mRNA stability and processing, and in regulation of translation.

Annotation of the UTR sequences in AspGD is derived from version 4 of the A. nidulans genome annotation by the Eurofungbase community.

AspGD contains the following types of UTR sequences:

A. nidulans genome assemblies and updates

Version 5

The Version 5 release (s04-m02-r01) is the result of merging the gene models in Version 4 (the previous AspGD release, s04-m01-r01) with the Broad Institute's 2009 release of new A. nidulans annotation.

The Broad 2009 release of the annotation was based on Version 3 as a starting point, rather than Version 4 (the Eurofungbase version of the annotation), which was current at the time. Since the Broad annotations do not include any of the changes introduced by the Eurofung annotation effort, the merge was done in such a way as to try to retain the Version 4 annotation for any gene model known to have been manually curated, but to adopt the new Broad annotation for any gene whose predicted polypeptide has not changed between the Version 2 release, the Version 3 release (which incorporates experimental evidence for gene model annotation), and the Eurofung Version 4 release (which incorporates manual curation of a subset of the gene models).

The entire set of changes proposed in the Broad Institute reannotation, including the gene model changes that were not included in Version 5 of the A. nidulans annotation, is available for
download (please see the GFF file subdirectory), and can be viewed and compared against the reference annotation using the "Historic Tracks" feature of the AspGD Genome Browser. In addition, the Broad annotation has been submitted to GenBank and is available at the Broad Institute web site.

The Broad assigned new gene identifiers, which begin with an "ANID" prefix, in contrast to the "AN" prefix used in previous versions of the annotation. In most (but not all) cases, the numerical part of the gene identifier remains the same between the old AN and new ANID identifier. The Broad provides the following file of gene identifiers on their web site:
http://www.broadinstitute.org/annotation/genome/aspergillus_group/MultiDownloads.html

To merge the Version 4 annotation with the set from the Broad, the Broad gene models were first mapped to the chromosome scaffolds used in the Version 4 release and then compared with the Version 4 annotations. For each gene in the Version 4 annotation, either the original Version 4 gene model OR the corresponding Broad gene model was placed in the Version 5 release. The newly-added genes in the Broad release were evaluated subsequently. Note that genes deleted from the Broad release have NOT been marked as "Deleted" in the initial Version 5 release, pending a future review of the status of these genes.

The procedure and criteria used for deciding which gene model to choose for the Version 5 annotation are summarized as follows:

Notes about gene names and aliases:

The AN identifier is used as the primary systematic name in AspGD. The ANID identifiers created by the Broad Institute and the CADRE ANIA identifiers have been added as gene aliases. We have added ANID aliases to the corresponding genes in cases where the Broad annotations were evaluated and rejected from Version 5, as well as in the cases where the Broad updates were included in Version 5. The ANIG identifiers originally proposed by the Broad are not added as aliases, except in the case of the mitochondrial genes, in which the ANIG identifiers are already used publicly in files available on their web site, and have therefore been added as gene aliases in AspGD. AN identifiers that numerically correspond to the ANIG identifiers have been added as the primary systematic names for the mitochondrial genes in AspGD. During processing of the complex "merge," "split," and "cluster" annotation change events at AspGD, the names and aliases associated with each of the old gene models from Version 4 are added as aliases for the new gene models in Version 5 that result from each of these annotation change events.

There were a small number of "gene name collisions" in which the numerical part of the ANID identifier assigned by the Broad was already in use for a different gene; that is, there was an existing gene called ANXXXXX, and then the Broad annotation includes a new gene called ANID_XXXXX, so that the numerical part of the IDs clash. For the gene that was already called ANXXXXX, the Broad has assigned an identifier ANID_YYYYY. These cases have been addressed in Version 5 as follows: 1. The gene that was already called ANXXXXX is given an AN identifier that numerically corresponds to the new ANID identifier (ANID_YYYYY) assigned by the Broad, and this AN identifier (ANYYYYY) is designated as the primary systematic name. The gene is not given an ANID identifier that corresponds to the original AN identifier (ANID_XXXXX, which the Broad has assigned to a different gene). All other names are retained as aliases. 2. The new gene, ANID_XXXXX, is given an AN identifier that numerically corresponds to the new ANID identifier (ANID_XXXXX), and this AN identifier (ANXXXXX) is designated as the primary systematic name. ANID_XXXXX is retained as a gene alias. Please note: The primary systematic name of the new gene, ANXXXXX, is also an alias of the gene that was originally called ANXXXXX. A search for the name "ANXXXXX" will retrieve both genes.

Version 4

- Eurofungbase community annotation and improvements to the assembly

The primary goal of the Eurofungbase annotation effort was to increase the numbers of A. nidulans proteins with informative functional assignments. Experts in various aspects of fungal and, particularly, A. nidulans biology were invited to participate in an ongoing annotation effort, which started with a jamboree in Autumn 2007. Prior to this initial meeting, the version 3 protein sequences were subjected to a series of computational analyses intended to provide evidence for protein function. Over the course of the functional annotation effort, 2,626 genes were reviewed and edited. Through this concerted effort, the percentage of A. nidulans gene products with an informative name increased from approximately 3% to 19%. For the remaining un-reviewed genes, product names were provided by transfer of information from A. fumigatus or A. niger orthologues, further increasing the proportion of gene products with informative names to 58%. As an indication of the changes arising from this project, the annotation version was incremented to 4 for all locus identifiers (e.g., AN****.4).

During the course of functional annotation, consortium members with expertise in specific gene families were able to identify gene structures that were likely to be incorrect. They then submitted either virtual cDNA sequences representing the corrected gene structure, or protein sequences corresponding to corrected genes. PASA was used to correct genes based on the submitted cDNA sequences, processing 74 gene structure updates, including 7 additional putative pseudogenes. Genewise (Birney et al., 2004) was used to instantiate gene structures based on the submitted protein sequences. These structures were reviewed manually and used to update 95 genes, including 3 additional putative pseudogenes, and to create five new gene models. The final gene count for the version 4 annotation is 10,605, which includes 63 putative pseudogenes. Two families of genes were significantly improved through these efforts. 39 of 119 cytochrome P450 genes were updated, as were 96 of the 342 predicted Zn2Cys6 transcription factors.

The initial Broad Institute assembly comprised 173 contigs, linked by end-sequenced BAC and fosmid bridges to form 16 supercontigs, corresponding to the 16 chromosome arms. A further 75 contigs were unplaced. Using BLAST (Altschul et al., 1990) and Bl2seq (Tatusova and Madden, 1999) it was found that many of the Broad contigs overlapped so that 58 previously unassigned contigs could be incorporated into supercontigs, and the number of unsequenced gaps within supercontigs was reduced from 157 to 71. In five further cases, gaps are bridged by independently cloned sequences or retrotransposons with matching target-site duplications. In summary, the new assembly has 248 contigs, of which 231 contigs are assigned to 17 supercontigs (with 66 unsequenced gaps) that are mapped to eight linkages groups (approx. 30.5Mb). These supercontigs are assigned to the linkage groups, which are labeled I to VIII.

- Additional updates made to Version 4 at AspGD

A. nidulans tRNA genes were predicted at AspGD in August, 2009, using the tRNA-scan algorithm (Lowe and Eddy, 1997). The tRNA gene names appear in the following format: a lower-case t (for "tRNA"), followed by the one-letter abbreviation of the amino acid with which it is charged, followed by the anticodon (in parenthesis) followed by an integer. For example, "tA(AGC)1" is an alanyl tRNA with an AGC anticodon. The tRNAscan-SE algorithm predicts that there are 175 tRNA genes in version 4 of the A. nidulans genome. The algorithm also predictes 5 pseudo-tRNA genes and they are noted as "pseudogene" in the AspGD database.

Pseudogenes and untranslated regions (UTRs) from the original Eurofungbase Community Annotation were not included among the data initially loaded into the AspGD database. In December, 2009, after additional review, pseudogenes and annotated UTRs were loaded into AspGD.

Updates to two gene models (kipA/AN8286 and teaR/AN4214) were made based on manual curation of the scientific literature by AspGD and correspondence with the authors. AN10697 was merged with AN10679, and AN1820 was updated, based on information provided by CADRE. A set of 34 gene models were added to the annotation from a set of 86 genes that were not originally mapped from the contig-based annotation to the chromosome sequences. Based on additional PASA-based mapping conducted at AspGD, these 34 could be matched to a chromosomal location (31 of them on Chromosome 3). Ten pairs of overlapping genes were merged after manual review to confirm that the gene models were either identical, or one was completely subsumed by the other. Information about every update is displayed on the AspGD Locus History page for each of the relevant genes, and also on the appropriate AspGD Chromosome History page.

Version 3

In summer, 2005, NIAID requested that the annotation group at TIGR revisit the gene structure annotation of A. nidulans in advance of microarray design being planned by the NIAID-funded Pathogen Functional Genomics Resource Center (PFGRC). This re-annotation effort focused on the automated incorporation of EST data into the existing gene models, and the manual review and correction of merged loci. 32,931 EST and cDNA sequences compiled from GenBank and provided by C. d'Enfert and G. Goldman were aligned to the genome, and compared to the existing annotation using the PASA pipeline (Haas et al., 2003). These EST sequences collapsed to 8,690 unique assemblies and were used to perform automated gene structure updates of 1,146 genes. In addition, over 2,000 genes that could not be computationally resolved with the current gene structure were manually reviewed and corrected on the basis of either protein homology or EST data. 494 loci were split into two or more distinct loci, and 214 new gene models were added. In addition, 426 gene models originally predicted by the Broad Institute, but excluded from the earlier release because they did not meet minimum length criteria, were also incorporated. The final gene set consisted of 10,701 protein-coding gene predictions, with 4,263 genes completely consistent with EST alignments. Locus identifiers were retained in all cases of one-to-one mapping (9447), whether the sequence changed or not, but the version number was incremented to 3 for all genes. Since the gene number was now over 10,000, there was a need to create new locus identifiers with 5 numeric digits after the AN prefix (e.g. AN10002.3). Functional annotation was supplied by the Broad Institute.

Version 2

The original public genome annotation of A. nidulans consists of 9,541 protein-coding gene predictions (Galagan et al., 2005). Each gene was assigned a unique locus identifier with the prefix AN followed by a four digit number between 0001 and 9541 and appended with the annotation version number 2 (e.g. AN0001.2). As of January, 2010, this is the version of the A. nidulans annotation represented at GenBank, linked to accession AACD00000000, submitted in January, 2004. Comparative analysis with A. fumigatus and A. oryzae suggested that there were many neighboring loci inappropriately merged (Wortman et al., 2006). Functional annotation was applied to gene products only when they exhibited high-identity matches to previously published, experimentally characterised, proteins within the fungal kingdom. This resulted in putative function assignments for approximately 3% of the predicted proteins.

Version 1

Version 1 of the A. nidulans genome data was internal to the Broad Institute and not released widely.

References

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. J Mol Biol. 1990 Oct 5;215(3):403-10.

Lowe TM, Eddy SR. Nucleic Acids Res. 1997 Mar 1;25(5):955-64.

Tatusova TA, Madden TL. FEMS Microbiol Lett. 1999 May 15;174(2):247-50.

Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK Jr, Hannick LI, Maiti R, Ronning CM, Rusch DB, Town CD, Salzberg SL, White O. Nucleic Acids Res. 2003 Oct 1;31(19):5654-66.

Birney E, Clamp M, Durbin R. Genome Res. 2004 May;14(5):988-95.

Galagan JE, Calvo SE, Cuomo C, Ma LJ, Wortman JR, Batzoglou S, Lee SI, Batürkmen M, Spevak CC, Clutterbuck J, Kapitonov V, Jurka J, Scazzocchio C, Farman M, Butler J, Purcell S, Harris S, Braus GH, Draht O, Busch S, D'Enfert C, Bouchier C, Goldman GH, Bell-Pedersen D, Griffiths-Jones S, Doonan JH, Yu J, Vienken K, Pain A, Freitag M, Selker EU, Archer DB, Peñalva MA, Oakley BR, Momany M, Tanaka T, Kumagai T, Asai K, Machida M, Nierman WC, Denning DW, Caddick M, Hynes M, Paoletti M, Fischer R, Miller B, Dyer P, Sachs MS, Osmani SA, Birren BW. Nature. 2005 Dec 22;438(7071):1105-15.

Wortman JR, Gilsenan JM, Joardar V, Deegan J, Clutterbuck J, Andersen MR, Archer D, Bencina M, Braus G, Coutinho P, von Döhren H, Doonan J, Driessen AJ, Durek P, Espeso E, Fekete E, Flipphi M, Estrada CG, Geysens S, Goldman G, de Groot PW, Hansen K, Harris SD, Heinekamp T, Helmstaedt K, Henrissat B, Hofmann G, Homan T, Horio T, Horiuchi H, James S, Jones M, Karaffa L, Karányi Z, Kato M, Keller N, Kelly DE, Kiel JA, Kim JM, van der Klei IJ, Klis FM, Kovalchuk A, Krasevec N, Kubicek CP, Liu B, Maccabe A, Meyer V, Mirabito P, Miskei M, Mos M, Mullins J, Nelson DR, Nielsen J, Oakley BR, Osmani SA, Pakula T, Paszewski A, Paulsen I, Pilsyk S, Pócsi I, Punt PJ, Ram AF, Ren Q, Robellet X, Robson G, Seiboth B, van Solingen P, Specht T, Sun J, Taheri-Talesh N, Takeshita N, Ussery D, vanKuyk PA, Visser H, van de Vondervoort PJ, de Vries RP, Walton J, Xiang X, Xiong Y, Zeng AP, Brandt BW, Cornell MJ, van den Hondel CA, Visser J, Oliver SG, Turner G. The 2008 update of the Aspergillus nidulans genome annotation: a community effort. Fungal Genet Biol. 2009 Mar;46 Suppl 1:S2-13.

Version Tracking for the A. nidulans Chromosomal Sequence and Genome Annotation

The version designation appears in the name of each of the relevant sequence files that are available at AspGD, so the exact source of the sequence data is always clear. This version system was implemented in AspGD on February 5, 2010.

Version designations appear in the following format:

sXX-mYY-rZZ

where XX, YY, and ZZ are zero-padded integers.

XX is incremented when there is any change to the underlying genomic (i.e., chromosome) sequence.

YY is incremented when there is any change to the coordinates of any feature annotated in the genome (e.g., any change in location or boundary, or addition or removal of a feature from the annotation). YY is reset to "01" when XX is incremented (when a sequence change is made).

ZZ is incremented in response to curatorial changes that affect information that appears in the
GFF file, specifically gene names, gene aliases, gene IDs, gene descriptions, feature types (e.g., gene or pseudogene), and ORF classifications or qualifiers (e.g., Verified, Uncharacterized, Deleted, Merged). The file will be checked on a weekly basis, as well as any time that the GFF file is regenerated manually, to see if changes have occurred that warrant a change in the ZZ number. ZZ is reset to "01" when XX or YY is incremented (when a sequence change is made, or when the coordinates of any feature are updated).

As a hypothetical example to illustrate the way changes to the version designation are made, say that we start with s05-m01-r01 as the current version. When the next weekly file check is performed, the new file contains curatorial updates to gene names in the database, but no new changes to the structural annotation or to the sequence itself, so the new version designation becomes s05-m01-r02. Subsequently, the chromosomal coordinates of a gene are changed, based on curation of a paper that provides evidence for updating the gene model, and consequently the new version designation becomes s05-m02-r01. Later, a change to correct a sequencing error is made, and the new version designation becomes s06-m01-r01. Please feel free to contact us with any questions. Information about every update to the chromosome sequence and/or chromosomal location of any gene (or other annotated feature) is displayed on the AspGD Locus History page for each of the relevant genes, and also on the appropriate AspGD Chromosome History page.


Return to AspGD Send a Message to the AspGD Curators