| Home | Genome | Blast / Blat | WormMart | Batch Sequences | Markers | Genetic Maps | Submit | Searches | Site Map |

It is easy to link to WormBase, or to extract information from it for data mining purposes. You can link to HTML pages, text-only dumps, and XML pages.
All objects in WormBase have a name and a class. The name is a unique identifier which is usually, but not invariably, a human-readable name. The class describes the type of the object, such as "Sequence" or "Protein.
For example, the cell ADAR has a name of "ADAR" and a class of "Cell".
For historical reasons, the classes of objects are not always what you expect. While the class of a predicted gene such as ZK154.1 is "Predicted_gene", as you might expect, it is less obvious that the named gene zyg-1 has a class of "Locus". The table at the bottom of this page lists some of the common classes.
If you are not sure, you can learn the name and class of a particular WormBase object by performing a search for the object. Once you find and display it, look at the URL at the top of the page. The URL will contain the arguments name= and class=. These are the name and class of the object.
To link to a WormBase web page describing the object, create a link to http://www.wormbase.org/db/get?name=X;class=Y, where X and Y are replaced with the name and class of the object you wish to retrieve. For example:
<a href="http://www.wormbase.org/db/get?name=F59E12.2;class=Predicted_Gene">F59E12.2</a>
(try it)
To link to an XML dump of the object, create a link to
http://www.wormbase.org/db/misc/xml?name=X;class=Y, where
X and Y are again replaced with the object name and class. For
example:
<a href="http://www.wormbase.org/db/misc/xml?name=WBGene00006988;class=Gene">WBGene00006988</a>
(try it)
To link to a text-only representation of the object, create a link to
http://www.wormbase.org/db/misc/text?name=X;class=Y,
where X and Y are again replaced with the object name and class. For
example:
<a href="http://www.wormbase.org/db/misc/text?name=WBGene00006988;class=Gene">WBGene00006988</a>
(try it)
To link to an image of the C. elegans physical map and you know
the contig name, create a link to
http://www.wormbase.org/db/misc/epic?name=X;class=Map,
where X is replaced with the contig of interest. To link to a
particular clone on the map, link to:
http://www.wormbase.org/db/misc/epic?name=X;class=Clone,
where X is replaced with the name of the clone. For example:
<a href="http://www.wormbase.org/db/misc/epic?name=F59E12;class=Clone">F59E12</a>
(try it)
Create a URL like this one:
<a href="http://www.wormbase.org/db/seq/gbrowse?source=wormbase;name=F59E12.2">F59E12.2>
(try it)
The "name=" argument can be filled with almost any landmark, including predicted gene names, locus names, cosmid names, and chromosome names. You can also combine a landmark specify a range such as II:1000..20000. This means to show chromosome II from positions 1000 to 20,000, inclusive.
You can create an inline image of a portion of the genome using a URL like this one:
<img src="http://www.wormbase.org/db/seq/gbrowse_img?source=wormbase;name=mec-3;width=650">
This will produce the following 650-pixel wide image:
There are many more options. You can turn on tracks, add your own tracks, and much more. See the gbrowse_img help page for details.
Create a link like this one:
<a href="http://www.wormbase.org/db/misc/epic?class=Map;name=III">III>
(try it)
To link to a region of the map defined by a centimorgan interval, add the map_start and map_stop arguments:
<a href="http://www.wormbase.org/db/misc/epic?class=Map;name=III;map_start=3;map_stop=4">III:3..4>
(try it)
To link to an ACEDB representation of the object suitable for loading
into a local ACeDB database, create a link to
http://www.wormbase.org/db/misc/acedb?name=X;class=Y,
where X and Y are again replaced with the object name and class. For
example:
<a href="http://www.wormbase.org/db/misc/acedb?name=F59E12.2;class=Predicted_Gene">F59E12.2</a>
(try it)
You can fetch multiple objects by using wild cards in the name, where "*" replaces any character, and "?" replaces a single character. For example, to retrieve all RME cells in XML format, use:
<a href="http://www.wormbase.org/db/misc/xml?name=RME*;class=Cell">All Cells</a>
(try it)
The data model for any WormBase object can be viewed by choosing the "Schema" menu item from the yellow navigation bar at the top of any object display page. You can search for particular data models using the simple search on the front page, and selecting "Model" as the type of object you are searching for.
WormBase uses ACEDB data models. Their format is described in detail at acedb.org.
WormBase can be queried from the command line using AcePerl. This allows you to write sophisticated Perl scripts to mine WormBase. For details, see the AcePerl pages for information on downloading, installing and using this software.
Network access to the C. elegans database is available at the following site:
| Location | Host | Port |
|---|---|---|
| Cold Spring Harbor Laboratory | aceserver.cshl.org | 2005 |
Be aware that you will be sharing this server with other people. If it seems slow, it may be because others are using it. Wait a while and try again.
The BioPerl library provides a simple relational schema and database access layer for querying genomic features. This is the access method of choice to use for mining WormBase for:
You will need to install your own local database. Currently WormBase does not provide world access to the MySQL database of genome features (although this may change in the future). You must:
After downloading the C. elegans FASTA files, you should combine them into a single file to facilitate loading. This is most easily done like this:
% gunzip -c CHROMOSOME_*.dna.gz EST_Elegans.dna.gz > CElegans.fa
You do not have to do this with C. briggsae, because its DNA data is already combined into one file.
Once these are downloaded, create appropriately-named databases using the MySQL or PostgreSQL administrators' tools. You may create separate databases for each of the genomes (recommended) or put them together in the same database. Assuming that you have created two MySQL databases, one named "elegans" and the other named "briggsae", you will use the bulk_load_gff.pl tool to load them from the downloaded files:
% bulk_load_gff.pl -c -d elegans -fasta CElegans.fa elegansWSXXXX.gff.gz % bulk_load_gff.pl -c -d briggsae -fasta briggsae_25.fa.gz briggsae_25.gff.gz
The bulk_load_gff.pl program comes with BioPerl. Look for it in the subdirectory scripts/Bio-DB-GFF.
If you are using PostgreSQL, you should use the script "load_gff.pl", which works, but is very slow, or "pg_bulk_load_gff.pl," which is fast and optimized for PostgreSQL. They both have the same command-line syntax.
Once you've got the database loaded, you can write scripts to mine the data. For example, this script will find all named (3-letter) genes that are contained within the intron of another named or predicted gene and print out 100 bp upstream from their 5' end:
#!/usr/bin/perl
# find 3-letter named genes that are contained within the intron of another gene
use strict;
use Bio::DB::GFF;
my $db = Bio::DB::GFF->new('elegans');
my $intron_stream = $db->get_seq_stream('intron:curated');
while (my $intron = $intron_stream->next_feature) {
my @contained_genes = $intron->contained_features('gene:curated') or next;
for my $gene (@contained_genes) {
my $upstream = $gene->subseq(-99,0); # 100 bp upstream - position 0 is 1 bp to left of translational start
print $gene->name,"\t",$upstream->dna,"\n";
}
}
The output starts like this:
fkh-6 acctccgtcttcacagttccgagaccccgccctcactcttagcttctgcataatccgttgtctcatttgacaccccctaccataaaaaaatacaataatc kin-31 aaaaaaaaatcgattttatcaaaaaacaatttatttcacatttttgtataactgacactcgtcagaattgtaaaaaccattaatttcatcgttgcattaa ...
To fetch all gene models and print their coding regions and UTRs, use the following script:
#!/usr/bin/perl
use strict;
use Bio::DB::GFF;
my $db = Bio::DB::GFF->new(-dsn => 'elegans',
-aggregators => 'gene_model{coding_exon,5_UTR,3_UTR/CDS}');
my $gene_stream = $db->get_seq_stream('gene_model:curated');
while (my $gene = $gene_stream->next_seq) {
print $gene->name,"\n";
for my $part ($gene->get_SeqFeatures) {
print "\t",join("\t",$part->method,$part->start,$part->end),"\n";
}
print "\n";
}
This will produce output like:
2L52.1 coding_exon II 1867 1911 1 coding_exon II 2506 2694 1 coding_exon II 2738 2888 1 coding_exon II 2931 3036 1 coding_exon II 3406 3552 1 coding_exon II 3802 3984 1 coding_exon II 4201 4663 1 2RSSE.1 5_UTR II 15268097 15268367 1 coding_exon II 15268368 15268441 1 coding_exon II 15269346 15269681 1 coding_exon II 15269747 15269918 1 coding_exon II 15270683 15270860 1 coding_exon II 15272930 15273201 1
See the documentation for the BioPerl Bio::DB::GFF class for details. Note the use of the aggregator "gene_model{coding_exon,5_UTR,3_UTR/CDS}", which says to aggregate parts of type coding_exon, 5_UTR, 3_UTR and CDS into a single feature of type "gene_model."
The following table lists common WormBase classes.
| Object | Name | Class | Notes |
|---|---|---|---|
| A Predicted Gene | A cosmid dot name, such as F59E12.2 | Predicted_Gene | A class name of "Sequence" is also recognized. |
| A Named Gene | A three letter name, such as zyg-1 | Locus | |
| A Genbank Accession Number (protein or nucleotide) | The accession number | Accession_Number | This will retrieve an Accession_Number object, which is a list of WormBase names that correspond to that accession number. |
| A Protein | A wormpep accession number, proceeded by "WP:", as in WP:CE28571 | Protein | |
| A Clone | The cosmid or YAC name, for example F59E12 | Clone | |
| Genomic Sequence | The cosmid or YAC name from which the sequence was made, for example F59E12 | Genomic_Sequence | A class name of "Sequence" is also recognized. |
| A GenePair | The Research Genetics name, preceded by the prefix "sjj_" (for Steve Jones, who designed the pairs). For example, sjj_F59E12.2. | PCR_Product | |
| A Cell | A cell name, such as ADAR; notice that paired cells, such as ADAR and ADAL are treated separately | Cell | |
| A Protein Family | The accession number, proceeded by the database name. Examples include the Interpro family INTERPRO:IPR000039, the Prosite motif PS:PS00041, and the PFAM family PFAM:PF00352 | Motif | |
| An RNAi Experiment | The accession number of the experiment. This is typically the name of the overlapping gene preceded by the laboratory prefix. For example, an RNAi experiment that covers gene F59E12.2 and was performed in Julie Ahringer's lab (JA), will have the name JA:59E12.2 | RNAi |