GeneSetDB Help Page
GeneSetDB provides two analysis functions: gene / gene set search and gene enrichment analysis. By using the gene / gene set search function, users can obtain gene sets that include the gene(s) or biological term(s) they are interested in. By using the gene enrichment analysis function, users can upload their own gene list and assess the list's statistical overrepresentation in gene sets. The gene enrichment analysis can be conducted by passing in all values via URL. This makes it possible to use a script to conduct the analysis.
1. Gene set definition and database summary
DB name |
Gene set name |
Component |
Gene set # |
Unique Gene # (Entrez ID) |
Note |
Biocarta |
Pathway Name |
Genes (or proteins) composing a pathway |
313 |
1383 |
Source from bioDBnet |
EHMN |
Pathway Name |
Genes (or proteins) composing a pathway |
67 |
2233 |
|
HumnCyc |
Pathway Name |
Genes (or proteins) composing a pathway |
299 |
944 |
version 16.0 |
INOH |
Pathway Name |
Genes (or proteins) composing a pathway |
88 |
1497 |
Release 4.0 |
NetPath |
Pathway Name |
Genes (or proteins) composing a pathway |
25 |
1628 |
|
PID |
Pathway Name |
Genes (or proteins) composing a pathway |
222 |
2552 |
|
Reactome |
Pathway Name |
Genes (or proteins) composing a pathway |
1199 |
5846 |
Version 40 |
SMPDB |
Pathway Name |
Genes (or proteins) composing a pathway |
399 |
860 |
|
Wikipathways |
Pathway Name |
Genes (or proteins) composing a pathway |
171 |
3671 |
Analysis collection pathways |
CancerGenes |
Query terms or annotation list |
genes with an annotation |
27 |
3161 |
|
KEGG Disease |
Disease name |
Disease or marker genes related to a disease |
925 |
2102 |
|
HPO |
Phenotype name |
Genes representing a phenotype |
6327 |
1854 |
|
MethCancerDB |
Cancer type and Methylation status |
Hypermethylated or hypomethylated cancer related genes |
40 |
812 |
|
MethyCancer |
Cancer type |
Annotated or candidated cancer-related genes |
366 |
1695 |
|
MPO |
Phenotype name |
Genes (or proteins) composing a pathway |
7314 |
6562 |
|
SIDER |
MedDRA side effect name (lowest level term) |
Proteins targeted by a drug with a side effect |
626 |
3683 |
No placebo data. Frequency score >= 0.1. Protein-drug interaction data is used from STICH3 |
CTD |
Drug Name |
Genes or proteins interacting with a drug |
5274 |
14867 |
|
DrugBank |
Drug name |
Drug targes |
4171 |
2042 |
|
MATADOR |
Drug name |
Proteins direct or indirect interacted with a chemical |
723 |
1845 |
|
STITCH |
Drug name |
Proteins known or predicted interaction with a chemical |
120275 |
8619 |
Protein - chemical interaction with high confidence >=0.7 in STICH3 |
T3DB |
Toxin name |
Toxin targets |
2716 |
1081 |
|
MicroCosm Targets |
miRNA name |
Predicted genes targeted by miRNA |
851 |
17183 |
P-org by RNAhybrid < 0.05 |
miRTarBase |
miRNA name |
miRNA target genes with experimentally validaion |
286 |
1715 |
Release 2.5 |
TFactS |
Transcriptional Factor (TF) |
TF targeting genes |
342 |
2564 |
TF targeting genes (SignLess) with annotation from PAZAR, NFIregulomeDB, TRED and TRRD |
Rel/NF-kappaB target genes |
Rel/NF-kappaB |
Rel/NF-kappaB targeting genes |
1 |
135 |
Human genes with checked binding sites or putative target genes |
Gene Ontology |
Ontology name |
Genes with an ontology |
12464 |
17978 |
|
2. Gene / Gene Set Search
2.1. Query format
Users can retrieve user's interesting gene or gene set name (=biological term), such as "FOXL2" (gene name), "lung cancer" (disease name), "imatinib" (drug name).
2.2. Output format
GeneSetDB shows the following five columns as the result of gene/ gene set search.
- Class : Subclass name of gene set (e.g. Drug/Chemical)
- Set Name : Name of gene set (e.g. Lipoprotein lipase)
- Source DB : Source DB of gene set (Reactome)
- Gene # : The number of genes in gene set (15)
- Gene Names : Gene names in gene set (PPARA)
Set Name is linked to the original database. As for the "Class", GeneSetDB classifies each database into five subclass based on the kind of database as below:
Pathway: Biocarta, EHMN, HumnCyc, INOH, NetPath, PID, Reactome, SMPDB, WikiPathways
Disease/Phenotype: CancerGenes, KEGG Disease, HPO, MethCancerDB, MethyCancer, MPO, SIDER
Drug/Chemical: CTD, DrugBank, MATADOR, STITCH, T3DB
Gene Regulation: MicroCosm Targets, miRTarBase, Rel/NF-kappaB target genes, TFactS
Gene Ontology: Gene Ontology
Users can download its tab delimited text file by click "Download" button.
2.3. Search settings
Two tick boxes are provided to enable three different settings for the gene/gene set search.
2.3.1 Setting one ("Exact match" tick box is selected)
This setting is searching the database for exact matches. For example if the search terms is "Breast cancer" only exact matches are returned.
2.3.2 Setting two ("Simple pattern search" tick box is selected)
If the "Simple pattern search" tick box is selected, the search term is separated into individual words so that "Breast cancer" becomes "Breast" and "cancer". All entries that match the search terms, or search terms with a prefix or suffix are returned. In this example, "Breast cancer" could return results such as "Abnormality of the breasts" and "colon cancer". This setting enables the user to search for his/her favourite gene symbols together with a drug or GO pathway at the same time.
2.3.3 Setting three ("Simple fuzzy search" tick box is selected)
This setting is very similar to the second setting. The difference is that minor spelling mistakes can be accommodated for. If the user mistakenly enters "Breast cacer" (note the missing n) as the search term, all terms with "cancer" will still come up. Similar terms such us "carcinoma", "camera" "breath" will also be found.
Please note that these settings are not case sensitive.
3. Enrichment Analysis
3.1. Query format
Gene List
Users paste gene list on the window or upload gene list text file directly by using uploading browser. GeneSetDB supports currently three higher organisms which are human, mouse and rat.
Input ID
GeneSetDB allows the following identifiers as gene list.
- Gene Symbol
- Entrez Gene ID
- Ensembl Gene ID
- UniProt/Swiss-Prot ID
- Affymetrix Probe Set ID (Human: U133 plus2, U133Av2, U133A, U133B, U133+2, U95Av2, Exon 1.0 ST, Gene 1.0 ST, Mouse: MOE430A, MOE430B, MOE430, MOE430Av2, U74Av2, U74Bv2, U74A, U74B, Exon 1.0 ST, Gene 1.0 ST, Rat: RAE230v2, RAE230A, RAE230B, U34A, U34B, Tox U34, Exon 1.0 ST, Gene 1.0 ST)
- Agilent Probe Set ID (Human, Mouse and Rat Whole Genome)
- CodeLink Probe Set ID (Human, Mouse and Rat Whole Genome)
- Illumina Probe Set ID (Human: WG6v1, WG6v2, HT12v3, HT12v4, Mouse: WG6v1, WG6v2, Rat: v1)
Users should select one ID from them.
Choose DB
Users can select databases to be used in the enrichment analysis. Gene sets with less than 10 or more than 500 genes are not used. A referense (background) gene set is used Gene IDs with at least one annotation in selected gene set(s) (e.g. Subclass Pathway or DrugBank).
FDR (False Discorvery Rate)
The statistical significance (p-value) of the query gene list's enrichment for each of the gene sets within GeneSetDB is calculated using the hypergeometric distribution. P-value is adjusted for multiple testing correction using the Benjamini and Hochberg procedure as shown false discovery rate (FDR). FDR represents the expected proportion of incorrectly rejected null hypotheses. In lay terms, within the gene sets identified as significantly enriched for the query gene list, the FDR can be thought of as the expected proportion of false positives. Users type into a dialogue box the FDR (proportion of false positives) they are prepared to accept in their results.
3.2. Output format
The result of enrichment analysis shows the following table.
- Class : Subclass name of gene set
- Set Name : Name of gene set
- Source DB : Source database of gene set
- Gene # : The number of genes in a gene set of GeneSetDB
- Gene # with Set: Set Name: The number of genes in a gene set of GeneSetDB and in the user's gene list
- Gene # without Set: Set Name: The number of genes "not" in a gene set of GeneSetDB and in the user's gene list
- p-value: p-value calculated by hypergeometric distribution
- FDR: Fold discovery rate as p-value corrected by Benjamini and Hochberg method
Set Name is hyperlinked to the original database as well as gene /gene set search. Users can download its tab delimited text file by click "Download" button.
4. Calling GeneSetDB enrichment analysis using a script
4.1. Specifying the request URL
Bioinformaticians may wish to access GeneSetDB automatically through a script. If you do this, to avoid overloading our server please use a 20 second delay between queries (
Please note, submitting jobs with a huge gene list and using many or all databases may cause the server to stop and the output may be empty.). The heatmap of overlapping proportions of gene sets is not provided if you use GeneSetDB this way. The output is the tab delimited txt file described in 2.2.
The URL for this type of request is in form of:
http://genesetdb.auckland.ac.nz/HyperTestRemote.php?genelist=<list of genes>&id=<gene identifier>&db=<database>&fdr=<false discovery rate>
- list of genes: A list of gene Identifiers separated by commas. For example: foxl1,foxl2,... Please do not query with more than 800 genes.
- gene identifier: The type of identifier used in the gene list above. For example: symbol. Section 4.3 contains a list of supported identifiers.
- database: This specifies which database(s) to query. As in the list of genes in the first part of the URL,this is done as a comma seperated list. For example: Pathway,STITCH,... For a list of available choices see section 4.3.
- false discovery rate: A value between 0 and 1 to filter the output according to the desired FDR control. For example: 0.05.
The finished example request URL would be:
http://genesetdb.auckland.ac.nz/HyperTestRemote.php?genelist=foxl1,foxl2&id=symbol&db=All&fdr=0.05
4.2. Example code
Some sample code you can try in R. This code was written in R 2.14.2 on Linux ubuntu. Earlier versions of R might be OK. (We confirmed that the code works in R on Linux/Mac and the Java code on Linux and Windows)
query.string="http://genesetdb.auckland.ac.nz/HyperTestRemote.php?genelist=foxl1,foxl2&id=symbol&db=All&fdr=0.05"
filedir=paste(getwd(),"/genesetDB.txt",sep="")
download.file(query.string,filedir,"wget",cacheOK=F)
Some java code you could try.
try{
String url="http://genesetdb.auckland.ac.nz/HyperTestRemote.php?genelist=foxl1,foxl2&id=symbol&db=All&fdr=0.05";
URL myURL=new URL(url);
ReadableByteChannel rbc=Channels.newChannel(myURL.openStream());
FileOutputStream fos=new FileOutputStream("genesetDB.txt");
fos.getChannel().transferFrom(rbc, 0, 1 << 24);
fos.flush();
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
4.3. Supported identifier and database codes
Supported Identifier |
As shown in GeneSetDB | To use in query |
Symbol | symbol |
Gene ID Human | geneidhs |
Gene ID Mouse | geneidmm |
Gene ID Rat | geneidrn |
Affy HG-U133 Plus2 | affyhgu133p2 |
Affy HG-U133A 2.0 | affyhgu133a2 |
Affy HG-U133A | affyhgu133a |
Affy HG-U133B | affyhgu133b |
Affy HG-U95A | affyhgu95a |
Affy HG-U95Av2 | affyhgu95av2 |
Affy HG Exon 1.0 ST | affyhgexst |
Affy HG Gene 1.0 ST | affyhggnst |
Affy MOE430A | affymoe430a |
Affy MOE430B | affymoe430b |
Affy MOE430 2.0 | affymoe4302 |
Affy MOE430A 2.0 | affymoe430a2 |
Affy MG-U74Av2 | affymgu74av2 |
Affy MG-U74Bv2 | affymgu74bv2 |
Affy MG-U74A | affymgu74a |
Affy MG-U74B | affymgu74b |
Affy MG Exon 1.0 ST | affymgexst |
Affy MG Gene 1.0 ST | affymggnst |
Affy RAE230A | affyrae230a |
Affy RAE230B | affyrae230b |
Affy RAE230 2.0 | affyrae2302 |
Affy RG-U34A | affyrgu34a |
Affy RG-U34B | affyrgu34b |
Affy Rat Tox U34 | affyrgtu34 |
Affy RG Exon 1.0 ST | affyrgexst |
Affy RG Gene 1.0 ST | affyrggnst |
Agilent Hs WholeGenome | agilwghs |
Agilent Mm WholeGenome | agilwgmm |
Agilent Rn WholeGenome | agilwgrn |
CodeLink Hs WholeGenome | clwghs |
CodeLink Mm WholeGenome | clwgmm |
CodeLink Rn WholeGenome | clwgrn |
Illumina Hs WG6 v1 | illuwg6v1hs |
Illumina Hs WG6 v2 | illuwg6v2hs |
Illumina Hs HT12 v3 | illuht12v3hs |
Illumina Hs HT12 v4 | illuht12v4hs |
Illumina Mm WG6 v1 | illuwg6v1mm |
Illumina Mm WG6 v2 | illuwg6v2mm |
Illumina Rn v1 | illuv1rn |
Ensembl Gene Hs | ensemblghs |
Ensembl Gene Mm | ensemblgmm |
Ensembl Gene Rn | ensemblgrn |
UniProt/Swiss-Prot Hs | upsphs |
UniProt/Swiss-Prot Mm | upspmm |
UniProt/Swiss-Prot Rn | upsprn |
|
Supported Databases |
As shown in GeneSetDB | To use in query |
All | All |
Subclass Pathway | Pathway |
Subclass Disease/Phenotype | Disease/Phenotype |
Subclass Drug/Chemical | Drug/Chemical |
Subclass Gene Regulation | GeneRegulation |
Subclass Gene Ontology | GO |
Reactome | Reactome |
NetPath | NetPath |
HumanCyc | HumanCyc |
WikiPathways | WikiPathways |
INOH | INOH |
EHMN | EHMN |
Biocarta | Biocarta |
DrugBank | DrugBank |
MATADOR | MATADOR |
CTD | CTD |
STITCH | STITCH |
T3DB | T3DB |
SMPDB | SMPDB |
SIDER | SIDER |
HPO | HPO |
MPO | MPO |
KEGG Disease | KEGGDisease |
MethyCancer | MethyCancer |
MethCancerDB | MethCancerDB |
CancerGenes | CancerGenes |
TFactS | TFactS |
miRTarBase | miRTarBase |
MicroCosm Targets | MicroCosmTargets |
Rel/NF-kappaB target genes | Rel/NF-kappaB |
GO Biological Process | GO_BP |
GO Molecular Function | GO_MF |
GO Cellular Component | GO_CC |
|