GeneSetDB Help Page

GeneSetDB provides two analysis functions: gene / gene set search and gene enrichment analysis. By using the gene / gene set search function, users can obtain gene sets that include the gene(s) or biological term(s) they are interested in. By using the gene enrichment analysis function, users can upload their own gene list and assess the list's statistical overrepresentation in gene sets. The gene enrichment analysis can be conducted by passing in all values via URL. This makes it possible to use a script to conduct the analysis.

1. Gene set definition and database summary

DB name Gene set name Component Gene set # Unique Gene # (Entrez ID) Note
Biocarta Pathway Name Genes (or proteins) composing a pathway 313 1383 Source from bioDBnet
EHMN Pathway Name Genes (or proteins) composing a pathway 67 2233
HumnCyc Pathway Name Genes (or proteins) composing a pathway 299 944 version 16.0
INOH Pathway Name Genes (or proteins) composing a pathway 88 1497 Release 4.0
NetPath Pathway Name Genes (or proteins) composing a pathway 25 1628
PID Pathway Name Genes (or proteins) composing a pathway 222 2552
Reactome Pathway Name Genes (or proteins) composing a pathway 1199 5846 Version 40
SMPDB Pathway Name Genes (or proteins) composing a pathway 399 860
Wikipathways Pathway Name Genes (or proteins) composing a pathway 171 3671 Analysis collection pathways
CancerGenes Query terms or annotation list genes with an annotation 27 3161
KEGG Disease Disease name Disease or marker genes related to a disease 925 2102
HPO Phenotype name Genes representing a phenotype 6327 1854
MethCancerDB Cancer type and Methylation status Hypermethylated or hypomethylated cancer related genes 40 812
MethyCancer Cancer type Annotated or candidated cancer-related genes 366 1695
MPO Phenotype name Genes (or proteins) composing a pathway 7314 6562
SIDER MedDRA side effect name (lowest level term) Proteins targeted by a drug with a side effect 626 3683 No placebo data. Frequency score >= 0.1. Protein-drug interaction data is used from STICH3
CTD Drug Name Genes or proteins interacting with a drug 5274 14867
DrugBank Drug name Drug targes 4171 2042
MATADOR Drug name Proteins direct or indirect interacted with a chemical 723 1845
STITCH Drug name Proteins known or predicted interaction with a chemical 120275 8619 Protein - chemical interaction with high confidence >=0.7 in STICH3
T3DB Toxin name Toxin targets 2716 1081
MicroCosm Targets miRNA name Predicted genes targeted by miRNA 851 17183 P-org by RNAhybrid < 0.05
miRTarBase miRNA name miRNA target genes with experimentally validaion 286 1715 Release 2.5
TFactS Transcriptional Factor (TF) TF targeting genes 342 2564 TF targeting genes (SignLess) with annotation from PAZAR, NFIregulomeDB, TRED and TRRD
Rel/NF-kappaB target genes Rel/NF-kappaB Rel/NF-kappaB targeting genes 1 135 Human genes with checked binding sites or putative target genes
Gene Ontology Ontology name Genes with an ontology 12464 17978

2. Gene / Gene Set Search

2.1. Query format

Users can retrieve user's interesting gene or gene set name (=biological term), such as "FOXL2" (gene name), "lung cancer" (disease name), "imatinib" (drug name).

2.2. Output format

GeneSetDB shows the following five columns as the result of gene/ gene set search.
Set Name is linked to the original database. As for the "Class", GeneSetDB classifies each database into five subclass based on the kind of database as below:
Pathway: Biocarta, EHMN, HumnCyc, INOH, NetPath, PID, Reactome, SMPDB, WikiPathways
Disease/Phenotype: CancerGenes, KEGG Disease, HPO, MethCancerDB, MethyCancer, MPO, SIDER
Drug/Chemical: CTD, DrugBank, MATADOR, STITCH, T3DB
Gene Regulation: MicroCosm Targets, miRTarBase, Rel/NF-kappaB target genes, TFactS
Gene Ontology: Gene Ontology

Users can download its tab delimited text file by click "Download" button.

2.3. Search settings

Two tick boxes are provided to enable three different settings for the gene/gene set search.

2.3.1 Setting one ("Exact match" tick box is selected)

This setting is searching the database for exact matches. For example if the search terms is "Breast cancer" only exact matches are returned.

2.3.2 Setting two ("Simple pattern search" tick box is selected)

If the "Simple pattern search" tick box is selected, the search term is separated into individual words so that "Breast cancer" becomes "Breast" and "cancer". All entries that match the search terms, or search terms with a prefix or suffix are returned. In this example, "Breast cancer" could return results such as "Abnormality of the breasts" and "colon cancer". This setting enables the user to search for his/her favourite gene symbols together with a drug or GO pathway at the same time.

2.3.3 Setting three ("Simple fuzzy search" tick box is selected)

This setting is very similar to the second setting. The difference is that minor spelling mistakes can be accommodated for. If the user mistakenly enters "Breast cacer" (note the missing n) as the search term, all terms with "cancer" will still come up. Similar terms such us "carcinoma", "camera" "breath" will also be found.

Please note that these settings are not case sensitive.

3. Enrichment Analysis

3.1. Query format

Gene List
Users paste gene list on the window or upload gene list text file directly by using uploading browser. GeneSetDB supports currently three higher organisms which are human, mouse and rat.

Input ID
GeneSetDB allows the following identifiers as gene list.
Users should select one ID from them.

Choose DB
Users can select databases to be used in the enrichment analysis. Gene sets with less than 10 or more than 500 genes are not used. A referense (background) gene set is used Gene IDs with at least one annotation in selected gene set(s) (e.g. Subclass Pathway or DrugBank).

FDR (False Discorvery Rate)
The statistical significance (p-value) of the query gene list's enrichment for each of the gene sets within GeneSetDB is calculated using the hypergeometric distribution. P-value is adjusted for multiple testing correction using the Benjamini and Hochberg procedure as shown false discovery rate (FDR). FDR represents the expected proportion of incorrectly rejected null hypotheses. In lay terms, within the gene sets identified as significantly enriched for the query gene list, the FDR can be thought of as the expected proportion of false positives. Users type into a dialogue box the FDR (proportion of false positives) they are prepared to accept in their results.

3.2. Output format

The result of enrichment analysis shows the following table.
Set Name is hyperlinked to the original database as well as gene /gene set search. Users can download its tab delimited text file by click "Download" button.

4. Calling GeneSetDB enrichment analysis using a script

4.1. Specifying the request URL

Bioinformaticians may wish to access GeneSetDB automatically through a script. If you do this, to avoid overloading our server please use a 20 second delay between queries (Please note, submitting jobs with a huge gene list and using many or all databases may cause the server to stop and the output may be empty.). The heatmap of overlapping proportions of gene sets is not provided if you use GeneSetDB this way. The output is the tab delimited txt file described in 2.2.

The URL for this type of request is in form of:<list of genes>&id=<gene identifier>&db=<database>&fdr=<false discovery rate>

The finished example request URL would be:,foxl2&id=symbol&db=All&fdr=0.05

4.2. Example code

Some sample code you can try in R. This code was written in R 2.14.2 on Linux ubuntu. Earlier versions of R might be OK. (We confirmed that the code works in R on Linux/Mac and the Java code on Linux and Windows)


Some java code you could try.

  String url=",foxl2&id=symbol&db=All&fdr=0.05";
  URL myURL=new URL(url);
  ReadableByteChannel rbc=Channels.newChannel(myURL.openStream());
  FileOutputStream fos=new FileOutputStream("genesetDB.txt");
  fos.getChannel().transferFrom(rbc, 0, 1 << 24);
} catch (MalformedURLException e) {
} catch (IOException e) {

4.3. Supported identifier and database codes

Supported Identifier
As shown in GeneSetDBTo use in query
Gene ID Humangeneidhs
Gene ID Mousegeneidmm
Gene ID Ratgeneidrn
Affy HG-U133 Plus2affyhgu133p2
Affy HG-U133A 2.0affyhgu133a2
Affy HG-U133Aaffyhgu133a
Affy HG-U133Baffyhgu133b
Affy HG-U95Aaffyhgu95a
Affy HG-U95Av2affyhgu95av2
Affy HG Exon 1.0 STaffyhgexst
Affy HG Gene 1.0 STaffyhggnst
Affy MOE430Aaffymoe430a
Affy MOE430Baffymoe430b
Affy MOE430 2.0affymoe4302
Affy MOE430A 2.0affymoe430a2
Affy MG-U74Av2affymgu74av2
Affy MG-U74Bv2affymgu74bv2
Affy MG-U74Aaffymgu74a
Affy MG-U74Baffymgu74b
Affy MG Exon 1.0 STaffymgexst
Affy MG Gene 1.0 STaffymggnst
Affy RAE230Aaffyrae230a
Affy RAE230Baffyrae230b
Affy RAE230 2.0affyrae2302
Affy RG-U34Aaffyrgu34a
Affy RG-U34Baffyrgu34b
Affy Rat Tox U34affyrgtu34
Affy RG Exon 1.0 STaffyrgexst
Affy RG Gene 1.0 STaffyrggnst
Agilent Hs WholeGenomeagilwghs
Agilent Mm WholeGenomeagilwgmm
Agilent Rn WholeGenomeagilwgrn
CodeLink Hs WholeGenomeclwghs
CodeLink Mm WholeGenomeclwgmm
CodeLink Rn WholeGenomeclwgrn
Illumina Hs WG6 v1illuwg6v1hs
Illumina Hs WG6 v2illuwg6v2hs
Illumina Hs HT12 v3illuht12v3hs
Illumina Hs HT12 v4illuht12v4hs
Illumina Mm WG6 v1illuwg6v1mm
Illumina Mm WG6 v2illuwg6v2mm
Illumina Rn v1illuv1rn
Ensembl Gene Hsensemblghs
Ensembl Gene Mmensemblgmm
Ensembl Gene Rnensemblgrn
UniProt/Swiss-Prot Hsupsphs
UniProt/Swiss-Prot Mmupspmm
UniProt/Swiss-Prot Rnupsprn
Supported Databases
As shown in GeneSetDBTo use in query
Subclass PathwayPathway
Subclass Disease/PhenotypeDisease/Phenotype
Subclass Drug/ChemicalDrug/Chemical
Subclass Gene RegulationGeneRegulation
Subclass Gene OntologyGO
KEGG DiseaseKEGGDisease
MicroCosm TargetsMicroCosmTargets
Rel/NF-kappaB target genesRel/NF-kappaB
GO Biological ProcessGO_BP
GO Molecular FunctionGO_MF
GO Cellular ComponentGO_CC