GeneSetDB Help Page

GeneSetDB provides two analysis functions: gene / gene set search and gene enrichment analysis. By using the gene / gene set search function, users can obtain gene sets that include the gene(s) or biological term(s) they are interested in. By using the gene enrichment analysis function, users can upload their own gene list and assess the list's statistical overrepresentation in gene sets. The gene enrichment analysis can be conducted by passing in all values via URL. This makes it possible to use a script to conduct the analysis.

1. Gene set definition and database summary

DB name	Gene set name	Component	Gene set #	Unique Gene # (Entrez ID)	Note
Biocarta	Pathway Name	Genes (or proteins) composing a pathway	313	1383	Source from bioDBnet
EHMN	Pathway Name	Genes (or proteins) composing a pathway	67	2233
HumnCyc	Pathway Name	Genes (or proteins) composing a pathway	299	944	version 16.0
INOH	Pathway Name	Genes (or proteins) composing a pathway	88	1497	Release 4.0
NetPath	Pathway Name	Genes (or proteins) composing a pathway	25	1628
PID	Pathway Name	Genes (or proteins) composing a pathway	222	2552
Reactome	Pathway Name	Genes (or proteins) composing a pathway	1199	5846	Version 40
SMPDB	Pathway Name	Genes (or proteins) composing a pathway	399	860
Wikipathways	Pathway Name	Genes (or proteins) composing a pathway	171	3671	Analysis collection pathways
CancerGenes	Query terms or annotation list	genes with an annotation	27	3161
KEGG Disease	Disease name	Disease or marker genes related to a disease	925	2102
HPO	Phenotype name	Genes representing a phenotype	6327	1854
MethCancerDB	Cancer type and Methylation status	Hypermethylated or hypomethylated cancer related genes	40	812
MethyCancer	Cancer type	Annotated or candidated cancer-related genes	366	1695
MPO	Phenotype name	Genes (or proteins) composing a pathway	7314	6562
SIDER	MedDRA side effect name (lowest level term)	Proteins targeted by a drug with a side effect	626	3683	No placebo data. Frequency score >= 0.1. Protein-drug interaction data is used from STICH3
CTD	Drug Name	Genes or proteins interacting with a drug	5274	14867
DrugBank	Drug name	Drug targes	4171	2042
MATADOR	Drug name	Proteins direct or indirect interacted with a chemical	723	1845
STITCH	Drug name	Proteins known or predicted interaction with a chemical	120275	8619	Protein - chemical interaction with high confidence >=0.7 in STICH3
T3DB	Toxin name	Toxin targets	2716	1081
MicroCosm Targets	miRNA name	Predicted genes targeted by miRNA	851	17183	P-org by RNAhybrid < 0.05
miRTarBase	miRNA name	miRNA target genes with experimentally validaion	286	1715	Release 2.5
TFactS	Transcriptional Factor (TF)	TF targeting genes	342	2564	TF targeting genes (SignLess) with annotation from PAZAR, NFIregulomeDB, TRED and TRRD
Rel/NF-kappaB target genes	Rel/NF-kappaB	Rel/NF-kappaB targeting genes	1	135	Human genes with checked binding sites or putative target genes
Gene Ontology	Ontology name	Genes with an ontology	12464	17978

2. Gene / Gene Set Search

2.1. Query format

Users can retrieve user's interesting gene or gene set name (=biological term), such as "FOXL2" (gene name), "lung cancer" (disease name), "imatinib" (drug name).

2.2. Output format

GeneSetDB shows the following five columns as the result of gene/ gene set search.

Class : Subclass name of gene set (e.g. Drug/Chemical)
Set Name : Name of gene set (e.g. Lipoprotein lipase)
Source DB : Source DB of gene set (Reactome)
Gene # : The number of genes in gene set (15)
Gene Names : Gene names in gene set (PPARA)

Set Name is linked to the original database. As for the "Class", GeneSetDB classifies each database into five subclass based on the kind of database as below:
Pathway: Biocarta, EHMN, HumnCyc, INOH, NetPath, PID, Reactome, SMPDB, WikiPathways
Disease/Phenotype: CancerGenes, KEGG Disease, HPO, MethCancerDB, MethyCancer, MPO, SIDER
Drug/Chemical: CTD, DrugBank, MATADOR, STITCH, T3DB
Gene Regulation: MicroCosm Targets, miRTarBase, Rel/NF-kappaB target genes, TFactS
Gene Ontology: Gene Ontology

Users can download its tab delimited text file by click "Download" button.

2.3. Search settings

Two tick boxes are provided to enable three different settings for the gene/gene set search.

2.3.1 Setting one ("Exact match" tick box is selected)

This setting is searching the database for exact matches. For example if the search terms is "Breast cancer" only exact matches are returned.

2.3.2 Setting two ("Simple pattern search" tick box is selected)

If the "Simple pattern search" tick box is selected, the search term is separated into individual words so that "Breast cancer" becomes "Breast" and "cancer". All entries that match the search terms, or search terms with a prefix or suffix are returned. In this example, "Breast cancer" could return results such as "Abnormality of the breasts" and "colon cancer". This setting enables the user to search for his/her favourite gene symbols together with a drug or GO pathway at the same time.

2.3.3 Setting three ("Simple fuzzy search" tick box is selected)

This setting is very similar to the second setting. The difference is that minor spelling mistakes can be accommodated for. If the user mistakenly enters "Breast cacer" (note the missing n) as the search term, all terms with "cancer" will still come up. Similar terms such us "carcinoma", "camera" "breath" will also be found.

Please note that these settings are not case sensitive.

3. Enrichment Analysis

3.1. Query format

Gene List
Users paste gene list on the window or upload gene list text file directly by using uploading browser. GeneSetDB supports currently three higher organisms which are human, mouse and rat.

Input ID
GeneSetDB allows the following identifiers as gene list.

Gene Symbol
Entrez Gene ID
Ensembl Gene ID
UniProt/Swiss-Prot ID
Affymetrix Probe Set ID (Human: U133 plus2, U133Av2, U133A, U133B, U133+2, U95Av2, Exon 1.0 ST, Gene 1.0 ST, Mouse: MOE430A, MOE430B, MOE430, MOE430Av2, U74Av2, U74Bv2, U74A, U74B, Exon 1.0 ST, Gene 1.0 ST, Rat: RAE230v2, RAE230A, RAE230B, U34A, U34B, Tox U34, Exon 1.0 ST, Gene 1.0 ST)
Agilent Probe Set ID (Human, Mouse and Rat Whole Genome)
CodeLink Probe Set ID (Human, Mouse and Rat Whole Genome)
Illumina Probe Set ID (Human: WG6v1, WG6v2, HT12v3, HT12v4, Mouse: WG6v1, WG6v2, Rat: v1)

Users should select one ID from them.

Choose DB
Users can select databases to be used in the enrichment analysis. Gene sets with less than 10 or more than 500 genes are not used. A referense (background) gene set is used Gene IDs with at least one annotation in selected gene set(s) (e.g. Subclass Pathway or DrugBank).

FDR (False Discorvery Rate)
The statistical significance (p-value) of the query gene list's enrichment for each of the gene sets within GeneSetDB is calculated using the hypergeometric distribution. P-value is adjusted for multiple testing correction using the Benjamini and Hochberg procedure as shown false discovery rate (FDR). FDR represents the expected proportion of incorrectly rejected null hypotheses. In lay terms, within the gene sets identified as significantly enriched for the query gene list, the FDR can be thought of as the expected proportion of false positives. Users type into a dialogue box the FDR (proportion of false positives) they are prepared to accept in their results.

3.2. Output format

The result of enrichment analysis shows the following table.

Class : Subclass name of gene set
Set Name : Name of gene set
Source DB : Source database of gene set
Gene # : The number of genes in a gene set of GeneSetDB
Gene # with Set: Set Name: The number of genes in a gene set of GeneSetDB and in the user's gene list
Gene # without Set: Set Name: The number of genes "not" in a gene set of GeneSetDB and in the user's gene list
p-value: p-value calculated by hypergeometric distribution
FDR: Fold discovery rate as p-value corrected by Benjamini and Hochberg method

Set Name is hyperlinked to the original database as well as gene /gene set search. Users can download its tab delimited text file by click "Download" button.

4. Calling GeneSetDB enrichment analysis using a script

4.1. Specifying the request URL

Bioinformaticians may wish to access GeneSetDB automatically through a script. If you do this, to avoid overloading our server please use a 20 second delay between queries (Please note, submitting jobs with a huge gene list and using many or all databases may cause the server to stop and the output may be empty.). The heatmap of overlapping proportions of gene sets is not provided if you use GeneSetDB this way. The output is the tab delimited txt file described in 2.2.

The URL for this type of request is in form of:

http://genesetdb.auckland.ac.nz/HyperTestRemote.php?genelist=<list of genes>&id=<gene identifier>&db=<database>&fdr=<false discovery rate>

list of genes: A list of gene Identifiers separated by commas. For example: foxl1,foxl2,... Please do not query with more than 800 genes.
gene identifier: The type of identifier used in the gene list above. For example: symbol. Section 4.3 contains a list of supported identifiers.
database: This specifies which database(s) to query. As in the list of genes in the first part of the URL,this is done as a comma seperated list. For example: Pathway,STITCH,... For a list of available choices see section 4.3.
false discovery rate: A value between 0 and 1 to filter the output according to the desired FDR control. For example: 0.05.

The finished example request URL would be:

http://genesetdb.auckland.ac.nz/HyperTestRemote.php?genelist=foxl1,foxl2&id=symbol&db=All&fdr=0.05

4.2. Example code

Some sample code you can try in R. This code was written in R 2.14.2 on Linux ubuntu. Earlier versions of R might be OK. (We confirmed that the code works in R on Linux/Mac and the Java code on Linux and Windows)

query.string="http://genesetdb.auckland.ac.nz/HyperTestRemote.php?genelist=foxl1,foxl2&id=symbol&db=All&fdr=0.05"
filedir=paste(getwd(),"/genesetDB.txt",sep="")
download.file(query.string,filedir,"wget",cacheOK=F)

Some java code you could try.

try{
  String url="http://genesetdb.auckland.ac.nz/HyperTestRemote.php?genelist=foxl1,foxl2&id=symbol&db=All&fdr=0.05";
  URL myURL=new URL(url);
  ReadableByteChannel rbc=Channels.newChannel(myURL.openStream());
  FileOutputStream fos=new FileOutputStream("genesetDB.txt");
  fos.getChannel().transferFrom(rbc, 0, 1 << 24);
  fos.flush();
} catch (MalformedURLException e) {
  e.printStackTrace();
} catch (IOException e) {
  e.printStackTrace();
}

4.3. Supported identifier and database codes

Supported Identifier

As shown in GeneSetDB	To use in query
Symbol	symbol
Gene ID Human	geneidhs
Gene ID Mouse	geneidmm
Gene ID Rat	geneidrn
Affy HG-U133 Plus2	affyhgu133p2
Affy HG-U133A 2.0	affyhgu133a2
Affy HG-U133A	affyhgu133a
Affy HG-U133B	affyhgu133b
Affy HG-U95A	affyhgu95a
Affy HG-U95Av2	affyhgu95av2
Affy HG Exon 1.0 ST	affyhgexst
Affy HG Gene 1.0 ST	affyhggnst
Affy MOE430A	affymoe430a
Affy MOE430B	affymoe430b
Affy MOE430 2.0	affymoe4302
Affy MOE430A 2.0	affymoe430a2
Affy MG-U74Av2	affymgu74av2
Affy MG-U74Bv2	affymgu74bv2
Affy MG-U74A	affymgu74a
Affy MG-U74B	affymgu74b
Affy MG Exon 1.0 ST	affymgexst
Affy MG Gene 1.0 ST	affymggnst
Affy RAE230A	affyrae230a
Affy RAE230B	affyrae230b
Affy RAE230 2.0	affyrae2302
Affy RG-U34A	affyrgu34a
Affy RG-U34B	affyrgu34b
Affy Rat Tox U34	affyrgtu34
Affy RG Exon 1.0 ST	affyrgexst
Affy RG Gene 1.0 ST	affyrggnst
Agilent Hs WholeGenome	agilwghs
Agilent Mm WholeGenome	agilwgmm
Agilent Rn WholeGenome	agilwgrn
CodeLink Hs WholeGenome	clwghs
CodeLink Mm WholeGenome	clwgmm
CodeLink Rn WholeGenome	clwgrn
Illumina Hs WG6 v1	illuwg6v1hs
Illumina Hs WG6 v2	illuwg6v2hs
Illumina Hs HT12 v3	illuht12v3hs
Illumina Hs HT12 v4	illuht12v4hs
Illumina Mm WG6 v1	illuwg6v1mm
Illumina Mm WG6 v2	illuwg6v2mm
Illumina Rn v1	illuv1rn
Ensembl Gene Hs	ensemblghs
Ensembl Gene Mm	ensemblgmm
Ensembl Gene Rn	ensemblgrn
UniProt/Swiss-Prot Hs	upsphs
UniProt/Swiss-Prot Mm	upspmm
UniProt/Swiss-Prot Rn	upsprn

Supported Databases

As shown in GeneSetDB	To use in query
All	All
Subclass Pathway	Pathway
Subclass Disease/Phenotype	Disease/Phenotype
Subclass Drug/Chemical	Drug/Chemical
Subclass Gene Regulation	GeneRegulation
Subclass Gene Ontology	GO
Reactome	Reactome
NetPath	NetPath
HumanCyc	HumanCyc
WikiPathways	WikiPathways
INOH	INOH
EHMN	EHMN
Biocarta	Biocarta
DrugBank	DrugBank
MATADOR	MATADOR
CTD	CTD
STITCH	STITCH
T3DB	T3DB
SMPDB	SMPDB
SIDER	SIDER
HPO	HPO
MPO	MPO
KEGG Disease	KEGGDisease
MethyCancer	MethyCancer
MethCancerDB	MethCancerDB
CancerGenes	CancerGenes
TFactS	TFactS
miRTarBase	miRTarBase
MicroCosm Targets	MicroCosmTargets
Rel/NF-kappaB target genes	Rel/NF-kappaB
GO Biological Process	GO_BP
GO Molecular Function	GO_MF
GO Cellular Component	GO_CC