Method

The genome was scanned in windows of length 500 bp (shifts of length 250 bp) for several sets (motif set) of motifs (PWMs). Each window was scored for each motif using one of four methods; three HMM-based scores for motif clustering, as well as a traditional log likelihood ratio (LLR) score that counts strong matches to the motif. A window's score was converted into an empirical "p-value" by comparing it to scores from the same method of all windows in the genome (content). A set of motif "target windows" was defined based on a threshold. These target windows were then assigned to the genes whose region they are located within (called the motif's "target genes"). The hypergeometric p-value was calculated for the enrichment for "target genes" of each cluster of genes (positive set) with respect to the given universe.

Using the Interface and the Results:

Select/Enter the positive set(s)/universe. Configure the options for a chosen set of parameter settings or choose to run Method Comparisons. Under Reporting, select the maximum pvalue that you wish to be printed. Click the Calculate button.

The Calculate button will produce a summary page that shows links to the results as they are calculated. It will also display links to the lists of genes in each set. If you chose to run method comparisons, this will display 48 configurations of options for each positive gene set. Each new result will display when the calculations are complete. It may take a couple of hours to complete the method comparisons.

On the summary page there are links to html files and txt files for each configuration of options. In the html pages, summary statistics about the tf set and genes sets are printed at the top. The first two columns are the name and logo of the motif. The third column, called "Array Activity" is a classification of the transcription factor corresponding to the motif if it had differential expression in microarray experiments. #Universe is the total number of genes that our analysis considers. #Pos-set is the size of the positive gene set under consideration. #Hits is the size of the motif target gene set. #Pos-Hits is the size (and list) of intersection of the positive gene set and the motif target set. Pval is the Hypergeometric p-value of the size of the above intersection. Neg-Pval is the test for enrichment in the genes outside the positive set.

The .txt files contain the same information, but also have four additional columns. Pos_Set is the name of the positive set. Neg_Set is the name of the universe set. TF_set describes the set of motifs used. Method is a concatenation of three of the options that were selected [Method]:[Content]:[Region].

There will be hundreds of result links on the summary page for method comparisons, so additional links were created to access the information in one location. At the end of each positive gene set, there is a link to the concatenation of all .txt files related to that gene set. At the end of all gene sets is a concatenation of all .txt files. This file is a single location which contains all the data from all the runs.

One important point to remember when you are analyzing the results is the repetition of motifs. The interface operates on each motif set separately and some motifs are included in multiple sets. Be careful to filter out the repeats before calculating q values.

If you have questions, suggestions, or notice bugs, please contact Charles Blatti, blatti@illinois.edu.

OPTIONS

Positive Set:

At least one positive set must be selected for the calculation to proceed. The first text box is designed to accept multiple gene sets in the following format:

>SetA
ENSTGUG00000002866
ENSTGUG00000008891
ENSTGUG00000002132
ENSTGUG00000003101
>SetB
ENSTGUG00000003005
ENSTGUG00000009577
ENSTGUG00000011637
ENSTGUG00000000884
>SetC
ENSTGUG00000011080
ENSTGUG00000002930

The set names will be recorded in the final results, so unique and descriptive names are important. If the >set_name is missing, the gene set will receive a name containing a random number. The positive gene set can also be selected from the Prepared Gene Sets positive set drop down boxes. This will supercede the above text boxes if used.

Universe:

If no universe set is entered or selected, the set of ~15,000 genes from the full gene model will be used. One can enter their own universe set in the second text box using the same format as above. The universe text box does not process multiple sets.

>Universe
ENSTGUG00000002867
ENSTGUG00000008898
ENSTGUG00000002139
ENSTGUG00000003100

In the Prepared Gene Sets universe set, the "ens" gene set is a set of > 9000 genes that was provided to us by the Jarvis lab that we believe to be all of the ENSEMBL gene models that have representation on their microarray. The "universe" set is the concatenation of their other set clusters. Again, this drop down list will supercede the other default and entered sets, if used.

Motif Set:

Genome wide scans have been completed on multiple sets of motifs. The available sets are from the JASPAR and TRANSFAC databases.
"all_jaspar" is a set containing 104 motifs
"hstransfac_selected" is a set containing 25 motifs
"hstransfac_selected.2" is a set containing 23 motifs
"fox_p2" only contains the FOX_P2 motif

Method:

There are four available motif scanning methods. There are three HMM-based scoring techniques extended from the Stubb method. The basic idea behind Stubb is to score a fixed-length (500 bp) window for presence of one or more, weak or strong, matches to the motif. It has been demonstrated that scoring short "regions" rather than individual sites, better mirrors the thermodynamic nature of the protein-DNA interaction and adds statistical power. The three HMM-based methods are "stubb_fixed", "swan", "swan_wt0".

The remaining method is the more traditional site log likelihood ratio method, that finds individual binding sites genome-wide, typically by requiring a very strong match to the motif. This method is labeled "sllr".

Content:

When converting a window score to an empirical p-value for a particular method, it will either be by comparing it to scores of all windows in the genome ("all") or by comparing it to scores of windows of similar G/C content ("gc").

Region:

There are several options to map windows to their neighboring genes.
1) All windows are mapped to the nearest gene start site ("nearest").
2) After mapping genes to the nearest gene start site, the window must fall within the 5Kbp upstream or be discarded ("5up").
3) After mapping genes to the nearest gene start site, the window must fall within the 5Kbp upstream or 2Kbp downstream of the start site or be discarded ("5up2down").
4) All windows are assigned to appropriate gene territories where a gene territory is defined as (i) include the 5 Kbp upstream region, (ii) include the entire gene itself, (iii) include the region upstream of the gene until half the distance to the next gene, (iv) include the region downstream of the gene until half the distance to the next gene. Note that gene territories of neighboring genes may overlap, if their 5 Kbp upstream regions overlap ("territory").

Threshold:

The top scoring windows and their respective genes are collected until there are X distinct genes that have been designated as the motif's target genes. This amount is calculated from the top 1% of all windows.