January 26, 2011

How to assess the statistical significance of the overlap between two set of genomic regions


In order to assess the statistical significance of the overlap between two set of genomic regions, a number of different statistical methods could be used.  One could use a chi-square test, and compare the total size of the overlap of the two set of genomic regions to the size of this overlap that would be expected at random based on the total size of the genome and the total size of the genomic regions in each group.  When using this method, it is probably best to make the assumption that the total genome size is less than the actual genome size, since some regions of the genome are unmappable.  For mouse and human, we assume that the total size of the genome is 2x10^9 bp.  This method accurately models the expected overlap between two datasets if they were truly randomly associated.  Unfortunately, since the genome is so large, this method will call the overlap between two datasets statistically significant even if it is not much larger than would be expected at random.  One way to improve this method is to consider the genome in bins 100-500 bp in size, rather one base at a time.  This reduces the overall n in the Chi-square calculations and is a reasonable adjustment, since the resolution of ChIP-Seq is in the range of 100-500bp for most datasets.
Another way to assess the statistical significance of the overlap between two sets of genomic regions is to use a Monte Carlo method.  In these methods, thousands or millions of simulated datasets are created and the overlaps between the simulated datasets are tabulated.  The statistical significant of the actual overlap is then estimated based on the frequency of an overlap at least that large occurring in the simulated datasets.  We used a Monte Carlo method to assess the statistical significance of the overlap between to sets of genomic regions in this manuscript.  To create the simulated datasets, we shifted each of the genomic regions in one of the datasets a random distance between -2,000 bp and +2,000 bp from its original position.
We noted that, if two sets of genomic regions that tend to occur in the same place, then this overlap is statistically significant.  This will be true whether the regions occur together 10% of the time, or 100% of the time.  Thus, the test for statistical significance is not a very good at describing the degree of overlap between to datasets.  Consequently. in each instance that we assessed the statistical significance of the overlap between two set of genomic regions, we also calculated the fold enrichment of the overlap between the two datasets.  This was calculated as the total size of the actual overlap between the two sets of regions and the overlap between the two datasets that would be expected at random.  In the manuscript, when assessing the overlap between two sets of genomic regions, we reported the both the p-value as calculated using the Mote Carlo method described above and the fold enrichment between the two datasets.


January 23, 2011

randomness



Data Pre-processing

Introduction

Image Processing

Image Quantification

MAS 5.0

Image ® Gene Transformation

MAS 5.0
Affymetrix Statistical Algorithms Reference Guide
https://www.affymetrix.com/support/technical/technotes/statistical_reference_guide.pdf

Normalization / Scaling

Linear Normalization

Quantile Normalization

Log Ratio Normalization

Annotation

Introduction

In order to effectively analyze microarray data, it is critical for investigators to have access to complete and up-to-date annotation of the genes on the array.  At the Microarray Resource we get our annotation information from two primary sources, though there are a few others that are worth mentioning.

NetAffx

Affymetrix maintains the NetAffx [Link] database containing information about the genes that are contained on their GeneChip microarrays.  This is the best first source of information about Affmyetrix probe sets because each probe set has a unique page in the NetAffx database containing a broad range of information including gene and probe sequences, links to other databases, and functional descriptions of the genes.

Incyte Proteome Database

The Incyte Proteome BioKnowledge Library [Link] is now available for access by all current Boston University and Boston University Medical Center faculty, staff, and students.  This is an excellent database for finding information about genes from microarray experiments.  It is well curated and provides Pubmed links for all references.  This database is indexed by gene symbol.

Other Databases (NCBI etc.)

The are a number of other database that can provide valuable information about genes from microarray experiments
-          Genbank
-          SGI  (yeast)
-          Gene Ontology

Identifying Differentially Expressed Genes

Introduction

With microarray data, biology researchers want to identify genes differentially expressed under different growth conditions or different treatments, to cluster genes according to their expression pattern, and to differentiate samples in pharmaceutical or clinical studies.

Fold Change

The most straightforward method of identifying differentially regulated genes in a microarray experiment is by fold change.  Fold change is the multiple by which the expression of a gene changed between two experimental groups.

Fold change can be reported using various scales that each convey the same information
Ratio: ¼, 4
Linear: -4, 4
Log base 2: -2, 2
Log base 10: -?, ?
Fold Change is usually calculated using the mean of a set of measurements within an experimental group, but I can also be calculated using the geometric mean, particularly if the original measurements were not converted to logarithmic scale.

While Fold Change is an important descriptor of the behavior of a genes expression between two experimental groups, it does not tell the whole story.  For example take the expression of one gene measured 4 times in each of two experimental groups.

Group A:         100, 200, 200, 300                  Mean = 200
Group B:         100, 100, 200, 2800                Mean = 800

Fold Change = 4

According to Fold Change this is a differentially regulated gene while we can see that this is not likely a good candidate for further investigation.  Consequently, Fold Change should not be used as a first pass method for identifying differentially expressed genes.

Statistical Significance

A better method for identifying differentially regulated genes is provided by statistics.  Analysis of Variance (ANOVA) is a technique that assesses whether a set of measurements from two or more experimental groups indicates, given observed variance, that the groups are different.  For microarrays the measurements are the expression levels of one gene and the groups correspond to the experimental sample groups.  ANOVA is used to identify genes that are differentially expressed in a manner that is reproducible across multiple measurements within each experimental group.

An ANOVA score is calculated by comparing the variance observed between the sample group means to the variance observed within the groups.  If the between group variance is high relative to the within group variance this indicates differential expression.  The result of an ANOVA is a probability, p, that an observed difference between groups could have been produced by chance if the groups were in fact the same.

Following the use of ANOVA to calculate a p-value for each gene it is useful to choose a p-value cut-off, below which genes will be considered differentially expressed, and above which genes will not be considered differentially expressed.  This cutoff will be arbitrary, but its’ choice should be made with an understanding of the trade-offs between sensitivity and selectivity that are inherent to choosing a significance cut-off.  In general, choosing a lower significance cut-off will result in fewer genes being identified as differentially expressed, but a smaller portion of those that are selected will be false-positives.  Choosing a higher significance cut-off will result in more genes being identified as differentially expressed, but a greater portion of those will be false-positives.  At any significance cut-off it is possible to estimate the associated false-positive and false-negative rates.  This allows an informed choice of the significance cut-off

ANOVA can take a few different forms depending on the experimental design.  The most basic type of ANOVA is a one-way ANOVA.  In a one-way ANOVA, the sample groups are stratified along a single experimental variable.  The simplest one-way ANOVA, with two sample groups, is equivalent to a T-Test.  The result of an ANOVA comparing more than two groups is the probability that any one of the groups is significantly different from the rest.,

Two-way ANOVA differs from one-way ANOVA in that it generates p-value scores for each of the primary experimental axis as well as a score for the interaction between the two axis.

Multiple Hypothesis Testing

Correction of significance results for multiple hypothesis testing is an important concern in microarray data analysis.  It is common to use a p-value cut-off of 0.05.  In a microarray experiment in which 20,000 genes are measured, even if no genes are truly differentially expressed, 1,000 genes can be expected to meet the p < 0.05 significance cut-off by chance alone.  Furthermore, in the same 20,000 gene experiment with no changed genes, one unchanged gene would be expected to have a p-value as low as 0.00005.

A statistic test, like ANOVA, applied to microarray data tells you the probability that the observations made about a single gene could have been made if the null hypothesis, that the gene is not significantly changed, were true.  When applied to normally distributed random data, p-values will be evenly distributed between 0 and 1.  Thus, when looking at a single gene, a very low p-value is a significant finding, but as you increase the number of genes observed, the chance of finding a single very low p-value increases.

Take a fictitious microarray data set with 20,000 genes, none of which are differentially expressed between the experimental groups.  We will use a p-value cut-off of 0.05 to identify differentially regulated genes.  If we look at any one gene from our fictitious data set, which we know is not differentially expressed, there is a 1 in 20 chance of it having a p-value less than 0.05.  Our gene-wise false-positive rate, at this level of sensitivity, is 5%.  So, if we to use a microarray to observe the expression of a single gene, we can use p-value cut-off of 0.05 and control false positives at a rate of 5%.

If we use a statistical test and a p-value cut-off of p < 0.05 to identify differentially expressed genes from our fictitious microarray experiment, our gene-wise false positive rate is still 5%.  Five percent of 20,000 genes is 1,000 genes, that were not actually differentially expressed, but would be identified as significant at this level of sensitivity.  Furthermore, it would be expected that one unchanged gene would have a p-value as low as .00005.  Testing as many hypotheses as there are genes on a microarray gives plenty of chances to make a mistake.

There are a few different methods for dealing with multiple hypothesis testing in significance analysis of microarray data.  The Bonferroni correction multiplies the significance observed for each hypothesis by the number of hypotheses being tested.  The Bonferroni correction is usually overly stringent for microarray data analysis.  If we use a Bonferoni corrected p-value cutoff of 0.05 on a real microarray data set, no matter how many genes meet the significance cut-off, there will be a 5% chance that a single false-positive will be among them.  If we identify 100 genes that are differentially expressed in an experiment, we would likely be willing to accept a few false-positives among the 100.  The Bonferonni criteria that there is only a 5% chance that a single false-positive is among the 100 is more control of false-positives than is usually necessary.  Increasing selectivity using the Bonferonni correction reduces sensitivity, so fewer differentially regulated genes will be identified.

Another method for treating the multiple hyptothesis problem makes more sense for microarray experiments.  The False Discovery Rate (FDR) correction of Benjamini and Hochberg estimates the gene-wise false-positive rate among the genes at a significance cut-off.  The FDR is the quotient of the number of unchanged genes expected at a given significance cutoff over the number of genes detected at that significance cutoff.

The assumption that unchanged genes would have p-values evenly distributed between 0 and 1 can be used to estimate the number of false-positives expected at a given significance cut-off.  The number of false-positives expected at a given significance cut-off will be equal to the number of unchanged genes (or the number of genes on the microarray) times the p-value of the significance cut-off

Estimating the number of changed and unchanged genes

Based on two assumptions, it is possible to estimate the number of changed and unchanged genes in a microarray data set.  The first assumption is that unchanged genes will have p-values evenly distributed between 0 and 1.  The second assumption is that changed genes will not have p-values greater than a certain p-value threshold.

If there are no changed genes with p greater than the threshold then all of the genes with p greater than the threshold are unchanged.  If the unchanged genes have evenly distributed p-values, then the density of unchanged genes above the threshold will be the same as the density of unchanged genes below the threshold.  So, we calculate the density of unchanged genes above the threshold, and integrate this constant density from p equals 0 to 1.

Advanced Analysis Techniques

Principle Components Analysis

Technique

Principle Components Analysis is a mathematical transformation that can be applied to microarray data sets allowing data compression and dimensionality reduction.  The primary objective is to transform the data into a new space where data analysis is easier.  According to this technique, the first principle component of a data set is the direction along which there is largest variance over all samples.  The


The approach in this technique is that the direction along which there is
maximum variation is most likely to contain the information about the class discrimination. The
prime objective is to transform the given data sets into a new space where data discrimination is
easier.The Euclidean distance is calculated between the test vectors and the given data sets, based
on the minimum distance the test vector is classified to a given set.
which can be used for feature extraction and data classification.

Looking at samples

Looking at genes

Clustering

Hierarchical Clustering

Gene Clustering

Sample Clustering

K-Means Clustering

Biological Data Mining and Pathway Analysis

EASE

GenMapp and MappFinder

Visualization

Introduction

Visualizations are often associated with the presentation of microarray data.

Heat Map

  The most common of these visualizations is the heat map. 

Volcano Plot

In a Volcano Plot, the fold change and significance for each gene are displayed as a scatter plot.  Both fold change and significance are generally plotted in log scale.  The spots take a characteristic volcano form because absolute fold change is correlated with significance.

Volcano plots can be used to demonstrate fold change and significance cut-offs.
Picture here
Volcano plots are also an excellent way to visualize the changes that occur in a group of genes.
Picture here

Talk about making volcano plots comparing more than two groups?

Pathway Visualization

GenMapp


Other Crap

 

For Microarray Resource customers, ANOVA is implemented within Microsoft Excel.

Asusmptions/Limitations of ANOVA

 

With microarray data, biology researchers want to identify genes differentially expressed under different growth conditions or different treatments, to cluster genes according to their expression pattern, and to differentiate samples in pharmaceutical or clinical studies.