MAS 5.0

Image ® Gene Transformation

MAS 5.0

Affymetrix Statistical Algorithms Reference Guide

https://www.affymetrix.com/support/technical/technotes/statistical_reference_guide.pdf

In order to effectively analyze microarray data, it is critical for investigators to have access to complete and up-to-date annotation of the genes on the array. At the Microarray Resource we get our annotation information from two primary sources, though there are a few others that are worth mentioning.

NetAffx

Affymetrix maintains the NetAffx [Link] database containing information about the genes that are contained on their GeneChip microarrays. This is the best first source of information about Affmyetrix probe sets because each probe set has a unique page in the NetAffx database containing a broad range of information including gene and probe sequences, links to other databases, and functional descriptions of the genes.

Incyte Proteome Database

The Incyte Proteome BioKnowledge Library [Link] is now available for access by all current Boston University and Boston University Medical Center faculty, staff, and students. This is an excellent database for finding information about genes from microarray experiments. It is well curated and provides Pubmed links for all references. This database is indexed by gene symbol.

Other Databases (NCBI etc.)

The are a number of other database that can provide valuable information about genes from microarray experiments

- Genbank

- SGI (yeast)

- Gene Ontology

Identifying Differentially Expressed Genes

Introduction

With microarray data, biology researchers want to identify genes differentially expressed under different growth conditions or different treatments, to cluster genes according to their expression pattern, and to differentiate samples in pharmaceutical or clinical studies.

Fold Change

The most straightforward method of identifying differentially regulated genes in a microarray experiment is by fold change. Fold change is the multiple by which the expression of a gene changed between two experimental groups.

Fold change can be reported using various scales that each convey the same information

Ratio: ¼, 4

Linear: -4, 4

Log base 2: -2, 2

Log base 10: -?, ?

Fold Change is usually calculated using the mean of a set of measurements within an experimental group, but I can also be calculated using the geometric mean, particularly if the original measurements were not converted to logarithmic scale.

While Fold Change is an important descriptor of the behavior of a genes expression between two experimental groups, it does not tell the whole story. For example take the expression of one gene measured 4 times in each of two experimental groups.

Group A: 100, 200, 200, 300 Mean = 200

Group B: 100, 100, 200, 2800 Mean = 800

Fold Change = 4

According to Fold Change this is a differentially regulated gene while we can see that this is not likely a good candidate for further investigation. Consequently, Fold Change should not be used as a first pass method for identifying differentially expressed genes.

Statistical Significance

A better method for identifying differentially regulated genes is provided by statistics. Analysis of Variance (ANOVA) is a technique that assesses whether a set of measurements from two or more experimental groups indicates, given observed variance, that the groups are different. For microarrays the measurements are the expression levels of one gene and the groups correspond to the experimental sample groups. ANOVA is used to identify genes that are differentially expressed in a manner that is reproducible across multiple measurements within each experimental group.

An ANOVA score is calculated by comparing the variance observed between the sample group means to the variance observed within the groups. If the between group variance is high relative to the within group variance this indicates differential expression. The result of an ANOVA is a probability, p, that an observed difference between groups could have been produced by chance if the groups were in fact the same.

Following the use of ANOVA to calculate a p-value for each gene it is useful to choose a p-value cut-off, below which genes will be considered differentially expressed, and above which genes will not be considered differentially expressed. This cutoff will be arbitrary, but its’ choice should be made with an understanding of the trade-offs between sensitivity and selectivity that are inherent to choosing a significance cut-off. In general, choosing a lower significance cut-off will result in fewer genes being identified as differentially expressed, but a smaller portion of those that are selected will be false-positives. Choosing a higher significance cut-off will result in more genes being identified as differentially expressed, but a greater portion of those will be false-positives. At any significance cut-off it is possible to estimate the associated false-positive and false-negative rates. This allows an informed choice of the significance cut-off

ANOVA can take a few different forms depending on the experimental design. The most basic type of ANOVA is a one-way ANOVA. In a one-way ANOVA, the sample groups are stratified along a single experimental variable. The simplest one-way ANOVA, with two sample groups, is equivalent to a T-Test. The result of an ANOVA comparing more than two groups is the probability that any one of the groups is significantly different from the rest.,

Two-way ANOVA differs from one-way ANOVA in that it generates p-value scores for each of the primary experimental axis as well as a score for the interaction between the two axis.

Multiple Hypothesis Testing

Correction of significance results for multiple hypothesis testing is an important concern in microarray data analysis. It is common to use a p-value cut-off of 0.05. In a microarray experiment in which 20,000 genes are measured, even if no genes are truly differentially expressed, 1,000 genes can be expected to meet the p < 0.05 significance cut-off by chance alone. Furthermore, in the same 20,000 gene experiment with no changed genes, one unchanged gene would be expected to have a p-value as low as 0.00005.

A statistic test, like ANOVA, applied to microarray data tells you the probability that the observations made about a single gene could have been made if the null hypothesis, that the gene is not significantly changed, were true. When applied to normally distributed random data, p-values will be evenly distributed between 0 and 1. Thus, when looking at a single gene, a very low p-value is a significant finding, but as you increase the number of genes observed, the chance of finding a single very low p-value increases.

Take a fictitious microarray data set with 20,000 genes, none of which are differentially expressed between the experimental groups. We will use a p-value cut-off of 0.05 to identify differentially regulated genes. If we look at any one gene from our fictitious data set, which we know is not differentially expressed, there is a 1 in 20 chance of it having a p-value less than 0.05. Our gene-wise false-positive rate, at this level of sensitivity, is 5%. So, if we to use a microarray to observe the expression of a single gene, we can use p-value cut-off of 0.05 and control false positives at a rate of 5%.

If we use a statistical test and a p-value cut-off of p < 0.05 to identify differentially expressed genes from our fictitious microarray experiment, our gene-wise false positive rate is still 5%. Five percent of 20,000 genes is 1,000 genes, that were not actually differentially expressed, but would be identified as significant at this level of sensitivity. Furthermore, it would be expected that one unchanged gene would have a p-value as low as .00005. Testing as many hypotheses as there are genes on a microarray gives plenty of chances to make a mistake.

There are a few different methods for dealing with multiple hypothesis testing in significance analysis of microarray data. The Bonferroni correction multiplies the significance observed for each hypothesis by the number of hypotheses being tested. The Bonferroni correction is usually overly stringent for microarray data analysis. If we use a Bonferoni corrected p-value cutoff of 0.05 on a real microarray data set, no matter how many genes meet the significance cut-off, there will be a 5% chance that a single false-positive will be among them. If we identify 100 genes that are differentially expressed in an experiment, we would likely be willing to accept a few false-positives among the 100. The Bonferonni criteria that there is only a 5% chance that a single false-positive is among the 100 is more control of false-positives than is usually necessary. Increasing selectivity using the Bonferonni correction reduces sensitivity, so fewer differentially regulated genes will be identified.

Another method for treating the multiple hyptothesis problem makes more sense for microarray experiments. The False Discovery Rate (FDR) correction of Benjamini and Hochberg estimates the gene-wise false-positive rate among the genes at a significance cut-off. The FDR is the quotient of the number of unchanged genes expected at a given significance cutoff over the number of genes detected at that significance cutoff.

The assumption that unchanged genes would have p-values evenly distributed between 0 and 1 can be used to estimate the number of false-positives expected at a given significance cut-off. The number of false-positives expected at a given significance cut-off will be equal to the number of unchanged genes (or the number of genes on the microarray) times the p-value of the significance cut-off

Estimating the number of changed and unchanged genes

Based on two assumptions, it is possible to estimate the number of changed and unchanged genes in a microarray data set. The first assumption is that unchanged genes will have p-values evenly distributed between 0 and 1. The second assumption is that changed genes will not have p-values greater than a certain p-value threshold.

If there are no changed genes with p greater than the threshold then all of the genes with p greater than the threshold are unchanged. If the unchanged genes have evenly distributed p-values, then the density of unchanged genes above the threshold will be the same as the density of unchanged genes below the threshold. So, we calculate the density of unchanged genes above the threshold, and integrate this constant density from p equals 0 to 1.

Advanced Analysis Techniques

Principle Components Analysis

Technique

Principle Components Analysis is a mathematical transformation that can be applied to microarray data sets allowing data compression and dimensionality reduction. The primary objective is to transform the data into a new space where data analysis is easier. According to this technique, the first principle component of a data set is the direction along which there is largest variance over all samples. The

The approach in this technique is that the direction along which there is

maximum variation is most likely to contain the information about the class discrimination. The

prime objective is to transform the given data sets into a new space where data discrimination is

easier.The Euclidean distance is calculated between the test vectors and the given data sets, based

on the minimum distance the test vector is classified to a given set.

which can be used for feature extraction and data classification.

Looking at samples

Looking at genes

Clustering

Hierarchical Clustering

Gene Clustering

Sample Clustering

K-Means Clustering

Biological Data Mining and Pathway Analysis

EASE

GenMapp and MappFinder

Visualization

Introduction

Visualizations are often associated with the presentation of microarray data.

Heat Map

The most common of these visualizations is the heat map.

Volcano Plot

In a Volcano Plot, the fold change and significance for each gene are displayed as a scatter plot. Both fold change and significance are generally plotted in log scale. The spots take a characteristic volcano form because absolute fold change is correlated with significance.

Volcano plots can be used to demonstrate fold change and significance cut-offs.

Picture here

Volcano plots are also an excellent way to visualize the changes that occur in a group of genes.

Picture here

Talk about making volcano plots comparing more than two groups?

Pathway Visualization

GenMapp

Other Crap

For Microarray Resource customers, ANOVA is implemented within Microsoft Excel.

Asusmptions/Limitations of ANOVA

January 26, 2011

January 23, 2011

Image ® Gene Transformation

Introduction

Technique

Looking at samples

Looking at genes

Gene Clustering

Sample Clustering

GenMapp

Followers

Blog Archive