January 23, 2011

microarray data analysis methods


Data normalization



In order to make meaningful comparisons of a gene’s signal intensities between multiple microarrays it is necessary for the signal intensities from each chip to be expressed on the same scale. This is accomplished through data normalization.

The linear normalization provided by the Affymetrix software package MAS 5.0 serves to equilibrate the overall intensity levels of a group of chips, but it does not normalize intensity-dependent differences between chips.  Such differences are only dramatic on a small fraction or arrays and might result from the arrays being scanned such that some of the intensity values are outside the linear range of the detector.  Whatever the cause, we have found that a normalization scheme that can correct for intensity-dependent differences between chips results in a more accurate measure of signal intensity than that accomplished by linear normalization.  Consequently, we have employed a quantile method to normalize Affymetrix GeneChip microarrays.

The quantile normalization method adjusts the signal intensities on each chip as follows.  Within each array the signal intensity of each gene is ranked.  (example) If a set of genes is tied then each is given the mean of the set of ranks that the tied genes span.(example)  Across the arrays, the mean signal intensity for the genes of each rank is calculated.  For every gene on each array the signal intensity of the gene is replaced by the mean signal intensity for the genes of that rank across all of the arrays.

The effect of this normalization method is to make the distribution of signal intensities on each array identical.  The differences in the expression level of a particular gene between arrays are a result of where the measurement is ranked within each array.


Data filtering


The Affymetrix U133A and U133B arrays are capable of detecting the expression of a large fraction of the genes in the human genome.  As we expect that not every gene in the genome is expressed in endothelial cells, we sought to remove from our dataset both those genes that are not expressed in our samples as well as those that might be expressed but for which expression levels could not be reliably quantitated by the Affymetrix system.

To accomplish this task we took advantage of the fact that the probe set for each gene on the Affymetrix arrays contains eleven perfect match (PM) probes and an equal number of mismatch (MM) probes.  Unlike hybridization to the perfect match probes, hybridization to the mismatch probes is non-gene specific and the ratio of PM to MM hybridization indicates whether the PM hybridization is likely to be the result of gene-specific hybridization or is rather likely to be the result of experimental noise.  This concept has been mathematically formalized and algorithms for estimating the probability of PM hybridization resulting from gene-specific hybridization are provided in Affymetrix’s Microarray Suite 5.0 (MAS) software package.  Through the use of adjustable cut offs, MAS can further reduce these probabilities to a trinary “Present”, “Absent” or “Marginal” call – indicating high, low, or intermediate probability of gene-specific hybridization respectively.

We used the Affymetrix-recommended settings in MAS to call whether gene-specific hybridization was detected for each probe set on each array.  We then eliminated from our data set those probe sets for which gene-specific hybridization was “Absent” on every array.  These eliminated genes include both those that are not expressed in our endothelial cell line and those that cannot be detected due to technical limitations.  In our data set we included hybridization-intensity data from probe sets that had at least one “Present” call as we hypothesized that our different treatment conditions might cause expression of some genes to be induced from undetectable levels.

This filtering scheme eliminated xxxx genes from our data set (Table 1.)



No comments:

Post a Comment