January 26, 2011

How to assess the statistical significance of the overlap between two set of genomic regions


In order to assess the statistical significance of the overlap between two set of genomic regions, a number of different statistical methods could be used.  One could use a chi-square test, and compare the total size of the overlap of the two set of genomic regions to the size of this overlap that would be expected at random based on the total size of the genome and the total size of the genomic regions in each group.  When using this method, it is probably best to make the assumption that the total genome size is less than the actual genome size, since some regions of the genome are unmappable.  For mouse and human, we assume that the total size of the genome is 2x10^9 bp.  This method accurately models the expected overlap between two datasets if they were truly randomly associated.  Unfortunately, since the genome is so large, this method will call the overlap between two datasets statistically significant even if it is not much larger than would be expected at random.  One way to improve this method is to consider the genome in bins 100-500 bp in size, rather one base at a time.  This reduces the overall n in the Chi-square calculations and is a reasonable adjustment, since the resolution of ChIP-Seq is in the range of 100-500bp for most datasets.
Another way to assess the statistical significance of the overlap between two sets of genomic regions is to use a Monte Carlo method.  In these methods, thousands or millions of simulated datasets are created and the overlaps between the simulated datasets are tabulated.  The statistical significant of the actual overlap is then estimated based on the frequency of an overlap at least that large occurring in the simulated datasets.  We used a Monte Carlo method to assess the statistical significance of the overlap between to sets of genomic regions in this manuscript.  To create the simulated datasets, we shifted each of the genomic regions in one of the datasets a random distance between -2,000 bp and +2,000 bp from its original position.
We noted that, if two sets of genomic regions that tend to occur in the same place, then this overlap is statistically significant.  This will be true whether the regions occur together 10% of the time, or 100% of the time.  Thus, the test for statistical significance is not a very good at describing the degree of overlap between to datasets.  Consequently. in each instance that we assessed the statistical significance of the overlap between two set of genomic regions, we also calculated the fold enrichment of the overlap between the two datasets.  This was calculated as the total size of the actual overlap between the two sets of regions and the overlap between the two datasets that would be expected at random.  In the manuscript, when assessing the overlap between two sets of genomic regions, we reported the both the p-value as calculated using the Mote Carlo method described above and the fold enrichment between the two datasets.


No comments:

Post a Comment