Bonferroni校正:如果在同一數據集上同時(shí)檢驗n個(gè)獨立的假設,那么用于每一假設的統計顯著(zhù)水平,應為僅檢驗一個(gè)假設時(shí)的顯著(zhù)水平的1/n。

簡(jiǎn)介

舉個(gè)例子:如要在同一數據集上檢驗兩個(gè)獨立的假設,顯著(zhù)水平設為常見(jiàn)的0.05。此時(shí)用于檢驗該兩個(gè)假設應使用更嚴格的0.025。即0.05* (1/2)。該方法是由Carlo Emilio Bonferroni發(fā)展的,因此稱(chēng)Bonferroni校正。

這樣做的理由是基于這樣一個(gè)事實(shí):在同一數據集上進(jìn)行多個(gè)假設的檢驗,每20個(gè)假設中就有一個(gè)可能純粹由于概率,而達到0.05的顯著(zhù)水平。

維基百科原文

Bonferroni correction

Bonferroni correction states that if an experimenter is testing n independent hypotheses on a set of data, then the statistical significance level that should be used for each hypothesis separately is 1/n times what it would be if only one hypothesis were tested.

For example, to test two independent hypotheses on the same data at 0.05 significance level, instead of using a p value threshold of 0.05, one would use a stricter threshold of 0.025.

The Bonferroni correction is a safeguard against multiple tests of statistical significance on the same data, where 1 out of every 20 hypothesis-tests will appear to be significant at the α = 0.05 level purely due to chance. It was developed by Carlo Emilio Bonferroni.

A less restrictive criterion is the rough false discovery rate giving (3/4)0.05 = 0.0375 for n = 2 and (21/40)0.05 = 0.02625 for n = 20.

數據分析中常碰見(jiàn)多重檢驗問(wèn)題(multiple testing).Benjamini于1995年提出一種方法,是假陽(yáng)性的。在統計學(xué)上,這也就等價(jià)于控制FDR不能超過(guò)5%.

根據Benjamini在他的文章中所證明的定理,控制fdr的步驟實(shí)際上非常簡(jiǎn)單。

設總共有m個(gè)候選基因,每個(gè)基因對應的p值從小到大排列分別是p(1),p(2),...,p(m),

The False Discovery Rate (FDR) of a set of predictions is the expected percent of false predictions in the set of predictions. For example if the algorithm returns 100 genes with a false discovery rate of .3 then we should expect 70 of them to be correct.

The FDR is very different from ap-value, and as such a much higher FDR can be tolerated than with a p-value. In the example above a set of 100 predictions of which 70 are correct might be very useful, especially if there are thousands of genes on the array most of which are not differentially expressed. In contrast p-value of .3 is generally unacceptabe in any circumstance. Meanwhile an FDR of as high as .5 or even higher might be quite meaningful.

FDR錯誤控制法是Benjamini于1995年提出一種方法,通過(guò)控制FDR(False Discovery Rate)來(lái)決定P值的域值. 假設你挑選了R個(gè)差異表達的基因,其中有S個(gè)是真正有差異表達的,另外有V個(gè)其實(shí)是沒(méi)有差異表達的,是假陽(yáng)性的。實(shí)踐中希望錯誤比例Q=V/R平均而言不能超過(guò)某個(gè)預先設定的值(比如0.05),在統計學(xué)上,這也就等價(jià)于控制FDR不能超過(guò)5%.

對所有候選基因的p值進(jìn)行從小到大排序,則若想控制fdr不能超過(guò)q,則只需找到最大的正整數i,使得 p(i)<= (i*q)/m.然后,挑選對應p(1),p(2),...,p(i)的基因做為差異表達基因,這樣就能從統計學(xué)上保證fdr不超過(guò)q。因此,FDR的計算公式如下:

p-value(i)=p(i)*length(p)/rank(p)

參考文獻

1.Audic, S. and J. M. Claverie (1997). The significance of digital gene expression profiles. Genome Res 7(10): 986-95.

2.Benjamini, Y. and D. Yekutieli (2001). The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics. 29: 1165-1188.

計算方法 請參考 R統計軟件的p.adjust函數:

> p<-c(0.0003,0.0001,0.02)

> p

[1] 3e-04 1e-04 2e-02

>

> p.adjust(p,method="fdr",length(p))

[1] 0.00045 0.00030 0.02000

>

> p*length(p)/rank(p)

[1] 0.00045 0.00030 0.02000

> length(p)

[1] 3

> rank(p)

[1] 2 1 3

sort(p)

[1] 1e-04 3e-04 2e-02[1]