Kristoffer Magnusson just (March 2014) created a beautiful interactive visualisation with harsh accompanying text which is here: Understanding Statistical Power and Significance Testing. I took the opportunity to get on top of the controversies around significance testing which have been quite prominently discussed on twitter lately, using as my informant the prominent research methodologist Will Lowe. Below are short notes that help me, here's a longer more detailed post by another expert you might like if you don't like the below.
There are really two different ways to check your experiments, which everyone tends to conflate, but are in fact completely at odds.
There are really two different ways to check your experiments, which everyone tends to conflate, but are in fact completely at odds.
- Neyman & Pearson approach: When you have real control over the experimental context e.g. large clinical trials, you pick an alpha threshold (how likely it is that you find a positive result that isn't really there). This alpha and your estimate (probably from a pilot study) of the effect size (d on Magnusson's figure) allows you to determine the N you need for a particular power (power = 1 - beta, how likely it is not to detect a positive result that is there.). Power is the chance you will detect a difference if it does exist. If alpha is small (low likelihood of being overly optimistic) then all else being equal power goes down (more likely to miss a real result.) But increasing N increases the power without affecting alpha. Similarly, if the effect size is bigger than you thought the power goes up (without increasing N or alpha) but the converse is also true. Alpha is unaffected by effect size. Neyman & Pearson are about inductive decisions. After the experiment is run, the scientists or doctors get no further look in, the answer is handed to them. Notes:
- If it's true that psychology experiments (which usually use an unholy mush of Fisher and Neyman & Pearson) generally have power around 40% (as Gigerenzer says), then replication is actually more easy for false results than true ones. There were some blog posts about this last year, but I can't find them in 2015 so I'm not sure this is a real problem.
- It doesn't make sense to talk about two tails in N&P;
you are making a comparison to a particular hypothesis concerning the real world, and the experiment should be set up to only be on one side of the "random" hypothesis. Otherwise the math gets nasty.
- Fisher approach: when you can't control that many aspects of your experiment, then you stop worrying about one of the errors (type II, that is about power). This means you stop trying to figure out where nature might be, and you only focus on making sure where it isn't, that is, what could be generated by noise. This is null hypothesis testing. Fisher has a model of noise (vs. signal) and when he doesn't see his noise model then he says there's some signal. It's an exploratory check that you aren't just looking at randomness. So we again pick a threshold about what we are willing to publish, but a p value is a real-valued thing and can be reported at any level. It is not alpha. Fisher is about inductive inference, allowing scientists as experts to decide what is going on once they see that something is. For Fisher, statistics is just doing a safety check. So in Magnusson's interactive figure, if you are thinking about p you should totally disregard the right-hand hypothesis (all the blue stuff). P values only tell you when the tail of the noise has gotten small enough to think something is going on. Two tails are fine in this case, because noise is symmetric.
if only 10 percent of the effects that psychologists search for are real, but all positive results are published, then setting a p-value of .05 would result in more than one-third of all positive findings reported in psychology journals being false positives.