Simpsons paradox

HomePage | Recent changes | View source | Discuss this page | Page history | Log in |

Printable version | Privacy policy

Simpson's paradox is a statistical paradox first described by E. H. Simpson in 1951, in which two sets of data both separately support a certain hypothesis, but when considered together support the opposite hypothesis.

As an example, suppose we have two people, Ann and Bob. In the first test, Ann and Bob are let loose on Wikipedia, and Ann edits 10 articles, improving 1 of them, while Bob edits 100 articles, improving 30 of them. In the second test, Ann and Bob are again let loose on Wikipedia, and this time Ann edits 100 articles, improving 60 of them, while Bob edits 10 articles, improving 9 of them. So in the first test, Ann improved 10% of the articles of the articles she edited, while Bob's success rate was 30%, and in the second test Ann managed 60% while Bob achieved 90%. Both sets of data therefore support the hypothesis that Bob's edits are more likely to be beneficial than Ann's. But if we combine the two sets of data we see that Ann and Bob both edited 110 articles, and that Ann improved 61 while Bob improved only 39. The combined data therefore supports the opposite hypothesis, that Ann's edits are more likely to be beneficial than Bob's.