Investigate Hypothesis Testing Methods through Python Programming

In a series of 1,000 coin flips, a biased coin with a probability of landing on heads of 55% was used. The probability of a type-2 error, or failing to reject the null hypothesis when it is actually false, was found to be 11.345%.

The null hypothesis in this case was that the coin was fair, with a 50% probability of landing on heads. The alternative hypothesis, or "alt" hypothesis, was that the probability was not equal to 0.5.

To test the fairness of the coin, a one-sided test was performed. If the number of heads observed was significantly higher than 500, the null hypothesis would be rejected. In this scenario, observing 529.5 heads would lead to the rejection of the null hypothesis, as the p-value would be less than 5%.

However, if the number of heads were 540, we would be 95% confident that the mean of the distribution is contained between 0.5091 and 0.5709. Since this interval does not contain 0.500, the null hypothesis that this is a fair coin would be rejected.

Interestingly, 5% of the time, a test with 1,000 coin flips will land outside the range of 469-531, leading to the rejection of the null hypothesis. This indicates that the coin is indeed biased.

Confidence intervals can also be used to decide whether to accept or reject the null hypothesis. For a coin flipping 530 times out of 1,000, the confidence interval is (0.4991, 0.5609). Since this interval contains the probability of heads being 50% (assuming a fair coin), we do not reject the null.

When you sum independent Bernoulli trials, you get a Binomial(n,p) random variable. Given a null hypothesis of (p = 0.5) and alt of (p != 0.5), the mean of the binomial random variable should be approximately 500 (from 1,000 coin flips) when testing the fairness of a coin.

The standard deviation of the binomial random variable is approximately 15.8114. The probability that a binomial random variable falls between 490 and 520 is approximately 63%. To increase the power to detect a type-2 error, the upper bounds of the test could be shifted to provide more probability in the upper tail, reducing the probability of a type-2 error.

The power to detect a type-2 error for a one-sided test with an upper bound of 526 is 93.6%. This means that if the coin is truly biased, there is a 93.6% chance that the test will correctly reject the null hypothesis.

Each coin flip is a Bernoulli trial, an experiment with two outcomes: success (probability p) and fail (probability 1-p). The central limit theorem states that as the number of independent Bernoulli trials (n) gets large, the Binomial distribution approaches a normal distribution. This makes it easier to perform statistical tests and calculate probabilities.

P-values represent another way of deciding whether to accept or reject the null hypothesis, by computing the probability, assuming the null hypothesis is true, that we would see a value as extreme as the one observed. For a fair coin, the p-value of observing 530 heads in 1,000 flips is approximately 6.2%.

For a test with 1,000 coin flips and a 95% significance level, the power to detect a type-2 error (false negative) is 88.7%. This means that if the coin is biased, there is an 88.7% chance that the test will correctly reject the null hypothesis.

In conclusion, the statistical analysis of 1,000 coin flips has provided insights into the fairness of the coin. The coin is indeed biased, with a probability of landing on heads being approximately 55%. The results demonstrate the usefulness of statistical tests and confidence intervals in analysing data and making informed decisions.

"Data Science from Scratch" by Joel Grus is a valuable resource for those interested in learning more about these statistical concepts.