class: center, middle, inverse, title-slide # Power Analysis Primer ### Matt Schuelke ### 2018-10-31 --- # An (Un)Motivating Example Imagine you are a health-care provider for Eileen - she complains of chronic low back pain and arthritis - she wants to know if she should take glucosamine - you read a recent research article suggesting no significant glucosamine effect - based on that prominant paper you do not recommend glucosamine What if the article conclusions are wrong? By wrong I do NOT mean that the results were - misinterpreted by the researchers - reported incorrectly In fact, everything may have been done correctly - and the results may still be wrong - worse, we may never know the results were wrong --- # Court Room Analogy Situation - No one sure who robbed bank because of blindfold - Evidence is collected such as height, clothing, and get-away vehicle - Find suspect who seems congruent with evidence Null and alternative hypotheses are: - `\(H_o:\)` suspect is innocent - `\(H_a:\)` suspect is not innocent (guilty) We use evidence to suggest "beyond a shadow of a doubt" null is false What is a good level of surity to convince us to reject innocence? - 50% half the time, an innocent person will be convicted - 20% 1 in 5 innocent people will be convicted - 5% 1 in 20 innocent people will be convicted - 1% 1 in 100 innocent people will be convicted --- # Court Room Analogy Four possible situations. Two good. Two bad.  Which error is more "costly" and should be guarded against more closely? --- # Possible Errors in Testing More generally in null hypothesis significance testing (NHST):  As in the court example - two hypotheses - two possible truth states - two correct decisions - two incorrect decisions Because we never really know the true state of the world, our decisions can be erronous and we might not ever know it. --- # Pop Quiz 1 ## Question 1 If the hypotheses in the Eileen paper were: - `\(H_o:\)` Glucosamine does NOT have an effect - `\(H_a:\)` Glucosmaine does have an effect What type of error might the authors have committed? ## Question 2 The publication industry typically prefers significant findings. What are the implications for scientific inquiry if one commits a - Type I Error? - Type II Error? --- # Type I Error We ordered a shipment of rats described as being able to run a standard maze in 33 seconds. Upon observing a few in our maze, they seemed slow. Thus we collected completion times for our sample in order to test: `\(H_o:\)` population mean maze completion time was less than or equal to 33 seconds `\(H_a:\)` population mean maze completion time was greater than 33 seconds .pull-left[ By putting `\(\alpha\)` in the upper tail, we have specified what would be unusual results under a true null hypothesis. We might then compute a test statistic describing the central tendency of our sample and place it on this figure. If the test statistic were to the left, that would not be enough evidence to warrent rejecting our null hypothesis. If the test statistic were to the right, our sample would be very unusual if the null is actually true. ] .pull-right[  ] --- # Correctly Retaining the Null One goal of NHST is to try and improve our changes of making correct decisions. One such correct decision is to retain the null when it is actually true. This is directly related to our choice of `\(\alpha\)`. In fact, it is `\(1 - \alpha\)`. .pull-left[ `\(\alpha = 0.05\)`  ] .pull-right[ `\(\alpha = 0.01\)`  ] --- # Type II Error and Power Control of Type I Errors is pretty straightforward. We have designed NHST to directly control such errors by directly choosing our upper allowable limit before conducing our tests. Control of Type II errors is more complicated and dictated be a number of moving parts. Here is an image showing just *one* possible location of an alternative distribution if the alternative is true. .img-center-50[  ] --- # Type II Error and Power If we could know exactly where the alternative distribution is located on the number line, we could look at how the critical value is cutting the alternative distribution into two parts. .img-center-50[  ] Here the large green area represents the probability of a Type II Error: failing to reject the null when it is the alternative which is true. --- # Type II Error and Power Now let's show a larger difference between the null and alternative distributions. .img-center-50[  ] Here the null distribution and red area attributed to `\(\alpha\)` did not change. However, the green area attributed to `\(\beta\)` got smaller. The area under the alternative distribution that is not green, represents a correct decision: to reject the null hypothesis when it is false. The probability represented by this unshaded area is 1 - `\(\beta\)` and has a special name: POWER! --- # Type II Error and Power .center[  ] We want to have a very good probability of rejecting the null hypothesis when it is false because this will keep Type II Errors to a minimum. Maintaining power is a balancing act - too little: unable to detect practically interesting differences - too much: tiny, arbitrary differences could be detected as significant We want just enough power for our test statistics to be significant when they encounter the smallest differences that are practially noteworthy. The judgement about what is practially noteworthy is not a statistical issue. --- # Factors Influencing Power:<br>#1) Effect Size In the rat shipment example, the effect size could be visualized as the distance between the null and alternative distributions. Many ways to define effect size - magnitude of the impact of an independent variable on a dependnet variable - strength of an observed relationship Effect size is *positively* related to power. - larger effect sizes imbue greater power - smaller effect sizes imbue lower power --- # Factors Influencing Power:<br>#2) Sample Size The factor that is most easily changed by researchers Has a huge effect on power Like effect size, sample size is *positively* related to power - large sample sizes give greater power - smaller sample sizes give lower power Sometimes researchers talk about calculating power, but they often wish to calculte the number of observations required to give them the power they want. That is, they want a decent probability of finding significance with the least amount of resources. --- # Factors Influencing Power:<br>#3) Directional Hypotheses The choice of directional hypotheses can make it easier to find statistical significance Important caveat: power only goes up if the results turn out in the predicted direction .pull-left[ Non-directional Hypotheses  ] .pull-right[ Directional Hypotheses  ] --- # Pop Quiz 2 In the glucosamine study, the authors stated "We estimated that 250 patients should be enrolled based on a clinically important difference of 3 [points on our outcome instrument] with 80% power, a 2-sided significance level of 0.05, and adding 20% for possible droputs". ## Question 1 What is the probability of a Type I error the researchers are willing to have? Type II? ## Question 2 Suppose we wish to replicate these findings, but think an effect of 8 points would be clinically meaningful. All else being the same, would we require a larger or smaller sample size to achieve the same level of power? --- # Factors Influencing Power:<br>#4) Significance Level Significance level is *positively* related to power - larger alpha lowers the critical value and thus increases power - smaller alpha increases the critical value and thus lowers power  --- # Pop Quiz 3 In the glucosamine study, Wilkens et al. (2010) said, "... 250 patients should be enrolled based on a clinically important difference of 3 with 80% power, a 2-sided significance level of 0.05, and adding 20% for possible dropouts" (p. 47). ## Question 1 What would happen to power if the researchers lowered `\(\alpha\)` or changed to a one-tailed test? --- # Factors Influencing Power:<br>#5) Variability Effect size can be thought of as signal to noise ratio Using the rat shipment example - if the alternative were true (sample came from slow population) - a big difference between the sample and hypothesized population means would be the signal But we have to account for sampling variability as the noise - different samples have different obserations - different sets of observations have different sample means Luckily the Central Limit Theorem tell us that under many situations - the sampling distribution of the sample mean is normally distributed - and the standard devaition of this distribution is `$$\sigma/\sqrt{N}$$` --- # Factors Influencing Power:<br>#5) Variability There are two ways to make the signal to noise ratio big - increase the signal - decrease the noise We can increase the signal by - using treatments with larger effects - or selecting larger effects to study We can decrease the noise - using a larger sample - control extraneous variables Variability is *negatively* related to power - greater variability reduces power - less variability increases power --- # Pop Quiz 4 In the glucosamine study, Wilkens et al. (2010) described inclusion and exclusion criteria for participants. For example, patients were accepted if they were 25 years or older and excluded if they had used glucosamine previously. ## Question 1 In terms of signal and noise, what effect would the inclusion and exclusion criteria probably have on the power in this study? --- # Summary We almost never know the true state of the world This allows for two types of errors: - Type I: rejecting the null when it is true - Type II: failing to reject the null when it is false Null Hypothesis Significance Testing is designed to control - costly Type I errors directly by setting `\(\alpha\)` - Type II errors via a constalation of factors related to its compliment: power Factors positively related to power - Effect size - Sample size - Directional Hypotheses - Significance Level Factors negatively related to power - Variability