class: center, middle, inverse, title-slide # Best guesses and uncertainty ## Lecture 2 ### Dr Milan Valášek ### 05 February 2021 --- <script type="text/javascript"> <!-- message to display at the bottom of slides when ?live=true --> const slideMessage = "<span>Ask questions at </span><a href = 'pollev.com/milanvalasek890'>pollev.com/milanvalasek890</a>" plotsToPanels() live() </script> ## Today - Point estimates vs interval estimates - Confidence intervals - *t*-distribution --- ## What stats is about (yet again) - We want to know about the world (population) - We can only get data from samples - We calculate statistics on samples and use them to *estimate* the values in population - Statistics is all about *making inferences about populations based on samples* - If we could measure the entire population, we wouldn't need stats! --- ## Point estimates - You've heard of the sample mean, median, mode - These are all point estimates - single numbers that are our best guesses about corresponding *population parameters* - Measures of spread (<i>SD</i>, `\(\sigma^2\)`, <i>etc.</i>) are also point estimates - Even relationships between variables can be expressed using point estimates --- ## Point estimates .pull-left[ *r* = −.07 ![](index_files/figure-html/unnamed-chunk-3-1.png)<!-- --> ] .pull-right[ *r* = .752 ![](index_files/figure-html/unnamed-chunk-4-1.png)<!-- --> ] --- ## Accuracy and uncertainty - Sample mean `\(\bar{x}\)` is the best estimate `\(\hat{\mu}\)` of population mean but means of almost all samples differ from population mean `\(\mu\)` - Same is true for *any* point estimate - *SE* of the mean expresses the uncertainty about the estimates of population parameters - *SE* can be calculated for other point estimates, not just the mean - We can quantify uncertainty around point estimates using **interval estimates** --- ## Interval estimates - In addition to estimating a single value, we can also estimate an interval around it - <i>e.g.,</i> mean = 4.13 with an interval from −0.2 to 8.46 - Interval estimates communicate the uncertainty around point estimates - There are different kinds of interval estimates - Important: **confidence intervals** --- ## Confidence interval - We can use *SE* and the sampling distribution to calculate a confidence interval (CI) with a certain *coverage*, <i>e.g.,</i> 90%, 95%, 99%... - For a 95% CI, 95% of these intervals around sample estimates will contain the value of the population parameter - Let’s see an example --- ## Confidence interval - Population of circles of different sizes <img src="assets/ci_01.png" height="350px"> --- ## Confidence interval - Sample from population, estimate mean size <img src="assets/ci_02.png" height="350px"> --- ## Confidence interval - Calculate the 95% CI around the mean <img src="assets/ci_03.png" height="350px"> --- ## Confidence interval - Lather, rinse, repeat... <img class="gif" src="assets/ci_03.png" gif="assets/ci_small.gif" height="350px"> --- ## Confidence interval - ~5% don't contain population mean = 95% coverage <img src="assets/ci_04.png" height="350px"> --- ## How is it made? - Easy if we know sampling distribution of the mean - 95% of sampling distribution is within ±1.96 <i>SE</i> - 95% CI around estimated population mean is mean ±1.96 <i>SE</i> --- ## How is it made? - Sampling distribution of the mean is normal (as per [CLT](../../01/handout/#the-central-limit-theorem)) - Middle 95% of the sample means lie within ±1.96 <i>SE</i> - We use the same 1.96 <i>SE</i> to construct 95% CI around mean <img class="gif" src="assets/ci_constr.png" gif="assets/ci_constr01_small.gif assets/ci_constr02_small.gif assets/ci_constr03_small.gif" height="350px"> --- exclude: ![:live] .pollEv[ <iframe src="https://embed.polleverywhere.com/discourses/IXL29dEI4tgTYG0zH1Sef?controls=none&short_poll=true" width="800px" height="600px"></iframe> ] --- ## How is it made? - Sampling distribution is, however, not known! - It can be approximated using the *t*-distribution and *s* and *N* --- ## <i>t</i>-distribution - Symmetrical, centred around 0 - Its shape changes based on **degrees of freedom** - "Fat-tailed" when <i>df</i> = 1; identical to standard normal when <i>df</i> = `\(\infty\)` <img src="assets/t.png" height="300px"> --- ## <i>t</i>-distribution - As shape changes, so do proportions (unlike with normal) - In standard normal, middle 95% of data lie within ±1.96 - In *t*-distribution, this critical value changes based on <i>df</i> <img class="gif" src="assets/t_01.png" gif="assets/t_02.png assets/t_03.png" height="300px"> --- ## <i>t</i>-distribution - *t*-distribution pops up in many situations - Always has to do with **estimating sampling distribution from a finite sample** - How we calculate number of <i>df</i> changes based on context - Often has to do with *N*, number of estimated parameters, or both - In the case of sampling distribution of the mean, <i>df = N</i> − 1 --- ## Back to CI - 95% CI around estimated population mean is mean ±1.96 <i>SE</i> **if we know the exact shape of sampling distribution** - We don't know the shape so we approximate it using the *t*-distribution - We need to replace the 1.96 with the appropriate critical value for a given number of <i>df</i> - For <i>N</i> = 30, <i>t</i><sub>crit</sub>(<i>df</i>=29) = 2.05 ``` ## [1] 2.04523 ``` --- exclude: ![:live] .pollEv[ <iframe src="https://embed.polleverywhere.com/discourses/IXL29dEI4tgTYG0zH1Sef?controls=none&short_poll=true" width="800px" height="600px"></iframe> ] --- ## Back to CI - 95% CI around the mean for a sample of 30 is `\(\bar{x} \pm 2.05\times SE\)` - `\(\widehat{SE}=\frac{s}{\sqrt{N}}\)` - `\(95\%\ CI = Mean\pm2.05\times \frac{s}{\sqrt{N}}\)` - To construct a 95% CI around our estimated mean, all we need is - Estimated mean (<i>i.e.</i> sample mean, because `\(\hat{\mu}=\bar{x}\)`) - Sample *SD* (*s*) - *N* - Critical value for a *t*-distribution with <i>N</i> − 1 <i>df</i> --- ## CIs are useful - Width of the interval tells us about how much we can expect the mean of a different sample of the same size to vary from the one we got - There's a x% chance that any given x% CI contains the true population mean - **CAVEAT: **That's not the same as saying that there's a x% chance that the population mean lies within our x% CI! - CIs can be calculated for *any point estimate*, not just the mean! --- ## Remember this? .pull-left[ *r* = −.07 ![](index_files/figure-html/unnamed-chunk-6-1.png)<!-- --> ] .pull-right[ *r* = .752 ![](index_files/figure-html/unnamed-chunk-7-1.png)<!-- --> ] --- ## Remember this? .pull-left[ *r* = −.07; 95% CI [−.263, .128] ![](index_files/figure-html/unnamed-chunk-9-1.png)<!-- --> ] .pull-right[ *r* = .752; 95% CI [.652, .827] ![](index_files/figure-html/unnamed-chunk-10-1.png)<!-- --> ] --- exclude: ![:live] .pollEv[ <iframe src="https://embed.polleverywhere.com/discourses/IXL29dEI4tgTYG0zH1Sef?controls=none&short_poll=true" width="800px" height="600px"></iframe> ] --- ## Take-home message - Our aim is to *estimate unknown population characteristics* based on samples - *Point estimate* is the best guess about a given population characteristic (parameter) - Estimation is inherently *uncertain* - We cannot say with 100% certainty that our estimate is truly equal to the population parameter - *Confidence intervals* express this uncertainty - The wider they are, the more uncertainty there is - They have arbitrary *coverage* (often 50%, 90%, 95%, 99%) - CIs are constructed using the *sampling distribution* - True sampling distribution is unknown, we can approximate it using the *t*-distribution with given *degrees of freedom* - CIs can be constructed for *any point estimate* - For a 95% CI, there is a 95% chance that any given CI contains the true population parameter --- class: last-slide weekend background-image: url("assets/end.jpg") background-size: cover # Have a lovely weekend :)