+ - 0:00:00
Notes for current slide
Notes for next slide

Fundamentals of statistical testing

Lecture 1

Dr Milan Valášek

29 January 2021

1 / 31

Overview

  • Recap on distributions

  • More about the normal distribution

  • Sampling

  • Sampling distribution

  • Standard error

  • Central Limit Theorem

2 / 31

Objectives

After this lecture you will understand

  • that there exist mathematical functions that describe different distributions

  • what makes the normal distribution normal and what are its properties

  • how random fluctuations affect sampling and parameter estimates

  • the function of the sampling distribution and the standard error

  • the Central Limit Theorem

 

With this knowledge you'll build a solid foundation for understanding all the statistics we will be learning in this programme!

3 / 31

It's all Greek to me!

  • μ is the population mean

  • x¯ is the sample mean

  • μ^ is the estimate of the population mean

  • Same with SD: σ, s, and σ^

  • Greek is for populations, Latin is for samples, hat is for population estimates

4 / 31

Recap on distributions

  • Numerically speaking, the number of observations per each value of a variable
  • Which values occur more often and which less often
  • The shape formed by the bars of a bar chart/histogram

df <- tibble(eye_col = sample(c("Brown", "Blue", "Green", "Gray"), 555,
replace = T, prob = c(.55, .39, .04, .02)),
age = rnorm(length(eye_col), 20, .65))
p1 <- df %>%
ggplot(aes(x = eye_col)) +
geom_bar(fill = c("skyblue4", "chocolate4", "slategray", "olivedrab"), colour=NA) +
labs(x = "Eye colour", y = "Count")
p2 <- df %>%
ggplot(aes(x = age)) +
geom_histogram() +
stat_density(aes(y = ..density.. * 80), geom = "line", color = theme_col, lwd = 1) +
labs(x = "Age (years)", y = "Count")
plot_grid(p1, p2)
5 / 31

Known distributions

  • Some shapes are "algebraically tractable", e.g., there is a maths formula to draw the line
  • We can use them for statistics

df <- tibble(x = seq(0, 10, length.out = 100),
norm = dnorm(scale(x), sd = .5),
chi = dchisq(x, df = 2) * 2,
t = dt(scale(x), 5, .5),
beta = (dbeta(x / 10, .5, .5) / 4) - .15)
cols <- c("#E69F00", "#56B4E9", "#009E73", "#CC79A7")
df %>%
ggplot(aes(x = x)) +
geom_line(aes(y = norm), color = cols[1], lwd = 1) +
geom_line(aes(y = chi), color = cols[2], lwd = 1) +
geom_line(aes(y = t), color = cols[3], lwd = 1) +
geom_line(aes(y = beta), color = cols[4], lwd = 1) +
labs(x = "x", y = "Density")
6 / 31

The normal distribution

  • AKA Gaussian distribution, The bell curve

  • The one you need to understand

  • Symmetrical and bell-shaped

  • Not every symmetrical bell-shaped distribution is normal!

  • It's also about the proportions

    • The normal distribution has fixed proportions and is a function of two parameters, μ (mean) and σ (or SD; standard deviation)
7 / 31

The normal distribution

  • Peak/centre of the distribution is its mean (also mode and median)
  • Changing mean (centring) shifts the curve left/right
  • SD determines steepness of the curve (small σ = steep curve)
  • Changing SD is also known as scaling
8 / 31

Area below the normal curve

  • No matter the particular shape of the given normal distribution, the proportions with respect to SD are the same
    • ∼68.2% of the area below the curve is within ±1 SD from the mean
    • ∼95.4% of the area below the curve is within ±2 SD from the mean
    • ∼99.7% of the area below the curve is within ±3 SD from the mean
  • We can calculate the proportion of the area with respect to any two points

quantiles <- tibble(x1 = -(1:3), x2 = 1:3, y = c(.21, .12, .03))
tibble(x = seq(-4, 4, by = .1), y = dnorm(x, 0, 1)) %>%
ggplot(aes(x, y)) +
geom_density(stat = "identity", color = default_col, fill = default_col) +
geom_density(data = ~ subset(.x, x >= quantiles$x1[3] & x <= quantiles$x2[3]),
stat = "identity", color = default_col, fill = bg_col) +
geom_density(data = ~ subset(.x, x >= quantiles$x1[2] & x <= quantiles$x2[2]),
stat = "identity", color = NA, fill = second_col) +
geom_density(data = ~ subset(.x, x >= quantiles$x1[1] & x <= quantiles$x2[1]),
stat = "identity", color = NA, fill = theme_col) +
geom_segment(data = quantiles,
aes(x = x1, xend = x2, y = y, yend = y),
arrow = arrow(length = unit(0.2, "cm"), angle = 15,
type = "closed", ends = "both"),
color = c(bg_col, default_col, default_col)) +
geom_line(data = tibble(x = rep(c(-1, 1), 2), y = rep(c(.12, .03), each = 2)),
aes(x, y, group = y), color = bg_col) +
geom_line(data = tibble(x = rep(c(-(2:3), 2:3), each = 2),
y = as.vector(rbind(0, rep(quantiles$y[-1] + .005, 2)))),
aes(x, y, group = x), lty = 2, color = default_col) +
annotate("text", x = rep(0, 3), y = quantiles$y + .05,
label = c("68.2%", "95.4%", "99.7%"), color = bg_col) +
labs(x = "z-score", y = "Density") +
scale_x_continuous(breaks = -4:4)
9 / 31

Area below the normal curve

  • Say we want to know the number of SDs from the mean beyond which lie the outer 5% of the distribution

quantiles <- qnorm(.025) * c(-1, 1)
tibble(x = sort(c(quantiles, seq(-4, 4, by = .1))), y = dnorm(x, 0, 1)) %>%
ggplot(aes(x, y)) +
geom_line(color = default_col) +
geom_density(data = ~ subset(.x, x >= quantiles[1]),
stat = "identity", color = NA, fill = second_col) +
geom_density(data = ~ subset(.x, x <= quantiles[2]),
stat = "identity", color = NA, fill = second_col) +
geom_line(data = tibble(x = rep(quantiles, each = 2), y = c(0, .15, 0, .15)),
aes(x, y, group = x), lty = 2, color = default_col) +
geom_segment(data = tibble(x = quantiles, xend = c(4, -4),
y = c(.15, .15), yend = c(.15, .15)),
aes(x = x, xend = xend, y = y, yend = yend),
arrow = arrow(length = unit(0.2, "cm"), angle = 15, type = "closed"),
color = default_col) +
annotate("text", x = c(-2.8, 2.8), y = .18, label = ("2.5%"), color = default_col) +
labs(x = "z-score", y = "Density")
qnorm(p = .025, mean = 0, sd = 1) # lower cut-off
## [1] -1.959964
qnorm(p = .975, mean = 0, sd = 1) # upper cut-off
## [1] 1.959964
10 / 31

Critical values

  • If SD is known, we can calculate the cut-off point (critical value) for any proportion of normally distributed data
qnorm(p = .005, mean = 0, sd = 1) # lowest .5%
## [1] -2.575829
qnorm(p = .995, mean = 0, sd = 1) # highest .5%
## [1] 2.575829
# most extreme 40% / bulk 60%
qnorm(p = .2, mean = 0, sd = 1)
## [1] -0.8416212
qnorm(p = .8, mean = 0, sd = 1)
## [1] 0.8416212
  • Other known distributions have different cut-offs but the principle is the same
11 / 31

Sampling from distributions

  • Collecting data on a variable = randomly sampling from distribution

  • The underlying distribution is often assumed to be normal

  • Some variables might come from other distributions

    • Reaction times: log-normal distribution

    • Number of annual casualties due to horse kicks: Poisson distribution

    • Passes/fails on an exam: binomial distribution

12 / 31

Sampling from distributions

  • Samples from the same population differ from one another
# draw a sample of 10 from a normally distributed
# population with mean 100 and sd 15
rnorm(n = 6, mean = 100, sd = 15)
## [1] 101.61958 80.95560 89.62080 96.04378 106.40106 86.21514
# repeat
rnorm(6, 100, 15)
## [1] 80.31573 107.63193 85.82520 99.95288 93.55956 74.73945
13 / 31

Sampling from distributions

  • Statistics ( x¯, s, etc.) of two samples will be different
  • Sample statistic (e.g., x¯) will likely differ from the population parameter (e.g., μ)
sample1 <- rnorm(50, 100, 15)
sample2 <- rnorm(50, 100, 15)
mean(sample1)
## [1] 98.56429
mean(sample2)
## [1] 105.4175
14 / 31

Sampling from distributions

  • Statistics ( x¯, s, etc.) of two samples will be different

  • Sample statistic (e.g., x¯) will likely differ from the population parameter (e.g., μ)

p1 <- ggplot(NULL, aes(x = sample1)) +
geom_histogram(bins = 15) +
geom_vline(xintercept = mean(sample1), color = second_col, lwd = 1) +
labs(x = "x", y = "Frequency") + ylim(0, 8)
p2 <- ggplot(NULL, aes(x = sample2)) +
geom_histogram(bins = 15) +
geom_vline(xintercept = mean(sample2), color = second_col, lwd = 1) +
labs(x = "x", y = "") + ylim(0, 8)
plot_grid(p1, p2)
15 / 31

Sampling distribution

  • If we took all possible samples of a given size (say N = 50) from the population and each time calculated x¯, the means would have their own distribution

  • This is the sampling distribution of the mean

    • Approximately normal

    • Centred around the true population mean, μ

  • Every statistic has its own sampling distribution (not all normal though!)

16 / 31

Sampling distribution

x_bar <- replicate(100000, mean(rnorm(50, 100, 15)))
mean(x_bar)
## [1] 99.99395

ggplot(NULL, aes(x_bar)) +
geom_histogram(bins = 51) +
geom_vline(xintercept = mean(x_bar), col = second_col, lwd = 1) +
labs(x = "Sample mean", y = "Frequency")
17 / 31

Standard error

  • Standard deviation of the sampling distribution is the standard error
sd(x_bar)
## [1] 2.122072
  • Sampling distribution of the mean is approximately normal: ~68.2% of means of samples of size 50 from this population will be within ±2.12 of the true mean
18 / 31

Standard error

  • Standard error can be estimated from any of the samplesSE=SDN
samp <- rnorm(50, 100, 15)
sd(samp)/sqrt(length(samp))
## [1] 1.872102
# underestimate compared to actual SE
sd(x_bar)
## [1] 2.122072
  • If ~68.2% of sample means lie within ±1.87, then there's a ~68.2% probability that x¯ will be within ±1.87 of μ
mean(samp)
## [1] 98.22903
19 / 31

Standard error

  • SE is calculated using N: there's a relationship between the two

tibble(x = 30:500, y = sd(samp)/sqrt(30:500)) %>%
ggplot(aes(x, y)) +
geom_line(lwd=1, color = second_col) +
labs(x = "N", y = "SE") +
annotate("text", x = 400, y = 2.5,
label = bquote(sigma==.(round(sd(samp), 1))),
color = default_col, size = 6)
20 / 31

Standard error

  • That is why larger samples are more reliable!

21 / 31

Standard error

  • Allows us to gauge the resampling accuracy of parameter estimate (e.g., μ^) in sample

  • The smaller the SE, the more confident we can be that the parameter estimate ( μ^ ) in our sample is close to those in other samples of the same size

  • We don't particularly care about our specific sample: we care about the population!

22 / 31

The Central Limit Theorem

  • Sampling distribution of the mean is approximately normal

  • True no matter the shape of the population distribution!

  • This is the Central Limit Theorem

    • "Central" as in "really important" because, well, it is!
23 / 31

CLT in action

24 / 31

CLT in action

25 / 31

Approximately normal

  • As N gets larger, the sampling distribution of x¯ tends towards a normal distribution with mean = μ and SD=σN

par(mfrow=c(2, 2), mar = c(3.1, 4.1, 3.1, 2.1), cex.main = 2)
pop <- tibble(x = runif(100000, 0, 1)) %>%
ggplot(aes(x)) +
geom_histogram(bins = 25) +
geom_vline(aes(xintercept = mean(x)), lwd = 1, color = second_col) +
labs(title = "Population", x = "", y = "Frequency")
plots <- list()
n <- c(5, 30, 1000)
ylab <- c("", "Frequency", "")
for (i in 1:3) {
plot_tib <- tibble(x = replicate(1000, mean(runif(n[i], 0, 1))))
plots[[i]] <- ggplot(plot_tib, aes(x)) +
geom_histogram(aes(y = ..density..), bins = 30) +
labs(title = bquote(paste(
italic(N)==.(n[i]),
"; ",
italic(SE)==.(round(sd(plot_tib$x), 2)))),
x = "", y = ylab[i]) +
# xlim(0:1) +
stat_density(geom = "line", color = second_col, lwd = 1)
}
plot_grid(pop, plots[[1]], plots[[2]], plots[[3]])
26 / 31

Take-home message

  • Distribution is the number of observations per each value of a variable

  • There are many mathematically well-described distributions

    • Normal (Gaussian) distribution is one of them
  • Each has a formula allowing the calculation of the probability of drawing an arbitrary range of values

27 / 31

Take-home message

  • Normal distribution is

    • continuous
    • unimodal
    • symmetrical
    • bell-shaped
    • it's the right proportions that make a distribution normal!
  • In a normal distribution it is true that

    • ∼68.2% of the data is within ±1 SD from the mean
    • ∼95.4% of the data is within ±2 SD from the mean
    • ∼99.7% of the data is within ±3 SD from the mean
  • Every known distribution has its own critical values

28 / 31

Take-home message

  • Statistics of random samples differ from parameters of a population

  • As N gets bigger, sample statistics approaches population parameters

  • Distribution of sample parameters is the sampling distribution

  • Standard error of a parameter estimate is the SD of its sampling distribution

    • Provides margin of error for estimated parameter

    • The larger the sample, the less the estimate varies from sample to sample

29 / 31

Take-home message

  • Central Limit Theorem

    • Really important!

    • Sampling distribution of the mean tends to normal even if population distribution is not normal

  • Understanding distributions, sampling distributions, standard errors, and CLT it most of what you need to understand all the stats techniques we will cover

30 / 31






And that's it!

31 / 31

Overview

  • Recap on distributions

  • More about the normal distribution

  • Sampling

  • Sampling distribution

  • Standard error

  • Central Limit Theorem

2 / 31
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow