class: center, middle, inverse, title-slide # Chi-Square ## Lecture 05 ### Dr Jennifer Mankin ### 26 February 2021 --- <script type="text/javascript"> <!-- message to display at the bottom of slides when ?live=true --> slideMessage = "<span>Ask questions at </span><a href = 'pollev.com/milanvalasek890'>pollev.com/milanvalasek890</a>" plotsToPanels() live() </script> ## Looking Ahead (and Behind) - Last week: Correlation -- - This week: Chi-Square ( `\(\chi^2\)` ) -- - Week 6: *t*-test -- - Week 7: The Linear Model - Week 8: The Linear Model --- ## Lab Report: Green Study - Today we will talk about one of the analyses for the lab report - `\(\chi^2\)` : Green study (Griskevicious et al., 2010) - *t*-test: Red study (Elliott et al., 2020), next week - We will talk about the lab report in the lectures **and** work on it in the practicals - Make sure you come to your registered sessions --- ## Objectives After this lecture you will understand: - The concepts behind tests of goodness-of-fit and association - How to calculate the `\(\chi^2\)` statistic - How to read tables and figures of counts - How to interpret and report significance tests of `\(\chi^2\)` - The relationship between association and causation --- ## Roll of the Dice <img src="assets/all_dice.jpg" alt="Photo of two plastic organising boxes full of dice of various shapes and colours, from large 20-sided dice to 4-sided dice" height = "500"> --- ## Roll of the Dice - I want to know if my four-sided die (d4) is fair - If it is, each number should come up with equal probability - Four numbers = 100%/4 = 25% probability of rolling each number - So, if I roll the dice 100 times, each number should come up (approximately) 25 times --- ## Roll of the Dice <br><br> .pull-left[ ![Photo of a four-sided die in a padded box, next to a piece of paper with tally marks counting rolls](assets/dice.jpg) ] <br><br> .pull-right[ <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Dice Roll </th> <th style="text-align:right;"> Observed Count </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 25 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 29 </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:right;"> 24 </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:right;"> 22 </td> </tr> </tbody> </table> ] --- ## A Fair Shake? - These numbers are not *exactly* 25/25/25/25 - But we live in a random universe! - How different is **different enough** to believe that the die is not actually fair? ![](index_files/figure-html/unnamed-chunk-3-1.png)<!-- --> --- ### Steps of the Analysis - Calculate the (standardised) *difference* between observed and expected frequencies -- - Compare that *test statistic* to its distribution under the null hypothesis -- - Obtain the probability *p* of encountering a test statistic of the size we have, or larger, if the null hypothesis is true -- - ????? - Profit --- ## Step 1: Calculate a Test Statistic - How different are the *observed* counts from the *expected* counts? .left[ <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Dice Roll </th> <th style="text-align:right;"> Obs. Count </th> <th style="text-align:right;"> Exp. Count </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 25 </td> <td style="text-align:right;"> 25 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 29 </td> <td style="text-align:right;"> 25 </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:right;"> 24 </td> <td style="text-align:right;"> 25 </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:right;"> 22 </td> <td style="text-align:right;"> 25 </td> </tr> </tbody> </table> ] -- .center[ `\(\chi^2 = \frac{(25-25)^2}{25} + \frac{(29-25)^2}{25} + \frac{(24-25)^2}{25} + \frac{(22-25)^2}{25}\)` ] -- .center[ `\(\chi^2 = \frac{0}{25} + \frac{16}{25} + \frac{1}{25} + \frac{9}{25}\)` ] -- .center[ `\(\chi^2 = 0 + 0.64 + 0.04 + 0.36\)` ] -- .center[ The total squared (and scaled) difference between *observed* and *expected* counts is the sum of those four numbers, or 1.04 ] --- ## Step 2: Compare to the Distribution - We've calculated a *test statistic* that represents the thing we are trying to test - Is this test statistic big or small in the grand scheme of things? -- - Compare our test statistic to the *distribution* of similar statistics - **IMPORTANT**: These distributions assume that *the null hypothesis is true!* --- ## The Chi-Square (*χ*<sup>2</sup>) Distribution - Unfortunately test statistics like the one we have are **not** normally distributed - No problem - we just have to use a different distribution! -- - Meet the `\(\chi^2\)` distribution - The sum of squared normal distributions - See [this excellent Khan Academy explainer](https://www.khanacademy.org/math/statistics-probability/inference-categorical-data-chi-square-tests/chi-square-goodness-of-fit-tests/v/chi-square-distribution-introduction) for more! <iframe id="chisq-app" class="app" src="https://and.netlify.app/viz/chisq.html" height=680px></iframe> --- ## Detour: Degrees of Freedom - Degrees of freedom are calculated differently for different test statistics - Important because they determine the distribution's shape and proportions - At base, they are the number of values that are *free to vary* -- - Consider our dice example... - We know our test statistic is 1.04 - If we know the first three values (0 + 0.64 + 0.04), the last value **must** be 0.36 - Alternatively, if we had three random values (e.g. 0.23 + 0.54 + 0.1), the last value cannot be random: it **must** be 0.17 to add up to 1.04 - So, we have **three** degrees of freedom --- ## Step 3: Obtain the Probability *p* - Look at the distribution for 3 degrees of freedom - What percentage of the distribution is greater than or equal to 1.04? <iframe id="chisq-app" class="app" src="https://and.netlify.app/viz/chisq.html" height=680px></iframe> --- ## Interpreting the Results - The sum of squared differences between our *expected* and *observed* counts ( `\(\chi^2\)` ) was 1.04 - For a `\(\chi^2\)` distribution with 3 degrees of freedom, this value is **extremely common** under the null hypothesis! - If our die **is** fair, our data are extremely likely - To believe that the die was *not* fair, we would have needed a test statistic of ~7.8 or greater ( `\(\alpha\)` = .05) - If only there were an easier way to do this...! -- ```r chisq.test(dice_table$obs_count) ``` ``` ## ## Chi-squared test for given probabilities ## ## data: dice_table$obs_count ## X-squared = 1.04, df = 3, p-value = 0.7916 ``` --- ## Interim Summary - The `\(\chi^2\)` test statistic quantifies how different a set of *observed* frequencies are from *expected* frequencies - We obtain the probability *p* of finding the test statistic we have calculated (or one even larger) using the distribution of the `\(\chi^2\)` statistic under the null hypothesis, with a given number of degrees of freedom - Given an `\(\alpha\)` level of .05... - If *p* > .05, we conclude that our results are *likely* to occur under the null hypothesis, so we have no evidence that the null hypothesis is not true - If *p* < .05, we conclude that our results are sufficiently *unlikely* to occur that it may in fact be the case that the null hypothesis is *not* true --- exclude: ![:live] .pollEv[ <iframe src="https://embed.polleverywhere.com/discourses/IXL29dEI4tgTYG0zH1Sef?controls=none&short_poll=true" width="800px" height="600px"></iframe> ] --- ## More *χ*<sup>2</sup> - We just saw a *goodness of fit* test - Tests whether a sample of data came from a population with a specific distribution - Next, let's look at a *test of association*, or *independence* - Are two categorical variables associated or not? -- - For your lab reports, you will again write about the Green or Red studies - You can freely choose which! - If you choose the Green study, **this is the test you will use** --- ## Quick Refresher: Variable Types - **Continuous** data - Represent some measurement or score on a scale - Examples: ratings of romantic attraction, age in years - Answers the question: *how much?* -- - **Categorical** data - Represent membership in a particular group or condition - Examples: control vs experimental group, year of uni - Answers the question: *which one?* --- ## *χ*<sup>2</sup> Test of Association - This time we will have two variables, both categorical - Data: counts of how many observations fall into each combination of categories --- ## Sequence-Space Synaesthesia - Spatial orientation of sequences, such as numbers, months, or days of the week .pull-left[ <iframe width="560" height="315" src="https://www.youtube.com/embed/BWGNWgBk76k?start=1486" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> ] .pull-right[ ![Drawing of a human figure in the centre of a circle made up of coloured segments, each labeled with a month of the year in order](https://images.ctfassets.net/cnu0m8re1exe/0XiQXU9wikCY4saHF2IQy/6eef3d4687e8b297d2252594a18149bf/IVelspu.jpg?w=650&h=433&fit=fill) [Image Source](https://www.discovermagazine.com/health/the-rare-humans-who-see-time-and-have-amazing-memories) ] --- ## Sequence-Space Synaesthesia - "Calendars" of spatial orientations of months of the year - [Brang et al. (2011)](https://bpspsychub.onlinelibrary.wiley.com/doi/full/10.1111/j.1748-6653.2011.02012.x?casa_token=xDJyLXaLC4QAAAAA%3Ak7PsUINKvtuZev5SHRKG8fG9uwq739qOI_knrdYKwIOUepYVZMwxo1YBhQS559jjAZP5YGspZXUEg5JU): Is the orientation of the calendar related to the synaesthete's handedness? - Orientation: months progress clockwise or counterclockwise in space - Handedness: left or right handed -- .pull-left[ - Each synaesthete has one value for orientation and one value for handedness - Data: counts of *how many synaesthetes* fall into each combination of categories ] .pull-right[ ![Drawing of a human figure in the centre of a circle made up of coloured segments, each labeled with a month of the year in order](https://images.ctfassets.net/cnu0m8re1exe/0XiQXU9wikCY4saHF2IQy/6eef3d4687e8b297d2252594a18149bf/IVelspu.jpg?w=650&h=433&fit=fill) [Image Source](https://www.discovermagazine.com/health/the-rare-humans-who-see-time-and-have-amazing-memories) ] --- ## Let's Think About This... - What is the **null hypothesis** in this case? - What is the **alternative hypothesis**? - What do you think we will find? --- ## Let's Think About This... - **Null hypothesis**: Calendar orientation is *not* associated with synaesthete handedness - **Alternative hypothesis**: Calendar orientation *is* associated with synaesthete handedness - Prediction from the paper: - Right-handed synaesthetes will tend to have a clockwise calendar - Left-handed synaesthetes will tend to have an anticlockwise calendar --- ## Visualising the Data .codePanel[ ```r ggplot(ss_tab, aes(x = handedness, y = n)) + geom_bar( aes(fill = orientation), stat="identity", position = position_dodge(0.8), width = 0.7) + labs(x = "Handedness", y = "Frequency", fill = "Calendar\nOrientation") + scale_y_continuous(limits = c(0, 20)) + scale_color_manual(values = c("#009FA7", "#52006F"))+ scale_fill_manual(values = c("#009FA7", "#52006F"), labels = c("Anticlockwise", "Clockwise"))+ scale_x_discrete(labels = c("Left","Right")) ``` ![](index_files/figure-html/unnamed-chunk-7-1.png)<!-- -->] - Left-handed synaesthetes have more anti-clockwise than clockwise - Right-handed synaesthetes have the reverse --- ## Test Result Are these data different enough from the expected frequencies to believe that there may be an **association** between orientation and handedness? ``` ## ## Pearson's Chi-squared test with Yates' continuity correction ## ## data: seq_space$orientation and seq_space$handedness ## X-squared = 9.7798, df = 1, p-value = 0.001764 ``` <br><br><br> .center[ **What can you conclude from this result?** ] --- ## Test Result Are these data different enough from the expected frequencies to believe that there may be an **association** between orientation and handedness? ``` ## ## Pearson's Chi-squared test with Yates' continuity correction ## ## data: seq_space$orientation and seq_space$handedness ## X-squared = 9.7798, df = 1, p-value = 0.001764 ``` <br><br><br> <font size = "+2"> .center["There was a significant association between calendar orientation and handedness ( `\(\chi^2\)`(1) = 9.78, *p* = .002)."] </font> --- ## Interpreting the Result .codePanel[ ```r ggplot(ss_tab, aes(x = handedness, y = n)) + geom_bar( aes(fill = orientation), stat="identity", position = position_dodge(0.8), width = 0.7) + labs(x = "Handedness", y = "Frequency", fill = "Calendar\nOrientation") + scale_y_continuous(limits = c(0, 20)) + scale_color_manual(values = c("#009FA7", "#52006F"))+ scale_fill_manual(values = c("#009FA7", "#52006F"), labels = c("Anticlockwise", "Clockwise"))+ scale_x_discrete(labels = c("Left","Right")) ``` ![](index_files/figure-html/unnamed-chunk-10-1.png)<!-- -->] - Our hypothesis is supported by the data - The association is in the *direction* we predicted --- ## Expected Frequencies - Are these data different enough from the **expected frequencies** to believe that there may be an association between orientation and handedness? - We can get these easily out of R! -- <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Orientation </th> <th style="text-align:right;"> Left </th> <th style="text-align:right;"> Right </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Anti-Clockwise </td> <td style="text-align:right;"> 3.53 </td> <td style="text-align:right;"> 8.47 </td> </tr> <tr> <td style="text-align:left;"> Clockwise </td> <td style="text-align:right;"> 6.47 </td> <td style="text-align:right;"> 15.53 </td> </tr> </tbody> </table> -- - One of the **assumptions** of `\(\chi^2\)` is that all expected frequencies are greater than 5 - Otherwise this test can give you a drastically wrong answer 😱 - In this case, use Fisher's exact test (`fisher.test()`) instead --- ## Final Overview - The `\(\chi^2\)` test quantifies the difference between observed and expected frequencies - Goodness of Fit - Tests whether a sample of data came from a population with a specific distribution 🎲 - Test of Association/Independence - Tests whether two categorical variables are associated with each other 🌈 - Like with correlation, **association is not causation** -- - For quizzes/exam: - You will **not** be expected to calculate `\(\chi^2\)` by hand! - You **will** be expected to read and interpret the output of `chisq.test()` for tests of association - More in the tutorial! --- ## Lab Reports - You can choose either the red or green study to write your report on - See [Lab Report Information on Canvas](https://canvas.sussex.ac.uk/courses/9242/pages/Lab%20Report%20Information%20and%20Resources?titleize=0) for more - If you choose the green study (Griskevicius et al., 2010), you must use and report the results of `\(\chi^2\)` -- - Choose one of three products to analyse - Report observed frequencies and `\(\chi^2\)` result - Include a figure of the results - Will be covered in depth in the next tutorial and practical! --- exclude: ![:live] .pollEv[ <iframe src="https://embed.polleverywhere.com/discourses/IXL29dEI4tgTYG0zH1Sef?controls=none&short_poll=true" width="800px" height="600px"></iframe> ] --- class: last-slide <br><br><br><br><br> # Have a lovely weekend!