Introduction

The goals of this activity are disparate:

  • Gain experience making contingency tables and conducting contingency tests.
  • Build your repertoire of plotting skills.

Contingency Tables

Load the penguins data

  • Load the palmerpenguins data using require().
  • Inspect the dataset using head()
    • Note which columns are categorical.
    • Year is coded as an integer, but could you also consider it a categorical variable here?

Counting number of records using table().

The table() function is helpful for two important tasks:

  • Counting the number of observations belonging to distince factor levels.
  • Creating two-way tables (see below).

For example, if I want to know how many penguins were observed in each year, I could use the following syntax:

table(penguins$year)
## 
## 2007 2008 2009 
##  110  114  120

This tells me that there were 110 observations in 2007, 114 in 2008, and 120 in 2009.

Question: Use table() to determine how many penguins were counted on each of the three islands. How many penguins were observed on Dream island?

Building a contingency table using table().

I can use table() to build a contingency table, also known as a two-way table. For example, if I wanted to build a contingency table showing the counts of male and female penguins that were observed each year, I could use

require(palmerpenguins)
head(penguins)
## # A tibble: 6 x 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm
##   <fct>   <fct>              <dbl>         <dbl>             <int>
## 1 Adelie  Torgersen           39.1          18.7               181
## 2 Adelie  Torgersen           39.5          17.4               186
## 3 Adelie  Torgersen           40.3          18                 195
## 4 Adelie  Torgersen           NA            NA                  NA
## 5 Adelie  Torgersen           36.7          19.3               193
## 6 Adelie  Torgersen           39.3          20.6               190
## # ... with 3 more variables: body_mass_g <int>, sex <fct>, year <int>
table(penguins$sex, penguins$year)
##         
##          2007 2008 2009
##   female   51   56   58
##   male     52   57   59
  • How many female penguins were observed in 2008?

Chi-Square tests

We can test whether significant associations occur in a contingency table using the chi-square test.

The syntax is pretty simple in R, the function is just chisq.test().

chisq.test() works with the output from the table() function:

sex_year_table = table(penguins$sex, penguins$year)
chisq.test(sex_year_table)
## 
##  Pearson's Chi-squared test
## 
## data:  sex_year_table
## X-squared = 7.8283e-05, df = 2, p-value = 1

Test Statistic: X-squared

The test statistic for the chi-square test is just the X-squared value in the test output. We don’t usually try to interpret the value directly, but notice here that the value is quite small: approximately 0.00008.

  • What is the associated p-value? Is it high, or is it below our traditional cutoff of 0.05?
  • Do you conclude that there is a significant association between observation year and sex?

Chi-square test hypotheses

One possible way you could frame the null and alternative hypotheses for this test might be:

Generic hypotheses

  • H0: There is no association between sex and observation year.
  • H1: there is an association between sex and observation year.

Note that these hypotheses are pretty generic. A better set of hypotheses would be:

Specific hypotheses

  • H0: Approximately the same proportion of male/female penguins were observed in each of the study years.
  • H1: In some years the mail/female ratio was significantly different than other years.

Fun with scatterplots

Let’s experiment with some ways to customize scatterplots in R.

We’ll use the following basic scatterplot of penguin body mass and bill length as an example. You should copy my code and use it as a template for your explorations.

plot(
  bill_length_mm ~ body_mass_g, 
  data = penguins)

Change the size of the plotting symbol

The first elaboration we’ll look at is changing the size of the plotting symbol using the cex argument. For example, if I wanted to make the points twice as large, I could use cex = 2:

plot(
  bill_length_mm ~ body_mass_g, 
  data = penguins,
  cex = 2)

Change the shape of the plotting symbol:

I can use the pch argument to change the shape of the plotting symbol:

plot(
  bill_length_mm ~ body_mass_g, 
  data = penguins,
  cex = 1.2,
  pch = 16)

I happen to like plotting symbol 16 because it is a filled circle.

Here’s a guide to the basic plotting symbols available in R.

Change the color of the plotting symbol (color numbers)

It’s easy to change the plotting symbol color using the col argument.

There are several ways you can specify a color. The first is using a numeric code. The code 2 stands for red:

plot(
  bill_length_mm ~ body_mass_g, 
  data = penguins,
  cex = 1.2,
  pch = 17,
  col = 2)

You should experiment with some different numbers to see what colors you get!

Change the color of the plotting symbol (color name)

R also understands the names of many colors.

Here’s a guide to the named colors in R

I like the steelblue color:

plot(
  bill_length_mm ~ body_mass_g, 
  data = penguins,
  cex = 1.2,
  pch = 17,
  col = "steelblue")

Questions

  • Q1 (1 pt.): How many penguins were observed on Dream island?
    • Show the code you used to find out.
  • Q2 (1 pt.): How many female penguins were observed on Torgersen island?
    • Show the code you used to find out.
  • Q3 (2 pts.):Conduct a Chi-square test to determine if there is a significant association between island and species.
    • Show the code you used to create your contingency table.
    • Show the code you used to conduct the test.
  • Q4 (1 pt.): What was the value of the test statistic for the test?
  • Q5 (2 pts.): What was the p-value of the test? Given this value, do you conclude that there is a significant association?
  • Q6 (1 pt.): State the null and alternative hypotheses of the test. You should use the specific hypotheses example above as a template.
  • Q7 (2 pts.): Include your favorite plot from the exercise above. It should be different from my example plots. Show your code. Your plot must have:
    • a title
    • custom axis labels
    • custom plot symbol size
    • custom plotting symbol shape
    • custom plot symbol color