This vignette is an overview of how factors (i.e. categorical variables) are coded under the hood using dummy coding and which types of dummy coding can be set in R.
Introduction
There’s seems to be a bit of terminological mix-up in the wild, so we first present a terminological set that will be used throughout the vignette.
Categorical variables in R are generally stored using factors. A factor is a vector of values from a categorical variable. The possible values in a factor are called levels in R.
For each observation in the factor, the vector specifies the level of that observation.
For example, let’s assume we have a data set with information on
dinosaurs and one column specifies the dinosaur’s diet:
carnivore
, herbivore
,
omnivore
.
In R, this column can be coded as a factor:
factor(c("carnivore", "carnivore", "herbivore", "omnivore", "herbivore"))
#> [1] carnivore carnivore herbivore omnivore herbivore
#> Levels: carnivore herbivore omnivore
We are so accustomed to using factors in regression models that sometimes we forget that regressions only work with numbers and cannot work with categorical variables.
To fit a regression model with categorical variables, these are first converted to numbers. The process of conversion is called coding.
One type of coding is the dummy variable coding or
simply dummy coding. This consists of assigning
0
s or 1
s to the levels in the variable.
Let’s go through a simple example of dummy coding of a categorical
variable with only 2 levels: metropolitan
and
rural
.
The most simple way of coding this categorical variable as a number
is to assign 0
to one level and 1
to the other
level. For example:
factor(c("rural", "rural", "metropolitan", "rural"))
#> [1] rural rural metropolitan rural
#> Levels: metropolitan rural
# dummy coded
c(0, 0, 1, 0)
#> [1] 0 0 1 0
In R, dummy coding is done under the hood for you when using factors, so you don’t have to worry about the conversion.
When the categorical variable has 3 levels instead of 2, we need a
work-around in order to code the 3-level factor with only
0
s and 1
s (we can’t use higher numbers for
reasons we will see later).
With three levels, we can code the variable using two numeric variables (instead of just one). Going back to the dinosaur’s diet example, we can use:
- One variable that codes whether the dinosaur is a carnivore
0
or a herbivore1
. - One variable that codes whether the dinosaur is a carnivore
0
or an omnivore1
.
Let call the first variable dummy_1
and the second
variable dummy_2
. Then:
- When
dummy_1
is0
anddummy_2
is also0
, the dinosaur is a carnivore. - When
dummy_1
is1
anddummy_2
is0
, the dinosaur is a herbivore. - When
dummy_1
is0
anddummy_2
is1
, the dinosaur is an omnivore.
So the following factor (repeated from above):
factor(c("carnivore", "carnivore", "herbivore", "omnivore", "herbivore"))
#> [1] carnivore carnivore herbivore omnivore herbivore
#> Levels: carnivore herbivore omnivore
can be coded as:
dummy_1 <- c(0, 0, 1, 0, 1) # carnivore (0) or herbivore (1)?
dummy_2 <- c(0, 0, 0, 1, 0) # carnivore (0) or omnivore (1)?
library(tidyverse)
tibble(
dummy_1, dummy_2
)
#> # A tibble: 5 × 2
#> dummy_1 dummy_2
#> <dbl> <dbl>
#> 1 0 0
#> 2 0 0
#> 3 1 0
#> 4 0 1
#> 5 1 0
If this doesn’t make much sense, try and figure it out by checking the value of the two columns for each row with the following code:
# case_when() is a very helpful function from dplyr!
case_when(
dummy_1 == 0 & dummy_2 == 0 ~ "carnivore",
dummy_1 == 1 & dummy_2 == 0 ~ "herbivore",
dummy_1 == 0 & dummy_2 == 1 ~ "omnivore",
)
What if the factor has 4 levels? Then you can code it with 3 dummy variables. And what about 5 levels? Use 4 dummy variables. The number of dummy variables needed is equal to the number of levels minus 1 (\(n_{dummy} = n_{levels} - 1\)).
Summing up
To sum up:
- Factors are vectors that code categorical variables.
- The values in a factor are called levels.
- Regression models cannot work directly with factors, so these are coded using numbers.
-
Dummy coding is one way of coding factors as
numbers using one or more numeric variables of
0
s and1
s.
Coding and contrasts
Now. We’ve seen that dummy coding is simply using dummy numeric
variables with 0
s and 1
s.
In fact, this is one way of coding factors, or one coding scheme.1 Different coding schemes in R are called contrasts. Dummy coding is called treatment contrasts in R.
Treatment contrasts
The term treatment contrasts comes from the clinical sciences where you test, for example, the efficacy of a medical intervention (a drug, surgery, etc…) by comparing a control group (which has not received the “treatment”) with a group that has received the medical intervention (the treatment group).
The control group can be used as the reference group to see if the treatment group has benefited from the medical treatment (i.e. if the treatment group’s health has improved after the intervention relative to the control group, then one can infer that the treatment was effective).
Let’s look at treatment contrasts in R.
In the previous section, we’ve been illustrating dummy coding by
assigning 0
s and 1
s using one or more dummy
variables. In practice, you do not need to do that to run real analyses,
because R does that under the hood for you.
Factors in R are coded with treatment contrasts by default. Also by
default, the first level is set as the reference level (the order is
alphabetical by default). The reference level is the level that gets
coded only with 0
s, as we have seen above for the
dinosaur’s diet factor (carnivorous
had
dummy_1 = 0
and dummy_2 = 0
).
Let’s see an example using a data table with measurements of vowel duration.
data("vowels")
glimpse(vowels)
#> Rows: 886
#> Columns: 9
#> $ item <dbl> 20, 2, 11, 1, 15, 10, 13, 3, 14, 19, 4, 6, 16, 17, 5, 23…
#> $ speaker <chr> "it01", "it01", "it01", "it01", "it01", "it01", "it01", …
#> $ word <chr> "pugu", "pada", "poco", "pata", "boco", "podo", "boto", …
#> $ v1_duration <dbl> 95.23720, 138.96844, 126.93226, 127.49888, 132.33310, 12…
#> $ c2_voicing <chr> "voiced", "voiced", "voiceless", "voiceless", "voiceless…
#> $ vowel <chr> "u", "a", "o", "a", "o", "o", "o", "a", "o", "u", "a", "…
#> $ c2_place <chr> "velar", "coronal", "velar", "coronal", "velar", "corona…
#> $ speech_rate <dbl> 4.893206, 5.015636, 4.819541, 5.031662, 5.063435, 5.0632…
#> $ speech_rate_c <dbl> -0.55937531, -0.43694485, -0.63303978, -0.42091937, -0.3…
For example, let’s take the vowel
column. This column
indicates which vowel the measurement was taken from, and that can be
/a/, /o/, or /u/.
If we convert the vowel
column into a factor, the levels
will be a
, o
and u
, and
a
will be the reference level.
To get a sense of how a factor would be coded with treatment
contrasts, we can print a dummy coding table with the
contr.treatment()
function.
contr.treatment(levels(vowels$vowel))
#> o u
#> a 0 0
#> o 1 0
#> u 0 1
Now, let’s run a regression model with v1_duration
as
the outcome variable and vowel
as the predictor.
vow_lm <- lm(v1_duration ~ vowel, data = vowels)
summary(vow_lm)
#>
#> Call:
#> lm(formula = v1_duration ~ vowel, data = vowels)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -66.874 -22.567 -2.293 17.025 106.755
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 128.616 1.806 71.227 <2e-16 ***
#> vowelo -5.641 2.549 -2.213 0.0272 *
#> vowelu -29.763 2.603 -11.432 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 31.38 on 883 degrees of freedom
#> Multiple R-squared: 0.1419, Adjusted R-squared: 0.1399
#> F-statistic: 72.99 on 2 and 883 DF, p-value: < 2.2e-16
The summary returns three coefficients:
-
Intercept
. -
vowelo
. -
vowelu
.
Since a
is the reference level of vowel
,
the Intercept
corresponds to the mean duration of the vowel
a
, i.e. 128 ms.
The coefficient of o
is the difference between
the mean duration of o
and the mean duration of the
reference level a
(i.e. the
Intercept
). So o
is 5.6 ms shorter than
a
on average (shorter because the coefficient is
negative).
Finally, the coefficient of u
is the difference
between the mean duration of u
and the mean duration of the
reference level a
So u
is 29.7 ms
shorter than a
.
This is how treatment contrasts work.
Sum contrasts
Another type of coding is effect coding. In R, the corresponding contrast type are the so-called sum contrasts.
When using sum contrasts, the levels in a factor are coded using
1
s, -1
s and 0
s. If you sum the
values of each dummy variable you always get 0
(hence the
name “sum” contrast).
Let’s see what happens to the factor vowel
when using
sum contrasts (remember that factors use treatment contrasts by
default).
This is how sum coding would look like for this factor:
Since there are 3 levels, we need two dummy variables. So
a
is coded as 1, 0
, o
is coded as
0, 1
, and u
as -1, -1
.
To set the contrasts of a factor to sum coding, we can run the following:
contrasts(vowels$vowel) <- "contr.sum"
# If you want to change it back to treatment contrasts you can run
# contrasts(vowels$vowel) <- "contr.treatment"
With sum contrasts the reference level is in fact the grand mean.
In our model of vowel duration this means that the
Intercept
coefficient will be the grand mean of vowel
duration.
Let’s rerun the model and look at the output.
vow_lm <- lm(v1_duration ~ vowel, data = vowels)
summary(vow_lm)
#>
#> Call:
#> lm(formula = v1_duration ~ vowel, data = vowels)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -66.874 -22.567 -2.293 17.025 106.755
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 116.815 1.055 110.728 < 2e-16 ***
#> vowel1 11.801 1.483 7.957 5.40e-15 ***
#> vowel2 6.160 1.481 4.160 3.49e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 31.38 on 883 degrees of freedom
#> Multiple R-squared: 0.1419, Adjusted R-squared: 0.1399
#> F-statistic: 72.99 on 2 and 883 DF, p-value: < 2.2e-16
The Intercept
now is 116 ms, which mean that the mean of
vowel duration across the three vowels is 116 ms.
We can check this by taking the mean:
mean(vowels$v1_duration)
#> [1] 117.2747
Yup, pretty close (small differences are fine).
So what are now the coefficients called vowel1
and
vowel2
?
These are, respectively, the difference between the mean duration of
a
and the grand mean, and the difference between the mean
duration of o
and the grand mean.
So a
is 11.8 ms longer than the grand mean, and
o
is 6.1 ms longer than the grand mean.
What about u
then?
Easy. You just subtract the coefficients of both a
and
o
from the grand mean: \(116.8 -
11.8 - 6.1 = 98.9\).
If you want to check that this is correct, the mean duration of
u
according to the model above where we used treatment
contrasts was \(128.616 - 29.763 =
98.853\).
Sum contrasts and interactions
Sum contrasts can be very handy when the model contains interactions between factors.
Let’s say we want to include in our model of vowel duration a predictor that specifies the voicing of the stop following the vowel. We also add an interaction between vowel and voicing, so that we can model differences in the effect of voicing across vowels.
vow_lm_2 <- lm(v1_duration ~ c2_voicing + vowel + c2_voicing:vowel, data = vowels)
summary(vow_lm_2)
#>
#> Call:
#> lm(formula = v1_duration ~ c2_voicing + vowel + c2_voicing:vowel,
#> data = vowels)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -64.905 -23.262 -2.033 18.134 115.813
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 123.469 1.449 85.232 < 2e-16 ***
#> c2_voicingvoiceless -13.309 2.049 -6.496 1.37e-10 ***
#> vowel1 14.606 2.037 7.171 1.57e-12 ***
#> vowel2 8.564 2.033 4.212 2.79e-05 ***
#> c2_voicingvoiceless:vowel1 -5.609 2.880 -1.947 0.0518 .
#> c2_voicingvoiceless:vowel2 -4.808 2.876 -1.672 0.0949 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 30.47 on 880 degrees of freedom
#> Multiple R-squared: 0.1937, Adjusted R-squared: 0.1891
#> F-statistic: 42.29 on 5 and 880 DF, p-value: < 2.2e-16
Now the Intercept
is the mean vowel duration when the
following stop is voiced (the reference level of
c2_voicing
). This means that the average vowel followed by
a voiced stop is 123 ms long in our data.
The coefficient of c2_voicingvoiceless
tells us the mean
effect of c2_voicing
on vowel duration, averaged across all
vowels. So, on average, a vowel is about 13 ms shorter when followed by
a voiceless stop.
The coefficients of vowel1
and vowel2
indicate the difference between the average vowel duration before a
voiced stop (the Intercept
) and a
and
o
respectively. As before, to get the difference between
the average vowel duration of u
before a voiceless stop and
the mean vowel duration, you just need to subtract the coefficients of
vowel1
and vowel2
from the
Intercept
: \(123.5 - 14.6 - 8.5 =
100.4\).
The last two coefficients, c2_voicingvoiceless:vowel1
and c2_voicingvoiceless:vowel2
correspond to the difference
in the effect of voicing between the average effect of voicing
(c2_voicingvoiceless
, i.e. -13 ms) and the effect of
voicing in a
and o
respectively. That is, the
decrease of vowel duration for a
is 5.6 ms greater than the
average effect, while the decrease of vowel duration for o
is 4.8 ms greater than the average effect.
Following the usual formula, the effect of voicing for u
is \(-13.309 - (-5.6) - (-4.8) =
-2.9\).