Bayesian Data Analysis for Speech Sciences

class: center, middle, inverse, title-slide

# Bayesian Data <br>Analysis<br> for <br>Speech Sciences
## Application to regression
### Timo Roettger, Stefano Coretta and Joseph Casillas
### LabPhon workshop
### 2021/07/05 (updated: 2021-07-05)

---

class: title-slide-section-grey, middle
background-image: url(https://www.bellevuerarecoins.com/wp-content/uploads/bigstock-Coin-Flip-5807921.jpg)
background-size: contain
background-position: 100% 50%

# What is probability?

???

Before we get started I'd like to ask you to take a second and think about "probability".

What does is it mean?

What do you understand if somebody asks you the probability you are going to be on time?

What is the probability it will rain tomorrow?

Over the next two days we will be thinking about the notion of probability and how define it in light of the research that we do

---
background-image: url(libs/worldview.png)
background-size: contain
background-color: black

# .white[The world according to <br>frequentists]

???

A key reason for *why* we will talk about probability has to do with frequentism

You might have heard this term before, but if not, that's fine.

I'm certain you already have a grasp of what it is.

Over the next few days we will be thinking a lot about frequentism and comparing it directly with Bayesian methods.

---
background-image: url(https://www.bellevuerarecoins.com/wp-content/uploads/bigstock-Coin-Flip-5807921.jpg)
background-size: contain
background-position: 100% 50%

# What is frequentism?

.pull-left[

- "Classical" statistics

- Population parameters are fixed, actually exist

- Probability refers to the long-run frequency of a given event

- A sample of data is the result of one of an infinite number of exactly repeated experiments

- Data are random, result from sampling a fixed population distribution

]

???

- So what is it?
- Frequentist statistics represent what most of us have always been doing
- It's a bit of a blanket term used to refer to "classical" statistics
- The frequentist view of probability is what leads to some of the odd definitions of the statistical machinery you already know, like confidence intervals
- A key idea is that population parameters, what we estimate when we do statistics, are fixed, actually exist

- Under this view, probability refers to the long-run frequency of a given event
- We can consider a sample of data as the result of one of an infinite number of exactly repeated experiments
- In this sense, the data are random, and result from sampling a fixed population distribution

- The key idea that I'd like you to keep in mind at this point is that this view of probability seems to be at odds with how we think/reason
- For example when we ask something like what is the probability it will rain tomorrow? (we aren't typically thinking about long term frequencies or exactly repeated experiments)
- We will be coming back to these ideas of *probability* and *frequentism*

---
class: center, middle

# What if...

### There isn't one true population parameter... <br>
???

But for now, consider this...

What if there isn't one *true* population parameter
--
but an entire distribution of parameters,<br> some more plausible than others

???

What if instead there is an entire distribution of parameters, some more plausible than others

---
background-color: black
background-image: url(https://raw.githubusercontent.com/jvcasillas/media/master/rstats/memes/bayesian_bey.png)
background-size: contain
count: false

???

This idea is at the heart of Bayesian inference

---
class: middle, center

# Bayesian inference <br>is all about the posterior

???

One key idea that I'd like you to hold on to for right now is that Bayesian inference is all about what we call the posterior distribution

I'm going to spend the rest of my time in this brief presentation building up intuition about the posterior.

In other words, in Bayesian data analysis we produce/approximate a posterior distribution of possible parameters that we can then summarize in different ways to answer questions and quantify uncertainty

This is a fundamental difference with regard to frequentistism
That is to say, there is no posterior distribution in frequentist statistics, you only estimate 1 single value for a parameter

I am going to walk us through applying this idea to regression

---
class: middle

# .center[Applied to regression]

## .blue[Classical]: There is a single "true" line of best fit, and I'll give my best estimate of it.

???

As many of us already know, in standard linear regression in a frequentist framework, we typically estimate what we call a line of best fit, either using OLS or maximum likelihood estimation

It's actually quite simple at it's core, we have some blob of data and we try to shove a line through it to describe it the best we can

## **Bayesian**: There is a distribution of lines of fit...some more plausible than others...and I'll give you samples from that distribution.

???

Under a Bayesian framework, we derive a distribution of lines that are compatible with our data and our prior assumptions

Some of these lines are more plausible than others, so we can use standard methods to summarize and quantify uncertainty about them

Let's illustrate this to make it clear

---
class: middle

???

This scatter plot represents a traditional linear regression with a line of best fit +/- standard error

---
class: middle
count: false

???

If we plot this next to the most plausible line from a Bayesian regression we see essentially the same line

---
class: middle
count: false

???

Here we plot the most plausible line along with 20 random draws of plausible lines

---
class: middle
count: false

???

Here we plot the most plausible line along with 50 random draws of other plausible lines

---
class: middle
count: false

???

Here we plot the most plausible line along with 200 random draws of other plausible lines

---
class: middle

???

Notice that we seem to converge on the same result, though we are not expressing our best guess of the "true" regression line,

Instead we express uncertainty about the relationship by plotting the posterior distribution of lines that are compatible with our assumptions and the data

The orange line merely represents the line the appears more often in the posterior, thus we assume it is the most plausible outcome

We will build our first model in just a second, but first let's get to know the data set we will be using

---

# Get to know the data set

```r
dim(polite)
```

```
## [1] 224  27
```

???

The data set we will be using in this workshop is from Winter XXXX

This study examined the effect of politeness on the realization of pitch

We can see that the data set has 27 columns (or variables) and 224 observations

---

```
## Rows: 224
## Columns: 27
## $ subject           <chr> "F1", "F1", "F1", "F1", "F1", "F1", "F1", "F1", "F1"…
## $ gender            <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F…
## $ birthplace        <chr> "seoul_area", "seoul_area", "seoul_area", "seoul_are…
## $ musicstudent      <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ months_ger        <dbl> 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, …
## $ scenario          <dbl> 6, 6, 7, 7, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7…
## $ task              <chr> "not", "not", "not", "not", "dct", "dct", "dct", "dc…
## $ attitude          <chr> "inf", "pol", "inf", "pol", "pol", "inf", "pol", "in…
## $ total_duration    <dbl> 55.159, 28.508, 60.309, 40.825, 18.392, 13.551, 5.21…
## $ articulation_rate <dbl> 3.380492, 12.159277, 6.188191, 6.025763, 5.614616, 5…
## $ f0mn              <dbl> 229.3, 181.1, 219.8, 175.8, 214.6, 210.9, 284.7, 265…
## $ f0sd              <dbl> 4.18, 3.78, 4.18, 4.59, 4.28, 3.43, 2.09, 4.93, 3.22…
## $ f0range           <dbl> 10.6, 9.0, 10.4, 13.1, 10.3, 10.5, 7.1, 18.3, 6.7, 1…
## $ inmn              <dbl> 64.6, 63.2, 61.8, 61.7, 62.0, 61.5, 65.7, 64.9, 61.3…
## $ insd              <dbl> 7.94, 7.75, 8.89, 8.51, 8.29, 8.95, 5.81, 8.52, 8.43…
## $ inrange           <dbl> 25.3, 24.3, 28.4, 26.1, 25.4, 28.0, 19.1, 27.6, 26.4…
## $ shimmer           <dbl> 0.08207, 0.08074, 0.08645, 0.08713, 0.06942, 0.08247…
## $ jitter            <dbl> 0.01138, 0.01161, 0.01282, 0.01320, 0.01069, 0.01112…
## $ HNRmn             <dbl> 18.1, 17.8, 17.0, 17.1, 18.5, 18.8, 20.9, 14.6, 20.6…
## $ H1H2              <dbl> 5.7893421, 6.3669091, 6.2444737, 5.0164835, 5.442307…
## $ breath_count      <dbl> 11, 3, 6, 7, 3, 3, 1, 0, 1, 0, 0, 1, 1, 0, 9, 7, 11,…
## $ filler_count      <dbl> 4, 2, 2, 4, 2, 2, 2, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0…
## $ hiss_count        <dbl> 0, 1, 2, 0, 2, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0…
## $ nasal_count       <dbl> 0, 3, 0, 0, 4, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ sil_count         <dbl> 8, 3, 20, 7, 0, 2, 0, 1, 2, 2, 1, 0, 0, 0, 1, 1, 6, …
## $ ya_count          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0…
## $ yey_count         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
```

???

We certainly won't be using all of the 27 variables, but we can see that Winter XXXX measured many things, such as F0 min, F0 sd, and F0 range, to name just a few

We are going to begin by exploring the variable articulation_rate

---
class: middle

???

We can use a histogram to get an idea of the distribution of the variable

---

# Get to know the data set

.pull-left[

Let's explore the `articulation_rate` variable

- N = 224

- The range is 3.38, 12.16.

- The .blue[mean] is 6.68.

- The .RUred[SD] is 1.16.

- The 95% quantiles are 4.86, 9.11

]

???

We can use standard methods to describe the distribution of values for articulation_rate

For we example, we have 224 observations, which range from 3.38 to 12.16

The average articulation rate is 6.68 (SD = 1.16)

.pull-left[

We'll fit an intercept only model:

$$
articulation\ rate \sim Normal(\mu, \sigma) \\
$$

.center[
```lm(articulation_rate ~ 1)```
]

]

???

Now we will use this distribution in a simple linear model, the simplest model we can fit is an intercept only model

An int-only model is essentially a flat line (slope of 0), the intercept will be the mean of the distribution

We will fit frequentist and bayesian versions of this model and compare

---

# Frequentist model

We'll fit an intercept only model. The mean `articulation_rate` is 6.68 ±1.16 SD.

```r
lm(articulation_rate ~ 1, data = polite) %>% summary()
```

```
## 
## Call:
## lm(formula = articulation_rate ~ 1, data = polite)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2996 -0.8456 -0.1150  0.6270  5.4791 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.68013    0.07783   85.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.165 on 223 degrees of freedom
```

???

The frequentist version is easily fit with the `lm` function.

We specify the criterion on the left side of the `~` and put a 1 on the right side to specify an intercept only model

If we look at the summary of the model we see the estimate for the intercept is essentially the same as the mean of the articulation rate distribution

---

# Bayesian model

We'll fit an intercept only model. The mean `articulation_rate` is 6.68 ±1.16 SD.

```r
brm(articulation_rate ~ 1, data = polite) %>% summary
```

```
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: articulation_rate ~ 1 
##    Data: polite (Number of observations: 224) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept     6.68      0.08     6.53     6.83 1.00     3375     2668
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     1.17      0.06     1.07     1.29 1.00     3391     2832
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

???

We can fit the Bayesian version of this model using the brm function from the package `brms` (note the syntax is identical)

There are some under-the-hood defaults that I am completely ignoring for now (priors and likelihood). 
We will dedicate time to these as we move along.

Notice that one key difference is that this model estimated 2 parameters: the intercept and sigma (the mean and SD of the articulation rate distribution)

Also notice that the estimate are also essentially the same as the mean and SD we calculated before

---
class: title-slide-section-blue, middle, center

# So what's the difference?

???

At this point you might be saying to yourself "So what? The models are the same?" and you would have a good point.

The information we have looked at in the summaries is essentially the same.

The main difference, and the key take away from this section, is that the bayesian model provides a posterior distribution of values for the parameters it estimates, the intercept and sigma.

Let's see what we can do with a posterior distribution

---

# Exploring the posterior

.pull-left[

```r
post_00 <- posterior_samples(b_mod_00)
```

```
##    b_Intercept    sigma
## 1     6.645696 1.237890
## 2     6.665647 1.125728
## 3     6.670478 1.129146
## 4     6.760504 1.196283
## 5     6.870616 1.264796
## 6     6.777265 1.259011
## 7     6.659514 1.153504
## 8     6.699240 1.157076
## 9     6.691047 1.222263
## 10    6.724507 1.208100
## 11    6.716813 1.218816
## 12    6.737001 1.323182
## 13    6.697180 1.294377
## 14    6.645625 1.065328
## 15    6.692892 1.163420
```

]

???

I can turn the model object into a dataframe/tibble of the posterior using the as_tibble function (or posterior_samples)

Because of the default values in brms I have 4000 samples. 
Here I print the first 15

.pull-right[

- This posterior distribution has 4000 draws of mean and SD values that are compatible with our data

- The posterior is just like any other distribution of data

- We can analyze/summarize it however we want

```r
mean(post_00$b_Intercept)
```

```
## [1] 6.678269
```

```r
quantile(
  post_00$b_Intercept, 
  probs = c(0.025, 0.975)
  )
```

```
##     2.5%    97.5% 
## 6.527723 6.826254
```

]

???

We can now use all of the mathematical tricks we have up our sleeve to describe the posterior distribution in any way we find meaningful.

Here I calculate the mean value for the intercept samples as well as a 95% credible interval, but I can describe the posterior in any way I want

This is how we can quantify our uncertainty about the estimates the model generates

---
class: center, middle

???

Here we describe the posterior distributions of the two model parameters in a single plot

This is a handy way to summarize a model

A key takeaway is that the peak of the distributions represents the value that is the most common

---
class: center, middle

???

Here is a scatter plot of the entire posterior distribution (4000 samples)

This is a handy way to visualize the posterior parameter space (we will come back to this) because (1) it emphasizes the fact that we are looking at a 2D space, a joint probability distribution, and (2) there are many estimates for b_Intercept and sigma that are essentially the same.

When we have a lot of estimates that appear often in the posterior it implies that these values are more plausible

I've highlighted the *most* plausible combination of the intercept and sigma, the medians (i.e., the values that are in the middle of the distribution), with the orange point.

We can think of the joint distribution as a mountain and the orange point represents the highest point

---
class: center, middle
background-color: black

---
class: center, middle
background-color: black
background-image: url(./libs/dark_density_posterior.png)
background-size: contain

---

# Takeaways

.Large[
- There are primarily two camps in statistics: frequentists and Bayesians

- The two camps see probability in different ways

- **Bayesian statistics is all about the posterior**
]

???

- There are primarily two camps in statistics: frequentists and Bayesians
- The two camps see probability in different ways
- Frequentists: parameters are real/true, provide best estimate of it
- Bayesians: there are distributions of possible parameters, we summarize the distribution
- Bayesian statistics is all about the posterior

---
background-color: black
background-image: url(https://i.pinimg.com/originals/1b/58/77/1b58772e23bb5b761ded9aa1d1e1d7e4.gif)
background-size: contain

???

With this in mind, you are now on your way to becoming Bayesians

We will continue building intuitions about these core ideas

---
count: false
class: title-slide-final
background-image: url(https://raw.githubusercontent.com/learnB4SS/b4ss-theme/main/assets/Logo.png)
background-size: 275px
background-position: 50% 40%

# .white[Thank you]

### https://learnB4SS.github.io

|                                 |                         |
| ------------------------------: | :---------------------- |
| <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:red;overflow:visible;position:relative;"><path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"/></svg> | .white[@TimoRoettger]   |
| <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:red;overflow:visible;position:relative;"><path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"/></svg> | .white[@StefanoCoretta] |
| <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:red;overflow:visible;position:relative;"><path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"/></svg> | .white[@jvcasill]       |

---
exclude: true

(McElreath, 2018)

---
class: title-slide-section-grey
# References

[1] R. McElreath. _Statistical rethinking: A Bayesian course with
examples in R and Stan_. Chapman and Hall/CRC, 2018.