Bayesian Data Analysis for Speech Sciences

class: center, middle, inverse, title-slide

# Bayesian Data Analysis for Speech Sciences
## Sampling from the posterior
### Timo Roettger, Stefano Coretta and Joseph Casillas
### LabPhon workshop
### 2021/07/06 (updated: 2021-07-06)

---

count: false
background-image: url(./libs/mira1.png)
background-size: contain
background-position: 50% 100%

???

In this session we are going to talk about model assessment

That is, what to do after you have run your model and you want to know about goodness of fit

We are going to talk about Markov chains and some diagnostics like Rhat and effective sample size, as well as best practices for correcting for divergent transitions

The purpose of this session is to help you build intuition about how the sampling process works and to know when your model is poor... luckily BRMS makes that quite obvious

If you have young children or are fans of Disney junior, then you might recognize Mira, royal detective.

If not, Mira is the star of my daughters favorite mystery cartoon. 
She is the brave and resourceful royal detective in a fictional kingdom called Jalpur

She travels around the kingdom solving mysteries and I think this is a bit similar to what it's like, especially at first, when building and assessing models

So, the plan is that we are going to be like Mira and solve some Bayesian mysteries and we'll get a good introduction to the topic, though it won't be exhaustive, and at the end I'll include some references for further reading

---
class: center, middle
background-image: url(./libs/chain1.png)
background-size: contain

???

We hinted before that Bayesian regression involves complicated sampling algorithms that help us approximate a posterior distribution

You may have heard about one class of these algorithms before, called Markov chain Monte Carlo methods

---
count: false
class: center, middle
background-image: url(./libs/chain2.png)
background-size: contain

# Markov chain Monte Carlo (MCMC)

### When we fit a model using `brm()` we run multiple markov chains

### The chains sample the parameter space (more on this in a bit)

### We assess the chains to get an idea of how well the model converged

### If the sampling procedure is sub-optimal we can inspect the chains and check several diagnostics

---
class: center, middle

???

They we usually do this is by looking at trace plots

Here we see the values of a parameter on the vertical axis, in this case the estimate of the intercept, and the sample iteration on the horizontal axis

It's hard to tell, but there are 4 different lines that are moving from left to right

This is an example of what we could call good, healthy chains, the algorithm worked... we say the chain have mixed well

As we look from left to right we see what many describe as a hairy caterpillar

So when we inspect our chains, this is what we want

---
class: right, middle
background-image: url(https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQ75CzFCTvGIxClCFzrJ6iF2YA6yzxVN-GraQ&usqp=CAU)
background-size: 800px

### .grey[A promising young Markov chain]

???

This is also a healthy chain

---
class: center, middle

???

If the sampling process does not work properly we might see something like this

If you follow any given line you will notice that they do not sample the parameter space like in the previous plot

That is, there is not a lot of random up and down movement... there is no hairy caterpillar... this means the chains have not mixed well

Now, obviously determining if your trace plot looks like a hairy caterpillar is not precisely an exact science, but luckily there are also diagnostics that tell us about the chains performance

---

# Model assessment

### Rhat and effective sample size

???

Two of those diagnostics that we're going to talk about Rhat and effective sample size

You probably noticed over the course of the last two days that when we examine a summary of a brms model we see something like this...

```
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: articulation_rate ~ attitude 
##    Data: polite (Number of observations: 224) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept       6.88      0.11     6.67     7.10 1.00     3677     2815
## attitudepol    -0.41      0.15    -0.70    -0.11 1.00     3657     2873
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     1.15      0.05     1.05     1.26 1.00     3684     3013
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

???

This is the same model we fit earlier today looking at articulation rate as a function of attitude

Explain "Samples" line

Now I'd like to call your attention to this area of the table...

background-image: url(./libs/mira2.png), url(https://raw.githubusercontent.com/jvcasillas/media/master/general/img/circle.png)
background-size: 800px, 410px
background-position: 150% 50%, 72% 72%

???

You have already noticed these three columns on the right-hand side: Rhat, bulk ESS and tail ESS

What is all this about?

Rhat is a convergence diagnostic that compares estimates between and within chains (Im simplifying)

The key thing to remember is that this value should less than 1.01, if not, it is a sign the chains did not mix well, i.e., between- and within-chain estimates don't agree

In this case, we can see that all the rhat values are 1.00

With regard to Effective Sample size, this metric "captures the number of independent draws that have the same amount of information as the dependent sample obtained by the MCMC algorithm...

That's a lot to unpack if you are new at this... the important thing to remember is that these numbers should be high (the max is 4k, or the number of iterations)

A general rule-of-thumb is that, at a minimum, the ESS should be 100x the number of chains (100 x 4 = 400)

It's helpful to compare with a poorly fitted model

---

# Model assessment

### Rhat and effective sample size (bad example)

???

This model fits the same parameters but I purposely gave it vary bad priors

background-image: url(./libs/mira2.png), url(https://raw.githubusercontent.com/jvcasillas/media/master/general/img/circle.png)
background-size: 800px, 410px
background-position: 150% 50%, 72% 72%

```
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: articulation_rate ~ attitude 
##    Data: polite (Number of observations: 224) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept       0.01      0.00     0.01     0.02 1.14      181       77
## attitudepol    -0.00      0.00    -0.00     0.00 1.10       99       46
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     0.32      0.00     0.32     0.32 1.03      137      156
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

???

Notice that the Rhat values are higher than 1.01, and the ESS are very low, less than 100x 4 chains

---
background-color: #272822
background-image: url(./libs/warning.png)
background-size: contain

???

Luckily, when something like this happens BRMS will also give us warnings

Note that it specifically suggests caution when analyzing the results and gives helpful suggestions

1. Run more iterations on each chain
2. Take another look at your priors (the problem here)

.pull-left[

```r
mod <- brm(
 formula = y ~ x, 
* prior = priors,
 chains = 4, 
* iter = 4000,
 warmup = 2000, 
 data = data
)
```
]

???

Notice that there is another warning (part 2): Divergent transitions

In this case BRMS also gives us a suggestion for combating this problem (increase delta), as well as a link with more information

So we already know what the solutions is, but let's unpack the problem...

---
class: middle, center
background-image: url(https://d23.com/app/uploads/2018/11/1180w-600h_112918_mira-royal-detective-announce-780x440.jpg)
background-size: 800px
background-position: 50% 40%

# Divergent transitions

???

Again this is an issue related to sampling the parameter space, so let's talk about parameter space...

---
class: center, middle

# So what is parameter space?

???

what is it?

You  may have noticed that I say this quite a bit

---
class: middle, center

???

Let's consider a simple example, y ~ x, where both variable are continuous and we can plot them in a scatter plot like this one, this is the data space

We can fit a line through the data that is defined by an intercept and a slope

---

???

We could also plot the values of the intercept and the slope in a 2d plane as you see in the right panel here

This allows us to compare the data space with parameter space in an interactive way

If the value of the intercept were higher, we see the line moves up in the data space and the red point moves up in the parameter space

If the value of the slope were higher, we see the line becomes more steep in the data space and the red point moves to the right in the parameter space

If I decrease the slope the inclination of the line changes in the opposite direction and the red point moves to the left

When we fit a bayesian model the markov chains sample the parameter space, in other words it randomly moves around in the grid we see in the plot on the right

Keep in mind that this is an extremely simple example in which we only fit a couple of parameters and we can easily visualize the parameter space in two dimensions

---
background-image: url(./libs/dark_density_posterior1.png)
background-size: contain
background-color: black

???

We can also visualize this in 3 dimensions as the amount of samples increase from a given area of the parameter space

---
background-image: url(./libs/dark_density_posterior2.png)
background-size: contain
background-color: black

???

To understand this it helps to look at the plot with a slight inclination to help us realize that it is like a small mountain

Now Im going to show you a simulation of how the sampling procedure actually works

Keep in mind the idea that in our simplified example the parameter space is 3 dimensional, but for math reasons let's assume it's flipped upside down like a bowl

---
background-image: url(./libs/param_space.png)
background-size: contain

???

So now we are looking at a similar plot of a hypothetical parameter space

We can imagine that the vertical axis is the intercept and the horizontal axis is the slope

The blue circles and gradient in the center represents the depth of our bowl in 3d

To understand how the sampling works, let's pretend that our bowl has a frictionless surface and that we don't know the shape (just that it is bowl-like) so we are going to survey it

In this imaginary example we'll put a marble in the bowl in a random spot and flick it to see where it goes

We'll keep track of where it stops and that will be the values we sampled in the parameter space

---
background-image: url(./libs/spiral.gif)
background-size: contain

???

I like to think about it a bit like this, if you have ever seen one of these

---

???

This helps us develop intuition about the problem but remember we rarely fit such simple models

It isn't unusual to fit models with 100's or even 1000's of parameters

As humans, we struggle even imagining what a high dimensional parameter space might look like... so plotting can't help us there

---

# Model assessment

### Dealing with divergencies

```r
mod <- brm(
 formula = y ~ x, 
 prior = priors, 
 chains = 4, 
 iter = 4000, 
 warmup = 2000, 
* control = list(adapt_delta = 0.99),
 data = data
)
```

???

More often than not you can get rid of divergent transitions with simple models by adjusting adapt_delta, as you see in the code here.

This forces the samples to "take smaller steps down the mountain"

The tradeoff is that your model may be slower

This may also be a solution to your problems when dealing with a more complex parameter space, but in my personal experience this often has to do with the assumptions of the generative model, like the choice of likelihood and sensible priors, so definitely think critically about those and double check your choices

---

# Model assessment

### More information

- [About Markov chains](http://elevanth.org/blog/2017/11/28/build-a-better-markov-chain/)

- More on Rhat and effective sample size: [Vehtari et al. (2020)](https://arxiv.org/abs/1903.08008)

- [Explanations of Stan warnings](https://mc-stan.org/misc/warnings.html)

- Hoffman, M. & Gelman, A. (2011). “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo”. [arxiv.org/abs/1111.4246](https://www.arxiv.org/abs/1111.4246)

- Betancourt, M. (2017) “Conceptual Introduction to Hamiltonian Monte Carlo”. [arxiv.org/abs/1701.02434](https://www.arxiv.org/abs/1701.02434)

---
---
class: middle, center
background-color: black

# .white["We put a tiny sliver of scientific information into the model and then we bet on the things that can happen a large number of ways"]

<!--

Calculating a posterior is computationally costly

This is part of the reason why we didn't start seeing complex Bayesian models until relatively recently

As computational power has increased so has our ability to approximate posterior distributions

Furthermore, we've developed faster methods of sampling from the parameter space

The surface in this space is a giant bowl.

suppose that this surface is frictionless.

Now what we do is flick the marble in a random direction.

The marble moves across the bowl and eventually will turn around.

If we take samples of the marbles position along its path, and then flick it in another random direction once it stops, then we can learn about the shape of the surface

It flows across the target and maps out its whole shape very efficiently

-->