class: center, middle, inverse, title-slide # Bayesian Data
Analysis
for
Speech Sciences ## Sampling from the posterior ### Timo Roettger, Stefano Coretta and Joseph Casillas ### LabPhon workshop ### 2021/07/06 (updated: 2021-07-06) --- count: false background-image: url(./libs/mira1.png) background-size: contain background-position: 50% 100% ??? In this session we are going to talk about model assessment That is, what to do after you have run your model and you want to know about goodness of fit We are going to talk about Markov chains and some diagnostics like Rhat and effective sample size, as well as best practices for correcting for divergent transitions The purpose of this session is to help you build intuition about how the sampling process works and to know when your model is poor... luckily BRMS makes that quite obvious If you have young children or are fans of Disney junior, then you might recognize Mira, royal detective. If not, Mira is the star of my daughters favorite mystery cartoon. She is the brave and resourceful royal detective in a fictional kingdom called Jalpur She travels around the kingdom solving mysteries and I think this is a bit similar to what it's like, especially at first, when building and assessing models So, the plan is that we are going to be like Mira and solve some Bayesian mysteries and we'll get a good introduction to the topic, though it won't be exhaustive, and at the end I'll include some references for further reading --- class: center, middle background-image: url(./libs/chain1.png) background-size: contain ??? We hinted before that Bayesian regression involves complicated sampling algorithms that help us approximate a posterior distribution You may have heard about one class of these algorithms before, called Markov chain Monte Carlo methods --- count: false class: center, middle background-image: url(./libs/chain2.png) background-size: contain # Markov chain Monte Carlo (MCMC) -- ### When we fit a model using `brm()` we run multiple markov chains -- ### The chains sample the parameter space (more on this in a bit) -- ### We assess the chains to get an idea of how well the model converged -- ### If the sampling procedure is sub-optimal we can inspect the chains and check several diagnostics --- class: center, middle <img src="index_files/figure-html/good-chains-1.png" width="936" /> ??? They we usually do this is by looking at trace plots Here we see the values of a parameter on the vertical axis, in this case the estimate of the intercept, and the sample iteration on the horizontal axis It's hard to tell, but there are 4 different lines that are moving from left to right This is an example of what we could call good, healthy chains, the algorithm worked... we say the chain have mixed well As we look from left to right we see what many describe as a hairy caterpillar So when we inspect our chains, this is what we want --- class: right, middle background-image: url(https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQ75CzFCTvGIxClCFzrJ6iF2YA6yzxVN-GraQ&usqp=CAU) background-size: 800px <br><br><br><br><br><br> <br><br><br><br><br><br> <br><br><br><br><br><br> ### .grey[A promising young Markov chain] ??? This is also a healthy chain --- class: center, middle <img src="index_files/figure-html/bad-chains-1.png" width="936" /> ??? If the sampling process does not work properly we might see something like this If you follow any given line you will notice that they do not sample the parameter space like in the previous plot That is, there is not a lot of random up and down movement... there is no hairy caterpillar... this means the chains have not mixed well Now, obviously determining if your trace plot looks like a hairy caterpillar is not precisely an exact science, but luckily there are also diagnostics that tell us about the chains performance --- # Model assessment ### Rhat and effective sample size ??? Two of those diagnostics that we're going to talk about Rhat and effective sample size You probably noticed over the course of the last two days that when we examine a summary of a brms model we see something like this... -- ``` ## Family: gaussian ## Links: mu = identity; sigma = identity ## Formula: articulation_rate ~ attitude ## Data: polite (Number of observations: 224) ## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; ## total post-warmup samples = 4000 ## ## Population-Level Effects: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS ## Intercept 6.88 0.11 6.67 7.10 1.00 3677 2815 ## attitudepol -0.41 0.15 -0.70 -0.11 1.00 3657 2873 ## ## Family Specific Parameters: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS ## sigma 1.15 0.05 1.05 1.26 1.00 3684 3013 ## ## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS ## and Tail_ESS are effective sample size measures, and Rhat is the potential ## scale reduction factor on split chains (at convergence, Rhat = 1). ``` ??? This is the same model we fit earlier today looking at articulation rate as a function of attitude Explain "Samples" line Now I'd like to call your attention to this area of the table... -- background-image: url(./libs/mira2.png), url(https://raw.githubusercontent.com/jvcasillas/media/master/general/img/circle.png) background-size: 800px, 410px background-position: 150% 50%, 72% 72% ??? You have already noticed these three columns on the right-hand side: Rhat, bulk ESS and tail ESS What is all this about? Rhat is a convergence diagnostic that compares estimates between and within chains (Im simplifying) The key thing to remember is that this value should less than 1.01, if not, it is a sign the chains did not mix well, i.e., between- and within-chain estimates don't agree In this case, we can see that all the rhat values are 1.00 With regard to Effective Sample size, this metric "captures the number of independent draws that have the same amount of information as the dependent sample obtained by the MCMC algorithm... That's a lot to unpack if you are new at this... the important thing to remember is that these numbers should be high (the max is 4k, or the number of iterations) A general rule-of-thumb is that, at a minimum, the ESS should be 100x the number of chains (100 x 4 = 400) It's helpful to compare with a poorly fitted model --- # Model assessment ### Rhat and effective sample size (bad example) ??? This model fits the same parameters but I purposely gave it vary bad priors -- background-image: url(./libs/mira2.png), url(https://raw.githubusercontent.com/jvcasillas/media/master/general/img/circle.png) background-size: 800px, 410px background-position: 150% 50%, 72% 72% ``` ## Family: gaussian ## Links: mu = identity; sigma = identity ## Formula: articulation_rate ~ attitude ## Data: polite (Number of observations: 224) ## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; ## total post-warmup samples = 4000 ## ## Population-Level Effects: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS ## Intercept 0.01 0.00 0.01 0.02 1.14 181 77 ## attitudepol -0.00 0.00 -0.00 0.00 1.10 99 46 ## ## Family Specific Parameters: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS ## sigma 0.32 0.00 0.32 0.32 1.03 137 156 ## ## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS ## and Tail_ESS are effective sample size measures, and Rhat is the potential ## scale reduction factor on split chains (at convergence, Rhat = 1). ``` ??? Notice that the Rhat values are higher than 1.01, and the ESS are very low, less than 100x 4 chains --- background-color: #272822 background-image: url(./libs/warning.png) background-size: contain ??? Luckily, when something like this happens BRMS will also give us warnings Note that it specifically suggests caution when analyzing the results and gives helpful suggestions 1. Run more iterations on each chain 2. Take another look at your priors (the problem here) -- .pull-left[ ```r mod <- brm( formula = y ~ x, * prior = priors, chains = 4, * iter = 4000, warmup = 2000, data = data ) ``` ] ??? Notice that there is another warning (part 2): Divergent transitions In this case BRMS also gives us a suggestion for combating this problem (increase delta), as well as a link with more information So we already know what the solutions is, but let's unpack the problem... --- class: middle, center background-image: url(https://d23.com/app/uploads/2018/11/1180w-600h_112918_mira-royal-detective-announce-780x440.jpg) background-size: 800px background-position: 50% 40% <br><br><br><br><br><br><br><br> <br><br><br><br><br><br><br><br> # Divergent transitions ??? Again this is an issue related to sampling the parameter space, so let's talk about parameter space... --- class: center, middle # So what is parameter space? ??? what is it? You may have noticed that I say this quite a bit --- class: middle, center <img src="index_files/figure-html/data-space-1.png" width="864" /> ??? Let's consider a simple example, y ~ x, where both variable are continuous and we can plot them in a scatter plot like this one, this is the data space We can fit a line through the data that is defined by an intercept and a slope --- <br><br> <iframe src="https://jvcasillas.shinyapps.io/shiny_parameter_space/" style="border:none;" width="100%" height="100%"> ??? We could also plot the values of the intercept and the slope in a 2d plane as you see in the right panel here This allows us to compare the data space with parameter space in an interactive way If the value of the intercept were higher, we see the line moves up in the data space and the red point moves up in the parameter space If the value of the slope were higher, we see the line becomes more steep in the data space and the red point moves to the right in the parameter space If I decrease the slope the inclination of the line changes in the opposite direction and the red point moves to the left When we fit a bayesian model the markov chains sample the parameter space, in other words it randomly moves around in the grid we see in the plot on the right Keep in mind that this is an extremely simple example in which we only fit a couple of parameters and we can easily visualize the parameter space in two dimensions --- background-image: url(./libs/dark_density_posterior1.png) background-size: contain background-color: black ??? We can also visualize this in 3 dimensions as the amount of samples increase from a given area of the parameter space --- background-image: url(./libs/dark_density_posterior2.png) background-size: contain background-color: black ??? To understand this it helps to look at the plot with a slight inclination to help us realize that it is like a small mountain Now Im going to show you a simulation of how the sampling procedure actually works Keep in mind the idea that in our simplified example the parameter space is 3 dimensional, but for math reasons let's assume it's flipped upside down like a bowl --- background-image: url(./libs/param_space.png) background-size: contain ??? So now we are looking at a similar plot of a hypothetical parameter space We can imagine that the vertical axis is the intercept and the horizontal axis is the slope The blue circles and gradient in the center represents the depth of our bowl in 3d To understand how the sampling works, let's pretend that our bowl has a frictionless surface and that we don't know the shape (just that it is bowl-like) so we are going to survey it In this imaginary example we'll put a marble in the bowl in a random spot and flick it to see where it goes We'll keep track of where it stops and that will be the values we sampled in the parameter space --- background-image: url(./libs/spiral.gif) background-size: contain ??? I like to think about it a bit like this, if you have ever seen one of these --- <iframe src="https://chi-feng.github.io/mcmc-demo/app.html" style="border:none;" width="100%" height="100%"> ??? This helps us develop intuition about the problem but remember we rarely fit such simple models It isn't unusual to fit models with 100's or even 1000's of parameters As humans, we struggle even imagining what a high dimensional parameter space might look like... so plotting can't help us there --- # Model assessment ### Dealing with divergencies ```r mod <- brm( formula = y ~ x, prior = priors, chains = 4, iter = 4000, warmup = 2000, * control = list(adapt_delta = 0.99), data = data ) ``` ??? More often than not you can get rid of divergent transitions with simple models by adjusting adapt_delta, as you see in the code here. This forces the samples to "take smaller steps down the mountain" The tradeoff is that your model may be slower This may also be a solution to your problems when dealing with a more complex parameter space, but in my personal experience this often has to do with the assumptions of the generative model, like the choice of likelihood and sensible priors, so definitely think critically about those and double check your choices --- # Model assessment ### More information - [About Markov chains](http://elevanth.org/blog/2017/11/28/build-a-better-markov-chain/) - More on Rhat and effective sample size: [Vehtari et al. (2020)](https://arxiv.org/abs/1903.08008) - [Explanations of Stan warnings](https://mc-stan.org/misc/warnings.html) - Hoffman, M. & Gelman, A. (2011). “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo”. [arxiv.org/abs/1111.4246](https://www.arxiv.org/abs/1111.4246) - Betancourt, M. (2017) “Conceptual Introduction to Hamiltonian Monte Carlo”. [arxiv.org/abs/1701.02434](https://www.arxiv.org/abs/1701.02434) --- --- class: middle, center background-color: black # .white["We put a tiny sliver of scientific information into the model and then we bet on the things that can happen a large number of ways"] <!-- Calculating a posterior is computationally costly This is part of the reason why we didn't start seeing complex Bayesian models until relatively recently As computational power has increased so has our ability to approximate posterior distributions Furthermore, we've developed faster methods of sampling from the parameter space The surface in this space is a giant bowl. suppose that this surface is frictionless. Now what we do is flick the marble in a random direction. The marble moves across the bowl and eventually will turn around. If we take samples of the marbles position along its path, and then flick it in another random direction once it stops, then we can learn about the shape of the surface It flows across the target and maps out its whole shape very efficiently -->