class: center, middle, inverse, title-slide # Bayesian Data
Analysis
for
Speech Sciences ## Application to regression ### Timo Roettger, Stefano Coretta and Joseph Casillas ### LabPhon workshop ### 2021/07/05 (updated: 2021-07-05) --- class: title-slide-section-grey, middle background-image: url(https://www.bellevuerarecoins.com/wp-content/uploads/bigstock-Coin-Flip-5807921.jpg) background-size: contain background-position: 100% 50% # What is probability? ??? Before we get started I'd like to ask you to take a second and think about "probability". What does is it mean? What do you understand if somebody asks you the probability you are going to be on time? What is the probability it will rain tomorrow? Over the next two days we will be thinking about the notion of probability and how define it in light of the research that we do --- background-image: url(libs/worldview.png) background-size: contain background-color: black # .white[The world according to <br>frequentists] ??? A key reason for *why* we will talk about probability has to do with frequentism You might have heard this term before, but if not, that's fine. I'm certain you already have a grasp of what it is. Over the next few days we will be thinking a lot about frequentism and comparing it directly with Bayesian methods. --- background-image: url(https://www.bellevuerarecoins.com/wp-content/uploads/bigstock-Coin-Flip-5807921.jpg) background-size: contain background-position: 100% 50% # What is frequentism? .pull-left[ - "Classical" statistics - Population parameters are fixed, actually exist - Probability refers to the long-run frequency of a given event - A sample of data is the result of one of an infinite number of exactly repeated experiments - Data are random, result from sampling a fixed population distribution ] ??? - So what is it? - Frequentist statistics represent what most of us have always been doing - It's a bit of a blanket term used to refer to "classical" statistics - The frequentist view of probability is what leads to some of the odd definitions of the statistical machinery you already know, like confidence intervals - A key idea is that population parameters, what we estimate when we do statistics, are fixed, actually exist - Under this view, probability refers to the long-run frequency of a given event - We can consider a sample of data as the result of one of an infinite number of exactly repeated experiments - In this sense, the data are random, and result from sampling a fixed population distribution - The key idea that I'd like you to keep in mind at this point is that this view of probability seems to be at odds with how we think/reason - For example when we ask something like what is the probability it will rain tomorrow? (we aren't typically thinking about long term frequencies or exactly repeated experiments) - We will be coming back to these ideas of *probability* and *frequentism* --- class: center, middle # What if... ### There isn't one true population parameter... <br> ??? But for now, consider this... What if there isn't one *true* population parameter -- but an entire distribution of parameters,<br> some more plausible than others ??? What if instead there is an entire distribution of parameters, some more plausible than others --- background-color: black background-image: url(https://raw.githubusercontent.com/jvcasillas/media/master/rstats/memes/bayesian_bey.png) background-size: contain count: false ??? This idea is at the heart of Bayesian inference --- class: middle, center # Bayesian inference <br>is all about the posterior ??? One key idea that I'd like you to hold on to for right now is that Bayesian inference is all about what we call the posterior distribution I'm going to spend the rest of my time in this brief presentation building up intuition about the posterior. In other words, in Bayesian data analysis we produce/approximate a posterior distribution of possible parameters that we can then summarize in different ways to answer questions and quantify uncertainty This is a fundamental difference with regard to frequentistism That is to say, there is no posterior distribution in frequentist statistics, you only estimate 1 single value for a parameter I am going to walk us through applying this idea to regression --- class: middle # .center[Applied to regression] ## .blue[Classical]: There is a single "true" line of best fit, and I'll give my best estimate of it. ??? As many of us already know, in standard linear regression in a frequentist framework, we typically estimate what we call a line of best fit, either using OLS or maximum likelihood estimation It's actually quite simple at it's core, we have some blob of data and we try to shove a line through it to describe it the best we can -- ## **Bayesian**: There is a distribution of lines of fit...some more plausible than others...and I'll give you samples from that distribution. ??? Under a Bayesian framework, we derive a distribution of lines that are compatible with our data and our prior assumptions Some of these lines are more plausible than others, so we can use standard methods to summarize and quantify uncertainty about them Let's illustrate this to make it clear --- class: middle <img src="index_files/figure-html/cars-f-vs-cars-b1-1.png" width="936" /> ??? This scatter plot represents a traditional linear regression with a line of best fit +/- standard error --- class: middle count: false <img src="index_files/figure-html/cars-f-vs-cars-b2-1.png" width="936" /> ??? If we plot this next to the most plausible line from a Bayesian regression we see essentially the same line --- class: middle count: false <img src="index_files/figure-html/cars-b-draws1-1.png" width="936" /> ??? Here we plot the most plausible line along with 20 random draws of plausible lines --- class: middle count: false <img src="index_files/figure-html/cars-b-draws2-1.png" width="936" /> ??? Here we plot the most plausible line along with 50 random draws of other plausible lines --- class: middle count: false <img src="index_files/figure-html/cars-b-draws3-1.png" width="936" /> ??? Here we plot the most plausible line along with 200 random draws of other plausible lines --- class: middle <img src="index_files/figure-html/cars-f-vs-cars-b-2-1.png" width="936" /> ??? Notice that we seem to converge on the same result, though we are not expressing our best guess of the "true" regression line, Instead we express uncertainty about the relationship by plotting the posterior distribution of lines that are compatible with our assumptions and the data The orange line merely represents the line the appears more often in the posterior, thus we assume it is the most plausible outcome We will build our first model in just a second, but first let's get to know the data set we will be using --- # Get to know the data set ```r dim(polite) ``` ``` ## [1] 224 27 ``` ??? The data set we will be using in this workshop is from Winter XXXX This study examined the effect of politeness on the realization of pitch We can see that the data set has 27 columns (or variables) and 224 observations --- ``` ## Rows: 224 ## Columns: 27 ## $ subject <chr> "F1", "F1", "F1", "F1", "F1", "F1", "F1", "F1", "F1"… ## $ gender <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F… ## $ birthplace <chr> "seoul_area", "seoul_area", "seoul_area", "seoul_are… ## $ musicstudent <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye… ## $ months_ger <dbl> 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, … ## $ scenario <dbl> 6, 6, 7, 7, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7… ## $ task <chr> "not", "not", "not", "not", "dct", "dct", "dct", "dc… ## $ attitude <chr> "inf", "pol", "inf", "pol", "pol", "inf", "pol", "in… ## $ total_duration <dbl> 55.159, 28.508, 60.309, 40.825, 18.392, 13.551, 5.21… ## $ articulation_rate <dbl> 3.380492, 12.159277, 6.188191, 6.025763, 5.614616, 5… ## $ f0mn <dbl> 229.3, 181.1, 219.8, 175.8, 214.6, 210.9, 284.7, 265… ## $ f0sd <dbl> 4.18, 3.78, 4.18, 4.59, 4.28, 3.43, 2.09, 4.93, 3.22… ## $ f0range <dbl> 10.6, 9.0, 10.4, 13.1, 10.3, 10.5, 7.1, 18.3, 6.7, 1… ## $ inmn <dbl> 64.6, 63.2, 61.8, 61.7, 62.0, 61.5, 65.7, 64.9, 61.3… ## $ insd <dbl> 7.94, 7.75, 8.89, 8.51, 8.29, 8.95, 5.81, 8.52, 8.43… ## $ inrange <dbl> 25.3, 24.3, 28.4, 26.1, 25.4, 28.0, 19.1, 27.6, 26.4… ## $ shimmer <dbl> 0.08207, 0.08074, 0.08645, 0.08713, 0.06942, 0.08247… ## $ jitter <dbl> 0.01138, 0.01161, 0.01282, 0.01320, 0.01069, 0.01112… ## $ HNRmn <dbl> 18.1, 17.8, 17.0, 17.1, 18.5, 18.8, 20.9, 14.6, 20.6… ## $ H1H2 <dbl> 5.7893421, 6.3669091, 6.2444737, 5.0164835, 5.442307… ## $ breath_count <dbl> 11, 3, 6, 7, 3, 3, 1, 0, 1, 0, 0, 1, 1, 0, 9, 7, 11,… ## $ filler_count <dbl> 4, 2, 2, 4, 2, 2, 2, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0… ## $ hiss_count <dbl> 0, 1, 2, 0, 2, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0… ## $ nasal_count <dbl> 0, 3, 0, 0, 4, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ sil_count <dbl> 8, 3, 20, 7, 0, 2, 0, 1, 2, 2, 1, 0, 0, 0, 1, 1, 6, … ## $ ya_count <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0… ## $ yey_count <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ``` ??? We certainly won't be using all of the 27 variables, but we can see that Winter XXXX measured many things, such as F0 min, F0 sd, and F0 range, to name just a few We are going to begin by exploring the variable articulation_rate --- class: middle <img src="index_files/figure-html/ar-hist-1.png" width="1008" /> ??? We can use a histogram to get an idea of the distribution of the variable --- # Get to know the data set <img src="index_files/figure-html/ar-hist-rep-1.png" width="504" style="float:right" /> .pull-left[ Let's explore the `articulation_rate` variable - N = 224 - The range is 3.38, 12.16. - The .blue[mean] is 6.68. - The .RUred[SD] is 1.16. - The 95% quantiles are 4.86, 9.11 ] ??? We can use standard methods to describe the distribution of values for articulation_rate For we example, we have 224 observations, which range from 3.38 to 12.16 The average articulation rate is 6.68 (SD = 1.16) -- .pull-left[ We'll fit an intercept only model: $$ articulation\ rate \sim Normal(\mu, \sigma) \\ $$ .center[ ```lm(articulation_rate ~ 1)``` ] ] ??? Now we will use this distribution in a simple linear model, the simplest model we can fit is an intercept only model An int-only model is essentially a flat line (slope of 0), the intercept will be the mean of the distribution We will fit frequentist and bayesian versions of this model and compare --- # Frequentist model We'll fit an intercept only model. The mean `articulation_rate` is 6.68 ±1.16 SD. -- ```r lm(articulation_rate ~ 1, data = polite) %>% summary() ``` ``` ## ## Call: ## lm(formula = articulation_rate ~ 1, data = polite) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.2996 -0.8456 -0.1150 0.6270 5.4791 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.68013 0.07783 85.83 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.165 on 223 degrees of freedom ``` ??? The frequentist version is easily fit with the `lm` function. We specify the criterion on the left side of the `~` and put a 1 on the right side to specify an intercept only model If we look at the summary of the model we see the estimate for the intercept is essentially the same as the mean of the articulation rate distribution --- # Bayesian model We'll fit an intercept only model. The mean `articulation_rate` is 6.68 ±1.16 SD. -- ```r brm(articulation_rate ~ 1, data = polite) %>% summary ``` ``` ## Family: gaussian ## Links: mu = identity; sigma = identity ## Formula: articulation_rate ~ 1 ## Data: polite (Number of observations: 224) ## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; ## total post-warmup samples = 4000 ## ## Population-Level Effects: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS ## Intercept 6.68 0.08 6.53 6.83 1.00 3375 2668 ## ## Family Specific Parameters: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS ## sigma 1.17 0.06 1.07 1.29 1.00 3391 2832 ## ## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS ## and Tail_ESS are effective sample size measures, and Rhat is the potential ## scale reduction factor on split chains (at convergence, Rhat = 1). ``` ??? We can fit the Bayesian version of this model using the brm function from the package `brms` (note the syntax is identical) There are some under-the-hood defaults that I am completely ignoring for now (priors and likelihood). We will dedicate time to these as we move along. Notice that one key difference is that this model estimated 2 parameters: the intercept and sigma (the mean and SD of the articulation rate distribution) Also notice that the estimate are also essentially the same as the mean and SD we calculated before --- class: title-slide-section-blue, middle, center # So what's the difference? ??? At this point you might be saying to yourself "So what? The models are the same?" and you would have a good point. The information we have looked at in the summaries is essentially the same. The main difference, and the key take away from this section, is that the bayesian model provides a posterior distribution of values for the parameters it estimates, the intercept and sigma. Let's see what we can do with a posterior distribution --- # Exploring the posterior .pull-left[ ```r post_00 <- posterior_samples(b_mod_00) ``` ``` ## b_Intercept sigma ## 1 6.645696 1.237890 ## 2 6.665647 1.125728 ## 3 6.670478 1.129146 ## 4 6.760504 1.196283 ## 5 6.870616 1.264796 ## 6 6.777265 1.259011 ## 7 6.659514 1.153504 ## 8 6.699240 1.157076 ## 9 6.691047 1.222263 ## 10 6.724507 1.208100 ## 11 6.716813 1.218816 ## 12 6.737001 1.323182 ## 13 6.697180 1.294377 ## 14 6.645625 1.065328 ## 15 6.692892 1.163420 ``` ] ??? I can turn the model object into a dataframe/tibble of the posterior using the as_tibble function (or posterior_samples) Because of the default values in brms I have 4000 samples. Here I print the first 15 -- .pull-right[ - This posterior distribution has 4000 draws of mean and SD values that are compatible with our data - The posterior is just like any other distribution of data - We can analyze/summarize it however we want ```r mean(post_00$b_Intercept) ``` ``` ## [1] 6.678269 ``` ```r quantile( post_00$b_Intercept, probs = c(0.025, 0.975) ) ``` ``` ## 2.5% 97.5% ## 6.527723 6.826254 ``` ] ??? We can now use all of the mathematical tricks we have up our sleeve to describe the posterior distribution in any way we find meaningful. Here I calculate the mean value for the intercept samples as well as a 95% credible interval, but I can describe the posterior in any way I want This is how we can quantify our uncertainty about the estimates the model generates --- class: center, middle <img src="index_files/figure-html/ar-post-param-plot-1.png" width="864" /> ??? Here we describe the posterior distributions of the two model parameters in a single plot This is a handy way to summarize a model A key takeaway is that the peak of the distributions represents the value that is the most common --- class: center, middle <img src="index_files/figure-html/ar-post-paramspace-plot-1.png" width="864" /> ??? Here is a scatter plot of the entire posterior distribution (4000 samples) This is a handy way to visualize the posterior parameter space (we will come back to this) because (1) it emphasizes the fact that we are looking at a 2D space, a joint probability distribution, and (2) there are many estimates for b_Intercept and sigma that are essentially the same. When we have a lot of estimates that appear often in the posterior it implies that these values are more plausible I've highlighted the *most* plausible combination of the intercept and sigma, the medians (i.e., the values that are in the middle of the distribution), with the orange point. We can think of the joint distribution as a mountain and the orange point represents the highest point --- class: center, middle background-color: black <img src="index_files/figure-html/ar-post-density-plot-1.png" width="864" /> --- class: center, middle background-color: black background-image: url(./libs/dark_density_posterior.png) background-size: contain --- # Takeaways .Large[ - There are primarily two camps in statistics: frequentists and Bayesians - The two camps see probability in different ways - **Bayesian statistics is all about the posterior** ] ??? - There are primarily two camps in statistics: frequentists and Bayesians - The two camps see probability in different ways - Frequentists: parameters are real/true, provide best estimate of it - Bayesians: there are distributions of possible parameters, we summarize the distribution - Bayesian statistics is all about the posterior --- background-color: black background-image: url(https://i.pinimg.com/originals/1b/58/77/1b58772e23bb5b761ded9aa1d1e1d7e4.gif) background-size: contain ??? With this in mind, you are now on your way to becoming Bayesians We will continue building intuitions about these core ideas --- count: false class: title-slide-final background-image: url(https://raw.githubusercontent.com/learnB4SS/b4ss-theme/main/assets/Logo.png) background-size: 275px background-position: 50% 40% # .white[Thank you] <br><br><br><br><br><br><br><br><br><br><br><br> ### https://learnB4SS.github.io | | | | ------------------------------: | :---------------------- | |
| .white[@TimoRoettger] | |
| .white[@StefanoCoretta] | |
| .white[@jvcasill] | --- exclude: true (McElreath, 2018) --- class: title-slide-section-grey # References [1] R. McElreath. _Statistical rethinking: A Bayesian course with examples in R and Stan_. Chapman and Hall/CRC, 2018.