Multilevel data in brms distributional models with smooth terms (splines)

This short entry summarizes what I found (and learned) about using smooth terms (splines) within brms mixed-models fitted with the brms R-package and a zero-one-inflated beta family (zoib) and, about the importance of modelling sigma in these cases.

The main reference to this entry was found in Estimating Distributional Models with brms (Paul Bürkner, 2023) and specifically in the final section about Additive distributional models.

In my particular case, I’m trying to fit a mixed-model to predict the percentage (ratio) of online homework coverage (outcome) ranging from 0 to 1 where there are some zeroes and ones and it did not follow a normal distribution (this is why I’m using a zoib family).

To predict the response variable (homework coverage) I have to deal with smooth terms because one of my predictors (Attempts by item) didn’t show a linear relationship with the outcome. Furthermore, my model includes two varying intercepts: (1|Topic) and (1|IDnumber) because observations are nested by topics and pupils.

Smooth term predictors can be added into a brms formula as s(smooth_term_predictor) as indicated here and in this another useful example. However, as I want to go further and investigate how the smooth term «interact» with another linear predictor to explain the outcome, I have two choices depending on whether the linear predictor is continuous (continuous_predictor) or categorical (categorical_predictor). This is:

If it is continuous we can use:

t2(smooth_term_predictor, continuous_predictor)

and, if it is categorical we can use:

s(smooth_term_predictor by= categorical_predictor)

The interpretation of outcomes from a zoib mixed-models for these «particular» explanatory variables is not straightforward. Firstly, because coefficients are given in logit scale which need to be transformed to properly understand their effects, and secondly, because the effects of smooth terms are different depending on which zone of the spline you are looking at. Therefore, it is recommended to look at their plots (e.g., with brms::conditional_effects() or marginaleffects::plot_predictions()) to get a first intuition about their relationships (effects) with the outcome.

Interestingly, I also learned how important is «not to assume that the residual standard deviation sigma was the same» . On the contrary, and very often, one should expect it to vary by the smoothing terms and the varying intercepts as explained here:

[…] In the above example, we only considered the mean of the response distribution to vary by predictor1 and predictor2, but this my not necessarily reasonable assumption, as the variation of the response might vary with these variables as well. Accordingly, we fit splines and effects of district for both the location and the scale parameter, which is called sigma in Gaussian models.

Therefore, as soon as I began to model sigma, I noticed a big improvement in my models fits:

sigma ~ t2(smooth_term, continuous_predictor)+ (1|ID1|TOPIC) +(1|ID2|IDnumber)

Notice as well how, instead of adding the random intercepts as (1|TOPIC) +(1 |IDnumber), I now use (1|ID1|TOPIC) +(1 | ID2|IDnumber) in both formulas (the random effect part and the sigma part) to model the varying intercepts of both model parts as correlated as it is also explained in page 5 here.

Outcome ~ covariate + t2(smooth_term,continuous_predictor) + (1|ID1|TOPIC) + (1|ID2|IDnumber),

  sigma ~ covariate + t2(smooth_term,continuous_predictor) + (1|ID1|TOPIC) + (1|ID2|IDnumber)

For example, in one of my comparisons through loo_compare() I got the next difference for model1 (without modelling sigma) and model 2 (after modelling sigma):

check_loo_results <- loo_compare(model1, model2)
compare(check_loo_results)

          elpd_diff    se_diff


model1:       0.0        0.0
model2:     -16.5        7.5 

This is what I think I know at the moment of writing this post about smooth terms and about modelling residual variance. Readers (if any) must be cautious about my explanations because I’m still a novice in most of these things.

Install CmdStan for R in Windows

Assuming that Rtools is already installed, these are the steps to install cmdstanr from RStudio in Windows 10:

install.packages("devtools")
devtools::install_github("stan-dev/cmdstanr")
library(cmdstanr)

This is cmdstanr version 0.5.3 – CmdStanR documentation and vignettes: mc-stan.org/cmdstanr – CmdStan path: C:/Users/tonis/Documents/.cmdstan/cmdstan-2.30.0 – CmdStan version: 2.30.0 A newer version of CmdStan is available. See ?install_cmdstan() to install it. To disable this check set option or environment variable CMDSTANR_NO_VER_CHECK=TRUE.

check_cmdstan_toolchain()

Error: Rtools42 installation found but the toolchain was not installed. Run cmdstanr::check_cmdstan_toolchain(fix = TRUE) to fix the issue. cmdstanr::check_cmdstan_toolchain(fix = TRUE)

Installing mingw32-make and g++ with Rtools42. The C++ toolchain required for CmdStan is setup properly!

check_cmdstan_toolchain()

The C++ toolchain required for CmdStan is setup properly!

install_cmdstan() # This check, download and install the latest release (it takes long!)

[…]

* Finished installing CmdStan to C:/Users/you/Documents/.cmdstan/cmdstan-2.31.0 CmdStan path set to: C:/Users/you/Documents/.cmdstan/cmdstan-2.31.0

 

«The install_cmdstan() function attempts to download and install the latest release of CmdStan. Installing a previous release or a new release candidate is also possible by specifying the version or release_url argument. The rebuild_cmdstan() function cleans and rebuilds the CmdStan installation. Use this function in case of any issues when compiling models.«

From time to time it might be worth to check if a new version is available. In such case, install_cmdstan() and rebuild_cmdstan()will do rest.

If all goes well, you can use backend = "cmdstanr« to fit brms models.

The main reference to this summary is the CmdStan User’s Guide.

Read also this page with some other recommendations, for example:

Instead of installing R in the standard location, C:\Program Files\R\R-x.y.z, I suggest that you use C:\R\R-x.y.z. Again, x.y.z is the current version of R. This will allow you to install packages in the main R library without running R with administrator privileges and may avoid problems that sometimes occur when there are spaces in paths.


Read also the image in this tweet about how to install Stan and related packages after updating R to a new version (e.g. from 4.2 to 4.3). Basically, the author says that «I first completely removed R, RTools and cmdstanr, then installed R 4.3 (alpha), RTools 4.3, then make sure your path/environmental variables (on Windows) point to the correct RTools, then I installed Stan and cmdstanr (again, check path to cmdstanr in PATH variable).«

DAGitty — draw and analyze causal diagrams

Visit this site to create causal diagrams online or in R.

DAGitty is a browser-based environment for creating, editing, and analyzing causal diagrams (also known as directed acyclic graphs or causal Bayesian networks). The focus is on the use of causal diagrams for minimizing bias in empirical studies in epidemiology and other disciplines. For background information, see the «learn» page.

Worth reading: Is this the future of assessment? (Christodoulou, D., 2022)

Another interesting «worth to read» piece is found in TES magazine with the title: «Is this the future of assessment?».  The author is Daisy Christodoulou and she introduces the topic as follows:

Calls for exams to be scrapped have grown louder in the wake of the Covid-19 pandemic. But rather than getting rid of exams, we may just need to reimagine them

Some of my takeaways are:

  • Scores (numbers) are better than grades (letters) to assess students.
  • The war against «traditional summative closed-book  exams» in school education seems to be a worldwide trending topic.
  • However, «Many non-exam assessment techniques introduce grey areas and ambiguities into the assessment process that are vulnerable to being exploited» and, as Daisy also points out: «Teacher assessment is subject to human biases«.
  • On the other hand, on-screen assessments can play an important complementary role but adopters must be aware of that it introduces complexity with both opportunities and risks. (i.e., mode effect , backwash-effect and other consequences).
  • The arrival of  ChatGPT AI will have strong multi-level consequences in education. Thus, practitioners must start thinking how to deal with these AI technologies in the near (immediate) future.

Enjoy the reading!

Worth reading: A Critical Perspective on Effect Sizes (Quentin A., 2022)

Effect sizes (ES) are usually required by journal editors to help readers to better understand research outcomes. However, ES should be understood and, when reported, researcher must provide not only which ES metric is used (Hedge’s g, Cohen’s d, and the Glass Δ,  η2 and ηp2 …) but also additional information such as sample sizes and statistical power to detect small effect sizes of interest (sesoi) to make them fully informative.:

In this post «A Critical Perspective on Effect Sizes (Quentin A., 2022)» the André Quentin introduces the topic as follows:

In this blog post, I am sharing a slightly-modified version of a presentation I gave to HEC Montreal’s “Research Day on Open Science and Replications in Marketing”. It is called “A Critical Perspective on Effect Sizes”.

I start with a quick refresher on what effect sizes are, discuss the conditions under which effect sizes contain useful information, and conclude by offering some heuristics to evaluate effect sizes.

Enjoy the reading 🙂

Protegido: Questionable research findings

Este contenido está protegido por contraseña. Para verlo, por favor, introduce tu contraseña a continuación:

Covid-19: Cuarentena día 6. La situación de hoy en España

El post de hoy es un gráfico que muestra la tasa de infectados por cada 100.000 habitantes en las comunidades autónomas (regiones) de España:

Tasa diaria de infectados por n-Covid19 por cada 100.000 habitantes en las distintas Comunidades Autónomas (Regiones) de España.

Tasa diaria de infectados por n-Covid19 por cada 100.000 habitantes en las distintas Comunidades Autónomas (Regiones) de España.

Se aprecia que La Rioja (1ª) y Navarra (3ª) comparten con Madrid (2ª) las tres primeras posiciones a pesar de que de las dos primeras apenas se habla. Las dos líneas verticales señalan fechas importantes en la evolución de la pandemia en España. La negra corresponde al 8 de marzo y la roja al 15 de marzo.

Covid-19: Cuarentena día 4. Dr. Tedros, la PCR y los datos de Italia.

En este segundo post que escribo en el 4ª día de cuarentena (leer aquí el primero) hablaré de tres asuntos  relacionados con la pandemia #Covid-19. Queda para una próxima entrada seguir analizando las herramientas para ofrecer soporte y feedback educativo online sobre las que ya hablé en el post anterior.

Covid-19

El Lunes, 16 de Marzo, mientras los españoles comenzábamos oficialmente la cuarentena de 15 días impuesta por el gobierno, el Dr. Tedros Adhanom Ghebreyesus Director-General de la Organización Mundial de la Salud (OMS) comparecía en rueda de prensa y anunciaba:

Continue reading ‘Covid-19: Cuarentena día 4. Dr. Tedros, la PCR y los datos de Italia.’ »