--- title: "An introduction to repeatability estimation with rptR" author: "Martin A. Stoffel, Shinichi Nakagawa & Holger Schielzeth" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{An introduction to repeatability estimation with rptR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE, hide = TRUE} load("vignette_out.RData") ``` The concept of 'repeatability' relates to the way of quantifying the reliability of measurements. In a wider context, however, the repeatability offers insights into the components contributing to variability in the data. The repeatability describes the relative partitioning of variance into within-group and between-group sources of variance and is more generally referred to as the intra-class correlation (ICC). The hierarchical decomposition of variances has become the focus of several disciplines. In the context of behavioral research, for example, repeatability is often used to quantify stable individual differences and thus became a fundamental quantity of interest in the field of animal personality research. A very powerful tool for estimating repeatabilities is the mixed effects model framework. Random-effect predictors are used to estimate variances at different hierarchical levels. In the simplest case of a design where repeated measures (e.g. quantifications of behavior) were taken from the same features (e.g. individuals), the repeatability $R$ is calculated as the variance among group means (group-level variance $V_G$) over the sum of group-level and data-level (residual) variance $V_R$: $$ R = {V_G}/{(V_G+V_R)} $$ While it is relatively easy to extract point estimates for $V_G$ and $V_R$ from standard software packages, it is considerably more difficult to quantify the uncertainty in estimated $R$ values. The `rptR` package offers easy-to-use functions for estimating $R$ and its uncertainty. The package includes estimates of the repeatability (and also raw variances and *marginal* $R^2$; *sensu* Nakagawa and Schielzeth 2013) for Gaussian traits, binomial and Poisson-distributed traits. The theory for for the quantification of the repeatability for non-Gaussian traits is reviewed in Nakagawa & Schielzeth (2010). The functions build heavily on functions for fitting mixed models, in particular on `lmer` from the `lme4` package (Bates et al. 2015). Mixed-effect models are fitted within the function call and fitted model objects are returned as part of the class `rpt` return object. Uncertainty is quantified via parametric bootstrapping. In a nutshell, parametric bootstrapping works by (i) fitting the model, (ii) simulating new data based on the estimators of the original model while adhering to the original sampling design, (iii) refitting the model to the simulated data and storage of the repeatability of the model fit. Repeating steps (ii) to (iii) a large number of times (e.g. 1000 times) yields the sampling variance give the data structure under the assumption that the model is valid. ## Basic usage As usual, the CRAN-version of the package can be installed via a call to: ```{r package-download-cran, tidy = TRUE, eval=FALSE} install.packages("rptR") ``` Since the package is developed on Github (www.github.com), there is likely to be a developmental version that implements new features prior to release on CRAN. This current stable version on Github can be installed via a call to: ```{r package-download-github, tidy = TRUE, eval=FALSE} install.packages("devtools") devtools::install_github("mastoffel/rptR", build_vignettes = TRUE) ``` After one of those two steps, the package can be loaded by ```{r package-loading, tidy = TRUE, results='hide'} library(rptR) ``` The citation can be found with ```{r citation, tidy = TRUE, results='hide'} citation("rptR") ``` The central wrapper function for estimating the repeatability is `rpt`. The argument `datatype` decides whether the call is forwarded to the specialized functions `rptGaussian` for Gaussian traits, `rptBinary` for binary (0/1) outcomes, `rptProportion` for proportion data represented as joint vectors (or matrix) of successes and failures and `rptPoisson` for Poisson-distributed count data. The functions return S3 class `rpt` objects that can be viewed via the `str` function. The generic functions `summary`, `print` and `plot` are available for a more convenient display of the results. In the following we walk through the features of `rptR` starting with Gaussian models for introducing the basic plots and outputs, before addressing the specifics of count and binary trait analyses. ## A toy dataset As an example dataset we will use a simulated toy dataset that was introduced for a different purpose (Nakagawa & Schielzeth 2013). It offers a balanced dataset with rather simple structure, sizable effects and decent sample size, just right to demonstrate some features of `rptR`. Sufficient sample size is required in particular for the non-Gaussian traits, because those tend to be more computationally demanding and less rich in information per data point than simple Gaussian traits. In brief, the imaginary sampling design of the simulated dataset is as follows. Beetle larvae were sampled from 12 populations (`Population`) with samples taken from two discrete microhabitats at each location (`Habitat`). Samples were split in equal proportions and raised in two dietary treatments (`Treatment`). Beetles were sexed at the pupal stage (`Sex`) and pupae were kept in single-sex containers (`Container`). Phenotypes (`BodyL`, `Egg` and `Color`) were measured at the imaginal stage and are introduced below in the separate sections. ## Repeatability for Gaussian data As an example for Gaussian data, we will analyse body length (`BodyL`). To start with, we will analyse the repeatability at the level of the population that might arise through genetic differentiation and/or early environmental influences prior to sampling. ```{r data-preparation-gaussian, tidy = TRUE, fig.width=7, fig.height=3} data(BeetlesBody) str(BeetlesBody) hist(BeetlesBody$BodyL) table(BeetlesBody$Population) ``` We analyse the data using the `rpt` function with the argument `datatype = "Gaussian"`. It is possible to call `rptGaussian` directly, while omitting the `datatype` argument. As parametric bootstrapping and permutation can be slow, we recommend starting with shutting bootstrapping and permutation off by setting `nboot = 0` and `npermut = 0`. Note that the formula argument uses the same syntax as `lmer` from the `lme4` packages, because the formula-call is directly forwarded to `lmer`. Also note that `grname` has to be specified with quotation marks and that the term has to match exactly a random effect as it is used in the `formula` argument. The `data` argument is mandatory and is specified without quotes. ```{r gaussian-reduced-boot, tidy = TRUE} rpt(BodyL ~ (1|Population), grname="Population", data=BeetlesBody, datatype="Gaussian", nboot=0, npermut=0) ``` Since this works fine and we see sizable repeatability, we increase the number of parametric bootstraps to 1000, in order to evaluate the uncertainty in the repeatability estimate. The increase in the number of parametric bootstraps will not affect the point estimates, but interval estimates will be affected. At the same time, the permutation test can be kept switched off. The results are stored in an object named `rep1`. ```{r gaussian-full-boot, tidy = TRUE, eval = FALSE} rep1 <- rpt(BodyL ~ (1|Population), grname="Population", data=BeetlesBody, datatype="Gaussian", nboot=1000, npermut=0) ``` It is also possible to increase the number of bootstrap iterations (and also of permutations, see below) by calling the function again with the `update = TRUE` and `rptObj` accepting the rpt object from the previous function call. Note that all other arguments (except for `nboot` and `npermut`) have to be identical to the original function call. We here add an additional 500 bootstraps to the original 1000 bootstraps. ```{r gaussian-full-boot-update, tidy = TRUE, eval = FALSE} rep1 <- rpt(BodyL ~ (1|Population), grname="Population", data=BeetlesBody, datatype="Gaussian", nboot=500, npermut=0, update=TRUE, rptObj=rep1) ``` The `rptR` package offers three generic function for convenient display. The `print` function displays the essence of the repeatability estimation: the point estimate R, standard error SE, confidence interval CI and $P$ values. The `summary` function is more verbose, summarizing parts of the data structure, distributional summaries of the bootstrap and permutation samples and the full details on likelihood ratio tests. Finally, the `plot` function shows the distribution of the parametric bootstrap samples along with the point estimate and the limits of the confidence interval. The plot can be customized by specifying additional graphical parameters as with the `cex.main` argument below (see `?par` for additional arguments). ```{r gaussian-print, tidy = TRUE, fig.width=7, fig.height=4} print(rep1) summary(rep1) plot(rep1, cex.main=1.0) ``` Parametric bootstrapping and permutations are inherently repetitive procedures that can be computed on multiple cores in parallel. The `rptR` package has build-in options for using multiple cores by setting `parallel = TRUE`. The default `ncores = NULL` determines the number of cores on your machine and uses all but one core. It is also possible to specify the number of cores to be used via the `ncores` argument. The statistical significance of the repeatability is tested by likelihood ratio tests (LRT), comparing the model fit of a model including the grouping factor of interest and one excluding it (i.e., constraining it group-level variance to zero). Because the LRT are conducted at the boundary of possible parameter space (against zero group-level variance), the difference in log-likelihoods is assumed to follow a mixture distribution of a $\chi^2$-distribution with 1 degree-of-freedom and point mass at zero. $P$ values based on LRT are part of the standard `print` output and the full details can be see by a call to `summary` (see above). In this case the variance among populations is highly significant and so is the repeatability at this level. Permutation can be used as a robust alternative or addition to LRT. In the simplest case this will work by randomizing the vector of group identities against the response vector followed by refitting the model to the randomized data and storage of the repeatability for each of many replications. However, in more complex models involving multiple random effects and/or fixed effects, this will also break the data structure between the grouping factor of interest and other aspects of the experimental design. We therefore construct randomized datasets by fitting a model without the grouping factor of interest and then add the permutated residuals to the fitted values of this model. This maintains the general data structure and any effects of other design aspects on the response while still breaking the link between grouping factor and the response. ```{r gaussian-full-boot-permut, tidy = TRUE, warning=1, eval = FALSE} rep2 <- rpt(BodyL ~ (1|Population), grname="Population", data=BeetlesBody, datatype="Gaussian", nboot=0, npermut=1000) ``` It is not unusual to see warnings about convergence problems, because the data are simulated with zero group-level variance and this might occasionally produce odd data. The results of the permutation tests are shown as part of `print` and `summary`, and can be visualized using the `type` argument in the `plot` function. The results are in general agreement with the LRT in this case, but note that the resolution of the permutation tests is limited by the number of iterations. Since the empirical estimate is always included as one permutation sample (following the reasoning that even under the null hypothesis, the observed data are a possible, albeit possibly unlikely, outcome, Manly 2007), the resolution is limited to $1/N$ were $N$ is the number of permutations. It is typical to see the spike at zero and an extended right tail if variances are low. ```{r gaussian-plot-permut, tidy = TRUE, fig.width=7, fig.height=4} plot(rep2, type="permut", cex.main=1.0) ``` The full return object can be viewed by the `str` functions, which shows, for example, that `R_boot` holds all parametric bootstrapping samples, `R_permut` all permutation samples (the first one of which is the point estimate), `all_warnings` hold all warning messages and `mod` holds the model fit as an `lmerMod` as returned by `lmer`. ```{r gaussian-str, tidy = TRUE, results='hide'} str(rep2) # Output omitted ``` ```{r gaussian-show-model} summary(rep2$mod) ``` ## Repeatabilities with multiple grouping factors for Gaussian data It is possible to include additional random effect components in the repeatability estimation. Indeed there does exist a repeatability for each grouping level (i.e. random effects) and `rptR` allows estimating one, multiple or all of them simultaneously by using the `grname` argument. For example, we might want to estimate the repeatability at the level of the source population (`Population`) as well as at the level of the housing group at the pupal stage (`Container`). Variances at the level of the pupal housing group (`Container`) might arise from various environmental influences, such as temperature or humidity, that might vary on a fine scale among containers. ```{r gaussian-multiple-randeff, tidy = TRUE, eval = FALSE} rep3 <- rpt(BodyL ~ (1|Container) + (1|Population), grname=c("Container", "Population"), data=BeetlesBody, datatype="Gaussian", nboot=1000, npermut=0) ``` ```{r gaussian-multiple-randeff-print, tidy = TRUE} print(rep3) ``` The `print` and `summary` functions will show results for both grouping factors, while the action of the `plot` function needs to be controlled by the `grname` argument. The results show substantial repeatable variation among containers with still sizable (albeit slightly reduced) variation among populations. ```{r gaussian-multiple-randeff-plot, tidy = TRUE, fig.show='hold', fig.width=3.2} plot(rep3, grname="Container", type="boot", cex.main=0.8) plot(rep3, grname="Population", type="boot", cex.main=0.8) ``` In our example, containers are nested within populations in the sense that samples from the same population are spread over multiple containers, but each container houses individuals from only a single population. In principle it might have been possible to apply a crossed sampling design with samples from the same population spread across multiple containers while each container houses samples from multiple populations. Crossed designs are usually preferred, but their appliation might be constraint like in the case of beetles where it is not possible to follow individuals (and consequently population identities) across the pupal stage, because all external markings are lost. Nested sampling has consequences for how variances are interpreted. In our example, the population level repeatability captures all variation among populations (arising for genetic and environmental reasons) over all other variance components. But the container-level variance with population included in the model captures only among-container variance after controlling for variation among populations and is hence foremost environmental in origin. The population x container interaction variance, however, will appear as part of the container repeatability. The interaction variance might, in this case, arise from genotype-environment interactions or early environmental x late environment interactions (see Schielzeth & Nakagawa 2013 for a more detailed discussion of nesting). One might argue that the full container variance should contain the population variance. It is here possible to estimate the repeatability at the container-level without controlling for population and the results turn out to be almost equivalent to adding the population-level variance to the container-level variance. ```{r gaussian-containeronly-randeff, tidy = TRUE, eval = FALSE} rep4 <- rpt(BodyL ~ (1|Container), grname="Container", data=BeetlesBody, datatype="Gaussian", nboot=1000, npermut=0) ``` ```{r gaussian-containeronly-randeff-print, tidy = TRUE, eval = TRUE} print(rep4) ``` ## Adjusted repeatabilities for Gaussian data All repeatability estimates so far were agreement repeatabilities in the sense that they represent the group-level variance over the sum of all other variance components. This situation changes when fixed effects come into play. Fixed effects explain part of the variances in the data and the variance assigned to fixed effects in the model will not appear in the denominator of the repeatability, at least under the `rptR` default settings. It is still possible to fit fixed effects in `rptR`, but the repeatability has then to be interpreted as the repeatability after controlling for fixed effects. This can be thought of as the repeatability when all observations are shifted along the slopes of the fixed effects covariates till they reach the ordinate. An equivalent estimate would have been derived if all data had been collected at the point where all covariates are zero. Since the procedure involves statistically adjustments, we have called this the _adjusted repeatability_ (Nakagawa & Schielzeth 2010). Estimating adjusted repeatability is facilitated in `rprR` by the `formula` argument. Fixed effect covariates can be included as in any `lmer` formula. We will here include `Treatment` and `Sex` as covariates. Note that the treatment was artificially applied, hence it potentially introduces excess variation to the sample that we might want to control for statistically. `Sex` is a more controversial example, because it represents part of the natural variation, but often we want to know the (adjusted) repeatabilities given that the sexual identities of the individuals are known (and assuming that repeatabilities are identical across the sexes). ```{r gaussian-adjusted, tidy = TRUE, eval = FALSE} rep5 <- rpt(BodyL ~ Treatment + Sex + (1|Container) + (1|Population), grname=c("Container", "Population"), data=BeetlesBody, datatype="Gaussian", nboot=1000, npermut=0) ``` The results can be viewed and plotted as before and the significance of the fixed effect can be assessed from the fitted model. ```{r gaussian-adjusted-plot, tidy = TRUE, fig.show='hold', fig.width=3.2} print(rep5) plot(rep5, type="boot", grname="Container", cex.main=0.8) plot(rep5, type="boot", grname="Population", cex.main=0.8) ``` It turns out that most of the variance among containers was caused by sexual dimorphism (females are larger, as can be seen from the slope of `SexMale` in the fitted model `rep5$mod`), albeit some significant container repeatability remains even after controlling for sexual dimorphism. ## Estimating enhanced agreement repeatabilities It is sometimes desiarable to fit models with fixed effects without loosing the variance explained by fixed effects from the denominator. We have therefore introduced the argument `adjusted`. It defaults to `TRUE`, which means that adjusted repeatabilites are estimated. But when the argument is set to `FALSE`, the function will estimate what we here call the _enhanced agreement repeatability_, because it allows to fit an improved model without loosing the variance from the repeatability denominator of the repeatability. With `adjusted = FALSE` the variance explained by fixed effects will be calculated by the variance in the linear predictor (on the link scale for non-Gaussian data). This variance is added to the denominator of the repeatability equation just as other variance components. ```{r gaussian-enhanced-agreement, tidy = TRUE, eval = FALSE} rep6 <- rpt(BodyL ~ Treatment + Sex + (1|Container) + (1|Population), grname=c("Container", "Population"), data=BeetlesBody, datatype="Gaussian", nboot=1000, npermut=0, adjusted=FALSE) ``` ```{r gaussian-enhanced-agreement-print, tidy = TRUE} print(rep6) ``` In our example, it turns out that the fixed effects explain a substantial amount of phenotypic variance and this reduces the relative contribution of `Population` and `Container`. We will see below that the change is entirely explained by the fraction of the phenotypic variance explained by fixed effects. Note that the repeatability for `Population` is now the same as in our initial model (`rep1`), although a far more complex model was fitted. The estimation of enhanced agreement repeatabilities works equally for the non-Gaussian models introduced below. ## Estimating the coefficient determination $R^2$ for fixed effects If we are able to feed the variance explained by fixed effects to the denominator of the repeatability calculation, it is also possible to extract the variance explained by fixed effects in the model as a quantity of interest. We have introduced a reserved term `Fixed` to the `grname` argument for this purpose. In combination with `ratio = FALSE`, `grname = "Fixed"` will return the variance in the linear predictor (on the latent scale for non-Gaussian data). In combination with `ratio = TRUE`, `grname = "Fixed"` will return the fraction of the total variance that is explained by variance in the linear predictor, i.e. by fixed effects. Incidently, the combination `grname = "Fixed", ratio = TRUE` estimates a form of the coefficient of determination $R^2$ that we have called the _marginal $R^2$_ of mixed effects models (Nakagawa & Schielzeth 2013). `rptR` provides a quantification of uncertainty for the variance explained by fixed effects by parametric bootstrapping as for other variance components. However, we have not implemented significance testing for the fixed effect variance. This should be quantified by other means. ```{r gaussian-r2, tidy = TRUE, eval = FALSE} rep7 <- rpt(BodyL ~ Treatment + Sex + (1|Container) + (1|Population), grname=c("Container", "Population", "Fixed"), data=BeetlesBody, datatype="Gaussian", nboot=1000, npermut=0, adjusted=FALSE) ``` ```{r gaussian-r2-plot, tidy = TRUE, fig.width=7, fig.height=4, eval = TRUE} print(rep7) plot(rep7, grname="Fixed", type="boot") ``` The variances explained by container and population are evidently unchanged, but it is now visible that the fixed effects alone explain almost 40% of the phenotypic variance in body length in our example. A big part of this is simply sexual dimorphism. This serves as an example, because in this particular case, we would probably be more interested in the adjusted repeatability controlling for `Treatment` and `Sex`. ## Repeatabilities for Poisson-distributed count data Count data represent a typical situation in which we might want to consider fitting generalized linear mixed-effects models (GLMM). Poisson error distributions with log-link are often appropriate for count data, although there is no guarantee that count data will be best modeled by Poisson error distributions (negative binomial distributions are an obvious alternative). Using the link function, the model parameters are fitted on the latent scale rather than on the original data scale and consequently the model output will represent results on the latent scale. While in a Gaussian model, there are different parameters for the location and the spread of the distribution (mean and variance), this is not the case in non-Gaussian models. The Poisson distribution, for example, has an inherent variance that is equal to the mean. This inherent variance needs to be respected when quantifying repeatabilities for non-Gaussian data. We have reviewed the literature on non-Gaussian link-scale variances (Nakagawa & Schielzeth 2010) and `rptR` is based on the formula presented therein. It turns out that the distribution-specific variance for a Poisson model with log link is approximated by $ln(1/\lambda + 1)$ where $\lambda$ is the average count. This approximation however becomes increasingly inaccurate at low $\lambda$. In order to account for additive overdisperison in Poisson and other non-Gaussian models (except for binary data), the current implementation of `rptR` internally adds an observation level random effect to the model specified in `formula`. An observation level random effect is an additional random effect term (named `Overdispersion`) with as many levels as there are observations to the specified model. The additive overdispersion explained by the new `Overdispersion` term is added to the distribution-specific variance in the denominator of the repeatability. See als section 'Multiplicative and additive overdispersion models' towards the end of the vignette. As an example, we will analyse the number of eggs laid by female beetles (`Eggs`) in our toy dataset. We will estimate the repeatability at the level of the population as well as at the level of the container simultaneously while controlling for excess variation introduced by the experimental nutritional treatment. There is no need to control for sex-differences in this sex-limited trait. ```{r data-loading-poisson, tidy = TRUE, fig.width=7} data(BeetlesFemale) hist(BeetlesFemale$Egg, nclass=max(BeetlesFemale$Egg)) ``` We will estimate the repeatability through the `rpt` function using the `datatype = "Poisson"` argument. Alternatively, it would be possible to fit the model directly using the `rptPoisson` function. The `formula` syntax follows the rules used in `glmer` from the `lme4` package, which is also the central engine for fitting the mixed effects model. ```{r poisson-model, tidy = TRUE, warning=1, eval = FALSE} rep8 <- rpt(Egg ~ Treatment + (1|Container) + (1|Population), grname=c("Container", "Population"), data=BeetlesFemale, datatype="Poisson", nboot=1000, npermut=0) ``` In non-Gaussian models, it frequently happens that some of the bootstrap or permutation iterations do notF converge on a good model fit. Warnings are thrown by `glmer` and are collected in the `warnings` element of the `rpt` return object. As usual, the output can be viewed and displayed by `print`, `summary` and `plot`. The results show substantial repeatability at the level of the population (due to genetic differentiation or lasting environmental influence prior to sampling), but very little repeatability at the level of the container (indicating limited shared environmental effects). ```{r poisson-model-print} print(rep8) ``` There are two repeatabilities for Poisson GLMM, derived by a latent scale approximation and a original scale approximation. We have previously called the two estimates the latent scale repeatability and the original scale repeatability. However, they are both estimates of the same quantities, not entirely different repeatabilites, hence we now prefer to call them the _latent and orginal scale approximations_. In the case of Poisson GLMM, original scale approximations are actually the exact solutions, while link scale approximations are indeed approximate. However, in our experience the two typically give very similar results. ```{r poisson-model-plot, tidy = TRUE, fig.width=3.2, fig.show="hold"} plot(rep8, grname="Container", scale="link", cex.main=0.8) plot(rep8, grname="Population", scale="link", cex.main=0.8) plot(rep8, grname="Container", scale="original", cex.main=0.8) plot(rep8, grname="Population", scale="original", cex.main=0.8) ``` As with Gaussian models, it is possible to estimate enhanced agreement repeatabilities as well as the proportion of variances explained by fixed effects (marginal $R^2$). In this case the treatment effect explained less than 10% of the variance and including this fraction to the denomionator of the repeatability calculation gives marginally reduced repeatability estimates for population and container: ```{r poisson-enhanced-agreement, tidy = TRUE, eval = FALSE} rep9 <- rpt(Egg ~ Treatment + (1|Container) + (1|Population), grname=c("Container", "Population", "Fixed"), data=BeetlesFemale, datatype="Poisson", nboot=1000, npermut=0, adjusted=FALSE) print(rep9) ``` ```{r poisson-enhanced-agreement-print, tidy = TRUE} print(rep9) ``` The quantification of the distribution-specific variance of Poisson models with log-link requires a parameter $\lambda$ that represents the expected value, i.e. the global mean on the data scale. We have previously given different advice for estimating $\lambda$, including the inverse-link of the intercept on the latent scale. While this gives a reasonable approximation in many cases, the estimate will be biased for small expected values (smaller than about 2). A preferred alternative is to use the mean of the observations in the sample as we have also suggested previously. The is now the default option in `rptR` that is determined by the `expect` argument with its default `expect = "meanobs"`. There is also an analytical alternative that takes the intercept and variances as estimated on the latent scale. The expected value is $E[y] = exp(\beta_0 + var_{tot}/2)$, where $\beta_0$ represents the intercept on the latent scale and $var_{tot}$ represents the sum of the variances, including variance caused by fixed effects, on the latent scale. While this is generally a preferable solution, the estimates are susceptible to the distribution of covariates and, in most cases, it gives appropriate results only if all covariates are centered. Since centering is not enforced within `rptR`, we have made this an opt-in option that can be used by setting `expect = "latent"`. The repeatabilities estimated by the formulae presented in Nakagawa & Schielzeth (2010) can be requested by `expect = "liability"`. ## Repeatabilities for Binary data Binary data are sets of single observations with failure/success status (coded 0/1). They are a special case of proportion data (see below) with a sample size of 1 per observation. Because of a number of specificities, repeatabilities for binary and proportion data are separately coded in `rptR`. For example, binary models to not fit the addional overdispersion term, since there is not definable overdispersion in binary models. Binary data are handed over to `rpt` as a single vector of 0's and 1's or as a two-level factor. Males in our toy example occur in two distinct color morphs: dark and reddish-brown. We will analyze the repeatability of color morph identities (`Colour`) as an example for the repeatability of binary data. This is a test for spatial heterogeneity in morph ratios. ```{r data-loading-binary, tidy = TRUE, fig.width=7} data(BeetlesMale) table(c("dark","reddish")[BeetlesMale$Colour+1]) ``` We estimate the repeatability through the wrapper function `rpt` at the level of population and at the level of containers. As with Poisson models it frequently happens that some of the bootstrap or permutation iterations do not converge on a model fit and that `glmer` throws warnings that are collected in the `all_warnings` element of the `rpt` object. ```{r binary-model, warning=1, tidy = TRUE, fig.width=7, fig.height=4, eval = FALSE} rep10 <- rpt(Colour ~ (1|Container) + (1|Population), grname=c("Container", "Population"), data=BeetlesMale, datatype="Binary", nboot=1000, npermut=0) ``` ```{r binary-model-print, warning=1, tidy = TRUE, fig.width=7, fig.height=4} print(rep10) ``` As with Poisson GLMM, there are two approximations for the repeatability that we call the latent scale approximation and original scale approximation and in the case of Binomial GLMM, the two are both approximations of the same quantity and in our experience the two typically give very similar results. It turns out that there is significant repeatability at the level of populations, but not on the level of containers (at least not beyond variance among populations). It frequently happens that small variances are estimated as zero, because small variances are difficult to estimate with any level of confidence. For a check of robustness, we still estimate the enhanced agreement repeatabilities controlling for the treatment manipulation, but it turns out that this change to the model does hardly affect the repeatability estimates due to the low explanatory power of the treatment. ```{r binary-enhaned-agreement, tidy = TRUE, fig.width=3.2, fig.show='hold', eval = FALSE} rep11 <- rpt(Colour ~ Treatment + (1|Container) + (1|Population), grname=c("Container", "Population", "Fixed"), data=BeetlesMale, datatype="Binary", nboot=1000, npermut=0, adjusted=FALSE) ``` ```{r binary-enhaned-agreement-plot, tidy = TRUE, fig.width=3.2, fig.show='hold'} print(rep11) plot(rep11, grname="Population", scale="link", cex.main=0.8) plot(rep11, grname="Population", scale="original", cex.main=0.8) plot(rep11, grname="Container", scale="link", cex.main=0.8) plot(rep11, grname="Container", scale="original", cex.main=0.8) plot(rep11, grname="Fixed", scale="link", cex.main=0.8) plot(rep11, grname="Fixed", scale="original", cex.main=0.8) ``` We have previously adviced to estimate the distribution specific variance for the link scale approximation as $\pi^2/3$ (Nakagawa & Schielzeth 2010). While this gives a reasonable approximation, it is preferable to estimate the link scale variance as $1 / (p(1-p))$ where $p$ is the expected probability of success (Nakagawa & Schielzeth 2016). The current version of `rptR` implements the more accurate approximation $1 / (p(1-p))$. As with Poisson GLMM, we have implemented two options for estimating the expectation for $p$. The default version with `expect = "meanobs"` takes the average succes rate across all observations. There is also a latent scale approximation (details are presented in Nakagawa & Schielzeth 2016) that can be used upon choice by setting `expect = "latent"`. As with Poisson GLMM, we consider the analytical solution via `expect = "latent"` more accurate and putitatively more robust to unbalanced sampling, but `expect = "meanobs"` seems safer, because it does not depend on how covariates are coded. ## Repeatabilities for Proportion data Counts of successes and failures are conveniently modelled as proportion data. They are handed over to `rpt` (and ultimately to `glmer`) as combined vectors of integers. We will illustrate the repeatability analysis of proportion data by aggregating color morph identities within containers. The analysis is equivalent to the analysis of the data as binary data with `Container` treated as a random effect as it was done above (model `rep10`). ```{r proportion-model, warning=1, tidy = TRUE, fig.width=7, fig.height=4, eval = FALSE} BeetlesMale$Dark = BeetlesMale$Colour BeetlesMale$Reddish = (BeetlesMale$Colour-1)*-1 BeetlesMaleAggr <- aggregate(cbind(Dark, Reddish) ~ Population + Container, data=BeetlesMale, FUN=sum) rep12 <- rpt(cbind(Dark, Reddish) ~ (1|Population), grname=c("Population"), data=BeetlesMaleAggr, datatype="Proportion", nboot=1000, npermut=0) print(rep12) ``` ```{r proportion-model-print, warning=1, tidy = TRUE, fig.width=7, fig.height=4} print(rep12) ``` ## Multiplicative and additive overdispersion models There are two alternative ways of modelling overdispersion in Poisson and proportion models. In the current implementation of `rptR` we use the underlying `glmer` model fits for estimating additive overdispersion as excess variation at the latent level (excess beyond variability arising from the sampling distribution alone). It is implemented as an additional random effect term (named `Overdispersion`) with as many levels as there are observations. This variation is added to the distribution-specific variance in the denominator of the repeatability. Multiplicative overdispersion, on the contrary, models overdispersion as a parameter (conveniently labelled $\omega$) that multiplies the distribution-specific variance. $\omega = 1$ is thus equivalent to no overdispersion beyond what is expected from the distribution. See Nakagawa & Schielzeth (2010) for details on additive and multiplicative overdispersion. The choice of additive versus multiplicative overdispersion is a matter of convenience. The data for the toy example was simulated as additive overdispersion and it happens that the powerful `glmer` function implements additive overdispersion. We have previously used `glmmPQL` for fitting multiplicative overdispersion models, but do not maintain this option any longer. ## Link functions In our examples we have used the default link functions of Poisson models (`log` link) and binary/proportion data (`logit` link). Alternative link functions are possible, as long as the distribution-specific variances are known for the distribution-link family (see Nakagawa & Schielzeth 2010 for an overview). Beyond what has been demonstrated above, `rptR` supports the `sqrt` link for Poisson and the `probit` link for binary/proportion data. In the above examples, the differences between alternative link functions are rather minor (but note that the original scale approximations are not available for Possion data with `sqrt` link and for binary/proportion models with `probit` link). ###Poisson models### ```{r poisson-model-link comparison-1, tidy = TRUE, warning=FALSE} print(rep8) # log link ``` ```{r poisson-model-link comparison-2, tidy = TRUE, warning=FALSE, eval = FALSE} rep8a <- rpt(Egg ~ Treatment + (1|Container) + (1|Population), grname=c("Container", "Population"), data=BeetlesFemale, datatype="Poisson", link="sqrt", nboot=1000, npermut=0) ``` ```{r poisson-model-link comparison-3, tidy = TRUE, warning=FALSE} print(rep8a) # sqrt link ``` ###Binary models### ```{r biary-model-link comparison-1, tidy = TRUE, warning=FALSE} print(rep10) # logit link ``` ```{r biary-model-link comparison-2, tidy = TRUE, warning=FALSE, eval = FALSE} rep10a <- rpt(Colour ~ (1|Container) + (1|Population), grname=c("Container", "Population"), data=BeetlesMale, datatype="Binary", link="probit", nboot=1000, npermut=0) ``` ```{r biary-model-link comparison-3, tidy = TRUE, warning=FALSE} print(rep10a) # probit link ``` ###Proportion models### ```{r proportion-model-link comparison-1, tidy = TRUE, warning=FALSE} print(rep12) # logit link ``` ```{r proportion-model-link comparison-2, tidy = TRUE, warning=FALSE, eval = FALSE} rep12a <- rpt(cbind(Dark, Reddish) ~ (1|Population), grname=c("Population"), data=BeetlesMaleAggr, datatype="Proportion", link="probit", nboot=1000, npermut=0) ``` ```{r proportion-model-link comparison-3, tidy = TRUE, warning=FALSE} print(rep12a) # probit link ``` ## Estimating variances instead of ratios We wrote `rptR` with the focus primarily on estimating repeatabilities, including a quantification of their uncertainty. However, we realize that it is often of interest to estimate variances and their uncertainty directly rather than as ratios of variances. `rptR` can be used for this purpose too, by setting the `ratio` argument from `TRUE` (the default) to `FALSE`. For example, if we are interested in the link scale variances of colour morph variation, we can estimate the variances as: ```{r binary-variance-model, warning=1, tidy = TRUE, fig.show='hold', fig.width=3.2, eval = FALSE } rep13 <- rpt(Egg ~ Treatment + (1|Container) + (1|Population), grname=c("Container", "Population"), data=BeetlesFemale, datatype="Poisson", nboot=1000, npermut=0, ratio=FALSE) ``` ```{r binary-variance-model-plot, warning=1, tidy = TRUE, fig.show='hold', fig.width=3.2 } print(rep13) plot(rep13, grname="Container", scale="link", cex.main=0.8) plot(rep13, grname="Population", scale="link", cex.main=0.8) ``` Evidently, the significance of the random effect terms won't change no matter whether they are presented as variances or as variances devided by the phenotypic variance (repeatabilities). If we are interested in inspecting raw variances, we are often also interested in quantifying the amount of residual variance and its uncertainty, too. We have therefore introduced two addition reserved terms for the `grname` argument (besides `grname = "Fixed"`) to allow this. `Residual` will allow estimating the residual variance, which in the case of non-Gaussian models is the overdispersion on the latent scale plus the distribution-specific variance. `Overdispersion` will allow estimating the overdispersion variance and the latent scale, which in the case of Gaussian models (with implicit identity link) is equal to the residual estimated by `Residual`. It is necessary to specify at least one additional grouping level in the `grname` argument along with `Residual`. The results for `Container` and `Population` are unaffected and are the same as above (`rep13`). In our example there is significant overdispersion, meaning that besides treatment effects and variation among populations and among containers within populations there is more unexplained variation than expected from the Poisson distribution alone. ```{r binary-residual-model, warning=1, tidy = TRUE, fig.show='hold', fig.width=3.2, eval = FALSE} rep14 <- rpt(Egg ~ Treatment + (1|Container) + (1|Population), grname=c("Container", "Population", "Overdispersion", "Residual"), data=BeetlesFemale, datatype="Poisson", nboot=1000, npermut=0, ratio=FALSE) ``` ```{r binary-residual-model-plot, warning=1, tidy = TRUE, fig.show='hold', fig.width=3.2} print(rep14) plot(rep14, grname="Overdispersion", scale="link", cex.main=0.8) plot(rep14, grname="Residual", scale="link", cex.main=0.8) ``` It is pointless to test for the statistical significance of residual variance against a null hypothesis of no residual variance and we therefore omit this test. But for non-Gaussian models it is worthwhile testing if there is statistically significant overdispersion and we have therefore implemented the LRT test (but not the permutation test) for `Overdispersion`. In gerneral, however, be aware that each grouping level specified in `grname` will be tested in separate iterations if a permutation test is required. Hence it is recommened to be careful with the use of `npermut` in combination with multiple grouping levels. ```{r binary-overdisperion-test, warning=1, tidy = TRUE, fig.width=7, fig.height=4, eval = FALSE} rep15 <- rpt(Egg ~ Treatment + (1|Container) + (1|Population), grname=c("Container", "Population", "Overdispersion"), data=BeetlesFemale, datatype="Poisson", nboot=0, npermut=1000, ratio=FALSE) ``` ```{r binary-overdisperion-test-print, warning=1, tidy = TRUE, fig.width=7, fig.height=4} print(rep15) ``` It is possible to use both `Overdisperion` and `Residual` in combination with `ratio = FALSE` or with `ratio = TRUE`. The ratio of the residual variance to the total variance should then be interpreted as the 'non-repeatability'. We assume that the combination of these two terms with `ratio = FALSE` will be more commonly used. ## Random-slope models It is possible to fit random-slope models following the approach introduced by Johnson (2014). The function call will return the average repeatability across the distribution of the covariate in the sample (see Johnson 2014 for details). Permutation tests and likelihood ratio test are conducted with random-intercept and random-slope terms being jointly removed from the model. Note that random-slope models can be specified in multiple ways and the function will likely handle only those models well that have been specified in the classical way `(foo | grname)`, implicitly fitting random-intercepts, random slopes for the covariate `foo` and their correlation. Multiple random-slopes are possible, but models with covariances being constraint do currently not work. The following code gives and example for a repeatability estimation for body size where random slopes are fitted for `Sex` and `Treatment` at the level of populations. The resulting model allows for variance components to vary by sex and by treatment with the average across the sex/treatment levels being returned as the model output. ```{r gaussian-random-slopes, tidy = TRUE, eval = FALSE} rep16 <- rpt(BodyL ~ Treatment + Sex + (1|Container) + (Treatment + Sex|Population), grname=c("Population"), data=BeetlesBody, datatype="Gaussian", nboot=500, npermut=0, adjusted=FALSE) ``` ```{r gaussian-random-slopes-print, warning=1, tidy = TRUE, eval = TRUE} summary(rep16) ``` ## Concluding remarks The `rptR` package offers convenient extraction of (ratios of) variance components for multiple grouping levels for estimating repeatabilities, including the option to control for fixed effects in order to estimate adjusted repeatabilities. While this is easy-to-use we want to remind users to make sure their data is appropriately modeled. We therefore provide the fitted mixed model as part of the `rpt` object and recommend that this is inspected for any indications of violated model assumptions or missing predictors. Repeatability estimates from Poisson, binary and proportion data might often appear disappointingly low. This is likely to be a simple reflection of the fact that for non-Gaussian data, the numerator contains the among-group variance in predispositions (e.g. probabilities of successes, long-range averages of counts) while the observed data is discretely scattered around these expectations. A lot of the phenotypic variance is hence inherently random in origin and each datapoint contains only very limited information about the underlying predisposition. The estimation of repeatabilities for count, binary and proportion data will thus require substantial sample sizes and (ideally) optimized sampling designs. ## References Bates, D., M. Maechler, B. Bolker & S. Walker (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software 67: 1-48. Johnson, P.C.D. (2014). Extension of Nakagawa & Schielzeth's $R^2_{GLMM}$ to random slopes models. Methods in Ecology and Evolution 5: 944-946 Nakagawa, S. & H. Schielzeth (2010). Repeatability for Gaussian and non-Gaussian data: a practical guide for biologists. Biological Reviews 85: 935-956. Nakagawa, S. & H. Schielzeth (2013). A general and simple method for obtaining $R^2$ from generalized linear mixed-effects models. Methods in Ecology and Evolution 4: 133-142. Nakagawa, S. & H. Schielzeth (2016). The coefficient of determination $R^2$ and intra-class correlation ICC from generalized linear mixed-effects models revisited and expanded. bioRxive: doi 10.1101/095851. Manly, B. F. J. (2007). Randomization, bootstrap and Monte Carlo methods in Biology. 3rd edn. Chapman & Hall, Boca Rato. Schielzeth, H. & S. Nakagawa (2013). Nested by design: Model fitting and interpretation in a mixed model era. Methods in Ecology and Evolution 4: 14-24.