--- title: "inbreedR step by step" author: "Martin A. Stoffel" date: "`r Sys.Date()`" output: rmarkdown::html_vignette bibliography: bibliography.bib vignette: > %\VignetteIndexEntry{inbreedR step by step} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE} library(knitr) knitr::opts_chunk$set(collapse = TRUE, comment = "#>", cache = FALSE, fig.width = 5, fig.height = 4) # warning = FALSE ``` ## inbreedR The idea behind `inbreedR` is to provide a consistent framework for the analysis of inbreeding and heterozygosity-fitness correlations (HFCs) based on genetic markers. This vignette gives a practical introduction into the concept of the package and how to use the functions. For a more concise theoretical background and a citation, please refer to our paper [@stoffel2016inbreedr]. We are happy about any suggestions and feedback on the package. Just write a mail to martin.adam.stoffel[at]gmail.com. ## Installation `inbreedR` is available on [CRAN](https://cran.r-project.org/). Here is the code to download and install the current stable release. ```{r, eval = FALSE} install.packages("inbreedR") ``` The development version can be downloaded from [GitHub](https://github.com/mastoffel/inbreedR) with the following code: ```{r, eval = FALSE} install.packages("devtools") devtools::install_github("mastoffel/inbreedR", build_vignettes = TRUE) ``` The package provides documentation for every function. To get an overview, just look at `inbreedR`'s help file. ```{r} library("inbreedR") ``` ```{r, eval = FALSE} ?inbreedR ``` ## inbreedR contains the following functions: * `convert_raw`: Converts a common format for genetic markers (two columns per locus) into the `inbreedR` working format, type `?convert_raw` for detailed information. * `check_data`: Checks whether the genotypes `data.frame` or `matrix` has the correct format for the `inbreedR` functions. * `sMLH`: Computes standardized multilocus heterozygosities [@coltman1999parasite]. * `MLH`: Computes multilocus heterozygosity. * `g2_microsats`: Calculates $g_2$, a measure if identity disequlibrium (ID) from smaller datasets, such as microsatellites. Based on the formula from @DAVID:2007kgba. * `g2_snps`: Calculates $g_2$ for larger datasets, such as SNPs. Allows for parallelization to speed up computation times. Based on the formula from the appendix of @Hoffman:2014bxba. * `HHC`: Computes heterozygosity-heterozygosity correlations, another measure of identity disequilibrium [@balloux2004does]. * `r2_Wf` : Calculates the expected squared correlation of inbreeding-level (f) with a fitness trait (W) according to @szulkin2010heterozygosity. * `r2_hf`: Calculates the expected quared correlation of inbreeding-level (f) with multilocus heterozygosity (h) according to @szulkin2010heterozygosity. * `simulate_g2`: A simulation that allows the user to draw different numbers of markers independently from a simulated genome and calculate respective $g_2$ values. Can be used to evaluate the effects of the number of individuals and loci on the precision and magnitude of $g_2$. * `simulate_r2_hf`: Works equivalent to `simulate_g2`. However the estimates are the expected squared correlations between inbreeding and heterozygosity $r^2(h, f)$. * `plot.inbreed`: Plots for objects of class `inbreed` ## Example datasets In the following sections, the functionality of `inbreedR` is illustrated using genetic and phenotypic data from an inbred captive population of oldfield mice (Peromyscus polionotus) [@Hoffman:2014bxba]. These mice were paired to produce offspring with a range of inbreeding coefficients (0-0.453) over six generations of laboratory breeding and the resulting pedigree was recorded, from which individual f values were calculated. Example files are provided containing the genotypes of 36 P. polionotus individuals at 12 microsatellites and 13,198 SNPs respectively. Data on body mass at weaning, a fitness proxy, are also available for the same individuals. ```{r, echo=TRUE, results='hide'} library(inbreedR) data("mouse_msats") # microsatellite data data("mouse_snps") # snp data data("bodyweight") # fitness data ``` ## Data format and checking The working format of `inbreedR` is an `individual * loci` matrix or data frame in which rows represent individuals and each column represents a locus. If an individual is heterozygous at a given locus, it is coded as 1, whereas a homozygote is coded as 0, and missing data are coded as NA. The `mouse_snps` dataset accompanying the package is already formatted in the right way. ```{r} data("mouse_snps") mouse_snps[1:10, 1:10] ``` You can check whether your data is in the right format with the `check_data` function, which gives an error with a message when something went wrong and `TRUE` otherwise. Look up the documentation with `?check_data` to see what exactly this functions checks for. ```{r} check_data(mouse_snps, num_ind = 36, num_loci = 13198) ``` ### Conversion from a more common format `convert_raw` is a function to convert a more common format, where each locus is represented by two columns (alleles), into the `inbreedR` working format. Microsatellite data is often formatted like `mouse_msats`, which is the second dataset accompanying the package. ```{r} data("mouse_msats") mouse_msats[1:8, 1:8] ``` To convert it into the `inbreedR` working format, just use the `convert_raw` function. ```{r} mouse_microsats <- convert_raw(mouse_msats) mouse_microsats[1:8, 1:8] ``` The same procedure works when you have letters (e.g. basepairs 'A', 'T') in two adjacent columns instead of microsatellite allele lengths. ### Loading and extracting genotype information from VCF files SNP data will naturally occur as VCF file after variant calling. Here is a short workflow how to load a VCF file into R with the `vcfR package`, extract the genotypes and transform them into a 0/1 format to be used within `inbreedR`. ```{r, eval = FALSE} # install.packages("vcfR") # install.packages("reshape") library(vcfR) library(reshape2) vcf_file <- "yourvcffile.vcf" # read vcf vcf <- read.vcfR(vcf_file, verbose = FALSE ) # extract genotypes gt <- extract.gt(vcf) # transpose and data.frame gt <- as.data.frame(t(gt), stringsAsFactors = FALSE) # NA handling gt[gt == "."] <- NA # split columns snp_geno <- do.call(cbind, apply(gt, 2, function(x) colsplit(x, "/", c("a","b")))) # convert mouse_snp_genotypes <- inbreedR::convert_raw(snp_geno) # check data check_data(mouse_snp_genotypes) ``` ## A short theory of heterozygosity-fitness correlations (HFC) Most HFC studies solely report the correlation between heterozygosity (*h*) and fitness (*W*). However, according to HFC theory, this correlation results from the simultaneous effects of inbreeding level (*f*) on fitness ($r(W,f)$) and heterozygosity ($r(h,f$)) [@Slate:2004iibacadaeafagahaia; @szulkin2010heterozygosity]: $$ r(W,h) = r(h,f)r(W,f) $$ (Equation 1) Although we cannot directly measure the inbreeding level *f*, we can use the extent to which heterozygosity is correlated across loci, termed identity disequilibrium (ID), as a proxy to characterize the distribution of *f* in populations. A measure of ID that can be related to HFC theory is the two-locus heterozygosity disequilibrium, $g_2$ [@DAVID:2007kgba], which quantifies the extent to which heterozygosities are correlated across pairs of loci. Based on $g_2$ as an estimate of ID, it is then possible to calculate $\hat{r}^2(h, f)$ as follows [@szulkin2010heterozygosity]: $$\hat{r}^2(h, f) = \frac{\hat{g}_{2}}{\hat{\sigma}^2(h)}$$ (Equation 2) Finally, the expected determination coefficient between a fitness trait and inbreeding level can simply be derived be rearranging equation 1 [@szulkin2010heterozygosity]: $$\hat{r}^2(W, f) = \frac{\hat{r}^2(W, h)}{\hat{r}^2(h, f)}$$ (Equation 3) Software is already available for calculating $g_2$ from microsatellite datasets [@DAVID:2007kgba]. However, for larger datasets, e.g. SNPs, the original formula is not computationally practical, as it requires a double summation over all pairs of loci. For example, with 15.000 loci, the double summations take of the order of 0.2 x 109 computation steps. For this reason, `inbreedR` implements a computationally more feasible formula based on the assumption that missing values do not vary much between pairs of loci [@Hoffman:2014bxba]. In turn, the $g_2$ parameter builds the foundation for the implementation of the above framework to analyse HFCs, which is recommended to be routinely computed in future HFC studies [@szulkin2010heterozygosity]. ## Identity disequilibrium The package provides two functions to calculate **$g_2$**, a proxy for Identity disequilibrium, for both small datasets (e.g. microsatellites) and large datasets (e.g.SNPs). * The `g2_microsats` function implements the formula given in @DAVID:2007kgba. * For large datasets, e.g. SNPs, the `g2_snps` function implements a computationally more feasible formula. This function also provides an additional argument for parallelization which distributes bootstrapping and permutation across cores. The results of both functions can be plotted as histograms with CIs. Have a look at the help files with `?g2_microsats` and `?g2_snps` for more information on the formulas. For both microsatellites and SNPs, `inbreedR` calculates confidence intervals by bootstrapping over individuals. It also permutes the genetic data to generate a P-value for the null hypothesis of no variance in inbreeding in the sample (i.e. $g_2$ = 0). ```{r, echo=TRUE, results='hide'} g2_mouse_microsats <- g2_microsats(mouse_microsats, nperm = 100, nboot = 100, CI = 0.95) g2_mouse_snps <- g2_snps(mouse_snps, nperm = 100, nboot = 10, CI = 0.95, parallel = FALSE, ncores = NULL) ``` To display a summary of the results just print the output of an `inbreedR` function. ```{r} g2_mouse_microsats ``` `plot` shows the distribution of bootstrap results including the confidence interval. ```{r g2_bootstrapping, echo=TRUE, fig.cap="Distribution of g2 from bootstrapping with confidence interval", fig.width=7, fig.height=4} par(mfrow=c(1,2)) plot(g2_mouse_microsats, main = "Microsatellites", col = "cornflowerblue", cex.axis=0.85) plot(g2_mouse_snps, main = "SNPs", col = "darkgoldenrod1", cex.axis=0.85) ``` Another approach for estimating ID is to divide the marker panel into two random subsets, compute the correlation in heterozygosity between the two, and repeat this hundreds or thousands of times in order to obtain a distribution of **heterozygosity-heterozygosity correlation coefficients (HHCs)** [@balloux2004does]. This approach is intuitive but can be criticised on the grounds that samples within the HHC distribution are non-independent. Moreover, $g_2$ is preferable because it directly relates to HFC theory (equation 2). The `HHC` function in `inbreedR` calculates HHCs together with confidence intervals, specifying how often the dataset is randomly split into two halves with the `reps` argument. The results can be outputted as text or plotted as histograms with CIs. ```{r, echo=TRUE, results='hide'} HHC_mouse_microsats <- HHC(mouse_microsats , reps = 1000) HHC_mouse_snps <- HHC(mouse_snps, reps = 100) ``` ```{r} HHC_mouse_microsats ``` ```{r HHC, echo=TRUE, fig.cap="Distribution of heterozygosity-heterozygosity correlations", fig.width=7, fig.height=4} par(mfrow=c(1,2)) plot(HHC_mouse_microsats, main = "Microsatellites", col = "cornflowerblue", cex.axis=0.85) plot(HHC_mouse_snps, main = "SNPs", col = "darkgoldenrod1", cex.axis=0.85) ``` ## HFC parameters Assuming that HFCs are due to inbreeding, it is possible to calculate both the expected correlation between heterozygosity and inbreeding level ($\hat{r}^2(h, f)$) and the expected correlation between a fitness trait and inbreeding ($\hat{r}^2(W, f)$) as described in Equation 1. These are implemented in `inbreedR` using the functions `r2_hf` and `r2_Wf`. Equal to the `glm` function, the distribution of the fitness trait can be specified in the `family` argument, as shown below: ```{r, echo=TRUE} # r^2 between inbreeding and heterozygosity hf <- r2_hf(genotypes = mouse_microsats, type = "msats") # r^2 between inbreeding and fitness Wf <- r2_Wf(genotypes = mouse_microsats, trait = bodyweight, family = gaussian, type = "msats") ``` In addition, bootstrapping over individuals can be used to estimate confidence intervals around these estimates. Also, there is the possibility of parallelization, by specifying `parallel = TRUE` ```{r, echo=TRUE, results = "hide"} # r^2 between inbreeding and heterozygosity with bootstrapping hf <- r2_hf(genotypes = mouse_microsats, nboot = 100, type = "msats", parallel = FALSE) ``` Plotting the histogram with confidence interval for `r2_hf`. ```{r} plot(hf) ``` ## Workflow for estimating the impact of inbreeding on fitness using HFC @szulkin2010heterozygosity in their online Appendix 1 provide a worked example of how to estimate the impact of inbreeding on fitness within an HFC framework. Below, we show how the required calculations can be implemented in `inbreedR`. We are now describing a coding workflow to estimate useful parameters for the interpretation of HFCs. We compare the results based on microsatellite and SNP data derived from a single inbred population of oldfield mice. We start with the estimation of identity disequilibrium ($\hat{g}_2$) and calculation of the distribution variance of standardized multilocus heterozygosity ($\hat{\sigma}^2(h)$), followed by the regression slope of fitness on heterozygosity ($\hat{\beta}_{Wh}$) and the three correlations from equation 1. Example code for the microsatellite dataset is shown below and the results for both microsatellites and SNPs are given in Table 1. ```{r, echo=TRUE, results='hide'} # g2 g2 <- g2_microsats(mouse_microsats) # calculate sMLH het <- sMLH(mouse_microsats) # variance in sMLH het_var <- var(het) # Linear model of fitness trait on heterozygosity mod <- lm(bodyweight ~ het) # regression slope beta <- coef(mod)[2] # r2 between fitness and heterozygosity Wh <- cor(bodyweight,predict(mod))^2 # r2 between inbreeding and heterozygosity hf <- r2_hf(genotypes = mouse_microsats, type = "msats") # r2 between inbreeding and fitness Wf <- r2_Wf(genotypes = mouse_microsats, trait = bodyweight, family = gaussian, type = "msats") ``` ```{r, echo=FALSE} library(knitr) df_msats <- data.frame(g2[["g2"]], het_var, beta, Wh, hf$r2_hf_full, Wf$r2_Wf_full) row.names(df_msats) <- c("microsats") kable(df_msats, digits=3, caption="Descriptors of HFCs", col.names = c("$\\hat{g}_2$", "$\\hat{\\sigma}^2(h)$", "$\\hat{\\beta}_{Wh}$", "$\\hat{r}^2_{Wh}$", "$\\hat{r}^2_{hf}$","$\\hat{r}^2_{Wf}$")) ``` ## Sensitivity to the number of genetic markers So far, the uncertainty of $g_2$ and other estimates is assessed via bootstrapping and confidence intervals. However, for planning future studies it might be of interest how the uncertainty of $g_2$ changes by increasing or decreasing the number of genetic markers.`simulate_g2` can be used to evaluate the effects of the number of individuals and loci on the precision and magnitude of $g_2$. The user specifies the number of simulated individuals (`n_ind`), the subsets of loci (`subsets`) to be drawn, the heterozygosity of non-inbred individuals (`H_nonInb`) and the distribution of *f* among the simulated individuals. The *f* values of the simulated individuals are sampled randomly from a beta distribution with mean (`meanF`) and variance (`varF`) specified by the user (e.g. as in [@wang2011]). This enables the simulation to mimic populations with known inbreeding characteristics, or to simulate hypothetical scenarios of interest. For computational simplicity, allele frequencies are assumed to be constant across all loci and the simulated loci are unlinked. Genotypes (i.e. the heterozygosity/homozygosity status at each locus) are assigned stochastically based on the *f* values of the simulated individuals. Specifically, the probability of an individual being heterozygous at any given locus ($H$) is expressed as $H = H0(1-\f)$ , where $H0$ is the user-specified heterozygosity of a non-inbred individual and *f* is an individual's inbreeding coefficient drawn from the beta distribution. The `type` argument specifies the $g_2$ formula to use. With `type = snps`, simulations with larger loci sets are possible. However, bear in mind that the function creates independent loci for every repition, which leads to a rapid increase in working memory use and computation time. ```{r, echo=TRUE, results='hide'} sim_g2 <- simulate_g2(n_ind = 20, H_nonInb = 0.5, meanF = 0.2, varF = 0.05, subsets = c(4,6,8,10,12), reps = 100, type = "msats", CI = 0.95) ``` The results can easily be plotted with the `plot` function again. ```{r sim_g2, echo=FALSE, fig.cap="Simulation: Sensitivity of g2 estimated from an increasing number of markers", fig.width=7, fig.height=4} plot(sim_g2, main = "Microsatellites", cex.axis=0.85) ``` Also, the plot function allows to plot the *real* $g_2$ value, which is directly computed from the realized inbreeding values of the individuals. ```{r sim_g2_true, echo=FALSE, fig.cap="Simulation: Sensitivity of g2 estimated from an increasing number of markers with true g2 value", fig.width=7, fig.height=4} plot(sim_g2, true_g2 = TRUE, main = "Microsatellites", cex.axis=0.85) ``` ## Note on calculating g2 for real-world SNP datasets SNP datasets obtained by most sequencing approaches will yield a high proportion of missing data. It is important to be aware that the $g_2$ formula for SNPs is fast because it relies on the assumption that missing values don't vary much between loci. We thus advice to calculate $g_2$ for datasets which are as complete as possible. In many situations, it might therefore be necessary to reduce the amount of SNPs by filtering for SNPs which have been genotyped in most individuals. Calculating confidence intervals around $g_2$ in combination with estimating the correlation between marker heterozygosity and inbreeding with the $r2_hf$ function will give you insights on whether the SNP density is sufficient to estimate $g_2$. Also, @huisman2016inbreeding show how to combine $g_2$ and other inbreeding estimates in the framework described above, thereby giving a good guideline for future studies. ## Extracting raw data from inbreedR objects You may wish to extract and plot the data yourself. Most function outputs are `inbreed` objects and lists. In the `Value` section of each functions documentation (`?fun` ), you can see the data which you can extract. Alternatively, use `str()` to look at the object's structure. Just index the function output with `[["."]]` or `$` as in the following example: Running the function. ```{r, results = "hide"} g2_seals <- g2_microsats(mouse_microsats, nperm = 100, nboot = 100, CI = 0.95) ``` Looking at the structure. ```{r} str(g2_seals) ``` Now extract whatever you want from the object, such as the $g_2$ bootstrap results. ```{r} g2_bootstrap_results <- g2_seals$g2_boot str(g2_bootstrap_results) ``` ## Literature