Using pcadapt to detect local adaption with pooled sequencing data.

Keurcien Luu, Michael G.B. Blum

The package also handles pooled sequencing data. In the following, we show how the pcadapt package can perform scans for selection based on pooled sequencing data. We show how to run the package using the example pool3pops that contains pooled sequencing data. The pool3pops data contain simulated allele frequencies in 3 populations for 1,500 diploid markers. Allele frequencies have been computed based on the geno3pops dataset.

To run the package, you need to install and load it using the following command lines:

install.packages("pcadapt")
library(pcadapt)

A. Reading the file

A Pool-seq example is provided in the package, and can be loaded as follows:

pooldata <- system.file("extdata","pool3pops",package="pcadapt")

The frequency matrix should have n rows and L columns by default (where n is the number of populations and L is the number of genetic markers).

B. Running the analysis

When calling the pcadapt function, make sure to specify data.type ="pool".

x.pool <- pcadapt(pooldata,data.type="pool")
## Number of SNPs: 1500
## Number of populations: 3
## [1] 2

As for genotype data, the pcadapt function performs two successive tasks. First PCA is performed on the centered matrix of allele frequencies (not scaled). And the second stage consists of computing test statistics (Mahalanobis distances by default) and p-values based on the covariances between allele frequencies and the first K PCs. By default, the function assumes that the number K of PCs is equal to the number of populations minus one (K=n-1). The user can use a smaller number of PCs (K < n-1) by determining the optimal number of PCs using the scree plot.

Plotting options mentioned in section D such as "manhattan", "qqplot", "scores", "screeplot", or "stat.distribution" are still valid for Pool-seq data. For example, to display the Manhattan plot, type the following command line:

plot(x.pool,option="manhattan")

C. Specifying the coverage matrix

However, if the mean coverage per population for each SNP is known, it is possible to take it into consideration to increase power in the selection scans, by taking into account low coverage SNPs in the test statistics. Hence, the coverage matrix has same dimensions as your pooled sequencing data. To do so, you need to provide the coverage matrix to the pcadapt function thanks to the argument cover.matrix.