Introduction to Using the pooledpeaks Workflow

Introduction

Welcome to pooledpeaks. This guide introduces researchers to pooledpeaks for analyzing microsatellite markers, scoring .fsa files, and calculating genetic measures like Nei’s GST and Jost’s D. Basic R skills are required, but most steps are straightforward. This vignette includes .fsa files from two different sources:

These files are intended solely to demonstrate the functionality of the pooledpeaks package and are not for diagnostic or clinical use.To access the example .fsa files included with the package, use the following path within R:

system.file("extdata", package = "pooledpeaks")

Data Objects

Several internal data objects are created and used in this vignette to demonstrate the analytical workflow of the pooledpeaks package. These include:

What This Vignette Covers

This vignette outlines:

  1. General setup: Preparing your environment and data.
  2. Peak scoring: Using pooledpeaks to process .fsa files.
  3. Data manipulation: Cleaning and preparing peak scores.
  4. Genetic analysis: Calculating diversity and differentiation measures.

Each step builds on the previous, so follow the vignette sequentially. However, the three sections, Peak Scoring, Data Manipulation, and Genetic Analysis, can be run separately.

1. General Setup

To get started you will need to set up the R environment by setting the working directory, loading the required libraries, reading in the source files with the functions written specifically for this analysis pipeline.

In addition to the pooledpeaks library the following packages are require to utilize this package to its fullest capacity.

library(pooledpeaks)
library(Fragman)
library(ape)
library(magrittr)
library(tibble)

if (!rlang::is_installed("plyr")) {
  stop("This vignette requires the 'plyr' package. Please install it with
       install.packages('plyr').")
}

library(plyr)
library(dplyr)

Identify where the .fsa files are located on your computer, load in the eggcount data (should be an excel or csv file but can be a dataframe), and provide the expected peak size panels for your markers and ladder.

file_path <- system.file("extdata", package = "pooledpeaks")
eggcount <- data.frame(
    ID = c("X23.2",  "X30.3", "X33.1", "X1086.3", "X1087.3", "X1205.3",
           "X121.3",  "X1222.3", "X1354.3", "X1453.3", "X1531.3", "X1540.1",
           "Multiplex_set_I_Shaem.1",
           "Multiplex_set_I_Shaem.3", "Multiplex_set_I_Shaem.4"),
    n = c( 20, 46, 80, 156, 154, 122, 19, 45, 117, 75,
          22, 175, 100, 97, 183)
  )
Shae10 <- c(161,164,167,170,173,176,179,182,185,188,191,194,197,200,203,206,209,
            212,215,218)
mic_SMMS2 <- c(211, 215, 219, 223, 227, 231, 235, 239)
GS600LIZ <- c(20, 40, 60, 80, 100, 114, 120, 140, 160, 180, 200, 214, 220,
              240, 250, 260, 280, 300, 314, 320, 340, 360, 380, 400, 414,
              420, 440, 460, 480, 500, 514, 520, 540, 560, 580, 600)

2. Peak Scoring

With your data loaded, we can move on to the peak scoring section. To facilitate this process, pooledpeaks incorporates functionality adapted from the Fragman package, originally developed for microsatellite typing in plants. These adaptations allow for the scoring of both allele sizes and their corresponding heights in the newer .fsa file versions.

Batch Import and Extracting .fsa Data

This section demonstrates how to import the .fsa files from the file directory into R and combine all of them into a list of data frames, wherein each file is stored as a data frame within the list. channels specifies that we are using a five channel dye set of which 1-4 are fluorescent label colors, and 5 contains the ladder. fourier and saturated should both be set to TRUE and lets.pullup to FALSE. When rawPlot is set to TRUE, the function will provide an overview of all peaks across all files within each channel. Once .fsa files have been imported, the dyes must be associated with the channels. This can be done using associate_dyes().

fsa_data <- fsa_batch_imp(file_path, channels = 5, rawPlot = TRUE,
                              fourier = TRUE, saturated = TRUE,
                              lets.pullup = FALSE)
#> Reading FSA files
#> Applying Fourier tranformation for smoothing...
#> Checking and correcting for saturated peaks...
#> Plotting raw data

fsa_data <- associate_dyes(fsa_data, file_path)
#> Columns have been renamed with associated fluoresecent dye.

Note: If you are encountering issues, you may want to consider checking the file version and/or metadata using check_fsa_v_batch() or fsa_metadata().

Match Sizing Ladder

To calibrate fragment sizes, the internal size marker peaks in each sample must match the expected sizes from the ladder. For this example, we use the LIZ600 object, which contains the expected allele sizes for the ladder. If you are using a different ladder or wish to adjust the fragment sizes, modify the c(...) values in the setup.

Next, we associate the ladder with the imported data. This step is performed once per dataset and ensures proper sizing by comparing the expected ladder sizes with the observed values. The program checks the correlation between these values, outputs the correlation results, and flags poorly correlated samples (<99.9%) in a vector named bad.

ladder.info.attach(stored = fsa_data,ladder = GS600LIZ,
                   ladd.init.thresh = 200, prog = FALSE, draw = FALSE)
#> 
#> Reducing threshold 2x to find ladder 
#> 
#> Reducing threshold 2x to find ladder 
#> 
#> Sizing process complete. Information has been stored in the environment for posterior functions.
#> For example to be used by the overview2() or score.markers() functions.
corro <- unlist(sapply(list.data.covarrubias, function(x){x$corr}))
bad <- which(corro < .999)

Note: If warnings are thrown, lowering the ladd.init.thresh may resolve the issue, or certain samples may need to be addressed manually as per the Fragman documentation (run ?ladder.corrector).

Scoring Peaks by Marker

The above chunks set up all samples for all markers and only need to be done once per analysis. The following steps will need to be repeated as many times as the number of microsatellite markers you have.

Using the score_markers_rev3 function (adapted from Fragman), you can score genotyped peaks based on size (weight) and intensity (height). This function bins peaks by comparing observed fragment sizes to expected microsatellite fragment sizes.

Key Parameters to Customize

Other options like window (distance from the expected size to count as a peak) and shift (handling stutter peaks) can be adjusted as needed. Refer to the Fragman documentation for detailed explanations.

Additional Updates of score_markers_rev390

scores_SMMS2 <- score_markers_rev3(my.inds = fsa_data,
                                   channel = 1,
                                   channel.ladder = 5,
                                   panel = "mic_SMMS2",
                                   ladder = GS600LIZ,
                                   init.thresh = 100,
                                   ploidy = length(mic_SMMS2),
                                   shift = 1,
                                   windowL = 1,
                                   windowR= 1,
                                   left.cond = c(0, 2.5),
                                   right.cond = 0,
                                   pref = 1,
                                   plotting = FALSE
                                   )

scores_Shae10 <- score_markers_rev3(my.inds = fsa_data,
                                   channel = 1,
                                   channel.ladder = 5,
                                   panel = "Shae10",
                                   ladder = GS600LIZ,
                                   init.thresh = 100,
                                   ploidy = length(Shae10),
                                   shift = 1,
                                   windowL = 1,
                                   windowR= 1,
                                   left.cond = c(0, 2.5),
                                   right.cond = 0,
                                   pref = 1,
                                   plotting = FALSE
                                   )

Note: The author recommends setting plotting to TRUE and then visually inspecting the PDFs to confirm that each peak is being called as expected. If they are not, adjust the parameters until satisfied.

Combining, Cleaning, and Exporting the Peak Dataframe

After scoring peaks, combine the data frames for all samples of the same marker into a single data frame instead of a list of lists You’ll also clean the sample IDs for consistency and prepare the data for downstream analyses.

Workflow

1. Combine Data and Create Simplified IDs

clean_scores() row-binds all the individual data frames and removes machine-added information from the ID column, keeping only the collection number and replicate

(e.g., filename: 104.1a_FA060920_2020-06-09_C05.fsa becomes ID: 104.1a).

scores_SMMS2_lf<-clean_scores(scores_SMMS2, pattern1 = "_I_[A|B|C].*",replacement1 = "",
                              pattern2 = "_[1|2|3]_Sample.*", replacement2 = "")

scores_Shae10_lf<-clean_scores(scores_Shae10, pattern1 = "_I_[A|B|C].*",replacement1 = "",
                              pattern2 = "_[1|2|3]_Sample.*", replacement2 = "")

2. Transform from Long Format to Table Format

scores_SMMS2_tdf <- lf_to_tdf(scores_SMMS2_lf)

scores_Shae10_tdf <- lf_to_tdf(scores_Shae10_lf)

3. Export Tables

To save time in future analyses, export the processed peak data as .txt files. This ensures you can access the data without rerunning the entire pipeline.

write.table(scores_SMMS2_lf, file = "scores_SMMS2_lfex.txt", col.names = NA,
            quote = FALSE, row.names = TRUE, sep = "\t")
write.table(scores_SMMS2_tdf, file = "scores_SMMS2_tdfex.txt", col.names = NA,
            quote = FALSE, row.names = TRUE, sep = "\t")

write.table(scores_Shae10_lf, file = "scores_Shae10_lfex.txt", col.names = NA,
            quote = FALSE, row.names = TRUE, sep = "\t")
write.table(scores_Shae10_tdf, file = "scores_Shae10_tdfex.txt", col.names = NA,
            quote = FALSE, row.names = TRUE, sep = "\t")

3. Data Manipulation

This data manipulation section is important in order to prepare for the genetic analysis but it is much simpler than the peak scoring portion above. Begin by calling in the previously exported tdf data frames.

SMMS2<- read.delim("./scores_SMMS2_tdfex.txt")%>%
  column_to_rownames(var = "X")
head(SMMS2[, 1:9])
#>     X23.2a X23.2b X30.3a X30.3b X33.1a X33.1b Multiplex_set_I_Shaem.1a
#> 211      0      0      0      0    119      0                        0
#> 215    916    657   4273   1443  12424   3605                        0
#> 219      0      0      0      0    102      0                        0
#> 231      0      0      0      0      0      0                      103
#> 235    547    140    419    259   4172   1140                        0
#> 239      0      0      0      0      0      0                      348
#>     Multiplex_set_I_Shaem.1b Multiplex_set_I_Shaem.3a
#> 211                      158                        0
#> 215                        0                        0
#> 219                        0                        0
#> 231                        0                        0
#> 235                        0                        0
#> 239                      382                        0

Initial Data Manipulation

The data_manipulation function should be used to clean the data first. It

  1. Removes samples without at least one peak exceeding the threshold.

  2. Eliminates alleles that are absent in all samples.

SMMS2_IDM <- data_manipulation(SMMS2, threshold = 200)
head(SMMS2_IDM[, 1:9])
#>     X23.2a X23.2b X30.3a X30.3b X33.1a X33.1b Multiplex_set_I_Shaem.1a
#> 211      0      0      0      0    119      0                        0
#> 215    916    657   4273   1443  12424   3605                        0
#> 219      0      0      0      0    102      0                        0
#> 231      0      0      0      0      0      0                      103
#> 235    547    140    419    259   4172   1140                        0
#> 239      0      0      0      0      0      0                      348
#>     Multiplex_set_I_Shaem.1b Multiplex_set_I_Shaem.4a
#> 211                      158                      398
#> 215                        0                        0
#> 219                        0                        0
#> 231                        0                      131
#> 235                        0                        0
#> 239                      382                      735

Replicate Check

Replicate samples are compared in the cleaned data frame (you can skip this step if you only ran each sample once). If you do have replicate samples, it replaces the individual columns .a and .b (.c, .d, etc.) with an average of the two and calculates the Jost’s D between the samples.

SMMS2_repcheck <- Rep_check(SMMS2_IDM)
#> Jost D between duplicate samples
#> Multiplex_set_I_Shaem.1, Multiplex_set_I_Shaem.4, X23.2, X30.3, X33.1
#> 0.115023363295114, 0.137798575802657, 0.0632672283449427, 0.00500598676325659, 0.000473242075468283
#> Samples where duplicates have a Jost D exceeding 0.05
#> Multiplex_set_I_Shaem.1, Multiplex_set_I_Shaem.4, X23.2
head(SMMS2_repcheck)
#>     Multiplex_set_I_Shaem.1 Multiplex_set_I_Shaem.4 X23.2 X30.3  X33.1
#> 211                    79.0                   199.0   0.0     0   59.5
#> 215                     0.0                     0.0 786.5  2858 8014.5
#> 219                     0.0                     0.0   0.0     0   51.0
#> 231                    51.5                   125.5   0.0     0    0.0
#> 235                     0.0                     0.0 343.5   339 2656.0
#> 239                   365.0                   589.0   0.0     0    0.0

Data Manipulation for Genetic Analysis

PCDM or the Post-Consolidation Manipulation function prepares the data for the Genetic Analysis Section by:

  1. Matching eggcount information for each sample.

  2. Calculating allelic frequencies.

  3. Adding the marker name to separate the data frames once combined.

SMMS2_PCM<-PCDM(SMMS2_repcheck,eggcount,'SMMS2')
head(SMMS2_PCM[,1:6])
#>   Locus_allele Multiplex_set_I_Shaem.1 Multiplex_set_I_Shaem.4       X23.2
#> 1        SMMS2                    <NA>                    <NA>        <NA>
#> 2            n                     100                     183          20
#> 3          211             0.159434914             0.217843459 0.000000000
#> 4          215               0.0000000               0.0000000   0.6960177
#> 5          219             0.000000000             0.000000000 0.000000000
#> 6          231               0.1039354               0.1373837   0.0000000
#>         X30.3       X33.1
#> 1        <NA>        <NA>
#> 2          46          80
#> 3 0.000000000 0.005518969
#> 4   0.8939631   0.7433912
#> 5 0.000000000 0.004730544
#> 6   0.0000000   0.0000000

Binding Markers and Exporting Allele Frequencies

If you have multiple markers, you can combine them into a single data frame using functions like rbind.fill(). This creates a consolidated structure with one column per sample and replaces any empty cells with NA. The processed data frame can be exported as a .txt file, allowing for efficient reuse in future analyses without repeating these steps.

combined<-rbind.fill(SMMS2_PCM, SMMS13_PCM, SMMS16_PCM)

write.table(combined, file = "combined.txt", col.names = NA,
            quote = FALSE, row.names = TRUE, sep = "\t")

4. Genetic Analysis

Welcome to the final stage of the pipeline: genetic analysis! This section provides a high-level overview of the key steps involved in analyzing the processed data for population genetics. It introduces essential methods like calculating genetic distance, visualizing population structure, and creating phylogenetic trees.

Note: This is a general guide intended to demonstrate the pipeline’s capabilities, not a comprehensive or in-depth example. For detailed use cases or advanced analyses, you may need to adjust the parameters and explore additional functions in the package.

Note: This section uses a different dataset than the rest of the vignette. It includes a more extensive set of microsatellite markers and represents a subset of Schistosoma mansoni samples collected as part of a series of three studies conducted in Brazil. These studies are described in detail by Long et al. (2022), available at https://www.nature.com/articles/s41598-022-04776-0:

Long, J.C., Taylor, S.E., Barbosa, L.M., et al. (2022). Cryptic population structure and transmission dynamics uncovered for Schistosoma mansoni populations by genetic analyses. Scientific Reports, 12, 1059. https://doi.org/10.1038/s41598-022-04776-0

Loading the Dataset

The LoadData function modifies and saves the data frame as the gends object with an added column that indexes the Locus number.

gends <- LoadData("./combined3.txt")
head(gends[1:8])
#>   Locus Locus_allele      J19       J82       J83       J95     J320     J323
#> 1     1        SMMS2       NA        NA        NA        NA       NA       NA
#> 2     1            n 9376.000 13824.000 45144.000 42081.000 3119.000 5185.000
#> 3     1          212    0.002     0.006     0.005     0.006    0.005    0.003
#> 4     1        215.9    0.697     0.773     0.743     0.748    0.819    0.719
#> 5     1        219.8    0.000     0.000     0.000     0.000    0.000    0.008
#> 6     1        235.2    0.301     0.221     0.252     0.246    0.176    0.270

Calculating Gene Identity and Genetic Distance Matrices

Next, we calculate the gene identity and genetic distances between samples. This step is fundamental to all downstream genetic analyses, as they are the basis for differentiation indices, clustering, phylogenetic trees, and other population genetic metrics. This involves:

1. Counting loci successfully genotyped for each individual (TypedLoci).

N <- TypedLoci(gends)
head(N[,1:5])
#>      J19 J82 J83 J95 J320
#> J19   14  14  14  14   14
#> J82   14  14  14  14   14
#> J83   14  14  14  14   14
#> J95   14  14  14  14   14
#> J320  14  14  14  14   14
#> J323  13  13  13  13   13

2. Calculating pairwise gene identity for all samples (GeneIdentityMatrix).

J <- GeneIdentityMatrix(gends,N)
head(J[,1:5])
#>            J19       J82       J83       J95      J320
#> J19  0.3571634 0.3314657 0.3350303 0.3280159 0.3625713
#> J82  0.3314657 0.3376424 0.3383771 0.3226482 0.3491884
#> J83  0.3350303 0.3383771 0.3464639 0.3286256 0.3511009
#> J95  0.3280159 0.3226482 0.3286256 0.3276570 0.3308035
#> J320 0.3625713 0.3491884 0.3511009 0.3308035 0.4251507
#> J323 0.2928828 0.2954567 0.2972713 0.2907427 0.2910343

3. Deriving a genetic distance matrix (GeneticDistanceMatrix).

D <- GeneticDistanceMatrix(J)
head(D[,1:5])
#>             J19         J82         J83         J95       J320
#> J19  0.00000000 0.015937125 0.016783286 0.014394321 0.02858577
#> J82  0.01593713 0.000000000 0.003676054 0.010001446 0.03220821
#> J83  0.01678329 0.003676054 0.000000000 0.008434786 0.03470645
#> J95  0.01439432 0.010001446 0.008434786 0.000000000 0.04560041
#> J320 0.02858577 0.032208214 0.034706446 0.045600411 0.00000000
#> J323 0.03281022 0.020475812 0.023071967 0.020197154 0.06865241

Differentiation Indices

We can use the gene identity matrix to calculate Nei’s GST and Jost’s D.

print(head(GST(J)[,1:5]))
#>            [,1]        [,2]        [,3]        [,4]       [,5]
#> [1,] 0.00000000 0.012063240 0.012780877 0.010826266 0.02293703
#> [2,] 0.01206324 0.000000000 0.002785797 0.007437667 0.02537248
#> [3,] 0.01278088 0.002785797 0.000000000 0.006321440 0.02747737
#> [4,] 0.01082627 0.007437667 0.006321440 0.000000000 0.03527280
#> [5,] 0.02293703 0.025372485 0.027477374 0.035272795 0.00000000
#> [6,] 0.02375101 0.014745538 0.016689967 0.014443897 0.05088081

print(head(JostD(J)[,1:5]))
#>            [,1]       [,2]       [,3]       [,4]       [,5]
#> [1,] 0.00000000 0.04587505 0.04770505 0.04203824 0.07308003
#> [2,] 0.04587505 0.00000000 0.01074703 0.03006600 0.08444810
#> [3,] 0.04770505 0.01074703 0.00000000 0.02502455 0.08995798
#> [4,] 0.04203824 0.03006600 0.02502455 0.00000000 0.12114756
#> [5,] 0.07308003 0.08444810 0.08995798 0.12114756 0.00000000
#> [6,] 0.10073970 0.06481071 0.07202264 0.06495518 0.19086724

PCA Plot

We can use the genetic distance matrix to visualize the “spread” of our population in space. This can be done using a PCA plot. It accepts the distance matrix, which PCs we want to include on the graph and how we want to differentiate the points.

M <- MDSplot(D,pcs=c(1,2))

Phylogenetic Tree

You can also create a phylogenetic tree using nj from ape on the genetic distance matrix. The resulting tree is ladderized and then plotted as an unrooted tree.

Tr <- nj(D)
Tr <- ladderize(Tr)
plot(Tr,cex=0.5,no.margin = TRUE,type='phylogram')

Conclusion

This pipeline provides powerful tools for exploring population genetics, offering flexibility to adapt to various datasets and research questions. While this section highlights the main features of the pipeline, further customization may be required for specific analyses. The combination of reproducibility, offline capability, and user control makes this pipeline a valuable resource for genetic studies.