| Title: | Single-Cell Analysis of Host-Microbiome Interactions | 
| Version: | 0.0.2 | 
| Description: | A computational resource designed to accurately detect microbial nucleic acids while filtering out contaminants and false-positive taxonomic assignments from standard transcriptomic sequencing of mammalian tissues. For more details, see Ghaddar (2023) <doi:10.1038/s43588-023-00507-1>. This implementation leverages the 'polars' package for fast and systematic microbial signal recovery and denoising from host tissue genomic sequencing. | 
| License: | MIT + file LICENSE | 
| ByteCompile: | true | 
| Encoding: | UTF-8 | 
| OS_type: | unix | 
| RoxygenNote: | 7.3.2 | 
| SystemRequirements: | kraken2, seqkit | 
| Imports: | blit (≥ 0.1.0), cli, rlang (≥ 1.1.0), ShortRead, utils | 
| Suggests: | polars (≥ 0.17.0) | 
| Additional_repositories: | https://community.r-multiverse.org | 
| URL: | https://github.com/Yunuuuu/rsahmi | 
| BugReports: | https://github.com/Yunuuuu/rsahmi/issues | 
| NeedsCompilation: | no | 
| Packaged: | 2025-03-21 18:48:11 UTC; yun | 
| Author: | Yun Peng | 
| Maintainer: | Yun Peng <yunyunp96@163.com> | 
| Repository: | CRAN | 
| Date/Publication: | 2025-03-24 12:00:05 UTC | 
Barcode level signal denoising
Description
True taxa are detected on multiple barcodes and with a proprotional number of
total and unique k-mer sequences across barcodes, measured as a significant
Spearman correlation between the number of total and unique k-mers across
barcodes. (padj < 0.05)
Usage
blsd(
  kmer,
  method = "spearman",
  ...,
  p.adjust = "BH",
  min_kmer_len = 3L,
  min_number = 3L
)
Arguments
| kmer | kmer data returned by  | 
| method | A character string indicating which correlation coefficient is to be used for the test. One of "pearson", "kendall", or "spearman", can be abbreviated. | 
| ... | Other arguments passed to cor.test. | 
| p.adjust | Pvalue correction method, a character string. Can be abbreviated. Details see p.adjust. | 
| min_kmer_len | An integer, the minimal number of kmer to filter taxa.
SAHMI use  | 
| min_number | An integer, the minimal number of cell per taxid. SAHMI use
 | 
Value
A polars DataFrame
See Also
https://github.com/sjdlabgroup/SAHMI
Examples
## Not run: 
# 1. `sahmi_datasets` should be the output of all samples from 
      `prep_dataset()` 
# 2. `real_taxids_slsd` should be the output of `slsd()`
umi_list <- lapply(sahmi_datasets, function(dataset) {
    # Barcode level signal denoising (barcode k-mer correlation test)
    blsd <- blsd(dataset$kmer)
    real_taxids <- blsd$filter(pl$col("padj")$lt(0.05))$get_column("taxid")
    # only keep taxids pass Sample level signal denoising
    real_taxids <- real_taxids$filter(real_taxids$is_in(real_taxids_slsd))
    # remove contaminants
    real_taxids <- real_taxids$filter(
        real_taxids$is_in(attr(truly_microbe, "truly"))
    )
    # filter UMI data
    dataset$umi$filter(pl$col("taxid")$is_in(real_taxids))
})
## End(Not run)
Extract reads and output from Kraken
Description
Extract reads and output from Kraken
Usage
extract_taxids(
  kraken_report,
  taxon = c("d__Bacteria", "d__Fungi", "d__Viruses")
)
extract_kraken_output(
  kraken_out,
  taxids,
  odir,
  ofile = "kraken_microbiome_output.txt",
  ...
)
extract_kraken_reads(
  kraken_out,
  reads,
  ofile = NULL,
  odir = getwd(),
  threads = NULL,
  ...,
  envpath = NULL,
  seqkit = NULL
)
Arguments
| kraken_report | The path to kraken report file. | 
| taxon | An atomic character specify the taxa name wanted. Should follow the kraken style, connected by rank codes, two underscores, and the scientific name of the taxon (e.g., "d__Viruses") | 
| kraken_out | The path to kraken output file. | 
| taxids | A character specify NCBI taxonony identifier to extract. | 
| odir | A string of directory to save the  | 
| ofile | A string of file save the kraken output of specified  | 
| ... | 
 | 
| reads | The original fastq files (input in  | 
| threads | Number of threads to use, see
 | 
| envpath | A string of path to be added to the environment variable
 | 
| seqkit | A string of path to  | 
Value
-  extract_taxids: An atomic character vector of taxon identifiers.
-  extract_kraken_output: A polars DataFrame.
-  extract_kraken_reads: Exit status invisiblely.
See Also
https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown
Examples
## Not run: 
# For 10x Genomic data, `fq1` only contain barcode and umi, but the official
# didn't give any information for this. In this way, I prefer using
# `umi-tools` to transform the `umi` into fq2 and then run `rsahmi` with
# only fq2.
blit::kraken2(
    fq1 = fq1,
    fq2 = fq2,
    classified_out = "classified.fq",
    # Number of threads to use
    blit::arg("--threads", 10L, format = "%d"),
    # the kraken database
    blit::arg("--db", kraken_db),
    "--use-names", "--report-minimizer-data",
) |> blit::cmd_run()
# `kraken_report` should be the output of `blit::kraken2()`
taxids <- extract_taxids(kraken_report = "kraken_report.txt")
# 1. `kraken_out` should be the output of `blit::kraken2()`
# 2. `taxids` should be the output of `extract_taxids()`
# 3. `odir`: the output directory
extract_kraken_output(
    kraken_out = "kraken_output.txt",
    taxids = taxids,
    odir = # specify the output directory
)
# 1. `kraken_out` should be the output of `extract_kraken_output()`
# 2. `fq1` and `fq2` should be the same with `blit::kraken2()`
extract_kraken_reads(
    kraken_out = "kraken_microbiome_output.txt",
    reads = c(fq1, fq2),
    threads = 10L, # Number of threads to use
    # try to change `seqkit` argument into your seqkit path. If `NULL`, the
    # internal will detect it in your `PATH` environment variable
    seqkit = NULL
)
## End(Not run)
Parse kraken report file
Description
Parse kraken report file
Usage
parse_kraken_report(kraken_report, intermediate_ranks = TRUE, mpa = FALSE)
Arguments
| kraken_report | The path to kraken report file. | 
| intermediate_ranks | A bool indicates whether to include non-traditional taxonomic ranks in output. | 
| mpa | A bool indicates whether to use mpa-style. | 
Value
A polars DataFrame.
See Also
https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown
Prepare kraken report, k-mer statistics, UMI data
Description
Three elements returned by this function:
-  kreport: Used byslsd().
-  kmer: Used byblsd(). The function count the number of k-mers and unique k-mers assigned to a taxon across barcodes. The cell barcode and unique molecular identifier (UMI) are used to identify unique barcodes and reads. Data is reported for taxa of pre-specified ranks (default genus + species) taking into account all subsequently higher resolution ranks. The output is a table of barcodes, taxonomic IDs, number of k-mers, and number of unique k-mers.
-  umi: Used bytaxa_counts().
Usage
prep_dataset(
  fa1,
  kraken_report,
  kraken_out,
  fa2 = NULL,
  cb_and_umi = function(sequence_id, read1, read2) {
     list(substring(read1, 1L, 16L),
    substring(read1, 17L, 28L))
 },
  ranks = c("G", "S"),
  kmer_len = 35L,
  min_frac = 0.5,
  exclude = "9606",
  threads = 10L,
  overwrite = TRUE,
  odir = NULL
)
read_dataset(dir)
Arguments
| fa1,fa2 | The path to microbiome fasta 1 and 2 file (returned by
 | 
| kraken_report | The path to kraken report file. | 
| kraken_out | The path of microbiome output file. Usually should be
filtered with  | 
| cb_and_umi | A function takes sequence id, read1, read2 and return a list of 2 corresponding to cell barcode and UMI respectively., each should have the same length of the input. | 
| ranks | Taxa ranks to analyze. | 
| kmer_len | Kraken kmer length. Default:  | 
| min_frac | Minimum fraction of kmers directly assigned to taxid to use
read. Reads with  | 
| exclude | A character of taxid to exclude, for  | 
| threads | Number of threads to use. | 
| overwrite | A bool indicates whether to overwrite the files in  | 
| odir | A string of directory to save the results. | 
| dir | A string of directory containing the files returned by
 | 
Value
A list of three polars DataFrame:
- kreport: Used by - slsd().
- kmer: Used by - blsd().
- umi: Used by - taxa_counts().
See Also
https://github.com/sjdlabgroup/SAHMI
Examples
# for sequence from `umi-tools`, we can use following function
cb_and_umi <- function(sequence_id, read1, read2) {
    out <- lapply(
        strsplit(sequence_id, "_", fixed = TRUE),
        `[`, 2:3
    )
    lapply(1:2, function(i) {
        vapply(out, function(o) as.character(.subset2(o, i)), character(1L))
    })
}
## Not run: 
# 1. `fa1` and `fa2` should be the output of `extract_kraken_reads()`
# 2. `kraken_report` should be the output of `blit::kraken2()`
# 3. `kraken_out` should be the output of `extract_kraken_output()`
# 4. `dir`: you may want to specify the output directory since this process 
#           is time-consuming
sahmi_dataset <- prep_dataset(
    fa1 = "kraken_microbiome_reads.fa",
    # if you have paired sequence, please also specify `fa2`,
    # !!! Also pay attention to the file name of `fa1` (add suffix `_1`)
    # if you use paired reads.
    # fa2 = "kraken_microbiome_reads_2.fa",
    kraken_report = "kraken_report.txt",
    kraken_out = "kraken_microbiome_output.txt",
    odir = NULL
)
# you may want to prepare all datasets for subsequent workflows.
# `paths` should be the output directory for each sample from
# `blit::kraken2()`, `extract_kraken_output()` and `extract_kraken_reads()`.
sahmi_datasets <- lapply(paths, function(dir) {
    prep_dataset(
        fa1 = file.path(dir, "kraken_microbiome_reads.fa"),
        # fa2 = file.path(dir, "kraken_microbiome_reads_2.fa"),
        kraken_report = file.path(dir, "kraken_report.txt"),
        kraken_out = file.path(dir, "kraken_microbiome_output.txt"),
        odir = dir
    )
})
## End(Not run)
Identifying contaminants and false positives taxa (cell line quantile test)
Description
Identifying contaminants and false positives taxa (cell line quantile test)
Usage
remove_contaminants(
  kraken_reports,
  study = "current study",
  taxon = c("d__Bacteria", "d__Fungi", "d__Viruses"),
  quantile = 0.95,
  alpha = 0.05,
  alternative = "greater",
  exclusive = FALSE
)
Arguments
| kraken_reports | A character of path to all kraken report files. | 
| study | A string of the study name, used to differentiate with cell line data. | 
| taxon | An atomic character specify the taxa name wanted. Should follow the kraken style, connected by rank codes, two underscores, and the scientific name of the taxon (e.g., "d__Viruses") | 
| quantile | Probabilities with values in  | 
| alpha | Level of significance. | 
| alternative | A string specifying the alternative hypothesis, must be one of "two.sided", "greater" (default) or "less". You can specify just the initial letter. | 
| exclusive | A boolean value, indicates whether taxa not found in
celllines data should be regarded as truly. Default:  | 
Value
A polars DataFrame with following attributes:
-  pvalues: Quantile test pvalue.
-  exclusive: taxids in current study but not found in cellline data.
-  significant: significant taxids withpvalues < alpha.
-  truly: truly taxids based onalphaandexclusive. IfexclusiveisTRUE, this should be the union ofexclusiveandsignificant, otherwise, this should be the same withsignificant.
Examples
## Not run: 
# `paths` should be the output directory for each sample from
# `blit::kraken2()`
truly_microbe <- remove_contaminants(
    kraken_reports = file.path(paths, "kraken_report.txt"),
    quantile = 0.99, exclusive = FALSE
)
microbe_for_plot <- attr(truly_microbe, "truly")[
    order(attr(truly_microbe, "pvalue")[attr(truly_microbe, "truly")])
]
microbe_for_plot <- microbe_for_plot[
    !microbe_for_plot %in% attr(truly_microbe, "exclusive")
]
ggplot(
    truly_microbe$filter(pl$col("taxid")$is_in(microbe_for_plot))$
        to_data_frame(),
    aes(rpmm),
) +
    geom_density(aes(fill = study), alpha = 0.5) +
    scale_x_log10() +
    facet_wrap(facets = vars(taxa), scales = "free") +
    theme(
        strip.clip = "off",
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        legend.position = "inside",
        legend.position.inside = c(1, 0),
        legend.justification.inside = c(1, 0)
    )
## End(Not run)
Sample level signal denoising
Description
In the low-microbiome biomass setting, real microbes also exhibit a
proportional number of total k-mers, number of unique k-mers, as well as
number of total assigned sequencing reads across samples; i.e. the following
three Spearman correlations are significant when tested using sample-level
data provided in Kraken reports: cor(minimizer_len, minimizer_n_unique),
cor(minimizer_len, total_reads) and cor(total_reads, minimizer_n_unique).
(r1>0 & r2>0 & r3>0 & p1<0.05 & p2<0.05 & p3<0.05).
Usage
slsd(
  kreports,
  method = "spearman",
  ...,
  min_reads = 3L,
  min_minimizer_n_unique = 3L,
  min_number = 3L
)
Arguments
| kreports | kreports data returned by  | 
| method | A character string indicating which correlation coefficient is to be used for the test. One of "pearson", "kendall", or "spearman", can be abbreviated. | 
| ... | Other arguments passed to cor.test. | 
| min_reads | An integer, the minimal number of the total reads to filter
taxa. SAHMI use  | 
| min_minimizer_n_unique | An integer, the minimal number of the unique
number of minimizer to filter taxa. SAHMI use  | 
| min_number | An integer, the minimal number of samples per taxid. SAHMI
use  | 
Value
A polars DataFrame of correlation
coefficient and pvalue for cor(minimizer_len, minimizer_n_unique) (r1 and
p1), cor(minimizer_len, total_reads) (r2 and p2) and cor(total_reads, minimizer_n_unique) (r3 and p3).
Examples
## Not run: 
# `sahmi_datasets` should be the output of all samples from `prep_dataset()`
slsd <- slsd(lapply(sahmi_datasets, `[[`, "kreport"))
real_taxids_slsd <- slsd$filter(
    pl$col("r1")$gt(0),
    pl$col("r2")$gt(0),
    pl$col("r3")$gt(0),
    pl$col("p1")$lt(0.05),
    pl$col("p2")$lt(0.05),
    pl$col("p3")$lt(0.05)
)$get_column("taxid")
## End(Not run)
Quantitation of microbes
Description
After identifying true taxa, reads assigned to those taxa are extracted and then undergo a series of filters. The cell barcode and UMI are used to demultiplex the reads and create a barcode x taxa counts matrix. The full taxonomic classification of all resulting barcodes and the number of counts assigned to each clade are tabulated.
Usage
taxa_counts(umi_list, samples = NULL)
Arguments
| umi_list | A list of polars DataFrame for UMI data returned by prep_dataset. | 
| samples | A character of sample identifier for each element in
 | 
Value
A polars DataFrame.
See Also
https://github.com/sjdlabgroup/SAHMI
Examples
## Not run: 
# `umi_list` should be the output of all samples from `prep_dataset()`, and 
# filtered by `slsd()` and `blsd()`
taxa_counts(umi_list, basename(names(umi_list)))
## End(Not run)