Title: | Bioinformatics Analysis of T-Cell and B-Cell Immune Repertoires |
---|---|
Description: | A comprehensive framework for bioinformatics exploratory analysis of bulk and single-cell T-cell receptor and antibody repertoires. It provides seamless data loading, analysis and visualisation for AIRR (Adaptive Immune Receptor Repertoire) data, both bulk immunosequencing (RepSeq) and single-cell sequencing (scRNAseq). Immunarch implements most of the widely used AIRR analysis methods, such as: clonality analysis, estimation of repertoire similarities in distribution of clonotypes and gene segments, repertoire diversity analysis, annotation of clonotypes using external immune receptor databases and clonotype tracking in vaccination and cancer studies. A successor to our previously published 'tcR' immunoinformatics package (Nazarov 2015) <doi:10.1186/s12859-015-0613-1>. |
Authors: | Vadim I. Nazarov [aut, cre], Vasily O. Tsvetkov [aut], Siarhei Fiadziushchanka [aut], Eugene Rumynskiy [aut], Aleksandr A. Popov [aut], Ivan Balashov [aut], Maria Samokhina [aut], Anna Lorenc [ctb], Daniel J. Moore [ctb], Victor Greiff [ctb], ImmunoMind [cph, fnd] |
Maintainer: | Vadim I. Nazarov <[email protected]> |
License: | Apache License (== 2.0) |
Version: | 0.9.1 |
Built: | 2024-11-20 05:08:06 UTC |
Source: | https://github.com/immunomind/immunarch |
Get a column's name using the input alias
.quant_column_choice(x)
.quant_column_choice(x)
x |
Character vector of length 1. |
A string with the column name.
immunarch:::.quant_column_choice("count") immunarch:::.quant_column_choice("freq")
Amino acid / codon table
AA_TABLE
AA_TABLE
An object of class table
of length 65.
Add a new class attribute
add_class(.obj, .class)
add_class(.obj, .class)
.obj |
R object. |
.class |
String with the desired class name. |
Input object with additional class .class
.
tmp <- "abc" class(tmp) tmp <- immunarch:::add_class(tmp, "new_class") class(tmp)
Apply the given function to every pair in the given datalist. Function either symmetrical (i.e. fun(x,y) == fun(y,x)) or assymmetrical (i.e. fun(x,y) != fun(y,x)).
apply_symm(.datalist, .fun, ..., .diag = NA, .verbose = TRUE) apply_asymm(.datalist, .fun, ..., .diag = NA, .verbose = TRUE)
apply_symm(.datalist, .fun, ..., .diag = NA, .verbose = TRUE) apply_asymm(.datalist, .fun, ..., .diag = NA, .verbose = TRUE)
.datalist |
List with some data.frames. |
.fun |
Function to apply, which return basic class value. |
... |
Arguments passsed to .fun. |
.diag |
Either NA for NA or something else != NULL for .fun(x,x). |
.verbose |
if TRUE then output a progress bar. |
Matrix with values M[i,j] = fun(datalist[i], datalist[j])
data(immdata) apply_symm(immdata$data, function(x, y) { nrow(x) + nrow(y) })
data(immdata) apply_symm(immdata$data, function(x, y) { nrow(x) + nrow(y) })
A dataset with BCR data for testing and examplatory purposes.
bcrdata
bcrdata
A list of two elements. The first element ("data") is a list of 1 element named "full_clones" that contains immune repertoire data frame. The second element ("meta") is empty metadata table.
List of immune repertoire data frames.
Metadata
...
Nucleotide to amino acid sequence translation
bunch_translate(.seq, .two.way = TRUE, .ignore.n = FALSE)
bunch_translate(.seq, .two.way = TRUE, .ignore.n = FALSE)
.seq |
Vector or list of strings. |
.two.way |
Logical. If TRUE (default) then translate from the both ends (like MIXCR). |
.ignore.n |
Logical. If FALSE (default) then return NA for sequences that have N, else parse triplets with N as ~ |
Character vector of translated input sequences.
data(immdata) head(bunch_translate(immdata$data[[1]]$CDR3.nt))
data(immdata) head(bunch_translate(immdata$data[[1]]$CDR3.nt))
Check if the given .data is a distribution and normalise it if necessary with an optional Laplace correction.
check_distribution( .data, .do.norm = NA, .laplace = 1, .na.val = 0, .warn.zero = FALSE, .warn.sum = TRUE )
check_distribution( .data, .do.norm = NA, .laplace = 1, .na.val = 0, .warn.zero = FALSE, .warn.sum = TRUE )
.data |
Numeric vector of values. |
.do.norm |
One of the three values - NA, TRUE or FALSE. If NA then checks for distrubution (sum(.data) == 1) and normalises if needed with the given laplace correction value. if TRUE then does the normalisation and laplace correction. If FALSE then doesn't do either normalisaton or laplace correction. |
.laplace |
Value for the laplace correction. |
.na.val |
Replace all NAs with this value. |
.warn.zero |
if TRUE then the function checks if in the resulted vector (after normalisation) are any zeros, and prints a warning message if there are some. |
.warn.sum |
if TRUE then the function checks if the sum of resulted vector (after normalisation) is equal to one, and prints a warning message if not. |
Numeric vector.
immunarch:::check_distribution(c(1, 2, 3)) immunarch:::check_distribution(c(1, 2, 3), TRUE) immunarch:::check_distribution(c(1, 2, 3), FALSE)
Filter out clonotypes with non-coding, coding, in-frame or out-of-frame CDR3 sequences:
'coding()' - remove all non-coding sequences (i.e., remove all sequences with stop codons and frame shifts);
'noncoding()' - remove all coding sequences (i.e., leave sequences with stop codons and frame shifts only);
'inframes()' - remove all out-of-frame sequences (i.e., remove all sequences with frame shifts);
'outofframes()' - remove all in-frame sequences (i.e., leave sequences with frame shifts only).
Note: the function will remove all clonotypes sequences with NAs in the CDR3 amino acid column.
coding(.data) noncoding(.data) inframes(.data) outofframes(.data)
coding(.data) noncoding(.data) inframes(.data) outofframes(.data)
.data |
The data to be processed. Can be data.frame, data.table, or a list of these objects. Every object must have columns in the immunarch compatible format. immunarch_data_format Competent users may provide advanced data representations: DBI database connections, Apache Spark DataFrame from copy_to or a list of these objects. They are supported with the same limitations as basic objects. Note: each connection must represent a separate repertoire. |
Filtered data frame.
data(immdata) immdata_cod <- coding(immdata$data) immdata_cod1 <- coding(immdata$data[[1]])
data(immdata) immdata_cod <- coding(immdata$data) immdata_cod1 <- coding(immdata$data[[1]])
Annotate clonotypes using immune receptor databases with known condition-associated receptors. Before using this function, you need to download database files first. For more details see the tutorial https://immunarch.com/articles/web_only/v11_db.html.
dbAnnotate(.data, .db, .data.col, .db.col)
dbAnnotate(.data, .db, .data.col, .db.col)
.data |
The data to process. It can be a data.frame, a data.table, or a list of these objects. Every object must have columns in the immunarch compatible format. immunarch_data_format Competent users may provide advanced data representations: DBI database connections, Apache Spark DataFrame from copy_to or a list of these objects. They are supported with the same limitations as basic objects. Note: each connection must represent a separate repertoire. |
.db |
A data frame or a data table with an immune receptor database. See dbLoad on how to load databases into R. |
.data.col |
Character vector. Vector of columns in the input repertoires to use for clonotype search. E.g., '"CDR3.aa"' or 'c("CDR3.aa", "V.name")'. |
.db.col |
Character vector. Vector of columns in the database to use for clonotype search. The order must match the order of ".data.col". E.g., if ".data.col" is 'c("CDR3.aa", "V.name")', then ".db.col" must have the exact order of columns. i.e., the first column must correspond to CDR3 amino acid sequences, and the second column must correspond to V gene segment names. |
Data frame with input sequences and counts or proportions for each of the input repertoire.
data(immdata) #' # Example file path file_path <- paste0(system.file(package = "immunarch"), "/extdata/db/vdjdb.example.txt") # Load the database with human-only TRB-only receptors for all known antigens db <- dbLoad(file_path, "vdjdb", "HomoSapiens", "TRB") res <- dbAnnotate(immdata$data, db, "CDR3.aa", "cdr3") res
data(immdata) #' # Example file path file_path <- paste0(system.file(package = "immunarch"), "/extdata/db/vdjdb.example.txt") # Load the database with human-only TRB-only receptors for all known antigens db <- dbLoad(file_path, "vdjdb", "HomoSapiens", "TRB") res <- dbAnnotate(immdata$data, db, "CDR3.aa", "cdr3") res
The function automatically detects the database format and loads it into R. Additionally, the function provides a general query interface to databases that allows filtering by species, chain types (i.e., locus) and pathology (i.e., antigen species).
Currently we support three popular databases:
VDJDB - https://github.com/antigenomics/vdjdb-db
McPAS-TCR - http://friedmanlab.weizmann.ac.il/McPAS-TCR/
TBAdb from PIRD - https://db.cngb.org/pird/
dbLoad(.path, .db, .species = NA, .chain = NA, .pathology = NA)
dbLoad(.path, .db, .species = NA, .chain = NA, .pathology = NA)
.path |
Character. A path to the database file, e.g., "/Users/researcher/Downloads/McPAS-TCR.csv". |
.db |
Character. A database type: either "vdjdb", "vdjdb-search", "mcpas" or "tbadb". "vdjdb" for VDJDB; "vdjdb-search" for search table obtained from the web interface of VDJDB; "mcpas" for McPAS-TCR; "tbadb" for PIRD TBAdb. |
.species |
Character. A string or a vector of strings specifying which species need to be in the database, e.g., "HomoSapiens". Pass NA (by default) to load all available species. |
.chain |
Character. A string or a vector of strings specifying which chains need to be in the database, e.g., "TRB". Pass NA (by default) to load all available chains. |
.pathology |
Character. A string or a vector of strings specifying which disease, virus, bacteria or any condition needs to be in the database, e.g., "CMV". Pass NA (by default) to load all available conditions. |
Data frame with the input database records.
# Example file path file_path <- paste0(system.file(package = "immunarch"), "/extdata/db/vdjdb.example.txt") # Load the database with human-only TRB-only receptors for all known antigens db <- dbLoad(file_path, "vdjdb", "HomoSapiens", "TRB") db
# Example file path file_path <- paste0(system.file(package = "immunarch"), "/extdata/db/vdjdb.example.txt") # Load the database with human-only TRB-only receptors for all known antigens db <- dbLoad(file_path, "vdjdb", "HomoSapiens", "TRB") db
Compute information-based estimates and distances.
entropy(.data, .base = 2, .norm = FALSE, .do.norm = NA, .laplace = 1e-12) kl_div(.alpha, .beta, .base = 2, .do.norm = NA, .laplace = 1e-12) js_div(.alpha, .beta, .base = 2, .do.norm = NA, .laplace = 1e-12, .norm.entropy = FALSE) cross_entropy(.alpha, .beta, .base = 2, .do.norm = NA, .laplace = 1e-12, .norm.entropy = FALSE)
entropy(.data, .base = 2, .norm = FALSE, .do.norm = NA, .laplace = 1e-12) kl_div(.alpha, .beta, .base = 2, .do.norm = NA, .laplace = 1e-12) js_div(.alpha, .beta, .base = 2, .do.norm = NA, .laplace = 1e-12, .norm.entropy = FALSE) cross_entropy(.alpha, .beta, .base = 2, .do.norm = NA, .laplace = 1e-12, .norm.entropy = FALSE)
.data |
Numeric vector. Any distribution. |
.base |
Numeric. A base of logarithm. |
.norm |
Logical. If TRUE then normalises the entropy by the maximal value of the entropy. |
.do.norm |
If TRUE then normalises the input distributions to make them sum up to 1. |
.laplace |
Numeric. A value for the laplace correction. |
.alpha |
Numeric vector. A distribution of some random value. |
.beta |
Numeric vector. A distribution of some random value. |
.norm.entropy |
Logical. If TRUE then normalises the resulting value by the average entropy of input distributions. |
A numeric value.
P <- abs(rnorm(10)) Q <- abs(rnorm(10)) entropy(P) kl_div(P, Q) js_div(P, Q) cross_entropy(P, Q)
P <- abs(rnorm(10)) Q <- abs(rnorm(10)) entropy(P) kl_div(P, Q) js_div(P, Q) cross_entropy(P, Q)
The fixVis
is a built-in software tool for the manipulation
of plots, such as adjusting title text font and size, axes, and more. It is a powerful
tool designed to produce publication-ready plots with minimal amount of coding.
fixVis(.plot = NA)
fixVis(.plot = NA)
.plot |
A ggplot2 plot. |
No return value because it is an application.
if (interactive()) { # Compute gene usage, visualise it and tweak via fixVis data(immdata) # load test data gu <- geneUsage(immdata$data) p <- vis(gu) fixVis(p) }
if (interactive()) { # Compute gene usage, visualise it and tweak via fixVis data(immdata) # load test data gu <- geneUsage(immdata$data) p <- vis(gu) fixVis(p) }
WIP
gene_stats()
gene_stats()
gene_stats
returns all segment gene statistics
gene_stats() get_genes("hs.trbv", "segment")
gene_stats() get_genes("hs.trbv", "segment")
An utility function to analyse the immune receptor gene usage
(IGHD, IGHJ, IDHV, IGIJ, IGKJ, IGKV, IGLJ, IGLV, TRAJ, TRAV, TRBD, etc.)
and statistics. For gene details run gene_stats()
.
geneUsage( .data, .gene = c("hs.trbv", "HomoSapiens.TRBJ", "macmul.IGHV"), .quant = c(NA, "count"), .ambig = c("inc", "exc", "maj"), .type = c("segment", "allele", "family"), .norm = FALSE )
geneUsage( .data, .gene = c("hs.trbv", "HomoSapiens.TRBJ", "macmul.IGHV"), .quant = c(NA, "count"), .ambig = c("inc", "exc", "maj"), .type = c("segment", "allele", "family"), .norm = FALSE )
.data |
The data to be processed. Can be data.frame, data.table, or a list of these objects. Every object must have columns in the immunarch compatible format. immunarch_data_format Competent users may provide advanced data representations: DBI database connections, Apache Spark DataFrame from copy_to or a list of these objects. They are supported with the same limitations as basic objects. Note: each connection must represent a separate repertoire. |
.gene |
A character vector of length one with the name of the gene you want
to analyse of the specific species. If you provide a vector of different length, only the first element
will be used. The string should also contain the species of interest, for example, valid ".gene" arguments
are "hs.trbv", "HomoSapiens.TRBJ" or "macmul.IGHV". For details run |
.quant |
Selects the column with data to evaluate. Pass NA if you want to compute gene statistics at the clonotype level without re-weighting. Pass "count" to use the "Clones" column to weight genes by abundance of their corresponding clonotypes. |
.ambig |
An option to handle ambiguous gene assigments, e.g., "TRAV1,TRAV2". - Pass "inc" to include all possible gene segments, so "TRAV1,TRAV2" is counted as a different gene segment. - Pass "exc" to exclude all ambiguous gene assignments, so "TRAV1,TRAV2" is excluded from the resultant gene table. We recommend to turn it on by passing "inc" (turned on by default). You can exclude data for the cases where there is no clear match for gene, include it for every supplied gene, or pick only first from the set. Set it to "exc", "inc" or "maj", respectively. |
.type |
Set the type of data to evaluate: "segment", "allele", or "family". |
.norm |
If TRUE then return proportions of genes. If FALSE then return counts of genes. |
A data frame with rows corresponding to gene segments and columns corresponding to the input samples.
data(immdata) gu <- geneUsage(immdata$data) vis(gu)
data(immdata) gu <- geneUsage(immdata$data) vis(gu)
The geneUsageAnalysis
function deploys several
data analysis methods, including PCA, multidimensional scaling,
Jensen-Shannon divergence, k-means, hierarchical clustering, DBscan, and different
correlation coefficients.
geneUsageAnalysis( .data, .method = c("js+hclust", "pca+kmeans", "anova", "js+pca+kmeans"), .base = 2, .norm.entropy = FALSE, .cor = c("pearson", "kendall", "spearman"), .do.norm = TRUE, .laplace = 1e-12, .verbose = TRUE, .k = 2, .eps = 0.01, .perp = 1, .theta = 0.1 )
geneUsageAnalysis( .data, .method = c("js+hclust", "pca+kmeans", "anova", "js+pca+kmeans"), .base = 2, .norm.entropy = FALSE, .cor = c("pearson", "kendall", "spearman"), .do.norm = TRUE, .laplace = 1e-12, .verbose = TRUE, .k = 2, .eps = 0.01, .perp = 1, .theta = 0.1 )
.data |
The |
.method |
A string that defines the type of analysis to perform. Can be "pca",
"mds", "js", "kmeans", "hclust", "dbscan" or "cor" if you want to calculate
correlation coefficient. In the latter case you have to provide |
.base |
A numerical value that defines the logarithm base for Jensen-Shannon divergence. |
.norm.entropy |
A logical value. Set TRUE to normalise your data if you haven't done it already. |
.cor |
A string that defines the correlation coefficient for analysis. Can be "pearson", "kendall" or "spearman". |
.do.norm |
A logical value. If TRUE it forces Laplace smoothing, if NA it checks if smoothing is necessary, if FALSE does nothing. |
.laplace |
The numeric value, which is used as a pseudocount for Laplace smoothing. |
.verbose |
A logical value. |
.k |
The number of clusters to create, passed as |
.eps |
A numerical value, DBscan epsylon parameter, see
|
.perp |
A numerical value, t-SNE perplexity, see |
.theta |
A numerical value, t-SNE theta parameter, see |
Depends on the last element in the .method
string. See immunr_tsne for more info.
data(immdata) gu <- geneUsage(immdata$data, .norm = TRUE) geneUsageAnalysis(gu, "js+hclust", .verbose = FALSE) %>% vis()
data(immdata) gu <- geneUsage(immdata$data, .norm = TRUE) geneUsageAnalysis(gu, "js+hclust", .verbose = FALSE) %>% vis()
Calculate the k-mer statistics of immune repertoires
getKmers(.data, .k, .col = c("aa", "nt"), .coding = TRUE)
getKmers(.data, .k, .col = c("aa", "nt"), .coding = TRUE)
.data |
The data to be processed. Can be data.frame, data.table, or a list of these objects. Every object must have columns in the immunarch compatible format. immunarch_data_format Competent users may provide advanced data representations: DBI database connections, Apache Spark DataFrame from copy_to or a list of these objects. They are supported with the same limitations as basic objects. Note: each connection must represent a separate repertoire. |
.k |
Integer. Length of k-mers. |
.col |
Character. Which column to use, pass "aa" (by default) for CDR3 amino acid sequence, pass "nt" for CDR3 nucleotide sequences. |
.coding |
Logical. If TRUE (by default) then removes all non-coding sequences from input data first. |
Data frame with two columns (k-mers and their counts).
data(immdata) kmers <- getKmers(immdata$data[[1]], 5) kmers %>% vis()
data(immdata) kmers <- getKmers(immdata$data[[1]], 5) kmers %>% vis()
Get a character vector of samples' groups from the input metadata file
group_from_metadata(.by, .metadata, .sep = "; ")
group_from_metadata(.by, .metadata, .sep = "; ")
.by |
Character vector. Specify a column or columns in the input metadata to group by. |
.metadata |
Metadata object. |
.sep |
Character vector. Defines a separator between groups if more than one group passed in |
Character vector with group names.
immunarch:::group_from_metadata("Status", data.frame(Status = c("A", "A", "B", "B", "C")))
A function to check if an input object has a specific class name.
has_class(.data, .class)
has_class(.data, .class)
.data |
Any R object. |
.class |
Character vector. Specifies a class name to check against. |
Logical value.
tmp <- "abc" immunarch:::has_class(tmp, "new_class") tmp <- immunarch:::add_class(tmp, "new_class") immunarch:::has_class(tmp, "new_class")
A dataset with single chain TCR data for testing and examplatory purposes.
immdata
immdata
A list of two elements. The first element ("data") is a list with data frames with clonotype tables. The second element ("meta") is a metadata table.
List of immune repertoire data frames.
Metadata
...
- "Clones" - number of barcodes (events, UMIs) or reads;
- "Proportion" - proportion of barcodes (events, UMIs) or reads;
- "CDR3.nt" - CDR3 nucleotide sequence;
- "CDR3.aa" - CDR3 amino acid sequence;
- "V.name" - names of aligned Variable gene segments;
- "D.name" - names of aligned Diversity gene segments or NA;
- "J.name" - names of aligned Joining gene segments;
- "V.end" - last positions of aligned V gene segments (1-based);
- "D.start" - positions of D'5 end of aligned D gene segments (1-based);
- "D.end" - positions of D'3 end of aligned D gene segments (1-based);
- "J.start" - first positions of aligned J gene segments (1-based);
- "VJ.ins" - number of inserted nucleotides (N-nucleotides) at V-J junction (-1 for receptors with VDJ recombination);
- "VD.ins" - number of inserted nucleotides (N-nucleotides) at V-D junction (-1 for receptors with VJ recombination);
- "DJ.ins" - number of inserted nucleotides (N-nucleotides) at D-J junction (-1 for receptors with VJ recombination);
- "Sequence" - full nucleotide sequence.
Clusters the data with one of the following methods:
- immunr_hclust
clusters the data using the hierarchical clustering from hcut;
- immunr_kmeans
clusters the data using the K-means algorithm from kmeans;
- immunr_dbscan
clusters the data using the DBSCAN algorithm from dbscan.
immunr_hclust(.data, .k = 2, .k.max = nrow(.data) - 1, .method = "complete", .dist = TRUE) immunr_kmeans(.data, .k = 2, .k.max = as.integer(sqrt(nrow(.data))) + 1, .method = c("silhouette", "gap_stat")) immunr_dbscan(.data, .eps, .dist = TRUE)
immunr_hclust(.data, .k = 2, .k.max = nrow(.data) - 1, .method = "complete", .dist = TRUE) immunr_kmeans(.data, .k = 2, .k.max = as.integer(sqrt(nrow(.data))) + 1, .method = c("silhouette", "gap_stat")) immunr_dbscan(.data, .eps, .dist = TRUE)
.data |
Matrix or data frame with features, distance matrix or output from repOverlapAnalysis or geneUsageAnalysis functions. |
.k |
The number of clusters to create, defined as |
.k.max |
Limits the maximum number of clusters. It is passed as |
.method |
Passed to hcut or as fviz_nbclust. In case of hcut the agglomeration method is going to be used (argument In case of fviz_nbclust it is the method to be used for estimating the optimal number of clusters (argument |
.dist |
If TRUE then ".data" is expected to be a distance matrix. If FALSE then the euclidean distance is computed for the input objects. |
.eps |
Local radius for expanding clusters, minimal distance between points to expand clusters. Passed as |
immunr_hclust
- list with two elements. The first element is an output from hcut.
The second element is an output from fviz_nbclust
immunr_kmeans
- list with three elements. The first element is an output from kmeans.
The second element is an output from fviz_nbclust.
The third element is the input dataset .data
.
immunr_dbscan
- list with two elements. The first element is an output from dbscan.
The second element is the input dataset .data
.
data(immdata) gu <- geneUsage(immdata$data, .norm = TRUE) immunr_hclust(t(as.matrix(gu[, -1])), .dist = FALSE) gu[is.na(gu)] <- 0 immunr_kmeans(t(as.matrix(gu[, -1])))
data(immdata) gu <- geneUsage(immdata$data, .norm = TRUE) immunr_hclust(t(as.matrix(gu[, -1])), .dist = FALSE) gu[is.na(gu)] <- 0 immunr_kmeans(t(as.matrix(gu[, -1])))
Collects a set of principal variables, reducing the number of not important variables to analyse. Dimensionality reduction makes data analysis algorithms work faster and sometimes more accurate, since it also reduces noise in the data. Currently available methods are:
- immunr_pca
performs PCA (Principal Component Analysis) using prcomp;
- immunr_mds
performs MDS (Multi-Dimensional Scaling) using isoMDS;
- immunr_tsne
performs tSNE (t-Distributed Stochastic Neighbour Embedding) using Rtsne.
immunr_pca(.data, .scale = default_scale_fun, .raw = TRUE, .orig = FALSE, .dist = FALSE) immunr_mds(.data, .scale = default_scale_fun, .raw = TRUE, .orig = FALSE, .dist = TRUE) immunr_tsne(.data, .perp = 1, .dist = TRUE, ...)
immunr_pca(.data, .scale = default_scale_fun, .raw = TRUE, .orig = FALSE, .dist = FALSE) immunr_mds(.data, .scale = default_scale_fun, .raw = TRUE, .orig = FALSE, .dist = TRUE) immunr_tsne(.data, .perp = 1, .dist = TRUE, ...)
.data |
A matrix or a data frame with features, distance matrix or output from repOverlapAnalysis or geneUsageAnalysis functions. |
.scale |
A function to apply to your data before passing it to any of dimensionality reduction algorithms. There is no scaling by default. |
.raw |
If TRUE then returns the non-processed output from dimensionality reduction algorithms. Pass FALSE if you want to visualise results. |
.orig |
If TRUE then returns the original result from algorithms. Pass FALSE if you want to visualise results. |
.dist |
If TRUE then assumes that ".data" is a distance matrix. |
.perp |
The perplexity parameter for Rtsne. Sepcifies the number of neighbours each data point must have in the resulting plot. |
... |
Other parameters passed to Rtsne. |
immunr_pca
- an output from prcomp.
immunr_mds
- an output from isoMDS.
immunr_tsne
- an output from Rtsne.
vis.immunr_pca for visualisations.
data(immdata) gu <- geneUsage(immdata$data) gu[is.na(gu)] <- 0 gu <- t(as.matrix(gu[, -1])) immunr_pca(gu) immunr_mds(dist(gu)) immunr_tsne(dist(gu))
data(immdata) gu <- geneUsage(immdata$data) gu[is.na(gu)] <- 0 gu <- t(as.matrix(gu[, -1])) immunr_pca(gu) immunr_mds(dist(gu)) immunr_tsne(dist(gu))
For reference please look up https://www.pnas.org/content/111/16/5980 (Fig. 4).
inc_overlap( .data, .fun, .step = 1000, .n.steps = 10, .downsample = FALSE, .bootstrap = NA, .verbose.inc = TRUE, ... )
inc_overlap( .data, .fun, .step = 1000, .n.steps = 10, .downsample = FALSE, .bootstrap = NA, .verbose.inc = TRUE, ... )
.data |
The data to be processed. Can be data.frame, data.table, or a list of these objects. Every object must have columns in the immunarch compatible format. immunarch_data_format Competent users may provide advanced data representations: DBI database connections, Apache Spark DataFrame from copy_to or a list of these objects. They are supported with the same limitations as basic objects. Note: each connection must represent a separate repertoire. |
.fun |
Function to compute overlaps. e.g., |
.step |
Either an integer or a numeric vector. In the first case, the integer defines the step of incremental overlap. In the second case, the vector encodes all repertoire sampling depths. |
.n.steps |
Integer. Number of steps if |
.downsample |
If TRUE then performs downsampling to N clonotypes at each step instead of choosing the top N clonotypes. |
.bootstrap |
Set NA to turn off any bootstrapping, set a number to perform bootstrapping with this number of tries. |
.verbose.inc |
Logical. If TRUE then shows the output from the computation process. |
... |
Other arguments passed to |
List with overlap matrices.
data(immdata) ov <- repOverlap(immdata$data, "inc+overlap", .step = 100, .verbose.inc = FALSE, .verbose = FALSE) vis(ov)
data(immdata) ov <- repOverlap(immdata$data, "inc+overlap", .step = 100, .verbose.inc = FALSE, .verbose = FALSE) vis(ov)
Copy the upper matrix triangle to the lower one
matrixdiagcopy(.mat)
matrixdiagcopy(.mat)
.mat |
Matrix. |
Matrix with its upper tri part copied to the lower tri part.
mat <- matrix(0, 3, 3) mat mat[1, 3] <- 1 mat <- immunarch:::matrixdiagcopy(mat) mat
Get a matrix with public clonotype frequencies
public_matrix(.data)
public_matrix(.data)
.data |
Public repertoire, an output from pubRep. |
Matrix with per-sample clonotype counts / proportions only.
data(immdata) immdata$data <- lapply(immdata$data, head, 2000) pr <- pubRep(immdata$data, .verbose = FALSE) pr.mat <- public_matrix(pr) dim(pr.mat) head(pr.mat)
data(immdata) immdata$data <- lapply(immdata$data, head, 2000) pr <- pubRep(immdata$data, .verbose = FALSE) pr.mat <- public_matrix(pr) dim(pr.mat) head(pr.mat)
Create a repertoire of public clonotypes
pubRep( .data, .col = "aa+v", .quant = c("count", "prop"), .coding = TRUE, .min.samples = 1, .max.samples = NA, .verbose = TRUE )
pubRep( .data, .col = "aa+v", .quant = c("count", "prop"), .coding = TRUE, .min.samples = 1, .max.samples = NA, .verbose = TRUE )
.data |
The data to be processed. Can be data.frame, data.table, or a list of these objects. Every object must have columns in the immunarch compatible format. immunarch_data_format Competent users may provide advanced data representations: DBI database connections, Apache Spark DataFrame from copy_to or a list of these objects. They are supported with the same limitations as basic objects. Note: each connection must represent a separate repertoire. |
.col |
A string that specifies the column(s) to be processed. Outputs one of the following strings, separated by the plus sign: "nt" for nucleotide sequences, "aa" for amino acid sequences, "v" for V gene segments, "j" for J gene segments. E.g., pass "aa+v" to compute overlaps on CDR3 amino acid sequences paired with V gene segments, i.e., in this case a unique clonotype is a pair of CDR3 amino acid and V gene segment. |
.quant |
A string that specifies the column to be processed. Set "count" to see public clonotype sharing with the number of clones, set "prop" to see proportions. |
.coding |
Logical. If TRUE then preprocesses the data to filter out non-coding sequences. |
.min.samples |
Integer. A minimal number of samples a clonotype must have to be included in the public repertoire table. |
.max.samples |
Integer. A maxminal number of samples a clonotype must have to be included in the public repertoire table. Set NA (by default) to have the maximal amount of samples. |
.verbose |
Logical. If TRUE then outputs the progress. |
Data table with columns for:
- Clonotypes (e.g., CDR3 sequence, or two columns for CDR3 sequence and V gene)
- Incidence of clonotypes
- Per-sample proportions or counts
# Subset the data to make the example faster to run immdata$data <- lapply(immdata$data, head, 2000) pr <- pubRep(immdata$data, .verbose = FALSE) vis(pr, "clonotypes", 1, 2)
# Subset the data to make the example faster to run immdata$data <- lapply(immdata$data, head, 2000) pr <- pubRep(immdata$data, .verbose = FALSE) vis(pr, "clonotypes", 1, 2)
Work In Progress
pubRepApply(.pr1, .pr2, .fun = function(x) log10(x[1])/log10(x[2]))
pubRepApply(.pr1, .pr2, .fun = function(x) log10(x[1])/log10(x[2]))
.pr1 |
First public repertoire. |
.pr2 |
Second public repertoire. |
.fun |
A function to apply to pairs of frequencies of same clonotypes from "pr1" and "pr2".
By default - |
Work in progress.
data(immdata) immdata$data <- lapply(immdata$data, head, 2000) pr <- pubRep(immdata$data, .verbose = FALSE) pr1 <- pubRepFilter(pr, immdata$meta, .by = c(Status = "MS")) pr2 <- pubRepFilter(pr, immdata$meta, .by = c(Status = "C")) prapp <- pubRepApply(pr1, pr2) head(prapp)
data(immdata) immdata$data <- lapply(immdata$data, head, 2000) pr <- pubRep(immdata$data, .verbose = FALSE) pr1 <- pubRepFilter(pr, immdata$meta, .by = c(Status = "MS")) pr2 <- pubRepFilter(pr, immdata$meta, .by = c(Status = "C")) prapp <- pubRepApply(pr1, pr2) head(prapp)
Filter our clonotypes with low incidence in a specific group.
pubRepFilter(.pr, .meta, .by, .min.samples = 1)
pubRepFilter(.pr, .meta, .by, .min.samples = 1)
.pr |
Public repertoires, an output from pubRep. |
.meta |
Metadata file. |
.by |
Named character vector. Names of the group to filter by. |
.min.samples |
Integer. Filters out clonotypes with the number of samples below than this number. |
Data frame with filtered clonotypes.
data(immdata) immdata$data <- lapply(immdata$data, head, 2000) pr <- pubRep(immdata$data, .verbose = FALSE) pr1 <- pubRepFilter(pr, immdata$meta, .by = c(Status = "MS")) head(pr1)
data(immdata) immdata$data <- lapply(immdata$data, head, 2000) pr <- pubRep(immdata$data, .verbose = FALSE) pr1 <- pubRepFilter(pr, immdata$meta, .by = c(Status = "MS")) head(pr1)
Statistics of number of public clonotypes for each possible combinations of repertoires
pubRepStatistics(.data, .by = NA, .meta = NA)
pubRepStatistics(.data, .by = NA, .meta = NA)
.data |
Public repertoire, an output from the pubRep function. |
.by |
Work in Progress. |
.meta |
Work in Progress. |
Data frame with incidence statistics per sample.
data(immdata) immdata$data <- lapply(immdata$data, head, 2000) pr <- pubRep(immdata$data, .verbose = FALSE) pubRepStatistics(pr) %>% vis()
data(immdata) immdata$data <- lapply(immdata$data, head, 2000) pr <- pubRep(immdata$data, .verbose = FALSE) pubRepStatistics(pr) %>% vis()
This function aligns all sequences (incliding germline) that belong to one clonal lineage and one cluster. After clustering and building the clonal lineage and germline, the next step is to analyze the degree of mutation and maturity of each clonal lineage. This allows for finding high mature cells and cells with a large number of offspring. The phylogenetic analysis will find mutations that increase the affinity of BCR. Making alignment of the sequence is the first step towards sequence analysis including BCR.
repAlignLineage(.data, .min_lineage_sequences, .prepare_threads, .align_threads, .nofail)
repAlignLineage(.data, .min_lineage_sequences, .prepare_threads, .align_threads, .nofail)
.data |
The data to be processed. Can be data.frame, data.table or a list of these objects. |
.min_lineage_sequences |
If number of sequences in the same clonal lineage and the same cluster (not including germline) is lower than this threshold, this group of sequences will be filtered out from the dataframe; so only large enough lineages will be included. |
.prepare_threads |
Number of threads to prepare results table. Please note that high number can cause heavy memory usage! |
.align_threads |
Number of threads for lineage alignment. It must have columns in the immunarch compatible format immunarch_data_format, and also must contain 'Cluster' column, which is added by seqCluster() function, and 'Germline.sequence' column, which is added by repGermline() function. |
.nofail |
Will return NA instead of stopping if Clustal W is not installed. Used to avoid raising errors in examples on computers where Clustal W is not installed. |
Dataframe or list of dataframes (if input is a list with multiple samples). The dataframe has these columns: * Cluster: cluster name * Germline: germline sequence * Alignment: DNAbin object with alignment * Sequences: nested dataframe containing all sequences for this combination of cluster and germline; it has columns * Sequence, CDR1.nt, CDR2.nt, CDR3.nt, FR1.nt, FR2.nt, FR3.nt, FR4.nt, V.allele, J.allele, V.aa, J.aa: all values taken from the input dataframe * Clone.ID: taken from the input dataframe, or created (filled with row numbers) if missing * Clones: taken from the input dataframe, or created (filled with '1' values) if missing
data(bcrdata) bcr_data <- bcrdata$data bcr_data %>% seqCluster(seqDist(bcr_data), .fixed_threshold = 3) %>% repGermline(.threads = 1) %>% repAlignLineage(.min_lineage_sequences = 2, .align_threads = 2, .nofail = TRUE)
data(bcrdata) bcr_data <- bcrdata$data bcr_data %>% seqCluster(seqDist(bcr_data), .fixed_threshold = 3) %>% repGermline(.threads = 1) %>% repAlignLineage(.min_lineage_sequences = 2, .align_threads = 2, .nofail = TRUE)
This function uses the PHYLIP package to make phylogenetic analysis. For making trees it uses maximum parsimony methods.
repClonalFamily(.data, .vis_groups, .threads, .nofail)
repClonalFamily(.data, .vis_groups, .threads, .nofail)
.data |
The data to be processed, output of repAlignLineage() function. |
.vis_groups |
Groups for visualization, used to annotate specific clones on chart and display them in different colors. This is a named list, where names are for the chart legend, and list items are clone IDs that belong to the groups. It's not necessary to assign groups to all clonotypes; unassigned ones will be displayed on the chart as "Clonotype" category. It's also possible to assign multiple clonotypes to the same group by providing nested lists or vectors of clone IDs instead of single clone IDs. Example: .vis_groups = list(A = 817, B = 201, C = list(303, 42)) |
.threads |
Number of threads to use. |
.nofail |
Returns NA instead of stopping if PHYLIP is not installed. Used to avoid raising errors in examples on computers where PHYLIP is not installed. |
Dataframe or list of dataframes (if input is a list with multiple samples). The dataframe has these columns: * Cluster: cluster name * Germline.Input: germline sequence, like it was in the input; not aligned * Germline.Output: germline sequence, parsed from PHYLIP dnapars function output; it contains difference of germline from the common ancestor; "." characters mean matching letters * Common.Ancestor: common ancestor sequence, parsed from PHYLIP dnapars function output * Trunk.Length: mean trunk length, representing the distance between the most recent common ancestor and germline sequence as a measure of the maturity of a lineage * Tree: output tree in "phylo" format, loaded from by PHYLIP dnapars function output * TreeStats: nested dataframe containing data about tree nodes, needed for visualization * Sequences: nested dataframe containing all sequences for this combination of cluster and germline; it contains regions from original sequences, saved for repSomaticHypermutation() calculation, and also data needed for visualizations
data(bcrdata) bcr_data <- bcrdata$data bcr_data %>% seqCluster(seqDist(bcr_data), .fixed_threshold = 3) %>% repGermline(.threads = 1) %>% repAlignLineage(.min_lineage_sequences = 2, .align_threads = 2, .nofail = TRUE) %>% repClonalFamily(.threads = 1, .nofail = TRUE)
data(bcrdata) bcr_data <- bcrdata$data bcr_data %>% seqCluster(seqDist(bcr_data), .fixed_threshold = 3) %>% repGermline(.threads = 1) %>% repAlignLineage(.min_lineage_sequences = 2, .align_threads = 2, .nofail = TRUE) %>% repClonalFamily(.threads = 1, .nofail = TRUE)
repClonality
function encompasses several methods to measure
clonal proportions in a given repertoire.
repClonality( .data, .method = c("clonal.prop", "homeo", "top", "rare"), .perc = 10, .clone.types = c(Rare = 1e-05, Small = 1e-04, Medium = 0.001, Large = 0.01, Hyperexpanded = 1), .head = c(10, 100, 1000, 3000, 10000, 30000, 1e+05), .bound = c(1, 3, 10, 30, 100) )
repClonality( .data, .method = c("clonal.prop", "homeo", "top", "rare"), .perc = 10, .clone.types = c(Rare = 1e-05, Small = 1e-04, Medium = 0.001, Large = 0.01, Hyperexpanded = 1), .head = c(10, 100, 1000, 3000, 10000, 30000, 1e+05), .bound = c(1, 3, 10, 30, 100) )
.data |
The data to be processed. Can be data.frame, data.table, or a list of these objects. Every object must have columns in the immunarch compatible format. immunarch_data_format Competent users may provide advanced data representations: DBI database connections, Apache Spark DataFrame from copy_to or a list of these objects. They are supported with the same limitations as basic objects. Note: each connection must represent a separate repertoire. |
.method |
A String with one of the following options: Set Set Set Set |
.perc |
A single numerical value ranging from 0 to 100. |
.clone.types |
A named numerical vector with the threshold of the half-closed intervals that mark off clonal groups. |
.head |
A numerical vector with ranges of the top clonotypes. |
.bound |
A numerical vector with ranges of abundance for the rare clonotypes in the dataset. |
Clonal proportion assessment is a different approach to estimate repertoire diversity. When visualised, it allows for thorough examination of immune repertoire structure and composition.
In its core this type of analysis is similar to the relative species abundance concept in ecology. Relative abundance is the percent composition of an organism of a particular kind relative to the total number of organisms in the area.
A stacked barplot of relative clonotype abundances can be therefore viewed as a non-parametric approach to comparing their underlying distributions.
If input data is a single immune repertoire, then the function returns a numeric vector with clonality statistics.
Otherwise, it returns a numeric matrix with clonality statistics for all input repertoires.
# Load the data data(immdata) imm_pr <- repClonality(immdata$data, .method = "clonal.prop") vis(imm_pr) imm_top <- repClonality(immdata$data, .method = "top", .head = c(10, 100, 1000, 3000, 10000)) vis(imm_top) imm_rare <- repClonality(immdata$data, .method = "rare") vis(imm_rare) imm_hom <- repClonality(immdata$data, .method = "homeo") vis(imm_hom)
# Load the data data(immdata) imm_pr <- repClonality(immdata$data, .method = "clonal.prop") vis(imm_pr) imm_top <- repClonality(immdata$data, .method = "top", .head = c(10, 100, 1000, 3000, 10000)) vis(imm_top) imm_rare <- repClonality(immdata$data, .method = "rare") vis(imm_rare) imm_hom <- repClonality(immdata$data, .method = "homeo") vis(imm_hom)
This is a utility function to estimate the diversity of species or objects in the given distribution.
Note: functions will check if .data is a distribution of a random variable (sum == 1) or not. To force normalisation and / or to prevent this, set .do.norm to TRUE (do normalisation) or FALSE (don't do normalisation), respectively.
repDiversity( .data, .method = "chao1", .col = "aa", .max.q = 6, .min.q = 1, .q = 5, .step = NA, .quantile = c(0.025, 0.975), .extrapolation = NA, .perc = 50, .norm = TRUE, .verbose = TRUE, .do.norm = NA, .laplace = 0 )
repDiversity( .data, .method = "chao1", .col = "aa", .max.q = 6, .min.q = 1, .q = 5, .step = NA, .quantile = c(0.025, 0.975), .extrapolation = NA, .perc = 50, .norm = TRUE, .verbose = TRUE, .do.norm = NA, .laplace = 0 )
.data |
The data to be processed. Can be data.frame, data.table, or a list of these objects. Every object must have columns in the immunarch compatible format. immunarch_data_format Competent users may provide advanced data representations: DBI database connections, Apache Spark DataFrame from copy_to or a list of these objects. They are supported with the same limitations as basic objects. Note: each connection must represent a separate repertoire. |
.method |
Picks a method used for estimation out of a following list: chao1, hill, div, gini.simp, inv.simp, gini, raref, d50, dxx. |
.col |
A string that specifies the column(s) to be processed. Pass one of the following strings, separated by the plus sign: "nt" for nucleotide sequences, "aa" for amino acid sequences, "v" for V gene segments, "j" for J gene segments. E.g., pass "aa+v" to compute diversity estimations on CDR3 amino acid sequences paired with V gene segments, i.e., in this case a unique clonotype is a pair of CDR3 amino acid and V gene segment. Clonal counts of equal clonotypes will be summed up. |
.max.q |
The max hill number to calculate (default: 5). |
.min.q |
Function calculates several hill numbers. Set the min (default: 1). |
.q |
q-parameter for the Diversity index. |
.step |
Rarefaction step's size. |
.quantile |
Numeric vector with quantiles for confidence intervals. |
.extrapolation |
An integer. An upper limit for the number of clones to extrapolate to. Pass 0 (zero) to turn extrapolation subroutines off. |
.perc |
Set the percent to dXX index measurement. |
.norm |
Normalises rarefaction curves. |
.verbose |
If TRUE then outputs progress. |
.do.norm |
One of the three values - NA, TRUE or FALSE. If NA then checks for distrubution (sum(.data) == 1) and normalises if needed with the given laplace correction value. if TRUE then does normalisation and laplace correction. If FALSE then doesn't do neither normalisaton nor laplace correction. |
.laplace |
A numeric value, which is used as a pseudocount for Laplace smoothing. |
- True diversity, or the effective number of types, refers to the number of equally-abundant types needed for the average proportional abundance of the types to equal that observed in the dataset of interest where all types may not be equally abundant.
- Inverse Simpson index is the effective number of types that is obtained when the weighted arithmetic mean is used to quantify average proportional abundance of types in the dataset of interest.
- The Gini coefficient measures the inequality among values of a frequency distribution (for example levels of income). A Gini coefficient of zero expresses perfect equality, where all values are the same (for example, where everyone has the same income). A Gini coefficient of one (or 100 percents ) expresses maximal inequality among values (for example where only one person has all the income).
- The Gini-Simpson index is the probability of interspecific encounter, i.e., probability that two entities represent different types.
- Chao1 estimator is a nonparameteric asymptotic estimator of species richness (number of species in a population).
- Rarefaction is a technique to assess species richness from the results of sampling through extrapolation.
- Hill numbers are a mathematically unified family of diversity indices (differing among themselves only by an exponent q).
- d50 is a recently developed immune diversity estimate. It calculates the minimum number of distinct clonotypes amounting to greater than or equal to 50 percent of a total of sequencing reads obtained following amplification and sequencing
- dXX is a similar to d50 index where XX corresponds to desirable percent of total sequencing reads.
div, gini, gini.simp, inv.simp, raref return numeric vector of length 1 with value.
chao1 returns 4 values: estimated number of species, standart deviation of this number and two 95
hill returns a vector of specified length .max.q - .min.q
For most methods, if input data is a single immune repertoire, then the function returns a numeric vector with diversity statistics.
Otherwise, it returns a numeric matrix with diversity statistics for all input repertoires.
For Chao1 the function returns a matrix with diversity estimations.
For rarefaction the function returns either a matrix with diversity estimatinos on different step of the simulaiton process or a list with such matrices.
repOverlap, entropy, repClonality Rarefaction wiki https://en.wikipedia.org/wiki/Rarefaction_(ecology) Hill numbers paper https://www.uvm.edu/~ngotelli/manuscriptpdfs/ChaoHill.pdf Diversity wiki https://en.wikipedia.org/wiki/Measurement_of_biodiversity
data(immdata) # Make data smaller for testing purposes immdata$data <- top(immdata$data, 4000) # chao1 repDiversity(.data = immdata$data, .method = "chao1") %>% vis() # Hill numbers repDiversity( .data = immdata$data, .method = "hill", .max.q = 6, .min.q = 1, .do.norm = NA, .laplace = 0 ) %>% vis() # diversity repDiversity(.data = immdata$data, .method = "div", .q = 5, .do.norm = NA, .laplace = 0) %>% vis() # Gini-Simpson repDiversity(.data = immdata$data, .method = "gini.simp", .q = 5, .do.norm = NA, .laplace = 0) %>% vis() # inverse Simpson repDiversity(.data = immdata$data, .method = "inv.simp", .do.norm = NA, .laplace = 0) %>% vis() # Gini coefficient repDiversity(.data = immdata$data, .method = "gini", .do.norm = NA, .laplace = 0) # d50 repDiversity(.data = immdata$data, .method = "d50") %>% vis()
data(immdata) # Make data smaller for testing purposes immdata$data <- top(immdata$data, 4000) # chao1 repDiversity(.data = immdata$data, .method = "chao1") %>% vis() # Hill numbers repDiversity( .data = immdata$data, .method = "hill", .max.q = 6, .min.q = 1, .do.norm = NA, .laplace = 0 ) %>% vis() # diversity repDiversity(.data = immdata$data, .method = "div", .q = 5, .do.norm = NA, .laplace = 0) %>% vis() # Gini-Simpson repDiversity(.data = immdata$data, .method = "gini.simp", .q = 5, .do.norm = NA, .laplace = 0) %>% vis() # inverse Simpson repDiversity(.data = immdata$data, .method = "inv.simp", .do.norm = NA, .laplace = 0) %>% vis() # Gini coefficient repDiversity(.data = immdata$data, .method = "gini", .do.norm = NA, .laplace = 0) # d50 repDiversity(.data = immdata$data, .method = "d50") %>% vis()
The repExplore
function calculates the basic statistics of
repertoire: the number of unique immune receptor clonotypes, their relative abundances,
and sequence length distribution across the input dataset.
repExplore( .data, .method = c("volume", "count", "len", "clones"), .col = c("nt", "aa"), .coding = TRUE )
repExplore( .data, .method = c("volume", "count", "len", "clones"), .col = c("nt", "aa"), .coding = TRUE )
.data |
The data to be processed. Can be data.frame, data.table, or a list of these objects. Every object must have columns in the immunarch compatible format. immunarch_data_format Competent users may provide advanced data representations: DBI database connections, Apache Spark DataFrame from copy_to or a list of these objects. They are supported with the same limitations as basic objects. Note: each connection must represent a separate repertoire. |
.method |
A string that specifies the method of analysis. It can be either "volume", "count", "len" or "clones". When .method is set to "volume" the repExplore calculates the number of unique clonotypes in the input data. When .method is set to "count" the repExplore calculates the distribution of clonotype abundances, i.e., how frequent receptors with different abundances are. When .method is set to "len" the repExplore calculates the distribution of CDR3 sequence lengths. When .method is set to "clones" the repExplore returns the number of clones (i.e., cells) per input repertoire. |
.col |
A string that specifies the column to be processed. Pass "nt" for nucleotide sequence or "aa" for amino acid sequence. |
.coding |
If |
If input data is a single immune repertoire, then the function returns a numeric vector with exploratory analysis statistics.
Otherwise, it returns a numeric matrix with exploratory analysis statistics for all input repertoires.
data(immdata) # Calculate statistics and generate a visual output with vis() repExplore(immdata$data, .method = "volume") %>% vis() repExplore(immdata$data, .method = "count") %>% vis() repExplore(immdata$data, .method = "len") %>% vis()
data(immdata) # Calculate statistics and generate a visual output with vis() repExplore(immdata$data, .method = "volume") %>% vis() repExplore(immdata$data, .method = "count") %>% vis() repExplore(immdata$data, .method = "len") %>% vis()
Main function for data filtering
repFilter( .data, .method = "by.clonotype", .query = list(CDR3.aa = exclude("partial", "out_of_frame")), .match = "exact" )
repFilter( .data, .method = "by.clonotype", .query = list(CDR3.aa = exclude("partial", "out_of_frame")), .match = "exact" )
.data |
The data to be processed. Must be the list of 2 elements: a data table and a metadata table. |
.method |
Method of filtering. Implemented methods: by.meta, by.repertoire (by.rep), by.clonotype (by.cl) Default value: 'by.clonotype'. |
.query |
Filtering query. It's a named list of filters that will be applied to data. Possible values for names in this list are dependent on filter methods: - by.meta: filters by metadata. Names in the named list are metadata column headers. - by.repertoire: filters by the number of clonotypes or total number of clones in sample. Possible names in the named list are "n_clonotypes" and "n_clones". - by.clonotype: filters by data in all samples. Names in the named list are data column headers. Elements of the named list for each of the filters are filtering options. Possible values for filtering options: - include("STR1", "STR2", ...): keeps only rows with matching values. Available for methods: "by.meta", "by.clonotype". - exclude("STR1", "STR2", ...): removes rows with matching values. Available for methods: "by.meta", "by.clonotype". - lessthan(value): keeps rows/samples with numeric values less than specified. Available for methods: "by.meta", "by.repertoire", "by.clonotype". - morethan(value): keeps rows/samples with numeric values more than specified. Available for methods: "by.meta", "by.repertoire", "by.clonotype". - interval(from, to): keeps rows/samples with numeric values that fits in this interval. from is inclusive, to is exclusive. Available for methods: "by.meta", "by.repertoire", "by.clonotype". Default value: 'list(CDR3.aa = exclude("partial", "out_of_frame"))'. |
.match |
Matching method for "include" and "exclude" options in query. Possible values: - exact: matches only the exact specified string; - startswith: matches all strings starting with the specified substring; - substring: matches all strings containing the specified substring. Default value: 'exact'. |
data(immdata) # Select samples with status "MS" repFilter(immdata, "by.meta", list(Status = include("MS"))) # Select samples without status "MS" repFilter(immdata, "by.meta", list(Status = exclude("MS"))) # Select samples from lanes "A" and "B" with age > 15 repFilter(immdata, "by.meta", list(Lane = include("A", "B"), Age = morethan(15))) # Select samples that are not from lanes "A" and "B" repFilter(immdata, "by.meta", list(Lane = exclude("A", "B"))) # Select samples with a number of clonotypes from 1000 to 5000 repFilter(immdata, "by.repertoire", list(n_clonotypes = interval(1000, 5000))) # Select clonotypes in all samples with alpha chains repFilter(immdata, "by.clonotype", list(V.name = include("AV"), J.name = include("AJ")), .match = "substring" )
data(immdata) # Select samples with status "MS" repFilter(immdata, "by.meta", list(Status = include("MS"))) # Select samples without status "MS" repFilter(immdata, "by.meta", list(Status = exclude("MS"))) # Select samples from lanes "A" and "B" with age > 15 repFilter(immdata, "by.meta", list(Lane = include("A", "B"), Age = morethan(15))) # Select samples that are not from lanes "A" and "B" repFilter(immdata, "by.meta", list(Lane = exclude("A", "B"))) # Select samples with a number of clonotypes from 1000 to 5000 repFilter(immdata, "by.repertoire", list(n_clonotypes = interval(1000, 5000))) # Select clonotypes in all samples with alpha chains repFilter(immdata, "by.clonotype", list(V.name = include("AV"), J.name = include("AJ")), .match = "substring" )
This function creates germlines for clonal lineages. B cell clonal lineage represents a set of B cells that presumably have a common origin (arising from the same VDJ rearrangement event) and a common ancestor. Each clonal lineage has its own germline sequence that represents the ancestral sequence for each BCR in clonal lineage. In other words, germline sequence is a sequence of B-cells immediately after VDJ recombination, before B-cell maturation and hypermutation process. Germline sequence is useful for assessing the degree of mutation and maturity of the repertoire.
repGermline(.data, .species, .min_nuc_outside_cdr3, .threads)
repGermline(.data, .species, .min_nuc_outside_cdr3, .threads)
.data |
The data to be processed. Can be data.frame, data.table or a list of these objects. It must have columns in the immunarch compatible format immunarch_data_format. |
.species |
Species from which the data was acquired. Available options: "HomoSapiens" (default), "MusMusculus", "BosTaurus", "CamelusDromedarius", "CanisLupusFamiliaris", "DanioRerio", "MacacaMulatta", "MusMusculusDomesticus", "MusMusculusCastaneus", "MusMusculusMolossinus", "MusMusculusMusculus", "MusSpretus", "OncorhynchusMykiss", "OrnithorhynchusAnatinus", "OryctolagusCuniculus", "RattusNorvegicus", "SusScrofa". |
.min_nuc_outside_cdr3 |
This parameter sets how many nucleotides should have V or J chain outside of CDR3 to be considered good for further alignment. |
.threads |
Number of threads to use. |
Data with added columns: * Sequence (FR1+CDR1+FR2+CDR2+FR3+CDR3+FR4 in nucleotides; the column will be replaced if exists) * V.allele, J.allele (chosen alleles of V and J genes), * V.aa, J.aa (V and J sequences from original clonotype, outside CDR3, converted to amino acids) * Germline.sequence (combined germline nucleotide sequence)
data(bcrdata) bcrdata$data %>% top(5) %>% repGermline()
data(bcrdata) bcrdata$data %>% top(5) %>% repGermline()
The repLoad
function loads repertoire files
into R workspace in the immunarch format where you can immediately use them for
the analysis. repLoad
automatically detects the right format for
your files, so all you need is simply provide the path to your files.
See "Details" for more information on supported formats. See "Examples" for diving right into it.
repLoad(.path, .mode = "paired", .coding = TRUE, ...)
repLoad(.path, .mode = "paired", .coding = TRUE, ...)
.path |
A character string specifying the path to the input data. Input data can be one of the following: - a single repertoire file.
In this case - a vector of paths to repertoire files. Same as in the case with no metadata file presented in the next section below; - a path to the folder with repertoire files and, if available, metadata file "metadata.txt".
If the metadata file if presented, then the |
.mode |
Either "single" for single chain data or "paired" for paired chain data. Currently "single" works for every format, and "paired" works only for 10X Genomics data. By default, 10X Genomics data will be loaded as paired chain data, and other files will be loaded as single chain data. |
.coding |
A logical value. Set TRUE to get coding-only clonotypes (by defaul). Set FALSE to get all clonotypes. |
... |
Extra arguments for parsing functions |
The metadata has to be a tab delimited file with first column named "Sample". It can have any number of additional columns with arbitrary names. The first column should contain base names of files without extensions in your folder. Example:
Sample | Sex | Age | Status |
immunoseq_1 | M | 1 | C |
immunoseq_2 | M | 2 | C |
immunoseq_3 | FALSE | 3 | A |
Currently, Immunarch support the following formats:
- "immunoseq" - ImmunoSEQ of any version. http://www.adaptivebiotech.com/immunoseq
- "mitcr" - MiTCR. https://github.com/milaboratory/mitcr
- "mixcr" - MiXCR (the "all" files) of any version. https://github.com/milaboratory/mixcr
- "migec" - MiGEC. http://migec.readthedocs.io/en/latest/
- "migmap" - For parsing IgBLAST results postprocessed with MigMap. https://github.com/mikessh/migmap
- "tcr" - tcR, our previous package. https://imminfo.github.io/tcr/
- "vdjtools" - VDJtools of any version. http://vdjtools-doc.readthedocs.io/en/latest/
- "imgt" - IMGT HighV-QUEST. http://www.imgt.org/HighV-QUEST/
- "airr" - adaptive immune receptor repertoire (AIRR) data format. http://docs.airr-community.org/en/latest/datarep/overview.html
- "10x" - 10XGenomics clonotype annotations tables. https://support.10xgenomics.com/single-cell-vdj/software/pipelines/latest/output/annotation
- "archer" - ArcherDX clonotype tables. https://archerdx.com/
A list with two named elements:
- "data" is a list of input samples;
- "meta" is a data frame with sample metadata.
immunr_data_format for immunarch data format; repSave for file saving; repOverlap, geneUsage and repDiversity for starting with immune repertoires basic statistics.
# To load the data from a single file (note that you don't need to specify the data format): file_path <- paste0(system.file(package = "immunarch"), "/extdata/io/Sample1.tsv.gz") immdata <- repLoad(file_path) # Suppose you have a following structure in your folder: # >_ ls # immunoseq1.txt # immunoseq2.txt # immunoseq3.txt # metadata.txt # To load the whole folder with every file in it type: file_path <- paste0(system.file(package = "immunarch"), "/extdata/io/") immdata <- repLoad(file_path) print(names(immdata)) # We recommend creating a metadata file named "metadata.txt" in the folder. # In that case, when you load your data you will see: # > immdata <- repLoad("path/to/your/folder/") # > names(immdata) # [1] "data" "meta" # If you do not have "metadata.txt", you will see the same output, # but your metadata will be almost empty: # > immdata <- repLoad("path/to/your/folder/") # > names(immdata) # [1] "data" "meta"
# To load the data from a single file (note that you don't need to specify the data format): file_path <- paste0(system.file(package = "immunarch"), "/extdata/io/Sample1.tsv.gz") immdata <- repLoad(file_path) # Suppose you have a following structure in your folder: # >_ ls # immunoseq1.txt # immunoseq2.txt # immunoseq3.txt # metadata.txt # To load the whole folder with every file in it type: file_path <- paste0(system.file(package = "immunarch"), "/extdata/io/") immdata <- repLoad(file_path) print(names(immdata)) # We recommend creating a metadata file named "metadata.txt" in the folder. # In that case, when you load your data you will see: # > immdata <- repLoad("path/to/your/folder/") # > names(immdata) # [1] "data" "meta" # If you do not have "metadata.txt", you will see the same output, # but your metadata will be almost empty: # > immdata <- repLoad("path/to/your/folder/") # > names(immdata) # [1] "data" "meta"
The repOverlap
function is designed to analyse the overlap between
two or more repertoires. It contains a number of methods to compare immune receptor
sequences that are shared between individuals.
repOverlap( .data, .method = c("public", "overlap", "jaccard", "tversky", "cosine", "morisita", "inc+public", "inc+morisita"), .col = "aa", .a = 0.5, .b = 0.5, .verbose = TRUE, .step = 1000, .n.steps = 10, .downsample = FALSE, .bootstrap = NA, .verbose.inc = NA, .force.matrix = FALSE )
repOverlap( .data, .method = c("public", "overlap", "jaccard", "tversky", "cosine", "morisita", "inc+public", "inc+morisita"), .col = "aa", .a = 0.5, .b = 0.5, .verbose = TRUE, .step = 1000, .n.steps = 10, .downsample = FALSE, .bootstrap = NA, .verbose.inc = NA, .force.matrix = FALSE )
.data |
The data to be processed. Can be data.frame, data.table, or a list of these objects. Every object must have columns in the immunarch compatible format. immunarch_data_format Competent users may provide advanced data representations: DBI database connections, Apache Spark DataFrame from copy_to or a list of these objects. They are supported with the same limitations as basic objects. Note: each connection must represent a separate repertoire. |
.method |
A string that specifies the method of analysis or a combination of
methods. The |
.col |
A string that specifies the column(s) to be processed. Pass one of the following strings, separated by the plus sign: "nt" for nucleotide sequences, "aa" for amino acid sequences, "v" for V gene segments, "j" for J gene segments. E.g., pass "aa+v" to compute overlaps on CDR3 amino acid sequences paired with V gene segments, i.e., in this case a unique clonotype is a pair of CDR3 amino acid and V gene segment. Clonal counts of equal clonotypes will be summed up. |
.a , .b
|
Alpha and beta parameters for Tversky Index. Default values give the Jaccard index measure. |
.verbose |
if TRUE then output the progress. |
.step |
Either an integer or a numeric vector. In the first case, the integer defines the step of incremental overlap. In the second case, the vector encodes all repertoire sampling depths. |
.n.steps |
Skipped if ".step" is a numeric vector. |
.downsample |
If TRUE then performs downsampling to N clonotypes at each step instead of choosing the top N clonotypes in incremental overlaps. Change nothing of you are using conventional methods. |
.bootstrap |
Set NA to turn off any bootstrapping, set a number to perform bootstrapping with this number of tries. |
.verbose.inc |
Logical. If TRUE then shows output from the computation process. |
.force.matrix |
Logical. If TRUE then always forces the matrix output even in case of two input repertoires. |
"public" and "shared" are synonyms that exist for the convenience of researchers.
The "overlap" coefficient is a similarity measure that measures the overlap between two finite sets.
The "jaccard" index is conceptually a percentage of how many objects two sets have in common out of how many objects they have total.
The "tversky" index is an asymmetric similarity measure on sets that compares a variant to a prototype.
The "cosine" index is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.
The "morisita" index measures how many times it is more likely to randomly select two sampled points from the same quadrat (the dataset is covered by a regular grid of changing size) then it would be in the case of a random distribution generated from a Poisson process. Duplicate objects are merged with their counts are summed up.
In most cases the return value is a matrix with overlap values for each pair of repertoires.
If only two repertoires were provided, return value is single numeric value.
If one of the incremental method is chosen, return list of overlap matrix.
data(immdata) # Make data smaller for testing purposes immdata$data <- top(immdata$data, 4000) ov <- repOverlap(immdata$data, .verbose = FALSE) vis(ov) ov <- repOverlap(immdata$data, "jaccard", .verbose = FALSE) vis(ov, "heatmap2")
data(immdata) # Make data smaller for testing purposes immdata$data <- top(immdata$data, 4000) ov <- repOverlap(immdata$data, .verbose = FALSE) vis(ov) ov <- repOverlap(immdata$data, "jaccard", .verbose = FALSE) vis(ov, "heatmap2")
The repOverlapAnalysis
function contains advanced data
analysis methods. You can use several clustering and dimensionality reduction
techniques in order to investigate further the difference between repertoires
provided.
To cluster a subset of similar data with repOverlapAnalysis
you can
perform hierarchical clustering, k-means or dbscan ('hclust', 'kmeans', 'dbscan'
respectively).
To reduce dimensions, for example, to select features for subsequent analysis, you can execute the multidimensional scaling or t-sne algorithms ('mds' and 'tsne' respectively).
repOverlapAnalysis( .data, .method = ("hclust"), .scale = default_scale_fun, .raw = TRUE, .perp = 1, .theta = 0.1, .eps = 0.01, .k = 2 )
repOverlapAnalysis( .data, .method = ("hclust"), .scale = default_scale_fun, .raw = TRUE, .perp = 1, .theta = 0.1, .eps = 0.01, .k = 2 )
.data |
Any distance matrix between pairs of repertoires. You can also pass your
output from |
.method |
A string that defines the type of analysis to perform. |
.scale |
A function to scale the data before passing it to the MDS algorithm. |
.raw |
A logical value. Set TRUE if you want to receive raw output of clustering
or dimensionality reduction function of choice. Set FALSE if you want to receive
processed output that can be subjected to visualisation with |
.perp |
A numerical value, t-SNE parameter, see |
.theta |
A numerical value, t-SNE parameter, see |
.eps |
A numerical value, DBscan epsylon parameter, see |
.k |
The number of clusters to create, passed as |
Depends on the last element in the .method
string. See immunr_tsne for more info.
data(immdata) ov <- repOverlap(immdata$data) repOverlapAnalysis(ov, "mds+hclust") %>% vis()
data(immdata) ov <- repOverlap(immdata$data) repOverlapAnalysis(ov, "mds+hclust") %>% vis()
Sample (downsample) repertoires using different approches.
repSample( .data, .method = c("downsample", "resample", "sample"), .n = NA, .prob = TRUE )
repSample( .data, .method = c("downsample", "resample", "sample"), .n = NA, .prob = TRUE )
.data |
The data to be processed. Can be data.frame, data.table, or a list of these objects. Every object must have columns in the immunarch compatible format. immunarch_data_format Competent users may provide advanced data representations: DBI database connections, Apache Spark DataFrame from copy_to or a list of these objects. They are supported with the same limitations as basic objects. Note: each connection must represent a separate repertoire. |
.method |
Character. Name of a sampling method. See "Details" for more details. Default value is "downsample" that downsamples the repertoires to the number of clones (i.e., reads / UMIs) that the smallest repertoire has, if user doesn't set any value to the ".n" argument. |
.n |
Integer. Number of clones / clonotypes / reads / UMIs to choose, depending on the method. Set NA to sample repertoires to the size of the smallest repertoire in the ".data". |
.prob |
Logical. If TRUE then samples the clonotypes with probability weights equal to their number of clones. Used only if ".method" is "sample". |
If .method
is "downsample" then repSample
chooses .n
clones (not clonotypes!) from the input repertoires without any probabilistic simulation,
but exactly computing each choosed clones. Such approach is is more consistent and biologically pleasant than
an output from the function if .method
is "resample".
If .method
is "resample" then repSample
uses multinomial distribution to compute the number of occurences for each cloneset.
then it removes zero-number clonotypes and return the resulting data frame. Probabilities for rmultinom
for each cloneset
is a percentage of this cloneset in the "Proportion" column. It's a some sort of simulation of how clonotypes are chosen from the organisms.
if .method
is "sample" then repSample
chooses .n
clonotypes (not clones!) randomly. Depending on the
.prob
argument, the function chooses clonotypes either according to their size (if .prob
is TRUE, by default),
or each clonotype has an equal chance to be choosed (if .prob
is FALSE). Note that sampling is done without replacing.
Subsampled immune repertoire or a list of subsampled immune repertoires.
data(immdata) # Downsampling to 1000 clones (not clonotypes!) tmp <- repSample(immdata$data[[1]], .n = 1000) sum(tmp$Clones) # Downsampling to 1000 clonotypes tmp <- repSample(immdata$data[[1]], "sample", .n = 1000) nrow(tmp) # Downsampling to the smallest repertoire by clones (not clonotypes!) tmp <- repSample(immdata$data[c(1, 2)]) sum(tmp[[1]]$Clones) sum(tmp[[2]]$Clones) # Downsampling to the smallest repertoire by clonotypes tmp <- repSample(immdata$data[c(1, 2)], "sample") nrow(tmp[[1]]$Clones) nrow(tmp[[2]]$Clones)
data(immdata) # Downsampling to 1000 clones (not clonotypes!) tmp <- repSample(immdata$data[[1]], .n = 1000) sum(tmp$Clones) # Downsampling to 1000 clonotypes tmp <- repSample(immdata$data[[1]], "sample", .n = 1000) nrow(tmp) # Downsampling to the smallest repertoire by clones (not clonotypes!) tmp <- repSample(immdata$data[c(1, 2)]) sum(tmp[[1]]$Clones) sum(tmp[[2]]$Clones) # Downsampling to the smallest repertoire by clonotypes tmp <- repSample(immdata$data[c(1, 2)], "sample") nrow(tmp[[1]]$Clones) nrow(tmp[[2]]$Clones)
The repSave
function is deigned to save your data to the disk
in desirable format. Currently supports "immunarch" and "vdjtools" file formats.
repSave(.data, .path, .format = c("immunarch", "vdjtools"), .compress = TRUE)
repSave(.data, .path, .format = c("immunarch", "vdjtools"), .compress = TRUE)
.data |
An R dataframe, a list of R dataframes or a list with |
.path |
A string with the path to the output directory. It should include file name if a single dataframe is provided to .data argument. |
.format |
A string with desirable format specification. Current options are "immunarch" and "vdjtools". |
.compress |
A boolean value. Defines whether the output will be compressed or not. |
It is not necessary to create directories beforehand. If the provided directory does not exist it will be created automatically.
No return value.
data(immdata) # Reduce data to save time on examples immdata$data <- purrr::map(immdata$data, ~ .x %>% head(10)) dirpath <- tempdir() # Save the list of repertoires repSave(immdata, dirpath) # Load it and check if it is the same new_immdata <- repLoad(dirpath) # sum(immdata$data[[1]] != new_immdata$data[[1]], na.rm = TRUE) # sum(immdata$data[[2]] != new_immdata$data[[2]], na.rm = TRUE) # sum(immdata$meta != new_immdata$meta, na.rm = TRUE)
data(immdata) # Reduce data to save time on examples immdata$data <- purrr::map(immdata$data, ~ .x %>% head(10)) dirpath <- tempdir() # Save the list of repertoires repSave(immdata, dirpath) # Load it and check if it is the same new_immdata <- repLoad(dirpath) # sum(immdata$data[[1]] != new_immdata$data[[1]], na.rm = TRUE) # sum(immdata$data[[2]] != new_immdata$data[[2]], na.rm = TRUE) # sum(immdata$meta != new_immdata$meta, na.rm = TRUE)
This function aligns V and J genes from the germline in each cluster with corresponding genes in each clonotype, saves the alignments for purpose of visualization, and calculates number of mutations for each clonotype.
repSomaticHypermutation(.data, .threads, .nofail)
repSomaticHypermutation(.data, .threads, .nofail)
.data |
The data to be processed: an output of repClonalFamily(); variants with one sample and list of samples are both supported. |
.threads |
Number of threads to use. |
.nofail |
Will return NA instead of stopping if Clustal W is not installed. Used to avoid raising errors in examples on computers where Clustal W is not installed. |
Dataframe or list of dataframes (if input is a list with multiple samples). The dataframe has all the columns from repClonalFamily() output dataframe, with Sequence column unnested: the resulting dataframe has one line per clonotype. Clone.ID column contains original IDs for clonotypes, and can be used as dataframe key. New columns are added: * Germline.Alignment.V: contains V gene alignment of current clonotype with the germline * Germline.Alignment.J: contains J gene alignment of current clonotype with the germline * Substitutions: contains number of substitutions in the alignment (summary for V and J) * Insertions: contains number of insertions in the clonotype relative to germline (summary for V and J) * Deletions: contains number of deletions in the clonotype relative to germline (summary for V and J) * Mutations: contains total number of mutations in the alignment (summary for V and J)
data(bcrdata) bcr_data <- bcrdata$data bcr_data %>% seqCluster(seqDist(bcr_data), .fixed_threshold = 3) %>% repGermline(.threads = 1) %>% repAlignLineage(.min_lineage_sequences = 2, .align_threads = 2, .nofail = TRUE) %>% repClonalFamily(.threads = 1, .nofail = TRUE) %>% repSomaticHypermutation(.threads = 1, .nofail = TRUE)
data(bcrdata) bcr_data <- bcrdata$data bcr_data %>% seqCluster(seqDist(bcr_data), .fixed_threshold = 3) %>% repGermline(.threads = 1) %>% repAlignLineage(.min_lineage_sequences = 2, .align_threads = 2, .nofail = TRUE) %>% repClonalFamily(.threads = 1, .nofail = TRUE) %>% repSomaticHypermutation(.threads = 1, .nofail = TRUE)
A dataset with paired chain IG data for testing and examplatory purposes.
scdata
scdata
A list of four elements: "data" is a list with data frames with clonotype tables. "meta" is a metadata table. "bc_patients" is a list of barcodes corresponding to specific patients. "bc_clusters" is a list of barcodes corresponding to specific cell clusters.
List of immune repertoire data frames.
Metadata
...
Subsets the input immune repertoire by barcodes. Creates a vector of barcodes to subset or a vector cluster IDs and corresponding barcodes to get a list of immune repertoires corresponding to cluster IDs. Columns with clonotype counts and proportions are changed accordingly to the filtered barcodes.
select_barcodes(.data, .barcodes, .force.list = FALSE)
select_barcodes(.data, .barcodes, .force.list = FALSE)
.data |
The data to be processed. Can be data.frame, data.table, or a list of these objects. Every object must have columns in the immunarch compatible format. immunarch_data_format Competent users may provide advanced data representations: DBI database connections, Apache Spark DataFrame from copy_to or a list of these objects. They are supported with the same limitations as basic objects. Note: each connection must represent a separate repertoire. |
.barcodes |
Either a character vector with barcodes or a named character/factor vector with
barcodes as names and cluster IDs a vector elements. The output of Seurat's |
.force.list |
Logical. If TRUE then always returns a list, even if the result is one data frame. |
An immune repertoire (if ".barcodes" is a barcode vector) or a list of immune repertoires (if ".barcodes" is named vector or an output from Seurat::Idents()). Each element is an immune repertoire with clonotype barcodes corresponding to the input barcodes. The output list names are cluster names in the ".barcode" argument (Seurat::Idents() case only).
## Not run: data(immdata) # Create a fake single-cell data df <- immdata$data[[1]] df$Barcode <- "AAAAACCCCC" df$Barcode[51:nrow(df)] <- "GGGGGCCCCC" barcodes <- "AAAAACCCCC" df <- select_barcodes(df, barcodes) nrow(df) ## End(Not run)
## Not run: data(immdata) # Create a fake single-cell data df <- immdata$data[[1]] df$Barcode <- "AAAAACCCCC" df$Barcode[51:nrow(df)] <- "GGGGGCCCCC" barcodes <- "AAAAACCCCC" df <- select_barcodes(df, barcodes) nrow(df) ## End(Not run)
Given the vector of barcodes from Seurat, splits the input repertoires to separate subsets following the barcodes' assigned IDs. Useful in case you want to split immune repertoires by patients or clusters.
select_clusters(.data, .clusters, .field = "Cluster")
select_clusters(.data, .clusters, .field = "Cluster")
.data |
List of two elements "data" and "meta", with "data" being a list of immune repertoires, and "meta" being a metadata table. |
.clusters |
Factor vector with barcodes as vector names and cluster IDs as vector elements.
The output of the Seurat |
.field |
A string specifying the name of the field in the input metadata. New immune repertoire subsets will have cluster IDs in this field. |
A list with two elements "data" and "meta" with updated immune repertoire tables and metadata.
## Not run: library(Seurat) Idents(pbmc_small) new_cluster_ids <- c("A", "B", "C") new_cluster_ids <- levels(pbmc_small) new_cluster_ids pbmc_small <- RenameIdents(pbmc_small, new_cluster_ids) ## End(Not run)
## Not run: library(Seurat) Idents(pbmc_small) new_cluster_ids <- c("A", "B", "C") new_cluster_ids <- levels(pbmc_small) new_cluster_ids pbmc_small <- RenameIdents(pbmc_small, new_cluster_ids) ## End(Not run)
Graph clustering based on distances between sequences
seqCluster(.data, .dist, .perc_similarity, .nt_similarity, .fixed_threshold)
seqCluster(.data, .dist, .perc_similarity, .nt_similarity, .fixed_threshold)
.data |
The data which was used to caluculate .dist object. Can be data.frame, data.table, or a list of these objects. Every object must have columns in the immunarch compatible format immunarch_data_format |
.dist |
List of distance objects produced with seqDist function. |
.perc_similarity |
Numeric value between 0 and 1 specifying the maximum acceptable weight of an edge in a graph. This threshold depends on the length of sequences. |
.nt_similarity |
Numeric between 0-sequence length specifying the threshold of allowing a 1 in n nucleotides mismatch in sequencies. |
.fixed_threshold |
Numeric specifying the threshold on the maximum weight of an edge in a graph. |
Immdata data format object. Same as .data, but with extra 'Cluster' column with clusters assigned.
data(immdata) # In this example, we will use only 2 samples with 500 clonotypes in each for time saving input_data <- lapply(immdata$data[1:2], head, 500) dist_result <- seqDist(input_data) cluster_result <- seqCluster(input_data, dist_result, .fixed_threshold = 1)
data(immdata) # In this example, we will use only 2 samples with 500 clonotypes in each for time saving input_data <- lapply(immdata$data[1:2], head, 500) dist_result <- seqDist(input_data) cluster_result <- seqCluster(input_data, dist_result, .fixed_threshold = 1)
Computing sequential distances between clonotypes from two repertoires:
seqDist(.data, .col = 'CDR3.nt', .method = 'hamming', .group_by = c("V.name", "J.name"), .group_by_seqLength = TRUE, .trim_genes = TRUE, ...)
seqDist(.data, .col = 'CDR3.nt', .method = 'hamming', .group_by = c("V.name", "J.name"), .group_by_seqLength = TRUE, .trim_genes = TRUE, ...)
.data |
The data to be processed. Can be data.frame, data.table, or a list of these objects. Every object must have columns in the immunarch compatible format immunarch_data_format |
.col |
A string that specifies the column name to be processed. The default value is 'CDR3.nt'. |
.method |
Character value or user-defined function. |
.group_by |
Character vector of column names to group sequence by. The default value is c("V.first", "J.first"). Columns "V.first" and "J.first" containing first genes without allele suffixes are calculated automatically from "V.name" and "J.name" if absent in the data. Pass NA for no grouping options. |
.group_by_seqLength |
If TRUE - adds grouping by sequence length of .col argument |
.trim_genes |
If TRUE - use only general gene values (e.g. "IGHV1-18") of .group_by columns for clustering; if FALSE - can cause very small clusters in case of high resolution genotyping |
... |
Extra arguments for user-defined function. The default value is Other possible values are:
In case of user-defined function, it should take x and y parameters as input and return dist object. |
Named list of list with dist objects for given repertoires for each combination of .group_by variable(s) and/or sequence length of .col.
data(immdata) # Reducing data to save time on examples immdata$data <- purrr::map(immdata$data, ~ .x %>% head(10)) # Computing hamming distance for the first two repertoires in \code{'immdata'} seqDist(immdata$data[1:2]) # Here we define a custom distance function # that will count the difference in number of characters in sequences. f <- function(x, y) { res <- matrix(nrow = length(x), ncol = length(y)) for (i in 1:length(x)) { res[i, ] <- abs(nchar(x[i]) - nchar(y)) } dimnames(res) <- list(x, y) return(as.dist(res)) } seqDist(immdata$data[1:2], .method = f, .group_by_seqLength = FALSE)
data(immdata) # Reducing data to save time on examples immdata$data <- purrr::map(immdata$data, ~ .x %>% head(10)) # Computing hamming distance for the first two repertoires in \code{'immdata'} seqDist(immdata$data[1:2]) # Here we define a custom distance function # that will count the difference in number of characters in sequences. f <- function(x, y) { res <- matrix(nrow = length(x), ncol = length(y)) for (i in 1:length(x)) { res[i, ] <- abs(nchar(x[i]) - nchar(y)) } dimnames(res) <- list(x, y) return(as.dist(res)) } seqDist(immdata$data[1:2], .method = f, .group_by_seqLength = FALSE)
Set and update progress bars
set_pb(.max) add_pb(.pb, .value = 1)
set_pb(.max) add_pb(.pb, .value = 1)
.max |
Integer. Maximal value of the progress bar. |
.pb |
Progress bar object from |
.value |
Numeric. Value to add to the progress bar at each step. |
An updated progress bar.
pb <- immunarch:::set_pb(100) immunarch:::add_pb(pb, 25) immunarch:::add_pb(pb, 25) immunarch:::add_pb(pb, 25) immunarch:::add_pb(pb, 25) close(pb)
Immune repertoire spectratyping
spectratype(.data, .quant = c("id", "count"), .col = "nt")
spectratype(.data, .quant = c("id", "count"), .col = "nt")
.data |
The data to be processed. Can be data.frame, data.table, or a list of these objects. Every object must have columns in the immunarch compatible format. immunarch_data_format Competent users may provide advanced data representations: DBI database connections, Apache Spark DataFrame from copy_to or a list of these objects. They are supported with the same limitations as basic objects. Note: each connection must represent a separate repertoire. |
.quant |
Select the column with clonal counts to evaluate. Set to "id" to count every clonotype once. Set to "count" to take into the account number of clones per clonotype. |
.col |
A string that specifies the column(s) to be processed. The output is one of the following strings, separated by the plus sign: "nt" for nucleotide sequences, "aa" for amino acid sequences, "v" for V gene segments, "j" for J gene segments. E.g., pass "aa+v" for spectratyping on CDR3 amino acid sequences paired with V gene segments, i.e., in this case a unique clonotype is a pair of CDR3 amino acid and V gene segment. Clonal counts of equal clonotypes will be summed up. |
Data frame with distributions of clonotypes per CDR3 length.
# Load the data data(immdata) sp <- spectratype(immdata$data[[1]], .col = "aa+v") vis(sp)
# Load the data data(immdata) sp <- spectratype(immdata$data[[1]], .col = "aa+v") vis(sp)
Analysis immune repertoire kmer statistics: sequence profiles, etc.
split_to_kmers(.data, .k) kmer_profile(.data, .method = c("freq", "prob", "wei", "self"), .remove.stop = TRUE)
split_to_kmers(.data, .k) kmer_profile(.data, .method = c("freq", "prob", "wei", "self"), .remove.stop = TRUE)
.data |
Character vector or the output from |
.k |
Integer. Size of k-mers. |
.method |
Character vector of length one. If "freq" then returns a position frequency matrix (PFM) - a matrix with occurences of each amino acid in each position. If "prob" then returns a position probability matrix (PPM) - a matrix with probabilities of occurences of each amino acid in each position. This is a traditional representation of sequence motifs. If "wei" then returns a position weight matrix (PWM) - a matrix with log likelihoods of PPM elements. If "self" then returns a matrix with self-information of elements in PWM. For more information see https://en.wikipedia.org/wiki/Position_weight_matrix. |
.remove.stop |
Logical. If TRUE (by default) remove stop codons. |
split_to_kmers
- Data frame with two columns (k-mers and their counts).
kmer_profile
- a matrix with per-position amino acid statistics.
data(immdata) kmers <- getKmers(immdata$data[[1]], 5) kmer_profile(kmers) %>% vis()
data(immdata) kmers <- getKmers(immdata$data[[1]], 5) kmer_profile(kmers) %>% vis()
Return a column's name
switch_type(type) process_col_argument(.col)
switch_type(type) process_col_argument(.col)
type |
Character. Specifies the column to choose: "nt" chooses the CDR3 nucleotide column, "aa" chooses the CDR3 amino acid column, "v" chooses the V gene segment column, "j" chooses the J gene segment column. |
.col |
A string that specifies the column(s) to be processed. Select one of the following strings, separated by the plus sign: "nt" for nucleotide sequences, "aa" for amino acid sequences, "v" for V gene segments, "j" for J gene segments. |
A column's name.
immunarch:::switch_type("nuc") immunarch:::switch_type("v")
Get the N most abundant clonotypes
top(.data, .n = 10)
top(.data, .n = 10)
.data |
The data to be processed. Can be data.frame, data.table, or a list of these objects. Every object must have columns in the immunarch compatible format. immunarch_data_format Competent users may provide advanced data representations: DBI database connections, Apache Spark DataFrame from copy_to or a list of these objects. They are supported with the same limitations as basic objects. Note: each connection must represent a separate repertoire. |
.n |
Numeric. Number of the most abundant clonotypes to return. |
Data frame with the .n
most abundant clonotypes only.
data(immdata) top(immdata$data) top(immdata$data[[1]])
data(immdata) top(immdata$data) top(immdata$data[[1]])
Tracks the temporal dynamics of clonotypes in repertoires. For example, tracking across multiple time points after vaccination.
Note: duplicated clonotypes are merged and their counts are summed up.
trackClonotypes(.data, .which = list(1, 15), .col = "aa", .norm = TRUE)
trackClonotypes(.data, .which = list(1, 15), .col = "aa", .norm = TRUE)
.data |
The data to process. It can be a data.frame, a data.table, or a list of these objects. Every object must have columns in the immunarch compatible format. immunarch_data_format Competent users may provide advanced data representations: DBI database connections, Apache Spark DataFrame from copy_to or a list of these objects. They are supported with the same limitations as basic objects. Note: each connection must represent a separate repertoire. |
.which |
An argument that regulates which clonotypes to choose for tracking. There are three options for this argument: 1) passes a list with two elements 2) passes a character vector of sequences to take from all data frames; 3) passes a data frame (data table, database) with one or more columns - first for sequences, and other for gene segments (if applicable). See the "Examples" below with examples for each option. |
.col |
A character vector of length 1. Specifies an identifier for a column, from which the function chooses clonotype sequences. Specify "nt" for nucleotide sequences, "aa" for amino acid sequences, "aa+v" for amino acid sequences and Variable genes, "nt+j" for nucleotide sequences with Joining genes, or any combination of the above. Used only if ".which" has option 1) or option 2). |
.norm |
Logical. If TRUE then uses Proportion instead of the number of Clones per clonotype to store in the function output. |
Data frame with input sequences and counts or proportions for each of the input repertoire.
# Load an example data that comes with immunarch data(immdata) # Make the data smaller in order to speed up the examples immdata$data <- immdata$data[c(1, 2, 3, 7, 8, 9)] immdata$meta <- immdata$meta[c(1, 2, 3, 7, 8, 9), ] # Option 1 # Choose the first 10 amino acid clonotype sequences # from the first repertoire to track tc <- trackClonotypes(immdata$data, list(1, 10), .col = "aa") # Choose the first 20 nucleotide clonotype sequences # and their V genes from the "MS1" repertoire to track tc <- trackClonotypes(immdata$data, list("MS1", 20), .col = "nt+v") # Option 2 # Choose clonotypes with amino acid sequences "CASRGLITDTQYF" or "CSASRGSPNEQYF" tc <- trackClonotypes(immdata$data, c("CASRGLITDTQYF", "CSASRGSPNEQYF"), .col = "aa") # Option 3 # Choose the first 10 clonotypes from the first repertoire # with amino acid sequences and V segments target <- immdata$data[[1]] %>% select(CDR3.aa, V.name) %>% head(10) tc <- trackClonotypes(immdata$data, target) # Visualise the output regardless of the chosen option # Therea are three way to visualise it, regulated by the .plot argument vis(tc, .plot = "smooth") vis(tc, .plot = "area") vis(tc, .plot = "line") # Visualising timepoints # First, we create an additional column in the metadata with randomly choosen timepoints: immdata$meta$Timepoint <- sample(1:length(immdata$data)) immdata$meta # Next, we create a vector with samples in the right order, # according to the "Timepoint" column (from smallest to greatest): sample_order <- order(immdata$meta$Timepoint) # Sanity check: timepoints are following the right order: immdata$meta$Timepoint[sample_order] # Samples, sorted by the timepoints: immdata$meta$Sample[sample_order] # And finally, we visualise the data: vis(tc, .order = sample_order)
# Load an example data that comes with immunarch data(immdata) # Make the data smaller in order to speed up the examples immdata$data <- immdata$data[c(1, 2, 3, 7, 8, 9)] immdata$meta <- immdata$meta[c(1, 2, 3, 7, 8, 9), ] # Option 1 # Choose the first 10 amino acid clonotype sequences # from the first repertoire to track tc <- trackClonotypes(immdata$data, list(1, 10), .col = "aa") # Choose the first 20 nucleotide clonotype sequences # and their V genes from the "MS1" repertoire to track tc <- trackClonotypes(immdata$data, list("MS1", 20), .col = "nt+v") # Option 2 # Choose clonotypes with amino acid sequences "CASRGLITDTQYF" or "CSASRGSPNEQYF" tc <- trackClonotypes(immdata$data, c("CASRGLITDTQYF", "CSASRGSPNEQYF"), .col = "aa") # Option 3 # Choose the first 10 clonotypes from the first repertoire # with amino acid sequences and V segments target <- immdata$data[[1]] %>% select(CDR3.aa, V.name) %>% head(10) tc <- trackClonotypes(immdata$data, target) # Visualise the output regardless of the chosen option # Therea are three way to visualise it, regulated by the .plot argument vis(tc, .plot = "smooth") vis(tc, .plot = "area") vis(tc, .plot = "line") # Visualising timepoints # First, we create an additional column in the metadata with randomly choosen timepoints: immdata$meta$Timepoint <- sample(1:length(immdata$data)) immdata$meta # Next, we create a vector with samples in the right order, # according to the "Timepoint" column (from smallest to greatest): sample_order <- order(immdata$meta$Timepoint) # Sanity check: timepoints are following the right order: immdata$meta$Timepoint[sample_order] # Samples, sorted by the timepoints: immdata$meta$Sample[sample_order] # And finally, we visualise the data: vis(tc, .order = sample_order)
Output from every function in immunarch can be visualised with a
single function - vis
. The vis
automatically detects
the type of the data and draws a proper visualisation. For example, output
from the repOverlap
function will be identified as repertoire overlap values
and respective visualisation will be chosen without any additional arguments.
See "Details" for the list of available visualisations.
vis(.data, ...)
vis(.data, ...)
.data |
Pass the output from any immunarch analysis tool to |
... |
Any other arguments, see the "Details" section for specific visualisation functions. |
List of available visualisations for different kinds of data.
Basic analysis:
- Exploratory analysis results (from repExplore) - see vis.immunr_exp_vol;
- Clonality statistics (from repClonality) - see vis.immunr_homeo.
Overlaps and public clonotypes:
- Overlaps (from repOverlap) using heatmaps, circos plots, polar area plots - see vis.immunr_ov_matrix;
- Overlap clustering (from repOverlapAnalysis) - see vis.immunr_hclust;
- Repertoire incremental overlaps (from repOverlap) - see vis.immunr_inc_overlap;
- Public repertoire abundance (from pubRep) - vis vis.immunr_public_repertoire.
Gene usage:
- Gene usage statistics (from geneUsage) using bar plots, box plots - see vis.immunr_gene_usage;
- Gene usage distances (from geneUsageAnalysis) using heatmaps, circos plots, polar area plots - see vis.immunr_ov_matrix;
- Gene usage clustering (from geneUsageAnalysis) - see vis.immunr_hclust.
Diversity estimation:
- Diversity estimations (from repDiversity) - see vis.immunr_chao1.
BCR analysis:
- Clonal tree (from repClonalFamily) - see vis.clonal_family and vis.clonal_family_tree.
Advanced analysis:
- Repertoire dynamics (from trackClonotypes) - see vis.immunr_dynamics;
- Sequence logo plots of amino acid distributions (from kmer_profile) - see vis_seqlogo;
- Kmers distributions (from getKmers) - see vis.immunr_kmer_table;
- Mutation networks (from mutationNetwork) - Work In Progress on vis.immunr_mutation_network;
- CDR3 amino acid properties, e.g., biophysical (from cdrProp) - Work In Progress on vis.immunr_cdr_prop.
Additionaly, we provide a wrapper functions for visualisations of common data types:
- Any data frames or matrices using heatmaps - see vis_heatmap and vis_heatmap2;
- Any data frames or matrices using circos plots - see vis_circos.
A ggplot2, pheatmap or circlize object.
fixVis for precise manipulation of plots.
# Load the test data data(immdata) # Compute and visualise: ov <- repOverlap(immdata$data) vis(ov) gu <- geneUsage(immdata$data) vis(gu) dv <- repDiversity(immdata$data) vis(dv)
# Load the test data data(immdata) # Compute and visualise: ov <- repOverlap(immdata$data) vis(ov) gu <- geneUsage(immdata$data) vis(gu) dv <- repDiversity(immdata$data) vis(dv)
Bar plots
vis_bar( .data, .by = NA, .meta = NA, .errorbars = c(0.025, 0.975), .errorbars.off = FALSE, .stack = FALSE, .points = TRUE, .test = TRUE, .signif.label.size = 3.5, .errorbar.width = 0.2, .defgroupby = "Sample", .grouping.var = "Group", .labs = c("X", "Y"), .title = "Barplot (.title argument)", .subtitle = "Subtitle (.subtitle argument)", .legend = NA, .leg.title = "Legend (.leg.title argument)", .legend.pos = "right", .rotate_x = 90 )
vis_bar( .data, .by = NA, .meta = NA, .errorbars = c(0.025, 0.975), .errorbars.off = FALSE, .stack = FALSE, .points = TRUE, .test = TRUE, .signif.label.size = 3.5, .errorbar.width = 0.2, .defgroupby = "Sample", .grouping.var = "Group", .labs = c("X", "Y"), .title = "Barplot (.title argument)", .subtitle = "Subtitle (.subtitle argument)", .legend = NA, .leg.title = "Legend (.leg.title argument)", .legend.pos = "right", .rotate_x = 90 )
.data |
Data to visualise. |
.by |
Pass NA if you want to plot samples without grouping. You can pass a character vector with one or several column names from ".meta" to group your data before plotting. In this case you should provide ".meta". You can pass a character vector that exactly matches the number of samples in your data, each value should correspond to a sample's property. It will be used to group data based on the values provided. Note that in this case you should pass NA to ".meta". |
.meta |
A metadata object. An R dataframe with sample names and their properties, such as age, serostatus or hla. |
.errorbars |
A numeric vector of length two with quantiles for error bars on sectors. Disabled if ".errorbars.off" is TRUE. |
.errorbars.off |
If TRUE then plot CI bars for distances between each group. Disabled if no group passed to the ".by" argument. |
.stack |
If TRUE and .errorbars.off is TRUE then plot stacked bar plots for each Group or Sample |
.points |
A logical value defining whether points will be visualised or not. |
.test |
A logical vector whether statistical tests should be applied. See "Details" for more information. |
.signif.label.size |
An integer value defining the size of text for p-value. |
.errorbar.width |
Numeric. Width for error bars. |
.defgroupby |
A name for the column with sample names. |
.grouping.var |
A name for the column to group by. |
.labs |
A character vector of length two specifying names for x-axis and y-axis. |
.title |
The text for the plot's title. |
.subtitle |
The text for the plot's subtitle. |
.legend |
If TRUE then displays a legend, otherwise removes legend from the plot. |
.leg.title |
The text for the plots's legend. Provide NULL to remove the legend's title completely. |
.legend.pos |
Positions of the legend: either "top", "bottom", "left" or "right". |
.rotate_x |
How much the x tick text should be rotated? In angles. |
A ggplot2 object.
vis_bar(data.frame(Sample = c("A", "B", "C"), Value = c(1, 2, 3)))
vis_bar(data.frame(Sample = c("A", "B", "C"), Value = c(1, 2, 3)))
Visualisation of distributions using ggplot2-based boxplots.
vis_box( .data, .by = NA, .meta = NA, .melt = TRUE, .points = TRUE, .test = TRUE, .signif.label.size = 3.5, .defgroupby = "Sample", .grouping.var = "Group", .labs = c("X", "Y"), .title = "Boxplot (.title argument)", .subtitle = "Subtitle (.subtitle argument)", .legend = NA, .leg.title = "Legend (.leg.title argument)", .legend.pos = "right" )
vis_box( .data, .by = NA, .meta = NA, .melt = TRUE, .points = TRUE, .test = TRUE, .signif.label.size = 3.5, .defgroupby = "Sample", .grouping.var = "Group", .labs = c("X", "Y"), .title = "Boxplot (.title argument)", .subtitle = "Subtitle (.subtitle argument)", .legend = NA, .leg.title = "Legend (.leg.title argument)", .legend.pos = "right" )
.data |
Input matrix or data frame. |
.by |
Pass NA if you want to plot samples without grouping. You can pass a character vector with one or several column names from ".meta" to group your data before plotting. In this case you should provide ".meta". You can pass a character vector that exactly matches the number of samples in your data, each value should correspond to a sample's property. It will be used to group data based on the values provided. Note that in this case you should pass NA to ".meta". |
.meta |
A metadata object. An R dataframe with sample names and their properties, such as age, serostatus or hla. |
.melt |
If TRUE then apply melt to the ".data" before plotting. In this case ".data" is supposed to be a data frame with the first character column reserved for names of genes and other numeric columns reserved to counts or frequencies of genes. Each numeric column should be associated with a specific repertoire sample. |
.points |
A logical value defining whether points will be visualised or not. |
.test |
A logical vector whether statistical tests should be applied. See "Details" for more information. |
.signif.label.size |
An integer value defining the size of text for p-value. |
.defgroupby |
A name for the column with sample names. |
.grouping.var |
A name for the column to group by. |
.labs |
Character vector of length two with names for x-axis and y-axis, respectively. |
.title |
The text for the title of the plot. |
.subtitle |
The The text for the plot's subtitle. |
.legend |
If TRUE then displays a legend, otherwise removes legend from the plot. |
.leg.title |
The The text for the plots's legend. Provide NULL to remove the legend's title completely. |
.legend.pos |
Positions of the legend: either "top", "bottom", "left" or "right". |
A ggplot2 object.
vis.immunr_gene_usage, geneUsage
vis_box(data.frame(Sample = sample(c("A", "B", "C"), 100, TRUE), Value = rnorm(100)), .melt = FALSE)
vis_box(data.frame(Sample = sample(c("A", "B", "C"), 100, TRUE), Value = rnorm(100)), .melt = FALSE)
Visualise matrices with the chordDiagram function from the circlize package.
vis_circos(.data, .title = NULL, ...)
vis_circos(.data, .title = NULL, ...)
.data |
Input matrix. |
.title |
The The text for the title of the plot. |
... |
Other arguments passed to chordDiagram from the 'circlize' package. |
A circlize object.
data(immdata) ov <- repOverlap(immdata$data) vis(ov, .plot = "circos")
data(immdata) ov <- repOverlap(immdata$data) vis(ov, .plot = "circos")
Fast and easy visualisations of matrices or data frames with functions based on the ggplot2 package.
vis_heatmap( .data, .text = TRUE, .scientific = FALSE, .signif.digits = 2, .text.size = 4, .axis.text.size = NULL, .labs = c("Sample", "Sample"), .title = "Overlap", .leg.title = "Overlap values", .legend = TRUE, .na.value = NA, .transpose = FALSE, ... )
vis_heatmap( .data, .text = TRUE, .scientific = FALSE, .signif.digits = 2, .text.size = 4, .axis.text.size = NULL, .labs = c("Sample", "Sample"), .title = "Overlap", .leg.title = "Overlap values", .legend = TRUE, .na.value = NA, .transpose = FALSE, ... )
.data |
Input object: a matrix or a data frame. If matrix: column names and row names (if presented) will be used as names for labs. If data frame: the first column will be used for row names and removed from the data. Other columns will be used for values in the heatmap. |
.text |
If TRUE then plots values in the heatmap cells. If FALSE does not plot values, just plot coloured cells instead. |
.scientific |
If TRUE then uses the scientific notation for numbers (e.g., "2.0e+2"). |
.signif.digits |
Number of significant digits to display on plot. |
.text.size |
Size of text in the cells of heatmap. |
.axis.text.size |
Size of text on the axis labels. |
.labs |
A character vector of length two with names for x-axis and y-axis, respectively. |
.title |
The The text for the plot's title. |
.leg.title |
The The text for the plots's legend. Provide NULL to remove the legend's title completely. |
.legend |
If TRUE then displays a legend, otherwise removes the legend from the plot. |
.na.value |
Replace NA values with this value. By default they remain NA. |
.transpose |
Logical. If TRUE then switch rows and columns. |
... |
Other passed arguments. |
A ggplot2 object.
data(immdata) ov <- repOverlap(immdata$data) vis_heatmap(ov) gu <- geneUsage(immdata$data, "hs.trbj") vis_heatmap(gu)
data(immdata) ov <- repOverlap(immdata$data) vis_heatmap(ov) gu <- geneUsage(immdata$data, "hs.trbj") vis_heatmap(gu)
Visualise matrices with the functions based on the pheatmap package with minimum amount of arguments.
vis_heatmap2( .data, .meta = NA, .by = NA, .title = NA, .color = colorRampPalette(c("#67001f", "#d6604d", "#f7f7f7", "#4393c3", "#053061"))(1024), ... )
vis_heatmap2( .data, .meta = NA, .by = NA, .title = NA, .color = colorRampPalette(c("#67001f", "#d6604d", "#f7f7f7", "#4393c3", "#053061"))(1024), ... )
.data |
Input matrix. Column names and row names (if presented) will be used as names for labs. |
.meta |
A metadata object. An R dataframe with sample names and their properties, such as age, serostatus or hla. |
.by |
Set NA if you want to plot samples without grouping. |
.title |
The text for the plot's title (same as the "main" argument in pheatmap). |
.color |
A vector specifying the colors (same as the "color" argument in pheatmap). Pass NA to use the default pheatmap colors. |
... |
Other arguments for the pheatmap function. |
A pheatmap object.
data(immdata) ov <- repOverlap(immdata$data) vis_heatmap2(ov)
data(immdata) ov <- repOverlap(immdata$data) vis_heatmap2(ov)
Visualisation of distributions using ggplot2-based histograms.
vis_hist( .data, .by = NA, .meta = NA, .title = "Gene usage", .ncol = NA, .points = TRUE, .test = TRUE, .coord.flip = FALSE, .grid = FALSE, .labs = c("Gene", NA), .melt = TRUE, .legend = NA, .add.layer = NULL, ... )
vis_hist( .data, .by = NA, .meta = NA, .title = "Gene usage", .ncol = NA, .points = TRUE, .test = TRUE, .coord.flip = FALSE, .grid = FALSE, .labs = c("Gene", NA), .melt = TRUE, .legend = NA, .add.layer = NULL, ... )
.data |
Input matrix or data frame. |
.by |
Pass NA if you want to plot samples without grouping. You can pass a character vector with one or several column names from ".meta" to group your data before plotting. In this case you should provide ".meta". You can pass a character vector that exactly matches the number of samples in your data, each value should correspond to a sample's property. It will be used to group data based on the values provided. Note that in this case you should pass NA to ".meta". |
.meta |
A metadata object. An R dataframe with sample names and their properties, such as age, serostatus or hla. |
.title |
The text for the title of the plot. |
.ncol |
A number of columns to display. Provide NA (by default) if you want the function to automatically detect the optimal number of columns. |
.points |
A logical value defining whether points will be visualised or not. |
.test |
A logical vector whether statistical tests should be applied. See "Details" for more information. |
.coord.flip |
If TRUE then swap x- and y-axes. |
.grid |
If TRUE then plot separate visualisations for each sample. |
.labs |
A character vector of length two with names for x-axis and y-axis, respectively. |
.melt |
If TRUE then apply melt to the ".data" before plotting. In this case ".data" is supposed to be a data frame with the first character column reserved for names of genes and other numeric columns reserved to counts or frequencies of genes. Each numeric column should be associated with a specific repertoire sample. |
.legend |
If TRUE then plots the legend. If FALSE removes the legend from the plot. If NA automatically detects the best way to display legend. |
.add.layer |
Addditional ggplot2 layers, that added to each plot in the output plot or grid of plots. |
... |
Is not used here. |
If data is grouped, then statistical tests for comparing means of groups will be performed, unless .test = FALSE
is supplied.
In case there are only two groups, the Wilcoxon rank sum test (https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test) is performed
(R function wilcox.test
with an argument exact = FALSE
) for testing if there is a difference in mean rank values between two groups.
In case there more than two groups, the Kruskal-Wallis test (https://en.wikipedia.org/wiki/Kruskal
A significant Kruskal-Wallis test indicates that at least one sample stochastically dominates one other sample.
Adjusted for multiple comparisons P-values are plotted on the top of groups.
P-value adjusting is done using the Holm method (https://en.wikipedia.org/wiki/Holm
You can execute the command ?p.adjust
in the R console to see more.
A ggplot2 object.
vis.immunr_gene_usage, geneUsage
data(immdata) imm_gu <- geneUsage(immdata$data[[1]]) vis(imm_gu, .plot = "hist", .add.layer = theme(axis.text.x = element_text(angle = 75, vjust = 1)) ) imm_gu <- geneUsage(immdata$data[1:4]) vis(imm_gu, .plot = "hist", .grid = TRUE, .add.layer = theme(axis.text.x = element_text(angle = 75, vjust = 1)) )
data(immdata) imm_gu <- geneUsage(immdata$data[[1]]) vis(imm_gu, .plot = "hist", .add.layer = theme(axis.text.x = element_text(angle = 75, vjust = 1)) ) imm_gu <- geneUsage(immdata$data[1:4]) vis(imm_gu, .plot = "hist", .grid = TRUE, .add.layer = theme(axis.text.x = element_text(angle = 75, vjust = 1)) )
Visualise kmer profiles
vis_immunr_kmer_profile_main(.data, .plot, ...)
vis_immunr_kmer_profile_main(.data, .plot, ...)
.data |
Kmer data, an output from kmer_profile. |
.plot |
String specifying the plot type: - "seqlogo" for traditional sequence logo plots using vis_seqlogo; - "textlogo" for modified approach to sequence logo plots via text labels using vis_textlogo; |
... |
Other arguments passed to vis_textlogo or vis_seqlogo, depending on the ".plot" argument. |
A ggplot2 object.
data(immdata) getKmers(immdata$data[[1]], 5) %>% kmer_profile() %>% vis("seqlogo")
data(immdata) getKmers(immdata$data[[1]], 5) %>% kmer_profile() %>% vis("seqlogo")
Visualise correlation of public clonotype frequencies in pairs of repertoires.
vis_public_clonotypes( .data, .x.rep = NA, .y.rep = NA, .title = NA, .ncol = 3, .point.size.modif = 1, .cut.axes = TRUE, .density = TRUE, .lm = TRUE, .radj.size = 3.5 )
vis_public_clonotypes( .data, .x.rep = NA, .y.rep = NA, .title = NA, .ncol = 3, .point.size.modif = 1, .cut.axes = TRUE, .density = TRUE, .lm = TRUE, .radj.size = 3.5 )
.data |
Public repertoire data - an output from the pubRep function. |
.x.rep |
Either indices of samples or character vector of sample names for the x-axis. Must be of the same length as ".y.rep". |
.y.rep |
Either indices of samples or character vector of sample names for the y-axis. Must be of the same length as ".x.rep". |
.title |
The text for the title of the plot. |
.ncol |
An integer number of columns to print in the grid of pairs of repertoires. |
.point.size.modif |
An integer value that is a modifier of the point size. The larger the number, the larger the points. |
.cut.axes |
If TRUE then axes limits become shorter. |
.density |
If TRUE then displays density plot for distributions of clonotypes for each sample. If FALSE then removes density plot from the visualisation. |
.lm |
If TRUE then fit a linear model and displays an R adjusted coefficient that shows how similar samples are in terms of shared clonotypes. |
.radj.size |
An integer value, that defines the size of the The text for the R adjusted coefficient. |
A ggplot2 object.
pubRep, vis.immunr_public_repertoire
data(immdata) pr <- pubRep(immdata$data, .verbose = FALSE) vis(pr, "clonotypes", 1, 2)
data(immdata) pr <- pubRep(immdata$data, .verbose = FALSE) vis(pr, "clonotypes", 1, 2)
Visualise public clonotype frequencies.
vis_public_frequencies( .data, .by = NA, .meta = NA, .type = c("boxplot", "none", "mean") )
vis_public_frequencies( .data, .by = NA, .meta = NA, .type = c("boxplot", "none", "mean") )
.data |
Public repertoire - an output from the pubRep function. |
.by |
Pass NA if you want to plot samples without grouping. You can pass a character vector with one or several column names from ".meta" to group your data before plotting. In this case you should provide ".meta". You can pass a character vector that exactly matches the number of samples in your data, each value should correspond to a sample's property. It will be used to group data based on the values provided. Note that in this case you should pass NA to ".meta". |
.meta |
A metadata object. An R dataframe with sample names and their properties, such as age, serostatus or hla. |
.type |
Character. Either "boxplot" for plotting distributions of frequencies, "none" for plotting everything, or "mean" for plotting average values only. |
A ggplot2 object.
data(immdata) immdata$data <- lapply(immdata$data, head, 500) pr <- pubRep(immdata$data, .verbose = FALSE) vis(pr, "freq", .type = "boxplot") vis(pr, "freq", .type = "none") vis(pr, "freq", .type = "mean") vis(pr, "freq", .by = "Status", .meta = immdata$meta)
data(immdata) immdata$data <- lapply(immdata$data, head, 500) pr <- pubRep(immdata$data, .verbose = FALSE) vis(pr, "freq", .type = "boxplot") vis(pr, "freq", .type = "none") vis(pr, "freq", .type = "mean") vis(pr, "freq", .by = "Status", .meta = immdata$meta)
Plot sequence logo plots for visualising of amino acid motif sequences / profiles.
'vis_textlogo' plots sequences in a text format - each letter has the same height. Useful when there are no big differences between occurences of amino acids in the motif.
'vis_seqlogo' is a traditional sequence logo plots. Useful when there are one or two amino acids with clear differences in their occurrences.
vis_textlogo(.data, .replace.zero.with.na = TRUE, .width = 0.1, ...) vis_seqlogo(.data, .scheme = "chemistry", ...)
vis_textlogo(.data, .replace.zero.with.na = TRUE, .width = 0.1, ...) vis_seqlogo(.data, .scheme = "chemistry", ...)
.data |
Output from the |
.replace.zero.with.na |
if TRUE then replace all zeros with NAs, therefore letters with zero frequency wont appear at the plot. |
.width |
Width for jitter, i.e., how much points will scatter around the verical line. Pass 0 (zero) to plot points on the straight vertical line for each position. |
... |
Not used here. |
.scheme |
Character. An argumentt passed to geom_logo specifying how to colour symbols. |
A ggplot2 object.
data(immdata) kmers <- getKmers(immdata$data[[1]], 5) ppm <- kmer_profile(kmers, "prob") vis(ppm, .plot = "text") vis(ppm, .plot = "seq") d <- kmer_profile(c("CASLL", "CASSQ", "CASGL")) vis_textlogo(d) vis_seqlogo(d)
data(immdata) kmers <- getKmers(immdata$data[[1]], 5) ppm <- kmer_profile(kmers, "prob") vis(ppm, .plot = "text") vis(ppm, .plot = "seq") d <- kmer_profile(c("CASLL", "CASSQ", "CASGL")) vis_textlogo(d) vis_seqlogo(d)
Visualise clonal family tree: wrapper for calling on the entire repClonalFamily output
## S3 method for class 'clonal_family' vis(.data, ...)
## S3 method for class 'clonal_family' vis(.data, ...)
.data |
Clonal families from 1 or multiple samples: |
... |
Not used here. |
A ggraph object.
data(bcrdata) bcr_data <- bcrdata$data clonal_family <- bcr_data %>% seqCluster(seqDist(bcr_data), .fixed_threshold = 3) %>% repGermline(.threads = 1) %>% repAlignLineage(.min_lineage_sequences = 2, .align_threads = 2, .nofail = TRUE) %>% repClonalFamily(.threads = 1, .nofail = TRUE) %>% vis()
data(bcrdata) bcr_data <- bcrdata$data clonal_family <- bcr_data %>% seqCluster(seqDist(bcr_data), .fixed_threshold = 3) %>% repGermline(.threads = 1) %>% repAlignLineage(.min_lineage_sequences = 2, .align_threads = 2, .nofail = TRUE) %>% repClonalFamily(.threads = 1, .nofail = TRUE) %>% vis()
Visualise clonal family tree
## S3 method for class 'clonal_family_tree' vis(.data, ...)
## S3 method for class 'clonal_family_tree' vis(.data, ...)
.data |
Single clonal family tree data from 1 cluster: 1 element from TreeStats column from |
... |
Not used here. |
A ggraph object.
data(bcrdata) bcr_data <- bcrdata$data clonal_family <- bcr_data %>% seqCluster(seqDist(bcr_data), .fixed_threshold = 3) %>% repGermline(.threads = 1) %>% repAlignLineage(.min_lineage_sequences = 2, .align_threads = 2, .nofail = TRUE) %>% repClonalFamily(.threads = 1, .nofail = TRUE) # This condition can be omitted; it prevents the example from crashing # when ClustalW or PHYLIP are not installed if (!("step_failure_ignored" %in% class(clonal_family))) { vis(clonal_family[["full_clones"]][["TreeStats"]][[2]]) }
data(bcrdata) bcr_data <- bcrdata$data clonal_family <- bcr_data %>% seqCluster(seqDist(bcr_data), .fixed_threshold = 3) %>% repGermline(.threads = 1) %>% repAlignLineage(.min_lineage_sequences = 2, .align_threads = 2, .nofail = TRUE) %>% repClonalFamily(.threads = 1, .nofail = TRUE) # This condition can be omitted; it prevents the example from crashing # when ClustalW or PHYLIP are not installed if (!("step_failure_ignored" %in% class(clonal_family))) { vis(clonal_family[["full_clones"]][["TreeStats"]][[2]]) }
An utility function to visualise the output from repDiversity
.
## S3 method for class 'immunr_chao1' vis( .data, .by = NA, .meta = NA, .errorbars = c(0.025, 0.975), .errorbars.off = FALSE, .points = TRUE, .test = TRUE, .signif.label.size = 3.5, ... )
## S3 method for class 'immunr_chao1' vis( .data, .by = NA, .meta = NA, .errorbars = c(0.025, 0.975), .errorbars.off = FALSE, .points = TRUE, .test = TRUE, .signif.label.size = 3.5, ... )
.data |
Output from |
.by |
Pass NA if you want to plot samples without grouping. You can pass a character vector with one or several column names from ".meta" to group your data before plotting. In this case you should provide ".meta". You can pass a character vector that exactly matches the number of samples in your data, each value should correspond to a sample's property. It will be used to group data based on the values provided. Note that in this case you should pass NA to ".meta". |
.meta |
A metadata object. An R dataframe with sample names and their properties, such as age, serostatus or hla. |
.errorbars |
A numeric vector of length two with quantiles for error bars on sectors. Disabled if ".errorbars.off" is TRUE. |
.errorbars.off |
If TRUE then plot CI bars for distances between each group. Disabled if no group passed to the ".by" argument. |
.points |
A logical value defining whether points will be visualised or not. |
.test |
A logical vector whether statistical tests should be applied. See "Details" for more information. |
.signif.label.size |
An integer value defining the size of text for p-value. |
... |
Not used here. |
If data is grouped, then statistical tests for comparing means of groups will be performed, unless .test = FALSE
is supplied.
In case there are only two groups, the Wilcoxon rank sum test (https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test) is performed
(R function wilcox.test
with an argument exact = FALSE
) for testing if there is a difference in mean rank values between two groups.
In case there more than two groups, the Kruskal-Wallis test (https://en.wikipedia.org/wiki/Kruskal
A significant Kruskal-Wallis test indicates that at least one sample stochastically dominates one other sample.
Adjusted for multiple comparisons P-values are plotted on the top of groups.
P-value adjusting is done using the Holm method (https://en.wikipedia.org/wiki/Holm
You can execute the command ?p.adjust
in the R console to see more.
A ggplot2 object.
data(immdata) dv <- repDiversity(immdata$data, "chao1") vis(dv)
data(immdata) dv <- repDiversity(immdata$data, "chao1") vis(dv)
An utility function to visualise the output from repClonality
.
## S3 method for class 'immunr_clonal_prop' vis( .data, .by = NA, .meta = NA, .errorbars = c(0.025, 0.975), .errorbars.off = FALSE, .points = TRUE, .test = TRUE, .signif.label.size = 3.5, ... )
## S3 method for class 'immunr_clonal_prop' vis( .data, .by = NA, .meta = NA, .errorbars = c(0.025, 0.975), .errorbars.off = FALSE, .points = TRUE, .test = TRUE, .signif.label.size = 3.5, ... )
.data |
Output from |
.by |
Pass NA if you want to plot samples without grouping. You can pass a character vector with one or several column names from ".meta" to group your data before plotting. In this case you should provide ".meta". You can pass a character vector that exactly matches the number of samples in your data, each value should correspond to a sample's property. It will be used to group data based on the values provided. Note that in this case you should pass NA to ".meta". |
.meta |
A metadata object. An R dataframe with sample names and their properties, such as age, serostatus or hla. |
.errorbars |
A numeric vector of length two with quantiles for error bars on sectors. Disabled if ".errorbars.off" is TRUE. |
.errorbars.off |
If TRUE then plot CI bars for distances between each group. Disabled if no group passed to the ".by" argument. |
.points |
A logical value defining whether points will be visualised or not. |
.test |
A logical vector whether statistical tests should be applied. See "Details" for more information. |
.signif.label.size |
An integer value defining the size of text for p-value. |
... |
Not used here. |
If data is grouped, then statistical tests for comparing means of groups will be performed, unless .test = FALSE
is supplied.
In case there are only two groups, the Wilcoxon rank sum test (https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test) is performed
(R function wilcox.test
with an argument exact = FALSE
) for testing if there is a difference in mean rank values between two groups.
In case there more than two groups, the Kruskal-Wallis test (https://en.wikipedia.org/wiki/Kruskal
A significant Kruskal-Wallis test indicates that at least one sample stochastically dominates one other sample.
Adjusted for multiple comparisons P-values are plotted on the top of groups.
P-value adjusting is done using the Holm method (https://en.wikipedia.org/wiki/Holm
You can execute the command ?p.adjust
in the R console to see more.
A ggplot2 object.
data(immdata) clp <- repClonality(immdata$data, "clonal.prop") vis(clp) hom <- repClonality(immdata$data, "homeo") # Remove p values and points from the plot vis(hom, .by = "Status", .meta = immdata$meta, .test = FALSE, .points = FALSE)
data(immdata) clp <- repClonality(immdata$data, "clonal.prop") vis(clp) hom <- repClonality(immdata$data, "homeo") # Remove p values and points from the plot vis(hom, .by = "Status", .meta = immdata$meta, .test = FALSE, .points = FALSE)
Visualise clonotype dynamics
## S3 method for class 'immunr_dynamics' vis(.data, .plot = c("smooth", "area", "line"), .order = NA, .log = FALSE, ...)
## S3 method for class 'immunr_dynamics' vis(.data, .plot = c("smooth", "area", "line"), .order = NA, .log = FALSE, ...)
.data |
Output from the trackClonotypes function. |
.plot |
Character. Either "smooth", "area" or "line". Each specifies a type of plot for visualisation of clonotype dynamics. |
.order |
Numeric or character vector. Specifies the order to samples, e.g., it used for ordering samples by timepoints. Either See "Examples" below for more details. |
.log |
Logical. If TRUE then use log-scale for the frequency axis. |
... |
Not used here. |
A ggplot2 object.
# Load an example data that comes with immunarch data(immdata) # Make the data smaller in order to speed up the examples immdata$data <- immdata$data[c(1, 2, 3, 7, 8, 9)] immdata$meta <- immdata$meta[c(1, 2, 3, 7, 8, 9), ] # Option 1 # Choose the first 10 amino acid clonotype sequences # from the first repertoire to track tc <- trackClonotypes(immdata$data, list(1, 10), .col = "aa") # Choose the first 20 nucleotide clonotype sequences # and their V genes from the "MS1" repertoire to track tc <- trackClonotypes(immdata$data, list("MS1", 20), .col = "nt+v") # Option 2 # Choose clonotypes with amino acid sequences "CASRGLITDTQYF" or "CSASRGSPNEQYF" tc <- trackClonotypes(immdata$data, c("CASRGLITDTQYF", "CSASRGSPNEQYF"), .col = "aa") # Option 3 # Choose the first 10 clonotypes from the first repertoire # with amino acid sequences and V segments target <- immdata$data[[1]] %>% select(CDR3.aa, V.name) %>% head(10) tc <- trackClonotypes(immdata$data, target) # Visualise the output regardless of the chosen option # Therea are three way to visualise it, regulated by the .plot argument vis(tc, .plot = "smooth") vis(tc, .plot = "area") vis(tc, .plot = "line") # Visualising timepoints # First, we create an additional column in the metadata with randomly choosen timepoints: immdata$meta$Timepoint <- sample(1:length(immdata$data)) immdata$meta # Next, we create a vector with samples in the right order, # according to the "Timepoint" column (from smallest to greatest): sample_order <- order(immdata$meta$Timepoint) # Sanity check: timepoints are following the right order: immdata$meta$Timepoint[sample_order] # Samples, sorted by the timepoints: immdata$meta$Sample[sample_order] # And finally, we visualise the data: vis(tc, .order = sample_order)
# Load an example data that comes with immunarch data(immdata) # Make the data smaller in order to speed up the examples immdata$data <- immdata$data[c(1, 2, 3, 7, 8, 9)] immdata$meta <- immdata$meta[c(1, 2, 3, 7, 8, 9), ] # Option 1 # Choose the first 10 amino acid clonotype sequences # from the first repertoire to track tc <- trackClonotypes(immdata$data, list(1, 10), .col = "aa") # Choose the first 20 nucleotide clonotype sequences # and their V genes from the "MS1" repertoire to track tc <- trackClonotypes(immdata$data, list("MS1", 20), .col = "nt+v") # Option 2 # Choose clonotypes with amino acid sequences "CASRGLITDTQYF" or "CSASRGSPNEQYF" tc <- trackClonotypes(immdata$data, c("CASRGLITDTQYF", "CSASRGSPNEQYF"), .col = "aa") # Option 3 # Choose the first 10 clonotypes from the first repertoire # with amino acid sequences and V segments target <- immdata$data[[1]] %>% select(CDR3.aa, V.name) %>% head(10) tc <- trackClonotypes(immdata$data, target) # Visualise the output regardless of the chosen option # Therea are three way to visualise it, regulated by the .plot argument vis(tc, .plot = "smooth") vis(tc, .plot = "area") vis(tc, .plot = "line") # Visualising timepoints # First, we create an additional column in the metadata with randomly choosen timepoints: immdata$meta$Timepoint <- sample(1:length(immdata$data)) immdata$meta # Next, we create a vector with samples in the right order, # according to the "Timepoint" column (from smallest to greatest): sample_order <- order(immdata$meta$Timepoint) # Sanity check: timepoints are following the right order: immdata$meta$Timepoint[sample_order] # Samples, sorted by the timepoints: immdata$meta$Sample[sample_order] # And finally, we visualise the data: vis(tc, .order = sample_order)
An utility function to visualise the output from repExplore
.
## S3 method for class 'immunr_exp_vol' vis( .data, .by = NA, .meta = NA, .errorbars = c(0.025, 0.975), .errorbars.off = FALSE, .points = TRUE, .test = TRUE, .signif.label.size = 3.5, ... )
## S3 method for class 'immunr_exp_vol' vis( .data, .by = NA, .meta = NA, .errorbars = c(0.025, 0.975), .errorbars.off = FALSE, .points = TRUE, .test = TRUE, .signif.label.size = 3.5, ... )
.data |
Output from |
.by |
Pass NA if you want to plot samples without grouping. You can pass a character vector with one or several column names from ".meta" to group your data before plotting. In this case you should provide ".meta". You can pass a character vector that exactly matches the number of samples in your data, each value should correspond to a sample's property. It will be used to group data based on the values provided. Note that in this case you should pass NA to ".meta". |
.meta |
A metadata object. An R dataframe with sample names and their properties, such as age, serostatus or hla. |
.errorbars |
A numeric vector of length two with quantiles for error bars on sectors. Disabled if ".errorbars.off" is TRUE. |
.errorbars.off |
If TRUE then plot CI bars for distances between each group. Disabled if no group passed to the ".by" argument. |
.points |
A logical value defining whether points will be visualised or not. |
.test |
A logical vector whether statistical tests should be applied. See "Details" for more information. |
.signif.label.size |
An integer value defining the size of text for p-value. |
... |
Not used here. |
If data is grouped, then statistical tests for comparing means of groups will be performed, unless .test = FALSE
is supplied.
In case there are only two groups, the Wilcoxon rank sum test (https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test) is performed
(R function wilcox.test
with an argument exact = FALSE
) for testing if there is a difference in mean rank values between two groups.
In case there more than two groups, the Kruskal-Wallis test (https://en.wikipedia.org/wiki/Kruskal
A significant Kruskal-Wallis test indicates that at least one sample stochastically dominates one other sample.
Adjusted for multiple comparisons P-values are plotted on the top of groups.
P-value adjusting is done using the Holm method (https://en.wikipedia.org/wiki/Holm
You can execute the command ?p.adjust
in the R console to see more.
A ggplot2 object.
data(immdata) repExplore(immdata$data, "volume") %>% vis() repExplore(immdata$data, "count") %>% vis() repExplore(immdata$data, "len") %>% vis() repExplore(immdata$data, "clones") %>% vis()
data(immdata) repExplore(immdata$data, "volume") %>% vis() repExplore(immdata$data, "count") %>% vis() repExplore(immdata$data, "len") %>% vis() repExplore(immdata$data, "clones") %>% vis()
Visualise distributions of genes using heatmaps or other plots.
## S3 method for class 'immunr_gene_usage' vis(.data, .plot = c("hist", "box", "heatmap", "heatmap2", "circos"), ...)
## S3 method for class 'immunr_gene_usage' vis(.data, .plot = c("hist", "box", "heatmap", "heatmap2", "circos"), ...)
.data |
Output from the geneUsage function. |
.plot |
String specifying the plot type: - "hist" for histograms using vis_hist; - "heatmap" for heatmaps using vis_heatmap; - "heatmap2" for heatmaps using vis_heatmap2; - "circos" for circos plots using vis_circos. |
... |
Other arguments passed to corresponding functions depending on the plot type: - "hist" - passes arguments to vis_hist; - "box" - passes arguments to vis_box; - "heatmap" - passes arguments to vis_heatmap; - "heatmap2" - passes arguments to vis_heatmap2 and heatmap from the "pheatmap" package; - "circos" - passes arguments to vis_circos and chordDiagram from the "circlize" package. |
A ggplot2 object, pheatmap or circlize object.
data(immdata) gu <- geneUsage(immdata$data[[1]]) vis(gu) gu <- geneUsage(immdata$data) vis(gu, .by = "Status", .meta = immdata$meta) vis(gu, "box", .by = "Status", .meta = immdata$meta)
data(immdata) gu <- geneUsage(immdata$data[[1]]) vis(gu) gu <- geneUsage(immdata$data) vis(gu, .by = "Status", .meta = immdata$meta) vis(gu, "box", .by = "Status", .meta = immdata$meta)
Visualisation of the results of hierarchical clustering. For other clustering visualisations see vis.immunr_kmeans.
## S3 method for class 'immunr_hclust' vis(.data, .rect = FALSE, .plot = c("clust", "best"), ...)
## S3 method for class 'immunr_hclust' vis(.data, .rect = FALSE, .plot = c("clust", "best"), ...)
.data |
Clustering results from repOverlapAnalysis or geneUsageAnalysis. |
.rect |
Passed to fviz_dend - whether to add a rectangle around groups. |
.plot |
A character vector of length one or two specifying which plots to visualise. If "clust" then plot only the clustering. If "best" then plot the number of optimal clusters. If both then plot both. |
... |
Not used here. |
Ggplot2 objects inside the patchwork container.
vis, repOverlapAnalysis, geneUsageAnalysis
data(immdata) ov <- repOverlap(immdata$data) repOverlapAnalysis(ov, "mds+hclust") %>% vis()
data(immdata) ov <- repOverlap(immdata$data) repOverlapAnalysis(ov, "mds+hclust") %>% vis()
Visualise incremental overlaps
## S3 method for class 'immunr_inc_overlap' vis(.data, .target = 1, .grid = FALSE, .ncol = 2, ...)
## S3 method for class 'immunr_inc_overlap' vis(.data, .target = 1, .grid = FALSE, .ncol = 2, ...)
.data |
Output from the repOverlap function that uses "top" methods. |
.target |
Index of a repertoire to plot. Omitted if .grid is TRUE. |
.grid |
Logical. If TRUE then plot all similarities in a grid. |
.ncol |
Numeric. Number of columns in the resulting grid. |
... |
Not used here. |
A ggplot2 object.
data(immdata) tmp <- repOverlap(immdata$data[1:4], "inc+overlap", .verbose.inc = FALSE, .verbose = FALSE) vis(tmp, .target = 1) vis(tmp, .grid = TRUE)
data(immdata) tmp <- repOverlap(immdata$data[1:4], "inc+overlap", .verbose.inc = FALSE, .verbose = FALSE) vis(tmp, .target = 1) vis(tmp, .grid = TRUE)
Visualisation of the results of K-means and DBSCAN clustering. For hierarhical clustering visualisations see vis.immunr_hclust.
## S3 method for class 'immunr_kmeans' vis( .data, .point = TRUE, .text = TRUE, .ellipse = TRUE, .point.size = 2, .text.size = 10, .plot = c("clust", "best"), ... )
## S3 method for class 'immunr_kmeans' vis( .data, .point = TRUE, .text = TRUE, .ellipse = TRUE, .point.size = 2, .text.size = 10, .plot = c("clust", "best"), ... )
.data |
Clustering results from repOverlapAnalysis or geneUsageAnalysis. |
.point |
If TRUE then plot sample points. Passed to fviz_cluster. |
.text |
If TRUE then plot text labels. Passed to fviz_cluster. |
.ellipse |
If TRUE then plot ellipses around all samples. Passed to "ellipse" from fviz_cluster. |
.point.size |
Size of points, passed to "pointsize" from fviz_cluster. |
.text.size |
Size of text labels, passed to labelsize from fviz_cluster. |
.plot |
A character vector of length one or two specifying which plots to visualise. If "clust" then plot only the clustering. If "best" then plot the number of optimal clusters. If both then plot both. |
... |
Not used here. |
Ggplot2 objects inside the pathwork container.
vis, repOverlapAnalysis, geneUsageAnalysis
data(immdata) ov <- repOverlap(immdata$data) repOverlapAnalysis(ov, "mds+kmeans") %>% vis()
data(immdata) ov <- repOverlap(immdata$data) repOverlapAnalysis(ov, "mds+kmeans") %>% vis()
Plot a distribution (bar plot) of the most frequent kmers in a data.
## S3 method for class 'immunr_kmer_table' vis( .data, .head = 100, .position = c("stack", "dodge", "fill"), .log = FALSE, ... )
## S3 method for class 'immunr_kmer_table' vis( .data, .head = 100, .position = c("stack", "dodge", "fill"), .log = FALSE, ... )
.data |
Data frame with two columns "Kmers" and "Count" or a list with such data frames. See Examples. |
.head |
Number of the most frequent kmers to choose for plotting from each data frame. |
.position |
Character vector of length 1. Position of bars for each kmers. Value for the |
.log |
Logical. If TRUE then plot log-scaled plots. |
... |
Not used here. |
A ggplot2 object.
get.kmers
# Load necessary data and package. data(immdata) # Get 5-mers. imm.km <- getKmers(immdata$data[[1]], 5) # Plots for kmer proportions in each data frame in immdata. p1 <- vis(imm.km, .position = "stack") p2 <- vis(imm.km, .position = "fill") p1 + p2
# Load necessary data and package. data(immdata) # Get 5-mers. imm.km <- getKmers(immdata$data[[1]], 5) # Plots for kmer proportions in each data frame in immdata. p1 <- vis(imm.km, .position = "stack") p2 <- vis(imm.km, .position = "fill") p1 + p2
PCA / MDS / tSNE visualisation (mainly overlap / gene usage)
## S3 method for class 'immunr_mds' vis( .data, .by = NA, .meta = NA, .point = TRUE, .text = TRUE, .ellipse = TRUE, .point.size = 2, .text.size = 4, ... )
## S3 method for class 'immunr_mds' vis( .data, .by = NA, .meta = NA, .point = TRUE, .text = TRUE, .ellipse = TRUE, .point.size = 2, .text.size = 4, ... )
.data |
Output from analysis functions such as geneUsageAnalysis or immunr_pca, immunr_mds or immunr_tsne. |
.by |
Pass NA if you want to plot samples without grouping. You can pass a character vector with one or several column names from ".meta" to group your data before plotting. In this case you should provide ".meta". You can pass a character vector that exactly matches the number of samples in your data, each value should correspond to a sample's property. It will be used to group data based on the values provided. Note that in this case you should pass NA to ".meta". |
.meta |
A metadata object. An R dataframe with sample names and their properties, such as age, serostatus or hla. |
.point |
Logical. If TRUE then plot points corresponding to objects. |
.text |
Logical. If TRUE then plot sample names. |
.ellipse |
Logical. If TRUE then plot ellipses around clusters of grouped samples. |
.point.size |
Numeric. A size of points to plot. |
.text.size |
Numeric. A size of sample names' labels. |
... |
Not used here. |
Other visualisation methods:
- PCA - vis.immunr_pca
- MDS - vis.immunr_mds
- tSNE - vis.immunr_tsne
A ggplot2 object.
data(immdata) ov <- repOverlap(immdata$data) repOverlapAnalysis(ov, "mds") %>% vis()
data(immdata) ov <- repOverlap(immdata$data) repOverlapAnalysis(ov, "mds") %>% vis()
Visualises matrices with overlap values or gene usage distances among samples. For details see the links below.
## S3 method for class 'immunr_ov_matrix' vis(.data, .plot = c("heatmap", "heatmap2", "circos"), ...)
## S3 method for class 'immunr_ov_matrix' vis(.data, .plot = c("heatmap", "heatmap2", "circos"), ...)
.data |
Output from repOverlap or geneUsageAnalysis. |
.plot |
A string specifying the plot type: - "heatmap" for heatmaps using vis_heatmap; - "heatmap2" for heatmaps using vis_heatmap2; - "circos" for circos plots using vis_circos; |
... |
Other arguments are passed through to the underlying plotting function: - "heatmap" - passes arguments to vis_heatmap; - "heatmap2" - passes arguments to vis_heatmap2 and heatmap from the "pheatmap" package; - "circos" - passes arguments to vis_circos and chordDiagram from the "circlize" package; |
A ggplot2, pheatmap or circlize object.
data(immdata) ov <- repOverlap(immdata$data) vis(ov) vis(ov, "heatmap") vis(ov, "heatmap2") vis(ov, "circos")
data(immdata) ov <- repOverlap(immdata$data) vis(ov) vis(ov, "heatmap") vis(ov, "heatmap2") vis(ov, "circos")
Public repertoire visualisation
## S3 method for class 'immunr_public_repertoire' vis(.data, .plot = c("freq", "clonotypes"), ...)
## S3 method for class 'immunr_public_repertoire' vis(.data, .plot = c("freq", "clonotypes"), ...)
.data |
Public repertoire, an output from pubRep. |
.plot |
A string specifying the plot type: - "freq" for visualisation of the distribution of occurrences of clonotypes and their frequencies using vis_public_frequencies. - "clonotypes" for visualisation of public clonotype frequenciy correlations between pairs of samples using vis_public_clonotypes |
... |
Further arguments passed vis_public_frequencies or vis_public_clonotypes, depending on the ".plot" argument. |
A ggplot2 object.
data(immdata) immdata$data <- lapply(immdata$data, head, 300) pr <- pubRep(immdata$data, .verbose = FALSE) vis(pr, "freq") vis(pr, "freq", .type = "none") vis(pr, "clonotypes", 1, 2)
data(immdata) immdata$data <- lapply(immdata$data, head, 300) pr <- pubRep(immdata$data, .verbose = FALSE) vis(pr, "freq") vis(pr, "freq", .type = "none") vis(pr, "clonotypes", 1, 2)
Visualise public clonotype frequencies.
## S3 method for class 'immunr_public_statistics' vis(.data, ...)
## S3 method for class 'immunr_public_statistics' vis(.data, ...)
.data |
Public repertoire - an output from the pubRep function. |
... |
Other arguments passsed directly to upset. |
A ggplot2 object.
data(immdata) immdata$data <- lapply(immdata$data, head, 2000) pr <- pubRep(immdata$data, .verbose = FALSE) pubRepStatistics(pr) %>% vis()
data(immdata) immdata$data <- lapply(immdata$data, head, 2000) pr <- pubRep(immdata$data, .verbose = FALSE) pubRepStatistics(pr) %>% vis()
Handler for .nofail argument of pipeline steps that prevents examples from crashing on computers where certain dependencies are not installed
## S3 method for class 'step_failure_ignored' vis(.data, ...)
## S3 method for class 'step_failure_ignored' vis(.data, ...)
.data |
Not used here. |
... |
Not used here. |
An empty object with "step_failure_ignored" class.