
Gene Set Similarity Tutorial
Source:vignettes/Tutorial_GeneSetSimilarity.Rmd
Tutorial_GeneSetSimilarity.Rmd
Installation
The user can install the development version of markeR from GitHub with:
# install.packages("devtools")
devtools::install_github("DiseaseTranscriptomicsLab/markeR")
Signature Similarity
Even if a user-defined gene signature demonstrates strong
discriminatory power between conditions, it may reflect known biological
pathways rather than novel mechanisms. To address this, the
geneset_similarity()
function computes pairwise
Jaccard indices or log odds ratios
(logOR) between user-provided gene signatures and a reference
set, quantifying their overlap as a percentage or a statistical
enrichment.
Users can compare their signatures to:
-
Custom gene sets, defined manually, or
-
MSigDB collections, via the
msigdbr
package.
The function provides options to:
- Filter by Jaccard index threshold, using
jaccard_threshold
- Filter by odds ratio and p-value, using
or_threshold
andpval_threshold
- Limit the number of top-matching reference signatures shown, using
num_sigs_toplot
Similarity via Jaccard Index
The Jaccard index measures raw set overlap:
Example 1: Compare against user-defined and MSigDB gene sets
# Example data
signature1 <- c("TP53", "BRCA1", "MYC", "EGFR", "CDK2")
signature2 <- c("ATXN2", "FUS", "MTOR", "CASP3")
signature_list <- list(
"User_Apoptosis" = c("TP53", "CASP3", "BAX"),
"User_CellCycle" = c("CDK2", "CDK4", "CCNB1", "MYC"),
"User_DNARepair" = c("BRCA1", "RAD51", "ATM"),
"User_MTOR" = c("MTOR", "AKT1", "RPS6KB1")
)
geneset_similarity(
signatures = list(Sig1 = signature1, Sig2 = signature2),
other_user_signatures = signature_list,
collection = "C2",
subcollection = "CP:KEGG_LEGACY",
jaccard_threshold = 0.05,
msig_subset = NULL,
metric = "jaccard"
)
Example 2: Restrict comparison to a custom subset of MSigDB
geneset_similarity(
signatures = list(Sig1 = signature1, Sig2 = signature2),
other_user_signatures = NULL,
collection = "C2",
subcollection = "CP:KEGG_LEGACY",
jaccard_threshold = 0,
msig_subset = c("KEGG_MTOR_SIGNALING_PATHWAY", "KEGG_APOPTOSIS", "NON_EXISTENT_PATHWAY"),
metric = "jaccard"
)
Similarity via Log Odds Ratio
The log odds ratio (logOR) provides a statistically grounded alternative for assessing gene set similarity. It measures enrichment of one set within another, relative to a defined background or gene universe, using a 2×2 contingency table and a one-sided Fisher’s exact test.
-
Log odds ratio (logOR):
Derived from contingency tables using:- Genes in both sets
- Genes in one but not the other
- Gene universe as background
Log-transformed odds ratios are visualized; statistical significance is assessed via the adjusted p-value.
Note: When using
metric = "odds_ratio"
, theuniverse
parameter must be supplied.
Example 3: Compare against user-defined and MSigDB gene sets
# Define gene universe (e.g., genes from HPA or your dataset)
gene_universe <- unique(c(
signature1, signature2,
unlist(signature_list),
msigdbr::msigdbr(species = "Homo sapiens", category = "C2")$gene_symbol
))
## Warning: The `category` argument of `msigdbr()` is deprecated as of msigdbr 10.0.0.
## ℹ Please use the `collection` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
geneset_similarity(
signatures = list(Sig1 = signature1, Sig2 = signature2),
other_user_signatures = signature_list,
collection = "C2",
subcollection = "CP:KEGG_LEGACY",
metric = "odds_ratio",
universe = gene_universe,
or_threshold = 1,
pval_threshold = 0.05,
width_text=50
)
Session Information
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] markeR_0.99.2
##
## loaded via a namespace (and not attached):
## [1] pROC_1.18.5 gridExtra_2.3 rlang_1.1.6
## [4] magrittr_2.0.3 clue_0.3-66 GetoptLong_1.0.5
## [7] msigdbr_25.1.1 matrixStats_1.5.0 compiler_4.5.1
## [10] png_0.1-8 systemfonts_1.2.3 vctrs_0.6.5
## [13] reshape2_1.4.4 stringr_1.5.1 pkgconfig_2.0.3
## [16] shape_1.4.6.1 crayon_1.5.3 fastmap_1.2.0
## [19] backports_1.5.0 effectsize_1.0.1 rmarkdown_2.29
## [22] ragg_1.4.0 purrr_1.1.0 xfun_0.52
## [25] cachem_1.1.0 jsonlite_2.0.0 BiocParallel_1.42.1
## [28] broom_1.0.8 parallel_4.5.1 cluster_2.1.8.1
## [31] R6_2.6.1 stringi_1.8.7 bslib_0.9.0
## [34] RColorBrewer_1.1-3 limma_3.64.1 car_3.1-3
## [37] jquerylib_0.1.4 Rcpp_1.1.0 assertthat_0.2.1
## [40] iterators_1.0.14 knitr_1.50 parameters_0.27.0
## [43] IRanges_2.42.0 Matrix_1.7-3 tidyselect_1.2.1
## [46] abind_1.4-8 yaml_2.3.10 doParallel_1.0.17
## [49] codetools_0.2-20 curl_6.4.0 lattice_0.22-7
## [52] tibble_3.3.0 plyr_1.8.9 withr_3.0.2
## [55] bayestestR_0.16.1 evaluate_1.0.4 desc_1.4.3
## [58] circlize_0.4.16 pillar_1.11.0 ggpubr_0.6.1
## [61] carData_3.0-5 foreach_1.5.2 stats4_4.5.1
## [64] insight_1.3.1 generics_0.1.4 S4Vectors_0.46.0
## [67] ggplot2_3.5.2 scales_1.4.0 glue_1.8.0
## [70] tools_4.5.1 data.table_1.17.8 fgsea_1.34.2
## [73] locfit_1.5-9.12 ggsignif_0.6.4 babelgene_22.9
## [76] fs_1.6.6 fastmatch_1.1-6 cowplot_1.2.0
## [79] grid_4.5.1 tidyr_1.3.1 datawizard_1.2.0
## [82] edgeR_4.6.3 colorspace_2.1-1 Formula_1.2-5
## [85] cli_3.6.5 textshaping_1.0.1 ComplexHeatmap_2.24.1
## [88] dplyr_1.1.4 gtable_0.3.6 ggh4x_0.3.1
## [91] rstatix_0.7.2 sass_0.4.10 digest_0.6.37
## [94] BiocGenerics_0.54.0 ggrepel_0.9.6 rjson_0.2.23
## [97] htmlwidgets_1.6.4 farver_2.1.2 htmltools_0.5.8.1
## [100] pkgdown_2.1.3 lifecycle_1.0.4 GlobalOptions_0.1.2
## [103] statmod_1.5.0