
Gene Set Similarity Tutorial
Source:vignettes/Tutorial_GeneSetSimilarity.Rmd
Tutorial_GeneSetSimilarity.Rmd
Installation
The user can install the development version of markeR from GitHub with:
# install.packages("devtools")
devtools::install_github("DiseaseTranscriptomicsLab/markeR")
Signature Similarity
Even if a user-defined gene signature demonstrates strong
discriminatory power between conditions, it may reflect known biological
pathways rather than novel mechanisms. To address this, the
geneset_similarity()
function computes pairwise
Jaccard indices or log odds ratios
(logOR) between user-provided gene signatures and a reference
set, quantifying their overlap as a percentage or a statistical
enrichment.
Users can compare their signatures to:
-
Custom gene sets, defined manually, or
-
MSigDB collections, via the
msigdbr
package.
The function provides options to:
- Filter by Jaccard index threshold, using
jaccard_threshold
- Filter by odds ratio and p-value, using
or_threshold
andpval_threshold
- Limit the number of top-matching reference signatures shown, using
num_sigs_toplot
Similarity via Jaccard Index
The Jaccard index measures raw set overlap:
Example 1: Compare against user-defined and MSigDB gene sets
# Example data
signature1 <- c("TP53", "BRCA1", "MYC", "EGFR", "CDK2")
signature2 <- c("ATXN2", "FUS", "MTOR", "CASP3")
signature_list <- list(
"User_Apoptosis" = c("TP53", "CASP3", "BAX"),
"User_CellCycle" = c("CDK2", "CDK4", "CCNB1", "MYC"),
"User_DNARepair" = c("BRCA1", "RAD51", "ATM"),
"User_MTOR" = c("MTOR", "AKT1", "RPS6KB1")
)
geneset_similarity(
signatures = list(Sig1 = signature1, Sig2 = signature2),
other_user_signatures = signature_list,
collection = "C2",
subcollection = "CP:KEGG_LEGACY",
jaccard_threshold = 0.05,
msig_subset = NULL,
metric = "jaccard"
)
Example 2: Restrict comparison to a custom subset of MSigDB
geneset_similarity(
signatures = list(Sig1 = signature1, Sig2 = signature2),
other_user_signatures = NULL,
collection = "C2",
subcollection = "CP:KEGG_LEGACY",
jaccard_threshold = 0,
msig_subset = c("KEGG_MTOR_SIGNALING_PATHWAY", "KEGG_APOPTOSIS", "NON_EXISTENT_PATHWAY"),
metric = "jaccard"
)
Similarity via Log Odds Ratio
The log odds ratio (logOR) provides a statistically grounded alternative for assessing gene set similarity. It measures enrichment of one set within another, relative to a defined background or gene universe, using a 2×2 contingency table and a one-sided Fisher’s exact test.
-
Log odds ratio (logOR):
Derived from contingency tables using:- Genes in both sets
- Genes in one but not the other
- Gene universe as background
Log-transformed odds ratios are visualized; statistical significance is assessed via the adjusted p-value.
Note: When using
metric = "odds_ratio"
, theuniverse
parameter must be supplied.
Example 3: Compare against user-defined and MSigDB gene sets
# Define gene universe (e.g., genes from HPA or your dataset)
gene_universe <- unique(c(
signature1, signature2,
unlist(signature_list),
msigdbr::msigdbr(species = "Homo sapiens", category = "C2")$gene_symbol
))
## Warning: The `category` argument of `msigdbr()` is deprecated as of msigdbr 10.0.0.
## ℹ Please use the `collection` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
geneset_similarity(
signatures = list(Sig1 = signature1, Sig2 = signature2),
other_user_signatures = signature_list,
collection = "C2",
subcollection = "CP:KEGG_LEGACY",
metric = "odds_ratio",
universe = gene_universe,
or_threshold = 1,
pval_threshold = 0.05,
width_text=50
)
Session Information
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] markeR_0.99.0
##
## loaded via a namespace (and not attached):
## [1] pROC_1.18.5 gridExtra_2.3 rlang_1.1.6
## [4] magrittr_2.0.3 clue_0.3-66 GetoptLong_1.0.5
## [7] msigdbr_24.1.0 matrixStats_1.5.0 compiler_4.5.1
## [10] png_0.1-8 systemfonts_1.2.3 vctrs_0.6.5
## [13] reshape2_1.4.4 stringr_1.5.1 pkgconfig_2.0.3
## [16] shape_1.4.6.1 crayon_1.5.3 fastmap_1.2.0
## [19] backports_1.5.0 labeling_0.4.3 effectsize_1.0.1
## [22] rmarkdown_2.29 ragg_1.4.0 purrr_1.0.4
## [25] xfun_0.52 cachem_1.1.0 jsonlite_2.0.0
## [28] BiocParallel_1.42.1 broom_1.0.8 parallel_4.5.1
## [31] cluster_2.1.8.1 R6_2.6.1 stringi_1.8.7
## [34] bslib_0.9.0 RColorBrewer_1.1-3 limma_3.64.1
## [37] car_3.1-3 jquerylib_0.1.4 Rcpp_1.0.14
## [40] assertthat_0.2.1 iterators_1.0.14 knitr_1.50
## [43] parameters_0.26.0 IRanges_2.42.0 Matrix_1.7-3
## [46] tidyselect_1.2.1 abind_1.4-8 yaml_2.3.10
## [49] doParallel_1.0.17 codetools_0.2-20 curl_6.4.0
## [52] lattice_0.22-7 tibble_3.3.0 plyr_1.8.9
## [55] withr_3.0.2 bayestestR_0.16.0 evaluate_1.0.4
## [58] desc_1.4.3 circlize_0.4.16 pillar_1.10.2
## [61] ggpubr_0.6.1 carData_3.0-5 foreach_1.5.2
## [64] stats4_4.5.1 insight_1.3.0 generics_0.1.4
## [67] S4Vectors_0.46.0 ggplot2_3.5.2 scales_1.4.0
## [70] glue_1.8.0 tools_4.5.1 data.table_1.17.6
## [73] fgsea_1.34.0 locfit_1.5-9.12 ggsignif_0.6.4
## [76] babelgene_22.9 fs_1.6.6 fastmatch_1.1-6
## [79] cowplot_1.1.3 grid_4.5.1 tidyr_1.3.1
## [82] datawizard_1.1.0 edgeR_4.6.2 colorspace_2.1-1
## [85] Formula_1.2-5 cli_3.6.5 textshaping_1.0.1
## [88] ComplexHeatmap_2.24.1 dplyr_1.1.4 gtable_0.3.6
## [91] ggh4x_0.3.1 rstatix_0.7.2 sass_0.4.10
## [94] digest_0.6.37 BiocGenerics_0.54.0 ggrepel_0.9.6
## [97] rjson_0.2.23 htmlwidgets_1.6.4 farver_2.1.2
## [100] htmltools_0.5.8.1 pkgdown_2.1.3 lifecycle_1.0.4
## [103] GlobalOptions_0.1.2 statmod_1.5.0