
Gene Set Similarity Tutorial
Rita M. Silva
2025-09-19
Source:vignettes/articles/Article_GeneSetSimilarity.Rmd
Article_GeneSetSimilarity.Rmd
Even if a user-defined gene signature demonstrates strong
discriminatory power between conditions, it may reflect known biological
pathways rather than novel mechanisms. To address this, the
geneset_similarity()
function implements two complementary
similarity metrics:
Jaccard Index: the ratio of the number of genes in common over the total number of genes in the two sets.
Log Odds Ratio (logOR) from Fisher’s exact test of association between gene sets, given a specified gene universe.
Users can compare their signatures to:
-
Custom gene sets, defined manually;
-
MSigDB collections, via the
msigdbr
package.
The function provides options to:
- Filter by Jaccard index threshold, using
jaccard_threshold
; - Filter by odds ratio and p-value, using
or_threshold
andpval_threshold
, respectively.
Similarity via Jaccard Index
The Jaccard index measures raw set overlap:
Example 1: Compare against user-defined and MSigDB gene sets
library(markeR)
#> Warning: markeR has been tested with ggplot2 <= 3.5.2. Using newer versions may cause incompatibilities.
# Example data
signature1 <- c("TP53", "BRCA1", "MYC", "EGFR", "CDK2")
signature2 <- c("ATXN2", "FUS", "MTOR", "CASP3")
signature_list <- list(
"User_Apoptosis" = c("TP53", "CASP3", "BAX"),
"User_CellCycle" = c("CDK2", "CDK4", "CCNB1", "MYC"),
"User_DNARepair" = c("BRCA1", "RAD51", "ATM"),
"User_MTOR" = c("MTOR", "AKT1", "RPS6KB1")
)
geneset_similarity(
signatures = list(Sig1 = signature1, Sig2 = signature2),
other_user_signatures = signature_list,
collection = "C2",
subcollection = "CP:KEGG_LEGACY",
jaccard_threshold = 0.05,
msig_subset = NULL,
metric = "jaccard"
)$plot
Example 2: Restrict comparison to a custom subset of MSigDB
geneset_similarity(
signatures = list(Sig1 = signature1, Sig2 = signature2),
other_user_signatures = NULL,
collection = "C2",
subcollection = "CP:KEGG_LEGACY",
jaccard_threshold = 0,
msig_subset = c("KEGG_MTOR_SIGNALING_PATHWAY", "KEGG_APOPTOSIS", "NON_EXISTENT_PATHWAY"),
metric = "jaccard",
limits=c(0,0.1)
)$plot
Similarity via Log Odds Ratio
The log odds ratio (logOR) provides a statistically grounded alternative for assessing gene set similarity. It measures enrichment of one set within another, relative to a defined background or gene universe, using a 2×2 contingency table.
-
Log odds ratio (logOR):
Derived from contingency tables using:- Genes in both sets
- Genes in one but not the other
- Gene universe as background
Note: When using
metric = "odds_ratio"
, theuniverse
parameter must be supplied.
Example 3: Compare against user-defined and MSigDB gene sets
geneset_similarity(
signatures = list(Sig1 = signature1, Sig2 = signature2),
other_user_signatures = signature_list,
collection = "C2",
subcollection = "CP:KEGG_LEGACY",
metric = "odds_ratio",
# Define gene universe (e.g., genes from HPA or your dataset)
universe = unique(c(
signature1, signature2,
unlist(signature_list),
msigdbr::msigdbr(species = "Homo sapiens", category = "C2")$gene_symbol
)),
or_threshold = 100, #log10OR = 2
width_text=50,
pval_threshold = 0.05
)$plot
#> Warning: The `category` argument of `msigdbr()` is deprecated as of msigdbr 10.0.0.
#> ℹ Please use the `collection` argument instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
Session Information
sessionInfo()
#> R version 4.5.1 (2025-06-13)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] markeR_0.99.5
#>
#> loaded via a namespace (and not attached):
#> [1] pROC_1.19.0.1 gridExtra_2.3 rlang_1.1.6
#> [4] magrittr_2.0.4 clue_0.3-66 GetoptLong_1.0.5
#> [7] msigdbr_25.1.1 matrixStats_1.5.0 compiler_4.5.1
#> [10] png_0.1-8 systemfonts_1.2.3 vctrs_0.6.5
#> [13] reshape2_1.4.4 stringr_1.5.2 pkgconfig_2.0.3
#> [16] shape_1.4.6.1 crayon_1.5.3 fastmap_1.2.0
#> [19] backports_1.5.0 labeling_0.4.3 effectsize_1.0.1
#> [22] rmarkdown_2.29 ragg_1.5.0 purrr_1.1.0
#> [25] xfun_0.53 cachem_1.1.0 jsonlite_2.0.0
#> [28] BiocParallel_1.42.2 broom_1.0.10 parallel_4.5.1
#> [31] cluster_2.1.8.1 R6_2.6.1 bslib_0.9.0
#> [34] stringi_1.8.7 RColorBrewer_1.1-3 limma_3.64.3
#> [37] car_3.1-3 jquerylib_0.1.4 Rcpp_1.1.0
#> [40] assertthat_0.2.1 iterators_1.0.14 knitr_1.50
#> [43] parameters_0.28.2 IRanges_2.42.0 Matrix_1.7-3
#> [46] tidyselect_1.2.1 abind_1.4-8 yaml_2.3.10
#> [49] doParallel_1.0.17 codetools_0.2-20 curl_7.0.0
#> [52] lattice_0.22-7 tibble_3.3.0 plyr_1.8.9
#> [55] withr_3.0.2 bayestestR_0.17.0 S7_0.2.0
#> [58] evaluate_1.0.5 desc_1.4.3 circlize_0.4.16
#> [61] pillar_1.11.1 ggpubr_0.6.1 carData_3.0-5
#> [64] foreach_1.5.2 stats4_4.5.1 insight_1.4.2
#> [67] generics_0.1.4 S4Vectors_0.46.0 ggplot2_4.0.0
#> [70] scales_1.4.0 glue_1.8.0 tools_4.5.1
#> [73] data.table_1.17.8 fgsea_1.34.2 locfit_1.5-9.12
#> [76] ggsignif_0.6.4 babelgene_22.9 fs_1.6.6
#> [79] fastmatch_1.1-6 cowplot_1.2.0 grid_4.5.1
#> [82] tidyr_1.3.1 datawizard_1.2.0 edgeR_4.6.3
#> [85] colorspace_2.1-1 Formula_1.2-5 cli_3.6.5
#> [88] textshaping_1.0.3 ComplexHeatmap_2.24.1 dplyr_1.1.4
#> [91] gtable_0.3.6 ggh4x_0.3.1 rstatix_0.7.2
#> [94] sass_0.4.10 digest_0.6.37 BiocGenerics_0.54.0
#> [97] ggrepel_0.9.6 rjson_0.2.23 htmlwidgets_1.6.4
#> [100] farver_2.1.2 htmltools_0.5.8.1 pkgdown_2.1.3
#> [103] lifecycle_1.0.4 GlobalOptions_0.1.2 statmod_1.5.0