
Gene Set Similarity Tutorial
Rita M. Silva
2025-08-21
Source:vignettes/articles/Article_GeneSetSimilarity.Rmd
Article_GeneSetSimilarity.Rmd
Even if a user-defined gene signature demonstrates strong
discriminatory power between conditions, it may reflect known biological
pathways rather than novel mechanisms. To address this, the
geneset_similarity()
function computes pairwise
Jaccard indices or log odds ratios
(logOR) between user-provided gene signatures and a reference
set, quantifying their overlap as a percentage or a statistical
enrichment.
Users can compare their signatures to:
-
Custom gene sets, defined manually, or
-
MSigDB collections, via the
msigdbr
package.
The function provides options to:
- Filter by Jaccard index threshold, using
jaccard_threshold
- Filter by odds ratio and p-value, using
or_threshold
andpval_threshold
- Limit the number of top-matching reference signatures shown, using
num_sigs_toplot
Similarity via Jaccard Index
The Jaccard index measures raw set overlap:
Example 1: Compare against user-defined and MSigDB gene sets
# Example data
signature1 <- c("TP53", "BRCA1", "MYC", "EGFR", "CDK2")
signature2 <- c("ATXN2", "FUS", "MTOR", "CASP3")
signature_list <- list(
"User_Apoptosis" = c("TP53", "CASP3", "BAX"),
"User_CellCycle" = c("CDK2", "CDK4", "CCNB1", "MYC"),
"User_DNARepair" = c("BRCA1", "RAD51", "ATM"),
"User_MTOR" = c("MTOR", "AKT1", "RPS6KB1")
)
geneset_similarity(
signatures = list(Sig1 = signature1, Sig2 = signature2),
other_user_signatures = signature_list,
collection = "C2",
subcollection = "CP:KEGG_LEGACY",
jaccard_threshold = 0.05,
msig_subset = NULL,
metric = "jaccard"
)$plot
Example 2: Restrict comparison to a custom subset of MSigDB
geneset_similarity(
signatures = list(Sig1 = signature1, Sig2 = signature2),
other_user_signatures = NULL,
collection = "C2",
subcollection = "CP:KEGG_LEGACY",
jaccard_threshold = 0,
msig_subset = c("KEGG_MTOR_SIGNALING_PATHWAY", "KEGG_APOPTOSIS", "NON_EXISTENT_PATHWAY"),
metric = "jaccard"
)$plot
Similarity via Log Odds Ratio
The log odds ratio (logOR) provides a statistically grounded alternative for assessing gene set similarity. It measures enrichment of one set within another, relative to a defined background or gene universe, using a 2×2 contingency table and a one-sided Fisher’s exact test.
-
Log odds ratio (logOR):
Derived from contingency tables using:- Genes in both sets
- Genes in one but not the other
- Gene universe as background
Log-transformed odds ratios are visualized; statistical significance is assessed via the adjusted p-value.
Note: When using
metric = "odds_ratio"
, theuniverse
parameter must be supplied.
Example 3: Compare against user-defined and MSigDB gene sets
geneset_similarity(
signatures = list(Sig1 = signature1, Sig2 = signature2),
other_user_signatures = signature_list,
collection = "C2",
subcollection = "CP:KEGG_LEGACY",
metric = "odds_ratio",
# Define gene universe (e.g., genes from HPA or your dataset)
universe = unique(c(
signature1, signature2,
unlist(signature_list),
msigdbr::msigdbr(species = "Homo sapiens", category = "C2")$gene_symbol
)),
or_threshold = 100, #log10OR = 2
pval_threshold = 0.05,
width_text=50
)$plot
#> Warning: The `category` argument of `msigdbr()` is deprecated as of msigdbr 10.0.0.
#> ℹ Please use the `collection` argument instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
Session Information
sessionInfo()
#> R version 4.5.1 (2025-06-13)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] markeR_0.99.3
#>
#> loaded via a namespace (and not attached):
#> [1] pROC_1.19.0.1 gridExtra_2.3 rlang_1.1.6
#> [4] magrittr_2.0.3 clue_0.3-66 GetoptLong_1.0.5
#> [7] msigdbr_25.1.1 matrixStats_1.5.0 compiler_4.5.1
#> [10] png_0.1-8 systemfonts_1.2.3 vctrs_0.6.5
#> [13] reshape2_1.4.4 stringr_1.5.1 pkgconfig_2.0.3
#> [16] shape_1.4.6.1 crayon_1.5.3 fastmap_1.2.0
#> [19] backports_1.5.0 labeling_0.4.3 effectsize_1.0.1
#> [22] rmarkdown_2.29 ragg_1.4.0 purrr_1.1.0
#> [25] xfun_0.53 cachem_1.1.0 jsonlite_2.0.0
#> [28] BiocParallel_1.42.1 broom_1.0.9 parallel_4.5.1
#> [31] cluster_2.1.8.1 R6_2.6.1 bslib_0.9.0
#> [34] stringi_1.8.7 RColorBrewer_1.1-3 limma_3.64.3
#> [37] car_3.1-3 jquerylib_0.1.4 Rcpp_1.1.0
#> [40] assertthat_0.2.1 iterators_1.0.14 knitr_1.50
#> [43] parameters_0.28.0 IRanges_2.42.0 Matrix_1.7-3
#> [46] tidyselect_1.2.1 abind_1.4-8 yaml_2.3.10
#> [49] doParallel_1.0.17 codetools_0.2-20 curl_7.0.0
#> [52] lattice_0.22-7 tibble_3.3.0 plyr_1.8.9
#> [55] withr_3.0.2 bayestestR_0.16.1 evaluate_1.0.4
#> [58] desc_1.4.3 circlize_0.4.16 pillar_1.11.0
#> [61] ggpubr_0.6.1 carData_3.0-5 foreach_1.5.2
#> [64] stats4_4.5.1 insight_1.4.0 generics_0.1.4
#> [67] S4Vectors_0.46.0 ggplot2_3.5.2 scales_1.4.0
#> [70] glue_1.8.0 tools_4.5.1 data.table_1.17.8
#> [73] fgsea_1.34.2 locfit_1.5-9.12 ggsignif_0.6.4
#> [76] babelgene_22.9 fs_1.6.6 fastmatch_1.1-6
#> [79] cowplot_1.2.0 grid_4.5.1 tidyr_1.3.1
#> [82] datawizard_1.2.0 edgeR_4.6.3 colorspace_2.1-1
#> [85] Formula_1.2-5 cli_3.6.5 textshaping_1.0.1
#> [88] ComplexHeatmap_2.24.1 dplyr_1.1.4 gtable_0.3.6
#> [91] ggh4x_0.3.1 rstatix_0.7.2 sass_0.4.10
#> [94] digest_0.6.37 BiocGenerics_0.54.0 ggrepel_0.9.6
#> [97] rjson_0.2.23 htmlwidgets_1.6.4 farver_2.1.2
#> [100] htmltools_0.5.8.1 pkgdown_2.1.3 lifecycle_1.0.4
#> [103] GlobalOptions_0.1.2 statmod_1.5.0