Skip to contents

Even if a user-defined gene signature demonstrates strong discriminatory power between conditions, it may reflect known biological pathways rather than novel mechanisms. To address this, the geneset_similarity() function computes pairwise Jaccard indices or log odds ratios (logOR) between user-provided gene signatures and a reference set, quantifying their overlap as a percentage or a statistical enrichment.

Users can compare their signatures to:

  • Custom gene sets, defined manually, or
  • MSigDB collections, via the msigdbr package.

The function provides options to:

  • Filter by Jaccard index threshold, using jaccard_threshold
  • Filter by odds ratio and p-value, using or_threshold and pval_threshold
  • Limit the number of top-matching reference signatures shown, using num_sigs_toplot

Similarity via Jaccard Index

The Jaccard index measures raw set overlap:

Jaccard(A,B)=|AB||AB| \text{Jaccard}(A, B) = \frac{|A \cap B|}{|A \cup B|}

Example 1: Compare against user-defined and MSigDB gene sets

# Example data
signature1 <- c("TP53", "BRCA1", "MYC", "EGFR", "CDK2")
signature2 <- c("ATXN2", "FUS", "MTOR", "CASP3")

signature_list <- list(
  "User_Apoptosis" = c("TP53", "CASP3", "BAX"),
  "User_CellCycle" = c("CDK2", "CDK4", "CCNB1", "MYC"),
  "User_DNARepair" = c("BRCA1", "RAD51", "ATM"),
  "User_MTOR" = c("MTOR", "AKT1", "RPS6KB1")
)

geneset_similarity(
  signatures = list(Sig1 = signature1, Sig2 = signature2),
  other_user_signatures = signature_list,
  collection = "C2",
  subcollection = "CP:KEGG_LEGACY", 
  jaccard_threshold = 0.05,
  msig_subset = NULL, 
  metric = "jaccard"
)$plot

Example 2: Restrict comparison to a custom subset of MSigDB


geneset_similarity(
  signatures = list(Sig1 = signature1, Sig2 = signature2),
  other_user_signatures = NULL,
  collection = "C2",
  subcollection = "CP:KEGG_LEGACY", 
  jaccard_threshold = 0,
  msig_subset = c("KEGG_MTOR_SIGNALING_PATHWAY", "KEGG_APOPTOSIS", "NON_EXISTENT_PATHWAY"), 
  metric = "jaccard"
)$plot

Similarity via Log Odds Ratio

The log odds ratio (logOR) provides a statistically grounded alternative for assessing gene set similarity. It measures enrichment of one set within another, relative to a defined background or gene universe, using a 2×2 contingency table and a one-sided Fisher’s exact test.

  • Log odds ratio (logOR):
    Derived from contingency tables using:
    • Genes in both sets
    • Genes in one but not the other
    • Gene universe as background
      Log-transformed odds ratios are visualized; statistical significance is assessed via the adjusted p-value.

Note: When using metric = "odds_ratio", the universe parameter must be supplied.

Example 3: Compare against user-defined and MSigDB gene sets


geneset_similarity(
  signatures = list(Sig1 = signature1, Sig2 = signature2),
  other_user_signatures = signature_list,
  collection = "C2",
  subcollection = "CP:KEGG_LEGACY",
  metric = "odds_ratio",
  # Define gene universe (e.g., genes from HPA or your dataset)
  universe = unique(c(
    signature1, signature2,
    unlist(signature_list),
    msigdbr::msigdbr(species = "Homo sapiens", category = "C2")$gene_symbol
  )),
  or_threshold = 100, #log10OR = 2
  pval_threshold = 0.05, 
  width_text=50
)$plot
#> Warning: The `category` argument of `msigdbr()` is deprecated as of msigdbr 10.0.0.
#>  Please use the `collection` argument instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

Session Information

sessionInfo()
#> R version 4.5.1 (2025-06-13)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] markeR_0.99.3
#> 
#> loaded via a namespace (and not attached):
#>   [1] pROC_1.19.0.1         gridExtra_2.3         rlang_1.1.6          
#>   [4] magrittr_2.0.3        clue_0.3-66           GetoptLong_1.0.5     
#>   [7] msigdbr_25.1.1        matrixStats_1.5.0     compiler_4.5.1       
#>  [10] png_0.1-8             systemfonts_1.2.3     vctrs_0.6.5          
#>  [13] reshape2_1.4.4        stringr_1.5.1         pkgconfig_2.0.3      
#>  [16] shape_1.4.6.1         crayon_1.5.3          fastmap_1.2.0        
#>  [19] backports_1.5.0       labeling_0.4.3        effectsize_1.0.1     
#>  [22] rmarkdown_2.29        ragg_1.4.0            purrr_1.1.0          
#>  [25] xfun_0.53             cachem_1.1.0          jsonlite_2.0.0       
#>  [28] BiocParallel_1.42.1   broom_1.0.9           parallel_4.5.1       
#>  [31] cluster_2.1.8.1       R6_2.6.1              bslib_0.9.0          
#>  [34] stringi_1.8.7         RColorBrewer_1.1-3    limma_3.64.3         
#>  [37] car_3.1-3             jquerylib_0.1.4       Rcpp_1.1.0           
#>  [40] assertthat_0.2.1      iterators_1.0.14      knitr_1.50           
#>  [43] parameters_0.28.0     IRanges_2.42.0        Matrix_1.7-3         
#>  [46] tidyselect_1.2.1      abind_1.4-8           yaml_2.3.10          
#>  [49] doParallel_1.0.17     codetools_0.2-20      curl_7.0.0           
#>  [52] lattice_0.22-7        tibble_3.3.0          plyr_1.8.9           
#>  [55] withr_3.0.2           bayestestR_0.16.1     evaluate_1.0.4       
#>  [58] desc_1.4.3            circlize_0.4.16       pillar_1.11.0        
#>  [61] ggpubr_0.6.1          carData_3.0-5         foreach_1.5.2        
#>  [64] stats4_4.5.1          insight_1.4.0         generics_0.1.4       
#>  [67] S4Vectors_0.46.0      ggplot2_3.5.2         scales_1.4.0         
#>  [70] glue_1.8.0            tools_4.5.1           data.table_1.17.8    
#>  [73] fgsea_1.34.2          locfit_1.5-9.12       ggsignif_0.6.4       
#>  [76] babelgene_22.9        fs_1.6.6              fastmatch_1.1-6      
#>  [79] cowplot_1.2.0         grid_4.5.1            tidyr_1.3.1          
#>  [82] datawizard_1.2.0      edgeR_4.6.3           colorspace_2.1-1     
#>  [85] Formula_1.2-5         cli_3.6.5             textshaping_1.0.1    
#>  [88] ComplexHeatmap_2.24.1 dplyr_1.1.4           gtable_0.3.6         
#>  [91] ggh4x_0.3.1           rstatix_0.7.2         sass_0.4.10          
#>  [94] digest_0.6.37         BiocGenerics_0.54.0   ggrepel_0.9.6        
#>  [97] rjson_0.2.23          htmlwidgets_1.6.4     farver_2.1.2         
#> [100] htmltools_0.5.8.1     pkgdown_2.1.3         lifecycle_1.0.4      
#> [103] GlobalOptions_0.1.2   statmod_1.5.0