Gene Set Similarity Tutorial • markeR

Installation

The user can install the development version of markeR from GitHub with:

# install.packages("devtools")
devtools::install_github("DiseaseTranscriptomicsLab/markeR")

Signature Similarity

Even if a user-defined gene signature demonstrates strong discriminatory power between conditions, it may reflect known biological pathways rather than novel mechanisms. To address this, the geneset_similarity() function computes pairwise Jaccard indices or log odds ratios (logOR) between user-provided gene signatures and a reference set, quantifying their overlap as a percentage or a statistical enrichment.

Users can compare their signatures to:

Custom gene sets, defined manually, or
MSigDB collections, via the msigdbr package.

The function provides options to:

Filter by Jaccard index threshold, using jaccard_threshold
Filter by odds ratio and p-value, using or_threshold and pval_threshold
Limit the number of top-matching reference signatures shown, using num_sigs_toplot

Similarity via Jaccard Index

The Jaccard index measures raw set overlap:

$\text{Jaccard}(A, B) = \frac{|A \cap B|}{|A \cup B|}$

Example 1: Compare against user-defined and MSigDB gene sets

# Example data
signature1 <- c("TP53", "BRCA1", "MYC", "EGFR", "CDK2")
signature2 <- c("ATXN2", "FUS", "MTOR", "CASP3")

signature_list <- list(
  "User_Apoptosis" = c("TP53", "CASP3", "BAX"),
  "User_CellCycle" = c("CDK2", "CDK4", "CCNB1", "MYC"),
  "User_DNARepair" = c("BRCA1", "RAD51", "ATM"),
  "User_MTOR" = c("MTOR", "AKT1", "RPS6KB1")
)

geneset_similarity(
  signatures = list(Sig1 = signature1, Sig2 = signature2),
  other_user_signatures = signature_list,
  collection = "C2",
  subcollection = "CP:KEGG_LEGACY", 
  jaccard_threshold = 0.05,
  msig_subset = NULL, 
  metric = "jaccard"
)

Example 2: Restrict comparison to a custom subset of MSigDB

geneset_similarity(
  signatures = list(Sig1 = signature1, Sig2 = signature2),
  other_user_signatures = NULL,
  collection = "C2",
  subcollection = "CP:KEGG_LEGACY", 
  jaccard_threshold = 0,
  msig_subset = c("KEGG_MTOR_SIGNALING_PATHWAY", "KEGG_APOPTOSIS", "NON_EXISTENT_PATHWAY"), 
  metric = "jaccard"
)

Similarity via Log Odds Ratio

The log odds ratio (logOR) provides a statistically grounded alternative for assessing gene set similarity. It measures enrichment of one set within another, relative to a defined background or gene universe, using a 2×2 contingency table and a one-sided Fisher’s exact test.

Log odds ratio (logOR):
Derived from contingency tables using:
- Genes in both sets
- Genes in one but not the other
- Gene universe as background
  Log-transformed odds ratios are visualized; statistical significance is assessed via the adjusted p-value.

Note: When using metric = "odds_ratio", the universe parameter must be supplied.

Example 3: Compare against user-defined and MSigDB gene sets

# Define gene universe (e.g., genes from HPA or your dataset)
gene_universe <- unique(c(
  signature1, signature2,
  unlist(signature_list),
  msigdbr::msigdbr(species = "Homo sapiens", category = "C2")$gene_symbol
))

## Warning: The `category` argument of `msigdbr()` is deprecated as of msigdbr 10.0.0.
## ℹ Please use the `collection` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

geneset_similarity(
  signatures = list(Sig1 = signature1, Sig2 = signature2),
  other_user_signatures = signature_list,
  collection = "C2",
  subcollection = "CP:KEGG_LEGACY",
  metric = "odds_ratio",
  universe = gene_universe,
  or_threshold = 1,
  pval_threshold = 0.05, 
  width_text=50
)

Session Information

sessionInfo()

## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] markeR_0.99.2
## 
## loaded via a namespace (and not attached):
##   [1] pROC_1.19.0.1         gridExtra_2.3         rlang_1.1.6          
##   [4] magrittr_2.0.3        clue_0.3-66           GetoptLong_1.0.5     
##   [7] msigdbr_25.1.1        matrixStats_1.5.0     compiler_4.5.1       
##  [10] png_0.1-8             systemfonts_1.2.3     vctrs_0.6.5          
##  [13] reshape2_1.4.4        stringr_1.5.1         pkgconfig_2.0.3      
##  [16] shape_1.4.6.1         crayon_1.5.3          fastmap_1.2.0        
##  [19] backports_1.5.0       effectsize_1.0.1      rmarkdown_2.29       
##  [22] ragg_1.4.0            purrr_1.1.0           xfun_0.53            
##  [25] cachem_1.1.0          jsonlite_2.0.0        BiocParallel_1.42.1  
##  [28] broom_1.0.9           parallel_4.5.1        cluster_2.1.8.1      
##  [31] R6_2.6.1              stringi_1.8.7         bslib_0.9.0          
##  [34] RColorBrewer_1.1-3    limma_3.64.3          car_3.1-3            
##  [37] jquerylib_0.1.4       Rcpp_1.1.0            assertthat_0.2.1     
##  [40] iterators_1.0.14      knitr_1.50            parameters_0.28.0    
##  [43] IRanges_2.42.0        Matrix_1.7-3          tidyselect_1.2.1     
##  [46] abind_1.4-8           yaml_2.3.10           doParallel_1.0.17    
##  [49] codetools_0.2-20      curl_7.0.0            lattice_0.22-7       
##  [52] tibble_3.3.0          plyr_1.8.9            withr_3.0.2          
##  [55] bayestestR_0.16.1     evaluate_1.0.4        desc_1.4.3           
##  [58] circlize_0.4.16       pillar_1.11.0         ggpubr_0.6.1         
##  [61] carData_3.0-5         foreach_1.5.2         stats4_4.5.1         
##  [64] insight_1.4.0         generics_0.1.4        S4Vectors_0.46.0     
##  [67] ggplot2_3.5.2         scales_1.4.0          glue_1.8.0           
##  [70] tools_4.5.1           data.table_1.17.8     fgsea_1.34.2         
##  [73] locfit_1.5-9.12       ggsignif_0.6.4        babelgene_22.9       
##  [76] fs_1.6.6              fastmatch_1.1-6       cowplot_1.2.0        
##  [79] grid_4.5.1            tidyr_1.3.1           datawizard_1.2.0     
##  [82] edgeR_4.6.3           colorspace_2.1-1      Formula_1.2-5        
##  [85] cli_3.6.5             textshaping_1.0.1     ComplexHeatmap_2.24.1
##  [88] dplyr_1.1.4           gtable_0.3.6          ggh4x_0.3.1          
##  [91] rstatix_0.7.2         sass_0.4.10           digest_0.6.37        
##  [94] BiocGenerics_0.54.0   ggrepel_0.9.6         rjson_0.2.23         
##  [97] htmlwidgets_1.6.4     farver_2.1.2          htmltools_0.5.8.1    
## [100] pkgdown_2.1.3         lifecycle_1.0.4       GlobalOptions_0.1.2  
## [103] statmod_1.5.0