Skip to contents

Installation

The user can install the development version of markeR from GitHub with:

# install.packages("devtools")
devtools::install_github("DiseaseTranscriptomicsLab/markeR")

Signature Similarity

Even if a user-defined gene signature demonstrates strong discriminatory power between conditions, it may reflect known biological pathways rather than novel mechanisms. To address this, the geneset_similarity() function computes pairwise Jaccard indices or log odds ratios (logOR) between user-provided gene signatures and a reference set, quantifying their overlap as a percentage or a statistical enrichment.

Users can compare their signatures to:

  • Custom gene sets, defined manually, or
  • MSigDB collections, via the msigdbr package.

The function provides options to:

  • Filter by Jaccard index threshold, using jaccard_threshold
  • Filter by odds ratio and p-value, using or_threshold and pval_threshold
  • Limit the number of top-matching reference signatures shown, using num_sigs_toplot

Similarity via Jaccard Index

The Jaccard index measures raw set overlap:

Jaccard(A,B)=|AB||AB| \text{Jaccard}(A, B) = \frac{|A \cap B|}{|A \cup B|}

Example 1: Compare against user-defined and MSigDB gene sets

# Example data
signature1 <- c("TP53", "BRCA1", "MYC", "EGFR", "CDK2")
signature2 <- c("ATXN2", "FUS", "MTOR", "CASP3")

signature_list <- list(
  "User_Apoptosis" = c("TP53", "CASP3", "BAX"),
  "User_CellCycle" = c("CDK2", "CDK4", "CCNB1", "MYC"),
  "User_DNARepair" = c("BRCA1", "RAD51", "ATM"),
  "User_MTOR" = c("MTOR", "AKT1", "RPS6KB1")
)

geneset_similarity(
  signatures = list(Sig1 = signature1, Sig2 = signature2),
  other_user_signatures = signature_list,
  collection = "C2",
  subcollection = "CP:KEGG_LEGACY", 
  jaccard_threshold = 0.05,
  msig_subset = NULL, 
  metric = "jaccard"
)

Example 2: Restrict comparison to a custom subset of MSigDB

geneset_similarity(
  signatures = list(Sig1 = signature1, Sig2 = signature2),
  other_user_signatures = NULL,
  collection = "C2",
  subcollection = "CP:KEGG_LEGACY", 
  jaccard_threshold = 0,
  msig_subset = c("KEGG_MTOR_SIGNALING_PATHWAY", "KEGG_APOPTOSIS", "NON_EXISTENT_PATHWAY"), 
  metric = "jaccard"
)

Similarity via Log Odds Ratio

The log odds ratio (logOR) provides a statistically grounded alternative for assessing gene set similarity. It measures enrichment of one set within another, relative to a defined background or gene universe, using a 2×2 contingency table and a one-sided Fisher’s exact test.

  • Log odds ratio (logOR):
    Derived from contingency tables using:
    • Genes in both sets
    • Genes in one but not the other
    • Gene universe as background
      Log-transformed odds ratios are visualized; statistical significance is assessed via the adjusted p-value.

Note: When using metric = "odds_ratio", the universe parameter must be supplied.

Example 3: Compare against user-defined and MSigDB gene sets

# Define gene universe (e.g., genes from HPA or your dataset)
gene_universe <- unique(c(
  signature1, signature2,
  unlist(signature_list),
  msigdbr::msigdbr(species = "Homo sapiens", category = "C2")$gene_symbol
))
## Warning: The `category` argument of `msigdbr()` is deprecated as of msigdbr 10.0.0.
##  Please use the `collection` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
geneset_similarity(
  signatures = list(Sig1 = signature1, Sig2 = signature2),
  other_user_signatures = signature_list,
  collection = "C2",
  subcollection = "CP:KEGG_LEGACY",
  metric = "odds_ratio",
  universe = gene_universe,
  or_threshold = 1,
  pval_threshold = 0.05, 
  width_text=50
)

Session Information

## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] markeR_0.99.0
## 
## loaded via a namespace (and not attached):
##   [1] pROC_1.18.5           gridExtra_2.3         rlang_1.1.6          
##   [4] magrittr_2.0.3        clue_0.3-66           GetoptLong_1.0.5     
##   [7] msigdbr_24.1.0        matrixStats_1.5.0     compiler_4.5.1       
##  [10] png_0.1-8             systemfonts_1.2.3     vctrs_0.6.5          
##  [13] reshape2_1.4.4        stringr_1.5.1         pkgconfig_2.0.3      
##  [16] shape_1.4.6.1         crayon_1.5.3          fastmap_1.2.0        
##  [19] backports_1.5.0       labeling_0.4.3        effectsize_1.0.1     
##  [22] rmarkdown_2.29        ragg_1.4.0            purrr_1.0.4          
##  [25] xfun_0.52             cachem_1.1.0          jsonlite_2.0.0       
##  [28] BiocParallel_1.42.1   broom_1.0.8           parallel_4.5.1       
##  [31] cluster_2.1.8.1       R6_2.6.1              stringi_1.8.7        
##  [34] bslib_0.9.0           RColorBrewer_1.1-3    limma_3.64.1         
##  [37] car_3.1-3             jquerylib_0.1.4       Rcpp_1.0.14          
##  [40] assertthat_0.2.1      iterators_1.0.14      knitr_1.50           
##  [43] parameters_0.26.0     IRanges_2.42.0        Matrix_1.7-3         
##  [46] tidyselect_1.2.1      abind_1.4-8           yaml_2.3.10          
##  [49] doParallel_1.0.17     codetools_0.2-20      curl_6.4.0           
##  [52] lattice_0.22-7        tibble_3.3.0          plyr_1.8.9           
##  [55] withr_3.0.2           bayestestR_0.16.0     evaluate_1.0.4       
##  [58] desc_1.4.3            circlize_0.4.16       pillar_1.10.2        
##  [61] ggpubr_0.6.1          carData_3.0-5         foreach_1.5.2        
##  [64] stats4_4.5.1          insight_1.3.0         generics_0.1.4       
##  [67] S4Vectors_0.46.0      ggplot2_3.5.2         scales_1.4.0         
##  [70] glue_1.8.0            tools_4.5.1           data.table_1.17.6    
##  [73] fgsea_1.34.0          locfit_1.5-9.12       ggsignif_0.6.4       
##  [76] babelgene_22.9        fs_1.6.6              fastmatch_1.1-6      
##  [79] cowplot_1.1.3         grid_4.5.1            tidyr_1.3.1          
##  [82] datawizard_1.1.0      edgeR_4.6.2           colorspace_2.1-1     
##  [85] Formula_1.2-5         cli_3.6.5             textshaping_1.0.1    
##  [88] ComplexHeatmap_2.24.1 dplyr_1.1.4           gtable_0.3.6         
##  [91] ggh4x_0.3.1           rstatix_0.7.2         sass_0.4.10          
##  [94] digest_0.6.37         BiocGenerics_0.54.0   ggrepel_0.9.6        
##  [97] rjson_0.2.23          htmlwidgets_1.6.4     farver_2.1.2         
## [100] htmltools_0.5.8.1     pkgdown_2.1.3         lifecycle_1.0.4      
## [103] GlobalOptions_0.1.2   statmod_1.5.0