Discovery Mode Tutorial • markeR

This vignette provides a comprehensive introduction to the markeR package, focusing on its Discovery Mode. The discovery mode was designed for users who are interested in quantifying a known, robust gene set in a given dataset to explore associations with other phenotypic or clinical variables. This approach is particularly suited to hypothesis generation where the phenotype marked by the gene set is of known biological or clinical relevance.

Installation

The user can install the development version of markeR from GitHub with:

# install.packages("devtools")
devtools::install_github("DiseaseTranscriptomicsLab/markeR")

Case-study: Senescence

We will be using an already pre-processed gene expression dataset, derived from the Marthandan et al. (2016) study (GSE63577), that includes human fibroblast samples cultured under two different conditions: replicative senescence and proliferative control. The dataset has already been filtered and normalized using the edgeR package. For more information about the dataset structure, see the help pages for ?counts_example and ?metadata_example.

This dataset serves as a working example to demonstrate the main functionalities of the markeR package. In particular, it will be used to showcase the two primary modules designed for benchmarking gene signatures:

Score: calculates expression-based signature scores for each sample, and
Enrichment: evaluates the over-representation of gene signatures within ranked gene lists.

To illustrate the usage of markeR, we use the HernandezSegura gene set: A transcriptomic gene set identified by Hernandez-Segura et al. (2017) as consistently altered across multiple senescence models. This set includes information on the directionality of gene regulation. It has shown strong performance in classification and enrichment analyses, including in the original paper of markeR, and also in the Tutorial on the Benchmarking Mode of markeR.

library(markeR)

# Load example gene sets
data("genesets_example")
HernandezSegura_GeneSet <- list(HernandezSegura=genesets_example$HernandezSegura)
head(HernandezSegura_GeneSet)

## $HernandezSegura
##        gene enrichment
## 1    ACADVL          1
## 2     ADPGK          1
## 3  ARHGAP35         -1
## 4     ARID2         -1
## 5     ASCC1         -1
## 6   B4GALT7          1
## 7    BCL2L2          1
## 8     C2CD5         -1
## 9     CCND1          1
## 10    CHMP5          1
## 11    CNTLN         -1
## 12   CREBBP         -1
## 13     DDA1          1
## 14     DGKA          1
## 15   DYNLT3          1
## 16    EFNB3         -1
## 17  FAM214B          1
## 18     GBE1          1
## 19     GDNF          1
## 20    GSTM4         -1
## 21     ICE1         -1
## 22    KCTD3         -1
## 23     KLC1          1
## 24    MEIS1         -1
## 25   MT-CYB          1
## 26     NFIA         -1
## 27     NOL3          1
## 28    P4HA2          1
## 29    PATZ1         -1
## 30    PCIF1         -1
## 31   PDLIM4          1
## 32    PDS5B         -1
## 33     PLK3          1
## 34   PLXNA3          1
## 35   POFUT2          1
## 36    RAI14          1
## 37    RHNO1         -1
## 38     SCOC          1
## 39  SLC10A3          1
## 40  SLC16A3          1
## 41      SMO         -1
## 42   SPATA6         -1
## 43    SPIN4         -1
## 44    STAG1         -1
## 45    SUSD6          1
## 46    TAF13          1
## 47  TMEM87B          1
## 48   TOLLIP          1
## 49   TRDMT1         -1
## 50  TSPAN13          1
## 51     UFM1          1
## 52   USP6NL         -1
## 53   ZBTB7A          1
## 54    ZC3H4         -1
## 55   ZNHIT1          1

data(counts_example)
# Load example data
counts_example[1:5,1:5]

##          SRR1660534 SRR1660535 SRR1660536 SRR1660537 SRR1660538
## A1BG        9.94566   9.476768   8.229231   8.515083   7.806479
## A1BG-AS1   12.08655  11.550303  12.283976   7.580694   7.312666
## A2M        77.50289  56.612839  58.860268   8.997624   6.981857
## A4GALT     14.74183  15.226083  14.815891  14.675780  15.222488
## AAAS       47.92755  46.292377  43.965972  47.109493  47.213739

For illustration purposes of different variable types, let’s imagine we also had two additional variables: one indicating the number of days between sample preparation and sequencing (DaysToSequencing), and another identifying the person who processed each sample (researcher). These variables are hypothetical and not part of the original study design.

data(metadata_example)

set.seed("123456")

metadata_example$Researcher <- sample(c("John","Ana","Francisca"),39, replace = TRUE)
metadata_example$DaysToSequencing <- sample(c(1:20),39, replace = TRUE)
 
head(metadata_example)

##       sampleID      DatasetID   CellType     Condition       SenescentType
## 252 SRR1660534 Marthandan2016 Fibroblast     Senescent Telomere shortening
## 253 SRR1660535 Marthandan2016 Fibroblast     Senescent Telomere shortening
## 254 SRR1660536 Marthandan2016 Fibroblast     Senescent Telomere shortening
## 255 SRR1660537 Marthandan2016 Fibroblast Proliferative                none
## 256 SRR1660538 Marthandan2016 Fibroblast Proliferative                none
## 257 SRR1660539 Marthandan2016 Fibroblast Proliferative                none
##                         Treatment Researcher DaysToSequencing
## 252 PD72 (Replicative senescence)        Ana                6
## 253 PD72 (Replicative senescence)        Ana               18
## 254 PD72 (Replicative senescence)       John               19
## 255                         young        Ana                2
## 256                         young  Francisca                9
## 257                         young       John               10

Score-Based approaches

The Score module in markeR quantifies the association between a gene signature and phenotypic variables by calculating a score for each sample based on the expression of genes in the signature. This score can then be correlated with other variables, such as clinical or experimental conditions:

Quantifies associations between phenotype variables and a gene signature score using Cohen’s effect sizes and p-values.
Visualizes results through lollipop plots, contrast plots, and distribution plots.

This is useful for identifying:

Biological relationships (e.g., phenotype-score associations)
Technical confounders (e.g., batch effects)

Outputs

The main function returns a structured list with:

overall: Effect sizes (Cohen’s f) and p-values for each variable.
contrasts: For categorical variables, pairwise or grouped comparisons using Cohen’s d with BH-adjusted p-values.
plot: A combined visualization showing:
1. Lollipop plot of effect sizes (Cohen’s f)
2. Distribution plots of the score by variable (density or scatter)
3. Lollipop plots of contrasts (Cohen’s d) for categorical variables, if applicable
plot_overall, plot_contrasts, plot_distributions: Individual components of the combined plot.

Controlling Contrast Resolution

The mode parameter determines the level of contrast detail for categorical variables:

"simple": All pairwise contrasts between group levels (e.g., A vs B, A vs C, B vs C).
"medium": One group vs. the union of all others (e.g., A vs B+C+D).
"extensive": All algebraic combinations of group levels (e.g., A+B vs C+D).

Results

For this example, we use:

The HernandezSegura gene signature
The logmedian scoring method
mode = "extensive" for thorough contrast analysis

We also include two synthetic phenotypic variables:

Researcher — a categorical variable representing who processed each sample
DaysToSequencing — a numeric variable indicating time between preparation and sequencing

Though artificial, these mimic potential technical covariates. Strong associations between these and the score could indicate batch effects, where technical variation may confound biological interpretation.

results_scoreassoc_bidirect <- VariableAssociation(data = counts_example, 
                          metadata = metadata_example, 
                          method = "logmedian",
                          cols = c("Condition","Researcher","DaysToSequencing"),  
                          gene_set = HernandezSegura_GeneSet,
                          mode="extensive",
                          nonsignif_color = "white", signif_color = "red", saturation_value=NULL,sig_threshold = 0.05,
                          widthlabels=30, labsize=10, titlesize=14, pointSize=5, discrete_colors=NULL,
                          continuous_color = "#8C6D03", color_palette = "Set2")

results_scoreassoc_bidirect$Overall

##           Variable    Cohen_f      P_Value
## 1        Condition 1.19407625 1.267423e-08
## 2       Researcher 0.12816159 7.458318e-01
## 3 DaysToSequencing 0.00792836 9.617953e-01

results_scoreassoc_bidirect$Contrasts

##     Variable                  Contrast          Group1           Group2
## 1  Condition Proliferative - Senescent   Proliferative        Senescent
## 2 Researcher                Ana - John             Ana             John
## 3 Researcher           Ana - Francisca             Ana        Francisca
## 4 Researcher          Francisca - John       Francisca             John
## 5 Researcher  Ana - (Francisca + John)             Ana Francisca + John
## 6 Researcher  (Ana + Francisca) - John Ana + Francisca             John
## 7 Researcher  (Ana + John) - Francisca      Ana + John        Francisca
##        CohenD       PValue         padj
## 1 -2.33302465 1.267423e-08 8.871962e-08
## 2 -0.05584444 9.021669e-01 9.021669e-01
## 3  0.24731622 4.904610e-01 8.034391e-01
## 4 -0.27989036 5.477802e-01 8.034391e-01
## 5  0.14113793 6.646037e-01 8.034391e-01
## 6 -0.16850285 6.886621e-01 8.034391e-01
## 7  0.25285837 4.472210e-01 8.034391e-01

Enrichment-based approaches

The GSEA_VariableAssociation() function evaluates how phenotypic variables are associated with gene set activity, using enrichment scores derived from gene expression statistics (B- or t-statistics). This allows users to understand whether a gene set is enriched or depleted in relation to different sample attributes.

Outputs

The GSEA_VariableAssociation() function returns a list with two elements:

data: A tidy data frame of GSEA results. For each variable contrast, this includes:
- Contrast: The comparison performed (e.g., A - B, A - (B+C))
- Statistic: The metric used for gene ranking (either t or B)
- NES: Normalized Enrichment Score
- Adjusted p-value: Multiple-testing corrected p-value (e.g., Benjamini–Hochberg)
- Gene Set Name: The gene set being tested
plot: A ggplot2 object showing the NES and significance of each contrast as a lollipop plot:
- Point position reflects NES (directional enrichment)
- Point color reflects adjusted p-value significance
- Dashed lines represent contrasts with negative NES under the B statistic, indicating likely non-alteration

Output Interpretation

Depending on the statistic used (t or B), the interpretation of results varies:

Negative NES (Normalized Enrichment Score):
- t statistic: A negative NES implies the gene set is depleted in over-expressed genes — i.e., genes in the set are under-expressed relative to others.
- B statistic: A negative NES suggests the gene set is less altered than most other genes — indicating a lack of strong expression change.
Dashed Lines in the Plot:
- Dashed lines denote contrasts with negative NES using the B statistic, indicating gene sets that are putatively not altered.
Subtitle Differences in the Plot:
- When using the B statistic, the plot subtitle is “Altered Contrasts”.
- When using the t statistic, the subtitle becomes “Enriched/Depleted Contrasts”.

Example: Exploring Gene Set Enrichment by Variable

The following code evaluates how three phenotypic variables — Condition, Researcher, and DaysToSequencing — are associated with the HernandezSegura gene set:

VariableAssociation(
  data = counts_example,
  metadata = metadata_example,
  method = "GSEA",
  cols = c("Condition", "Researcher", "DaysToSequencing"),
  mode = "simple",
  gene_set = HernandezSegura_GeneSet,
  saturation_value = NULL,
  nonsignif_color = "white",
  signif_color = "red",
  sig_threshold = 0.05,
  widthlabels = 30,
  labsize = 10,
  titlesize = 14,
  pointSize = 5
)

## $plot

## 
## $data
##            pathway         pval         padj    log2err         ES        NES
##             <char>        <num>        <num>      <num>      <num>      <num>
## 1: HernandezSegura 7.420457e-10 3.710228e-09 0.80121557 -0.6263553 -2.6486833
## 2: HernandezSegura 9.994967e-01 9.994967e-01 0.01188457  0.1127391  0.4471673
## 3: HernandezSegura 1.644800e-04 4.111999e-04 0.51884808  0.4544704  2.0326219
## 4: HernandezSegura 2.632307e-04 4.387178e-04 0.49849311 -0.4726426 -1.9550716
## 5: HernandezSegura 9.636917e-01 9.994967e-01 0.01516609  0.1677355  0.6709891
##     size  leadingEdge stat_used                  Contrast
##    <int>       <list>    <char>                    <char>
## 1:    52 CNTLN, D....         t Proliferative - Senescent
## 2:    52 P4HA2, S....         t                Ana - John
## 3:    52 NOL3, PL....         t           Ana - Francisca
## 4:    52 MEIS1, S....         t          Francisca - John
## 5:    52 BCL2L2, ....         t          DaysToSequencing

Session Information

sessionInfo()

## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] markeR_0.99.2
## 
## loaded via a namespace (and not attached):
##   [1] pROC_1.18.5           gridExtra_2.3         rlang_1.1.6          
##   [4] magrittr_2.0.3        clue_0.3-66           GetoptLong_1.0.5     
##   [7] msigdbr_25.1.1        matrixStats_1.5.0     compiler_4.5.1       
##  [10] mgcv_1.9-3            png_0.1-8             systemfonts_1.2.3    
##  [13] vctrs_0.6.5           reshape2_1.4.4        stringr_1.5.1        
##  [16] pkgconfig_2.0.3       shape_1.4.6.1         crayon_1.5.3         
##  [19] fastmap_1.2.0         backports_1.5.0       labeling_0.4.3       
##  [22] effectsize_1.0.1      rmarkdown_2.29        ragg_1.4.0           
##  [25] purrr_1.1.0           xfun_0.52             cachem_1.1.0         
##  [28] jsonlite_2.0.0        BiocParallel_1.42.1   broom_1.0.8          
##  [31] parallel_4.5.1        cluster_2.1.8.1       R6_2.6.1             
##  [34] stringi_1.8.7         bslib_0.9.0           RColorBrewer_1.1-3   
##  [37] limma_3.64.1          car_3.1-3             jquerylib_0.1.4      
##  [40] Rcpp_1.1.0            assertthat_0.2.1      iterators_1.0.14     
##  [43] knitr_1.50            parameters_0.27.0     IRanges_2.42.0       
##  [46] splines_4.5.1         Matrix_1.7-3          tidyselect_1.2.1     
##  [49] abind_1.4-8           yaml_2.3.10           doParallel_1.0.17    
##  [52] codetools_0.2-20      curl_6.4.0            lattice_0.22-7       
##  [55] tibble_3.3.0          plyr_1.8.9            withr_3.0.2          
##  [58] bayestestR_0.16.1     evaluate_1.0.4        desc_1.4.3           
##  [61] circlize_0.4.16       pillar_1.11.0         ggpubr_0.6.1         
##  [64] carData_3.0-5         foreach_1.5.2         stats4_4.5.1         
##  [67] insight_1.3.1         generics_0.1.4        S4Vectors_0.46.0     
##  [70] ggplot2_3.5.2         scales_1.4.0          glue_1.8.0           
##  [73] tools_4.5.1           data.table_1.17.8     fgsea_1.34.2         
##  [76] locfit_1.5-9.12       ggsignif_0.6.4        babelgene_22.9       
##  [79] fs_1.6.6              fastmatch_1.1-6       cowplot_1.2.0        
##  [82] grid_4.5.1            tidyr_1.3.1           datawizard_1.2.0     
##  [85] edgeR_4.6.3           colorspace_2.1-1      nlme_3.1-168         
##  [88] Formula_1.2-5         cli_3.6.5             textshaping_1.0.1    
##  [91] ComplexHeatmap_2.24.1 dplyr_1.1.4           gtable_0.3.6         
##  [94] ggh4x_0.3.1           rstatix_0.7.2         sass_0.4.10          
##  [97] digest_0.6.37         BiocGenerics_0.54.0   ggrepel_0.9.6        
## [100] rjson_0.2.23          htmlwidgets_1.6.4     farver_2.1.2         
## [103] htmltools_0.5.8.1     pkgdown_2.1.3         lifecycle_1.0.4      
## [106] GlobalOptions_0.1.2   statmod_1.5.0