This vignette provides a comprehensive introduction to the markeR package, focusing on its Discovery Mode. The discovery mode was designed for users who are interested in quantifying a known, robust gene set in a given dataset to explore associations with other phenotypic or clinical variables. This approach is particularly suited to hypothesis generation where the phenotype marked by the gene set is of known biological or clinical relevance.
Installation
The user can install the development version of markeR from GitHub with:
# install.packages("devtools")
devtools::install_github("DiseaseTranscriptomicsLab/markeR")
Case-study: Senescence
We will be using an already pre-processed gene expression dataset,
derived from the Marthandan et al. (2016) study (GSE63577), that
includes human fibroblast samples cultured under two different
conditions: replicative senescence and proliferative control. The
dataset has already been filtered and normalized using the
edgeR
package. For more information about the dataset
structure, see the help pages for ?counts_example
and
?metadata_example
.
This dataset serves as a working example to demonstrate the main functionalities of the markeR package. In particular, it will be used to showcase the two primary modules designed for benchmarking gene signatures:
- Score: calculates expression-based signature scores for each sample, and
- Enrichment: evaluates the over-representation of gene signatures within ranked gene lists.
To illustrate the usage of markeR
, we use the
HernandezSegura gene set: A transcriptomic gene set
identified by Hernandez-Segura et al. (2017) as consistently altered
across multiple senescence models. This set includes information on the
directionality of gene regulation. It has shown strong performance in
classification and enrichment analyses, including in the original paper
of markeR, and also in the Tutorial on the Benchmarking Mode of
markeR
.
# Load example gene sets
data("genesets_example")
HernandezSegura_GeneSet <- list(HernandezSegura=genesets_example$HernandezSegura)
print(HernandezSegura_GeneSet)
## $HernandezSegura
## gene enrichment
## 1 ACADVL 1
## 2 ADPGK 1
## 3 ARHGAP35 -1
## 4 ARID2 -1
## 5 ASCC1 -1
## 6 B4GALT7 1
## 7 BCL2L2 1
## 8 C2CD5 -1
## 9 CCND1 1
## 10 CHMP5 1
## 11 CNTLN -1
## 12 CREBBP -1
## 13 DDA1 1
## 14 DGKA 1
## 15 DYNLT3 1
## 16 EFNB3 -1
## 17 FAM214B 1
## 18 GBE1 1
## 19 GDNF 1
## 20 GSTM4 -1
## 21 ICE1 -1
## 22 KCTD3 -1
## 23 KLC1 1
## 24 MEIS1 -1
## 25 MT-CYB 1
## 26 NFIA -1
## 27 NOL3 1
## 28 P4HA2 1
## 29 PATZ1 -1
## 30 PCIF1 -1
## 31 PDLIM4 1
## 32 PDS5B -1
## 33 PLK3 1
## 34 PLXNA3 1
## 35 POFUT2 1
## 36 RAI14 1
## 37 RHNO1 -1
## 38 SCOC 1
## 39 SLC10A3 1
## 40 SLC16A3 1
## 41 SMO -1
## 42 SPATA6 -1
## 43 SPIN4 -1
## 44 STAG1 -1
## 45 SUSD6 1
## 46 TAF13 1
## 47 TMEM87B 1
## 48 TOLLIP 1
## 49 TRDMT1 -1
## 50 TSPAN13 1
## 51 UFM1 1
## 52 USP6NL -1
## 53 ZBTB7A 1
## 54 ZC3H4 -1
## 55 ZNHIT1 1
data(counts_example)
# Load example data
counts_example[1:5,1:5]
## SRR1660534 SRR1660535 SRR1660536 SRR1660537 SRR1660538
## A1BG 9.94566 9.476768 8.229231 8.515083 7.806479
## A1BG-AS1 12.08655 11.550303 12.283976 7.580694 7.312666
## A2M 77.50289 56.612839 58.860268 8.997624 6.981857
## A4GALT 14.74183 15.226083 14.815891 14.675780 15.222488
## AAAS 47.92755 46.292377 43.965972 47.109493 47.213739
For illustration purposes of different variable types, let’s imagine
we also had two additional variables: one indicating the number of days
between sample preparation and sequencing
(DaysToSequencing
), and another identifying the person who
processed each sample (researcher
). These variables are
hypothetical and not part of the original study design.
data(metadata_example)
set.seed("123456")
metadata_example$Researcher <- sample(c("John","Ana","Francisca"),39, replace = TRUE)
metadata_example$DaysToSequencing <- sample(c(1:20),39, replace = TRUE)
head(metadata_example)
## sampleID DatasetID CellType Condition SenescentType
## 252 SRR1660534 Marthandan2016 Fibroblast Senescent Telomere shortening
## 253 SRR1660535 Marthandan2016 Fibroblast Senescent Telomere shortening
## 254 SRR1660536 Marthandan2016 Fibroblast Senescent Telomere shortening
## 255 SRR1660537 Marthandan2016 Fibroblast Proliferative none
## 256 SRR1660538 Marthandan2016 Fibroblast Proliferative none
## 257 SRR1660539 Marthandan2016 Fibroblast Proliferative none
## Treatment Researcher DaysToSequencing
## 252 PD72 (Replicative senescence) Ana 6
## 253 PD72 (Replicative senescence) Ana 18
## 254 PD72 (Replicative senescence) John 19
## 255 young Ana 2
## 256 young Francisca 9
## 257 young John 10
Score-Based approaches
The Score module in markeR
quantifies
the association between a gene signature and phenotypic variables by
calculating a score for each sample based on the expression of genes in
the signature. This score can then be correlated with other variables,
such as clinical or experimental conditions:
- Quantifies associations between phenotype variables and a gene signature score using Cohen’s effect sizes and p-values.
- Visualizes results through lollipop plots, contrast plots, and distribution plots.
This is useful for identifying:
- Biological relationships (e.g., phenotype-score associations)
- Technical confounders (e.g., batch effects)
Outputs
The main function returns a structured list with:
-
overall
: Effect sizes (Cohen’s f) and p-values for each variable. -
contrasts
: For categorical variables, pairwise or grouped comparisons using Cohen’s d with BH-adjusted p-values. -
plot
: A combined visualization showing:- Lollipop plot of effect sizes (Cohen’s f)
- Distribution plots of the score by variable (density or scatter)
- Lollipop plots of contrasts (Cohen’s d) for categorical variables, if applicable
-
plot_overall
,plot_contrasts
,plot_distributions
: Individual components of the combined plot.
Controlling Contrast Resolution
The mode
parameter determines the level of contrast
detail for categorical variables:
-
"simple"
: All pairwise contrasts between group levels (e.g., A vs B, A vs C, B vs C). -
"medium"
: One group vs. the union of all others (e.g., A vs B+C+D). -
"extensive"
: All algebraic combinations of group levels (e.g., A+B vs C+D).
Results
For this example, we use:
- The HernandezSegura gene signature
- The
logmedian
scoring method -
mode = "extensive"
for thorough contrast analysis
We also include two synthetic phenotypic variables:
-
Researcher
— a categorical variable representing who processed each sample -
DaysToSequencing
— a numeric variable indicating time between preparation and sequencing
Though artificial, these mimic potential technical covariates. Strong associations between these and the score could indicate batch effects, where technical variation may confound biological interpretation.
results_scoreassoc_bidirect <- VariableAssociation(data = counts_example,
metadata = metadata_example,
method = "logmedian",
cols = c("Condition","Researcher","DaysToSequencing"),
gene_set = HernandezSegura_GeneSet,
mode="extensive",
nonsignif_color = "white", signif_color = "red", saturation_value=NULL,sig_threshold = 0.05,
widthlabels=30, labsize=10, titlesize=14, pointSize=5, discrete_colors=NULL,
continuous_color = "#8C6D03", color_palette = "Set2")
results_scoreassoc_bidirect$Overall
## Variable Cohen_f P_Value
## 1 Condition 1.19407625 1.267423e-08
## 2 Researcher 0.12816159 7.458318e-01
## 3 DaysToSequencing 0.00792836 9.617953e-01
results_scoreassoc_bidirect$Contrasts
## Variable Contrast Group1 Group2
## 1 Condition Proliferative - Senescent Proliferative Senescent
## 2 Researcher Ana - John Ana John
## 3 Researcher Ana - Francisca Ana Francisca
## 4 Researcher Francisca - John Francisca John
## 5 Researcher Ana - (Francisca + John) Ana Francisca + John
## 6 Researcher (Ana + Francisca) - John Ana + Francisca John
## 7 Researcher (Ana + John) - Francisca Ana + John Francisca
## CohenD PValue padj
## 1 -2.33302465 1.267423e-08 8.871962e-08
## 2 -0.05584444 9.021669e-01 9.021669e-01
## 3 0.24731622 4.904610e-01 8.034391e-01
## 4 -0.27989036 5.477802e-01 8.034391e-01
## 5 0.14113793 6.646037e-01 8.034391e-01
## 6 -0.16850285 6.886621e-01 8.034391e-01
## 7 0.25285837 4.472210e-01 8.034391e-01
Enrichment-based approaches
The GSEA_VariableAssociation()
function evaluates how
phenotypic variables are associated with gene set
activity, using enrichment scores derived from gene expression
statistics (B- or t-statistics). This allows users to understand whether
a gene set is enriched or depleted in relation to different sample
attributes.
Outputs
The GSEA_VariableAssociation()
function returns a list
with two elements:
-
data
: A tidy data frame of GSEA results. For each variable contrast, this includes:- Contrast: The comparison performed (e.g., A - B, A - (B+C))
- Statistic: The metric used for gene ranking (either t or B)
- NES: Normalized Enrichment Score
- Adjusted p-value: Multiple-testing corrected p-value (e.g., Benjamini–Hochberg)
- Gene Set Name: The gene set being tested
-
plot
: Aggplot2
object showing the NES and significance of each contrast as a lollipop plot:- Point position reflects NES (directional enrichment)
- Point color reflects adjusted p-value significance
- Dashed lines represent contrasts with negative NES under the B statistic, indicating likely non-alteration
Output Interpretation
Depending on the statistic used (t
or B
),
the interpretation of results varies:
-
Negative NES (Normalized Enrichment Score):
- t statistic: A negative NES implies the gene set is depleted in over-expressed genes — i.e., genes in the set are under-expressed relative to others.
- B statistic: A negative NES suggests the gene set is less altered than most other genes — indicating a lack of strong expression change.
-
Dashed Lines in the Plot:
- Dashed lines denote contrasts with negative NES using the B statistic, indicating gene sets that are putatively not altered.
-
Subtitle Differences in the Plot:
- When using the B statistic, the plot subtitle is “Altered Contrasts”.
- When using the t statistic, the subtitle becomes “Enriched/Depleted Contrasts”.
Example: Exploring Gene Set Enrichment by Variable
The following code evaluates how three phenotypic variables —
Condition
, Researcher
, and
DaysToSequencing
— are associated with the
HernandezSegura gene set:
VariableAssociation(
data = counts_example,
metadata = metadata_example,
method = "GSEA",
cols = c("Condition", "Researcher", "DaysToSequencing"),
mode = "simple",
gene_set = HernandezSegura_GeneSet,
saturation_value = NULL,
nonsignif_color = "white",
signif_color = "red",
sig_threshold = 0.05,
widthlabels = 30,
labsize = 10,
titlesize = 14,
pointSize = 5
)
## $plot
##
## $data
## pathway pval padj log2err ES NES
## <char> <num> <num> <num> <num> <num>
## 1: HernandezSegura 7.502274e-10 3.751137e-09 0.80121557 -0.6376777 -2.6601988
## 2: HernandezSegura 1.000000e+00 1.000000e+00 0.01186731 0.1098945 0.4337543
## 3: HernandezSegura 3.225779e-04 8.064448e-04 0.49849311 0.4485591 1.9585999
## 4: HernandezSegura 4.936314e-04 8.227190e-04 0.47727082 -0.4631808 -1.8931036
## 5: HernandezSegura 9.604021e-01 1.000000e+00 0.01538060 0.1689494 0.6775488
## size leadingEdge stat_used Contrast
## <int> <list> <char> <char>
## 1: 52 CNTLN, D.... t Proliferative - Senescent
## 2: 52 P4HA2, S.... t Ana - John
## 3: 52 NOL3, PL.... t Ana - Francisca
## 4: 52 MEIS1, S.... t Francisca - John
## 5: 52 BCL2L2, .... t DaysToSequencing
Session Information
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] markeR_0.99.0
##
## loaded via a namespace (and not attached):
## [1] pROC_1.18.5 gridExtra_2.3 rlang_1.1.6
## [4] magrittr_2.0.3 clue_0.3-66 GetoptLong_1.0.5
## [7] msigdbr_24.1.0 matrixStats_1.5.0 compiler_4.5.1
## [10] mgcv_1.9-3 png_0.1-8 systemfonts_1.2.3
## [13] vctrs_0.6.5 reshape2_1.4.4 stringr_1.5.1
## [16] pkgconfig_2.0.3 shape_1.4.6.1 crayon_1.5.3
## [19] fastmap_1.2.0 backports_1.5.0 labeling_0.4.3
## [22] effectsize_1.0.1 rmarkdown_2.29 ragg_1.4.0
## [25] purrr_1.0.4 xfun_0.52 cachem_1.1.0
## [28] jsonlite_2.0.0 BiocParallel_1.42.1 broom_1.0.8
## [31] parallel_4.5.1 cluster_2.1.8.1 R6_2.6.1
## [34] stringi_1.8.7 bslib_0.9.0 RColorBrewer_1.1-3
## [37] limma_3.64.1 car_3.1-3 jquerylib_0.1.4
## [40] Rcpp_1.0.14 assertthat_0.2.1 iterators_1.0.14
## [43] knitr_1.50 parameters_0.26.0 IRanges_2.42.0
## [46] splines_4.5.1 Matrix_1.7-3 tidyselect_1.2.1
## [49] abind_1.4-8 yaml_2.3.10 doParallel_1.0.17
## [52] codetools_0.2-20 curl_6.4.0 lattice_0.22-7
## [55] tibble_3.3.0 plyr_1.8.9 withr_3.0.2
## [58] bayestestR_0.16.0 evaluate_1.0.4 desc_1.4.3
## [61] circlize_0.4.16 pillar_1.10.2 ggpubr_0.6.1
## [64] carData_3.0-5 foreach_1.5.2 stats4_4.5.1
## [67] insight_1.3.0 generics_0.1.4 S4Vectors_0.46.0
## [70] ggplot2_3.5.2 scales_1.4.0 glue_1.8.0
## [73] tools_4.5.1 data.table_1.17.6 fgsea_1.34.0
## [76] locfit_1.5-9.12 ggsignif_0.6.4 babelgene_22.9
## [79] fs_1.6.6 fastmatch_1.1-6 cowplot_1.1.3
## [82] grid_4.5.1 tidyr_1.3.1 datawizard_1.1.0
## [85] edgeR_4.6.2 colorspace_2.1-1 nlme_3.1-168
## [88] Formula_1.2-5 cli_3.6.5 textshaping_1.0.1
## [91] ComplexHeatmap_2.24.1 dplyr_1.1.4 gtable_0.3.6
## [94] ggh4x_0.3.1 rstatix_0.7.2 sass_0.4.10
## [97] digest_0.6.37 BiocGenerics_0.54.0 ggrepel_0.9.6
## [100] rjson_0.2.23 htmlwidgets_1.6.4 farver_2.1.2
## [103] htmltools_0.5.8.1 pkgdown_2.1.3 lifecycle_1.0.4
## [106] GlobalOptions_0.1.2 statmod_1.5.0