Skip to contents

This vignette provides a comprehensive introduction to the markeR package, focusing on its Benchmarking Mode. This mode is designed to evaluate the performance of gene signatures in quantifying specific biological states or phenotypes, such as disease states or cellular conditions. It allows users to assess the robustness and reliability of gene signatures across various conditions, providing a standardized framework for benchmarking.

Installation

# install.packages("devtools")
devtools::install_github("DiseaseTranscriptomicsLab/markeR")

Case-study: Senescence

We will be using an already pre-processed gene expression dataset, derived from the Marthandan et al. (2016) study (GSE63577), that includes human fibroblast samples cultured under two different conditions: replicative senescence and proliferative control. The dataset has already been filtered and normalized using the edgeR package. For more information about the dataset structure, see the help pages for ?counts_example and ?metadata_example.

This dataset serves as a working example to demonstrate the main functionalities of the markeR package. In particular, it will be used to showcase the two primary modules designed for benchmarking gene signatures:

  • Score: calculates expression-based signature scores for each sample, and
  • Enrichment: evaluates the over-representation of gene signatures within ranked gene lists.

To illustrate the usage of markeR, we use three gene sets commonly associated with cellular senescence:

  • LiteratureMarkers: A small, curated set of well-established senescence-associated genes repeatedly reported in the literature. This concise gene set includes key markers often used for validating senescence phenotypes. This set includes information on the directionality of gene regulation (i.e., whether genes are typically up- or down-regulated in senescence).
  • REACTOME_CELLULAR_SENESCENCE: A comprehensive gene set from the MSigDB REACTOME collection, representing known molecular pathways involved in cellular senescence and commonly used in enrichment-based analyses. This is also treated as an undirected gene set without information on the up- or down-regulation of individual genes.
  • HernandezSegura: A transcriptomic gene set identified by Hernandez-Segura et al. (2017) as consistently altered across multiple senescence models. This set includes information on the directionality of gene regulation. It has shown strong performance in classification and enrichment analyses, including in the original paper of markeR.
# Load example gene sets
data("genesets_example")
print(genesets_example) 
## $HernandezSegura
##        gene enrichment
## 1    ACADVL          1
## 2     ADPGK          1
## 3  ARHGAP35         -1
## 4     ARID2         -1
## 5     ASCC1         -1
## 6   B4GALT7          1
## 7    BCL2L2          1
## 8     C2CD5         -1
## 9     CCND1          1
## 10    CHMP5          1
## 11    CNTLN         -1
## 12   CREBBP         -1
## 13     DDA1          1
## 14     DGKA          1
## 15   DYNLT3          1
## 16    EFNB3         -1
## 17  FAM214B          1
## 18     GBE1          1
## 19     GDNF          1
## 20    GSTM4         -1
## 21     ICE1         -1
## 22    KCTD3         -1
## 23     KLC1          1
## 24    MEIS1         -1
## 25   MT-CYB          1
## 26     NFIA         -1
## 27     NOL3          1
## 28    P4HA2          1
## 29    PATZ1         -1
## 30    PCIF1         -1
## 31   PDLIM4          1
## 32    PDS5B         -1
## 33     PLK3          1
## 34   PLXNA3          1
## 35   POFUT2          1
## 36    RAI14          1
## 37    RHNO1         -1
## 38     SCOC          1
## 39  SLC10A3          1
## 40  SLC16A3          1
## 41      SMO         -1
## 42   SPATA6         -1
## 43    SPIN4         -1
## 44    STAG1         -1
## 45    SUSD6          1
## 46    TAF13          1
## 47  TMEM87B          1
## 48   TOLLIP          1
## 49   TRDMT1         -1
## 50  TSPAN13          1
## 51     UFM1          1
## 52   USP6NL         -1
## 53   ZBTB7A          1
## 54    ZC3H4         -1
## 55   ZNHIT1          1
## 
## $REACTOME_CELLULAR_SENESCENCE
##   [1] "CDC27"    "E2F2"     "SCMH1"    "MRE11"    "MAP2K3"   "MAPK9"   
##   [7] "ANAPC4"   "MAP2K4"   "MAP4K4"   "RPS6KA2"  "UBE2D1"   "EED"     
##  [13] "MAP2K7"   "TNRC6C"   "MAPKAPK5" "ANAPC5"   "TNRC6A"   "TINF2"   
##  [19] "AGO1"     "CDC23"    "CABIN1"   "MAPK1"    "HIRA"     "TNRC6B"  
##  [25] "E2F1"     "RBBP7"    "MAPK3"    "ACD"      "NBN"      "CCNE1"   
##  [31] "FZR1"     "ERF"      "CDK6"     "H2AZ2"    "EZH2"     "MAPK8"   
##  [37] "UBE2S"    "MAP2K6"   "NFKB1"    "MAPK10"   "ANAPC15"  "CDKN1B"  
##  [43] "PHC1"     "ASF1A"    "MAPK14"   "E2F3"     "LMNB1"    "RAD50"   
##  [49] "TFDP2"    "MAPKAPK3" "IL1A"     "RPS6KA1"  "UBN1"     "RNF2"    
##  [55] "CDKN2C"   "CDK2"     "H1-3"     "H1-1"     "H2BC11"   "CDKN1A"  
##  [61] "ID1"      "AGO3"     "POT1"     "CDKN2D"   "CDC16"    "H3-3B"   
##  [67] "KDM6B"    "TERF2"    "CCNA1"    "PHC2"     "AGO4"     "ETS1"    
##  [73] "CDK4"     "MDM2"     "IL6"      "TXN"      "HMGA1"    "RB1"     
##  [79] "MINK1"    "TP53"     "ANAPC11"  "CBX8"     "CBX4"     "RPS27A"  
##  [85] "CCNA2"    "H2BC1"    "TERF1"    "CDKN2B"   "CDKN2A"   "ATM"     
##  [91] "HMGA2"    "UBC"      "VENTX"    "ANAPC1"   "TNIK"     "MOV10"   
##  [97] "ETS2"     "H2BC5"    "H4C8"     "RBBP4"    "MAPKAPK2" "H3-3A"   
## [103] "IGFBP7"   "ANAPC10"  "ANAPC16"  "MAPK7"    "TERF2IP"  "H3-4"    
## [109] "BMI1"     "H1-4"     "STAT3"    "CXCL8"    "UBE2E1"   "UBB"     
## [115] "FOS"      "IFNB1"    "CEBPB"    "KAT5"     "RELA"     "PHC3"    
## [121] "CBX2"     "UBE2C"    "CCNE2"    "ANAPC2"   "CDC26"    "RPS6KA3" 
## [127] "JUN"      "SUZ12"    "H2AC6"    "H2BC4"    "EHMT1"    "EP400"   
## [133] "H3C13"    "CBX6"     "H2AC20"   "H1-5"     "H2BC21"   "H2BC13"  
## [139] "MAPK11"   "SP1"      "H1-2"     "H2AX"     "H1-0"     "ANAPC7"  
## [145] "H2AC7"    "H2BC26"   "H4C3"     "H3C12"    "H4C11"    "H3C4"    
## [151] "MAP3K5"   "H4C16"    "H2BC12"   "TFDP1"    "MDM4"     "H3C14"   
## [157] "H3C15"    "RING1"    "EHMT2"    "UBA52"    "H2AJ"     "H4C15"   
## [163] "H4C14"    "H4C12"    "H2BC14"   "H2BC8"    "H3C8"     "H2AB1"   
## [169] "H2BC6"    "H4C6"     "H2BC17"   "H3C6"     "H4C13"    "H3C11"   
## [175] "H2BC9"    "H3C1"     "H4C9"     "H2AC14"   "H2BC3"    "H4C5"    
## [181] "H2AC8"    "H4C4"     "H2BC7"    "H3C7"     "H2AC4"    "H2BC10"  
## [187] "H4C1"     "H4C2"     "H3C10"    "MIR24-2"  "MIR24-1"  "H3C2"    
## [193] "H3C3"     "H2AC18"   "H2AC19"  
## 
## $LiteratureMarkers
##     gene enrichment
## 1 CDKN1A          1
## 2 CDKN2A          1
## 3   GLB1          1
## 4   TP53          1
## 5   CCL2          1
## 6  LMNB1         -1
## 7  MKI67         -1
data(counts_example)
# Load example data
counts_example[1:5,1:5]
##          SRR1660534 SRR1660535 SRR1660536 SRR1660537 SRR1660538
## A1BG        9.94566   9.476768   8.229231   8.515083   7.806479
## A1BG-AS1   12.08655  11.550303  12.283976   7.580694   7.312666
## A2M        77.50289  56.612839  58.860268   8.997624   6.981857
## A4GALT     14.74183  15.226083  14.815891  14.675780  15.222488
## AAAS       47.92755  46.292377  43.965972  47.109493  47.213739

For illustration purposes of different variable types, let’s imagine we also had two additional variables: one indicating the number of days between sample preparation and sequencing (DaysToSequencing), and another identifying the person who processed each sample (researcher). These variables are hypothetical and not part of the original study design.

data(metadata_example)

set.seed("123456")

metadata_example$Researcher <- sample(c("John","Ana","Francisca"),39, replace = TRUE)
metadata_example$DaysToSequencing <- sample(c(1:20),39, replace = TRUE)
 
head(metadata_example)
##       sampleID      DatasetID   CellType     Condition       SenescentType
## 252 SRR1660534 Marthandan2016 Fibroblast     Senescent Telomere shortening
## 253 SRR1660535 Marthandan2016 Fibroblast     Senescent Telomere shortening
## 254 SRR1660536 Marthandan2016 Fibroblast     Senescent Telomere shortening
## 255 SRR1660537 Marthandan2016 Fibroblast Proliferative                none
## 256 SRR1660538 Marthandan2016 Fibroblast Proliferative                none
## 257 SRR1660539 Marthandan2016 Fibroblast Proliferative                none
##                         Treatment Researcher DaysToSequencing
## 252 PD72 (Replicative senescence)        Ana                6
## 253 PD72 (Replicative senescence)        Ana               18
## 254 PD72 (Replicative senescence)       John               19
## 255                         young        Ana                2
## 256                         young  Francisca                9
## 257                         young       John               10

Calculate Senescence Scores

The CalculateScores function computes the signature scores for each sample based on predefined gene sets, such as a senescence gene set. It returns a named list where each entry corresponds to a specific gene set and includes the calculated scores, along with metadata (if available). When setting method = "all", the function returns a list, where each element corresponds to a scoring method and contains the respective data frame of scores, allowing comparison between methods. The function allows users to select from three different scoring methods:

  • ssGSEA: Computes an enrichment score for each gene set in each sample.
  • logmedian: Calculates the score as the sum of the normalized (log2-median-centered) expression values of the genes in the gene set, divided by the number of genes.
  • ranking: Determines the score by ranking the expression of the genes in the gene set and normalizing the result.

These methods are very similar and, when applied to a robust gene set, will yield similar results across all three methods. Empirically, a good gene set will be one that shows consistent results, both in the calculated scores and in Cohen’s d or F statistics, across different methods. If the gene set is not robust, or if there is considerable noise, the results across methods may differ significantly. Consistent scores across methods typically indicate a more reliable and meaningful gene set. These methods are explained in more detail below, allowing the user to select the most appropriate one for their analysis.

The PlotScores function can be used to compute and visualize the scores in various ways, depending on the method and variable chosen.

  • If method = "all" and the variable is categorical, it will return a heatmap of Cohen’s d or F statistics and a volcano plot showing contrasts between all groups of that variable.
  • If method = "all" and the variable is numeric, a heatmap of Cohen’s F and a volcano plot will be produced.
  • If method != "all" and the variable is categorical, it will generate a violin plot for each gene set.
  • If method != "all" and the variable is NULL, a density plot of the score distribution will be displayed.
  • If method != "all" and the variable is numeric, a scatter plot will be created to show the relationship between the scores and the numeric variable.

logmedian method

The following example uses the logmedian method to calculate a gene signature score. This method first applies a log2 transformation to the expression values, and then centers them by subtracting the median expression (across all samples) for each genes. The score for each sample is then computed by summing the normalised expression values of the genes in the gene set, and dividing by the number of genes in the gene set. This normalization makes each gene’s expression relative to its typical behavior across the dataset, allowing for meaningful comparisons between genes with different expression scales. By using log2 median-centering, the method ensures that both highly and lowly expressed genes contribute comparably to the score, as long as their variances are similar. This normalization emphasizes relative changes in expression rather than absolute values, allowing the score to reflect the coordinated behavior of the genes in a gene set. Users can calculate the gene signature score for each sample based on one or more predefined gene sets (signatures).

Here’s an example where we calculate the signature score using the “logmedian” method:

df_Scores <- CalculateScores(data = counts_example,
                             metadata = metadata_example,
                             method = "logmedian",
                             gene_sets = genesets_example)
## Considering bidirectional gene signature mode for signature HernandezSegura
## Considering unidirectional gene signature mode for signature REACTOME_CELLULAR_SENESCENCE
## Considering bidirectional gene signature mode for signature LiteratureMarkers
head(df_Scores$REACTOME_CELLULAR_SENESCENCE)
##       sample       score      DatasetID   CellType     Condition
## 1 SRR1660534 -0.05098427 Marthandan2016 Fibroblast     Senescent
## 2 SRR1660535 -0.05497976 Marthandan2016 Fibroblast     Senescent
## 3 SRR1660536 -0.03370396 Marthandan2016 Fibroblast     Senescent
## 4 SRR1660537 -0.02091467 Marthandan2016 Fibroblast Proliferative
## 5 SRR1660538 -0.01089274 Marthandan2016 Fibroblast Proliferative
## 6 SRR1660539 -0.01012807 Marthandan2016 Fibroblast Proliferative
##         SenescentType                     Treatment Researcher DaysToSequencing
## 1 Telomere shortening PD72 (Replicative senescence)        Ana                6
## 2 Telomere shortening PD72 (Replicative senescence)        Ana               18
## 3 Telomere shortening PD72 (Replicative senescence)       John               19
## 4                none                         young        Ana                2
## 5                none                         young  Francisca                9
## 6                none                         young       John               10
head(df_Scores$HernandezSegura)
##       sample      score      DatasetID   CellType     Condition
## 1 SRR1660534  0.4460830 Marthandan2016 Fibroblast     Senescent
## 2 SRR1660535  0.4564033 Marthandan2016 Fibroblast     Senescent
## 3 SRR1660536  0.4727664 Marthandan2016 Fibroblast     Senescent
## 4 SRR1660537 -0.1676110 Marthandan2016 Fibroblast Proliferative
## 5 SRR1660538 -0.1198661 Marthandan2016 Fibroblast Proliferative
## 6 SRR1660539 -0.1467108 Marthandan2016 Fibroblast Proliferative
##         SenescentType                     Treatment Researcher DaysToSequencing
## 1 Telomere shortening PD72 (Replicative senescence)        Ana                6
## 2 Telomere shortening PD72 (Replicative senescence)        Ana               18
## 3 Telomere shortening PD72 (Replicative senescence)       John               19
## 4                none                         young        Ana                2
## 5                none                         young  Francisca                9
## 6                none                         young       John               10
head(df_Scores$LiteratureMarkers)
##       sample      score      DatasetID   CellType     Condition
## 1 SRR1660534  2.2008955 Marthandan2016 Fibroblast     Senescent
## 2 SRR1660535  1.8078284 Marthandan2016 Fibroblast     Senescent
## 3 SRR1660536  1.8542998 Marthandan2016 Fibroblast     Senescent
## 4 SRR1660537 -0.3543785 Marthandan2016 Fibroblast Proliferative
## 5 SRR1660538 -0.3799436 Marthandan2016 Fibroblast Proliferative
## 6 SRR1660539 -0.4281760 Marthandan2016 Fibroblast Proliferative
##         SenescentType                     Treatment Researcher DaysToSequencing
## 1 Telomere shortening PD72 (Replicative senescence)        Ana                6
## 2 Telomere shortening PD72 (Replicative senescence)        Ana               18
## 3 Telomere shortening PD72 (Replicative senescence)       John               19
## 4                none                         young        Ana                2
## 5                none                         young  Francisca                9
## 6                none                         young       John               10

The user can also chose to directly plot the scores.

Effect sizes can be computed using the compute_cohen parameter (default = T): when the grouping variable has only two levels, Cohen’s d is calculated by default. If there are more than two levels, Cohen’s f is used unless a specific pairwise comparison is defined via cond_cohend, in which case Cohen’s d is reported for that comparison. If pvalcalc = TRUE (default = FALSE), an associated p-value (not corrected for multiple testing) is also reported. The p-value is derived from a two-sample t-test for two-group comparisons or numeric variables, or from an ANOVA for multi-group comparisons.

senescence_triggers_colors <- c(
  "none" = "#E57373",  # Soft red   
  "Telomere shortening" = "#4FC3F7"  # Vivid sky blue  
)

cond_cohend <- list(A=c("Senescent"),  
                    B=c("Proliferative"))

PlotScores(data = counts_example, 
           metadata = metadata_example, 
           gene_sets = genesets_example,
           ColorVariable = "SenescentType", 
           Variable="Condition",  
           method ="logmedian", 
           ColorValues = senescence_triggers_colors, 
           ConnectGroups=TRUE, 
           ncol = NULL, 
           nrow = 1, 
           widthTitle=24, 
           limits = NULL, 
           legend_nrow = 1, 
           pointSize=4,
           compute_cohen=TRUE,
           cond_cohend=cond_cohend,
           title="Marthandan et al. 2016",
           labsize=9, 
           titlesize = 12)  

Interestingly, when we provide directionality for a signature—such as the Literature_Senescence set—the interpretation of the results can change substantially. For example, without specifying direction, senescent samples may appear to have lower scores than proliferative ones. But once directionality is accounted for, the scores shift in a way that aligns better with biological expectations. Therefore, it is strongly advised that, whenever possible (i.e., if known), the user states the putative regulation “sign” of the genes in the gene set This helps ensure more accurate and meaningful interpretations of the data.

GeneSets_Example_Bidirectionality <- list(LiteratureMarkers=genesets_example$LiteratureMarkers,
                                          LiteratureMarkers_Unidirectional =  genesets_example$LiteratureMarkers$gene)
print(GeneSets_Example_Bidirectionality)
## $LiteratureMarkers
##     gene enrichment
## 1 CDKN1A          1
## 2 CDKN2A          1
## 3   GLB1          1
## 4   TP53          1
## 5   CCL2          1
## 6  LMNB1         -1
## 7  MKI67         -1
## 
## $LiteratureMarkers_Unidirectional
## [1] "CDKN1A" "CDKN2A" "GLB1"   "TP53"   "CCL2"   "LMNB1"  "MKI67"
PlotScores(data = counts_example, 
           metadata = metadata_example, 
           gene_sets =  GeneSets_Example_Bidirectionality,
           ColorVariable = "SenescentType", 
           Variable="Condition",  
           method ="logmedian", 
           ColorValues = senescence_triggers_colors, 
           ConnectGroups=TRUE, 
           ncol = NULL, 
           nrow = NULL, 
           widthTitle=24, 
           limits = NULL, 
           legend_nrow = 1, 
           pointSize=4,
           compute_cohen=TRUE,
           cond_cohend=cond_cohend,
           title="Marthandan et al. 2016",
           labsize=9, 
           titlesize = 12)  

To use the function for numeric variables, the user should specify the relevant parameters, including the numeric variable to be analysed. The function will generate a scatter plot for the numeric variable, optionally calculating Cohen’s f as the effect size. The user can choose a correlation method (e.g., Pearson, Spearman, or Kendall) to assess the relationship between the variable and the signature scores. The plot will also include optional p-value calculations for comparisons.

Here is an example of how to configure the function for numeric variables:

PlotScores(data = counts_example,
           metadata = metadata_example,
           gene_sets = genesets_example,
           Variable = "DaysToSequencing",               
           method = "logmedian",            
           ColorValues = "#3B415B",             
           ConnectGroups = FALSE,          
           ncol = NULL,                    
           nrow = 1,                    
           pointSize = 6,                  
           compute_cohen = TRUE,          
           pvalcalc = TRUE,                
           title = "Marthandan et al. 2016",  
           labsize=9, 
           titlesize = 12, 
           widthTitle = 26,
           cor = "pearson")

For users interested in viewing the overall distribution of scores across gene signatures, the PlotScores function can be used without specifying the GroupingVariable parameter, i.e, without grouping scores by any metadata variable. In this case, the function will automatically generate a grid of density plots, with each plot representing the distribution of scores for a specific gene set

PlotScores(data = counts_example, 
           metadata = metadata_example, 
           gene_sets = genesets_example, 
           method ="logmedian", 
           ColorValues = NULL,  
           ncol = NULL, 
           nrow = 1, 
           widthTitle=24, 
           limits = NULL,  
           title="Marthandan et al. 2016",
           labsize=9, 
           titlesize = 11)  

ssGSEA method

The same approach can be applied for ssGSEA (single-sample Gene Set Enrichment Analysis; Barbie et al. (2009)) for score calculation and visualization, both for unidirectional and bidirectional signatures. ssGSEA computes an enrichment score for each gene signature in each sample using an adaptation of the gsva() function from the GSVA package. This method is useful for evaluating gene set enrichment in individual samples rather than groups, as described in the sections below.

senescence_triggers_colors <- c(
  "none" = "#E57373",  # Soft red   
  "Telomere shortening" = "#4FC3F7"  # Vivid sky blue  
)

cond_cohend <- list(A=c("Senescent"),  
                    B=c("Proliferative"))

PlotScores(data = counts_example, 
           metadata = metadata_example, 
           gene_sets = genesets_example,
           ColorVariable = "SenescentType", 
           Variable="Condition",  
           method ="ssGSEA", 
           ColorValues = senescence_triggers_colors, 
           ConnectGroups=TRUE, 
           ncol = NULL, 
           nrow = 1, 
           widthTitle=24, 
           limits = NULL, 
           legend_nrow = 1, 
           pointSize=4,
           compute_cohen=TRUE,
           cond_cohend=cond_cohend,
           title="Marthandan et al. 2016",
           labsize=9, 
           titlesize = 12)  

Ranking method

The ranking method computes gene signature scores for each sample by ranking the expression of signature genes in the dataset and normalizing the score based on the total number of genes.

The following example demonstrates the use of the “ranking” method for both unidirectional and bidirectional signatures:

senescence_triggers_colors <- c(
  "none" = "#E57373",  # Soft red   
  "Telomere shortening" = "#4FC3F7"  # Vivid sky blue  
)

cond_cohend <- list(A=c("Senescent"),  
                    B=c("Proliferative"))

PlotScores(data = counts_example, 
           metadata = metadata_example, 
           gene_sets = genesets_example,
           ColorVariable = "SenescentType", 
           Variable="Condition",  
           method ="ranking", 
           ColorValues = senescence_triggers_colors, 
           ConnectGroups=TRUE, 
           ncol = NULL, 
           nrow = 1, 
           widthTitle=24, 
           limits = NULL, 
           legend_nrow = 1, 
           pointSize=4,
           compute_cohen=TRUE,
           cond_cohend=cond_cohend,
           title="Marthandan et al. 2016",
           labsize=9, 
           titlesize = 12)  

All methods

To compare various metrics across different condition combinations, violin plots may not always be the best choice. In such cases, users can setmethod = "all" to generate a summary heatmap and volcano-like plot. The function will return one heatmap per gene set, with rows corresponding to all possible combinations of values in the GroupingVariable. In parenthesis is represented the p-value, adjusted using the BH method (i.e., the Benjamini–Hochberg false discovery rate procedure), across all combinations of contrasts and gene signatures. It will also return a volcano-like plot (Cohen’s d effect sizes vs -log10(adjusted p-values)), where each dot represents a method-signature pair, faceted by contrast. The dashed lines represent user-defined thresholds for significance and effect size.

The mode parameter controls how contrasts are generated for categorical variables, allowing users to adjust the complexity of the analysis:

  • “simple”: Performs the minimal number of pairwise comparisons between group levels (e.g., for a factor with levels A, B, C and D, it generates A - B, A - C, A - D, B - C, B - D, C - D).
  • “medium”: Includes comparisons between one group and the union of other groups (e.g., A - (B + C + D); B - (A + C + D)), allowing for broader contrasts beyond simple pairwise comparisons.
  • “extensive”: Allows for all possible algebraic combinations of group levels (e.g., (A + B) - (C + D)).
Overall_Scores <- PlotScores(data = counts_example, 
                             metadata = metadata_example,  
                             gene_sets=genesets_example, 
                             Variable="Condition",  
                             method ="all",   
                             ncol = NULL, 
                             nrow = 1, 
                             widthTitle=30, 
                             limits = c(0,3.5),   
                             title="Marthandan et al. 2016", 
                             titlesize = 10,
                             ColorValues = list(heatmap=c("#F9F4AE", "#B44141"),
                                                volcano=signature_colors <- c(
                                                  HernandezSegura = "#A07395",               
                                                  REACTOME_CELLULAR_SENESCENCE = "#6B8E9E",  
                                                  LiteratureMarkers = "#CA7E45"             
                                                )
                             ),
                             mode="simple",
                             widthlegend=30, 
                             sig_threshold=0.05, 
                             cohen_threshold=0.6,
                             pointSize=6,
                             colorPalette="Paired")  
Overall_Scores$heatmap

Overall_Scores$volcano

Classification Potential of Gene Signatures

The ROC_Scores and AUC_Scores functions allow users to evaluate the classification potential of gene set scores based on ROC curves and their AUC values. These functions help to assess how well a given score can differentiate between conditions, based on predefined contrasts. Besides method="all", these functions can also be used for each method individually.

The ROC_Scores function generates ROC curves for different scoring methods across contrasts, allowing users to visualize performance differences.

 ROC_Scores(data = counts_example, 
           metadata = metadata_example, 
           gene_sets=genesets_example, 
           method = "all", 
           variable ="Condition",
           colors = c(logmedian = "#3E5587", ssGSEA = "#B65285", ranking = "#B68C52"), 
           grid = TRUE, 
           spacing_annotation=0.3, 
           ncol=NULL, 
           nrow=1,
            mode = "simple",
            widthTitle = 28,
           titlesize = 10,  
           title="Marthandan et al. 2016") 

The AUC_Scores function generates heatmaps summarizing AUC values for each gene set, with methods as columns and contrasts as rows. The plot is returned in the $plt element, and the underlying data used to generate it is available in the $data element.

AUC_Scores(data = counts_example, 
           metadata = metadata_example, 
           gene_sets=genesets_example, 
           method = "all", 
           mode = "simple", 
           variable="Condition", 
           nrow = 1, 
           ncol = NULL, 
           limits = NULL, 
           widthTitle = 28, 
           titlesize = 10, 
           ColorValues = c("#F9F4AE", "#B44141"),   
           title="Marthandan et al. 2016") 

False Positive Rate (FPR) Calculations

The user can assess the significance of gene set scores by comparing observed effect sizes against a distribution of those originated by random gene sets with the same number of genes and matched directionality. For each original gene set, the function calculates the observed Cohen’s d (and p‑value) using (GroupingVariable). It then generates a number of simulated gene sets (number_of_sims) by randomly sampling the same number of genes from a user provided gene list (gene_list) and computes their Cohen’s d values. The simulation results are visualised as violin plots of the distribution of Cohen’s d values for each method, overlaid with the observed values of the original gene sets, and a 95th percentile threshold. Significance is indicated by distinct point shapes based on the associated p‑value.

FPR_Simulation(data = counts_example,
               metadata = metadata_example,
               original_signatures = genesets_example,
               gene_list = row.names(counts_example),
               number_of_sims = 10,
               title = "Marthandan et al. 2016",
               widthTitle = 30,
               Variable = "Condition",
               titlesize = 12,
               pointSize = 5,
               labsize = 10,
               mode = "simple",
               ColorValues=NULL,
               ncol=NULL, 
               nrow=1 ) 

Enrichment-Based Methods

Differentially Expressed Genes

The calculateDE function in the markeR package leverages the limma framework to compute differential gene expression statistics from raw count data. This function is highly flexible and supports several modes of operation depending on the user’s experimental design. In the examples below, we illustrate two common scenarios:

  • Automatic Design Matrix with Contrasts: In the first example, the design matrix is built automatically from the metadata using a specified variable (here, "Condition"). The user must explicitly define the comparisons of interest based on the levels of this variable (e.g., Senescent - Proliferative). Internally, this approach fits a linear model without an intercept, enabling the user to define contrasts between the levels. This is ideal for simpler experimental designs, where quick comparison between predefined groups is desired without manually specifying the full model matrix.
  • Providing an Externally Constructed Design Matrix: In the second example, the user manually creates the design matrix (e.g., using model.matrix(~ 0 + Condition) for a no-intercept model) and provides it directly to calculateDE. This gives full control over how the design is specified, including complex experimental setups or custom encodings. Multiple contrasts can be defined later, based on this matrix, to extract specific comparisons of interest, using the Contrast parameter. If this parameter is left as NULL, the function will return results for all conditions (i.e. columns) defined in the design matrix. This approach is recommended when the user has a complex design or has already constructed the design matrix as part of a broader analysis pipeline.

Below are the corresponding code snippets demonstrating each scenario, by answering the same question: What are the genes differentially expressed between senescence and proliferative cells?

# Example 1: Build design matrix from variables (Condition) and apply a contrast.
# In this case, the design matrix is constructed automatically using the variable "Condition".
DEGs <- calculateDE(data = counts_example,
                    metadata = metadata_example,
                    variables = "Condition",
                    contrasts = c("Senescent - Proliferative"))
DEGs$`Senescent-Proliferative`[1:5,]
##            logFC  AveExpr         t      P.Value    adj.P.Val        B
## CCND2   3.816674 4.406721 12.393130 2.747202e-15 2.435712e-12 24.70982
## MKI67  -3.581174 6.605339 -9.197510 2.007135e-11 4.992097e-10 15.96203
## PTCHD4  3.398914 3.556007 10.740867 2.318137e-13 3.002118e-11 20.35906
## KIF20A -3.365481 5.934893 -9.728709 4.176398e-12 1.844336e-10 17.51106
## CDC20  -3.304602 6.104079 -9.801724 3.375212e-12 1.657492e-10 17.72110
# Example 2: Supply a custom design matrix directly.
# Here, the design matrix is created externally (using no intercept, for instance).
design <- model.matrix(~0 + Condition, data = metadata_example)
colnames(design) <- c("Proliferative","Senescent")
DEGs2 <- calculateDE(data = counts_example,
                     metadata = NULL,
                     variables = NULL,
                     modelmat = design,
                     contrasts = c("Senescent - Proliferative"))
DEGs2$`Senescent-Proliferative`[1:5,]
##            logFC  AveExpr         t      P.Value    adj.P.Val        B
## CCND2   3.816674 4.406721 12.393130 2.747202e-15 2.435712e-12 24.70982
## MKI67  -3.581174 6.605339 -9.197510 2.007135e-11 4.992097e-10 15.96203
## PTCHD4  3.398914 3.556007 10.740867 2.318137e-13 3.002118e-11 20.35906
## KIF20A -3.365481 5.934893 -9.728709 4.176398e-12 1.844336e-10 17.51106
## CDC20  -3.304602 6.104079 -9.801724 3.375212e-12 1.657492e-10 17.72110

After running differential expression analysis (for example, using the calculateDE function), the user can visualize their results with the plotVolcano function. This function provides a flexible interface for exploring their data by allowing the user to:

  • Plot Differential Gene Expression Statistics:
    Display a volcano plot with chosen statistics (e.g., log fold-change on the x-axis and –log₁₀ adjusted p-value on the y-axis).
  • Color Interesting Genes:
    Highlight genes that pass user-specified thresholds by adjusting threshold_y and threshold_x.
  • Annotate Top and Bottom N Genes:
    Optionally, label the top (and bottom) N genes based on the chosen statistic to quickly identify the most significant genes.
  • Highlight Gene Signatures: If the user provides a list of gene signatures using the genes argument, the function can highlight these genes in the plot. The user can also specify distinct colors for putativelyupregulated and downregulated if their direction is known, or a color for genes that do not have a putative direction.

Below is an example usage of plotVolcano that visualizes differential expression results from a DEResultsList. The first plot shows the default behavior, generating a basic volcano plot without thresholds or gene highlights. Subsequent examples demonstrate how to customize the plot:

  • Adding significance thresholds to highlight genes of interest,
  • Annotating the top and bottom N genes by effect size,
  • And using gene signatures to color genes across multiple plots arranged by contrast and signature.

These examples illustrate how users can customise the output plot to highlight biologically meaningful patterns or focus on specific gene sets.

# Plot Differentially Expressed Genes:
plotVolcano(DEGs, genes = NULL, N = NULL,
            x = "logFC", y = "-log10(adj.P.Val)", pointSize = 2,
            color = "#6489B4", highlightcolor = "#05254A", nointerestcolor = "#B7B7B7",
            threshold_y = NULL, threshold_x = NULL,
            xlab = NULL, ylab = NULL, ncol = NULL, nrow = NULL, title = "Marthandan et al. 2016",
            labsize = 8, widthlabs = 25, invert = FALSE)

# Color Interesting Genes:
plotVolcano(DEGs, genes = NULL, N = NULL,
            x = "logFC", y = "-log10(adj.P.Val)", pointSize = 2,
            color = "#6489B4", highlightcolor = "#05254A", nointerestcolor = "#B7B7B7",
            threshold_y = 0.0001, threshold_x = 1,
            xlab = NULL, ylab = NULL, ncol = NULL, nrow = NULL, title = "Marthandan et al. 2016",
            labsize = 8, widthlabs = 25, invert = FALSE)

# Annotate Top and Bottom N Genes:
plotVolcano(DEGs, genes = NULL, N = 5,
            x = "logFC", y = "-log10(adj.P.Val)", pointSize = 2,
            color = "pink", highlightcolor = "#05254A", nointerestcolor = "#B7B7B7",
            threshold_y = NULL, threshold_x = NULL,
            xlab = NULL, ylab = NULL, ncol = NULL, nrow = NULL, title = "Marthandan et al. 2016",
            labsize = 8, widthlabs = 25, invert = FALSE)

# Change order: signatures in columns, contrast in rows
plotVolcano(DEGs, genes = genesets_example, 
            N = NULL,
            x = "logFC", y = "-log10(adj.P.Val)", pointSize = 2,
            color = "#6489B4", highlightcolor = "#05254A", highlightcolor_upreg = "#038C65", highlightcolor_downreg = "#8C0303",nointerestcolor = "#B7B7B7",
            threshold_y = NULL, threshold_x = NULL,
            xlab = NULL, ylab = NULL, ncol = NULL, nrow = NULL, title = "Marthandan et al. 2016",
            labsize = 10, widthlabs = 24, invert = TRUE)

Note on Continuous Variables

If the user wants to analyse continuous variables (e.g., time or dosage), they must provide a custom design matrix via the modelmat argument, instead of using the variables argument. This is because the variables argument is intended to be used for categorical variables only: internally, when using variables to specify a phenotypic variable of interest, the function constructs a design matrix without an intercept and treats each level of the variable as a discrete group. This is appropriate for defining explicit contrasts between categorical levels but is not suitable for continuous variables, where such discretization would distort the relationship.

For continuous variables, the user should instead build a design matrix manually (e.g., using model.matrix(~ variable)), ensuring that the variable of interest is numeric and has a corresponding column in the matrix with the same name. The calculateDE() function will then use this matrix directly for linear modelling.

The interpretation of the differential expression results remains consistent: the statistics reflect always the expected change in expression associated with a 1-unit increase in the variable (e.g., one day, one unit of dosage, “Senescent” to “Proliferative”, etc).

design <- model.matrix(~1 + DaysToSequencing, data = metadata_example)

DEGs_continuous1 <- calculateDE(data = counts_example,
                                metadata = metadata_example, 
                                modelmat = design,
                                contrasts = c("DaysToSequencing"))
## Warning in limma::makeContrasts(contrasts = contrasts, levels = design_matrix):
## Renaming (Intercept) to Intercept
DEGs_continuous1$DaysToSequencing[1:3,]
##                logFC  AveExpr         t   P.Value adj.P.Val         B
## RNA45SN2 -0.07628608 7.712585 -1.626674 0.1117361 0.9997685 -6.018805
## RNA18SN2  0.06743439 9.577467  1.554171 0.1280974 0.9997685 -6.129426
## RNA18SN3  0.06743439 9.577467  1.554171 0.1280974 0.9997685 -6.129426

This usage of the modelmat argument allows the user to combine categorical and numeric variables in a fully customized design matrix. In the example below:

  • (Intercept): Baseline expression when days = 0 and Condition = Control.
  • DaysToSequencing: Change in expression per unit of time (day) increase.
  • Senescent: Average difference in expression between Senescent and Control conditions.

This approach is useful when modeling continuous effects alongside group comparisons, and it provides complete flexibility in specifying the design.

# Manually construct the design matrix
model_matrix <- model.matrix(~ DaysToSequencing + Condition, data = metadata_example)
colnames(model_matrix) <- c("(Intercept)", "DaysToSequencing", "Senescent")

# Provide the custom design matrix to calculateDE using the `modelmat` argument
DEGs_continuous2 <- calculateDE(data = counts_example,
                                modelmat = model_matrix)

# Access results for each coefficient
DEGs_continuous2$`(Intercept)`[1:3, ]
##           logFC  AveExpr         t      P.Value    adj.P.Val         B
## FN1    13.23994 13.21879  74.04645 1.265799e-43 3.210740e-43  89.73456
## EEF1A1 12.94192 12.85091 179.29631 1.402180e-58 9.033979e-57 121.69861
## GAPDH  12.71139 12.53188 147.86006 2.586838e-55 5.583501e-54 115.21324
DEGs_continuous2$DaysToSequencing[1:3, ]
##                logFC  AveExpr         t    P.Value adj.P.Val         B
## RNA45SN2 -0.08056063 7.712585 -1.689209 0.09914647 0.9853746 -6.052523
## RNA18SN2  0.06598991 9.577467  1.490666 0.14407989 0.9853746 -6.355482
## RNA18SN3  0.06598991 9.577467  1.490666 0.14407989 0.9853746 -6.355482
DEGs_continuous2$Senescent[1:3, ]
##            logFC  AveExpr         t      P.Value    adj.P.Val        B
## CCND2   3.863054 4.406721 12.344742 4.643461e-15 4.865503e-12 24.20437
## MKI67  -3.655159 6.605339 -9.296026 1.907736e-11 5.010329e-10 16.02834
## PTCHD4  3.453362 3.556007 10.774059 2.907380e-13 3.864144e-11 20.14942

Gene Set Enrichment Analyses

To perform GSEA, the user can use the runGSEA() function. This function takes a named list of differential expression statistics (one per contrast) and a set of gene signatures to compute enrichment scores.

  • DEGList: A list of differentially expressed genes (DEGs) for each contrast.

  • gene_sets: A list of gene sets, where each entry can be:

    • A vector of gene names (unidirectional analysis).
    • A data frame where the first column is the gene name and the second column indicates the expected direction (+1 or -1, bidirectional analysis).
  • stat: The ranking statistic. If NULL, the ranking statistic is automatically selected:

    • "B" for gene sets with no known direction (vectors).
    • "t" for unidirectional or bidirectional gene sets (data frames).
    • If provided, this argument overrides the automatic selection.
  • ContrastCorrection: Logical, default is FALSE. If TRUE, applies an additional multiple testing correction (Benjamini–Hochberg) across all contrasts returned in the DEGList results list. This accounts for the number of contrasts tested per signature and provides more stringent control of false discovery rate across multiple comparisons (similar to answering the question “Is there any signature that is significant in any of the contrasts” instead of “For each contrast, is there any signature that is significant”). If FALSE, the function only corrects for the number of gene sets.

GSEAresults <- runGSEA(DEGList = DEGs, 
                       gene_sets = genesets_example,
                       stat = NULL,
                        ContrastCorrection = FALSE)

GSEAresults
## $`Senescent-Proliferative`
##                         pathway         pval         padj   log2err        ES
##                          <char>        <num>        <num>     <num>     <num>
## 1:              HernandezSegura 5.728568e-10 1.718570e-09 0.8012156 0.6376777
## 2: REACTOME_CELLULAR_SENESCENCE 2.457757e-03 3.686636e-03 0.2972329 0.3807095
## 3:            LiteratureMarkers 2.013284e-02 2.013284e-02 0.1461515 0.7068136
##         NES  size  leadingEdge stat_used
##       <num> <int>       <list>    <char>
## 1: 2.647154    52 CNTLN, D....         t
## 2: 1.570456   137 H1-2, H2....         B
## 3: 1.667541     7 LMNB1, M....         t

Depending on the statistic used, the interpretation of the plots changes:

  1. B Statistic vs. t Statistic:
    • The B statistic does not specify the direction (enriched or depleted) of the gene set’s alterations. It only indicates the strength of evidence for any alterations.
    • The t statistic orders genes from stronger evidence of over-expression to stronger evidence for under-expression.
  2. Graph’s annotation:
    • When using the B statistic, the plot will include “Altered Gene Set” beneath its title to reflect this focus on whether genes are altered.
    • For the t statistic, the plot will include “Enriched/Depleted Gene Set” beneath its title, indicating its focus on the enrichment or depletion of genes.

After running GSEA, the user can visualize enrichment plots using the plotGSEAenrichment() function. This function generates enrichment plots for each gene signature and contrast, displaying also the Normalized Enrichment Scores (NES) and adjusted p-value for each enrichment result. As plotGSEAenrichment’s relevant graphical parameters:

  • grid = TRUE: Arranges the plots in a grid for better visualization.
  • titlesize: Adjusts title font size.
  • nrow/ncol: Specifies the grid layout for arranging plots.
plotGSEAenrichment(GSEA_results=GSEAresults, 
                   DEGList=DEGs, 
                   gene_sets=genesets_example, 
                   widthTitle=40, grid = TRUE, titlesize = 10, nrow=1, ncol=3) 

The plotNESlollipop() function creates lollipop plots for visualizing Gene Set Enrichment Analysis (GSEA) results. Each plot displays gene sets on the y-axis and Normalized Enrichment Scores (NES) on the x-axis, with a color gradient proportional to the adjusted p-value. The function supports multiple contrasts and includes options for customizing the color gradient, significance threshold, and plot layout. It can also arrange individual plots into a grid layout for comparative visualization.

plotNESlollipop(GSEA_results=GSEAresults, 
                saturation_value=NULL, 
                nonsignif_color = "#F4F4F4", 
                signif_color = "red",
                sig_threshold = 0.05, 
                grid = FALSE, 
                nrow = NULL, ncol = NULL, 
                widthlabels=13, 
                title=NULL, titlesize=12) 
## $`Senescent-Proliferative`

The plotCombinedGSEA() function generates a scatter plot to visualize the results of Gene Set Enrichment Analysis (GSEA) across multiple contrasts. Each point represents a pathway, with:

  • X-axis: Normalized Enrichment Score (NES)
  • Y-axis: -log10 adjusted p-value (significance)
  • Color: Gene sets
  • Shape: Contrasts
  • Dashed line: Significance threshold

This function helps compare enrichment results when the number of contrasts and the number of pathways is high, scenario of which the example herein is not too representative.

plotCombinedGSEA(GSEAresults, sig_threshold = 0.05, PointSize=6, widthlegend = 26 )

Note on GSEA for All-negative gene sets

In gene set enrichment analysis, the behaviour of gene sets annotated with regulatory direction must be interpreted carefully. In our current implementation, if a gene set includes both positive and negative regulators of a process, the statistics for the negatively annotated genes are inverted, meaning that the enrichment reflects their effect on the process rather than simply their expression direction. This ensures that the final enrichment score captures the coordinated functional outcome of the gene set.

The same logic is preserved when a gene set contains only negatively annotated genes (e.g. negative regulators of senescence): the statistics are inverted so that a positive normalised enrichment score (NES) implies these inhibitors are less expressed, thereby supporting the activation of the process they repress (e.g. senescence). Conversely, a negative NES would suggest their upregulation, which could potentially suppress the process.

Comparison of both approaches

Based on these very simple analyses, the REACTOME_CELLULAR_SENESCENCE showed consistently poor performance, highlighing caution when this or other MSigDB pathways appear in common enrichment-based analyses with RNA-seq data. The HernandezSegura signature, with its moderate size (55 genes) and annotated gene direction, performed best in enrichment-based methods such as GSEA. In contrast, the LiteratureMarkers set yielded stronger results in scoring approaches, likely due to a small number of highly informative genes rather than coordinated activity across the entire set.

These findings highlight important trade-offs: while scoring methods offer per-sample resolution and are less sensitive (in terms of statistical significance) to gene set size, making them useful for classification tasks, they may be overly influenced by a small subset of genes, which could limit biological interpretability. While more robust to sample heterogeneity and better at capturing coordinated expression changes, enrichment-based methods are sensitive to gene set composition and size. Caution is warranted when interpreting results, especially from score-based approaches, as strong signals may not always reflect the intended biological process, but rather a handful of dominant genes.

Visualise Individual Gene Behaviour

As highlighted in the previous section, score-based approaches can be disproportionately influenced by a small subset of genes. To address this, markeR includes dedicated functions for exploring individual gene behaviour, enabling users to assess if and which genes may be driving the overall signal. We demonstrate this functionality using the LiteratureMarkers gene set.

The ExpressionHeatmap function generates a heatmap to display the expression levels of selected senescence genes across samples. Samples are annotated by a chosen condition, and expression values are color-scaled for easy visual comparison. Clustering options and customizable color palettes allow for flexible and informative visualization.

annotation_colors <- list( 
  Condition = c(
    "Senescent"     = "#65AC7C",  # Example color: greenish
    "Proliferative" = "#5F90D4"  # Example color: blueish 
  )
)

ExpressionHeatmap(data=counts_example, 
                  metadata = metadata_example, 
                  genes=genesets_example$LiteratureMarkers$gene,  
                  annotate.by = c("Condition"),
                  annotation_colors = annotation_colors,
                  colorlist = list(low = "#3F4193", mid = "#F9F4AE", high = "#B44141"),
                  cluster_rows = TRUE, 
                  cluster_columns = FALSE,
                  title = "Senescence Genes", 
                  titlesize = 20,
                  legend_position = "right",
                  scale_position="right")

The IndividualGenes_Violins function creates violin plots to visualize the expression distributions of selected senescence genes across conditions. Jittered points represent individual samples, and grouping (x axis, GroupingVariable) and color variables (ColorVariable and ColorValues) from the metadata allow for additional stratification and insight. Customization options include layout, point size, colors, and axis labeling.

senescence_triggers_colors <- c(
  "none" = "#E57373",  # Soft red   
  "Telomere shortening" = "#4FC3F7"  # Vivid sky blue  
)


IndividualGenes_Violins(data = counts_example, 
                        metadata = metadata_example, 
                        genes = genesets_example$LiteratureMarkers$gene, 
                        GroupingVariable = "Condition", 
                        plot=TRUE, 
                        ncol=NULL, 
                        nrow=1, 
                        divide=NULL, 
                        invert_divide=FALSE,
                        ColorValues=senescence_triggers_colors, 
                        pointSize=2, 
                        ColorVariable="SenescentType", 
                        title="Senescence Genes", 
                        widthTitle=16,
                        y_limits = NULL,
                        legend_nrow=1, 
                        xlab="Condition",
                        colorlab="") 

The CorrelationHeatmap function displays pairwise correlations between selected genes, helping to reveal co-expression patterns within the senescence signature. Correlations can be computed separately for different conditions, and the heatmap is fully customizable with options for clustering, color scaling, and correlation method (e.g., Spearman or Pearson).

CorrelationHeatmap(data=counts_example, 
                   metadata = metadata_example, 
                   genes=genesets_example$LiteratureMarkers$gene, 
                   separate.by = "Condition", 
                   method = "spearman",  
                   colorlist = list(low = "#3F4193", mid = "#F9F4AE", high = "#B44141"),
                   limits_colorscale = c(-1,0,1), 
                   widthTitle = 16, 
                   title = "Senescence Genes", 
                   cluster_rows = TRUE, 
                   cluster_columns = TRUE,  
                   detailedresults = FALSE, 
                   legend_position="right",
                   titlesize=20)

The ROCandAUCplot function evaluates the discriminatory power of individual genes in the signature by computing ROC curves and AUC values based on a binary classification (e.g., senescent vs. proliferative). If the selected grouping variable has more than two levels, the user can specify the reference group using the class parameter. For example, if a variable has levels A, B, C, and D, setting class = c("A", "B") will group samples from A and B together as the positive class*, while the remaining samples (C and D) are automatically grouped as the negative class. Additionally, the user can use the group_var parameter to split and display results separately for each level of another metadata variable — allowing for subgroup-specific ROC analyses. Outputs include individual ROC plots and an AUC heatmap, with customisable layout, color schemes, and clustering options—ideal for identifying genes with strong discriminatory ability. If group_varis not specified, the AUC values will be displayed in a barplot.

  • For ease of interpretation, the directionality of the comparison is adjusted so that the AUC is always >=0.5, showing only discriminatory power rather than directionality.
options(error=recover)
ROCandAUCplot(counts_example, 
              metadata_example, 
              condition_var = "Condition", 
              class = "Senescent", 
              group_var=NULL,
              title = NULL,
              genes=genesets_example$LiteratureMarkers$gene, 
              plot_type = "all",
              auc_params =  list(colors =  "#3B415B",
                                    limits = c(0.5,1) ),
              roc_params = list(nrow=3,
                                ncol=3, 
                                colors="#3B415B"),
              commomplot_params = list(widths=c(0.5,0.3)))

The CohenD_IndividualGenes function computes the effect size (Cohen’s d) of the difference in expression of each gene between two conditions, given by the variable condition_var. If the selected condition variable has more than two levels, the class parameter specifies which condition will be compared to the rest. Additionally, the user can use the group_var parameter to split and display results separately for each level of another metadata variable. Results are visualized as a heatmap, with customizable color scales and clustering options for easy interpretation of effect sizes across genes. If group_var is not specified, the function will return a barplot instead.

CohenD_IndividualGenes(counts_example, 
                       metadata_example, 
                       genes=genesets_example$LiteratureMarkers$gene,
                       condition_var = "Condition", 
                       class = "Senescent", 
                       group_var = NULL,  
                       params = list(colors = "#3B415B",
                                             limits = NULL,
                                             cluster_rows=TRUE))

The plotPCA function performs a PCA on the expression of a selected set of genes to understand if they explain enough variance in the data, allowing the user to test if the genes in the signature are sufficient to separate the group of interest (given by the `ColourVariable’ parameter). While previous metrics assess discriminatory power quantitatively (e.g., AUC), PCA provides a complementary, unsupervised visualization that can help assess whether the gene signature is sufficient to visually separate groups of interest. Users can customize which principal components to display using the PCs argument and adjust layout, point size, and colour annotations for comparison across conditions.

annotation_colors <- c(  
  "Senescent"     = "#65AC7C",  # Example color: greenish
  "Proliferative" = "#5F90D4"  # Example color: blueish 
)


plotPCA(data = counts_example, 
        metadata = metadata_example, 
        genes=genesets_example$LiteratureMarkers$gene, 
        scale=FALSE, 
        center=TRUE, 
        PCs=list(c(1,2), c(2,3), c(3,4)), 
        ColorVariable="Condition",
        ColorValues=annotation_colors,
        pointSize=5,
        legend_nrow=1, 
        ncol=3, 
        nrow=NULL)

Comment

As demonstrated by the behaviour of individual genes in the LiteratureMarkers gene set, LMNB1, MKI67, and GLB1 appear to drive the overall signal. These genes consistently show higher performance metrics (e.g., Cohen’s d, AUC), strong expression changes between conditions, and LMNB1 and MKI67 specifically have correlated expression patterns. This underscores the importance of examining gene-level behaviour, as a strong overall signature score may reflect the influence of only a few informative genes, rather than coordinated activity across the entire set. In this case, the strong performance of the LiteratureMarkers set in scoring approaches is likely driven by these genes. However, relying heavily on a few markers can be a caveat: for example, MKI67 and LMNB1 (proliferation-related genes) may also change in other biological contexts like quiescence or differentiation, potentially limiting their specificity for senescence. Thus, the choice of gene set and analysis strategy should be guided by the research question, and complemented with both score distributions, enrichment analyses, and individual gene behaviour. Notably, scoring with just these three genes yielded results similar (or, even, slightly better) to the full LiteratureMarkers set.

GeneSets_Example_DrivingGenes <- list(LiteratureMarkers=genesets_example$LiteratureMarkers,
                                          LiteratureMarkers_subset=genesets_example$LiteratureMarkers[genesets_example$LiteratureMarkers$gene %in% c("LMNB1", "MKI67","GLB1"),])
print(GeneSets_Example_DrivingGenes)
## $LiteratureMarkers
##     gene enrichment
## 1 CDKN1A          1
## 2 CDKN2A          1
## 3   GLB1          1
## 4   TP53          1
## 5   CCL2          1
## 6  LMNB1         -1
## 7  MKI67         -1
## 
## $LiteratureMarkers_subset
##    gene enrichment
## 3  GLB1          1
## 6 LMNB1         -1
## 7 MKI67         -1
PlotScores(data = counts_example, 
           metadata = metadata_example, 
           gene_sets =  GeneSets_Example_DrivingGenes,
           ColorVariable = "SenescentType", 
           Variable="Condition",  
           method ="logmedian", 
           ColorValues = senescence_triggers_colors, 
           ConnectGroups=TRUE, 
           ncol = NULL, 
           nrow = NULL, 
           widthTitle=24, 
           limits = NULL, 
           legend_nrow = 1, 
           pointSize=4,
           compute_cohen=TRUE,
           cond_cohend=cond_cohend,
           title="Marthandan et al. 2016",
           labsize=9, 
           titlesize = 12)  

Session Information

## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] markeR_0.99.0
## 
## loaded via a namespace (and not attached):
##   [1] pROC_1.18.5           gridExtra_2.3         rlang_1.1.6          
##   [4] magrittr_2.0.3        clue_0.3-66           GetoptLong_1.0.5     
##   [7] msigdbr_24.1.0        matrixStats_1.5.0     compiler_4.5.1       
##  [10] mgcv_1.9-3            png_0.1-8             systemfonts_1.2.3    
##  [13] vctrs_0.6.5           reshape2_1.4.4        stringr_1.5.1        
##  [16] pkgconfig_2.0.3       shape_1.4.6.1         crayon_1.5.3         
##  [19] fastmap_1.2.0         backports_1.5.0       labeling_0.4.3       
##  [22] effectsize_1.0.1      rmarkdown_2.29        ragg_1.4.0           
##  [25] purrr_1.0.4           xfun_0.52             cachem_1.1.0         
##  [28] jsonlite_2.0.0        BiocParallel_1.42.1   broom_1.0.8          
##  [31] parallel_4.5.1        cluster_2.1.8.1       R6_2.6.1             
##  [34] stringi_1.8.7         bslib_0.9.0           RColorBrewer_1.1-3   
##  [37] limma_3.64.1          car_3.1-3             jquerylib_0.1.4      
##  [40] Rcpp_1.0.14           assertthat_0.2.1      iterators_1.0.14     
##  [43] knitr_1.50            parameters_0.26.0     IRanges_2.42.0       
##  [46] splines_4.5.1         Matrix_1.7-3          tidyselect_1.2.1     
##  [49] abind_1.4-8           yaml_2.3.10           doParallel_1.0.17    
##  [52] codetools_0.2-20      curl_6.4.0            lattice_0.22-7       
##  [55] tibble_3.3.0          plyr_1.8.9            withr_3.0.2          
##  [58] bayestestR_0.16.0     evaluate_1.0.4        desc_1.4.3           
##  [61] circlize_0.4.16       pillar_1.10.2         ggpubr_0.6.1         
##  [64] carData_3.0-5         foreach_1.5.2         stats4_4.5.1         
##  [67] insight_1.3.0         generics_0.1.4        S4Vectors_0.46.0     
##  [70] ggplot2_3.5.2         scales_1.4.0          glue_1.8.0           
##  [73] tools_4.5.1           data.table_1.17.6     fgsea_1.34.0         
##  [76] locfit_1.5-9.12       ggsignif_0.6.4        babelgene_22.9       
##  [79] fs_1.6.6              fastmatch_1.1-6       cowplot_1.1.3        
##  [82] grid_4.5.1            tidyr_1.3.1           datawizard_1.1.0     
##  [85] edgeR_4.6.2           colorspace_2.1-1      nlme_3.1-168         
##  [88] Formula_1.2-5         cli_3.6.5             textshaping_1.0.1    
##  [91] ComplexHeatmap_2.24.1 dplyr_1.1.4           gtable_0.3.6         
##  [94] ggh4x_0.3.1           rstatix_0.7.2         sass_0.4.10          
##  [97] digest_0.6.37         BiocGenerics_0.54.0   ggrepel_0.9.6        
## [100] rjson_0.2.23          htmlwidgets_1.6.4     farver_2.1.2         
## [103] htmltools_0.5.8.1     pkgdown_2.1.3         lifecycle_1.0.4      
## [106] GlobalOptions_0.1.2   statmod_1.5.0