GSEA Variable Association — GSEA_VariableAssociation • markeR

This function assesses the association between gene expression (or another molecular score) and metadata variables using differential expression (DE) analysis and Gene Set Enrichment Analysis (GSEA). It generates all possible contrasts for categorical variables and uses linear modeling for continuous variables.

Usage

GSEA_VariableAssociation(
  data,
  metadata,
  cols,
  stat = NULL,
  mode = c("simple", "medium", "extensive"),
  gene_set,
  nonsignif_color = "grey",
  signif_color = "red",
  saturation_value = NULL,
  sig_threshold = 0.05,
  widthlabels = 18,
  labsize = 10,
  titlesize = 14,
  pointSize = 5,
  ignore_NAs = FALSE,
  printplt = TRUE
)

Arguments

data

A matrix or data frame containing gene expression data, where rows represent genes and columns represent samples.

metadata

A data frame containing sample metadata with at least one column corresponding to the variables of interest.

cols

A character vector specifying the metadata columns (variables) to analyse.

stat

Optional. The statistic to use for ranking genes before GSEA. If NULL, it is automatically determined based on the gene set:

"B" for gene sets with no known direction (vectors).
"t" for unidirectional or bidirectional gene sets (data frames).
If provided, this argument overrides the automatic selection.

mode

A string specifying the level of detail for contrasts. Options are:

"simple": Performs the minimal number of pairwise comparisons between individual group levels (e.g., A - B, A - C). Default.
"medium": Includes comparisons between one group and the union of all other groups (e.g., A - (B + C + D)), enabling broader contrasts beyond simple pairs.
"extensive": Allows for all possible algebraic combinations of group levels (e.g., (A + B) - (C + D)), supporting flexible and complex contrast definitions.

gene_set

A named list defining the gene sets for GSEA. (Required)

If using unidirectional gene sets, provide a list where each element is a vector of gene names representing a signature.
If using bidirectional gene sets, provide a list where each element is a data frame:
The first column should contain gene names.
The second column should indicate the expected direction of enrichment (1 for upregulated, -1 for downregulated).

nonsignif_color

A string specifying the color for the middle of the adjusted p-value gradient. Default is "white". Lower limit correspond to the value of sig_threshold.

signif_color

A string specifying the color for the low end of the adjusted p-value gradient until the value chosen for significance (sig_threshold). Default is "red".

saturation_value

A numeric value specifying the lower limit of the adjusted p-value gradient, below which the color will correspond to signif_color. Default is the results' minimum, unless that value is above the sig_threshold; in that case, it is 0.001.

sig_threshold

A numeric value specifying the threshold for significance visualization in the plot. Default: 0.05.

widthlabels

An integer controlling the maximum width of contrast labels before text wrapping. Default: 18.

labsize

An integer controlling the axis text size in the plot. Default: 10.

titlesize

An integer specifying the plot title size. Default: 14.

pointSize

Numeric. The size of points in the lollipop plot (default is 5).

ignore_NAs

Boolean (default: FALSE). Whether to ignore NAs in the metadata when fitting the linear model. If TRUE, rows with any NAs will be removed before analysis, leading to a loss of data to be fitted in the model.

printplt

Boolean specifying if plot is to be printed. Default: TRUE.

Value

A list with two elements:

data: A data frame containing the GSEA results, including normalized enrichment scores (NES), adjusted p-values, and contrasts.
plot: A ggplot2 object visualizing the GSEA results as a lollipop plot.

Examples

# Simulate gene expression data (genes as rows, samples as columns)
set.seed(42)
expr <- as.data.frame(matrix(rnorm(500), nrow = 50, ncol = 10))
rownames(expr) <- paste0("Gene", 1:50)
colnames(expr) <- paste0("Sample", 1:10)

# Simulate metadata (categorical and continuous)
metadata <- data.frame(
  sampleID = paste0("Sample", 1:10),
  Group = rep(c("A", "B"), each = 5),
  Age = sample(20:60, 10),
  row.names = colnames(expr)
)

# Define a toy gene set: one gene set only for discovery mode!
gene_set <- list(
  Signature1 = paste0("Gene", 1:10)
)

# Score-based association (e.g., logmedian)
res_score <- VariableAssociation(
  method = "logmedian",
  data = expr,
  metadata = metadata,
  cols = c("Group", "Age"),
  gene_set = gene_set
)
#> Considering unidirectional gene signature mode for signature Signature1
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: no non-missing arguments to min; returning Inf
#> `geom_smooth()` using formula = 'y ~ x'

print(res_score$Overall)
#>   Variable   Cohen_f   P_Value
#> 1    Group 0.4754035 0.2156171
#> 2      Age 0.0233770 0.9489047
print(res_score$plot)


# GSEA-based association (if GSEA_VariableAssociation is available)
# res_gsea <- VariableAssociation(
#   method = "GSEA",
#   data = expr,
#   metadata = metadata,
#   cols = "Group",
#   gene_set = gene_set
# )
# print(res_gsea$data)
print(res_score$plot)