Skip to contents

This function evaluates the association between gene expression scores and metadata variables. It uses linear modeling to get Cohen's F, and contrast-based comparisons for categorical variables to compute Cohen's D. The function generates plots summarizing the results.

Usage

Score_VariableAssociation(
  data,
  metadata,
  cols,
  method = c("logmedian", "ssGSEA", "ranking"),
  gene_set,
  mode = c("simple", "medium", "extensive"),
  nonsignif_color = "grey",
  signif_color = "red",
  saturation_value = NULL,
  sig_threshold = 0.05,
  widthlabels = 18,
  labsize = 10,
  title = NULL,
  titlesize = 14,
  pointSize = 5,
  discrete_colors = NULL,
  continuous_color = "#8C6D03",
  color_palette = "Set2",
  printplt = TRUE
)

Arguments

data

A data frame or matrix containing gene expression data.

metadata

A data frame containing sample metadata with at least one column corresponding to the variables of interest.

cols

A character vector specifying metadata columns to analyse.

method

A character string specifying the scoring method ("logmedian", "ssGSEA", or "ranking").

gene_set

A named list containing one gene set for scoring.

mode

A character string specifying the contrast generation method ("simple", "medium", "extensive"). Four methods are available:

  • ssGSEA: Uses the single-sample Gene Set Enrichment Analysis (ssGSEA) method to compute an enrichment score for each signature in each sample using an adaptation of the gsva() function from the GSVA package.

  • logmedian: Computes the score as the sum of the normalized (log2-median-centered) expression values of the signature genes divided by the number of genes in the signature.

  • ranking: Computes gene signature scores for each sample by ranking the expression of signature genes in the dataset and normalizing the score based on the total number of genes.

nonsignif_color

A string specifying the color for non-significant results. Default: "grey".

signif_color

A string specifying the color for significant results. Default: "red".

saturation_value

A numeric value for color saturation threshold. Default: NULL (auto-determined).

sig_threshold

A numeric value specifying the significance threshold. Default: 0.05.

widthlabels

An integer controlling contrast label wrapping. Default: 18.

labsize

An integer controlling axis text size. Default: 10.

title

A string specifying the plot title. Default: NULL.

titlesize

An integer specifying the title size. Default: 14.

pointSize

A numeric value for point size in plots. Default: 5.

discrete_colors

A named list mapping categorical variable levels to colors. Each element should be a named vector where names correspond to factor levels. Default: NULL.

continuous_color

A string specifying the color for continuous variables. Default: "#8C6D03".

color_palette

A string specifying the color palette for discrete variables. Default: "Set2".

printplt

Boolean specifying if plot is to be printed. Default: TRUE.

Value

A list with:

  • Overall: Data frame of effect sizes and p-values for each contrasted phenotypic variable.

  • Contrasts: Data frame of Cohen’s d and adjusted p-values for contrasts between levels of categorical variables, with the resolution of contrasts determined by the mode parameter.

  • plot: A combined visualization with three main panels: (1) lollipop plots of Cohen’s f for each variable of interest, (2) distribution plots of the score by variable (density or scatter depending on variable type), and (3, if applicable) lollipop plots of Cohen’s d for contrasts in categorical variables.

  • plot_contrasts: Lollipop plots of Cohen’s d effect sizes for contrasts between levels of non numerical variables (if applicable), colored by adjusted p-value (BH).

  • plot_overall: Lollipop plot showing Cohen’s f effect sizes for each variable, colored by p-value.

  • plot_distributions: List of density or scatter plots of the score across variable levels, depending on variable type.

Examples

data <- as.data.frame(abs(matrix(rnorm(1000), ncol = 10)))
rownames(data) <- paste0("Gene", 1:100)  # Name columns as Gene1, Gene2, ..., Gene10
colnames(data) <- paste0("Sample", 1:10)  # Name rows as Sample1, Sample2, ..., Sample100

metadata <- data.frame(
  sample = colnames(data),  # Sample ID matches the rownames of the data
  Condition = rep(c("A", "B"), each = 50)  # Two conditions (A and B)
)
gene_set <- list(SampleSet = c("Gene1", "Gene2", "Gene3"))
results <- Score_VariableAssociation(data, metadata, cols = "Condition", gene_set = gene_set)
#> Considering unidirectional gene signature mode for signature SampleSet
#> Warning: no non-missing arguments to min; returning Inf

print(results$plot)