Subclustering Analysis (Optional)

Overview

Subclustering analysis is a powerful technique for studying specific cell populations in greater detail. This feature allows you to analyze subclustered populations, such as T cells or fibroblasts, after initial CASSIA annotation.

Quick Start

CASSIA.runCASSIA_subclusters(
    marker = subcluster_results,
    major_cluster_info = "cd8 t cell",
    output_name = "subclustering_results",
    model = "anthropic/claude-sonnet-4.6",
    provider = "openrouter",
    tissue = "lung",
    species = "human"
)
Python

Input

Workflow Summary

  1. Initial CASSIA analysis on full dataset
  2. Subcluster extraction and processing (using Seurat or Scanpy)
  3. Marker identification for subclusters
  4. CASSIA subclustering analysis
  5. Uncertainty assessment (optional)

Required Input

  • Marker genes: DataFrame or file path containing marker genes for each subcluster (output from FindAllMarkers in Seurat or sc.tl.rank_genes_groups in Scanpy)

We recommend applying the default CASSIA first. Then, on a target cluster, apply standard pipelines (Seurat/Scanpy) to subcluster and get marker results.

Parameters

Required Parameters

ParameterDescription
markerMarker genes for the subclusters (DataFrame or file path)
major_cluster_infoDescription of the parent cluster or context (e.g., "CD8+ T cells" or "cd8 t cell mixed with other celltypes")
output_nameBase name for the output CSV file
modelLLM model to use
providerAPI provider

Optional Parameters

ParameterDefaultDescription
temperature0Sampling temperature (0-1)
n_genes50Number of top marker genes to use
tissueNoneTissue type being analyzed (e.g., "lung", "brain")
speciesNoneSpecies being analyzed (e.g., "human", "mouse")
use_referenceFalseRetrieve expert subtype references and inject them into the subclustering prompt
reference_modelNoneModel used for reference selection (defaults to the provider's fast model)
reference_cell_type_hintNoneParent lineage hint for reference selection (e.g., "macrophage")

Reference-Assisted Subclustering

Reference mode is useful for difficult subtype problems such as macrophage, fibroblast, or immune-cell subclustering. For example:

When enabled, the reference agent first reads the lineage overview/router and all subcluster marker sets together, then reads only the detailed reference documents needed for this run. It injects a compact reference brief with objective paper facts, cluster-specific guidance, and cross-cluster subtype distinctions rather than pasting the whole reference library.

CASSIA.runCASSIA_subclusters(
    marker = macrophage_subcluster_results,
    major_cluster_info = "human tumor macrophage",
    output_name = "macrophage_subclustering_reference",
    model = "moonshotai/kimi-k2.6",
    provider = "openrouter",
    tissue = "tumor",
    species = "human",
    use_reference = True,
    reference_model = "moonshotai/kimi-k2.6",
    reference_cell_type_hint = "macrophage"
)
Python

To record token usage and provider cost for a run:

CASSIA.reset_llm_usage_log()
CASSIA.runCASSIA_subclusters(..., use_reference = True)
usage = CASSIA.get_llm_usage_summary(reset = True)
print(usage)
Python

For OpenRouter, CASSIA records the usage.cost value returned by the API.

Uncertainty Assessment Functions

For more confident results, calculate consistency scores (CS) using multiple iterations:

runCASSIA_n_subcluster() - Run multiple annotation iterations:

CASSIA.runCASSIA_n_subcluster(
    n=5,
    marker=subcluster_results,
    major_cluster_info="cd8 t cell",
    base_output_name="subclustering_results_n",
    model="anthropic/claude-sonnet-4.6",
    temperature=0,
    provider="openrouter",
    max_workers=5,
    n_genes=50,
    tissue="lung",
    species="human"
)
Python
ParameterDescription
nNumber of iterations to run
base_output_nameBase name for output files (appended with iteration number)
max_workersNumber of parallel workers

runCASSIA_similarity_score_batch() - Calculate similarity scores across iterations:

CASSIA.runCASSIA_similarity_score_batch(
    marker = subcluster_results,
    file_pattern = "subclustering_results_n_*.csv",
    output_name = "subclustering_uncertainty",
    max_workers = 6,
    model = "anthropic/claude-sonnet-4.6",
    provider = "openrouter",
    main_weight = 0.5,
    sub_weight = 0.5
)
Python
ParameterDefaultDescription
file_pattern-Glob pattern matching iteration result files
main_weight0.5Weight for main cell type similarity
sub_weight0.5Weight for subtype similarity

Output

FileDescription
{output_name}.csvBasic CASSIA analysis results
{output_name}.htmlHTML report with visualizations
{output_name}_uncertainty.csvSimilarity scores (if uncertainty assessment is performed)

Automatic Report Generation: An HTML report is automatically generated alongside the CSV output for easy visualization of subclustering results.