Subclustering Analysis (Optional)
Overview
Subclustering analysis is a powerful technique for studying specific cell populations in greater detail. This feature allows you to analyze subclustered populations, such as T cells or fibroblasts, after initial CASSIA annotation.
Quick Start
CASSIA.runCASSIA_subclusters( marker = subcluster_results, major_cluster_info = "cd8 t cell", output_name = "subclustering_results", model = "anthropic/claude-sonnet-4.6", provider = "openrouter", tissue = "lung", species = "human" )Python
Input
Workflow Summary
- Initial CASSIA analysis on full dataset
- Subcluster extraction and processing (using Seurat or Scanpy)
- Marker identification for subclusters
- CASSIA subclustering analysis
- Uncertainty assessment (optional)
Required Input
- Marker genes: DataFrame or file path containing marker genes for each subcluster (output from
FindAllMarkersin Seurat orsc.tl.rank_genes_groupsin Scanpy)
We recommend applying the default CASSIA first. Then, on a target cluster, apply standard pipelines (Seurat/Scanpy) to subcluster and get marker results.
Parameters
Required Parameters
| Parameter | Description |
|---|---|
marker | Marker genes for the subclusters (DataFrame or file path) |
major_cluster_info | Description of the parent cluster or context (e.g., "CD8+ T cells" or "cd8 t cell mixed with other celltypes") |
output_name | Base name for the output CSV file |
model | LLM model to use |
provider | API provider |
Optional Parameters
| Parameter | Default | Description |
|---|---|---|
temperature | 0 | Sampling temperature (0-1) |
n_genes | 50 | Number of top marker genes to use |
tissue | None | Tissue type being analyzed (e.g., "lung", "brain") |
species | None | Species being analyzed (e.g., "human", "mouse") |
use_reference | False | Retrieve expert subtype references and inject them into the subclustering prompt |
reference_model | None | Model used for reference selection (defaults to the provider's fast model) |
reference_cell_type_hint | None | Parent lineage hint for reference selection (e.g., "macrophage") |
Reference-Assisted Subclustering
Reference mode is useful for difficult subtype problems such as macrophage, fibroblast, or immune-cell subclustering. For example:
When enabled, the reference agent first reads the lineage overview/router and all subcluster marker sets together, then reads only the detailed reference documents needed for this run. It injects a compact reference brief with objective paper facts, cluster-specific guidance, and cross-cluster subtype distinctions rather than pasting the whole reference library.
CASSIA.runCASSIA_subclusters( marker = macrophage_subcluster_results, major_cluster_info = "human tumor macrophage", output_name = "macrophage_subclustering_reference", model = "moonshotai/kimi-k2.6", provider = "openrouter", tissue = "tumor", species = "human", use_reference = True, reference_model = "moonshotai/kimi-k2.6", reference_cell_type_hint = "macrophage" )Python
To record token usage and provider cost for a run:
CASSIA.reset_llm_usage_log() CASSIA.runCASSIA_subclusters(..., use_reference = True) usage = CASSIA.get_llm_usage_summary(reset = True) print(usage)Python
For OpenRouter, CASSIA records the usage.cost value returned by the API.
Uncertainty Assessment Functions
For more confident results, calculate consistency scores (CS) using multiple iterations:
runCASSIA_n_subcluster() - Run multiple annotation iterations:
CASSIA.runCASSIA_n_subcluster( n=5, marker=subcluster_results, major_cluster_info="cd8 t cell", base_output_name="subclustering_results_n", model="anthropic/claude-sonnet-4.6", temperature=0, provider="openrouter", max_workers=5, n_genes=50, tissue="lung", species="human" )Python
| Parameter | Description |
|---|---|
n | Number of iterations to run |
base_output_name | Base name for output files (appended with iteration number) |
max_workers | Number of parallel workers |
runCASSIA_similarity_score_batch() - Calculate similarity scores across iterations:
CASSIA.runCASSIA_similarity_score_batch( marker = subcluster_results, file_pattern = "subclustering_results_n_*.csv", output_name = "subclustering_uncertainty", max_workers = 6, model = "anthropic/claude-sonnet-4.6", provider = "openrouter", main_weight = 0.5, sub_weight = 0.5 )Python
| Parameter | Default | Description |
|---|---|---|
file_pattern | - | Glob pattern matching iteration result files |
main_weight | 0.5 | Weight for main cell type similarity |
sub_weight | 0.5 | Weight for subtype similarity |
Output
| File | Description |
|---|---|
{output_name}.csv | Basic CASSIA analysis results |
{output_name}.html | HTML report with visualizations |
{output_name}_uncertainty.csv | Similarity scores (if uncertainty assessment is performed) |
Automatic Report Generation: An HTML report is automatically generated alongside the CSV output for easy visualization of subclustering results.