Subclustering Analysis (Optional)

Overview

Subclustering analysis is a powerful technique for studying specific cell populations in greater detail. This feature allows you to analyze subclustered populations, such as T cells or fibroblasts, using Cassia and Seurat.

Quick Start

runCASSIA_subclusters(
    marker = marker_sub,
    major_cluster_info = "cd8 t cell",
    output_name = "subclustering_results",
    model = "anthropic/claude-sonnet-4.5",
    provider = "openrouter"
)
R

Input

Prerequisites

A Seurat object containing your single-cell data
The Cassia package installed and loaded
Basic familiarity with R and single-cell analysis

Workflow Summary

Initial Cassia analysis on full dataset
Subcluster extraction and processing
Marker identification for subclusters
Cassia subclustering analysis
Uncertainty assessment (optional)

Preparing Subclusters

First, run the default Cassia pipeline on your complete dataset to identify major cell populations. Then extract and process your target cluster using Seurat:

# Extract target population (example using CD8+ T cells)
cd8_cells <- subset(large, cell_ontology_class == "cd8-positive, alpha-beta t cell")

# Normalize data
cd8_cells <- NormalizeData(cd8_cells)

# Identify variable features
cd8_cells <- FindVariableFeatures(cd8_cells,
    selection.method = "vst",
    nfeatures = 2000)

# Scale data
all.genes <- rownames(cd8_cells)
cd8_cells <- ScaleData(cd8_cells, features = all.genes)

# Run PCA
cd8_cells <- RunPCA(cd8_cells,
    features = VariableFeatures(object = cd8_cells),
    npcs = 30)

# Perform clustering
cd8_cells <- FindNeighbors(cd8_cells, dims = 1:20)
cd8_cells <- FindClusters(cd8_cells, resolution = 0.3)

# Generate UMAP visualization
cd8_cells <- RunUMAP(cd8_cells, dims = 1:20)
R

Marker Identification

Identify markers for each subcluster:

# Find markers
cd8_markers <- FindAllMarkers(cd8_cells,
    only.pos = TRUE,
    min.pct = 0.1,
    logfc.threshold = 0.25)

# Filter significant markers
cd8_markers <- cd8_markers %>% filter(p_val_adj < 0.05)

# Save results
write.csv(cd8_markers, "cd8_subcluster_markers.csv")
R

Parameters

Required Parameters

Parameter	Description
`marker`	Marker genes for the subclusters (data frame or file path)
`major_cluster_info`	Description of the parent cluster or context (e.g., "CD8+ T cells" or "cd8 t cell mixed with other celltypes")
`output_name`	Base name for the output CSV file
`model`	LLM model to use
`provider`	API provider

Optional Parameters

Parameter	Default	Description
`temperature`	0	Sampling temperature (0-1)
`n_genes`	50	Number of top marker genes to use

Example: Mixed Populations

# For mixed populations
runCASSIA_subclusters(
    marker = marker_sub,
    major_cluster_info = "cd8 t cell mixed with other celltypes",
    output_name = "subclustering_results2",
    model = "anthropic/claude-sonnet-4.5",
    provider = "openrouter"
)
R

Uncertainty Assessment Functions

For more confident results, calculate CS scores using multiple iterations:

runCASSIA_n_subcluster() - Run multiple annotation iterations:

runCASSIA_n_subcluster(
    n = 5,
    marker = marker_sub,
    major_cluster_info = "cd8 t cell",
    base_output_name = "subclustering_results_n",
    model = "anthropic/claude-sonnet-4.5",
    temperature = 0,
    provider = "openrouter",
    max_workers = 5,
    n_genes = 50
)
R

Parameter	Description
`n`	Number of iterations to run
`base_output_name`	Base name for output files (appended with iteration number)
`max_workers`	Number of parallel workers

runCASSIA_similarity_score_batch() - Calculate similarity scores across iterations:

similarity_scores <- runCASSIA_similarity_score_batch(
    marker = marker_sub,
    file_pattern = "subclustering_results_n_*.csv",
    output_name = "subclustering_uncertainty",
    max_workers = 6,
    model = "anthropic/claude-sonnet-4.5",
    provider = "openrouter",
    main_weight = 0.5,
    sub_weight = 0.5
)
R

Parameter	Default	Description
`file_pattern`	-	Glob pattern matching iteration result files
`main_weight`	0.5	Weight for main cell type similarity
`sub_weight`	0.5	Weight for subtype similarity

Output

File	Description
`cd8_subcluster_markers.csv`	Marker genes for each subcluster
`{output_name}.csv`	Basic Cassia analysis results
`{output_name}.html`	HTML report with visualizations
`{output_name}_uncertainty.csv`	Similarity scores (if uncertainty assessment is performed)

Automatic Report Generation: An HTML report is automatically generated alongside the CSV output for easy visualization of subclustering results.

Tips and Recommendations

Always run the default Cassia analysis first before subclustering
Adjust clustering resolution based on your data's complexity
When dealing with mixed populations, specify this in the major_cluster_info parameter
Use the uncertainty assessment for more robust results