Batch Processing

Overview

Batch processing allows you to annotate all clusters in your single-cell dataset in one run. CASSIA processes multiple clusters in parallel, generating cell type predictions with detailed reasoning for each cluster.


Quick Start

runCASSIA_batch(
    marker = markers,
    output_name = "my_annotation",
    model = "anthropic/claude-sonnet-4.5",
    tissue = "brain",
    species = "human",
    provider = "openrouter"
)
R

Input

Marker Data Formats

CASSIA accepts three input formats:

1. Seurat FindAllMarkers Output (Recommended)

Standard output from Seurat's FindAllMarkers function with differential expression statistics:

p_valavg_log2FCpct.1pct.2p_val_adjclustergene
03.020.9730.15200CD79A
02.740.9380.12500MS4A1
02.540.9350.13800CD79B
01.890.8120.08901IL7R
01.760.7560.11201CCR7

2. Scanpy rank_genes_groups Output

Output from Scanpy's sc.tl.rank_genes_groups() function, typically exported using sc.get.rank_genes_groups_df():

groupnamesscorespvalspvals_adjlogfoldchanges
0CD79A28.53003.02
0MS4A125.41002.74
0CD79B24.89002.54
1IL7R22.15001.89
1CCR720.87001.76

3. Simplified Format

A two-column data frame with cluster ID and comma-separated marker genes:

clustermarker_genes
0CD79A,MS4A1,CD79B,HLA-DRA,TCL1A
1IL7R,CCR7,LEF1,TCF7,FHIT,MAL
2CD8A,CD8B,GZMK,CCL5,NKG7

Loading Marker Data

# Option 1: Use Seurat's FindAllMarkers output directly (Recommended)
# (assuming you already have a Seurat object)
markers <- FindAllMarkers(seurat_obj)

# Option 2: Load Scanpy rank_genes_groups output (exported as CSV)
markers <- read.csv("scanpy_markers.csv")

# Option 3: Load your own simplified marker data
markers <- read.csv("path/to/your/markers.csv")

# Load example marker data for testing
markers <- loadExampleMarkers()

# Preview the data
head(markers)
R

Parameters

Required

ParameterDescription
markerMarker data (data frame or file path)
output_nameBase name for output files
modelLLM model ID (see Model Selection below)
tissueTissue type (e.g., "brain", "blood")
speciesSpecies (e.g., "human", "mouse")
providerAPI provider ("openrouter", "openai", "anthropic") or custom base URL

Optional

ParameterDefaultDescription
max_workers4Number of parallel workers. Recommend ~75% of CPU cores.
n_genes50Top marker genes per cluster
additional_infoNULLExtra experimental context (see below)
temperature0Output randomness (0=deterministic, 1=creative). Keep at 0 for reproducible results.
validator_involvement"v1"Validation intensity: "v1" (moderate) or "v0" (high, slower)
reasoningNULLReasoning depth for GPT-5 series via OpenRouter only: "low", "medium", "high"
ranking_method"avg_log2FC"Gene ranking: "avg_log2FC", "p_val_adj", "pct_diff", "Score"
ascendingNULLSort direction (uses method default)
celltype_columnNULLColumn name for cluster IDs
gene_column_nameNULLColumn name for gene symbols
max_retries1Max retries for failed API calls

Parameter Details

Model Selection

  • Default is anthropic/claude-sonnet-4.5 for best performance
  • Use google/gemini-2.5-flash for faster, preliminary analysis
  • For detailed model recommendations, see How to Select Models and Providers

Marker Gene Selection

  • Default: top 50 genes per cluster
  • ranking_method controls how marker genes are ranked and selected:
    • "avg_log2FC" (default): Rank by average log2 fold change
    • "p_val_adj": Rank by adjusted p-value
    • "pct_diff": Rank by difference in percentage expression
    • "Score": Rank by custom score
  • Filtering criteria: adjusted p-value < 0.05, avg_log2FC > 0.25, min percentage > 0.1
  • If fewer than 50 genes pass filters, all passing genes are used

Additional Context

  • Use additional_info to provide experimental context
  • Examples:
    • Treatment conditions: "Samples were antibody-treated"
    • Analysis focus: "Please carefully distinguish between cancer and non-cancer cells"
  • Tip: Compare results with and without additional context

Output

Files Generated

FileDescription
{output_name}_summary.csvAnnotation results with cell types, markers, and metadata
{output_name}_conversations.jsonComplete conversation history for debugging
{output_name}_report.htmlInteractive HTML report with visualizations

Add Results to Seurat Object

You can easily add the annotation results back to your Seurat object using add_cassia_to_seurat. This function maps the CASSIA results to your Seurat object based on cluster identifiers.

seurat_obj <- add_cassia_to_seurat(
    seurat_obj = seurat_obj,
    cassia_results_path = "my_annotation_summary.csv",
    cluster_col = "seurat_clusters",
    cassia_cluster_col = "Cluster"
)
R

This adds columns to your Seurat object:

  • Cluster ID: The cluster identifier
  • Predicted General Cell Type: The broad cell type category
  • Predicted Detailed Cell Type: The specific cell type prediction
  • Possible Mixed Cell Types: Information on potential mixed populations
  • Marker List: The marker genes used for annotation
  • Additional metadata columns: Iterations, Model, Provider, Tissue, Species