Batch Processing

Overview

Batch processing allows you to annotate all clusters in your single-cell dataset in one run. CASSIA processes multiple clusters in parallel, generating cell type predictions with detailed reasoning for each cluster.


Quick Start

CASSIA.runCASSIA_batch(
    marker = markers,
    output_name = "my_annotation",
    model = "anthropic/claude-sonnet-4.5",
    tissue = "brain",
    species = "human",
    provider = "openrouter"
)
Python

Input

Marker Data Formats

CASSIA accepts three input formats:

1. Seurat FindAllMarkers Output

Standard output from Seurat's FindAllMarkers function with differential expression statistics:

p_valavg_log2FCpct.1pct.2p_val_adjclustergene
03.020.9730.15200CD79A
02.740.9380.12500MS4A1
02.540.9350.13800CD79B
01.890.8120.08901IL7R
01.760.7560.11201CCR7

2. Scanpy rank_genes_groups Output (Recommended for Python)

Output from Scanpy's sc.tl.rank_genes_groups() function, typically exported using sc.get.rank_genes_groups_df():

groupnamesscorespvalspvals_adjlogfoldchanges
0CD79A28.53003.02
0MS4A125.41002.74
0CD79B24.89002.54
1IL7R22.15001.89
1CCR720.87001.76

3. Simplified Format

A two-column DataFrame with cluster ID and comma-separated marker genes:

clustermarker_genes
0CD79A,MS4A1,CD79B,HLA-DRA,TCL1A
1IL7R,CCR7,LEF1,TCF7,FHIT,MAL
2CD8A,CD8B,GZMK,CCL5,NKG7

Loading Marker Data

import CASSIA
import scanpy as sc
import pandas as pd

# Option 1: Load Seurat FindAllMarkers output (exported as CSV)
markers = pd.read_csv("seurat_markers.csv")

# Option 2: Use Scanpy rank_genes_groups output directly (Recommended for Python)
# (assuming you already have an AnnData object with rank_genes_groups computed)
markers = sc.get.rank_genes_groups_df(adata, group=None)  # Get all groups

# Option 3: Load your own simplified marker data
markers = pd.read_csv("path/to/your/markers.csv")

# Load example marker data for testing
markers = CASSIA.loadmarker(marker_type="unprocessed")

# Preview the data
print(markers.head())
Python

Parameters

Required

ParameterDescription
markerMarker data (DataFrame or file path)
output_nameBase name for output files
modelLLM model ID (see Model Selection below)
tissueTissue type (e.g., "brain", "blood")
speciesSpecies (e.g., "human", "mouse")
providerAPI provider ("openrouter", "openai", "anthropic") or custom base URL

Optional

ParameterDefaultDescription
max_workers4Number of parallel workers. Recommend ~75% of CPU cores.
n_genes50Top marker genes per cluster
additional_infoNoneExtra experimental context (see below)
temperature0Output randomness (0=deterministic, 1=creative). Keep at 0 for reproducible results.
validator_involvement"v1"Validation intensity: "v1" (moderate) or "v0" (high, slower)
reasoningNoneReasoning depth for GPT-5 series via OpenRouter only: "low", "medium", "high"
ranking_method"avg_log2FC"Gene ranking: "avg_log2FC", "p_val_adj", "pct_diff", "Score"
ascendingNoneSort direction (uses method default)
celltype_columnNoneColumn name for cluster IDs
gene_column_nameNoneColumn name for gene symbols
max_retries1Max retries for failed API calls

Parameter Details

Model Selection

  • Default is anthropic/claude-sonnet-4.5 for best performance
  • Use google/gemini-2.5-flash for faster, preliminary analysis
  • For detailed model recommendations, see How to Select Models and Providers

Marker Gene Selection

  • Default: top 50 genes per cluster
  • ranking_method controls how marker genes are ranked and selected:
    • "avg_log2FC" (default): Rank by average log2 fold change
    • "p_val_adj": Rank by adjusted p-value
    • "pct_diff": Rank by difference in percentage expression
    • "Score": Rank by custom score
  • Filtering criteria: adjusted p-value < 0.05, avg_log2FC > 0.25, min percentage > 0.1
  • If fewer than 50 genes pass filters, all passing genes are used

Additional Context

  • Use additional_info to provide experimental context
  • Examples:
    • Treatment conditions: "Samples were antibody-treated"
    • Analysis focus: "Please carefully distinguish between cancer and non-cancer cells"
  • Tip: Compare results with and without additional context

Output

Files Generated

FileDescription
{output_name}_summary.csvAnnotation results with cell types, markers, and metadata
{output_name}_conversations.jsonComplete conversation history for debugging
{output_name}_report.htmlInteractive HTML report with visualizations