Uncertainty Quantification (Optional)

Overview

Uncertainty quantification in CASSIA helps assess annotation reliability through multiple analysis iterations and similarity scoring. This process is crucial for:

  • Identifying robust cell type assignments
  • Detecting mixed or ambiguous clusters
  • Quantifying annotation confidence
  • Understanding prediction variability

Cost Warning: Running multiple iterations with LLM models can incur significant costs. Each iteration makes separate API calls, so the total cost will be approximately n times the cost of a single run.

Quick Start

Single Cluster Analysis

from CASSIA import runCASSIA_n_times_similarity_score

result = runCASSIA_n_times_similarity_score(
    tissue="large intestine",
    species="human",
    marker_list=["CD38", "CD138", "JCHAIN", "MZB1", "SDC1"],
    model="openai/gpt-5.1",
    provider="openrouter",
    n=5,
    reasoning="medium"
)

print(f"Main cell type: {result['general_celltype_llm']}")
print(f"Similarity score: {result['similarity_score']}")
Python

Batch Analysis

import CASSIA

# Step 1: Run multiple iterations
CASSIA.runCASSIA_batch_n_times(
    n=5,
    marker=marker_data,
    output_name="my_annotation",
    model="openai/gpt-5.1",
    provider="openrouter",
    tissue="large intestine",
    species="human",
    reasoning="medium"
)

# Step 2: Calculate similarity scores
CASSIA.runCASSIA_similarity_score_batch(
    marker=marker_data,
    file_pattern="my_annotation_*_summary.csv",
    output_name="similarity_results",
    model="openai/gpt-5.1",
    provider="openrouter",
    reasoning="medium"
)
Python

Input

InputDescriptionFormat
marker_listMarker genes for single clusterList of gene names
markerMarker gene data for batchDataFrame or file path
tissueTissue type contextString (e.g., "brain", "large intestine")
speciesSpecies contextString (e.g., "human", "mouse")
file_patternPattern to match iteration resultsGlob pattern with * wildcard

Parameters

Single Cluster (runCASSIA_n_times_similarity_score)

ParameterRequiredDefaultDescription
tissueYes-Tissue type for context
speciesYes-Species for context
marker_listYes-List of marker genes
modelYes-LLM model to use
providerYes-API provider ("openrouter", "openai", "anthropic")
nNo5Number of analysis iterations
temperatureNo0.3LLM temperature (lower = more consistent)
max_workersNo3Parallel processing workers
main_weightNo0.5Weight for main cell type in similarity (0-1)
sub_weightNo0.5Weight for subtype in similarity (0-1)
validator_involvementNo"v1"Validator mode ("v0" strict, "v1" moderate)
additional_infoNoNoneAdditional context string
generate_reportNoTrueGenerate HTML report
report_output_pathNo"uq_report.html"Path for HTML report
reasoningNoNoneReasoning effort level ("low", "medium", "high") - only for GPT-5 models

Batch Iteration (runCASSIA_batch_n_times)

ParameterRequiredDefaultDescription
nYes-Number of analysis iterations (recommended: 5)
markerYes-Marker gene data (DataFrame or path)
output_nameYes-Base name for output files
modelYes-LLM model to use
providerYes-API provider
tissueYes-Tissue type
speciesYes-Species
max_workersNo4Overall parallel processing limit
batch_max_workersNo2Workers per iteration
reasoningNoNoneReasoning effort level ("low", "medium", "high") - only for GPT-5 models

Similarity Scoring (runCASSIA_similarity_score_batch)

ParameterRequiredDefaultDescription
markerYes-Marker gene data
file_patternYes-Pattern to match iteration results (e.g., "output_*_summary.csv")
output_nameYes-Base name for results
modelYes-LLM model for scoring
providerYes-API provider
max_workersNo4Number of parallel workers
main_weightNo0.5Importance of main cell type match (0-1)
sub_weightNo0.5Importance of subtype match (0-1)
generate_reportNoTrueGenerate HTML report
report_output_pathNo"uq_batch_report.html"Path for HTML report
reasoningNoNoneReasoning effort level ("low", "medium", "high") - only for GPT-5 models

Output

Files Generated

FileDescription
{output_name}_{n}_summary.csvResults from each iteration
{output_name}_similarity.csvSimilarity scores across iterations
uq_report.html / uq_batch_report.htmlHTML visualization report

Return Values (Single Cluster)

KeyDescription
general_celltype_llmConsensus main cell type
sub_celltype_llmConsensus sub cell type
similarity_scoreOverall similarity across iterations (0-1)
consensus_typesCell types that appeared most frequently
Possible_mixed_celltypes_llmDetected mixed cell type populations
original_resultsRaw results from each iteration

Interpreting Similarity Scores

Score RangeInterpretationAction
> 0.9High consistencyRobust annotation
0.75 - 0.9Moderate consistencyReview recommended
< 0.75Low consistencyUse Annotation Boost or Subclustering

Troubleshooting Low Scores

  1. Review Data: Check marker gene quality and cluster heterogeneity
  2. Try Advanced Agents: Use Annotation Boost Agent or Subclustering
  3. Adjust Parameters: Increase iteration count for more reliable consensus