Uncertainty Quantification (Optional)

Overview

Uncertainty quantification in CASSIA helps assess annotation reliability through multiple analysis iterations and similarity scoring. This process is crucial for:

  • Identifying robust cell type assignments
  • Detecting mixed or ambiguous clusters
  • Quantifying annotation confidence
  • Understanding prediction variability

Cost Warning: Running multiple iterations with LLM models can incur significant costs. Each iteration makes separate API calls, so the total cost will be approximately n times the cost of a single run.

Quick Start

library(CASSIA)

# Step 1: Run multiple iterations
runCASSIA_batch_n_times(
    n = 5,
    marker = marker_data,
    output_name = "my_annotation",
    model = "openai/gpt-5.1",
    provider = "openrouter",
    tissue = "brain",
    species = "human",
    reasoning = "medium"
)

# Step 2: Calculate similarity scores
runCASSIA_similarity_score_batch(
    marker = marker_data,
    file_pattern = "my_annotation_*_summary.csv",
    output_name = "similarity_results",
    model = "openai/gpt-5.1",
    provider = "openrouter",
    reasoning = "medium"
)
R

Input

InputDescriptionFormat
markerMarker gene dataData frame or file path
tissueTissue type contextString (e.g., "brain", "large intestine")
speciesSpecies contextString (e.g., "human", "mouse")
file_patternPattern to match iteration resultsGlob pattern with * wildcard

Parameters

Batch Iteration (runCASSIA_batch_n_times)

ParameterRequiredDefaultDescription
nYes-Number of analysis iterations (recommended: 5)
markerYes-Marker gene data (data frame or path)
output_nameYes-Base name for output files
modelYes-LLM model to use
providerYes-API provider
tissueYes-Tissue type
speciesYes-Species
max_workersNo4Overall parallel processing limit
batch_max_workersNo2Workers per iteration (max_workers * batch_max_workers should match your cores)
reasoningNoNULLReasoning effort level ("low", "medium", "high") - only for GPT-5 models

Similarity Scoring (runCASSIA_similarity_score_batch)

ParameterRequiredDefaultDescription
markerYes-Marker gene data
file_patternYes-Pattern to match iteration results (e.g., "output_*_summary.csv")
output_nameYes-Base name for results
modelYes-LLM model for scoring
providerYes-API provider
max_workersNo4Number of parallel workers
main_weightNo0.5Importance of main cell type match (0-1)
sub_weightNo0.5Importance of subtype match (0-1)
generate_reportNoTRUEGenerate HTML report
report_output_pathNo"uq_batch_report.html"Path for HTML report
reasoningNoNULLReasoning effort level ("low", "medium", "high") - only for GPT-5 models

Output

Files Generated

FileDescription
{output_name}_{n}_summary.csvResults from each iteration
{output_name}_similarity.csvSimilarity scores across iterations
uq_batch_report.htmlHTML visualization report

Interpreting Similarity Scores

Score RangeInterpretationAction
> 0.9High consistencyRobust annotation
0.75 - 0.9Moderate consistencyReview recommended
< 0.75Low consistencyUse Annotation Boost or Subclustering

Troubleshooting Low Scores

  1. Review Data: Check marker gene quality and cluster heterogeneity
  2. Try Advanced Agents: Use Annotation Boost Agent or Subclustering
  3. Adjust Parameters: Increase iteration count for more reliable consensus