Addressing Batch Correction with SAKURA
Batch effects are technical variations that occur between different experimental batches, which can confound biological signals in single-cell RNA sequencing data. SAKURA utilizes flexible approaches to handle batch effects, allowing users to choose the strategy that best fits their data and analysis goals.
Approach 1: Pre-corrected Expression Input
Perform batch correction methods on the raw expression matrix prior to utilizing SAKURA. The resulting expression matrix will serve as the input dataset for implementing SAKURA.
- Advantages:
Corrects at the expression level, preserving biological signals
Flexible choice of established methods with extensive validation
SAKURA operates normally without modifications
- Considerations:
May over-correct and remove subtle biological variations
Requires careful parameter tuning for optimal results due to transformed statistical properties and potential information loss in the batch corrected data
- Suitable Methods:
Seurat CCA/RPCA: Batch effect correction aligning canonical basis vectors or reciprocal PCA
scVI: Probabilistic modeling of batch effects
Liger: Batch effect correction relies on integrative non-negative matrix factorization
Approach 2: Post-hoc Embedding Correction
Use SAKURA result embeddings as input to external batch correction methods, and then specifically remove technical batch variation within this low-dimensional space already enriched for biological signal.
Note
Use SAKURA embeddings as a functional substitute for PCA coordinates in batch correction, but ensure the method does not mathematically depends on PCA-specific constructs (e.g., gene loadings for matrix reconstruction).
- Typical Workflow:
Extract batch information from metadata during data preprocessing (
batch,donor,sequencing_run, etc.)Generate SAKURA embeddings using standard training pipeline
Apply batch correction to SAKURA embeddings using external tools
Use corrected embeddings for clustering, visualization, and analysis
- Advantages:
Preserves SAKURA’s biological signal learning
Reduces correction computational cost and time with low-dimensional SAKURA embedding
Enables modular evaluation on both SAKURA feature learning quality and batch correction efficacy
- Considerations:
Requires compatible correction methods applied after feature learning
May involve iterative optimization between external correction methods and SAKURA targeting two objectives, i.e. biological signal learning and batch effect correction
- Suitable Methods:
Approach 3: Knowledge-Guided Training with Pre-Corrected Embeddings
Use pre-computed, batch-corrected low dimensional embeddings as the knowledge input for SAKURA’s feature training.
Note
This option is designed for scenarios where a reliable, batch-corrected low-dimensional representation of the data already exists or is preferred to be generated upstream.The core idea is to format prior knowledge about the desired, batch-effect-free cell-state together with necessary cell and feature metadata as the knowledge input to SAKURA for feature learning.
Configuration Example:
{
"dataset": {
...
"pheno_csv_path": "./<dataset>/pheno_df_with_batch_effect_correction_features.csv",
"pheno_meta_path": "./<dataset>_<signature/phenotype>_bec/pheno_meta.json",
"selected_pheno": [
"BE_cor",
...
],
...
},
...
"story": [
{
...
"<pheno_with_>batch_effect_correction": {
"use_split": "overall_train",
"batch_size": 100,
"train_main_latent": "False",
"train_pheno": "True",
"train_signature": "False",
"selected_pheno": {
...
"BE_cor": {
"loss": "*",
"regularization": "*"
}
}
},
}
]
}
- Advantages:
Separates the complex problem of batch correction from knowledge-guided deep learning
Flexible integration with existing workflows where batch correction is a standardized upstream step
- Considerations:
Requires appropriate choice of upstream batch correction methods and decent validation
Requires careful parameter tuning balancing the intensity of batch correction and other biological signal learning
Evaluation is holistic; difficult to disentangle the combined effect due to the fused learning.
- Suitable Methods:
Conclusion
SAKURA’s flexible architecture supports multiple strategies for batch effect handling. The choice between pre-correction, post-hoc correction, or corrected embedding knowledge input depends on the severity of batch effects, data complexity, and specific research questions.
For most applications, we recommend starting with Approach 1 (pre-corrected expression) for well-established datasets, or Approach 2 (post-hoc embedding correction) when SAKURA’s biological feature extraction is prioritized.