BIOINFORMATICS & COMPUTATIONAL BIOLOGY METHODS
Faculty of Computers and
Artificial Intelligence
Benha University
Lecture · M.Sc in AI
AI for Single-Cell
RNA Sequencing Analysis
A high-dimensional data problem disguised as biology — and how modern AI is learning to solve it.
scrna-seq · ai · foundation models
Today
Three things, in order.
PART 1
Biology you actually need
Cells, DNA, RNA, gene expression — through a CS lens. ~10 min.
PART 2
The scRNA-seq pipeline
From raw reads to clusters: QC, normalization, dimensionality reduction, annotation, batch effects, trajectories. ~20 min.
PART 3
Where AI shows up
Classical ML, deep learning, scVI, GNNs, and the new foundation models — scGPT, Geneformer. ~15 min.
02 / 52
Why a CS student should care
A cell is a process running inside a giant distributed system.
BIOLOGY
- The body
- A tissue
- DNA
- Gene expression
- Disease
- A drug
CS ANALOG
- Distributed system
- Compute cluster
- Stored source code
- Runtime activity
- Abnormal system state
- External intervention
03 / 52
02 · What is a cell?
The basic unit of life — and a unit of computation.
A human body has ~37 trillion cells. Different jobs:
- neurontransmits electrical signals
- musclecontracts
- immunefights infection
- red bloodcarries oxygen
- canceractivates abnormal growth
The puzzle
Almost all of these cells contain the same DNA.
So if the source code is identical — why do they behave so differently?
05 / 52
The answer
Gene expression.
Different cells activate different genes. Same source code — different functions running.
NEURON
runs signaling & communication genes
MUSCLE
runs contraction genes
IMMUNE
runs defense & pathogen-recognition genes
CANCER
runs abnormal growth & survival genes
Identity isn't determined only by which genes exist in DNA — but by which are active, inactive, up- or down-regulated right now.
06 / 52
The central flow
DNA → RNA → Protein → Behavior
DNA
Long-term storage of genetic information.
≈ source code on disk
GENE
A region of DNA that codes for one functional product.
≈ a function definition
RNA
A temporary copy made when a gene is being used.
≈ executed instruction
PROTEIN
The functional molecule that does work in the cell.
≈ runtime object
scRNA-seq measures RNA — so it tells us which functions each cell is currently running, not just which exist.
07 / 52
Biology becomes a matrix
Expression level → number. Cell → vector. Dataset → matrix.
|
GENE A |
GENE B |
GENE C |
GENE D |
| cell 1 | 0 | 5 | 2 | 0 |
| cell 2 | 8 | 0 | 1 | 3 |
| cell 3 | 0 | 0 | 7 | 2 |
value = number of RNA transcripts detected
10⁶
cells (rows) — typical study
08 / 52
CHECK-IN · 1 of 3
Quick check
scRNA-seq directly measures…
ADNA sequences in each cell
BRNA transcripts in each cell
CProtein abundances in each cell
DCell-cell physical contacts
09 / 52
Bulk RNA-seq · the old way
One average profile from millions of mixed cells.
“Like measuring the average CPU usage of a whole data center
without knowing which server is overloaded.”
cancer cells
immune cells
blood vessels
connective tissue
All blended into one measurement.
11 / 52
Side by side
Bulk averages it all. Single-cell keeps every cell.
12 / 52
What you actually get back
A cell × gene count matrix.
Shape
- rowscells (10³ — 10⁶)
- colsgenes (~20,000)
- valuesRNA transcript counts (integers)
Properties
- very large & very sparse
- noisy, with technical artifacts
- partially missing (dropout)
- heterogeneous across cells
Not a tidy ML benchmark. A real-world, messy, observational dataset.
13 / 52
Zoom in on the matrix
Most entries are zero.
Each tile is one (cell, gene) pair. A zero may mean truly not expressed — or not captured. We can't always tell.
14 / 52
The five difficulties
Why scRNA-seq isn't standard ML.
01
High-dim
~20,000 features per cell — distance metrics misbehave.
02
Sparse
85%+ zeros. Each zero is ambiguous: real or missing?
03
Noisy
Sequencing variation, dying cells, doublets, dropouts.
04
Batch effects
Lab, day, machine — confounded with biology.
05
Unlabeled
No ground truth cell types. Mostly unsupervised.
16 / 52
The central computational problem
Given a noisy, high-dimensional, sparse matrix —
learn a representation that captures real biological variation while suppressing technical noise.
Everything else in this lecture — clustering, autoencoders, foundation models — is one answer to that one sentence.
17 / 52
The standard pipeline
Click any step to see what it does.
18 / 52
Step 1 · Quality control
Not every droplet is a real cell.
DROP THESE
- empty droplets — no cell inside
- dying cells — high mito %
- doublets — two cells in one
- low-quality — too few genes detected
CS framing
Data cleaning before training.
Garbage in, artifacts out. Most “surprising” discoveries in poorly-QC'd datasets turn out to be technical noise.
20 / 52
Step 2 · Normalization
Make cells comparable.
Different cells get sequenced to different depths. Without correction, the model learns library size instead of biology.
raw counts
→
scale by library size
→
log(x+1)
Goal
Reduce technical differences just enough that biological structure becomes detectable.
Not perfect. Just comparable.
21 / 52
Step 3 · Feature selection
Keep the genes that vary. Drop the rest.
START
~20k
all detected genes
SELECT
~2k
highly variable genes
USE
For all downstream
distance & modeling.
10× faster, less noise, same signal.
22 / 52
Step 4 · Dimensionality reduction
Compress high-dimensional cells into something we can compute & visualize.
PCA
Linear. Fast. First reduction step. 20k → 50 dims.
UMAP / t-SNE
Nonlinear. For visualization. 50 → 2 dims.
DEEP MODELS
Autoencoders, scVI. Learned latent space. Keeps nonlinear structure.
⚠ WARNING · A UMAP plot is a visualization — not a proof of biological truth. Distances and shapes can mislead.
23 / 52
Live · A PBMC UMAP
Hover any cluster to see its marker genes.
24 / 52
Step 5 · Graph & clustering
Build a kNN graph. Find communities.
- nodeeach cell
- edgeconnect to k nearest neighbors in PCA space
- clusterLeiden / Louvain community detection
A cluster is a hypothesis, not a conclusion. It becomes biology only after validation.
25 / 52
Step 6 · Cell-type annotation
Give clusters biological names.
MANUAL
Use known marker genes.
cluster 3
→
CD3D high
→
T cell
Interpretable, slow, needs expertise.
AUTOMATED
Reference atlases & classifiers.
unknown cell
→
classifier
→
label
Fast, scalable, fails on novel states.
26 / 52
Step 7 · Differential expression
Which genes differ between groups?
- cancer cells vs normal cells
- pre-treatment vs post-treatment
- responder vs non-responder
- cluster A vs all others
Why it matters
The bridge from numbers to biology.
Differential genes become marker signatures, disease pathways, drug targets, biomarkers.
27 / 52
Live · Marker gene heatmap
Toggle between cell-type and disease comparisons.
28 / 52
The biggest practical headache
Same cell type, different batch — looks like different biology.
SOURCES
- different patients
- different labs / hospitals
- different sequencing days
- different machines / protocols
CS FRAMING
Domain adaptation.
Align distributions across batches while preserving real biological variation.
Risk: over-correction erases the signal you cared about.
30 / 52
Toggle the correction
Watch what integration does.
31 / 52
Pseudotime & trajectory inference
scRNA-seq is a snapshot — but cells caught mid-process can reveal the process.
If your sample contains cells at different stages of a dynamic process, you can order them along an inferred path.
Examples: stem cell → mature neuron · naive T cell → activated T cell · healthy → cancer progression.
Manifold assumption
Cells lie on a low-dimensional trajectory embedded in the high-dimensional gene-expression space. Pseudotime methods reconstruct it.
32 / 52
Watch a developmental path light up
Drag the slider, or hit auto.
33 / 52
Where classical ML lives
Still the workhorse of most pipelines.
- PCAlinear dimensionality reduction
- kNNcell similarity graphs
- Leidenclustering
- LR / RFannotation classifiers
- UMAP2D visualization
LIMITS
Nonlinear structure
Cross-dataset transfer
Count-distribution modeling
Multimodal integration
35 / 52
Deep learning · the pattern
Cell → embedding → everything.
noisy 20k-d gene vector
→
neural encoder
→
~32-d embedding
→
downstream tasks
clustering
classification
visualization
integration
prediction
Same playbook as image embeddings or text embeddings — just for cells.
36 / 52
Autoencoder · the simplest version
Compress, then reconstruct.
37 / 52
Probabilistic deep learning · scVI
A VAE that respects how RNA counts actually look.
- latentdistribution, not point
- likelihoodzero-inflated negative binomial
- batchesmodeled explicitly
- stackPyTorch + AnnData (scvi-tools)
Why it matters
Models the data-generation process — uncertainty, technical noise, and batch in one framework.
Used for: integration, denoising, differential expression, automated annotation.
38 / 52
Graph neural networks
Cells & genes are graphs by nature.
CELL GRAPH
Edges between transcriptionally similar cells.
GENE GRAPH
Regulatory & pathway interactions between genes.
SPATIAL GRAPH
Edges between cells physically near each other in tissue.
Used for: cell-cell communication, regulatory network inference, perturbation prediction, disease modeling.
39 / 52
The analogy
Genes are tokens. A cell is a structured pattern of gene activity.
NLP
sentence = sequence of words
Pretrain on billions of tokens → general language model → fine-tune for tasks.
SINGLE-CELL
cell = pattern of gene activity
Pretrain on tens of millions of cells → general cell model → fine-tune for tasks.
41 / 52
scGPT
33M+
cells in pretraining
Generative pretrained transformer for single-cell biology.
Same recipe as language models — applied to gene-expression contexts.
cell-type annotation
batch correction
multi-omics integration
gene network modeling
perturbation prediction
cross-dataset transfer
42 / 52
Geneformer
~30M
single-cell transcriptomes
Context-aware model focused on gene network dynamics.
Strong on transfer learning when labeled data is scarce — the regime that breaks training-from-scratch.
gene network modeling
disease-gene prioritization
therapeutic target discovery
perturbation effects
43 / 52
CHECK-IN · 2 of 3
Quick check
You have 500 cells from a rare disease.
Which approach do you reach for first?
ATrain scVI from scratch on the 500 cells
BTrain a deep CNN from scratch
CFine-tune a foundation model (scGPT / Geneformer)
DJust use UMAP and hope for the best
Industry & clinical applications
scRNA-seq is now foundational in five fields.
CANCER
tumor heterogeneity, resistance, immune evasion
IMMUNOLOGY
exhausted T cells, vaccine response, inflammation
DRUG DISCOVERY
target ID, mechanism of action, biomarkers
PRECISION MED
why same diagnosis, different outcomes
TOXICOLOGY
which cell types a drug damages
46 / 52
Case study · tumor microenvironment
Cancer samples — pre & post treatment.
QUESTIONS
- which cell types are present?
- which immune cells are exhausted?
- which cancer cells survived therapy?
- which genes mark survivors?
- can we predict who will respond?
PIPELINE
QC + norm
→
cluster + annotate
→
batch correct
DE: pre vs post
→
predict response
Real translational science. Every step in this lecture, in one project.
47 / 52
Limitations of AI in scRNA-seq
Five things to stay honest about.
- 01Models can learn batch effects, not biology
- 02Training data is biased toward common tissues, well-studied diseases
- 03Deep models are hard to interpret — predictions ≠ explanations
- 04Predictions need wet-lab validation — they're hypotheses
- 05Preprocessing choices change results, sometimes a lot
48 / 52
Best practices
Boring rules that separate good work from bad.
01 · QUESTION FIRST
Start with the biology. Choose the model second.
02 · STRICT QC
Bad cells, doublets, dying cells — gone before training.
03 · CHECK BATCHES
Are clusters splitting by biology or by lab?
04 · DON'T TRUST UMAP
It's a sketch, not a proof.
05 · VALIDATE CLUSTERS
Marker genes, references, known biology.
06 · DOCUMENT EVERYTHING
Thresholds, parameters, software versions.
49 / 52
Where the field is going
Six active fronts.
FOUNDATION MODELS
scGPT, Geneformer, scaling further.
MULTIMODAL
RNA + chromatin + protein + imaging.
SPATIAL
Keep cells' physical location in tissue.
PERTURBATION
Predict knockout / drug effects in silico.
VIRTUAL CELL
Simulate full cell behavior under conditions.
LLM AGENTS
Automate analysis pipelines, interpret results.
50 / 52
The take-home
AI for scRNA-seq turns noisy molecular measurements
from individual cells into reliable biological insight.
You bring the high-dimensional modeling. The biology brings the questions and the validation. The combination is what produces real science.
51 / 52
End of lecture
Questions?
The interactive demos stay live in this deck — feel free to play with them.
BIOINFORMATICS & COMPUTATIONAL BIOLOGY METHODS
Faculty of Computers and Artificial Intelligence · Benha University
M.Sc · AI Track