BIOINFORMATICS & COMPUTATIONAL BIOLOGY METHODS
Faculty of Computers and
Artificial Intelligence
Benha University
Lecture · M.Sc in AI

AI for Single-Cell
RNA Sequencing Analysis

A high-dimensional data problem disguised as biology — and how modern AI is learning to solve it.

scrna-seq · ai · foundation models
Today

Three things, in order.

PART 1
Biology you actually need
Cells, DNA, RNA, gene expression — through a CS lens. ~10 min.
PART 2
The scRNA-seq pipeline
From raw reads to clusters: QC, normalization, dimensionality reduction, annotation, batch effects, trajectories. ~20 min.
PART 3
Where AI shows up
Classical ML, deep learning, scVI, GNNs, and the new foundation models — scGPT, Geneformer. ~15 min.
02 / 52
Why a CS student should care

A cell is a process running inside a giant distributed system.

BIOLOGY
  • The body
  • A tissue
  • DNA
  • Gene expression
  • Disease
  • A drug
CS ANALOG
  • Distributed system
  • Compute cluster
  • Stored source code
  • Runtime activity
  • Abnormal system state
  • External intervention
03 / 52
PART 01
Biology Primer · 10 min

Just enough biology
to do the math.

Six concepts. Cell, DNA, gene, RNA, protein, expression. After this, biology becomes a matrix.

Section 1 / 8
02 · What is a cell?

The basic unit of life — and a unit of computation.

A human body has ~37 trillion cells. Different jobs:

  • neurontransmits electrical signals
  • musclecontracts
  • immunefights infection
  • red bloodcarries oxygen
  • canceractivates abnormal growth
The puzzle

Almost all of these cells contain the same DNA.

So if the source code is identical — why do they behave so differently?

05 / 52
The answer

Gene expression.

Different cells activate different genes. Same source code — different functions running.

NEURON

runs signaling & communication genes

MUSCLE

runs contraction genes

IMMUNE

runs defense & pathogen-recognition genes

CANCER

runs abnormal growth & survival genes

Identity isn't determined only by which genes exist in DNA — but by which are active, inactive, up- or down-regulated right now.

06 / 52
The central flow

DNA → RNA → Protein → Behavior

DNA

Long-term storage of genetic information.

≈ source code on disk

GENE

A region of DNA that codes for one functional product.

≈ a function definition

RNA

A temporary copy made when a gene is being used.

≈ executed instruction

PROTEIN

The functional molecule that does work in the cell.

≈ runtime object

scRNA-seq measures RNA — so it tells us which functions each cell is currently running, not just which exist.

07 / 52
Biology becomes a matrix

Expression level → number. Cell → vector. Dataset → matrix.

GENE A GENE B GENE C GENE D
cell 10520
cell 28013
cell 30072

value = number of RNA transcripts detected

10⁶
cells (rows) — typical study
2 · 10⁴
genes (columns)
08 / 52
CHECK-IN · 1 of 3
Quick check
scRNA-seq directly measures…
ADNA sequences in each cell
BRNA transcripts in each cell
CProtein abundances in each cell
DCell-cell physical contacts
09 / 52
PART 02
Bulk vs Single-Cell · 5 min

From averages
to individuals.

The conceptual revolution that made scRNA-seq worth doing.

Section 2 / 8
Bulk RNA-seq · the old way

One average profile from millions of mixed cells.

Like measuring the average CPU usage of a whole data center
without knowing which server is overloaded.
cancer cells
immune cells
blood vessels
connective tissue

All blended into one measurement.

11 / 52
Side by side

Bulk averages it all. Single-cell keeps every cell.

interactive
12 / 52
What you actually get back

A cell × gene count matrix.

Shape
  • rowscells (10³ — 10⁶)
  • colsgenes (~20,000)
  • valuesRNA transcript counts (integers)
Properties
  • very large & very sparse
  • noisy, with technical artifacts
  • partially missing (dropout)
  • heterogeneous across cells

Not a tidy ML benchmark. A real-world, messy, observational dataset.

13 / 52
Zoom in on the matrix

Most entries are zero.

interactive

Each tile is one (cell, gene) pair. A zero may mean truly not expressed — or not captured. We can't always tell.

14 / 52
PART 03
Why this data is hard · 5 min

Five reasons
your toolkit breaks.

Section 3 / 8
The five difficulties

Why scRNA-seq isn't standard ML.

01
High-dim

~20,000 features per cell — distance metrics misbehave.

02
Sparse

85%+ zeros. Each zero is ambiguous: real or missing?

03
Noisy

Sequencing variation, dying cells, doublets, dropouts.

04
Batch effects

Lab, day, machine — confounded with biology.

05
Unlabeled

No ground truth cell types. Mostly unsupervised.

16 / 52
The central computational problem

Given a noisy, high-dimensional, sparse matrix —
learn a representation that captures real biological variation while suppressing technical noise.

Everything else in this lecture — clustering, autoencoders, foundation models — is one answer to that one sentence.

17 / 52
The standard pipeline

Click any step to see what it does.

interactive
18 / 52
PART 04
Pipeline walk-through · 12 min

QC. Normalize.
Reduce. Cluster.
Annotate.

Section 4 / 8
Step 1 · Quality control

Not every droplet is a real cell.

DROP THESE
  • empty droplets — no cell inside
  • dying cells — high mito %
  • doublets — two cells in one
  • low-quality — too few genes detected
CS framing

Data cleaning before training.

Garbage in, artifacts out. Most “surprising” discoveries in poorly-QC'd datasets turn out to be technical noise.

20 / 52
Step 2 · Normalization

Make cells comparable.

Different cells get sequenced to different depths. Without correction, the model learns library size instead of biology.

raw counts
scale by library size
log(x+1)
Goal

Reduce technical differences just enough that biological structure becomes detectable.

Not perfect. Just comparable.

21 / 52
Step 3 · Feature selection

Keep the genes that vary. Drop the rest.

START
~20k

all detected genes

SELECT
~2k

highly variable genes

USE

For all downstream
distance & modeling.

10× faster, less noise, same signal.

22 / 52
Step 4 · Dimensionality reduction

Compress high-dimensional cells into something we can compute & visualize.

PCA

Linear. Fast. First reduction step. 20k → 50 dims.

UMAP / t-SNE

Nonlinear. For visualization. 50 → 2 dims.

DEEP MODELS

Autoencoders, scVI. Learned latent space. Keeps nonlinear structure.

⚠ WARNING · A UMAP plot is a visualization — not a proof of biological truth. Distances and shapes can mislead.

23 / 52
Live · A PBMC UMAP

Hover any cluster to see its marker genes.

interactive
24 / 52
Step 5 · Graph & clustering

Build a kNN graph. Find communities.

  • nodeeach cell
  • edgeconnect to k nearest neighbors in PCA space
  • clusterLeiden / Louvain community detection

A cluster is a hypothesis, not a conclusion. It becomes biology only after validation.

25 / 52
Step 6 · Cell-type annotation

Give clusters biological names.

MANUAL

Use known marker genes.

cluster 3
CD3D high
T cell

Interpretable, slow, needs expertise.

AUTOMATED

Reference atlases & classifiers.

unknown cell
classifier
label

Fast, scalable, fails on novel states.

26 / 52
Step 7 · Differential expression

Which genes differ between groups?

  • cancer cells vs normal cells
  • pre-treatment vs post-treatment
  • responder vs non-responder
  • cluster A vs all others
Why it matters

The bridge from numbers to biology.

Differential genes become marker signatures, disease pathways, drug targets, biomarkers.

27 / 52
Live · Marker gene heatmap

Toggle between cell-type and disease comparisons.

interactive
28 / 52
PART 05
Hard problems · 8 min

Batch effects.
Trajectories.

Two things that break a naive pipeline.

Section 5 / 8
The biggest practical headache

Same cell type, different batch — looks like different biology.

SOURCES
  • different patients
  • different labs / hospitals
  • different sequencing days
  • different machines / protocols
CS FRAMING

Domain adaptation.

Align distributions across batches while preserving real biological variation.

Risk: over-correction erases the signal you cared about.

30 / 52
Toggle the correction

Watch what integration does.

interactive
31 / 52
Pseudotime & trajectory inference

scRNA-seq is a snapshot — but cells caught mid-process can reveal the process.

If your sample contains cells at different stages of a dynamic process, you can order them along an inferred path.

Examples: stem cell → mature neuron · naive T cell → activated T cell · healthy → cancer progression.

Manifold assumption

Cells lie on a low-dimensional trajectory embedded in the high-dimensional gene-expression space. Pseudotime methods reconstruct it.

32 / 52
Watch a developmental path light up

Drag the slider, or hit auto.

interactive
33 / 52
PART 06
Where AI shows up · 10 min

Classical → Deep
Foundation.

Section 6 / 8
Where classical ML lives

Still the workhorse of most pipelines.

  • PCAlinear dimensionality reduction
  • kNNcell similarity graphs
  • Leidenclustering
  • LR / RFannotation classifiers
  • UMAP2D visualization
LIMITS

Nonlinear structure

Cross-dataset transfer

Count-distribution modeling

Multimodal integration

35 / 52
Deep learning · the pattern

Cell → embedding → everything.

noisy 20k-d gene vector
neural encoder
~32-d embedding
downstream tasks
clustering
classification
visualization
integration
prediction

Same playbook as image embeddings or text embeddings — just for cells.

36 / 52
Autoencoder · the simplest version

Compress, then reconstruct.

interactive
37 / 52
Probabilistic deep learning · scVI

A VAE that respects how RNA counts actually look.

  • latentdistribution, not point
  • likelihoodzero-inflated negative binomial
  • batchesmodeled explicitly
  • stackPyTorch + AnnData (scvi-tools)
Why it matters

Models the data-generation process — uncertainty, technical noise, and batch in one framework.

Used for: integration, denoising, differential expression, automated annotation.

38 / 52
Graph neural networks

Cells & genes are graphs by nature.

CELL GRAPH

Edges between transcriptionally similar cells.

GENE GRAPH

Regulatory & pathway interactions between genes.

SPATIAL GRAPH

Edges between cells physically near each other in tissue.

Used for: cell-cell communication, regulatory network inference, perturbation prediction, disease modeling.

39 / 52
PART 07
Foundation models · 5 min

A cell as
a sentence.

What happens when we apply the GPT recipe to biology.

Section 7 / 8
The analogy

Genes are tokens. A cell is a structured pattern of gene activity.

NLP

sentence = sequence of words

Pretrain on billions of tokens → general language model → fine-tune for tasks.

SINGLE-CELL

cell = pattern of gene activity

Pretrain on tens of millions of cells → general cell model → fine-tune for tasks.

41 / 52
scGPT
33M+
cells in pretraining

Generative pretrained transformer for single-cell biology.

Same recipe as language models — applied to gene-expression contexts.

cell-type annotation
batch correction
multi-omics integration
gene network modeling
perturbation prediction
cross-dataset transfer
42 / 52
Geneformer
~30M
single-cell transcriptomes

Context-aware model focused on gene network dynamics.

Strong on transfer learning when labeled data is scarce — the regime that breaks training-from-scratch.

gene network modeling
disease-gene prioritization
therapeutic target discovery
perturbation effects
43 / 52
CHECK-IN · 2 of 3
Quick check
You have 500 cells from a rare disease.
Which approach do you reach for first?
ATrain scVI from scratch on the 500 cells
BTrain a deep CNN from scratch
CFine-tune a foundation model (scGPT / Geneformer)
DJust use UMAP and hope for the best
PART 08
Applications & wrap · 5 min

Where this matters.

Section 8 / 8
Industry & clinical applications

scRNA-seq is now foundational in five fields.

CANCER

tumor heterogeneity, resistance, immune evasion

IMMUNOLOGY

exhausted T cells, vaccine response, inflammation

DRUG DISCOVERY

target ID, mechanism of action, biomarkers

PRECISION MED

why same diagnosis, different outcomes

TOXICOLOGY

which cell types a drug damages

46 / 52
Case study · tumor microenvironment

Cancer samples — pre & post treatment.

QUESTIONS
  • which cell types are present?
  • which immune cells are exhausted?
  • which cancer cells survived therapy?
  • which genes mark survivors?
  • can we predict who will respond?
PIPELINE
QC + norm
cluster + annotate
batch correct
DE: pre vs post
predict response

Real translational science. Every step in this lecture, in one project.

47 / 52
Limitations of AI in scRNA-seq

Five things to stay honest about.

48 / 52
Best practices

Boring rules that separate good work from bad.

01 · QUESTION FIRST

Start with the biology. Choose the model second.

02 · STRICT QC

Bad cells, doublets, dying cells — gone before training.

03 · CHECK BATCHES

Are clusters splitting by biology or by lab?

04 · DON'T TRUST UMAP

It's a sketch, not a proof.

05 · VALIDATE CLUSTERS

Marker genes, references, known biology.

06 · DOCUMENT EVERYTHING

Thresholds, parameters, software versions.

49 / 52
Where the field is going

Six active fronts.

FOUNDATION MODELS

scGPT, Geneformer, scaling further.

MULTIMODAL

RNA + chromatin + protein + imaging.

SPATIAL

Keep cells' physical location in tissue.

PERTURBATION

Predict knockout / drug effects in silico.

VIRTUAL CELL

Simulate full cell behavior under conditions.

LLM AGENTS

Automate analysis pipelines, interpret results.

50 / 52
The take-home

AI for scRNA-seq turns noisy molecular measurements
from individual cells into reliable biological insight.

You bring the high-dimensional modeling. The biology brings the questions and the validation. The combination is what produces real science.

51 / 52
End of lecture

Questions?

The interactive demos stay live in this deck — feel free to play with them.

BIOINFORMATICS & COMPUTATIONAL BIOLOGY METHODS
Faculty of Computers and Artificial Intelligence · Benha University
M.Sc · AI Track