In this notebook, we will use scVelo (Bergen et al.) to do RNA-velocity analysis for a primary tumor sample of patient T71 from the dataset GSE137804
(Transcriptomic Characterization of neuroblastoma by Single Cell RNA Sequencing, Donr R. 2020)
After projecting the primary tumor cells from neuroblastoma patients into a reference atlas of early human embryos and fetal adrenal glands at relatively late development stages, the authors found that tumor cells from patient data resemble two sub-populations: cycling chromaffin cells as undifferentiated chromaffin cells (UCHCs
) and non-cycling chromaffin cells as differentiated chromaffin cells (DCHCs
).
We will begin with identifying two sub-populations of tumor cells in this sample, and use scVelo to predict the progression from one tumor sub-population to another.
import scvelo as scv
import scanpy as sc
import urllib.request
The input of this notebook is an Adata object generated by the Kallisto BUS velocity pipeline. We intersected the cell barcodes generated by Kallisto with the filtered cells annotation provided by the author to produce an AnnData object of 9398
cells, whic are tumor cells only. The AnnData object has two layers of expression data which are spliced
and unspliced
patient_t71, _ = urllib.request.urlretrieve(
'https://www.dropbox.com/s/tqhuzfwqda4pn0b/GSE137804-kallisto.symbol.h5ad?dl=1',
'GSE137804-kallisto.symbol.h5ad'
)
adata = sc.read_h5ad('GSE137804-kallisto.symbol.h5ad')
adata
Display the proportion of spliced molecules and unspliced molecules in this dataset
scv.pl.proportions(adata)
Depending on the protocol used (Drop-Seq, Smart-Seq), there are about 10%-25% of unspliced mRNA molecules which containing intronic sequences.
Preprocessing requisites consist of gene selection by detection (with a minimum number of counts) and high variability (dispersion), normalizing every cell by its total size and logarithmizing X. Filtering and normalization is applied in the same vein to spliced/unspliced counts and X. Logarithmizing is only applied to X. If X is already preprocessed from former analysis, it will not be touched.
scv.pp.filter_and_normalize(adata, min_shared_counts=3, n_top_genes=4000)
scv.pp.log1p(adata)
Computes moments for velocity estimation. First-/second-order moments are computed for each cell across its nearest neighbors, where the neighbor graph is obtained from euclidean distances in PCA space.
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)
Compute the t-SNE
sc.tl.tsne(adata)
Do Leiden cell clustering. We suspect that there are two sub-population of tumor cells in this dataset so we use a small resolution
sc.tl.leiden(adata, resolution = 0.2)
We plot the t-SNE with cells colored by some marker genes like PCP4 (chromaffin cell-like marker), TOP2A, and HELLS (cell-cycle markers), and the leiden clustering
sc.pl.tsne(adata, color=['PCP4', 'NPY', 'TOP2A', 'HELLS', 'leiden'], ncols=3, color_map='YlOrRd')
Since the author didn't provide the exact annotation for each cell, we will use the marker genes reported in the publication to identify cluster 0 as the DCHC-like
cell population, and cluster 1 as the UCHC-like
population
adata.obs['celltype'] = (
adata.obs['leiden'].map(lambda x: {
'0': 'DCHC-like',
'1': 'UCHC-like'
}.get(x, x)).astype('category')
)
Velocities are vectors in gene expression space and represent the direction and speed of movement of the individual cells. The velocities are obtained by modeling transcriptional dynamics of splicing kinetics, either stochastically (default) or deterministically (by setting mode='deterministic').
For each gene, a steady-state-ratio of pre-mature (unspliced) and mature (spliced) mRNA counts is fitted, which constitutes a constant transcriptional state. Velocities are then obtained as residuals from this ratio.
scv.tl.velocity(adata)
The computed velocities are stored in adata.layers just like the count matrices
adata.layers['velocity']
Positive velocity indicates that a gene is actively being transcribed, which occurs for cells that show higher abundance of unspliced mRNA for that gene than expected in steady state. Conversely, negative velocity indicates that a gene is down-regulated.
scv.tl.velocity_graph(adata)
scv.pl.velocity_embedding_stream(adata, basis='tsne', color='celltype', arrow_size=2)
The velocity vectors were embedded in t-SNE plot. It provides a snapshot of the cellular dynamics within DCHC-like and the UCHC-like populations and between them. At the junction between two cell populations, there are positive velocity going from UCHC-like cells toward the DCHC-like cells but not vice versa. This result is consistent with the finding described in the paper that UCHC-like cells are the undifferentiated chromaffin cells and actively transitioning toward the differentiated state.
scv.pl.velocity(adata, ['PCP4', 'TOP2A', 'HELLS'], color = 'celltype')