In this notebook, we will use scVelo (Bergen et al.) to do RNA-velocity analysis for a primary tumor sample of patient T71 from the dataset GSE137804
(Transcriptomic Characterization of neuroblastoma by Single Cell RNA Sequencing, Donr R. 2020)
After projecting the primary tumor cells from neuroblastoma patients into a reference atlas of early human embryos and fetal adrenal glands at relatively late development stages, the authors found that tumor cells from patient data resemble two sub-populations: cycling chromaffin cells as undifferentiated chromaffin cells (UCHCs
) and non-cycling chromaffin cells as differentiated chromaffin cells (DCHCs
).
We will begin with identifying two sub-populations of tumor cells in this sample, and use scVelo to predict the progression from one tumor sub-population to another.
import scvelo as scv
import scanpy as sc
import urllib.request
The input of this notebook is an Adata object generated by the Kallisto BUS velocity pipeline. We intersected the cell barcodes generated by Kallisto with the filtered cells annotation provided by the author to produce an AnnData object of 9398
cells, whic are tumor cells only. The AnnData object has two layers of expression data which are spliced
and unspliced
patient_t71, _ = urllib.request.urlretrieve(
'https://www.dropbox.com/s/tqhuzfwqda4pn0b/GSE137804-kallisto.symbol.h5ad?dl=1',
'GSE137804-kallisto.symbol.h5ad'
)
adata = sc.read_h5ad('GSE137804-kallisto.symbol.h5ad')
adata
Display the proportion of spliced molecules and unspliced molecules in this dataset
scv.pl.proportions(adata)
Depending on the protocol used (Drop-Seq, Smart-Seq), there are about 10%-25% of unspliced mRNA molecules which containing intronic sequences.
Preprocessing requisites consist of gene selection by detection (with a minimum number of counts) and high variability (dispersion), normalizing every cell by its total size and logarithmizing X. Filtering and normalization is applied in the same vein to spliced/unspliced counts and X. Logarithmizing is only applied to X. If X is already preprocessed from former analysis, it will not be touched.
scv.pp.filter_and_normalize(adata, min_shared_counts=3, n_top_genes=4000)
scv.pp.log1p(adata)
Computes moments for velocity estimation. First-/second-order moments are computed for each cell across its nearest neighbors, where the neighbor graph is obtained from euclidean distances in PCA space.
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)
Compute the t-SNE
sc.tl.tsne(adata)
Do Leiden cell clustering. We suspect that there are two sub-population of tumor cells in this dataset so we use a small resolution
sc.tl.leiden(adata, resolution = 0.2)
We plot the t-SNE with cells colored by some marker genes like PCP4 (chromaffin cell-like marker), TOP2A, and HELLS (cell-cycle markers), and the leiden clustering
sc.pl.tsne(adata, color=['PCP4', 'NPY', 'TOP2A', 'HELLS', 'leiden'], ncols=3, color_map='YlOrRd')
Since the author didn't provide the exact annotation for each cell, we will use the marker genes reported in the publication to identify cluster 0 as the DCHC-like
cell population, and cluster 1 as the UCHC-like
population
adata.obs['celltype'] = (
adata.obs['leiden'].map(lambda x: {
'0': 'DCHC-like',
'1': 'UCHC-like'
}.get(x, x)).astype('category')
)
Velocities are vectors in gene expression space and represent the direction and speed of movement of the individual cells. The velocities are obtained by modeling transcriptional dynamics of splicing kinetics, either stochastically (default) or deterministically (by setting mode='deterministic').
For each gene, a steady-state-ratio of pre-mature (unspliced) and mature (spliced) mRNA counts is fitted, which constitutes a constant transcriptional state. Velocities are then obtained as residuals from this ratio.
scv.tl.velocity(adata)
The computed velocities are stored in adata.layers just like the count matrices
adata.layers['velocity']
Positive velocity indicates that a gene is actively being transcribed, which occurs for cells that show higher abundance of unspliced mRNA for that gene than expected in steady state. Conversely, negative velocity indicates that a gene is down-regulated.
scv.tl.velocity_graph(adata)
scv.pl.velocity_embedding_stream(adata, basis='tsne', color='celltype', arrow_size=2)
The velocity vectors were embedded in t-SNE plot. It provides a snapshot of the cellular dynamics within DCHC-like and the UCHC-like populations and between them. At the junction between two cell populations, there are positive velocity going from UCHC-like cells toward the DCHC-like cells but not vice versa. This result is consistent with the finding described in the paper that UCHC-like cells are the undifferentiated chromaffin cells and actively transitioning toward the differentiated state.
scv.pl.velocity(adata, ['PCP4', 'TOP2A', 'HELLS'], color = 'celltype')
The first column contain so-called phase portrait plot displaying the spliced count versus unspliced count for each gene. To interprete this plot, please check https://user-images.githubusercontent.com/31883718/80227452-eb822480-864d-11ea-9399-56886c5e2785.gif. In generall, the black line in phase portrait corresponds to the estimated ‘steady-state’ ratio, i.e. the ratio of unspliced to spliced mRNA abundance which is in a constant transcriptional state. RNA velocity for a particular gene is determined as the residual, i.e. how much an observation deviates from that steady-state line.
The second colum show the velocities embedded in t-SNE plot. Positive velocity indicates that a gene is up-regulated, which occurs for cells that show higher abundance of unspliced mRNA for that gene than expected in steady state. Conversely, negative velocity indicates that a gene is down-regulated.
The third column display gene expression after taking into account both spliced and unspliced counts.
scv.pl.scatter(adata, ['PCP4', 'TOP2A', 'HELLS'], color=['velocity'])
It is improtant to identify genes that may help explain the resulting vector field and inferred lineages. To do so, we can test which genes have cluster-specific differential velocity expression, being siginificantly higher/lower compared to the remaining population.
The module scv.tl.rank_velocity_genes runs a differential velocity t-test and outpus a gene ranking for each cluster. Thresholds can be set (e.g. min_corr) to restrict the test on a selection of gene candidates.
scv.tl.rank_velocity_genes(adata, groupby='celltype', min_corr=.3)
df = scv.DataFrame(adata.uns['rank_velocity_genes']['names'])
df.head(n=10)
scv.tl.score_genes_cell_cycle(adata)
scv.pl.scatter(adata, color_gradients=['S_score', 'G2M_score'], smooth=True, perc=[5, 95])
scv.pl.velocity_embedding_stream(adata, basis='tsne', color='celltype', arrow_size=2)
The cell-cycle enrichment score shows that most of the UCHC-like
cells are in the S phase, which are predicted to transit into the DCHC-like
population. Part of the UCHC-like
population has the self-cycling velocity vectors, and low S-core enriched
s_genes, g2m_genes = scv.utils.get_phase_marker_genes(adata)
s_genes = scv.get_df(adata[:, s_genes], 'spearmans_score', sort_values=True).index
g2m_genes = scv.get_df(adata[:, g2m_genes], 'spearmans_score', sort_values=True).index
kwargs = dict(frameon=False, ylabel='cell cycle genes')
scv.pl.scatter(adata, list(s_genes[:2]) + list(g2m_genes[:3]), **kwargs)
The velocity values and gene expression values of two cell-cycle related genes, HELLS
and TOP2A
scv.pl.velocity(adata, ['HELLS', 'TOP2A'], ncols=1, add_outline=True, color='celltype')
Two more useful stats: - The speed or rate of differentiation is given by the length of the velocity vector. - The coherence of the vector field (i.e., how a velocity vector correlates with its neighboring velocities) provides a measure of confidence.
scv.tl.velocity_confidence(adata)
keys = 'velocity_length', 'velocity_confidence'
scv.pl.scatter(adata, c=keys, cmap='coolwarm', perc=[5, 95])
df = adata.obs.groupby('celltype')[keys].mean().T
df.style.background_gradient(cmap='coolwarm', axis=1)
scv.tl.velocity_pseudotime(adata)
scv.pl.scatter(adata, color='velocity_pseudotime', cmap='gnuplot')
Finally, based on the velocity graph, a velocity pseudotime can be computed. After inferring a distribution over root cells from the graph, it measures the average number of steps it takes to reach a cell after walking along the graph starting from the root cells. It consistently predicts that the UCHC-like cells are the root and giving rise to DCHC-like cells.