In this notebook, we will use scVelo (Bergen et al.) to do RNA-velocity analysis for a primary tumor sample of patient T71 from the dataset GSE137804 (Transcriptomic Characterization of neuroblastoma by Single Cell RNA Sequencing, Donr R. 2020)

After projecting the primary tumor cells from neuroblastoma patients into a reference atlas of early human embryos and fetal adrenal glands at relatively late development stages, the authors found that tumor cells from patient data resemble two sub-populations: cycling chromaffin cells as undifferentiated chromaffin cells (UCHCs) and non-cycling chromaffin cells as differentiated chromaffin cells (DCHCs).

We will begin with identifying two sub-populations of tumor cells in this sample, and use scVelo to predict the progression from one tumor sub-population to another.

Load required packages

In [2]:
import scvelo as scv
import scanpy as sc
import urllib.request

Load the data

The input of this notebook is an Adata object generated by the Kallisto BUS velocity pipeline. We intersected the cell barcodes generated by Kallisto with the filtered cells annotation provided by the author to produce an AnnData object of 9398 cells, whic are tumor cells only. The AnnData object has two layers of expression data which are spliced and unspliced

In [3]:
patient_t71, _ = urllib.request.urlretrieve(
    'https://www.dropbox.com/s/tqhuzfwqda4pn0b/GSE137804-kallisto.symbol.h5ad?dl=1',
    'GSE137804-kallisto.symbol.h5ad'
)
In [4]:
adata = sc.read_h5ad('GSE137804-kallisto.symbol.h5ad')
adata
Out[4]:
AnnData object with n_obs × n_vars = 9398 × 58367
    obs: 'sample', 'celltype'
    layers: 'spliced', 'unspliced'

Display the proportion of spliced molecules and unspliced molecules in this dataset

In [5]:
scv.pl.proportions(adata)

Depending on the protocol used (Drop-Seq, Smart-Seq), there are about 10%-25% of unspliced mRNA molecules which containing intronic sequences.

Basic preprocessing the data

Preprocessing requisites consist of gene selection by detection (with a minimum number of counts) and high variability (dispersion), normalizing every cell by its total size and logarithmizing X. Filtering and normalization is applied in the same vein to spliced/unspliced counts and X. Logarithmizing is only applied to X. If X is already preprocessed from former analysis, it will not be touched.

In [6]:
scv.pp.filter_and_normalize(adata, min_shared_counts=3, n_top_genes=4000)
scv.pp.log1p(adata)
Filtered out 48829 genes that are detected 3 counts (shared).
Normalized count data: X, spliced, unspliced.
Extracted 4000 highly variable genes.
Logarithmized X.

Computes moments for velocity estimation. First-/second-order moments are computed for each cell across its nearest neighbors, where the neighbor graph is obtained from euclidean distances in PCA space.

In [7]:
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)
computing neighbors
    finished (0:00:24) --> added 
    'distances' and 'connectivities', weighted adjacency matrices (adata.obsp)
computing moments based on connectivities
    finished (0:00:01) --> added 
    'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)

Compute the t-SNE

In [9]:
sc.tl.tsne(adata)

Do Leiden cell clustering. We suspect that there are two sub-population of tumor cells in this dataset so we use a small resolution

In [10]:
sc.tl.leiden(adata, resolution = 0.2)

We plot the t-SNE with cells colored by some marker genes like PCP4 (chromaffin cell-like marker), TOP2A, and HELLS (cell-cycle markers), and the leiden clustering

In [11]:
sc.pl.tsne(adata, color=['PCP4', 'NPY', 'TOP2A', 'HELLS', 'leiden'], ncols=3, color_map='YlOrRd')

Since the author didn't provide the exact annotation for each cell, we will use the marker genes reported in the publication to identify cluster 0 as the DCHC-like cell population, and cluster 1 as the UCHC-like population

In [12]:
adata.obs['celltype'] = (
    adata.obs['leiden'].map(lambda x: {
        '0': 'DCHC-like',
        '1': 'UCHC-like'
    }.get(x, x)).astype('category')
)

Estimate RNA velocity

Velocities are vectors in gene expression space and represent the direction and speed of movement of the individual cells. The velocities are obtained by modeling transcriptional dynamics of splicing kinetics, either stochastically (default) or deterministically (by setting mode='deterministic').

For each gene, a steady-state-ratio of pre-mature (unspliced) and mature (spliced) mRNA counts is fitted, which constitutes a constant transcriptional state. Velocities are then obtained as residuals from this ratio.

In [13]:
scv.tl.velocity(adata)
computing velocities
    finished (0:00:01) --> added 
    'velocity', velocity vectors for each individual cell (adata.layers)

The computed velocities are stored in adata.layers just like the count matrices

In [14]:
adata.layers['velocity']
Out[14]:
array([[-0.00843609, -0.01818206, -0.00271271, ..., -0.00329949,
         0.12779334,  0.02179979],
       [ 0.        , -0.03086859, -0.00939301, ...,  0.        ,
         0.14171043,  0.        ],
       [ 0.        ,  0.        , -0.01660287, ..., -0.00402984,
        -0.03750505,  0.03165812],
       ...,
       [-0.03606855,  0.        ,  0.00575436, ..., -0.003869  ,
        -0.00928451,  0.05416595],
       [-0.0017605 ,  0.02446539, -0.00124819, ..., -0.00184293,
         0.04092183,  0.01954305],
       [-0.00179475,  0.00253087, -0.00587782, ..., -0.00213099,
         0.14145498,  0.01485818]], dtype=float32)

Positive velocity indicates that a gene is actively being transcribed, which occurs for cells that show higher abundance of unspliced mRNA for that gene than expected in steady state. Conversely, negative velocity indicates that a gene is down-regulated.

In [15]:
scv.tl.velocity_graph(adata)
computing velocity graph (using 1/192 cores)
  0%|          | 0/9398 [00:00<?, ?cells/s]
    finished (0:00:08) --> added 
    'velocity_graph', sparse matrix with cosine correlations (adata.uns)

Project the velocities

In [17]:
scv.pl.velocity_embedding_stream(adata, basis='tsne', color='celltype', arrow_size=2)
computing velocity embedding
    finished (0:00:01) --> added
    'velocity_tsne', embedded velocity vectors (adata.obsm)

The velocity vectors were embedded in t-SNE plot. It provides a snapshot of the cellular dynamics within DCHC-like and the UCHC-like populations and between them. At the junction between two cell populations, there are positive velocity going from UCHC-like cells toward the DCHC-like cells but not vice versa. This result is consistent with the finding described in the paper that UCHC-like cells are the undifferentiated chromaffin cells and actively transitioning toward the differentiated state.

Interprete the velocities

In [18]:
scv.pl.velocity(adata, ['PCP4', 'TOP2A', 'HELLS'], color = 'celltype')

The first column contain so-called phase portrait plot displaying the spliced count versus unspliced count for each gene. To interprete this plot, please check https://user-images.githubusercontent.com/31883718/80227452-eb822480-864d-11ea-9399-56886c5e2785.gif. In generall, the black line in phase portrait corresponds to the estimated ‘steady-state’ ratio, i.e. the ratio of unspliced to spliced mRNA abundance which is in a constant transcriptional state. RNA velocity for a particular gene is determined as the residual, i.e. how much an observation deviates from that steady-state line.

The second colum show the velocities embedded in t-SNE plot. Positive velocity indicates that a gene is up-regulated, which occurs for cells that show higher abundance of unspliced mRNA for that gene than expected in steady state. Conversely, negative velocity indicates that a gene is down-regulated.

The third column display gene expression after taking into account both spliced and unspliced counts.

In [19]:
scv.pl.scatter(adata, ['PCP4', 'TOP2A', 'HELLS'], color=['velocity'])

Identify important genes

It is improtant to identify genes that may help explain the resulting vector field and inferred lineages. To do so, we can test which genes have cluster-specific differential velocity expression, being siginificantly higher/lower compared to the remaining population.

The module scv.tl.rank_velocity_genes runs a differential velocity t-test and outpus a gene ranking for each cluster. Thresholds can be set (e.g. min_corr) to restrict the test on a selection of gene candidates.

In [20]:
scv.tl.rank_velocity_genes(adata, groupby='celltype', min_corr=.3)

df = scv.DataFrame(adata.uns['rank_velocity_genes']['names'])
df.head(n=10)
ranking velocity genes
    finished (0:00:06) --> added 
    'rank_velocity_genes', sorted scores by group ids (adata.uns) 
    'spearmans_score', spearmans correlation scores (adata.var)
Out[20]:
DCHC-like UCHC-like
0 RGS5 POLQ
1 GAP43 CDCA2
2 CYB561 SHCBP1
3 AKAP7 DNA2
4 IGFBPL1 KIF2C
5 ALCAM NUF2
6 NUCKS1 EXO1
7 ARRDC3 KIF15
8 CREB5 WDR62
9 RAD51AP1 MCM6

Velocities in cycling progenitors

In [21]:
scv.tl.score_genes_cell_cycle(adata)
scv.pl.scatter(adata, color_gradients=['S_score', 'G2M_score'], smooth=True, perc=[5, 95])
scv.pl.velocity_embedding_stream(adata, basis='tsne', color='celltype', arrow_size=2)
calculating cell cycle phase
-->     'S_score' and 'G2M_score', scores of cell cycle phases (adata.obs)

The cell-cycle enrichment score shows that most of the UCHC-like cells are in the S phase, which are predicted to transit into the DCHC-like population. Part of the UCHC-like population has the self-cycling velocity vectors, and low S-core enriched

In [22]:
s_genes, g2m_genes = scv.utils.get_phase_marker_genes(adata)
s_genes = scv.get_df(adata[:, s_genes], 'spearmans_score', sort_values=True).index
g2m_genes = scv.get_df(adata[:, g2m_genes], 'spearmans_score', sort_values=True).index

kwargs = dict(frameon=False, ylabel='cell cycle genes')
scv.pl.scatter(adata, list(s_genes[:2]) + list(g2m_genes[:3]), **kwargs)

The velocity values and gene expression values of two cell-cycle related genes, HELLS and TOP2A

In [23]:
scv.pl.velocity(adata, ['HELLS', 'TOP2A'], ncols=1, add_outline=True, color='celltype')

Speed and coherence

Two more useful stats: - The speed or rate of differentiation is given by the length of the velocity vector. - The coherence of the vector field (i.e., how a velocity vector correlates with its neighboring velocities) provides a measure of confidence.

In [24]:
scv.tl.velocity_confidence(adata)
keys = 'velocity_length', 'velocity_confidence'
scv.pl.scatter(adata, c=keys, cmap='coolwarm', perc=[5, 95])
--> added 'velocity_length' (adata.obs)
--> added 'velocity_confidence' (adata.obs)
--> added 'velocity_confidence_transition' (adata.obs)
In [25]:
df = adata.obs.groupby('celltype')[keys].mean().T
df.style.background_gradient(cmap='coolwarm', axis=1)
Out[25]:
celltype DCHC-like UCHC-like
velocity_length 2.715776 3.382733
velocity_confidence 0.532254 0.706930

Velocity pseudotime

In [26]:
scv.tl.velocity_pseudotime(adata)
scv.pl.scatter(adata, color='velocity_pseudotime', cmap='gnuplot')
computing terminal states
    identified 1 region of root cells and 1 region of end points .
    finished (0:00:01) --> added
    'root_cells', root cells of Markov diffusion process (adata.obs)
    'end_points', end points of Markov diffusion process (adata.obs)

Finally, based on the velocity graph, a velocity pseudotime can be computed. After inferring a distribution over root cells from the graph, it measures the average number of steps it takes to reach a cell after walking along the graph starting from the root cells. It consistently predicts that the UCHC-like cells are the root and giving rise to DCHC-like cells.