Developed by Marius Lange and colleagues at Theis lab, CellRank is a tool aiming to identify cell state dynamics without predefined direction. The core concept behind Cellrank is Markov chain, in which each cell is assigned a state and the probability of transition to other cells is only based on the current cell. Since CellRank does not require prior knowledge of biological direction, it has many uses particularly where traditional trajectory inference is infeasible.

For versatility, the authors have modularized CellRank into kernels and estimators to fit different needs and is extensible (see image). The kernels compute transition matrix from various data types such as RNA velocity vector, cell similarity, pseudotime, etc. The estimators use that transition matrix to compute aggregated dynamics.

PPARG-low Tregs precuror data

We use expression data of PPARG-low Tregs precursor population from this publication by Chaoran Li and colleagues. The paper reports that lowly-expressed PPARgamma Tregs is indeed transcriptionally heterogeneous. And RNA velocity reveals that its subpopulations exhibit clear transition dynamics to so-called precusor states.

The expression data for this study can be aquired through GSE168605, which comprises of 2 samples: Spleen PPARg negative Tregs and Spleen PPARGg low Tregs. Tregs from these 2 were sorted selected from spleens of 6-8 weeks old Pparg-Tdt.Foxp3-Gfp mice and then prepared with Chromium Single Cell 3' v2 platform. We have processed the single-cell data with a standard scanpy pipeline like normalization and doing data integration using harmony for this notebook.

Load required packages

In [2]:
import scvelo as scv
import scanpy as sc
import cellrank as cr
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

#Settings
scv.settings.verbosity = 3
scv.settings.set_figure_params('scvelo')
cr.settings.verbosity = 2

Read in the data

Download prepared scanpy object:

In [ ]:
import os

os.system('wget -O pparg.h5ad https://www.dropbox.com/s/0iw552yiwnlz9by/pparg.h5ad?dl=1')
In [3]:
adata = sc.read_h5ad('./pparg.h5ad')
adata
Out[3]:
AnnData object with n_obs × n_vars = 4881 × 11237
    obs: 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'batch'
    var: 'gene_ids-0', 'feature_types-0', 'n_cells-0', 'mt-0', 'n_cells_by_counts-0', 'mean_counts-0', 'pct_dropout_by_counts-0', 'total_counts-0', 'gene_ids-1', 'feature_types-1', 'n_cells-1', 'mt-1', 'n_cells_by_counts-1', 'mean_counts-1', 'pct_dropout_by_counts-1', 'total_counts-1', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'batch_colors', 'hvg', 'neighbors', 'pca', 'umap'
    obsm: 'X_pca', 'X_pca_harmony', 'X_umap'
    varm: 'PCs'
    layers: 'counts'
    obsp: 'connectivities', 'distances'

Next, we compute moments for velocity estimation. However, scv.pp.moments requires layer.spliced and layer.unspliced to impute the moments. We trick it by assign .X matrix to the 2 slots for it to able to run:

In [5]:
adata.layers['spliced'] = adata.X
adata.layers['unspliced'] = adata.X
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)
computing neighbors
    finished (0:00:04) --> added 
    'distances' and 'connectivities', weighted adjacency matrices (adata.obsp)
computing moments based on connectivities
    finished (0:00:10) --> added 
    'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)

Run CytoTRACE kernel

We often use RNA velocity kernel to build transition matrix. However cell velocity requires splicing information of the data. In order to get the splicing data, we have to do intron-aware quantification from raw sequencing data and the raw might not be available or taking a long time to run. In that case, we have to use other kernels like Cell similarity or Pseudotime.

This notebook attempts to run CellRank with a pseudotime kernel, CytoTRACE Cellular Trajectory Reconstruction Analysis using gene Counts and Expression. CytoTRACE infers the relative differentiation state of cells without any prior knowledge. Unlike scVelo, it uses the gene expression profile of each cell to calculate the developmental potential.

Import the kernel and run:

In [6]:
from cellrank.tl.kernels import CytoTRACEKernel
ctk = CytoTRACEKernel(adata)
Computing CytoTRACE score with `11237` genes
Adding `adata.obs['ct_score']`
       `adata.obs['ct_pseudotime']`
       `adata.obs['ct_num_exp_genes']`
       `adata.var['ct_gene_corr']`
       `adata.var['ct_correlates']`
       `adata.uns['ct_params']`
    Finish (0:00:01)

We can see 6 slots are added to the adata object after running CytoTRACEKernel, including the CytoTRACE score ct_score and ct_pseudotime:

We visualize the ct_pseudotime on UMAP:

In [7]:
scv.pl.scatter(
    adata,
    c=['ct_pseudotime'],
    basis='X_umap',
    legend_loc='right',
    color_map='gnuplot2',
)

Cells that have scores close to 1 is immature-like and otherwise

Compute transition matrix

From the CytoTRACE kernel, we build the transition matrix with function compute_transition_matrix:

In [8]:
ctk.compute_transition_matrix()
Computing transition matrix based on `ct_pseudotime`
  0%|          | 0/4881 [00:00<?, ?cell/s]
    Finish (0:00:01)
Out[8]:
<CytoTRACEKernel>

Then we compute a projection of the transition matrix in the umap embedding, here is X_umap:

In [9]:
ctk.compute_projection(basis='X_umap')
Projecting transition matrix onto `X_umap`
Adding `adata.obsm['T_fwd_X_umap']`
    Finish (0:00:01)

We run leiden clustering of this dataset before annotating cell type:

In [10]:
sc.tl.leiden(adata, resolution=0.8)

Add slot T_fwd_params using data umap embeddings:

In [11]:
adata.uns['T_fwd_params'] = {'embeddings': ['umap']}
adata.obsm['T_fwd_umap'] = adata.obsm['T_fwd_X_umap']

Let's see how the cell transition vector stream embeded on the UMAP, with cells colored by the leiden clustering

In [36]:
scv.pl.velocity_embedding_stream(
    adata,
    color='leiden', 
    vkey='T_fwd',
    basis='X_umap',
    legend_loc='right', 
    recompute=False,
    density=2,
    size=20,
    alpha=1,
)

We use marker genes Pdcd1, Tcf7, Ccr7, Klrg1 to annotate cell states according to the paper:

(A) Violin plots of transcripts characteristically up-regulated as tissue-Treg precursors exit the spleen. Numbers refer to the percentage of cells in the designated cluster that express a given transcript. A.U., arbitrary units. B) Violin plots of transcripts encoding proteins that need to be down-regulated for tissue-Treg precursors to exit the spleen. Numbers refer to the percentage of cells in the designated cluster expressing a given transcript.

We view the expression of these genes in the umap embedding:

In [35]:
sc.pl.umap(
    adata,
    color = ['leiden', 'Pdcd1', 'Tcf7', 'Ccr7', 'Klrg1'],
    ncols=3,
    size=20,
    color_map='YlOrRd'
)

It is apparent that our leiden clustering has not separated cells into clusters too well, making cluster-based cell labelling impractical. Thus we decide to employ BioTuring Browser lasso tool to arbitrarilly select cells that match with marker genes expression. The result has been uploaded to dropbox for our convenience.

In [ ]:
os.system('wget -O pparg-treg-annotation-batchrename.tsv https://www.dropbox.com/s/96lv61ipgmym6m1/pparg-treg-annotation-batchrename.tsv?dl=1')
In [15]:
metadata = pd.read_csv('pparg-treg-annotation-batchrename.tsv', sep='\t')

We replace the slot containing cell state annotation (Tregs sub population) with that from the metadata:

In [16]:
adata.obs['Tregs sub population'] = metadata['Tregs sub population'].to_list()

View the aggregated cell states in the umap:

In [17]:
umap = sc.pl.umap(adata, color=['Tregs sub population'])

Initially batch in the andata has been annotated as number, we need to rename them accordingly to reflex the correct information:

In [18]:
adata.obs['batch'] = metadata['batch'].to_list()

Let's view the UMAP of batches together with the Tregs sub-populations:

In [49]:
sc.pl.umap(
    adata,
    color = ['batch', 'Tregs sub population'],
    size=20,
    wspace=0.4
)

Let's see the Tregs sub-population composition of two samples, PPARGneg and PPARGlow:

In [20]:
pct2 = (adata.obs.groupby(['batch','Tregs sub population']).size() / adata.obs.groupby(['batch']).size()).reset_index().rename({0:'percent'}, axis=1)
sns.barplot(x='batch', hue='Tregs sub population', y='percent', data=pct2)
plt.legend(loc='best')
plt.show()

PPARg negative has more resting Tregs than PPARg low's. On the contrary, PPARg low has more Precursor Y than PPARg negative's. The changes in cell states composition suggest a possible transformation from Treg resting state to Treg precussor states.

Following this thought, we view the cell transition plot between cell states, but we need to rename T_fwd_X_umap to T_fwd_umap first:

We use function scv.pl.velocity_embedding_stream to plot the transition vector streams on UMAP, and color cells by Treg sub-population:

In [34]:
scv.pl.velocity_embedding_stream(
    adata,
    color='Tregs sub population', 
    vkey='T_fwd',
    basis='X_umap',
    legend_loc='right', 
    recompute=False,
    density=2,
    size=20,
    alpha=1,
)

We also visualize the transition streams embeded on the expression plot of each marker gene that were used to define cell states:

In [37]:
scv.pl.velocity_embedding_stream(
    adata,
    color=['Pdcd1', 'Tcf7', 'Ccr7', 'Klrg1', 'Tregs sub population'], 
    vkey='T_fwd',
    basis='X_umap',
    legend_loc='right',
    recompute=False,
    ncols=3,
    size=20,
    alpha=1,
    color_map='YlOrRd'
)

Inititate GPCCA estimator

GPCCA or Generalized Perron Cluster Cluster Analysis is an estimator, the second component of CellRank workflow. We use it to compute the macrostates for this dataset. First we call up GPCCA and compute in the forward direction:

In [38]:
from cellrank.tl.estimators import GPCCA
g_fwd = GPCCA(ctk)

Subsequently we compute the Schur decompostion that will be used to define macrostates. Please refer to Cellrank publication and the GPCCA publication to know more about this procedure.

In [39]:
g_fwd.compute_schur(n_components=20)
g_fwd.plot_spectrum(real_only=True)
Computing Schur decomposition
When computing macrostates, choose a number of states NOT in `[4, 15]`
Adding `adata.uns['eigendecomposition_fwd']`
       `.schur_vectors`
       `.schur_matrix`
       `.eigendecomposition`
    Finish (0:00:38)

From the expression of marker genes like Pdcd1, Tcf7, Klrg1, CCr7, we assume that there are possibly 4 macrostates in our data. So we use the function compute_macrostates to find those 4 macrostates:

In [40]:
g_fwd.compute_macrostates(n_states=4, cluster_key='Tregs sub population')
Computing `5` macrostates
Adding `.macrostates`
       `.macrostates_memberships`
       `.coarse_T`
       `.coarse_initial_distribution
       `.coarse_stationary_distribution`
       `.schur_vectors`
       `.schur_matrix`
       `.eigendecomposition`
    Finish (0:00:03)

Visualize the macrostate findings:

In [41]:
g_fwd.plot_macrostates(
    discrete=True,
    legend_loc='right',
    size=20,
    basis='X_umap'
)

The macrostates might be legitimates as they locate at the edges, together with the fact that these 4 macrostates are caculated forward. We manually set them as the terminal states using set_terminal_states_from_macrostates:

In [42]:
g_fwd.set_terminal_states_from_macrostates()
Adding `adata.obs['terminal_states']`
       `adata.obs['terminal_states_probs']`
       `.terminal_states`
       `.terminal_states_probabilities`
       `.terminal_states_memberships
    Finish`

Plot to confirm the assigment these terminal states:

In [43]:
sc.pl.umap(adata, color=['terminal_states'])

Compute cell fate probability

Having define 4 terminal states, we can compute the transition probability of cells lead to these locations. We run compute_absorption_probabilities on the GPCCA output:

In [44]:
g_fwd.compute_absorption_probabilities()
Computing absorption probabilities
Defaulting to `'gmres'` solver.
  0%|          | 0/5 [00:00<?, ?/s]
Adding `adata.obsm['to_terminal_states']`
       `.absorption_probabilities`
    Finish (0:00:00)

We visualize the absorption probabilities on the UMAP:

In [52]:
g_fwd.plot_absorption_probabilities(
    same_plot=False,
    basis='X_umap',
    size=20,
    ncols=3,
    cmap='YlOrRd'
)

Another useful ploting implementation in cellrank is PAGA or graph abstraction reconciles clustering. We can use it to aggregate all fate maps into just one fate map to have a complete view of cell fates. However, for PAGA we need to specify initial state, for that we would need to labouriously compute from the fastq files. Thus it is worth a separate notebook to this.

Nevertheless, this notebook demonstrates that CellRank with its CytoTRACE kernel is well able to impute the transition direction of individual cells just from the gene expression.