Developed by Marius Lange and colleagues at Theis lab, CellRank is a tool aiming to identify cell state dynamics without predefined direction. The core concept behind Cellrank is Markov chain, in which each cell is assigned a state and the probability of transition to other cells is only based on the current cell. Since CellRank does not require prior knowledge of biological direction, it has many uses particularly where traditional trajectory inference is infeasible.
For versatility, the authors have modularized CellRank into kernels and estimators to fit different needs and is extensible (see image). The kernels compute transition matrix from various data types such as RNA velocity vector, cell similarity, pseudotime, etc. The estimators use that transition matrix to compute aggregated dynamics.
We use expression data of PPARG-low Tregs precursor population from this publication by Chaoran Li and colleagues. The paper reports that lowly-expressed PPARgamma Tregs is indeed transcriptionally heterogeneous. And RNA velocity reveals that its subpopulations exhibit clear transition dynamics to so-called precusor states.
The expression data for this study can be aquired through GSE168605, which comprises of 2 samples: Spleen PPARg negative Tregs
and Spleen PPARGg low Tregs
. Tregs from these 2 were sorted selected from spleens of 6-8 weeks old Pparg-Tdt.Foxp3-Gfp mice and then prepared with Chromium Single Cell 3' v2 platform. We have processed the single-cell data with a standard scanpy pipeline like normalization and doing data integration using harmony for this notebook.
import scvelo as scv
import scanpy as sc
import cellrank as cr
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
#Settings
scv.settings.verbosity = 3
scv.settings.set_figure_params('scvelo')
cr.settings.verbosity = 2
Download prepared scanpy object:
import os
os.system('wget -O pparg.h5ad https://www.dropbox.com/s/0iw552yiwnlz9by/pparg.h5ad?dl=1')
adata = sc.read_h5ad('./pparg.h5ad')
adata
Next, we compute moments for velocity estimation. However, scv.pp.moments
requires layer.spliced
and layer.unspliced
to impute the moments. We trick it by assign .X
matrix to the 2 slots for it to able to run:
adata.layers['spliced'] = adata.X
adata.layers['unspliced'] = adata.X
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)
We often use RNA velocity kernel to build transition matrix. However cell velocity requires splicing information of the data. In order to get the splicing data, we have to do intron-aware quantification from raw sequencing data and the raw might not be available or taking a long time to run. In that case, we have to use other kernels like Cell similarity or Pseudotime.
This notebook attempts to run CellRank with a pseudotime kernel, CytoTRACE
Cellular Trajectory Reconstruction Analysis using gene Counts and Expression. CytoTRACE infers the relative differentiation state of cells without any prior knowledge. Unlike scVelo, it uses the gene expression profile of each cell to calculate the developmental potential.
Import the kernel and run:
from cellrank.tl.kernels import CytoTRACEKernel
ctk = CytoTRACEKernel(adata)
We can see 6 slots are added to the adata object after running CytoTRACEKernel
, including the CytoTRACE score ct_score
and ct_pseudotime
:
We visualize the ct_pseudotime
on UMAP:
scv.pl.scatter(
adata,
c=['ct_pseudotime'],
basis='X_umap',
legend_loc='right',
color_map='gnuplot2',
)
Cells that have scores close to 1
is immature-like and otherwise
From the CytoTRACE kernel, we build the transition matrix with function compute_transition_matrix
:
ctk.compute_transition_matrix()
Then we compute a projection of the transition matrix in the umap embedding, here is X_umap
:
ctk.compute_projection(basis='X_umap')
We run leiden clustering of this dataset before annotating cell type:
sc.tl.leiden(adata, resolution=0.8)
Add slot T_fwd_params
using data umap embeddings:
adata.uns['T_fwd_params'] = {'embeddings': ['umap']}
adata.obsm['T_fwd_umap'] = adata.obsm['T_fwd_X_umap']
Let's see how the cell transition vector stream embeded on the UMAP, with cells colored by the leiden clustering
scv.pl.velocity_embedding_stream(
adata,
color='leiden',
vkey='T_fwd',
basis='X_umap',
legend_loc='right',
recompute=False,
density=2,
size=20,
alpha=1,
)
We use marker genes Pdcd1
, Tcf7
, Ccr7
, Klrg1
to annotate cell states according to the paper:
(A) Violin plots of transcripts characteristically up-regulated as tissue-Treg precursors exit the spleen. Numbers refer to the percentage of cells in the designated cluster that express a given transcript. A.U., arbitrary units. B) Violin plots of transcripts encoding proteins that need to be down-regulated for tissue-Treg precursors to exit the spleen. Numbers refer to the percentage of cells in the designated cluster expressing a given transcript.
We view the expression of these genes in the umap embedding:
sc.pl.umap(
adata,
color = ['leiden', 'Pdcd1', 'Tcf7', 'Ccr7', 'Klrg1'],
ncols=3,
size=20,
color_map='YlOrRd'
)
It is apparent that our leiden clustering has not separated cells into clusters too well, making cluster-based cell labelling impractical. Thus we decide to employ BioTuring Browser lasso tool to arbitrarilly select cells that match with marker genes expression. The result has been uploaded to dropbox for our convenience.
os.system('wget -O pparg-treg-annotation-batchrename.tsv https://www.dropbox.com/s/96lv61ipgmym6m1/pparg-treg-annotation-batchrename.tsv?dl=1')
metadata = pd.read_csv('pparg-treg-annotation-batchrename.tsv', sep='\t')
We replace the slot containing cell state annotation (Tregs sub population
) with that from the metadata:
adata.obs['Tregs sub population'] = metadata['Tregs sub population'].to_list()
View the aggregated cell states in the umap:
umap = sc.pl.umap(adata, color=['Tregs sub population'])
Initially batch in the andata has been annotated as number, we need to rename them accordingly to reflex the correct information:
adata.obs['batch'] = metadata['batch'].to_list()
Let's view the UMAP of batches together with the Tregs sub-populations:
sc.pl.umap(
adata,
color = ['batch', 'Tregs sub population'],
size=20,
wspace=0.4
)
Let's see the Tregs sub-population composition of two samples, PPARGneg and PPARGlow:
pct2 = (adata.obs.groupby(['batch','Tregs sub population']).size() / adata.obs.groupby(['batch']).size()).reset_index().rename({0:'percent'}, axis=1)
sns.barplot(x='batch', hue='Tregs sub population', y='percent', data=pct2)
plt.legend(loc='best')
plt.show()
PPARg negative
has more resting Tregs than PPARg low
's. On the contrary, PPARg low
has more Precursor Y than PPARg negative
's. The changes in cell states composition suggest a possible transformation from Treg resting state to Treg precussor states.
Following this thought, we view the cell transition plot between cell states, but we need to rename T_fwd_X_umap
to T_fwd_umap
first:
We use function scv.pl.velocity_embedding_stream
to plot the transition vector streams on UMAP, and color cells by Treg sub-population:
scv.pl.velocity_embedding_stream(
adata,
color='Tregs sub population',
vkey='T_fwd',
basis='X_umap',
legend_loc='right',
recompute=False,
density=2,
size=20,
alpha=1,
)
We also visualize the transition streams embeded on the expression plot of each marker gene that were used to define cell states:
scv.pl.velocity_embedding_stream(
adata,
color=['Pdcd1', 'Tcf7', 'Ccr7', 'Klrg1', 'Tregs sub population'],
vkey='T_fwd',
basis='X_umap',
legend_loc='right',
recompute=False,
ncols=3,
size=20,
alpha=1,
color_map='YlOrRd'
)
GPCCA or Generalized Perron Cluster Cluster Analysis is an estimator, the second component of CellRank workflow. We use it to compute the macrostates for this dataset. First we call up GPCCA and compute in the forward direction:
from cellrank.tl.estimators import GPCCA
g_fwd = GPCCA(ctk)
Subsequently we compute the Schur decompostion that will be used to define macrostates. Please refer to Cellrank publication and the GPCCA publication to know more about this procedure.
g_fwd.compute_schur(n_components=20)
g_fwd.plot_spectrum(real_only=True)
From the expression of marker genes like Pdcd1
, Tcf7
, Klrg1
, CCr7
, we assume that there are possibly 4 macrostates in our data. So we use the function compute_macrostates
to find those 4 macrostates:
g_fwd.compute_macrostates(n_states=4, cluster_key='Tregs sub population')
Visualize the macrostate findings:
g_fwd.plot_macrostates(
discrete=True,
legend_loc='right',
size=20,
basis='X_umap'
)
The macrostates might be legitimates as they locate at the edges, together with the fact that these 4 macrostates are caculated forward. We manually set them as the terminal states using set_terminal_states_from_macrostates
:
g_fwd.set_terminal_states_from_macrostates()
Plot to confirm the assigment these terminal states:
sc.pl.umap(adata, color=['terminal_states'])
Having define 4 terminal states, we can compute the transition probability of cells lead to these locations. We run compute_absorption_probabilities
on the GPCCA output:
g_fwd.compute_absorption_probabilities()
We visualize the absorption probabilities on the UMAP:
g_fwd.plot_absorption_probabilities(
same_plot=False,
basis='X_umap',
size=20,
ncols=3,
cmap='YlOrRd'
)
Another useful ploting implementation in cellrank is PAGA or graph abstraction reconciles clustering. We can use it to aggregate all fate maps into just one fate map to have a complete view of cell fates. However, for PAGA we need to specify initial state, for that we would need to labouriously compute from the fastq files. Thus it is worth a separate notebook to this.
Nevertheless, this notebook demonstrates that CellRank with its CytoTRACE kernel is well able to impute the transition direction of individual cells just from the gene expression.