This notebook shows how to do gene regulatory analysis for scRNA-Seq dataset using SCENIC. The example data used in this notebook is the PBMC4k dataset downloaded from 10xGenomics

SCENIC package allows users to characterize the single-cell gene regulatory network interference and cluster cell by set of regulons. Below is the main steps in SCENIC workflow:

  1. Building the gene regulatory network (GRN):

    a. Identify potential targets for each TF based on co-expression.

    • Filtering the expression matrix and running GENIE3/GRNBoost.
    • Formatting the targets from GENIE3/GRNBoost into co-expression modules.

    b. Select potential direct-binding targets (regulons) based on DNA-motif analysis (RcisTarget: TF motif analysis)

  1. Identify cell states and their regulators:

    c. Analyzing the network activity in each individual cell (AUCell)

    • Scoring regulons in the cells (calculate AUC)
    • Optional: Convert the network activity into ON/OFF (binary activity matrix)

    d. Identify stable cell states based on their gene regulatory network activity (cell clustering) and exploring the results.

Load required packages

In [41]:
import os, sys, glob, pickle
import operator as op

import pandas as pd
import seaborn as sns
import numpy as np
import scanpy as sc
import loompy as lp
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.pyplot import rc_context


from pyscenic.export import add_scenic_metadata
from pyscenic.utils import load_motifs
from pyscenic.transform import df2regulons
from pyscenic.binarization import binarize
from pyscenic.rss import regulon_specificity_scores
from pyscenic.plotting import plot_binarization, plot_rss

from IPython.display import HTML, display

Set some settings for Scanpy about verbosity and the figure size

In [4]:
sc.set_figure_params(dpi=150, fontsize=10, dpi_save=600)
sc.settings.njobs = 16

Data preprocessing

Get the expression data

In [ ]:
os.system('wget https://cf.10xgenomics.com/samples/cell-exp/2.1.0/pbmc4k/pbmc4k_filtered_gene_bc_matrices.tar.gz')
In [ ]:
os.system('tar -xvzf pbmc4k_filtered_gene_bc_matrices.tar.gz')
In [10]:
adata = sc.read_10x_mtx(
    './filtered_gene_bc_matrices/GRCh38/' ,  
    var_names='gene_symbols',
) 

Filter low-quality cells

In [11]:
# simply compute the number of genes per cell (computers 'n_genes' column)
sc.pp.filter_cells(adata, min_genes=0)

# for each cell compute fraction of counts in mito genes versus all genes
mito_genes = adata.var_names.str.startswith('MT-')
adata.obs['percent_mito'] = np.sum(
    adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1

# add the total counts per cell as observations-annotation to adata
adata.obs['n_counts'] = adata.X.sum(axis=1).A1

# initial cuts
sc.pp.filter_cells(adata, min_genes=200)
adata = adata[adata.obs['n_genes'] < 5000, :]
adata = adata[adata.obs['percent_mito'] < 0.15, :]
In [12]:
adata
Out[12]:
View of AnnData object with n_obs × n_vars = 4335 × 33694
    obs: 'n_genes', 'percent_mito', 'n_counts'
    var: 'gene_ids'

Basic pre-processing

In [13]:
# save a copy of the raw data
adata.raw = adata

# Total-count normalize (library-size correct) to 10,000 reads/cell
sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4)

# log transform the data.
sc.pp.log1p(adata)

# identify highly variable genes.
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)

# keep only highly variable genes:
adata = adata[:, adata.var['highly_variable']]

# regress out total counts per cell and the percentage of mitochondrial genes expressed
sc.pp.regress_out(adata, ['n_counts', 'percent_mito'])

# scale each gene to unit variance, clip values exceeding SD 10.
sc.pp.scale(adata, max_value=10)
In [14]:
adata
Out[14]:
AnnData object with n_obs × n_vars = 4335 × 1775
    obs: 'n_genes', 'percent_mito', 'n_counts'
    var: 'gene_ids', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'log1p', 'hvg'

We write the basic filtered expression matrix to a loom file. This will be used in the command-line pySCENIC steps

In [15]:
# create basic row and column attributes for the loom file:
row_attrs = {
    'Gene': np.array(adata.var_names),
}
col_attrs = {
    'CellID': np.array(adata.obs_names),
    'nGene': np.array(np.sum(adata.X.transpose() > 0, axis=0)).flatten(),
    'nUMI': np.array(np.sum(adata.X.transpose(), axis=0)).flatten(),
}
lp.create('./pbmc4k.loom', adata.X.transpose(), row_attrs, col_attrs)

Perform dimensionality reduction and clustering

In [ ]:
sc.tl.pca(adata, svd_solver='arpack')

# neighborhood graph of cells (determine optimal number of PCs here)
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=30)

# compute tSNE
sc.tl.tsne(adata)

# cluster the neighbourhood graph
sc.tl.louvain(adata, resolution=0.4)
In [36]:
sc.pl.tsne(adata, color=['louvain'])

Cell type labeling

In [30]:
with rc_context({'figure.figsize': (2, 2)}):
    sc.pl.tsne( adata, color=[
        'IL7R', 'CCR7', 'CD14', 'LYZ',  'S100A4', 'MS4A1', 'CD8A',
        'FCGR3A', 'MS4A7', 'GNLY', 'NKG7', 'FCER1A', 'CST3', 'PPBP'
    ], ncols=3, color_map='YlOrRd', alpha=1, size=10)