This notebook shows how to do gene regulatory analysis for scRNA-Seq dataset using SCENIC. The example data used in this notebook is the PBMC4k dataset downloaded from 10xGenomics

SCENIC package allows users to characterize the single-cell gene regulatory network interference and cluster cell by set of regulons. Below is the main steps in SCENIC workflow:

  1. Building the gene regulatory network (GRN):

    a. Identify potential targets for each TF based on co-expression.

    • Filtering the expression matrix and running GENIE3/GRNBoost.
    • Formatting the targets from GENIE3/GRNBoost into co-expression modules.

    b. Select potential direct-binding targets (regulons) based on DNA-motif analysis (RcisTarget: TF motif analysis)

  1. Identify cell states and their regulators:

    c. Analyzing the network activity in each individual cell (AUCell)

    • Scoring regulons in the cells (calculate AUC)
    • Optional: Convert the network activity into ON/OFF (binary activity matrix)

    d. Identify stable cell states based on their gene regulatory network activity (cell clustering) and exploring the results.

Load required packages

In [41]:
import os, sys, glob, pickle
import operator as op

import pandas as pd
import seaborn as sns
import numpy as np
import scanpy as sc
import loompy as lp
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.pyplot import rc_context


from pyscenic.export import add_scenic_metadata
from pyscenic.utils import load_motifs
from pyscenic.transform import df2regulons
from pyscenic.binarization import binarize
from pyscenic.rss import regulon_specificity_scores
from pyscenic.plotting import plot_binarization, plot_rss

from IPython.display import HTML, display

Set some settings for Scanpy about verbosity and the figure size

In [4]:
sc.set_figure_params(dpi=150, fontsize=10, dpi_save=600)
sc.settings.njobs = 16

Data preprocessing

Get the expression data

In [ ]:
os.system('wget https://cf.10xgenomics.com/samples/cell-exp/2.1.0/pbmc4k/pbmc4k_filtered_gene_bc_matrices.tar.gz')
In [ ]:
os.system('tar -xvzf pbmc4k_filtered_gene_bc_matrices.tar.gz')
In [10]:
adata = sc.read_10x_mtx(
    './filtered_gene_bc_matrices/GRCh38/' ,  
    var_names='gene_symbols',
) 

Filter low-quality cells

In [11]:
# simply compute the number of genes per cell (computers 'n_genes' column)
sc.pp.filter_cells(adata, min_genes=0)

# for each cell compute fraction of counts in mito genes versus all genes
mito_genes = adata.var_names.str.startswith('MT-')
adata.obs['percent_mito'] = np.sum(
    adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1

# add the total counts per cell as observations-annotation to adata
adata.obs['n_counts'] = adata.X.sum(axis=1).A1

# initial cuts
sc.pp.filter_cells(adata, min_genes=200)
adata = adata[adata.obs['n_genes'] < 5000, :]
adata = adata[adata.obs['percent_mito'] < 0.15, :]
In [12]:
adata
Out[12]:
View of AnnData object with n_obs × n_vars = 4335 × 33694
    obs: 'n_genes', 'percent_mito', 'n_counts'
    var: 'gene_ids'

Basic pre-processing

In [13]:
# save a copy of the raw data
adata.raw = adata

# Total-count normalize (library-size correct) to 10,000 reads/cell
sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4)

# log transform the data.
sc.pp.log1p(adata)

# identify highly variable genes.
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)

# keep only highly variable genes:
adata = adata[:, adata.var['highly_variable']]

# regress out total counts per cell and the percentage of mitochondrial genes expressed
sc.pp.regress_out(adata, ['n_counts', 'percent_mito'])

# scale each gene to unit variance, clip values exceeding SD 10.
sc.pp.scale(adata, max_value=10)
In [14]:
adata
Out[14]:
AnnData object with n_obs × n_vars = 4335 × 1775
    obs: 'n_genes', 'percent_mito', 'n_counts'
    var: 'gene_ids', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'log1p', 'hvg'

We write the basic filtered expression matrix to a loom file. This will be used in the command-line pySCENIC steps

In [15]:
# create basic row and column attributes for the loom file:
row_attrs = {
    'Gene': np.array(adata.var_names),
}
col_attrs = {
    'CellID': np.array(adata.obs_names),
    'nGene': np.array(np.sum(adata.X.transpose() > 0, axis=0)).flatten(),
    'nUMI': np.array(np.sum(adata.X.transpose(), axis=0)).flatten(),
}
lp.create('./pbmc4k.loom', adata.X.transpose(), row_attrs, col_attrs)

Perform dimensionality reduction and clustering

In [ ]:
sc.tl.pca(adata, svd_solver='arpack')

# neighborhood graph of cells (determine optimal number of PCs here)
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=30)

# compute tSNE
sc.tl.tsne(adata)

# cluster the neighbourhood graph
sc.tl.louvain(adata, resolution=0.4)
In [36]:
sc.pl.tsne(adata, color=['louvain'])

Cell type labeling

In [30]:
with rc_context({'figure.figsize': (2, 2)}):
    sc.pl.tsne( adata, color=[
        'IL7R', 'CCR7', 'CD14', 'LYZ',  'S100A4', 'MS4A1', 'CD8A',
        'FCGR3A', 'MS4A7', 'GNLY', 'NKG7', 'FCER1A', 'CST3', 'PPBP'
    ], ncols=3, color_map='YlOrRd', alpha=1, size=10)

We follow the tutorial from Seurat pbmc3k clustering and Scanpy pbmc3k clustering to label the cell types for this dataset.

In [31]:
adata.obs['celltype'] = adata.obs['louvain']
new_cluster_names = [
    'CD4 T', 'CD14 Monocytes', 'B', 'NK1', 'CD8 T',
    'Naive T cell', 'NK2','Dendritic', 'FCGR3A Monocytes', 'Unknown'
]
adata.rename_categories('celltype', new_cluster_names)
/home/dev/lib/python-3.9.11/install/lib/python3.9/site-packages/anndata/_core/anndata.py:1160: FutureWarning: The `inplace` parameter in pandas.Categorical.rename_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
  self.obs[key].cat.rename_categories(categories, inplace=True)
In [37]:
sc.pl.tsne(adata, color=['celltype'])

SCENIC analysis

STEP 1: Gene regulatory network inference, and generation of co-expression modules

Phase Ia: GRN inference using the GRNBoost2 algorithm

For this step we use the cli version of pySCENIC. We use the counts matrix (without log transformation or further processing) from the loom file we wrote earlier.

Output: List of adjacencies between a TF and its targets stored in ADJACENCIES_FNAME.

Download the list of human transcription factors

In [ ]:
os.system('wget https://raw.githubusercontent.com/aertslab/pySCENIC/master/resources/lambert2018.txt')

Let's load all transcription factors in the list and see if they are in the highly-variable gene list

In [39]:
tf = pd.read_csv('lambert2018.txt', header=None)
tf
Out[39]:
0
0 TFAP2A
1 TFAP2B
2 TFAP2C
3 TFAP2D
4 TFAP2E
... ...
1634 THAP5
1635 THAP6
1636 THAP7
1637 THAP8
1638 THAP9

1639 rows × 1 columns

In [40]:
hvg = np.array(adata.var.index)
expressed_tfs = []
for t in np.array(tf[0]):
    if t in hvg:
        expressed_tfs.append(t)
print('Number of highly variable transcription factors', len(expressed_tfs))
Number of highly variable transcription factors 149

Save list of highly variable transcription factors

In [42]:
with open('hv_tfs_pbmc4k.txt', 'w') as f:
    f.write('\n'.join(expressed_tfs))

Set the PATH variable to get the pyscenic cli

In [6]:
os.environ['PATH'] = os.path.dirname(sys.executable) + ':' + os.environ['PATH']
In [299]:
os.system('pyscenic grn ./pbmc4k.loom ./hv_tfs_pbmc4k -o adj.csv --num_workers 14')
2021-12-05 17:56:59,387 - pyscenic.cli.pyscenic - INFO - Loading expression matrix.

2021-12-05 17:56:59,591 - pyscenic.cli.pyscenic - INFO - Inferring regulatory networks.
Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process.
Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process.
Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process.
Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process.
Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process.
Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process.
Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process.
Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process.
Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process.
Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process.
Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process.
Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process.
Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process.
Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process.
Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process.
preparing dask client
parsing input
creating dask graph
14 partitions
computing dask graph
not shutting down client, client was created externally
finished

2021-12-05 18:02:55,595 - pyscenic.cli.pyscenic - INFO - Writing results to file.

Read in the adjacencies matrix:

In [2]:
adjacencies = pd.read_csv('adj.csv', index_col=False, sep=',')
adjacencies.head()
Out[2]:
TF target importance
0 JUN DUSP1 172.803409
1 TSC22D1 ACRBP 130.713070
2 TSC22D1 CTTN 126.471167
3 TBX21 FGFBP2 124.856615
4 TBX21 NKG7 122.912053

STEP 2-3: Regulon prediction aka cisTarget from CLI

For this step the CLI version of SCENIC is used. This step can be deployed on an High Performance Computing system.

Output: List of adjacencies between a TF and its targets stored in MOTIFS_FNAME.

locations for ranking databases, and motif annotations:

Download the motif database, motifs-v9-nr.hgnc-m0.001-o0.0.tbl. And the sequence data of human genes hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr.feather

In [ ]:
os.sys('wget -O hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr.feather https://www.dropbox.com/s/drpobsb3ipb98h5/hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr.feather?dl=1')
In [ ]:
os.system('wget -O motifs-v9-nr.hgnc-m0.001-o0.0.tbl https://www.dropbox.com/s/ejqytwyka3ixcir/motifs-v9-nr.hgnc-m0.001-o0.0.tbl?dl=1')
In [375]:
import glob
# ranking databases
f_db_glob = '*feather'
f_db_names = ' '.join( glob.glob(f_db_glob) )
In [376]:
f_db_names
Out[376]:
'hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr.feather'

Run the Regulon prediction command

Here, we use the --mask_dropouts option, which affects how the correlation between TF and target genes is calculated during module creation. It is important to note that prior to pySCENIC v0.9.18, the default behavior was to mask dropouts, while in v0.9.18 and later, the correlation is performed using the entire set of cells (including those with zero expression). When using the modules_from_adjacencies function directly in python instead of via the command line, the rho_mask_dropouts option can be used to control this.

In [377]:
os.system('pyscenic ctx adj.csv {} --annotations_fname ./motifs-v9-nr.hgnc-m0.001-o0.0.tbl --expression_mtx_fname ./pbmc4k.loom --output reg.csv --mask_dropouts --num_workers 20'.format(f_db_names))
2021-12-06 07:13:12,428 - pyscenic.cli.pyscenic - INFO - Creating modules.

2021-12-06 07:13:12,541 - pyscenic.cli.pyscenic - INFO - Loading expression matrix.

2021-12-06 07:13:12,893 - pyscenic.utils - INFO - Calculating Pearson correlations.

2021-12-06 07:13:12,915 - pyscenic.utils - WARNING - Note on correlation calculation: the default behaviour for calculating the correlations has changed after pySCENIC verion 0.9.16. Previously, the default was to calculate the correlation between a TF and target gene using only cells with non-zero expression values (mask_dropouts=True). The current default is now to use all cells to match the behavior of the R verision of SCENIC. The original settings can be retained by setting 'rho_mask_dropouts=True' in the modules_from_adjacencies function, or '--mask_dropouts' from the CLI.
	Dropout masking is currently set to [True].

2021-12-06 07:13:13,931 - pyscenic.utils - INFO - Creating modules.

2021-12-06 07:13:20,822 - pyscenic.cli.pyscenic - INFO - Loading databases.

2021-12-06 07:13:20,822 - pyscenic.cli.pyscenic - INFO - Calculating regulons.
[                                        ] | 0% Completed | 15.1s
2021-12-06 07:13:36,619 - pyscenic.transform - WARNING - Less than 80% of the genes in ZNF527 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[                                        ] | 0% Completed | 15.4s
2021-12-06 07:13:36,979 - pyscenic.transform - WARNING - Less than 80% of the genes in ZNF562 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[                                        ] | 0% Completed | 15.5s
2021-12-06 07:13:37,043 - pyscenic.transform - WARNING - Less than 80% of the genes in ZNF570 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[                                        ] | 0% Completed | 16.2s
2021-12-06 07:13:37,794 - pyscenic.transform - WARNING - Less than 80% of the genes in ZNF718 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[                                        ] | 0% Completed | 16.3s
2021-12-06 07:13:37,861 - pyscenic.transform - WARNING - Less than 80% of the genes in ZNF778 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[                                        ] | 0% Completed | 16.5s
2021-12-06 07:13:38,080 - pyscenic.transform - WARNING - Less than 80% of the genes in ZNF860 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[                                        ] | 0% Completed | 16.6s
2021-12-06 07:13:38,145 - pyscenic.transform - WARNING - Less than 80% of the genes in ZNF891 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[                                        ] | 0% Completed |  1min 37.5s
2021-12-06 07:14:59,064 - pyscenic.transform - WARNING - Less than 80% of the genes in JRK could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[                                        ] | 0% Completed |  3min  2.2s
2021-12-06 07:16:23,779 - pyscenic.transform - WARNING - Less than 80% of the genes in Regulon for ZNF527 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[                                        ] | 0% Completed |  3min 14.2s
2021-12-06 07:16:35,818 - pyscenic.transform - WARNING - Less than 80% of the genes in Regulon for CC2D1A could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[                                        ] | 0% Completed |  3min 25.6s
2021-12-06 07:16:47,277 - pyscenic.transform - WARNING - Less than 80% of the genes in ZBTB3 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[                                        ] | 0% Completed |  3min 28.1s
2021-12-06 07:16:49,715 - pyscenic.transform - WARNING - Less than 80% of the genes in ZNF570 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[                                        ] | 0% Completed |  3min 29.6s
2021-12-06 07:16:51,142 - pyscenic.transform - WARNING - Less than 80% of the genes in ZNF891 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed |  5min 43.3s
2021-12-06 07:19:04,921 - pyscenic.transform - WARNING - Less than 80% of the genes in Regulon for ZNF528 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed |  6min 57.4s
2021-12-06 07:20:18,936 - pyscenic.transform - WARNING - Less than 80% of the genes in Regulon for JRK could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed |  8min 43.5s
2021-12-06 07:22:05,135 - pyscenic.transform - WARNING - Less than 80% of the genes in Regulon for ZBTB3 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed |  8min 45.2s
2021-12-06 07:22:06,871 - pyscenic.transform - WARNING - Less than 80% of the genes in Regulon for ZNF570 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed |  8min 46.3s
2021-12-06 07:22:07,891 - pyscenic.transform - WARNING - Less than 80% of the genes in Regulon for ZNF860 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.

2021-12-06 07:22:07,957 - pyscenic.transform - WARNING - Less than 80% of the genes in Regulon for ZNF891 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed |  8min 46.4s
2021-12-06 07:22:08,013 - pyscenic.transform - WARNING - Less than 80% of the genes in AEBP1 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed |  8min 46.6s
2021-12-06 07:22:08,225 - pyscenic.transform - WARNING - Less than 80% of the genes in ARID5B could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed |  9min  1.3s
2021-12-06 07:22:22,845 - pyscenic.transform - WARNING - Less than 80% of the genes in CC2D1A could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed |  9min 21.1s
2021-12-06 07:22:42,674 - pyscenic.transform - WARNING - Less than 80% of the genes in ETV7 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed |  9min 30.8s
2021-12-06 07:22:52,430 - pyscenic.transform - WARNING - Less than 80% of the genes in HES1 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed |  9min 46.0s
2021-12-06 07:23:07,552 - pyscenic.transform - WARNING - Less than 80% of the genes in JRK could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed |  9min 48.4s
2021-12-06 07:23:09,951 - pyscenic.transform - WARNING - Less than 80% of the genes in JUN could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed | 10min 25.2s
2021-12-06 07:23:46,803 - pyscenic.transform - WARNING - Less than 80% of the genes in PAX5 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed | 11min 23.1s
2021-12-06 07:24:44,748 - pyscenic.transform - WARNING - Less than 80% of the genes in THAP6 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed | 11min 23.4s
2021-12-06 07:24:44,992 - pyscenic.transform - WARNING - Less than 80% of the genes in TSHZ2 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed | 11min 23.8s
2021-12-06 07:24:45,386 - pyscenic.transform - WARNING - Less than 80% of the genes in ZBTB8A could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed | 11min 29.1s
2021-12-06 07:24:50,696 - pyscenic.transform - WARNING - Less than 80% of the genes in ZNF490 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed | 11min 29.3s
2021-12-06 07:24:50,884 - pyscenic.transform - WARNING - Less than 80% of the genes in ZNF527 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed | 11min 29.4s
2021-12-06 07:24:51,062 - pyscenic.transform - WARNING - Less than 80% of the genes in ZNF558 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed | 11min 29.5s
2021-12-06 07:24:51,128 - pyscenic.transform - WARNING - Less than 80% of the genes in ZNF562 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed | 11min 29.6s
2021-12-06 07:24:51,187 - pyscenic.transform - WARNING - Less than 80% of the genes in ZNF570 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed | 11min 29.8s
2021-12-06 07:24:51,387 - pyscenic.transform - WARNING - Less than 80% of the genes in ZNF608 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.

2021-12-06 07:24:51,453 - pyscenic.transform - WARNING - Less than 80% of the genes in ZNF683 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed | 11min 29.9s
2021-12-06 07:24:51,512 - pyscenic.transform - WARNING - Less than 80% of the genes in ZNF684 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed | 11min 30.1s
2021-12-06 07:24:51,705 - pyscenic.transform - WARNING - Less than 80% of the genes in ZNF778 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed | 11min 30.3s
2021-12-06 07:24:51,886 - pyscenic.transform - WARNING - Less than 80% of the genes in ZNF860 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.

2021-12-06 07:24:51,944 - pyscenic.transform - WARNING - Less than 80% of the genes in ZNF891 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed | 12min  1.4s
2021-12-06 07:25:23,009 - pyscenic.transform - WARNING - Less than 80% of the genes in ETV2 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed | 12min 26.8s
2021-12-06 07:25:48,314 - pyscenic.transform - WARNING - Less than 80% of the genes in JRK could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed | 12min 51.4s
2021-12-06 07:26:12,982 - pyscenic.transform - WARNING - Less than 80% of the genes in MEF2A could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed | 14min  1.5s
2021-12-06 07:27:23,044 - pyscenic.transform - WARNING - Less than 80% of the genes in THAP6 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed | 14min  2.5s
2021-12-06 07:27:24,036 - pyscenic.transform - WARNING - Less than 80% of the genes in ZBTB8A could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed | 14min  3.2s
2021-12-06 07:27:24,780 - pyscenic.transform - WARNING - Less than 80% of the genes in ZNF320 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed | 14min  5.8s
2021-12-06 07:27:27,385 - pyscenic.transform - WARNING - Less than 80% of the genes in ZNF449 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[##########                              ] | 25% Completed | 14min  5.9s
2021-12-06 07:27:27,456 - pyscenic.transform - WARNING - Less than 80% of the genes in ZNF490 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.
[########################################] | 100% Completed | 14min  7.5s

2021-12-06 07:27:29,997 - pyscenic.cli.pyscenic - INFO - Writing results to file.

The results is a list of enriched motifs for the modules.

Column name Description
TF Transcription Factor (TF) for which an enriched motif is discovered.
motifID The identifier of the enriched motif.
AUC Area Under the recovery Curve statistic for this enriched motif.
NES Normalized Enrichment Score for this enriched motif.
Context Collection of tags clarifying the origin of the module for this factor: e.g. ranking database, ...
Annotation Verbose description of the annotation available for this motif.
MotifSimilarityQvalue The TomTom derived Q-value for motif similarity (if used for assigning the factor to this enriched motif).
OrthologousIdentity The Amino Acid Identity between factors (if used for assigning the factor to this enriched motif).
RankAtMax The position of the Leading Edge which is used as a threshold on the whole genome ranking of the motif to decide if a gene in the input is a direct target of a TF that binds this motif.
TargetGenes A list of pairs: genes and their associated weights from GENIE3/GRNBoost2.

Read the TF-motif results

In [3]:
df_motifs = load_motifs('./reg.csv')
df_motifs.head()
Out[3]:
Enrichment
AUC NES MotifSimilarityQvalue OrthologousIdentity Annotation Context TargetGenes RankAtMax
TF MotifID
ATF3 dbcorrdb__CEBPD__ENCSR000BQJ_1__m1 0.083382 3.808435 0.000001 1.0 motif similar to dbcorrdb__ATF3__ENCSR000BNU_1... (hg38__refseq-r80__500bp_up_and_100bp_down_tss... [(AQP9, 1.1227345953791656), (CREB5, 5.2188569... 1568
cisbp__M0314 0.077860 3.376442 0.000052 1.0 motif similar to factorbook__CREB ('CREB'; q-v... (hg38__refseq-r80__500bp_up_and_100bp_down_tss... [(CREB5, 5.218856962902287), (RXRA, 1.69832948... 4550
transfac_pro__M07414 0.094758 4.698509 0.000972 1.0 motif similar to factorbook__CREB ('CREB'; q-v... (hg38__refseq-r80__500bp_up_and_100bp_down_tss... [(CREB5, 5.218856962902287), (AQP9, 1.12273459... 1505
cisbp__M0326 0.078454 3.422876 0.000832 1.0 motif similar to factorbook__CREB ('CREB'; q-v... (hg38__refseq-r80__500bp_up_and_100bp_down_tss... [(CREB5, 5.218856962902287), (ATF3, 1.0), (NFK... 2260
BCL11A cisbp__M4475 0.096300 5.628151 0.000000 1.0 gene is annotated for similar motif cisbp__M44... (hg38__refseq-r80__500bp_up_and_100bp_down_tss... [(FCGR2B, 6.452480106575977), (ALOX5AP, 5.4368... 3093

Let's try embeded the motif logo of all regulon into the dataframe

In [379]:
BASE_URL = 'http://motifcollections.aertslab.org/v9/logos/'
COLUMN_NAME_LOGO = 'MotifLogo'
COLUMN_NAME_MOTIF_ID = 'MotifID'
COLUMN_NAME_TARGETS = 'TargetGenes'
FIGURES_FOLDERNAME = './figures/'
In [380]:
### Set up some helper functions first
def savesvg(fname: str, fig, folder: str=FIGURES_FOLDERNAME) -> None:
    """
    Save figure as vector-based SVG image format.
    """
    fig.tight_layout()
    fig.savefig(os.path.join(folder, fname), format='svg')

def display_logos(df: pd.DataFrame, top_target_genes: int = 3, base_url: str = BASE_URL):
    """
    :param df:
    :param base_url:
    """
    # Make sure the original dataframe is not altered.
    df = df.copy()
    
    # Add column with URLs to sequence logo.
    def create_url(motif_id):
        return '<img src="{}{}.png" style="max-height:124px;"></img>'.format(base_url, motif_id)
    df[("Enrichment", COLUMN_NAME_LOGO)] = list(map(create_url, df.index.get_level_values(COLUMN_NAME_MOTIF_ID)))
    
    # Truncate TargetGenes.
    def truncate(col_val):
        return sorted(col_val, key=op.itemgetter(1))[:top_target_genes]
    df[("Enrichment", COLUMN_NAME_TARGETS)] = list(map(truncate, df[("Enrichment", COLUMN_NAME_TARGETS)]))
    
    MAX_COL_WIDTH = pd.get_option('display.max_colwidth')
    pd.set_option('display.max_colwidth', -1)
    display(HTML(df.head().to_html(escape=False)))
    pd.set_option('display.max_colwidth', MAX_COL_WIDTH)
    
display_logos(df_motifs.head())
/tmp/ipykernel_3181/924629426.py:28: FutureWarning: Passing a negative integer is deprecated in version 1.0 and will not be supported in future version. Instead, use None to not limit the column width.
  pd.set_option('display.max_colwidth', -1)
Enrichment
AUC NES MotifSimilarityQvalue OrthologousIdentity Annotation Context TargetGenes RankAtMax MotifLogo
TF MotifID
ATF3 dbcorrdb__CEBPD__ENCSR000BQJ_1__m1 0.083382 3.808435 0.000001 1.0 motif similar to dbcorrdb__ATF3__ENCSR000BNU_1__m1 ('ATF3 (ENCSR000BNU-1, motif 1)'; q-value = 1.45e-06) which is directly annotated (hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr, activating, weight>75.0%) [(ATF3, 1.0), (NEAT1, 1.0447401413985318), (GABARAP, 1.060969255395327)] 1568
cisbp__M0314 0.077860 3.376442 0.000052 1.0 motif similar to factorbook__CREB ('CREB'; q-value = 5.21e-05) which is directly annotated (hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr, activating, weight>75.0%) [(ATF3, 1.0), (RETN, 1.0312089031247464), (NEAT1, 1.0447401413985318)] 4550
transfac_pro__M07414 0.094758 4.698509 0.000972 1.0 motif similar to factorbook__CREB ('CREB'; q-value = 0.000972) which is directly annotated (hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr, activating, weight>75.0%) [(ATF3, 1.0), (GABARAP, 1.060969255395327), (CSTA, 1.064965479804881)] 1505
cisbp__M0326 0.078454 3.422876 0.000832 1.0 motif similar to factorbook__CREB ('CREB'; q-value = 0.000832) which is directly annotated (hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr, activating, weight>75.0%) [(ATF3, 1.0), (NEAT1, 1.0447401413985318), (GABARAP, 1.060969255395327)] 2260
BCL11A cisbp__M4475 0.096300 5.628151 0.000000 1.0 gene is annotated for similar motif cisbp__M4453 ('BCL11A[gene ID: "ENSG00000119866" species: "Homo sapiens" TF status: "direct" TF family: "C2H2 ZF" DBDs: "zf-C2H2"]; BCL11B[gene ID: "ENSG00000127152" species: "Homo sapiens" TF status: "inferred" TF family: "C2H2 ZF" DBDs: "zf-C2H2"]; Bcl11a[gene ID: "ENSMUSG00000000861" species: "Mus musculus" TF status: "inferred" TF family: "C2H2 ZF" DBDs: "zf-C2H2"]; Bcl11b[gene ID: "ENSMUSG00000048251" species: "Mus musculus" TF status: "inferred" TF family: "C2H2 ZF" DBDs: "zf-C2H2"]'; q-value = 6.32e-14) (hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr, activating, weight>75.0%) [(ENTPD1, 1.1548777195742543), (SPI1, 1.1968982137430095), (SCIMP, 1.254032449123743)] 3093
/tmp/ipykernel_3181/924629426.py:30: FutureWarning: Passing a negative integer is deprecated in version 1.0 and will not be supported in future version. Instead, use None to not limit the column width.
  pd.set_option('display.max_colwidth', MAX_COL_WIDTH)

We convert the motif dataframe into a list of regulons for further analysis

In [381]:
regulons = df2regulons(df_motifs)
# Pickle these regulons.
with open('./regulons.pkl', 'wb') as f:
    pickle.dump(regulons, f)
Create regulons from a dataframe of enriched features.
Additional columns saved: []
In [382]:
def fetch_logo(regulon, base_url = BASE_URL):
    for elem in regulon.context:
        if elem.endswith('.png'):
            return '<img src="{}{}" style="max-height:124px;"></img>'.format(base_url, elem)
    return ""

Create a regulon dataframe with the motif logo included

In [383]:
df_regulons = pd.DataFrame(data=[
    list(map(op.attrgetter('name'), regulons)),
    list(map(len, regulons)),
    list(map(fetch_logo, regulons))
], index=['name', 'count', 'logo']).T
In [386]:
### Replace this by your TF name
SELECTED_TF = 'IRF8(+)'

MAX_COL_WIDTH = pd.get_option('display.max_colwidth')
pd.set_option('display.max_colwidth', -1)
display(HTML(df_regulons[df_regulons.name == SELECTED_TF].to_html(escape=False)))
pd.set_option('display.max_colwidth', MAX_COL_WIDTH)
/tmp/ipykernel_3181/38202661.py:5: FutureWarning: Passing a negative integer is deprecated in version 1.0 and will not be supported in future version. Instead, use None to not limit the column width.
  pd.set_option('display.max_colwidth', -1)
name count logo
16 IRF8(+) 300
/tmp/ipykernel_3181/38202661.py:7: FutureWarning: Passing a negative integer is deprecated in version 1.0 and will not be supported in future version. Instead, use None to not limit the column width.
  pd.set_option('display.max_colwidth', MAX_COL_WIDTH)

STEP 4: Cellular enrichment (aka AUCell) from CLI

We use AUCell package to explore whether a given gene set is active/inactive in a particular cell. In one cell, AUCell ranks the genes expressed in that cell by the expression values, genes that have the same expression value will be shuffled. Then it goes from the 1st rank gene (highest expression) to least ranked gene and keep track of how many genes in the given set appear respectively. For more information, please refer to AUCell

In [387]:
os.system('pyscenic aucell ./pbmc4k.loom reg.csv --output ./pyscenic_output.loom --num_workers 20')
2021-12-06 07:29:22,603 - pyscenic.cli.pyscenic - INFO - Loading expression matrix.

2021-12-06 07:29:22,808 - pyscenic.cli.pyscenic - INFO - Loading gene signatures.
Create regulons from a dataframe of enriched features.
Additional columns saved: []

2021-12-06 07:29:25,001 - pyscenic.cli.pyscenic - INFO - Calculating cellular enrichment.

2021-12-06 07:29:30,939 - pyscenic.cli.pyscenic - INFO - Writing results to file.

Read the aucell results as a matrix

In [8]:
# collect SCENIC AUCell output
lf = lp.connect('pyscenic_output.loom', mode='r+', validate=False)
auc_mtx = pd.DataFrame(lf.ca.RegulonsAUC, index=lf.ca.CellID)
lf.close()
In [9]:
auc_mtx
Out[9]:
ASCL2(+) ATF3(+) BACH1(+) BCL11A(+) CEBPB(+) CEBPD(+) CREB5(+) EGR1(+) EOMES(+) ETV7(+) ... SPIB(+) STAT1(+) STAT2(+) TBX21(+) TCF4(+) TCF7(+) TCF7L2(+) TFEC(+) ZNF438(+) ZNF503(+)
AAACCTGAGAAGGCCT-1 0.056777 0.049699 0.050274 0.026008 0.067909 0.061067 0.068027 0.004579 0.011818 0.000000 ... 0.020416 0.055901 0.047061 0.015026 0.020545 0.026786 0.062698 0.057540 0.0 0.071429
AAACCTGAGACAGACC-1 0.040293 0.046113 0.049503 0.005363 0.062451 0.055776 0.000000 0.016789 0.007299 0.000000 ... 0.006840 0.040114 0.040365 0.007840 0.004512 0.000000 0.062103 0.156746 0.0 0.079365
AAACCTGAGATAGTCA-1 0.006410 0.035499 0.048476 0.019949 0.065615 0.046921 0.002551 0.006105 0.012948 0.000000 ... 0.019162 0.014105 0.066034 0.015897 0.003168 0.035289 0.071131 0.057540 0.0 0.000000
AAACCTGAGCGCCTCA-1 0.000000 0.038224 0.015245 0.023220 0.029900 0.032444 0.000000 0.016484 0.058655 0.105590 ... 0.031433 0.053313 0.032552 0.051757 0.032834 0.103316 0.030060 0.011905 0.0 0.000000
AAACCTGAGGCATGGT-1 0.000000 0.009968 0.012761 0.020646 0.015939 0.015579 0.034014 0.002747 0.035454 0.045549 ... 0.024436 0.065476 0.027468 0.019019 0.027266 0.125000 0.010913 0.000000 0.0 0.170635
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
TTTGGTTTCGCTAGCG-1 0.000000 0.053500 0.033231 0.017643 0.057309 0.055592 0.070578 0.000000 0.003650 0.000000 ... 0.009033 0.061594 0.040613 0.008856 0.008737 0.023384 0.049107 0.067460 0.0 0.103175
TTTGTCACACTTAACG-1 0.001832 0.010255 0.001028 0.015176 0.009492 0.027300 0.024660 0.000000 0.104884 0.038820 ... 0.021564 0.018634 0.006262 0.091391 0.019009 0.025510 0.013294 0.000000 0.0 0.000000
TTTGTCACAGGTCCAC-1 0.000000 0.003371 0.003854 0.011154 0.005616 0.030938 0.000000 0.008547 0.075773 0.000000 ... 0.010808 0.022386 0.008557 0.064533 0.014401 0.000000 0.015278 0.000000 0.0 0.000000
TTTGTCAGTTAAGACA-1 0.000000 0.019291 0.019527 0.093737 0.014713 0.015065 0.000000 0.014347 0.009819 0.000000 ... 0.087041 0.000000 0.038876 0.022067 0.080933 0.000000 0.014980 0.054563 0.0 0.000000
TTTGTCATCCCAAGAT-1 0.000000 0.059380 0.049418 0.010564 0.061660 0.061986 0.054422 0.001221 0.009559 0.000000 ... 0.014202 0.032997 0.025050 0.005226 0.013729 0.002976 0.045536 0.026786 0.0 0.047619

4331 rows × 47 columns

Visualization of SCENIC's AUC matrix

Create heatmap with binarized regulon activity.

For a given gene set, cells that have higher AUC score might be in the 'active' state of that gene set. In the ideal situation, we would have a bi-modal distribution of AUC score of all cells, where active cells are separated by inactive cells.

In the steps below, we will use the function binarize in the module pyscenic.binarization to compute the AUC threshold of each gene set to determine the active/inactive state of each cell. But, determining whether the signature is active (or not) in a given cell is not always trivial, please refer here to find more information about this step AUCell

Compute the on/off thresholds for each regulon

In [46]:
bin_mtx, thresholds = binarize(auc_mtx)
thresholds.to_frame().rename(columns={0:'threshold'}).to_csv('onoff_thresholds.csv')

Draw the AUC score distribution with the threshold included

In [47]:
fig, ((ax1, ax2, ax3)) = plt.subplots(1, 3, figsize=(8, 4), dpi=100)

plot_binarization(auc_mtx, 'PAX5(+)', thresholds['PAX5(+)'], ax=ax1)
plot_binarization(auc_mtx, 'IRF8(+)', thresholds['IRF8(+)'], ax=ax2)
plot_binarization(auc_mtx, 'POU2AF1(+)', thresholds['POU2AF1(+)'], ax=ax3)

plt.tight_layout()
/project_envs/pyscenic-fresh/lib/python3.8/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/project_envs/pyscenic-fresh/lib/python3.8/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/project_envs/pyscenic-fresh/lib/python3.8/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

The helper function to draw the cell type color palette

In [48]:
def palplot(pal, names, colors=None, size=1):
    n = len(pal)
    f, ax = plt.subplots(1, 1, figsize=(n * size, size))
    ax.imshow(np.arange(n).reshape(1, n),
              cmap=mpl.colors.ListedColormap(list(pal)),
              interpolation="nearest", aspect="auto")
    ax.set_xticks(np.arange(n) - .5)
    ax.set_yticks([-.5, .5])
    ax.set_xticklabels([])
    ax.set_yticklabels([])
    colors = n * ['k'] if colors is None else colors
    for idx, (name, color) in enumerate(zip(names, colors)):
        ax.text(0.0+idx, 0.0, name, color=color, horizontalalignment='center', verticalalignment='center')
    return f
In [49]:
N_COLORS = len(adata.obs.celltype.dtype.categories)
COLORS = [color['color'] for color in mpl.rcParams["axes.prop_cycle"]]
In [50]:
### black/white palette
cell_type_color_lut = dict(zip(adata.obs.celltype.dtype.categories, COLORS))
bw_palette = sns.xkcd_palette(['white', 'black'])
### cell type color palette
sns.set()
sns.set(font_scale=0.8)
fig = palplot(sns.color_palette(COLORS), adata.obs.celltype.dtype.categories, size=2.0)

Map cell ID to cell type

In [51]:
cell_id2cell_type_lut = adata.obs['celltype'].to_dict()
cell_id2cell_type_lut
Out[51]:
{'AAACCTGAGAAGGCCT-1': 'CD4 T',
 'AAACCTGAGACAGACC-1': 'CD14 Monocytes',
 'AAACCTGAGATAGTCA-1': 'CD4 T',
 'AAACCTGAGCGCCTCA-1': 'Naive T cell',
 'AAACCTGAGGCATGGT-1': 'CD4 T',
 'AAACCTGCAAGGTTCT-1': 'CD8 T',
 'AAACCTGCAGGATTGG-1': 'Unknown',
 'AAACCTGCAGGCGATA-1': 'Dendritic',
 'AAACCTGCATCCCATC-1': 'B',
 'AAACCTGCATGAAGTA-1': 'B',
 'AAACCTGGTACATCCA-1': 'B',
 'AAACCTGGTGCGGTAA-1': 'CD8 T',
 'AAACCTGTCGTGGTCG-1': 'CD4 T',
 'AAACCTGTCTCTGCTG-1': 'Naive T cell',
 'AAACGGGAGCGGCTTC-1': 'Dendritic',
 'AAACGGGAGGCTAGCA-1': 'NK1',
 'AAACGGGAGTGTCCAT-1': 'CD8 T',
 'AAACGGGGTCTTCTCG-1': 'CD14 Monocytes',
 'AAACGGGGTGGACGAT-1': 'NK1',
 'AAACGGGGTTTGCATG-1': 'CD14 Monocytes',
 'AAACGGGTCTGGGCCA-1': 'CD4 T',
 'AAACGGGTCTGGTATG-1': 'Naive T cell',
 'AAAGATGAGACATAAC-1': 'CD4 T',
 'AAAGATGGTCCCTACT-1': 'B',
 'AAAGATGTCCGAATGT-1': 'CD8 T',
 'AAAGATGTCTGGTTCC-1': 'CD14 Monocytes',
 'AAAGCAAAGTGTCCAT-1': 'Unknown',
 'AAAGCAACACCGAAAG-1': 'CD8 T',
 'AAAGCAAGTAAACACA-1': 'NK2',
 'AAAGCAAGTCAAGCGA-1': 'CD4 T',
 'AAAGCAAGTCTCACCT-1': 'CD14 Monocytes',
 'AAAGCAAGTCTTTCAT-1': 'Naive T cell',
 'AAAGCAAGTTGTTTGG-1': 'CD4 T',
 'AAAGCAATCGTATCAG-1': 'NK2',
 'AAAGCAATCTGCCCTA-1': 'Dendritic',
 'AAAGTAGAGAAGGTGA-1': 'Naive T cell',
 'AAAGTAGAGACAATAC-1': 'CD4 T',
 'AAAGTAGAGCGACGTA-1': 'NK1',
 'AAAGTAGAGGGCACTA-1': 'CD14 Monocytes',
 'AAAGTAGAGTCAAGGC-1': 'FCGR3A Monocytes',
 'AAAGTAGAGTCCATAC-1': 'B',
 'AAAGTAGCACATGACT-1': 'B',
 'AAAGTAGCAGCTGTTA-1': 'B',
 'AAAGTAGGTAAGGGAA-1': 'CD4 T',
 'AAAGTAGGTTACTGAC-1': 'CD4 T',
 'AAAGTAGTCAAACCAC-1': 'CD4 T',
 'AAAGTAGTCTCAAGTG-1': 'CD4 T',
 'AAATGCCAGACGCACA-1': 'Naive T cell',
 'AAATGCCAGGGCTTCC-1': 'CD14 Monocytes',
 'AAATGCCAGGGTATCG-1': 'CD4 T',
 'AAATGCCCACATTTCT-1': 'CD8 T',
 'AAATGCCCAGACACTT-1': 'CD14 Monocytes',
 'AAATGCCGTGCAGTAG-1': 'NK1',
 'AAATGCCGTTTCGCTC-1': 'CD4 T',
 'AAATGCCTCACCAGGC-1': 'CD14 Monocytes',
 'AAATGCCTCAGGCCCA-1': 'B',
 'AACACGTAGGTGCTAG-1': 'CD14 Monocytes',
 'AACACGTCACCTGGTG-1': 'Unknown',
 'AACACGTCATGGTTGT-1': 'CD8 T',
 'AACACGTGTACCGTAT-1': 'CD4 T',
 'AACACGTGTATCAGTC-1': 'Unknown',
 'AACACGTTCGCATGGC-1': 'CD8 T',
 'AACACGTTCTTACCGC-1': 'NK1',
 'AACCATGAGAGTACCG-1': 'Dendritic',
 'AACCATGAGGAGCGTT-1': 'Naive T cell',
 'AACCATGCAGTCCTTC-1': 'CD14 Monocytes',
 'AACCATGGTACAGCAG-1': 'CD14 Monocytes',
 'AACCATGGTATCTGCA-1': 'CD4 T',
 'AACCATGTCGGAAACG-1': 'CD8 T',
 'AACCATGTCGGATGTT-1': 'Unknown',
 'AACCGCGAGACACTAA-1': 'Naive T cell',
 'AACCGCGAGATCCCAT-1': 'Unknown',
 'AACCGCGAGGGTGTGT-1': 'CD14 Monocytes',
 'AACCGCGCACGGCCAT-1': 'NK1',
 'AACCGCGCATGGTCTA-1': 'CD4 T',
 'AACCGCGGTAAGTGTA-1': 'CD8 T',
 'AACCGCGGTAGGAGTC-1': 'Naive T cell',
 'AACCGCGGTTACGCGC-1': 'CD4 T',
 'AACCGCGTCTGCTTGC-1': 'CD4 T',
 'AACGTTGAGCGATATA-1': 'CD4 T',
 'AACGTTGAGGTCATCT-1': 'CD14 Monocytes',
 'AACGTTGCACACCGCA-1': 'CD4 T',
 'AACGTTGCAGTCGTGC-1': 'NK1',
 'AACGTTGCAGTTAACC-1': 'Naive T cell',
 'AACGTTGCATGCAATC-1': 'CD14 Monocytes',
 'AACGTTGGTGTAACGG-1': 'NK1',
 'AACGTTGGTTGGTGGA-1': 'Naive T cell',
 'AACGTTGTCACTGGGC-1': 'CD14 Monocytes',
 'AACGTTGTCCACGAAT-1': 'CD14 Monocytes',
 'AACGTTGTCTTGTATC-1': 'CD8 T',
 'AACTCAGCAACGATGG-1': 'CD8 T',
 'AACTCAGCACGGTTTA-1': 'CD4 T',
 'AACTCAGGTGCACTTA-1': 'NK1',
 'AACTCAGGTGTAAGTA-1': 'B',
 'AACTCAGGTTACCGAT-1': 'NK1',
 'AACTCAGGTTGTCGCG-1': 'CD4 T',
 'AACTCAGGTTTGACAC-1': 'Dendritic',
 'AACTCAGTCAGAGCTT-1': 'NK2',
 'AACTCAGTCCAACCAA-1': 'NK1',
 'AACTCCCAGAAACCTA-1': 'CD4 T',
 'AACTCCCAGCGTGAAC-1': 'Unknown',
 'AACTCCCCACATGGGA-1': 'CD4 T',
 'AACTCCCCACCATGTA-1': 'NK1',
 'AACTCCCCACTACAGT-1': 'B',
 'AACTCCCCATTAGCCA-1': 'CD8 T',
 'AACTCCCCATTGGTAC-1': 'CD8 T',
 'AACTCCCGTACCGTAT-1': 'CD14 Monocytes',
 'AACTCCCGTAGAGGAA-1': 'NK2',
 'AACTCCCGTATTCTCT-1': 'CD4 T',
 'AACTCCCGTCTCAACA-1': 'B',
 'AACTCCCGTTGGTAAA-1': 'Naive T cell',
 'AACTCCCTCACGGTTA-1': 'CD4 T',
 'AACTCCCTCGTATCAG-1': 'CD4 T',
 'AACTCTTAGATAGGAG-1': 'CD4 T',
 'AACTCTTCAAGGGTCA-1': 'NK1',
 'AACTCTTCAGTCTTCC-1': 'CD8 T',
 'AACTCTTGTTACGCGC-1': 'Naive T cell',
 'AACTCTTTCAACGAAA-1': 'Naive T cell',
 'AACTCTTTCTCTTATG-1': 'B',
 'AACTGGTAGCGATAGC-1': 'CD14 Monocytes',
 'AACTGGTAGCTGCCCA-1': 'NK2',
 'AACTGGTGTACATCCA-1': 'CD4 T',
 'AACTGGTGTACGAAAT-1': 'B',
 'AACTGGTGTGAGCGAT-1': 'NK2',
 'AACTGGTGTTTCGCTC-1': 'CD4 T',
 'AACTGGTTCCTTCAAT-1': 'NK2',
 'AACTGGTTCGGAGGTA-1': 'B',
 'AACTTTCAGAGCTATA-1': 'NK1',
 'AACTTTCAGAGTAAGG-1': 'CD4 T',
 'AACTTTCAGGGTGTTG-1': 'CD4 T',
 'AACTTTCCAATACGCT-1': 'B',
 'AACTTTCCACACCGAC-1': 'B',
 'AACTTTCCATAAAGGT-1': 'NK1',
 'AACTTTCTCAACACGT-1': 'CD14 Monocytes',
 'AACTTTCTCAAGGCTT-1': 'B',
 'AACTTTCTCATCGGAT-1': 'Dendritic',
 'AAGACCTAGAAACGAG-1': 'NK1',
 'AAGACCTAGAATTCCC-1': 'CD4 T',
 'AAGACCTAGCACCGTC-1': 'CD4 T',
 'AAGACCTGTGAGGCTA-1': 'NK1',
 'AAGACCTGTGCAGGTA-1': 'B',
 'AAGACCTGTTTGACAC-1': 'CD4 T',
 'AAGACCTTCACCGGGT-1': 'CD8 T',
 'AAGACCTTCGAACGGA-1': 'CD14 Monocytes',
 'AAGCCGCAGGCCATAG-1': 'CD4 T',
 'AAGCCGCCAATCTGCA-1': 'CD8 T',
 'AAGCCGCCAATGTAAG-1': 'CD14 Monocytes',
 'AAGCCGCCACGAAACG-1': 'NK2',
 'AAGCCGCGTAGCAAAT-1': 'Naive T cell',
 'AAGCCGCGTGTAACGG-1': 'CD4 T',
 'AAGCCGCTCAAGGTAA-1': 'CD4 T',
 'AAGCCGCTCATACGGT-1': 'CD14 Monocytes',
 'AAGGAGCAGTTATCGC-1': 'CD14 Monocytes',
 'AAGGAGCCAAGTTAAG-1': 'CD4 T',
 'AAGGAGCGTACGCACC-1': 'CD4 T',
 'AAGGAGCGTACTCAAC-1': 'CD14 Monocytes',
 'AAGGAGCGTAGTACCT-1': 'CD4 T',
 'AAGGAGCGTCAAAGCG-1': 'Naive T cell',
 'AAGGAGCGTGGTTTCA-1': 'CD14 Monocytes',
 'AAGGAGCGTTGAGGTG-1': 'B',
 'AAGGAGCTCATGCAAC-1': 'B',
 'AAGGAGCTCTGTGCAA-1': 'CD4 T',
 'AAGGCAGAGAATCTCC-1': 'B',
 'AAGGCAGAGTGGGTTG-1': 'CD14 Monocytes',
 'AAGGCAGCAAATACAG-1': 'CD14 Monocytes',
 'AAGGCAGCAAATCCGT-1': 'NK1',
 'AAGGCAGCACCATCCT-1': 'Dendritic',
 'AAGGCAGGTAGTGAAT-1': 'CD8 T',
 'AAGGCAGGTGGTAACG-1': 'CD8 T',
 'AAGGCAGTCCGCATCT-1': 'CD4 T',
 'AAGGCAGTCCGCGGTA-1': 'CD8 T',
 'AAGGCAGTCCTATGTT-1': 'B',
 'AAGGCAGTCGTACGGC-1': 'NK1',
 'AAGGTTCAGGTGATAT-1': 'CD14 Monocytes',
 'AAGGTTCAGTGTACCT-1': 'CD4 T',
 'AAGGTTCCAAGGCTCC-1': 'B',
 'AAGGTTCCACACGCTG-1': 'CD8 T',
 'AAGGTTCCACCGTTGG-1': 'CD4 T',
 'AAGGTTCCACGAAACG-1': 'NK2',
 'AAGGTTCGTGCCTGCA-1': 'CD8 T',
 'AAGGTTCGTGTTCTTT-1': 'NK1',
 'AAGGTTCGTTCACGGC-1': 'CD14 Monocytes',
 'AAGGTTCTCCATGAAC-1': 'NK1',
 'AAGGTTCTCTACTATC-1': 'NK2',
 'AAGGTTCTCTCGTATT-1': 'B',
 'AAGGTTCTCTTGAGGT-1': 'CD8 T',
 'AAGTCTGAGAGGGATA-1': 'CD14 Monocytes',
 'AAGTCTGAGAGTACAT-1': 'Naive T cell',
 'AAGTCTGAGCGGATCA-1': 'NK2',
 'AAGTCTGAGTCAAGCG-1': 'B',
 'AAGTCTGCAAGGGTCA-1': 'CD14 Monocytes',
 'AAGTCTGCACTGTTAG-1': 'Naive T cell',
 'AATCCAGAGAGACTAT-1': 'NK2',
 'AATCCAGAGCCAACAG-1': 'Naive T cell',
 'AATCCAGAGCCACCTG-1': 'B',
 'AATCCAGAGTAATCCC-1': 'NK2',
 'AATCCAGCAAAGAATC-1': 'B',
 'AATCCAGCACAGGCCT-1': 'B',
 'AATCCAGCAGCCTGTG-1': 'CD14 Monocytes',
 'AATCCAGGTAAAGTCA-1': 'CD14 Monocytes',
 'AATCCAGGTATAATGG-1': 'CD4 T',
 'AATCCAGGTGAAATCA-1': 'Naive T cell',
 'AATCCAGGTGCGCTTG-1': 'CD14 Monocytes',
 'AATCCAGTCCCAGGTG-1': 'B',
 'AATCCAGTCCGAATGT-1': 'CD8 T',
 'AATCCAGTCTCCAGGG-1': 'CD4 T',
 'AATCCAGTCTCGCTTG-1': 'CD14 Monocytes',
 'AATCGGTCAGGACCCT-1': 'FCGR3A Monocytes',
 'AATCGGTCATCGTCGG-1': 'CD14 Monocytes',
 'AATCGGTGTCGGATCC-1': 'CD14 Monocytes',
 'AATCGGTGTGACTCAT-1': 'CD4 T',
 'AATCGGTGTTCCATGA-1': 'Naive T cell',
 'AATCGGTTCACCGTAA-1': 'Unknown',
 'ACACCAAAGAAACGCC-1': 'NK2',
 'ACACCAAAGACTGGGT-1': 'CD8 T',
 'ACACCAAAGGCAAAGA-1': 'CD4 T',
 'ACACCAAAGGCAATTA-1': 'CD4 T',
 'ACACCAAAGGCTATCT-1': 'NK1',
 'ACACCAAAGTTCCACA-1': 'NK1',
 'ACACCAAAGTTTCCTT-1': 'CD4 T',
 'ACACCAACAACTGCTA-1': 'NK1',
 'ACACCAAGTAGCGATG-1': 'NK1',
 'ACACCAATCATTGCGA-1': 'CD14 Monocytes',
 'ACACCAATCTTGACGA-1': 'NK1',
 'ACACCCTAGTACGTTC-1': 'B',
 'ACACCCTAGTGTTAGA-1': 'NK1',
 'ACACCCTCATCTCCCA-1': 'FCGR3A Monocytes',
 'ACACCCTCATGCAATC-1': 'CD4 T',
 'ACACCCTGTAAGGGCT-1': 'CD4 T',
 'ACACCCTGTCGCTTTC-1': 'Naive T cell',
 'ACACCCTGTTCTGGTA-1': 'CD14 Monocytes',
 'ACACCGGAGCTAAGAT-1': 'CD14 Monocytes',
 'ACACCGGAGGCAGGTT-1': 'CD4 T',
 'ACACCGGAGTAACCCT-1': 'CD14 Monocytes',
 'ACACCGGCAACCGCCA-1': 'CD4 T',
 'ACACCGGCACACATGT-1': 'NK2',
 'ACACCGGCACAGGTTT-1': 'NK2',
 'ACACCGGCAGCTGCTG-1': 'CD4 T',
 'ACACCGGCATCCTTGC-1': 'CD4 T',
 'ACACCGGGTATGAATG-1': 'B',
 'ACACCGGTCCCATTTA-1': 'Naive T cell',
 'ACACCGGTCGTAGGAG-1': 'CD8 T',
 'ACACTGAAGCGGCTTC-1': 'CD8 T',
 'ACACTGACAAGTAATG-1': 'NK1',
 'ACACTGACACAGGTTT-1': 'CD14 Monocytes',
 'ACACTGAGTGGTGTAG-1': 'CD14 Monocytes',
 'ACACTGATCCTTCAAT-1': 'CD4 T',
 'ACACTGATCGTTACGA-1': 'Naive T cell',
 'ACACTGATCTTTAGGG-1': 'NK1',
 'ACAGCCGCAGTTCATG-1': 'B',
 'ACAGCCGGTCATACTG-1': 'CD4 T',
 'ACAGCCGGTCTCAACA-1': 'CD4 T',
 'ACAGCCGGTCTTGATG-1': 'FCGR3A Monocytes',
 'ACAGCCGTCACCGTAA-1': 'CD4 T',
 'ACAGCTAAGATAGCAT-1': 'B',
 'ACAGCTAAGCCGATTT-1': 'CD14 Monocytes',
 'ACAGCTAAGGACAGCT-1': 'CD8 T',
 'ACAGCTAAGTTAGGTA-1': 'NK1',
 'ACAGCTACACTAAGTC-1': 'CD14 Monocytes',
 'ACAGCTACATACGCTA-1': 'NK2',
 'ACAGCTAGTAAGTGGC-1': 'CD4 T',
 'ACAGCTAGTATTAGCC-1': 'NK1',
 'ACAGCTAGTCGGGTCT-1': 'CD14 Monocytes',
 'ACAGCTATCAACGGCC-1': 'Naive T cell',
 'ACAGCTATCGCATGAT-1': 'CD4 T',
 'ACATACGAGTTACGGG-1': 'CD4 T',
 'ACATACGCAGCTGCAC-1': 'CD4 T',
 'ACATACGTCACCGGGT-1': 'CD14 Monocytes',
 'ACATACGTCGGGAGTA-1': 'CD4 T',
 'ACATCAGAGGTGCAAC-1': 'Naive T cell',
 'ACATCAGCAGCTGTAT-1': 'CD8 T',
 'ACATCAGCATATACGC-1': 'B',
 'ACATCAGGTGTGACCC-1': 'Naive T cell',
 'ACATCAGGTTTGTTTC-1': 'Naive T cell',
 'ACATCAGTCCTATTCA-1': 'CD4 T',
 'ACATGGTAGCATGGCA-1': 'B',
 'ACATGGTAGTCTCAAC-1': 'NK1',
 'ACATGGTCAAATACAG-1': 'Naive T cell',
 'ACATGGTCAACCGCCA-1': 'CD4 T',
 'ACATGGTCATCGTCGG-1': 'B',
 'ACATGGTCATTTCACT-1': 'B',
 'ACATGGTTCTGCAAGT-1': 'B',
 'ACCAGTAAGGGCTCTC-1': 'NK1',
 'ACCAGTAAGTGATCGG-1': 'CD4 T',
 'ACCAGTACAGACAGGT-1': 'CD14 Monocytes',
 'ACCAGTAGTACAAGTA-1': 'CD8 T',
 'ACCAGTAGTAGAGCTG-1': 'CD14 Monocytes',
 'ACCAGTAGTCTCTCTG-1': 'NK1',
 'ACCAGTAGTGAGTGAC-1': 'CD8 T',
 'ACCAGTAGTGGTTTCA-1': 'CD8 T',
 'ACCAGTAGTTGACGTT-1': 'NK1',
 'ACCAGTAGTTGTACAC-1': 'B',
 'ACCAGTATCCGTACAA-1': 'Unknown',
 'ACCAGTATCCTCAATT-1': 'CD4 T',
 'ACCAGTATCGCCTGAG-1': 'Naive T cell',
 'ACCAGTATCGCGTAGC-1': 'FCGR3A Monocytes',
 'ACCAGTATCTCATTCA-1': 'Dendritic',
 'ACCCACTAGCAATATG-1': 'B',
 'ACCCACTAGCGCCTCA-1': 'NK2',
 'ACCCACTAGCTAGTGG-1': 'CD8 T',
 'ACCCACTAGGTGATAT-1': 'CD14 Monocytes',
 'ACCCACTAGTGGACGT-1': 'B',
 'ACCCACTCATAAGACA-1': 'B',
 'ACCCACTGTAAGTAGT-1': 'NK1',
 'ACCCACTGTAGGCTGA-1': 'CD4 T',
 'ACCCACTTCAGTCAGT-1': 'CD8 T',
 'ACCCACTTCGAGAGCA-1': 'NK1',
 'ACCCACTTCGGAGCAA-1': 'CD8 T',
 'ACCCACTTCGGCTTGG-1': 'NK1',
 'ACCCACTTCTACTATC-1': 'B',
 'ACCCACTTCTGTACGA-1': 'Naive T cell',
 'ACCGTAAAGATAGGAG-1': 'CD4 T',
 'ACCGTAACAGGACCCT-1': 'B',
 'ACCGTAAGTGGTCTCG-1': 'CD14 Monocytes',
 'ACCGTAAGTTAAGACA-1': 'CD4 T',
 'ACCGTAATCACATACG-1': 'Unknown',
 'ACCTTTAAGCAGCGTA-1': 'CD14 Monocytes',
 'ACCTTTAAGTATTGGA-1': 'CD4 T',
 'ACCTTTACAATTGCTG-1': 'Naive T cell',
 'ACCTTTACACATTAGC-1': 'B',
 'ACCTTTACACCGTTGG-1': 'B',
 'ACCTTTACATGGTCAT-1': 'NK2',
 'ACCTTTACATGTTGAC-1': 'CD4 T',
 'ACCTTTATCTCTTATG-1': 'Naive T cell',
 'ACGAGCCAGAGTACCG-1': 'B',
 'ACGAGCCAGGGTCTCC-1': 'NK1',
 'ACGAGCCAGTCGATAA-1': 'NK1',
 'ACGAGCCCAGGGTTAG-1': 'CD4 T',
 'ACGAGCCCATGAACCT-1': 'CD8 T',
 'ACGAGCCCATTGAGCT-1': 'CD8 T',
 'ACGAGCCGTTGGTAAA-1': 'CD4 T',
 'ACGAGCCTCCTCAACC-1': 'B',
 'ACGAGCCTCGGAGCAA-1': 'CD4 T',
 'ACGAGCCTCTCTGAGA-1': 'CD14 Monocytes',
 'ACGAGCCTCTGTTTGT-1': 'Naive T cell',
 'ACGAGGAAGAATGTTG-1': 'FCGR3A Monocytes',
 'ACGAGGAAGCAATATG-1': 'CD8 T',
 'ACGAGGAAGCACACAG-1': 'CD14 Monocytes',
 'ACGAGGAAGCTAACTC-1': 'NK1',
 'ACGAGGAAGTGGTAGC-1': 'B',
 'ACGAGGACAAAGGAAG-1': 'NK2',
 'ACGAGGACATCAGTCA-1': 'B',
 'ACGAGGACATTCGACA-1': 'Naive T cell',
 'ACGAGGATCTACTATC-1': 'NK1',
 'ACGATACAGAATGTTG-1': 'CD14 Monocytes',
 'ACGATACAGCCCAGCT-1': 'Dendritic',
 'ACGATACAGGAACTGC-1': 'CD14 Monocytes',
 'ACGATACCAATTCCTT-1': 'B',
 'ACGATACCAGATGAGC-1': 'CD4 T',
 'ACGATACCATGCCTAA-1': 'NK1',
 'ACGATACGTCGGCATC-1': 'CD8 T',
 'ACGATACGTTAAGAAC-1': 'CD4 T',
 'ACGATACTCGTAGGTT-1': 'NK2',
 'ACGATGTAGAGACTAT-1': 'CD4 T',
 'ACGATGTAGATCTGCT-1': 'NK1',
 'ACGATGTAGTCCAGGA-1': 'CD4 T',
 'ACGATGTGTCATCGGC-1': 'CD4 T',
 'ACGCAGCAGTGACATA-1': 'CD8 T',
 'ACGCAGCCACCTTGTC-1': 'CD14 Monocytes',
 'ACGCAGCCATAAAGGT-1': 'Dendritic',
 'ACGCAGCGTGTGGCTC-1': 'CD4 T',
 'ACGCAGCGTTTACTCT-1': 'B',
 'ACGCAGCTCTACTATC-1': 'CD8 T',
 'ACGCAGCTCTCGAGTA-1': 'Naive T cell',
 'ACGCCAGAGAATAGGG-1': 'B',
 'ACGCCAGAGAGACGAA-1': 'NK1',
 'ACGCCAGAGGGTATCG-1': 'Dendritic',
 'ACGCCAGAGTACGTAA-1': 'NK1',
 'ACGCCAGGTACTCTCC-1': 'B',
 'ACGCCAGGTGAACCTT-1': 'B',
 'ACGCCAGTCAAGCCTA-1': 'Unknown',
 'ACGCCGAAGATTACCC-1': 'Unknown',
 'ACGCCGAAGTCGTACT-1': 'Naive T cell',
 'ACGCCGAGTCATATCG-1': 'NK1',
 'ACGCCGAGTGCTAGCC-1': 'NK1',
 'ACGCCGATCAGGATCT-1': 'CD14 Monocytes',
 'ACGCCGATCTTAACCT-1': 'Unknown',
 'ACGGAGAGTCACAAGG-1': 'CD4 T',
 'ACGGAGAGTCGAGTTT-1': 'CD14 Monocytes',
 'ACGGAGATCGTGACAT-1': 'NK2',
 'ACGGCCAAGAGATGAG-1': 'CD14 Monocytes',
 'ACGGCCAAGCGACGTA-1': 'B',
 'ACGGCCAAGCGATGAC-1': 'NK2',
 'ACGGCCAAGGAGTAGA-1': 'Naive T cell',
 'ACGGCCAAGTACGTAA-1': 'CD14 Monocytes',
 'ACGGCCAAGTGACTCT-1': 'CD14 Monocytes',
 'ACGGCCACAGGGTTAG-1': 'CD4 T',
 'ACGGCCAGTCTCATCC-1': 'Naive T cell',
 'ACGGCCAGTTAAAGAC-1': 'Naive T cell',
 'ACGGCCATCCTTTACA-1': 'CD14 Monocytes',
 'ACGGCCATCTTGTACT-1': 'CD4 T',
 'ACGGGCTAGTTCGATC-1': 'CD14 Monocytes',
 'ACGGGCTCACGGTTTA-1': 'Naive T cell',
 'ACGGGCTCATGATCCA-1': 'CD14 Monocytes',
 'ACGGGCTGTAAACACA-1': 'Naive T cell',
 'ACGGGCTGTACAGACG-1': 'CD14 Monocytes',
 'ACGGGCTGTCATACTG-1': 'CD8 T',
 'ACGGGCTGTCTTCTCG-1': 'CD14 Monocytes',
 'ACGGGCTGTGTTGAGG-1': 'B',
 'ACGGGCTGTTATGCGT-1': 'CD4 T',
 'ACGGGCTGTTGACGTT-1': 'Naive T cell',
 'ACGGGCTTCCACGTGG-1': 'CD14 Monocytes',
 'ACGGGTCAGCTACCTA-1': 'B',
 'ACGGGTCAGTCGATAA-1': 'CD8 T',
 'ACGGGTCCAGGGAGAG-1': 'NK1',
 'ACGGGTCGTATCACCA-1': 'CD4 T',
 'ACGGGTCGTCGACTGC-1': 'CD14 Monocytes',
 'ACGGGTCGTGCACCAC-1': 'CD4 T',
 'ACGGGTCGTGCTCTTC-1': 'CD4 T',
 'ACGGGTCGTGGTCCGT-1': 'CD14 Monocytes',
 'ACGGGTCGTTACAGAA-1': 'CD4 T',
 'ACGGGTCTCACTCCTG-1': 'CD8 T',
 'ACGGGTCTCCGCTGTT-1': 'B',
 'ACGGGTCTCGCTGATA-1': 'CD4 T',
 'ACGTCAAAGACGCTTT-1': 'Naive T cell',
 'ACGTCAAAGACTTGAA-1': 'Naive T cell',
 'ACGTCAAAGATCACGG-1': 'NK1',
 'ACGTCAAAGGCCCTCA-1': 'CD14 Monocytes',
 'ACGTCAAAGGGTCTCC-1': 'CD14 Monocytes',
 'ACGTCAAGTAAGAGAG-1': 'NK1',
 'ACGTCAAGTATGGTTC-1': 'CD8 T',
 'ACGTCAAGTCTAGGTT-1': 'CD14 Monocytes',
 'ACGTCAAGTTCAGCGC-1': 'CD4 T',
 'ACGTCAATCTGGCGAC-1': 'Naive T cell',
 'ACTATCTAGTCTCAAC-1': 'B',
 'ACTATCTAGTTATCGC-1': 'B',
 'ACTATCTGTATATCCG-1': 'CD14 Monocytes',
 'ACTATCTGTCGCATAT-1': 'CD14 Monocytes',
 'ACTATCTGTTAAGAAC-1': 'CD4 T',
 'ACTATCTGTTATCACG-1': 'CD14 Monocytes',
 'ACTATCTTCACAGGCC-1': 'CD14 Monocytes',
 'ACTATCTTCCGTACAA-1': 'CD14 Monocytes',
 'ACTATCTTCGCCATAA-1': 'CD4 T',
 'ACTATCTTCTACGAGT-1': 'CD4 T',
 'ACTGAACAGCGTTGCC-1': 'CD4 T',
 'ACTGAACCAAGCCGTC-1': 'CD14 Monocytes',
 'ACTGAACCAGCCAATT-1': 'B',
 'ACTGAACCAGGCAGTA-1': 'CD4 T',
 'ACTGAACGTAGCTTGT-1': 'CD14 Monocytes',
 'ACTGAACGTTTGTGTG-1': 'B',
 'ACTGAACTCGTTTGCC-1': 'CD8 T',
 'ACTGAACTCTGGCGTG-1': 'FCGR3A Monocytes',
 'ACTGAACTCTTTAGTC-1': 'Dendritic',
 'ACTGAGTAGAATTCCC-1': 'NK1',
 'ACTGAGTAGTTGTAGA-1': 'CD4 T',
 'ACTGAGTCAACACGCC-1': 'CD8 T',
 'ACTGAGTCAGCGAACA-1': 'NK2',
 'ACTGAGTCATCCCATC-1': 'Dendritic',
 'ACTGAGTCATGATCCA-1': 'B',
 'ACTGAGTGTAGCTAAA-1': 'CD4 T',
 'ACTGAGTGTCATACTG-1': 'NK1',
 'ACTGAGTGTCCTAGCG-1': 'CD8 T',
 'ACTGAGTGTGACCAAG-1': 'CD14 Monocytes',
 'ACTGAGTGTGAGTGAC-1': 'B',
 'ACTGAGTGTTGATTGC-1': 'NK1',
 'ACTGAGTTCCGGGTGT-1': 'Naive T cell',
 'ACTGAGTTCGAATCCA-1': 'NK1',
 'ACTGAGTTCGCCAGCA-1': 'CD14 Monocytes',
 'ACTGATGAGGTAAACT-1': 'CD14 Monocytes',
 'ACTGATGCAAGAAAGG-1': 'CD14 Monocytes',
 'ACTGATGCAAGTTAAG-1': 'CD14 Monocytes',
 'ACTGATGCAATACGCT-1': 'CD4 T',
 'ACTGATGCACCTGGTG-1': 'CD4 T',
 'ACTGATGCACGTTGGC-1': 'B',
 'ACTGATGCAGTCCTTC-1': 'CD14 Monocytes',
 'ACTGATGCATGGGACA-1': 'CD4 T',
 'ACTGATGCATTCCTGC-1': 'CD14 Monocytes',
 'ACTGATGGTCAGCTAT-1': 'CD14 Monocytes',
 'ACTGATGTCAGTACGT-1': 'CD4 T',
 'ACTGATGTCGGTCTAA-1': 'Naive T cell',
 'ACTGATGTCTTGGGTA-1': 'NK2',
 'ACTGCTCAGCCAGTAG-1': 'B',
 'ACTGCTCAGGGTATCG-1': 'CD4 T',
 'ACTGCTCAGTATCGAA-1': 'CD4 T',
 'ACTGCTCCAATGAATG-1': 'NK2',
 'ACTGCTCCAGGGTACA-1': 'NK1',
 'ACTGCTCGTCTGATTG-1': 'CD14 Monocytes',
 'ACTGCTCTCCTCAATT-1': 'CD4 T',
 'ACTGCTCTCCTCATTA-1': 'NK1',
 'ACTGCTCTCGGCTACG-1': 'NK1',
 'ACTGCTCTCTTGTCAT-1': 'Naive T cell',
 'ACTGTCCAGCCACGTC-1': 'B',
 'ACTGTCCAGCTCCTTC-1': 'CD4 T',
 'ACTGTCCAGCTGAAAT-1': 'CD4 T',
 'ACTGTCCAGTGGGATC-1': 'CD4 T',
 'ACTGTCCCACGAGGTA-1': 'CD4 T',
 'ACTGTCCCAGCCTGTG-1': 'CD4 T',
 'ACTGTCCCATGTAGTC-1': 'NK1',
 'ACTGTCCCATGTCGAT-1': 'B',
 'ACTGTCCCATTCCTGC-1': 'Naive T cell',
 'ACTGTCCGTATGGTTC-1': 'CD14 Monocytes',
 'ACTGTCCGTTAAGTAG-1': 'Unknown',
 'ACTGTCCTCACAGGCC-1': 'CD4 T',
 'ACTTACTAGATCCTGT-1': 'B',
 'ACTTACTAGCTATGCT-1': 'NK1',
 'ACTTACTAGTCCCACG-1': 'CD4 T',
 'ACTTACTCAAGCCGCT-1': 'CD4 T',
 'ACTTACTCACTGCCAG-1': 'CD14 Monocytes',
 'ACTTACTCAGTACACT-1': 'CD8 T',
 'ACTTACTCATTTGCTT-1': 'B',
 'ACTTACTGTTGCGTTA-1': 'NK1',
 'ACTTACTTCAGTTCGA-1': 'NK1',
 'ACTTGTTAGGGTGTGT-1': 'NK1',
 'ACTTGTTAGTGTCCAT-1': 'CD4 T',
 'ACTTGTTCAACTGGCC-1': 'B',
 'ACTTGTTCACCGGAAA-1': 'CD14 Monocytes',
 'ACTTGTTGTCCGAGTC-1': 'CD4 T',
 'ACTTGTTGTGACCAAG-1': 'CD8 T',
 'ACTTGTTGTGACGCCT-1': 'CD8 T',
 'ACTTGTTGTGTCCTCT-1': 'CD8 T',
 'ACTTGTTTCATGTCTT-1': 'Naive T cell',
 'ACTTGTTTCCCGACTT-1': 'CD14 Monocytes',
 'ACTTGTTTCGAACTGT-1': 'CD14 Monocytes',
 'ACTTGTTTCGGCTACG-1': 'CD8 T',
 'ACTTGTTTCTGGAGCC-1': 'CD8 T',
 'ACTTGTTTCTTAGAGC-1': 'CD8 T',
 'ACTTTCAAGCTAGTCT-1': 'CD8 T',
 'ACTTTCAAGTGCGATG-1': 'Naive T cell',
 'ACTTTCACAAGCTGAG-1': 'B',
 'ACTTTCACACATGTGT-1': 'NK1',
 'ACTTTCACATCCCACT-1': 'B',
 'ACTTTCAGTCGCTTCT-1': 'B',
 'ACTTTCATCACAACGT-1': 'CD14 Monocytes',
 'ACTTTCATCCACGACG-1': 'CD4 T',
 'ACTTTCATCTCGTTTA-1': 'Naive T cell',
 'ACTTTCATCTGCGACG-1': 'CD14 Monocytes',
 'AGAATAGAGACAAGCC-1': 'B',
 'AGAATAGAGACCTTTG-1': 'Naive T cell',
 'AGAATAGCAAGCGAGT-1': 'CD14 Monocytes',
 'AGAATAGCAGTGGGAT-1': 'CD4 T',
 'AGAATAGCATTATCTC-1': 'FCGR3A Monocytes',
 'AGAATAGGTGACCAAG-1': 'CD4 T',
 'AGAATAGGTGGTCTCG-1': 'NK2',
 'AGAATAGTCCAGATCA-1': 'FCGR3A Monocytes',
 'AGAATAGTCCTATTCA-1': 'Dendritic',
 'AGAATAGTCTGGCGAC-1': 'B',
 'AGACGTTAGACCACGA-1': 'FCGR3A Monocytes',
 'AGACGTTCATAGAAAC-1': 'NK1',
 'AGACGTTGTGAAATCA-1': 'CD4 T',
 'AGACGTTGTGACCAAG-1': 'CD8 T',
 'AGACGTTGTTCCTCCA-1': 'NK1',
 'AGACGTTTCACGCATA-1': 'CD4 T',
 'AGACGTTTCTGGTATG-1': 'CD4 T',
 'AGAGCGAAGTGTTAGA-1': 'CD4 T',
 'AGAGCGACACAGACTT-1': 'NK2',
 'AGAGCGACACGTCTCT-1': 'CD14 Monocytes',
 'AGAGCGACACTTAACG-1': 'FCGR3A Monocytes',
 'AGAGCGAGTTCCACTC-1': 'CD8 T',
 'AGAGCGATCCGCAGTG-1': 'NK1',
 'AGAGCTTAGACTGGGT-1': 'B',
 'AGAGCTTAGAGCCCAA-1': 'CD14 Monocytes',
 'AGAGCTTCACATGACT-1': 'NK1',
 'AGAGCTTGTCAGAGGT-1': 'Naive T cell',
 'AGAGCTTGTCCGAATT-1': 'Naive T cell',
 'AGAGCTTGTCGCTTCT-1': 'NK2',
 'AGAGCTTGTCTGATTG-1': 'CD8 T',
 'AGAGCTTGTTGAGTTC-1': 'B',
 'AGAGCTTTCAGTCAGT-1': 'CD4 T',
 'AGAGTGGAGAGGGCTT-1': 'CD14 Monocytes',
 'AGAGTGGAGTACGATA-1': 'CD14 Monocytes',
 'AGAGTGGTCGGTCCGA-1': 'CD14 Monocytes',
 'AGATCTGAGAGATGAG-1': 'CD14 Monocytes',
 'AGATCTGAGGATCGCA-1': 'CD4 T',
 'AGATCTGAGTCGAGTG-1': 'Naive T cell',
 'AGATCTGCAAAGCGGT-1': 'CD4 T',
 'AGATCTGCACCAACCG-1': 'NK1',
 'AGATCTGCAGGACGTA-1': 'CD14 Monocytes',
 'AGATCTGCATCAGTAC-1': 'CD4 T',
 'AGATCTGGTATGAATG-1': 'B',
 'AGATCTGGTCCTGCTT-1': 'CD14 Monocytes',
 'AGATCTGGTGTTCGAT-1': 'CD14 Monocytes',
 'AGATCTGGTTATCCGA-1': 'NK1',
 'AGATCTGTCTGCGTAA-1': 'Dendritic',
 'AGATTGCAGCAATCTC-1': 'CD4 T',
 'AGATTGCCATCGTCGG-1': 'B',
 'AGATTGCGTCAACATC-1': 'CD4 T',
 'AGATTGCGTTAAGAAC-1': 'NK1',
 'AGATTGCTCAACGCTA-1': 'CD4 T',
 'AGCAGCCAGATGTCGG-1': 'Dendritic',
 'AGCAGCCAGCATGGCA-1': 'Dendritic',
 'AGCAGCCCAGACACTT-1': 'NK2',
 'AGCAGCCCAGATAATG-1': 'CD8 T',
 'AGCAGCCGTTCAGTAC-1': 'B',
 'AGCAGCCTCAATCACG-1': 'CD4 T',
 'AGCAGCCTCAGAGCTT-1': 'CD4 T',
 'AGCAGCCTCGAATCCA-1': 'FCGR3A Monocytes',
 'AGCAGCCTCTGTACGA-1': 'CD8 T',
 'AGCATACAGAAGAAGC-1': 'CD4 T',
 'AGCATACCAAGCCATT-1': 'Naive T cell',
 'AGCATACCACAGGAGT-1': 'CD14 Monocytes',
 'AGCATACTCCTCTAGC-1': 'CD14 Monocytes',
 'AGCATACTCGGATGTT-1': 'Naive T cell',
 'AGCATACTCTTGCAAG-1': 'Dendritic',
 'AGCCTAAAGGGAGTAA-1': 'CD4 T',
 'AGCCTAACAAGGACAC-1': 'CD4 T',
 'AGCCTAACAATGTAAG-1': 'CD4 T',
 'AGCCTAACATCCTTGC-1': 'CD14 Monocytes',
 'AGCCTAAGTAGAAGGA-1': 'CD8 T',
 'AGCCTAAGTAGCGCAA-1': 'NK1',
 'AGCCTAAGTCCGAACC-1': 'Naive T cell',
 'AGCCTAAGTTTGGGCC-1': 'CD14 Monocytes',
 'AGCCTAATCATTCACT-1': 'Dendritic',
 'AGCCTAATCTCGATGA-1': 'NK1',
 'AGCGGTCAGAGTAATC-1': 'CD4 T',
 'AGCGGTCAGATCACGG-1': 'CD4 T',
 'AGCGGTCAGATCGGGT-1': 'CD4 T',
 'AGCGGTCAGCTCCCAG-1': 'CD14 Monocytes',
 'AGCGGTCAGTCCGGTC-1': 'CD14 Monocytes',
 'AGCGGTCCAATGTTGC-1': 'CD14 Monocytes',
 'AGCGGTCCATGTTGAC-1': 'CD8 T',
 'AGCGGTCTCCCTTGCA-1': 'B',
 'AGCGGTCTCCGAATGT-1': 'CD14 Monocytes',
 'AGCGTATAGATACACA-1': 'FCGR3A Monocytes',
 'AGCGTATCACGAGAGT-1': 'CD14 Monocytes',
 'AGCGTATCAGGCTGAA-1': 'CD4 T',
 'AGCGTATGTGTAACGG-1': 'CD8 T',
 'AGCGTATTCTACTTAC-1': 'CD14 Monocytes',
 'AGCGTCGAGACCGGAT-1': 'NK2',
 'AGCGTCGAGATCCCAT-1': 'CD14 Monocytes',
 'AGCGTCGAGTGAAGTT-1': 'CD4 T',
 'AGCGTCGCACGGTGTC-1': 'NK1',
 'AGCGTCGCACTCGACG-1': 'Naive T cell',
 'AGCGTCGCACTGTCGG-1': 'CD14 Monocytes',
 'AGCGTCGCAGCAGTTT-1': 'Naive T cell',
 'AGCGTCGCAGTCAGAG-1': 'B',
 'AGCGTCGGTGCGAAAC-1': 'Naive T cell',
 'AGCGTCGGTGTCAATC-1': 'CD4 T',
 'AGCGTCGGTGTTCGAT-1': 'Dendritic',
 'AGCGTCGGTTTCCACC-1': 'CD14 Monocytes',
 'AGCTCCTAGATGCCTT-1': 'CD4 T',
 'AGCTCCTAGCTAGTTC-1': 'CD8 T',
 'AGCTCCTCACATGACT-1': 'NK1',
 'AGCTCCTGTATAAACG-1': 'B',
 'AGCTCCTTCTGCGGCA-1': 'Naive T cell',
 'AGCTCCTTCTTACCGC-1': 'Naive T cell',
 'AGCTCTCAGAGTCTGG-1': 'CD4 T',
 'AGCTCTCAGATGCCAG-1': 'CD4 T',
 'AGCTCTCAGCACCGCT-1': 'CD14 Monocytes',
 'AGCTCTCAGGCTCATT-1': 'B',
 'AGCTCTCAGTGACTCT-1': 'B',
 'AGCTCTCAGTTACGGG-1': 'NK1',
 'AGCTCTCCAATGTAAG-1': 'CD4 T',
 'AGCTCTCCAGCTTCGG-1': 'CD14 Monocytes',
 'AGCTCTCCATGGTCTA-1': 'CD14 Monocytes',
 'AGCTCTCGTTGATTGC-1': 'Unknown',
 'AGCTCTCTCACATACG-1': 'Naive T cell',
 'AGCTCTCTCACCGTAA-1': 'CD8 T',
 'AGCTCTCTCCAGAGGA-1': 'NK1',
 'AGCTCTCTCTTCAACT-1': 'CD8 T',
 'AGCTTGAAGCAATATG-1': 'CD14 Monocytes',
 'AGCTTGAAGCCCAACC-1': 'NK2',
 'AGCTTGAAGGTGCTTT-1': 'CD8 T',
 'AGCTTGACAGGGTTAG-1': 'CD4 T',
 'AGCTTGAGTAGGGTAC-1': 'CD14 Monocytes',
 'AGCTTGAGTGACAAAT-1': 'CD8 T',
 'AGCTTGAGTTAAGAAC-1': 'CD4 T',
 'AGCTTGATCACTTATC-1': 'CD4 T',
 'AGCTTGATCTTTAGTC-1': 'CD4 T',
 'AGGCCACAGCTTATCG-1': 'CD4 T',
 'AGGCCACAGGGTGTTG-1': 'CD14 Monocytes',
 'AGGCCACAGTGTCCCG-1': 'NK1',
 'AGGCCACCACCGAATT-1': 'Naive T cell',
 'AGGCCACGTAGAGCTG-1': 'CD4 T',
 'AGGCCACGTGATAAGT-1': 'CD14 Monocytes',
 'AGGCCACGTTCGTCTC-1': 'B',
 'AGGCCACGTTTACTCT-1': 'CD4 T',
 'AGGCCGTAGAACTGTA-1': 'NK1',
 'AGGCCGTAGACTACAA-1': 'CD8 T',
 'AGGCCGTAGCTCCTTC-1': 'CD14 Monocytes',
 'AGGCCGTAGCTGGAAC-1': 'NK1',
 'AGGCCGTAGGAATTAC-1': 'CD14 Monocytes',
 'AGGCCGTCACATTAGC-1': 'NK1',
 'AGGCCGTCAGCGTAAG-1': 'CD4 T',
 'AGGCCGTGTACCATCA-1': 'CD4 T',
 'AGGCCGTGTCTTGCGG-1': 'NK1',
 'AGGCCGTTCGGCATCG-1': 'NK1',
 'AGGGAGTCACATGACT-1': 'CD4 T',
 'AGGGAGTCACGGTGTC-1': 'B',
 'AGGGAGTGTCTTCTCG-1': 'CD14 Monocytes',
 'AGGGAGTTCAAGCCTA-1': 'NK1',
 'AGGGATGCAAGTAGTA-1': 'CD14 Monocytes',
 'AGGGATGCAGCATACT-1': 'CD8 T',
 'AGGGATGGTTTGACAC-1': 'B',
 'AGGGATGGTTTGCATG-1': 'CD4 T',
 'AGGGTGAGTGTATGGG-1': 'B',
 'AGGGTGATCCCAACGG-1': 'Naive T cell',
 'AGGGTGATCGCCTGTT-1': 'CD8 T',
 'AGGTCATAGCTAAGAT-1': 'NK1',
 'AGGTCATCAATGGAAT-1': 'CD4 T',
 'AGGTCATCACACCGAC-1': 'CD4 T',
 'AGGTCATCAGCTTCGG-1': 'CD14 Monocytes',
 'AGGTCATCATTGGCGC-1': 'B',
 'AGGTCATGTTACCGAT-1': 'CD4 T',
 'AGGTCATGTTCCACGG-1': 'CD8 T',
 'AGGTCATGTTCGCTAA-1': 'NK1',
 'AGGTCATTCACTTACT-1': 'B',
 'AGGTCATTCCGCGGTA-1': 'Naive T cell',
 'AGGTCATTCCTTAATC-1': 'Dendritic',
 'AGGTCCGAGAATTGTG-1': 'CD14 Monocytes',
 'AGGTCCGAGAGGTTGC-1': 'CD14 Monocytes',
 'AGGTCCGGTGAGGCTA-1': 'B',
 'AGGTCCGTCCAAGTAC-1': 'NK2',
 'AGGTCCGTCCTAAGTG-1': 'B',
 'AGGTCCGTCGGACAAG-1': 'CD14 Monocytes',
 'AGTAGTCAGGTACTCT-1': 'FCGR3A Monocytes',
 'AGTAGTCAGTCCGTAT-1': 'CD8 T',
 'AGTAGTCCACCGAAAG-1': 'CD8 T',
 'AGTAGTCCATGGTTGT-1': 'CD14 Monocytes',
 'AGTAGTCGTAACGACG-1': 'Naive T cell',
 'AGTAGTCGTAGAGCTG-1': 'Naive T cell',
 'AGTAGTCGTTCCCGAG-1': 'CD4 T',
 'AGTAGTCGTTGATTCG-1': 'CD4 T',
 'AGTAGTCTCACCTCGT-1': 'CD4 T',
 'AGTAGTCTCAGGCCCA-1': 'NK1',
 'AGTAGTCTCATTTGGG-1': 'NK2',
 'AGTAGTCTCCGGCACA-1': 'CD14 Monocytes',
 'AGTAGTCTCGTAGATC-1': 'CD4 T',
 'AGTAGTCTCTGCGTAA-1': 'NK1',
 'AGTCTTTAGCACAGGT-1': 'Naive T cell',
 'AGTCTTTCAAAGTCAA-1': 'NK2',
 'AGTCTTTCAAGCTGGA-1': 'Naive T cell',
 'AGTCTTTCAGCCAGAA-1': 'CD8 T',
 'AGTCTTTCAGCTCGAC-1': 'NK1',
 'AGTCTTTCATGTTGAC-1': 'Naive T cell',
 'AGTCTTTCATTGGCGC-1': 'Naive T cell',
 'AGTCTTTGTCCATGAT-1': 'CD14 Monocytes',
 'AGTCTTTGTCGCTTCT-1': 'NK1',
 'AGTCTTTGTGTTTGGT-1': 'B',
 'AGTCTTTTCCCAACGG-1': 'Naive T cell',
 'AGTCTTTTCGGTCTAA-1': 'CD4 T',
 'AGTGAGGAGCCACTAT-1': 'CD14 Monocytes',
 'AGTGAGGCAAGTCTAC-1': 'CD8 T',
 'AGTGAGGCATAGACTC-1': 'CD4 T',
 'AGTGAGGCATGGTCTA-1': 'NK2',
 'AGTGAGGGTGTCTGAT-1': 'NK2',
 'AGTGAGGGTTCCACAA-1': 'CD14 Monocytes',
 'AGTGAGGGTTCGGGCT-1': 'CD8 T',
 'AGTGAGGGTTTAAGCC-1': 'CD4 T',
 'AGTGAGGTCTCGGACG-1': 'CD14 Monocytes',
 'AGTGAGGTCTGTACGA-1': 'B',
 'AGTGAGGTCTTTACGT-1': 'B',
 'AGTGGGAAGACCCACC-1': 'CD14 Monocytes',
 'AGTGGGACACTTCTGC-1': 'Naive T cell',
 'AGTGGGACAGCTTCGG-1': 'CD8 T',
 'AGTGGGAGTACAGACG-1': 'NK1',
 'AGTGGGAGTAGCGTGA-1': 'CD14 Monocytes',
 'AGTGGGAGTAGCTAAA-1': 'NK1',
 'AGTGGGAGTCAGAAGC-1': 'B',
 'AGTGGGAGTGCCTGTG-1': 'Naive T cell',
 'AGTGGGAGTGCCTTGG-1': 'CD8 T',
 'AGTGGGATCACTTACT-1': 'CD8 T',
 'AGTGGGATCATCACCC-1': 'NK1',
 'AGTGGGATCTTAACCT-1': 'Dendritic',
 'AGTGTCAAGGGTGTTG-1': 'Unknown',
 'AGTGTCACAGACGTAG-1': 'CD14 Monocytes',
 'AGTGTCACAGCTGCTG-1': 'CD14 Monocytes',
 'AGTGTCAGTGCAGTAG-1': 'CD4 T',
 'AGTGTCAGTTACGACT-1': 'CD8 T',
 'AGTGTCATCATAACCG-1': 'NK1',
 'AGTGTCATCCGCGGTA-1': 'NK1',
 'AGTTGGTAGGACATTA-1': 'CD4 T',
 'AGTTGGTGTACCGGCT-1': 'Unknown',
 'AGTTGGTGTCTAAACC-1': 'NK2',
 'AGTTGGTGTGCCTTGG-1': 'Naive T cell',
 'AGTTGGTGTGGTACAG-1': 'Naive T cell',
 'AGTTGGTTCCTAGAAC-1': 'CD4 T',
 'AGTTGGTTCGCGATCG-1': 'FCGR3A Monocytes',
 'AGTTGGTTCTGCGGCA-1': 'B',
 'ATAACGCAGGTGACCA-1': 'NK1',
 'ATAACGCAGTGTCCAT-1': 'FCGR3A Monocytes',
 'ATAACGCCACAGGCCT-1': 'CD14 Monocytes',
 'ATAACGCGTAAGTTCC-1': 'CD14 Monocytes',
 'ATAACGCGTCAGATAA-1': 'CD14 Monocytes',
 'ATAACGCGTTCGAATC-1': 'CD14 Monocytes',
 'ATAACGCTCCTGTAGA-1': 'CD14 Monocytes',
 'ATAACGCTCTACGAGT-1': 'CD8 T',
 'ATAACGCTCTTGTTTG-1': 'Naive T cell',
 'ATAAGAGAGGTGCACA-1': 'NK2',
 'ATAAGAGCACCCAGTG-1': 'NK1',
 'ATAAGAGCATCGGGTC-1': 'Naive T cell',
 'ATAAGAGGTTCACCTC-1': 'Dendritic',
 'ATAAGAGTCCGATATG-1': 'B',
 'ATAAGAGTCGAATGCT-1': 'Naive T cell',
 'ATAAGAGTCGCCATAA-1': 'Naive T cell',
 'ATAAGAGTCTGCAGTA-1': 'B',
 'ATAGACCAGTACGCGA-1': 'Naive T cell',
 'ATAGACCAGTGGAGAA-1': 'NK1',
 'ATAGACCCAAAGAATC-1': 'B',
 'ATAGACCCAGCTGTTA-1': 'CD4 T',
 'ATAGACCGTAAGGGAA-1': 'CD4 T',
 'ATAGACCGTCAGGACA-1': 'CD8 T',
 'ATAGACCGTGCACCAC-1': 'CD14 Monocytes',
 'ATAGACCGTGTCAATC-1': 'NK1',
 'ATAGACCGTGTGACGA-1': 'CD14 Monocytes',
 'ATAGACCTCAATCACG-1': 'CD8 T',
 'ATAGACCTCTCACATT-1': 'NK1',
 'ATAGACCTCTCGATGA-1': 'FCGR3A Monocytes',
 'ATCACGAAGAAGCCCA-1': 'NK1',
 'ATCACGACAACACCTA-1': 'B',
 'ATCACGACATCCTTGC-1': 'NK1',
 'ATCACGACATGCAATC-1': 'CD14 Monocytes',
 'ATCACGAGTCTGGTCG-1': 'B',
 'ATCACGATCACCACCT-1': 'B',
 'ATCACGATCATCGCTC-1': 'B',
 'ATCACGATCATGTAGC-1': 'CD14 Monocytes',
 'ATCACGATCCAAGCCG-1': 'CD4 T',
 'ATCATCTAGACAAGCC-1': 'CD4 T',
 'ATCATCTAGCGCCTCA-1': 'NK1',
 'ATCATCTAGCTCCTCT-1': 'CD4 T',
 'ATCATCTAGGATGGTC-1': 'CD14 Monocytes',
 'ATCATCTAGGCACATG-1': 'CD14 Monocytes',
 'ATCATCTCAGTAAGCG-1': 'B',
 'ATCATCTCAGTACACT-1': 'Naive T cell',
 'ATCATCTCATCCAACA-1': 'FCGR3A Monocytes',
 'ATCATCTGTGACTACT-1': 'NK1',
 'ATCATCTTCGTCCAGG-1': 'CD14 Monocytes',
 'ATCATGGAGCTGCGAA-1': 'NK1',
 'ATCATGGAGGCAATTA-1': 'NK1',
 'ATCATGGAGGCCCTCA-1': 'NK1',
 'ATCATGGAGTTGAGAT-1': 'NK1',
 'ATCATGGCAAATCCGT-1': 'CD4 T',
 'ATCATGGCAATGAATG-1': 'NK1',
 'ATCATGGCAGCCAATT-1': 'Dendritic',
 'ATCATGGCAGGGTACA-1': 'NK1',
 'ATCATGGGTAACGTTC-1': 'B',
 'ATCATGGGTCAATACC-1': 'NK1',
 'ATCCACCAGAGCTGGT-1': 'CD4 T',
 'ATCCACCAGCAGACTG-1': 'CD14 Monocytes',
 'ATCCACCAGCGATCCC-1': 'NK1',
 'ATCCACCAGCGTTTAC-1': 'CD4 T',
 'ATCCACCAGGAGCGTT-1': 'B',
 'ATCCACCAGTCACGCC-1': 'CD8 T',
 'ATCCACCCAACGCACC-1': 'CD14 Monocytes',
 'ATCCACCCAACTGGCC-1': 'B',
 'ATCCACCCAAGCGCTC-1': 'B',
 'ATCCACCCAAGTCTGT-1': 'CD4 T',
 'ATCCACCCACAGTCGC-1': 'CD4 T',
 'ATCCACCCAGGGTTAG-1': 'B',
 'ATCCACCCATTACGAC-1': 'CD14 Monocytes',
 'ATCCACCGTATCAGTC-1': 'CD4 T',
 'ATCCACCTCTAACGGT-1': 'CD4 T',
 'ATCCGAAAGTTTAGGA-1': 'CD4 T',
 'ATCCGAACACACTGCG-1': 'NK1',
 'ATCCGAACAGTTCCCT-1': 'CD8 T',
 'ATCCGAAGTAACGCGA-1': 'Naive T cell',
 'ATCCGAAGTACTCAAC-1': 'CD8 T',
 'ATCCGAAGTCTTGATG-1': 'NK1',
 'ATCCGAAGTGACTCAT-1': 'CD8 T',
 'ATCCGAATCCTTGCCA-1': 'CD4 T',
 'ATCCGAATCGGAATCT-1': 'CD8 T',
 'ATCGAGTAGCCAGAAC-1': 'CD8 T',
 'ATCGAGTAGCCCAATT-1': 'NK1',
 'ATCGAGTAGCTCTCGG-1': 'CD14 Monocytes',
 'ATCGAGTCAAAGGTGC-1': 'CD14 Monocytes',
 'ATCGAGTCAACAACCT-1': 'FCGR3A Monocytes',
 'ATCGAGTCAGGGTATG-1': 'CD14 Monocytes',
 'ATCGAGTCAGGTCCAC-1': 'B',
 'ATCGAGTCATCACAAC-1': 'CD4 T',
 'ATCGAGTCATCCTAGA-1': 'CD14 Monocytes',
 'ATCGAGTGTACAGCAG-1': 'B',
 'ATCGAGTGTCGTGGCT-1': 'B',
 'ATCGAGTGTGATGTGG-1': 'CD14 Monocytes',
 'ATCGAGTGTGCAGTAG-1': 'NK1',
 'ATCGAGTGTTCAGTAC-1': 'CD14 Monocytes',
 'ATCGAGTGTTTAAGCC-1': 'CD14 Monocytes',
 'ATCGAGTTCATTTGGG-1': 'CD8 T',
 'ATCGAGTTCCGCATAA-1': 'CD14 Monocytes',
 'ATCGAGTTCGCAAGCC-1': 'B',
 'ATCTACTAGAGCTTCT-1': 'CD14 Monocytes',
 'ATCTACTCAAACGCGA-1': 'B',
 'ATCTACTCATGTCTCC-1': 'CD14 Monocytes',
 'ATCTACTGTTGGTAAA-1': 'CD4 T',
 'ATCTACTGTTTAGCTG-1': 'NK1',
 'ATCTACTTCCCATTTA-1': 'CD8 T',
 'ATCTACTTCGTTTATC-1': 'CD8 T',
 'ATCTACTTCTATCCCG-1': 'NK1',
 'ATCTACTTCTGAGTGT-1': 'CD14 Monocytes',
 'ATCTACTTCTGGTTCC-1': 'NK2',
 'ATCTGCCAGACAAAGG-1': 'Naive T cell',
 'ATCTGCCAGTGTACCT-1': 'CD14 Monocytes',
 'ATCTGCCAGTTGTAGA-1': 'CD4 T',
 'ATCTGCCCACCAGATT-1': 'NK1',
 'ATCTGCCCACCAGTTA-1': 'CD4 T',
 'ATCTGCCGTACTTGAC-1': 'B',
 'ATCTGCCTCAACACCA-1': 'CD4 T',
 'ATGAGGGAGAAGGTGA-1': 'FCGR3A Monocytes',
 'ATGAGGGAGAGTCGGT-1': 'CD14 Monocytes',
 'ATGAGGGAGCGTGTCC-1': 'CD14 Monocytes',
 'ATGAGGGCAGACGTAG-1': 'NK2',
 'ATGAGGGCATATGGTC-1': 'CD4 T',
 'ATGAGGGCATGCAACT-1': 'CD4 T',
 'ATGAGGGTCAAACCGT-1': 'CD4 T',
 'ATGAGGGTCCCAAGTA-1': 'Naive T cell',
 'ATGAGGGTCGTCGTTC-1': 'B',
 'ATGCGATAGCTAGTGG-1': 'NK2',
 'ATGCGATAGGCGATAC-1': 'CD4 T',
 'ATGCGATAGGCTAGCA-1': 'CD4 T',
 'ATGCGATGTACATGTC-1': 'NK2',
 'ATGCGATTCAGGATCT-1': 'B',
 'ATGGGAGAGAGTCGGT-1': 'NK1',
 'ATGGGAGAGCACACAG-1': 'CD4 T',
 'ATGGGAGAGTCCCACG-1': 'CD14 Monocytes',
 'ATGGGAGCAGCTCGCA-1': 'Dendritic',
 'ATGGGAGCAGTACACT-1': 'CD4 T',
 'ATGGGAGCATCAGTCA-1': 'B',
 'ATGGGAGGTAAACCTC-1': 'NK1',
 'ATGGGAGGTAGCGTGA-1': 'NK1',
 'ATGGGAGGTCTTGATG-1': 'Naive T cell',
 'ATGGGAGTCAACTCTT-1': 'B',
 'ATGGGAGTCGACAGCC-1': 'CD8 T',
 'ATGTGTGAGAGGGATA-1': 'CD4 T',
 'ATGTGTGAGAGTCTGG-1': 'B',
 'ATGTGTGAGCACCGTC-1': 'Dendritic',
 'ATGTGTGAGCCCTAAT-1': 'CD8 T',
 'ATGTGTGAGGGCATGT-1': 'FCGR3A Monocytes',
 'ATGTGTGCATGCAACT-1': 'B',
 'ATGTGTGGTCCCTACT-1': 'CD14 Monocytes',
 'ATGTGTGGTCTTCAAG-1': 'CD8 T',
 'ATGTGTGGTGCACTTA-1': 'B',
 'ATGTGTGTCAGGTAAA-1': 'CD8 T',
 'ATGTGTGTCTCTGCTG-1': 'B',
 'ATTACTCAGAGGGATA-1': 'B',
 'ATTACTCAGCCACGCT-1': 'CD14 Monocytes',
 'ATTACTCAGTGATCGG-1': 'B',
 'ATTACTCCACAAGTAA-1': 'CD8 T',
 'ATTACTCGTACTCAAC-1': 'B',
 'ATTACTCGTGAACCTT-1': 'NK1',
 'ATTACTCTCGGCCGAT-1': 'CD14 Monocytes',
 'ATTATCCAGAGGGCTT-1': 'CD8 T',
 'ATTATCCAGGGTCTCC-1': 'B',
 'ATTATCCCAATAGCAA-1': 'CD4 T',
 'ATTATCCCAGTATAAG-1': 'CD8 T',
 'ATTATCCGTAAATGAC-1': 'CD8 T',
 'ATTATCCGTCTCTCGT-1': 'NK1',
 'ATTATCCTCGGATGTT-1': 'NK1',
 'ATTCTACCAAGTAGTA-1': 'B',
 'ATTCTACGTCGAGTTT-1': 'NK1',
 'ATTCTACGTTCGCGAC-1': 'CD14 Monocytes',
 'ATTCTACTCATCGCTC-1': 'Naive T cell',
 'ATTGGACAGAGTACAT-1': 'Naive T cell',
 'ATTGGACAGCACCGTC-1': 'NK1',
 'ATTGGACAGCGTTCCG-1': 'B',
 'ATTGGACAGTACGTAA-1': 'CD14 Monocytes',
 'ATTGGACAGTGACATA-1': 'Naive T cell',
 'ATTGGACCAGAAGCAC-1': 'B',
 'ATTGGACCATCGGGTC-1': 'CD14 Monocytes',
 'ATTGGACGTCGACTAT-1': 'CD4 T',
 'ATTGGACGTCTGCGGT-1': 'NK1',
 'ATTGGACGTCTTCGTC-1': 'CD14 Monocytes',
 'ATTGGACGTTCACGGC-1': 'NK1',
 'ATTGGACTCCTCATTA-1': 'NK1',
 'ATTGGACTCCTGTACC-1': 'NK2',
 'ATTGGACTCGAATGCT-1': 'CD4 T',
 'ATTGGACTCTTCAACT-1': 'CD14 Monocytes',
 'ATTGGACTCTTGTACT-1': 'B',
 'ATTGGTGAGGAATTAC-1': 'CD8 T',
 'ATTGGTGCATCAGTAC-1': 'B',
 'ATTGGTGGTCCGTCAG-1': 'CD14 Monocytes',
 'ATTGGTGGTCGCTTTC-1': 'CD4 T',
 'ATTGGTGGTCTTGTCC-1': 'CD4 T',
 'ATTGGTGGTTACGACT-1': 'CD14 Monocytes',
 'ATTGGTGTCGTTGACA-1': 'CD14 Monocytes',
 'ATTTCTGAGTCATGCT-1': 'CD4 T',
 'ATTTCTGCAAGAGTCG-1': 'NK1',
 'ATTTCTGGTGACAAAT-1': 'CD4 T',
 'ATTTCTGTCACCTTAT-1': 'B',
 'ATTTCTGTCTGGGCCA-1': 'CD14 Monocytes',
 'CAACCAAAGCGTAGTG-1': 'CD14 Monocytes',
 'CAACCAAAGTTACGGG-1': 'NK2',
 'CAACCAACACAGACAG-1': 'FCGR3A Monocytes',
 'CAACCAACAGGGTTAG-1': 'CD14 Monocytes',
 'CAACCAACATTCACTT-1': 'B',
 'CAACCAAGTAAGAGGA-1': 'CD14 Monocytes',
 'CAACCAATCTCGCATC-1': 'CD8 T',
 'CAACCTCAGATAGGAG-1': 'CD14 Monocytes',
 'CAACCTCGTAGGCATG-1': 'Naive T cell',
 'CAACCTCTCAACACGT-1': 'NK2',
 'CAACCTCTCTGCTGTC-1': 'NK1',
 'CAACTAGAGAAACCAT-1': 'B',
 'CAACTAGAGGTGCAAC-1': 'NK1',
 'CAACTAGAGTGTGAAT-1': 'NK1',
 'CAACTAGCAACTGCGC-1': 'CD4 T',
 'CAACTAGCACACCGAC-1': 'CD14 Monocytes',
 'CAACTAGGTAGCTTGT-1': 'CD4 T',
 'CAACTAGGTGCACTTA-1': 'CD14 Monocytes',
 'CAACTAGGTTAAAGTG-1': 'CD8 T',
 'CAACTAGTCAGCATGT-1': 'B',
 'CAACTAGTCAGTTAGC-1': 'CD4 T',
 'CAACTAGTCCTCTAGC-1': 'CD8 T',
 'CAAGAAAAGACTAGAT-1': 'CD14 Monocytes',
 'CAAGAAAAGCTGCGAA-1': 'CD4 T',
 'CAAGAAAAGGCTAGGT-1': 'CD8 T',
 'CAAGAAACACAGTCGC-1': 'Unknown',
 'CAAGAAACACGTCTCT-1': 'CD4 T',
 'CAAGAAACAGCTATTG-1': 'B',
 'CAAGAAACAGTCGATT-1': 'NK2',
 'CAAGAAATCCAGAAGG-1': 'Naive T cell',
 'CAAGAAATCCGCTGTT-1': 'CD4 T',
 'CAAGAAATCGAACGGA-1': 'NK2',
 'CAAGAAATCTCTAAGG-1': 'CD4 T',
 ...}

The cluster heatmap function

In [52]:
sns.set()
sns.set(font_scale=1.0)
sns.set_style('ticks', {'xtick.minor.size': 1, 'ytick.minor.size': 0.1})
g = sns.clustermap(bin_mtx.T, 
               col_colors=auc_mtx.index.map(cell_id2cell_type_lut).map(cell_type_color_lut),
               cmap=bw_palette, figsize=(20,20))
g.ax_heatmap.set_xticklabels([])
g.ax_heatmap.set_xticks([])
g.ax_heatmap.set_xlabel('Cells')
g.ax_heatmap.set_ylabel('Regulons')
g.ax_col_colors.set_yticks([0.5])
g.ax_col_colors.set_yticklabels(['Cell Type'])
g.cax.set_visible(False)
/project_envs/pyscenic-fresh/lib/python3.8/site-packages/seaborn/matrix.py:654: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
  warnings.warn(msg)

On the heatmap, we can see that cell types are clustered by the regulon activity, which can be considered as the transcription program of each cell type

Create new tSNE embedding on regulon activity space

Since we gonna replace the current tSNE coordinates, we save the current t-SNE embedding for further analysis.

In [53]:
embedding_pca_tsne = pd.DataFrame(adata.obsm['X_tsne'], columns=[['_X', '_Y']], index=adata.obs_names)

We add all metadata derived from SCENIC to the scanpy.AnnData object.

In [421]:
add_scenic_metadata(adata, auc_mtx, regulons)

Current adata

In [422]:
adata
Out[422]:
AnnData object with n_obs × n_vars = 4331 × 1677
    obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain', 'celltype', 'Regulon(ASCL2(+))', 'Regulon(ATF3(+))', 'Regulon(BACH1(+))', 'Regulon(BCL11A(+))', 'Regulon(CEBPB(+))', 'Regulon(CEBPD(+))', 'Regulon(CREB5(+))', 'Regulon(EGR1(+))', 'Regulon(EOMES(+))', 'Regulon(ETV7(+))', 'Regulon(FOSL2(+))', 'Regulon(FOXO1(+))', 'Regulon(GATA2(+))', 'Regulon(GATA3(+))', 'Regulon(IRF1(+))', 'Regulon(IRF7(+))', 'Regulon(IRF8(+))', 'Regulon(JUN(+))', 'Regulon(JUND(+))', 'Regulon(KLF10(+))', 'Regulon(KLF4(+))', 'Regulon(LEF1(+))', 'Regulon(MAF(+))', 'Regulon(MEF2A(+))', 'Regulon(MYC(+))', 'Regulon(NFE2(+))', 'Regulon(NFIA(+))', 'Regulon(NFIL3(+))', 'Regulon(PAX5(+))', 'Regulon(POU2AF1(+))', 'Regulon(PRDM1(+))', 'Regulon(REL(+))', 'Regulon(RUNX3(+))', 'Regulon(RXRA(+))', 'Regulon(SOX4(+))', 'Regulon(SP1(+))', 'Regulon(SPI1(+))', 'Regulon(SPIB(+))', 'Regulon(STAT1(+))', 'Regulon(STAT2(+))', 'Regulon(TBX21(+))', 'Regulon(TCF4(+))', 'Regulon(TCF7(+))', 'Regulon(TCF7L2(+))', 'Regulon(TFEC(+))', 'Regulon(ZNF438(+))', 'Regulon(ZNF503(+))'
    var: 'gene_ids', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std', 'Regulon(ASCL2(+))', 'Regulon(ATF3(+))', 'Regulon(BACH1(+))', 'Regulon(BCL11A(+))', 'Regulon(CEBPB(+))', 'Regulon(CEBPD(+))', 'Regulon(CREB5(+))', 'Regulon(EGR1(+))', 'Regulon(EOMES(+))', 'Regulon(ETV7(+))', 'Regulon(FOSL2(+))', 'Regulon(FOXO1(+))', 'Regulon(GATA2(+))', 'Regulon(GATA3(+))', 'Regulon(IRF1(+))', 'Regulon(IRF7(+))', 'Regulon(IRF8(+))', 'Regulon(JUN(+))', 'Regulon(JUND(+))', 'Regulon(KLF10(+))', 'Regulon(KLF4(+))', 'Regulon(LEF1(+))', 'Regulon(MAF(+))', 'Regulon(MEF2A(+))', 'Regulon(MYC(+))', 'Regulon(NFE2(+))', 'Regulon(NFIA(+))', 'Regulon(NFIL3(+))', 'Regulon(PAX5(+))', 'Regulon(POU2AF1(+))', 'Regulon(PRDM1(+))', 'Regulon(REL(+))', 'Regulon(RUNX3(+))', 'Regulon(RXRA(+))', 'Regulon(SOX4(+))', 'Regulon(SP1(+))', 'Regulon(SPI1(+))', 'Regulon(SPIB(+))', 'Regulon(STAT1(+))', 'Regulon(STAT2(+))', 'Regulon(TBX21(+))', 'Regulon(TCF4(+))', 'Regulon(TCF7(+))', 'Regulon(TCF7L2(+))', 'Regulon(TFEC(+))', 'Regulon(ZNF438(+))', 'Regulon(ZNF503(+))'
    uns: 'celltype_colors', 'hvg', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'tsne', 'aucell'
    obsm: 'X_pca', 'X_tsne', 'X_aucell'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

AUCELL + tSNE PROJECTION

We change the tSNE projection so that it relies on AUCell instead of PCA.

In [423]:
sc.tl.tsne(adata, use_rep = 'X_aucell')
computing tSNE
    using sklearn.manifold.TSNE
/project_envs/pyscenic-fresh/lib/python3.8/site-packages/sklearn/manifold/_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  warnings.warn(
    finished: added
    'X_tsne', tSNE coordinates (adata.obsm) (0:00:09)

Now we plot the new tSNE and color by cell types

In [424]:
sc.pl.tsne(adata, color = 'celltype')

As we can see from the plot, T-cell populations (Naive, CD4, and CD8) are mixed with the NK populations. Myeloid dervied cell population like Monocytes and Dendritic are mixed together. B cell population are splitted into two sub-populations

Cell type specific regulators - RSS

In [426]:
rss = regulon_specificity_scores(auc_mtx, adata.obs.celltype)
rss
Out[426]:
ASCL2(+) ATF3(+) BACH1(+) BCL11A(+) CEBPB(+) CEBPD(+) CREB5(+) EGR1(+) EOMES(+) ETV7(+) ... SPIB(+) STAT1(+) STAT2(+) TBX21(+) TCF4(+) TCF7(+) TCF7L2(+) TFEC(+) ZNF438(+) ZNF503(+)
CD4 T 0.254965 0.354335 0.350609 0.317162 0.357388 0.339908 0.305970 0.310514 0.335177 0.362216 ... 0.318163 0.422523 0.345023 0.335693 0.308746 0.472705 0.357702 0.334944 0.199853 0.268990
CD14 Monocytes 0.294127 0.470699 0.476293 0.277640 0.470806 0.449611 0.500718 0.340014 0.227779 0.234375 ... 0.267299 0.354912 0.416408 0.241749 0.271382 0.190989 0.451460 0.423925 0.199587 0.342221
Naive T cell 0.230936 0.243760 0.251116 0.251576 0.250168 0.247842 0.203252 0.261649 0.289624 0.347146 ... 0.254438 0.297404 0.240050 0.290954 0.252784 0.319659 0.254741 0.220501 0.200983 0.221731
CD8 T 0.217953 0.246389 0.240407 0.249987 0.249458 0.243357 0.209219 0.256116 0.288731 0.279764 ... 0.253478 0.302533 0.244697 0.276455 0.253385 0.355960 0.248659 0.239402 0.210596 0.194785
Unknown 0.182973 0.182710 0.174130 0.207132 0.172568 0.173577 0.170440 0.174086 0.173801 0.172610 ... 0.208543 0.174475 0.197332 0.174733 0.222518 0.169157 0.170713 0.169696 0.174956 0.170325
Dendritic 0.202423 0.213203 0.214122 0.211660 0.210776 0.211961 0.204109 0.215008 0.180613 0.180764 ... 0.207749 0.192810 0.207239 0.184642 0.213761 0.169990 0.206384 0.198008 0.200993 0.197551
B 0.187051 0.241578 0.239479 0.483090 0.233196 0.234245 0.188361 0.315691 0.218357 0.209480 ... 0.479905 0.207144 0.337435 0.238588 0.473801 0.197984 0.226621 0.210691 0.212560 0.194394
NK1 0.345601 0.245713 0.238197 0.252902 0.250483 0.285914 0.211123 0.276463 0.426044 0.345437 ... 0.261252 0.262000 0.239800 0.409009 0.253153 0.269352 0.244461 0.223430 0.218163 0.187301
NK2 0.316572 0.192335 0.190429 0.197329 0.197770 0.234541 0.189275 0.192605 0.298271 0.221064 ... 0.200695 0.186527 0.192913 0.302035 0.198620 0.181264 0.197463 0.194362 0.203358 0.175036
FCGR3A Monocytes 0.195610 0.228610 0.231542 0.188614 0.231532 0.223991 0.193151 0.203231 0.185050 0.181421 ... 0.188601 0.202445 0.222092 0.186652 0.185567 0.169331 0.255973 0.199822 0.182411 0.246853

10 rows × 47 columns

Now we plot the Regulon specific score for all cell types to see if there is any Transcription factors that is specific for a particular cell type

In [427]:
sns.set()
sns.set(style='whitegrid', font_scale=0.7)
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(8, 6))
plot_rss(rss, 'B', ax=ax1)
ax1.set_xlabel('')
plot_rss(rss, 'CD4 T', ax=ax2)
ax2.set_xlabel('')
ax2.set_ylabel('')
plot_rss(rss, 'CD14 Monocytes', ax=ax3)
ax3.set_xlabel('')
ax3.set_ylabel('')
plot_rss(rss, 'Naive T cell', ax=ax4)
In [428]:
sns.set()
sns.set(style='whitegrid', font_scale=0.8)
fig, ((ax5, ax6, ax7), (ax8, ax9, ax10)) = plt.subplots(2, 3, figsize=(8, 6))
plot_rss(rss, 'CD8 T', ax=ax5)
plot_rss(rss, 'Dendritic', ax=ax6)
plot_rss(rss, 'NK1', ax=ax7)
plot_rss(rss, 'NK2', ax=ax8)
plot_rss(rss, 'FCGR3A Monocytes', ax=ax9)
plot_rss(rss, 'Unknown', ax=ax10)
plt.tight_layout()

Cell type specific regulators - Z-score

Another way to find cell type specific regulators is to use a Z score (i.e. the average AUCell score for the cells of a give type are standardized using the overall average AUCell scores and its standard deviation).

In [429]:
df_obs = adata.obs
signature_column_names = list(df_obs.select_dtypes('number').columns)
signature_column_names = list(filter(lambda s: s.startswith('Regulon('), signature_column_names))
df_scores = df_obs[signature_column_names + ['celltype']]
df_results = ((df_scores.groupby(by='celltype').mean() - df_obs[signature_column_names].mean())/ df_obs[signature_column_names].std()).stack().reset_index().rename(columns={'level_1': 'regulon', 0:'Z'})
df_results['regulon'] = list(map(lambda s: s[8:-1], df_results.regulon))

Let see the top results of B cells

In [430]:
df_Bresults = df_results[df_results['celltype'] == 'B']
df_Bresults.sort_values('Z', ascending=False).head()
Out[430]:
celltype regulon Z
122 B PAX5(+) 2.266961
125 B REL(+) 2.247385
123 B POU2AF1(+) 2.199252
97 B BCL11A(+) 2.178879
131 B SPIB(+) 2.168406

The table shows the top transcription factors of B cells is IRF8, PAX5, and IRF7. We plot the activities of IRF8 and PAX5 and IRF7 on the tSNE plot to test if we can see any distinct pattern for B cells

In [431]:
sc.pl.tsne(adata, color = ['celltype', 'Regulon(PAX5(+))', 'Regulon(REL(+))', 'Regulon(POU2AF1(+))'])

We can also draw a heatmap showing the Z-score of regulons on each cell type

In [432]:
ZSCORE_THRESHOLD = 1
CELLTYPE_COLUMN_NAME = 'celltype'
df_heatmap = pd.pivot_table(data=df_results[df_results.Z >= ZSCORE_THRESHOLD].sort_values('Z', ascending=False),
                           index=CELLTYPE_COLUMN_NAME, columns='regulon', values='Z')
In [433]:
fig, ax1 = plt.subplots(1, 1, figsize=(10, 8))
sns.heatmap(df_heatmap, ax=ax1, annot=True, fmt='.1f', linewidths=.7, cbar=False, square=True, linecolor='gray', 
            cmap='YlGnBu', annot_kws={'size': 6})
ax1.set_ylabel('')
Out[433]:
Text(223.875, 0.5, '')

Discovery more find-grained cell types and their gene regulators

We will subcluster the B cells and find regulators that define these subtypes of B cells. More specifically, we create a nearest neighbour graph on the AUCell-based dimensional reduced cell space. Then we use Louvain community detection on the resulting graph to clusters cells.

In [434]:
N_NEIGHBORS = 5
sc.pp.neighbors(adata, use_rep='X_aucell', n_neighbors=N_NEIGHBORS)
sc.tl.louvain(adata)
computing neighbors
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:00)
running Louvain clustering
    using the "louvain" package of Traag (2017)
    finished: found 20 clusters and added
    'louvain', the cluster labels (adata.obs, categorical) (0:00:00)
In [435]:
### Some helper functions to label the sub clusters
counter = 1

def init(data, col_name: str = 'cellular_phenotype'):
    data.obs[col_name] = data.obs['celltype'].astype(str)
    return data

def subcluster(cell_type: str, abbr: str, data, min_cells: int = 20, col_name: str = 'cellular_phenotype'):
    global counter
    counter = 1
    ct_idx = data.obs[data.obs.celltype == cell_type].index
    df_cell_numbers = data[ct_idx, :].obs.louvain.value_counts()
    clusters = df_cell_numbers[df_cell_numbers >= min_cells].index
    def rename(n: str) -> str:
        global counter
        if n in clusters:
            res = '{}{}'.format(abbr, counter)
            counter += 1
        else:
            res = '?'
        return res
    data.obs.loc[ct_idx, [col_name]] = data[ct_idx, :].obs['louvain'].map(rename).values
    return data
In [436]:
adata = init(adata)
adata = subcluster('B', 'B', adata)
In [437]:
adata.obs.cellular_phenotype
Out[437]:
AAACCTGAGAAGGCCT-1    CD4 T         
AAACCTGAGACAGACC-1    CD14 Monocytes
AAACCTGAGATAGTCA-1    CD4 T         
AAACCTGAGCGCCTCA-1    Naive T cell  
AAACCTGAGGCATGGT-1    CD4 T         
                      ...           
TTTGGTTTCGCTAGCG-1    CD14 Monocytes
TTTGTCACACTTAACG-1    NK1           
TTTGTCACAGGTCCAC-1    NK2           
TTTGTCAGTTAAGACA-1    B2            
TTTGTCATCCCAAGAT-1    CD14 Monocytes
Name: cellular_phenotype, Length: 4331, dtype: object
In [438]:
sc.pl.tsne(adata, color=['cellular_phenotype'], title=['pmbc4k'], legend_loc='on data')
/project_envs/pyscenic-fresh/lib/python3.8/site-packages/anndata/_core/anndata.py:1228: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Reordering categories will always return a new Categorical object.
  c.reorder_categories(natsorted(c.categories), inplace=True)
... storing 'cellular_phenotype' as categorical

We use RSS to identify regulators for these different types of B cells (B1 and B2)

In [439]:
rss = regulon_specificity_scores(auc_mtx, adata.obs.cellular_phenotype)
rss
Out[439]:
ASCL2(+) ATF3(+) BACH1(+) BCL11A(+) CEBPB(+) CEBPD(+) CREB5(+) EGR1(+) EOMES(+) ETV7(+) ... SPIB(+) STAT1(+) STAT2(+) TBX21(+) TCF4(+) TCF7(+) TCF7L2(+) TFEC(+) ZNF438(+) ZNF503(+)
CD4 T 0.254965 0.354335 0.350609 0.317162 0.357388 0.339908 0.305970 0.310514 0.335177 0.362216 ... 0.318163 0.422523 0.345023 0.335693 0.308746 0.472705 0.357702 0.334944 0.199853 0.268990
CD14 Monocytes 0.294127 0.470699 0.476293 0.277640 0.470806 0.449611 0.500718 0.340014 0.227779 0.234375 ... 0.267299 0.354912 0.416408 0.241749 0.271382 0.190989 0.451460 0.423925 0.199587 0.342221
Naive T cell 0.230936 0.243760 0.251116 0.251576 0.250168 0.247842 0.203252 0.261649 0.289624 0.347146 ... 0.254438 0.297404 0.240050 0.290954 0.252784 0.319659 0.254741 0.220501 0.200983 0.221731
CD8 T 0.217953 0.246389 0.240407 0.249987 0.249458 0.243357 0.209219 0.256116 0.288731 0.279764 ... 0.253478 0.302533 0.244697 0.276455 0.253385 0.355960 0.248659 0.239402 0.210596 0.194785
Unknown 0.182973 0.182710 0.174130 0.207132 0.172568 0.173577 0.170440 0.174086 0.173801 0.172610 ... 0.208543 0.174475 0.197332 0.174733 0.222518 0.169157 0.170713 0.169696 0.174956 0.170325
Dendritic 0.202423 0.213203 0.214122 0.211660 0.210776 0.211961 0.204109 0.215008 0.180613 0.180764 ... 0.207749 0.192810 0.207239 0.184642 0.213761 0.169990 0.206384 0.198008 0.200993 0.197551
B2 0.175943 0.197595 0.197279 0.303319 0.194519 0.194714 0.174526 0.226824 0.189171 0.187408 ... 0.302276 0.183132 0.241988 0.197357 0.297515 0.179982 0.193054 0.186487 0.167445 0.183422
? 0.171113 0.175557 0.174587 0.189672 0.174541 0.174608 0.176052 0.180040 0.172341 0.171540 ... 0.189235 0.172041 0.179315 0.173879 0.188600 0.170261 0.173093 0.171849 0.177767 0.169614
NK1 0.345601 0.245713 0.238197 0.252902 0.250483 0.285914 0.211123 0.276463 0.426044 0.345437 ... 0.261252 0.262000 0.239800 0.409009 0.253153 0.269352 0.244461 0.223430 0.218163 0.187301
B3 0.169670 0.174715 0.175192 0.208068 0.174252 0.174039 0.170516 0.183857 0.171911 0.171253 ... 0.207892 0.171994 0.187071 0.174176 0.207483 0.170438 0.173621 0.172652 0.312961 0.174516
NK2 0.316572 0.192335 0.190429 0.197329 0.197770 0.234541 0.189275 0.192605 0.298271 0.221064 ... 0.200695 0.186527 0.192913 0.302035 0.198620 0.181264 0.197463 0.194362 0.203358 0.175036
FCGR3A Monocytes 0.195610 0.228610 0.231542 0.188614 0.231532 0.223991 0.193151 0.203231 0.185050 0.181421 ... 0.188601 0.202445 0.222092 0.186652 0.185567 0.169331 0.255973 0.199822 0.182411 0.246853
B1 0.180443 0.214724 0.213004 0.372993 0.208358 0.209404 0.179780 0.271071 0.199169 0.193945 ... 0.370681 0.192189 0.277191 0.212552 0.368966 0.187111 0.203911 0.196266 0.167445 0.182965

13 rows × 47 columns

In [440]:
sns.set()
sns.set(style='whitegrid', font_scale=0.8)
fig, ((ax1, ax2)) = plt.subplots(1, 2, figsize=(8, 3))
plot_rss(rss, 'B1', ax=ax1)
plot_rss(rss, 'B2', ax=ax2)

We can see MEF2A is one regulator in the top 5 of B1 is not exists in the top 5 of B2. Let's plot the activity of MEF2A.

In [341]:
sc.pl.tsne(adata, color = ["cellular_phenotype", 'Regulon(MEF2A(+))'], legend_loc='on data')

Try to use conventional method to compute the regulator marker

We will create a another Anndata object with the AUCell score of all cells as the main matrix. Then we will use the marker gene finding method in scanpy package to compute the regulator marker in our case.

In [342]:
bdata = sc.AnnData(adata.obsm['X_aucell'])
bdata.var_names = np.array(auc_mtx.keys())
# Copy the cell annotations from the old anndata to the new one
bdata.obs = adata.obs

Finding regulator marker using the wilcoxon method

In [343]:
GROUPS = ['B1', 'B2'] # we only interested in the B1 and B2 population at the moment
sc.tl.rank_genes_groups(bdata, 'cellular_phenotype', method='wilcoxon', groups = GROUPS)
sc.pl.rank_genes_groups(bdata, n_genes=10, sharey=False)
ranking genes
    finished: added to `.uns['rank_genes_groups']`
    'names', sorted np.recarray to be indexed by group ids
    'scores', sorted np.recarray to be indexed by group ids
    'logfoldchanges', sorted np.recarray to be indexed by group ids
    'pvals', sorted np.recarray to be indexed by group ids
    'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:00)

It turns out that the convential method gives the same results compared to the RSS and the Z-score method.