SCIB is a framework used to benchmark different methods for single-cell data integration. In the original publication, the authors benchmarked different batch effect removal methods in terms of scalability and usability. The methods are assessed in different metrics, in both technical effect removing and biological conservation.

In this notebook, we will use the api from SCIB to run different batch effect removal methods for the human pancreas data from GSE81076, GSE85241, GSE86469, GSE84133, GSE81608 then compute the evaluation metrics for the integrated results.

Load required packages

In [1]:
import scanpy as sc
import pandas as pd
import os
import sys

# Required step before importing scib
RHOME = os.path.join(sys.executable.split('/bin/')[0],'lib/R')
from rpy2.rinterface_lib import openrlib
openrlib.R_HOME = RHOME
import scib

Read the human pancreas data

In [4]:
import urllib.request
pancreas_data, _ = urllib.request.urlretrieve(
    'https://www.dropbox.com/s/xbu8a1uw36ip5ia/human_pancreas_norm_complexBatch.h5ad?dl=1',
    'human_pancreas_norm_complexBatch.h5ad'
)

# Import adata
adata = sc.read_h5ad(pancreas_data)

integration_result = {}
bench_result = {}
BATCH_KEY = 'tech'
CELLTYPE_KEY = 'celltype'

Pre-processing

Find highly-variable features

In [7]:
# Total-count normalize (library-size correct) to 10,000 reads/cell
sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4)

# log transform the data.
sc.pp.log1p(adata)

# identify highly variable genes.
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)

# keep only highly variable genes:
adata = adata[:, adata.var['highly_variable']]

Run different batch effect removal methods

In this notebook, we will run five methods Harmony, BBKNN, Scanorama, Combat, Saucie for the illustration purpose. SCIB provideds a simple API to run other batch effect removal methods as below:

scib.integration.<method>(adata, batch=<batch_key>)

You just need to replace the method name in the function above by your method. For example, the function used to run bbknn with the batch key batch will be scib.integration.scanorama(adata, batch='batch')

The full list of supported Python methods is: scanorama, trvae, trvaep, scgen, mnn, bbknn, scvi, scanvi, combat, saucie, desc

Note that some methods that involve the training steps like trvae, trvaep, scgen, scvi, scanvi will run faster with GPU

In [10]:
# Run PCA prior to run Harmony
sc.tl.pca(adata)

# Harmony runs through R so:
sc.external.pp.harmony_integrate(adata, BATCH_KEY)
2022-01-07 07:53:45,026 - harmonypy - INFO - Iteration 1 of 10
2022-01-07 07:53:58,372 - harmonypy - INFO - Iteration 2 of 10
2022-01-07 07:54:11,699 - harmonypy - INFO - Iteration 3 of 10
2022-01-07 07:54:24,525 - harmonypy - INFO - Iteration 4 of 10
2022-01-07 07:54:38,707 - harmonypy - INFO - Iteration 5 of 10
2022-01-07 07:54:50,671 - harmonypy - INFO - Iteration 6 of 10
2022-01-07 07:55:03,993 - harmonypy - INFO - Iteration 7 of 10
2022-01-07 07:55:15,671 - harmonypy - INFO - Iteration 8 of 10
2022-01-07 07:55:22,043 - harmonypy - INFO - Iteration 9 of 10
2022-01-07 07:55:26,623 - harmonypy - INFO - Iteration 10 of 10
2022-01-07 07:55:31,050 - harmonypy - INFO - Converged after 10 iterations
In [11]:
eval_metrics = {
        'isolated_labels_asw_': True,
        'silhouette_':True,
        'hvg_score_':True,
        'graph_conn_':True,
        'pcr_':True,
        'isolated_labels_f1_':True,
        'nmi_':True,
        'ari_':True,
        'cell_cycle_':True,
        'kBET_':True,
        'ilisi_':True,
        'clisi_':True
}
In [12]:
# Compute the evaluation metrics for Harmony
bench_result['harmony'] = scib.me.metrics(
        adata,
        adata,
        batch_key = BATCH_KEY,
        label_key = CELLTYPE_KEY, 
        verbose = True,
        embed = 'X_pca_harmony',
        organism = 'human',
        **eval_metrics
)
NMI...
ARI...
Silhouette score...
PC regression...
covariate: tech
compute PCA n_comps: 50
covariate: tech
Compute PCR on embedding n_comps: 50
Trying to set attribute `.obs` of view, copying.
cell cycle effect...
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Isolated labels F1...
Isolated labels ASW...
Graph connectivity...
kBET...
batch: tech
Use 19 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 70.
/project_envs/scib-conda/lib/python3.8/site-packages/rpy2/robjects/pandas2ri.py:261: DeprecationWarning: The global conversion available with activate() is deprecated and will be removed in the next major release. Use a local converter.
  warnings.warn('The global conversion available with activate() '
/project_envs/scib-conda/lib/python3.8/site-packages/rpy2/robjects/numpy2ri.py:195: DeprecationWarning: The global conversion available with activate() is deprecated and will be removed in the next major release. Use a local converter.
  warnings.warn('The global conversion available with activate() '
Use 52 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 167.
Use 70 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 550.
Use 29 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 106.
Use 70 nearest neighbors.
Adding diffusion to step 4
importing expression matrix
kBET estimation
Number of kBET tests is set to 417.
Use 66 nearest neighbors.
Adding diffusion to step 4
Adding diffusion to step 5
Adding diffusion to step 6
importing expression matrix
kBET estimation
Number of kBET tests is set to 215.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 32.
Use 14 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 47.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 3.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 25.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 25.
Use 10 nearest neighbors.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 25.
t_cell consists of a single batch or is too small. Skip.
cLISI score...
using precomputed kNN graph
Convert nearest neighbor matrix and distances for LISI.
Compute knn on shortest paths
/tmp/lisi_wyabgwn1/input.mtx /tmp/lisi_wyabgwn1/
LISI score estimation
16 processes started.
iLISI score...
using precomputed kNN graph
Convert nearest neighbor matrix and distances for LISI.
Compute knn on shortest paths
/tmp/lisi_8i2zx_up/input.mtx /tmp/lisi_8i2zx_up/
LISI score estimation
16 processes started.
/project_envs/scib-conda/lib/python3.8/site-packages/statsmodels/compat/pandas.py:12: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  version = LooseVersion(pd.__version__)
/project_envs/scib-conda/lib/python3.8/site-packages/statsmodels/compat/pandas.py:14: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  pandas_lt_1_0_0 = version < LooseVersion('1.0.0')
/project_envs/scib-conda/lib/python3.8/site-packages/statsmodels/compat/pandas.py:15: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  pandas_lt_25_0 = version < LooseVersion('0.25.0')
/project_envs/scib-conda/lib/python3.8/site-packages/statsmodels/compat/pandas.py:16: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  pandas_gte_23_0 = version >= LooseVersion('0.23.0')
In [13]:
bench_result['harmony'].rename(columns = {0 : 'harmony'}, inplace = True)
bench_result['harmony']
Out[13]:
harmony
NMI_cluster/label 0.641552
ARI_cluster/label 0.297319
ASW_label 0.602215
ASW_label/batch 0.856864
PCR_batch 0.848686
cell_cycle_conservation 0.892146
isolated_label_F1 0.109375
isolated_label_silhouette 0.534415
graph_conn 0.870155
kBET 0.383899
iLISI 0.001107
cLISI 0.999470
hvg_overlap 1.000000
trajectory NaN
In [14]:
# Run bbknn
integration_result['bbknn'] = scib.integration.bbknn(adata, BATCH_KEY)

# Eval bbknn
bench_result['bbknn'] = scib.me.metrics(
        adata,
        integration_result['bbknn'], 
        batch_key = BATCH_KEY,
        label_key = CELLTYPE_KEY,
        verbose = True, embed = 'X_pca',
        organism = 'human',
        **eval_metrics
)
Warning: cluster key cluster already exists in adata.obs and will be overwritten
NMI...
ARI...
Silhouette score...
PC regression...
covariate: tech
compute PCA n_comps: 50
covariate: tech
compute PCA n_comps: 50
Trying to set attribute `.obs` of view, copying.
cell cycle effect...
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Isolated labels F1...
Isolated labels ASW...
Graph connectivity...
kBET...
batch: tech
Use 19 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 70.
/project_envs/scib-conda/lib/python3.8/site-packages/rpy2/robjects/pandas2ri.py:261: DeprecationWarning: The global conversion available with activate() is deprecated and will be removed in the next major release. Use a local converter.
  warnings.warn('The global conversion available with activate() '
/project_envs/scib-conda/lib/python3.8/site-packages/rpy2/robjects/numpy2ri.py:195: DeprecationWarning: The global conversion available with activate() is deprecated and will be removed in the next major release. Use a local converter.
  warnings.warn('The global conversion available with activate() '
Use 52 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 167.
Use 70 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 550.
Use 29 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 106.
Use 70 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 417.
Use 66 nearest neighbors.
Adding diffusion to step 4
importing expression matrix
kBET estimation
Number of kBET tests is set to 215.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 32.
Use 14 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 47.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 3.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 25.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 25.
Use 10 nearest neighbors.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 25.
t_cell consists of a single batch or is too small. Skip.
cLISI score...
using precomputed kNN graph
Convert nearest neighbor matrix and distances for LISI.
Compute knn on shortest paths
/tmp/lisi_oj1_ssl8/input.mtx /tmp/lisi_oj1_ssl8/
LISI score estimation
16 processes started.
iLISI score...
using precomputed kNN graph
Convert nearest neighbor matrix and distances for LISI.
Compute knn on shortest paths
/tmp/lisi_dw4r6lms/input.mtx /tmp/lisi_dw4r6lms/
LISI score estimation
16 processes started.
In [15]:
# Rename column
bench_result['bbknn'].rename(columns = {0 : 'bbknn'}, inplace = True)
bench_result['bbknn']
Out[15]:
bbknn
NMI_cluster/label 0.898412
ARI_cluster/label 0.945042
ASW_label 0.550620
ASW_label/batch 0.792516
PCR_batch 0.000000
cell_cycle_conservation 1.000000
isolated_label_F1 0.016949
isolated_label_silhouette 0.514728
graph_conn 0.984663
kBET 0.208242
iLISI 0.378540
cLISI 0.855703
hvg_overlap 1.000000
trajectory NaN
In [16]:
# Run scanorama
integration_result['scanorama'] = scib.integration.scanorama(adata, BATCH_KEY)
integration_result['scanorama'].obs[BATCH_KEY] = adata.obs[BATCH_KEY]
integration_result['scanorama'].obs[CELLTYPE_KEY] = adata.obs[CELLTYPE_KEY].astype('category')

# Bench scanorama
bench_result['scanorama'] = scib.me.metrics(
        adata,
        integration_result['scanorama'], 
        batch_key = BATCH_KEY,
        label_key = CELLTYPE_KEY,
        verbose = True,
        embed = 'X_pca',
        organism = 'human',
        **eval_metrics
)
/project_envs/scib-conda/lib/python3.8/site-packages/intervaltree/intervaltree.py:37: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.10 it will stop working
  class IntervalTree(collections.MutableSet):
Found 4130 genes among all datasets
[[0.00000000e+00 8.30677291e-01 5.79937304e-02 1.80278884e-01
  4.98007968e-02 3.98406375e-02 7.96812749e-03 7.96812749e-03
  1.99203187e-03]
 [0.00000000e+00 0.00000000e+00 1.81818182e-01 2.27133479e-01
  8.15694373e-02 2.20417633e-02 6.56455142e-03 7.67459708e-03
  5.36193029e-03]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 9.40438871e-02
  4.23197492e-02 2.97805643e-02 2.35109718e-02 1.09717868e-02
  6.45768025e-01]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  1.64995823e-01 1.19047619e-01 1.92147034e-02 2.45587107e-02
  5.56300268e-02]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 5.52784223e-01 7.29478575e-01 4.31312356e-01
  3.09757357e-03]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 3.40487239e-01 7.58700696e-01
  6.70241287e-04]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 4.88104375e-01
  6.70241287e-04]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  6.29316961e-02]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00]]
Processing datasets (0, 1)
Processing datasets (5, 7)
Processing datasets (4, 6)
Processing datasets (2, 8)
Processing datasets (4, 5)
Processing datasets (6, 7)
Processing datasets (4, 7)
Processing datasets (5, 6)
Processing datasets (1, 3)
Processing datasets (1, 2)
Processing datasets (0, 3)
Processing datasets (3, 4)
Processing datasets (3, 5)
Warning: cluster key cluster already exists in adata.obs and will be overwritten
NMI...
ARI...
Silhouette score...
PC regression...
covariate: tech
compute PCA n_comps: 50
covariate: tech
compute PCA n_comps: 50
Trying to set attribute `.obs` of view, copying.
cell cycle effect...
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Isolated labels F1...
Isolated labels ASW...
Graph connectivity...
kBET...
batch: tech
Use 19 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 70.
/project_envs/scib-conda/lib/python3.8/site-packages/rpy2/robjects/pandas2ri.py:261: DeprecationWarning: The global conversion available with activate() is deprecated and will be removed in the next major release. Use a local converter.
  warnings.warn('The global conversion available with activate() '
/project_envs/scib-conda/lib/python3.8/site-packages/rpy2/robjects/numpy2ri.py:195: DeprecationWarning: The global conversion available with activate() is deprecated and will be removed in the next major release. Use a local converter.
  warnings.warn('The global conversion available with activate() '
Use 52 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 167.
Use 70 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 550.
Use 29 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 106.
Use 70 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 417.
Use 66 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 215.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 32.
Use 14 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 47.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 3.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 25.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 25.
Use 10 nearest neighbors.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 25.
t_cell consists of a single batch or is too small. Skip.
cLISI score...
using precomputed kNN graph
Convert nearest neighbor matrix and distances for LISI.
Compute knn on shortest paths
/tmp/lisi_8pklnsvb/input.mtx /tmp/lisi_8pklnsvb/
LISI score estimation
16 processes started.
File has no entries. Doing nothing.
iLISI score...
using precomputed kNN graph
Convert nearest neighbor matrix and distances for LISI.
Compute knn on shortest paths
/tmp/lisi_rlaqwaor/input.mtx /tmp/lisi_rlaqwaor/
LISI score estimation
16 processes started.
In [17]:
# Rename column
bench_result['scanorama'].rename(columns = {0 : 'scanorama'}, inplace = True)
bench_result['scanorama']
Out[17]:
scanorama
NMI_cluster/label 0.641288
ARI_cluster/label 0.297198
ASW_label 0.550620
ASW_label/batch 0.792516
PCR_batch 0.643726
cell_cycle_conservation 0.935562
isolated_label_F1 0.112903
isolated_label_silhouette 0.514728
graph_conn 0.863278
kBET 0.207821
iLISI 0.001115
cLISI 0.999475
hvg_overlap 0.298667
trajectory NaN
In [18]:
# Run combat
integration_result['combat'] = scib.integration.combat(adata, 'tech')

# Bench combat
bench_result['combat'] = scib.me.metrics(
        adata,
        integration_result['combat'],
        batch_key = BATCH_KEY,
        label_key = CELLTYPE_KEY,
        verbose = True,
        embed = 'X_pca',
        organism = 'human',
        **eval_metrics
)
/project_envs/scib-conda/lib/python3.8/site-packages/scanpy/preprocessing/_combat.py:340: RuntimeWarning: divide by zero encountered in true_divide
  (abs(g_new - g_old) / g_old).max(), (abs(d_new - d_old) / d_old).max()
Warning: cluster key cluster already exists in adata.obs and will be overwritten
NMI...
ARI...
Silhouette score...
PC regression...
covariate: tech
compute PCA n_comps: 50
covariate: tech
compute PCA n_comps: 50
Trying to set attribute `.obs` of view, copying.
cell cycle effect...
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Isolated labels F1...
Isolated labels ASW...
Graph connectivity...
kBET...
batch: tech
Use 19 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 70.
/project_envs/scib-conda/lib/python3.8/site-packages/rpy2/robjects/pandas2ri.py:261: DeprecationWarning: The global conversion available with activate() is deprecated and will be removed in the next major release. Use a local converter.
  warnings.warn('The global conversion available with activate() '
/project_envs/scib-conda/lib/python3.8/site-packages/rpy2/robjects/numpy2ri.py:195: DeprecationWarning: The global conversion available with activate() is deprecated and will be removed in the next major release. Use a local converter.
  warnings.warn('The global conversion available with activate() '
Use 52 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 167.
Use 70 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 550.
Use 29 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 106.
Use 70 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 417.
Use 66 nearest neighbors.
Adding diffusion to step 4
importing expression matrix
kBET estimation
Number of kBET tests is set to 215.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 32.
Use 14 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 47.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 3.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 25.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 25.
Use 10 nearest neighbors.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 25.
t_cell consists of a single batch or is too small. Skip.
cLISI score...
using precomputed kNN graph
Convert nearest neighbor matrix and distances for LISI.
Compute knn on shortest paths
/tmp/lisi_k6bdrtlc/input.mtx /tmp/lisi_k6bdrtlc/
LISI score estimation
16 processes started.
File has no entries. Doing nothing.
iLISI score...
using precomputed kNN graph
Convert nearest neighbor matrix and distances for LISI.
Compute knn on shortest paths
/tmp/lisi_hfimfksv/input.mtx /tmp/lisi_hfimfksv/
LISI score estimation
16 processes started.
In [19]:
# Rename column
bench_result['combat'].rename(columns = {0 : 'combat'}, inplace = True)
bench_result['combat']
Out[19]:
combat
NMI_cluster/label 0.641552
ARI_cluster/label 0.297319
ASW_label 0.550620
ASW_label/batch 0.792516
PCR_batch 0.999973
cell_cycle_conservation 0.954411
isolated_label_F1 0.109375
isolated_label_silhouette 0.514728
graph_conn 0.870155
kBET 0.207976
iLISI 0.001291
cLISI 0.999466
hvg_overlap 0.435111
trajectory NaN
In [20]:
# Run saucie
integration_result['saucie'] = scib.integration.combat(adata, 'tech')

# Bench saucie
bench_result['saucie'] = scib.me.metrics(
        adata,
        integration_result['saucie'],
        batch_key = BATCH_KEY,
        label_key = CELLTYPE_KEY,
        verbose = True,
        embed = 'X_pca',
        organism = 'human',
        **eval_metrics
)
/project_envs/scib-conda/lib/python3.8/site-packages/scanpy/preprocessing/_combat.py:340: RuntimeWarning: divide by zero encountered in true_divide
  (abs(g_new - g_old) / g_old).max(), (abs(d_new - d_old) / d_old).max()
Warning: cluster key cluster already exists in adata.obs and will be overwritten
NMI...
ARI...
Silhouette score...
PC regression...
covariate: tech
compute PCA n_comps: 50
covariate: tech
compute PCA n_comps: 50
Trying to set attribute `.obs` of view, copying.
cell cycle effect...
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Isolated labels F1...
Isolated labels ASW...
Graph connectivity...
kBET...
batch: tech
Use 19 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 70.
/project_envs/scib-conda/lib/python3.8/site-packages/rpy2/robjects/pandas2ri.py:261: DeprecationWarning: The global conversion available with activate() is deprecated and will be removed in the next major release. Use a local converter.
  warnings.warn('The global conversion available with activate() '
/project_envs/scib-conda/lib/python3.8/site-packages/rpy2/robjects/numpy2ri.py:195: DeprecationWarning: The global conversion available with activate() is deprecated and will be removed in the next major release. Use a local converter.
  warnings.warn('The global conversion available with activate() '
Use 52 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 167.
Use 70 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 550.
Use 29 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 106.
Use 70 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 417.
Use 66 nearest neighbors.
Adding diffusion to step 4
importing expression matrix
kBET estimation
Number of kBET tests is set to 215.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 32.
Use 14 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 47.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 3.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 25.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 25.
Use 10 nearest neighbors.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 25.
t_cell consists of a single batch or is too small. Skip.
cLISI score...
using precomputed kNN graph
Convert nearest neighbor matrix and distances for LISI.
Compute knn on shortest paths
/tmp/lisi_0xwgq4do/input.mtx /tmp/lisi_0xwgq4do/
LISI score estimation
16 processes started.
iLISI score...
using precomputed kNN graph
Convert nearest neighbor matrix and distances for LISI.
Compute knn on shortest paths
/tmp/lisi_ehchh6bb/input.mtx /tmp/lisi_ehchh6bb/
LISI score estimation
16 processes started.
In [21]:
# Rename column
bench_result['saucie'].rename(columns = {0 : 'saucie'}, inplace = True)
bench_result['saucie']
Out[21]:
saucie
NMI_cluster/label 0.641552
ARI_cluster/label 0.297319
ASW_label 0.550620
ASW_label/batch 0.792516
PCR_batch 0.999973
cell_cycle_conservation 0.954411
isolated_label_F1 0.109375
isolated_label_silhouette 0.514728
graph_conn 0.870155
kBET 0.209427
iLISI 0.001253
cLISI 0.999495
hvg_overlap 0.435111
trajectory NaN

Merge all evaluation scores to summary

In [22]:
merged_benchscores = pd.concat(bench_result.values(), axis = 1)
In [23]:
# Metrics for batch effect removal
batch_effect_removal = {
        'PCR_batch',
        'ARI_cluster/label',
        'graph_conn',
}

# Metrics for conservation of biological variance
biological_conservation_metrics = {
        'NMI_cluster/label',
        'ASW_label',
        'ASW_label/batch',
        'iLISI',
        'cLISI',
        'isolated_label_F1',
        'isolated_label_silhouette',
        'cell_cycle_conservation',
        'hvg_overlap'
}

The overall score of all scores is calculated by a 40% weighted mean of batch correction metrics per 60% of biological conservation metrics.

In [24]:
total_scores = {}
for k in merged_benchscores.keys():
    batch_score = merged_benchscores[k].loc[batch_effect_removal].sum()/len(batch_effect_removal)*0.4
    conservation_score = merged_benchscores[k].loc[biological_conservation_metrics].sum()/len(biological_conservation_metrics)*0.6
    total_scores[k] = batch_score + conservation_score
In [25]:
totalscore_df = pd.DataFrame.from_dict({"Total score": total_scores}, orient="index")
merged_benchscores = merged_benchscores.append(totalscore_df)
merged_benchscores = merged_benchscores.T
In [26]:
# Move Total score column up
cols = merged_benchscores.columns.tolist()
cols = cols[-1:] + cols[:-1]
merged_benchscores = merged_benchscores[cols]
In [27]:
# Sort benchscore table by Total score
merged_benchscores.sort_values(by = 'Total score', ascending = False)
Out[27]:
Total score NMI_cluster/label ARI_cluster/label ASW_label ASW_label/batch PCR_batch cell_cycle_conservation isolated_label_F1 isolated_label_silhouette graph_conn kBET iLISI cLISI hvg_overlap trajectory
bbknn 0.657792 0.898412 0.945042 0.550620 0.792516 0.000000 1.000000 0.016949 0.514728 0.984663 0.208242 0.378540 0.855703 1.000000 NaN
harmony 0.644631 0.641552 0.297319 0.602215 0.856864 0.848686 0.892146 0.109375 0.534415 0.870155 0.383899 0.001107 0.999470 1.000000 NaN
combat 0.622264 0.641552 0.297319 0.550620 0.792516 0.999973 0.954411 0.109375 0.514728 0.870155 0.207976 0.001291 0.999466 0.435111 NaN
saucie 0.622264 0.641552 0.297319 0.550620 0.792516 0.999973 0.954411 0.109375 0.514728 0.870155 0.209427 0.001253 0.999495 0.435111 NaN
scanorama 0.563685 0.641288 0.297198 0.550620 0.792516 0.643726 0.935562 0.112903 0.514728 0.863278 0.207821 0.001115 0.999475 0.298667 NaN

Explain each evaluation metric

All evaluation metrics below are scaled from 0 to 1. Higher is better.

PCR (Principal component regression)

Batch identity is a covariate that shouldn't contribute to the variance of the expression data. For each Principal component computed from the expression data, we do a linear regression model with PC as the response value and batch identity (one-hot encoded) as the covariate. The ratio of $R^2$ from the linear regression model to the variance of the PCA for each PC is the contribution of the batch to the variance of the expression data.

The PCR value computed by scib is the difference ratio of the contribution mentioned above after batch effect removal

In [28]:
merged_benchscores[['PCR_batch']].sort_values(by = 'PCR_batch', ascending = False)
Out[28]:
PCR_batch
combat 0.999973
saucie 0.999973
harmony 0.848686
scanorama 0.643726
bbknn 0.000000

Graph connectivity

Assesses whether the kNN graph representation (of the integrated data) directly connects all cells with the same cell identity label.

In [29]:
merged_benchscores[['graph_conn']].sort_values(by = 'graph_conn', ascending = False)
Out[29]:
graph_conn
bbknn 0.984663
harmony 0.870155
combat 0.870155
saucie 0.870155
scanorama 0.863278

kBET (k-nearest-neighbor batch effect test)

Determines whether the label composition of a k nearest neighborhood of a cell is similar to the expected (global) label composition. The test is repeated for a random subset of cells, and the results are summarized as a rejection rate over all tested neighborhoods.

In [30]:
merged_benchscores[['kBET']].sort_values(by = 'kBET', ascending = False)
Out[30]:
kBET
harmony 0.383899
saucie 0.209427
bbknn 0.208242
combat 0.207976
scanorama 0.207821

NMI (normalized mutual information)

Compares the overlap of two cell identities in the integrated dataset, Louvain cluster cell label and cell-type label. NMI scores of 0 or 1 correspond to uncorrelated clustering or a perfect match, respectively.

Louvain clustering was performed at a resolution range of 0.1 to 2 in steps of 0.1, and the clustering output with the highest NMI with the label set was used.

scib uses the NMI implementation from scikit-learn (v.0.22.1)

In [31]:
merged_benchscores[['NMI_cluster/label']].sort_values(by = 'NMI_cluster/label', ascending = False)
Out[31]:
NMI_cluster/label
bbknn 0.898412
harmony 0.641552
combat 0.641552
saucie 0.641552
scanorama 0.641288

ARI (Adjusted Rand Index)

Compares the overlap of two cell identities in the integrated dataset, Louvain cluster cell label and cell-type label. One is the NMI-optimized Louvain clustering and the other is the cell-type labels. ARI of 0 or 1 corresponds to random labeling or a perfect match, respectively.

scib uses the ARI implementation from scikit-learn (v.0.22.1)

In [32]:
merged_benchscores[['ARI_cluster/label']].sort_values(by = 'ARI_cluster/label', ascending = False)
Out[32]:
ARI_cluster/label
bbknn 0.945042
harmony 0.297319
combat 0.297319
saucie 0.297319
scanorama 0.297198

ASW (Average silhouette width)

ASW is commonly used to determine the separation of clusters where 1 represents dense and well-separated clusters, while 0 or −1 corresponds to overlapping clusters or strong miscalculation. For more information, please see: "Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis"

Measures the relationship between the within-cluster distances of a cell and the between-cluster distances of that cell to the closest cluster. Averaging over all silhouette widths of a set of cells yields the ASW, which ranges between −1 and 1.

Cell-type ASW is the silhouette of the cell labels and batch ASW measures batch mixing.

The ASW metric for a batch effect removal method is the Cell-type ASW / Batch ASW

In [33]:
# ASW_label/batch
merged_benchscores[['ASW_label/batch']].sort_values(by = 'ASW_label/batch', ascending = False)
Out[33]:
ASW_label/batch
harmony 0.856864
bbknn 0.792516
scanorama 0.792516
combat 0.792516
saucie 0.792516
In [34]:
# ASW_label
merged_benchscores[['ASW_label']].sort_values(by = 'ASW_label', ascending = False)
Out[34]:
ASW_label
harmony 0.602215
bbknn 0.550620
scanorama 0.550620
combat 0.550620
saucie 0.550620

Graph LISI

Is a diversity score assess bath mixing (iLISI) and cell-type separation (cLISI). This evaluation metric was proposed by Korsunsky et. al 2019 Harmony

iLISI defines the effective number of datasets in a neighborhood. Neighborhoods represented by only a single dataset get an iLISI of 1, while neighborhoods with an equal number of cells from 2 datasets get an iLISI of 2. Note that even under ideal mixing, if the datasets have different numbers of cells, iLISI would be less than 2

To assess accuracy, the authors use “cell-type LISI” (cLISI), the same mathematical measure, but applied to cell-type instead of dataset labels. Accurate integration should maintain a cLISI of 1, reflecting a separation of unique cell types throughout the embedding.

To extend LISI graph-based integration outputs, the developers of scib developed graph LISI, which uses the integrated graph structure as an embedded space for distance calculation.

In [35]:
# iLISI
merged_benchscores[['iLISI']].sort_values(by = 'iLISI', ascending = False)
Out[35]:
iLISI
bbknn 0.378540
combat 0.001291
saucie 0.001253
scanorama 0.001115
harmony 0.001107
In [36]:
# cLISI
merged_benchscores[['cLISI']].sort_values(by = 'cLISI', ascending = False)
Out[36]:
cLISI
saucie 0.999495
scanorama 0.999475
harmony 0.999470
combat 0.999466
bbknn 0.855703

Isolated_label score

Evaluates how well the isolated cell label (the cell type label present in the least number of batches) separate from other cell identities. There are 2 version: the best clustering of the isolated label (F1 score) and the global ASW (average silhouette width) of the isolated label. It gives value between 0 & 1, where 1 shows that all the isolated label cells and no others are captured in the cluster.

In [37]:
# isolated_label_F1
merged_benchscores[['isolated_label_F1']].sort_values(by = 'isolated_label_F1', ascending = False)
Out[37]:
isolated_label_F1
scanorama 0.112903
harmony 0.109375
combat 0.109375
saucie 0.109375
bbknn 0.016949
In [38]:
# isolated_label_silhouette
merged_benchscores[['isolated_label_silhouette']].sort_values(by = 'isolated_label_silhouette', ascending = False)
Out[38]:
isolated_label_silhouette
harmony 0.534415
bbknn 0.514728
scanorama 0.514728
combat 0.514728
saucie 0.514728

Cell_cycle_conservation

Evaluates how well the cell-cycle effect can be captured before and after integration. Values close to 0 indicate lower conservation and 1 indicates complete conservation of the variance explained by cell cycle

The authors computed cell-cycle scores using Scanpy’s score_cell_cycle function with a reference gene set from Tirosh et al. then computed the variance contribution of the resulting S and G2/M phase scores using principal component regression (PCs as the response values and GS/M and S scores as the covariates), which was performed for each batch separately.

Then the differences in variance before, $Var_{before}$, and after, $Var_{after}$, integration were aggregated into a final score between 0 and 1, using the equation

$${\mathrm{CC}}\,{\mathrm{conservation}} = 1 - \frac{{\left| {{\mathrm{Var}}_{{\mathrm{after}}} - {\mathrm{Var}}_{{\mathrm{before}}}} \right|}}{{{\mathrm{Var}}_{{\mathrm{before}}}}}{{{\mathrm{.}}}}$$
In [39]:
merged_benchscores[['cell_cycle_conservation']].sort_values(by = 'cell_cycle_conservation', ascending = False)
Out[39]:
cell_cycle_conservation
bbknn 1.000000
combat 0.954411
saucie 0.954411
scanorama 0.935562
harmony 0.892146

Highly variable gene (HVG) conservation

Computes the number of HVGs before and after correction of each batch. The overall HVG score is the mean of the per-batch HVG overlap coefficients.

In [40]:
merged_benchscores[['cell_cycle_conservation']].sort_values(by = 'cell_cycle_conservation', ascending = False)
Out[40]:
cell_cycle_conservation
bbknn 1.000000
combat 0.954411
saucie 0.954411
scanorama 0.935562
harmony 0.892146

Visualization

To illustrate the integration results of different methods, we draw the UMAP plot with cells colored by batches and the cell types

BBKNN

In [41]:
sc.tl.umap(integration_result['bbknn'])
sc.pl.umap(integration_result['bbknn'], color = ['tech', 'celltype'])

Combat

In [42]:
sc.tl.umap(integration_result['combat'])
sc.pl.umap(integration_result['combat'], color = ['tech', 'celltype'])

Saucie

In [43]:
sc.tl.umap(integration_result['saucie'])
sc.pl.umap(integration_result['saucie'], color = ['tech', 'celltype'])

Scanorama

In [44]:
sc.tl.umap(integration_result['scanorama'])
sc.pl.umap(integration_result['scanorama'], color = ['tech', 'celltype'])

Harmony

In [45]:
# Find neigbors for adata
sc.pp.neighbors(adata, use_rep="X_pca_harmony")
In [46]:
sc.tl.umap(adata)
sc.pl.umap(adata, color = ['tech', 'celltype'])