SCIB is a framework used to benchmark different methods for single-cell data integration. In the original publication, the authors benchmarked different batch effect removal methods in terms of scalability and usability. The methods are assessed in different metrics, in both technical effect removing and biological conservation.

In this notebook, we will use the api from SCIB to run different batch effect removal methods for the human pancreas data from GSE81076, GSE85241, GSE86469, GSE84133, GSE81608 then compute the evaluation metrics for the integrated results.

Load required packages¶

import scanpy as sc
import pandas as pd
import os
import sys

# Required step before importing scib
RHOME = os.path.join(sys.executable.split('/bin/')[0],'lib/R')
from rpy2.rinterface_lib import openrlib
openrlib.R_HOME = RHOME
import scib

Read the human pancreas data¶

import urllib.request
pancreas_data, _ = urllib.request.urlretrieve(
    'https://www.dropbox.com/s/xbu8a1uw36ip5ia/human_pancreas_norm_complexBatch.h5ad?dl=1',
    'human_pancreas_norm_complexBatch.h5ad'
)

# Import adata
adata = sc.read_h5ad(pancreas_data)

integration_result = {}
bench_result = {}
BATCH_KEY = 'tech'
CELLTYPE_KEY = 'celltype'

Pre-processing¶

Find highly-variable features

# Total-count normalize (library-size correct) to 10,000 reads/cell
sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4)

# log transform the data.
sc.pp.log1p(adata)

# identify highly variable genes.
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)

# keep only highly variable genes:
adata = adata[:, adata.var['highly_variable']]

Run different batch effect removal methods¶

In this notebook, we will run five methods Harmony, BBKNN, Scanorama, Combat, Saucie for the illustration purpose. SCIB provideds a simple API to run other batch effect removal methods as below:

scib.integration.<method>(adata, batch=<batch_key>)

You just need to replace the method name in the function above by your method. For example, the function used to run bbknn with the batch key batch will be scib.integration.scanorama(adata, batch='batch')

The full list of supported Python methods is: scanorama, trvae, trvaep, scgen, mnn, bbknn, scvi, scanvi, combat, saucie, desc

Note that some methods that involve the training steps like trvae, trvaep, scgen, scvi, scanvi will run faster with GPU

Harmony¶

Fast, sensitive, and accurate integration of single cell data with Harmony

# Run PCA prior to run Harmony
sc.tl.pca(adata)

# Harmony runs through R so:
sc.external.pp.harmony_integrate(adata, BATCH_KEY)

2022-01-07 07:53:45,026 - harmonypy - INFO - Iteration 1 of 10
2022-01-07 07:53:58,372 - harmonypy - INFO - Iteration 2 of 10
2022-01-07 07:54:11,699 - harmonypy - INFO - Iteration 3 of 10
2022-01-07 07:54:24,525 - harmonypy - INFO - Iteration 4 of 10
2022-01-07 07:54:38,707 - harmonypy - INFO - Iteration 5 of 10
2022-01-07 07:54:50,671 - harmonypy - INFO - Iteration 6 of 10
2022-01-07 07:55:03,993 - harmonypy - INFO - Iteration 7 of 10
2022-01-07 07:55:15,671 - harmonypy - INFO - Iteration 8 of 10
2022-01-07 07:55:22,043 - harmonypy - INFO - Iteration 9 of 10
2022-01-07 07:55:26,623 - harmonypy - INFO - Iteration 10 of 10
2022-01-07 07:55:31,050 - harmonypy - INFO - Converged after 10 iterations

eval_metrics = {
        'isolated_labels_asw_': True,
        'silhouette_':True,
        'hvg_score_':True,
        'graph_conn_':True,
        'pcr_':True,
        'isolated_labels_f1_':True,
        'nmi_':True,
        'ari_':True,
        'cell_cycle_':True,
        'kBET_':True,
        'ilisi_':True,
        'clisi_':True
}

# Compute the evaluation metrics for Harmony
bench_result['harmony'] = scib.me.metrics(
        adata,
        adata,
        batch_key = BATCH_KEY,
        label_key = CELLTYPE_KEY, 
        verbose = True,
        embed = 'X_pca_harmony',
        organism = 'human',
        **eval_metrics
)

NMI...
ARI...
Silhouette score...
PC regression...
covariate: tech
compute PCA n_comps: 50
covariate: tech
Compute PCR on embedding n_comps: 50

Trying to set attribute `.obs` of view, copying.

cell cycle effect...

Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.

Isolated labels F1...
Isolated labels ASW...
Graph connectivity...
kBET...
batch: tech
Use 19 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 70.

/project_envs/scib-conda/lib/python3.8/site-packages/rpy2/robjects/pandas2ri.py:261: DeprecationWarning: The global conversion available with activate() is deprecated and will be removed in the next major release. Use a local converter.
  warnings.warn('The global conversion available with activate() '
/project_envs/scib-conda/lib/python3.8/site-packages/rpy2/robjects/numpy2ri.py:195: DeprecationWarning: The global conversion available with activate() is deprecated and will be removed in the next major release. Use a local converter.
  warnings.warn('The global conversion available with activate() '

Use 52 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 167.
Use 70 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 550.
Use 29 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 106.
Use 70 nearest neighbors.
Adding diffusion to step 4
importing expression matrix
kBET estimation
Number of kBET tests is set to 417.
Use 66 nearest neighbors.
Adding diffusion to step 4
Adding diffusion to step 5
Adding diffusion to step 6
importing expression matrix
kBET estimation
Number of kBET tests is set to 215.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 32.
Use 14 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 47.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 3.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 25.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 25.
Use 10 nearest neighbors.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 25.
t_cell consists of a single batch or is too small. Skip.
cLISI score...
using precomputed kNN graph
Convert nearest neighbor matrix and distances for LISI.
Compute knn on shortest paths
/tmp/lisi_wyabgwn1/input.mtx /tmp/lisi_wyabgwn1/
LISI score estimation
16 processes started.
iLISI score...
using precomputed kNN graph
Convert nearest neighbor matrix and distances for LISI.
Compute knn on shortest paths
/tmp/lisi_8i2zx_up/input.mtx /tmp/lisi_8i2zx_up/
LISI score estimation
16 processes started.

/project_envs/scib-conda/lib/python3.8/site-packages/statsmodels/compat/pandas.py:12: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  version = LooseVersion(pd.__version__)
/project_envs/scib-conda/lib/python3.8/site-packages/statsmodels/compat/pandas.py:14: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  pandas_lt_1_0_0 = version < LooseVersion('1.0.0')
/project_envs/scib-conda/lib/python3.8/site-packages/statsmodels/compat/pandas.py:15: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  pandas_lt_25_0 = version < LooseVersion('0.25.0')
/project_envs/scib-conda/lib/python3.8/site-packages/statsmodels/compat/pandas.py:16: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  pandas_gte_23_0 = version >= LooseVersion('0.23.0')

bench_result['harmony'].rename(columns = {0 : 'harmony'}, inplace = True)
bench_result['harmony']

BBKNN¶

BBKNN: fast batch alignment of single cell transcriptomes

# Run bbknn
integration_result['bbknn'] = scib.integration.bbknn(adata, BATCH_KEY)

# Eval bbknn
bench_result['bbknn'] = scib.me.metrics(
        adata,
        integration_result['bbknn'], 
        batch_key = BATCH_KEY,
        label_key = CELLTYPE_KEY,
        verbose = True, embed = 'X_pca',
        organism = 'human',
        **eval_metrics
)

Warning: cluster key cluster already exists in adata.obs and will be overwritten
NMI...
ARI...
Silhouette score...
PC regression...
covariate: tech
compute PCA n_comps: 50
covariate: tech
compute PCA n_comps: 50

Trying to set attribute `.obs` of view, copying.

cell cycle effect...

Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.

Isolated labels F1...
Isolated labels ASW...
Graph connectivity...
kBET...
batch: tech
Use 19 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 70.

/project_envs/scib-conda/lib/python3.8/site-packages/rpy2/robjects/pandas2ri.py:261: DeprecationWarning: The global conversion available with activate() is deprecated and will be removed in the next major release. Use a local converter.
  warnings.warn('The global conversion available with activate() '
/project_envs/scib-conda/lib/python3.8/site-packages/rpy2/robjects/numpy2ri.py:195: DeprecationWarning: The global conversion available with activate() is deprecated and will be removed in the next major release. Use a local converter.
  warnings.warn('The global conversion available with activate() '

Use 52 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 167.
Use 70 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 550.
Use 29 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 106.
Use 70 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 417.
Use 66 nearest neighbors.
Adding diffusion to step 4
importing expression matrix
kBET estimation
Number of kBET tests is set to 215.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 32.
Use 14 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 47.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 3.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 25.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 25.
Use 10 nearest neighbors.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 25.
t_cell consists of a single batch or is too small. Skip.
cLISI score...
using precomputed kNN graph
Convert nearest neighbor matrix and distances for LISI.
Compute knn on shortest paths
/tmp/lisi_oj1_ssl8/input.mtx /tmp/lisi_oj1_ssl8/
LISI score estimation
16 processes started.
iLISI score...
using precomputed kNN graph
Convert nearest neighbor matrix and distances for LISI.
Compute knn on shortest paths
/tmp/lisi_dw4r6lms/input.mtx /tmp/lisi_dw4r6lms/
LISI score estimation
16 processes started.

# Rename column
bench_result['bbknn'].rename(columns = {0 : 'bbknn'}, inplace = True)
bench_result['bbknn']

Scanorama¶

Efficient integration of heterogeneous single-cell transcriptomes using Scanorama

# Run scanorama
integration_result['scanorama'] = scib.integration.scanorama(adata, BATCH_KEY)
integration_result['scanorama'].obs[BATCH_KEY] = adata.obs[BATCH_KEY]
integration_result['scanorama'].obs[CELLTYPE_KEY] = adata.obs[CELLTYPE_KEY].astype('category')

# Bench scanorama
bench_result['scanorama'] = scib.me.metrics(
        adata,
        integration_result['scanorama'], 
        batch_key = BATCH_KEY,
        label_key = CELLTYPE_KEY,
        verbose = True,
        embed = 'X_pca',
        organism = 'human',
        **eval_metrics
)

/project_envs/scib-conda/lib/python3.8/site-packages/intervaltree/intervaltree.py:37: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.10 it will stop working
  class IntervalTree(collections.MutableSet):

Found 4130 genes among all datasets
[[0.00000000e+00 8.30677291e-01 5.79937304e-02 1.80278884e-01
  4.98007968e-02 3.98406375e-02 7.96812749e-03 7.96812749e-03
  1.99203187e-03]
 [0.00000000e+00 0.00000000e+00 1.81818182e-01 2.27133479e-01
  8.15694373e-02 2.20417633e-02 6.56455142e-03 7.67459708e-03
  5.36193029e-03]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 9.40438871e-02
  4.23197492e-02 2.97805643e-02 2.35109718e-02 1.09717868e-02
  6.45768025e-01]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  1.64995823e-01 1.19047619e-01 1.92147034e-02 2.45587107e-02
  5.56300268e-02]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 5.52784223e-01 7.29478575e-01 4.31312356e-01
  3.09757357e-03]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 3.40487239e-01 7.58700696e-01
  6.70241287e-04]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 4.88104375e-01
  6.70241287e-04]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  6.29316961e-02]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00]]
Processing datasets (0, 1)
Processing datasets (5, 7)
Processing datasets (4, 6)
Processing datasets (2, 8)
Processing datasets (4, 5)
Processing datasets (6, 7)
Processing datasets (4, 7)
Processing datasets (5, 6)
Processing datasets (1, 3)
Processing datasets (1, 2)
Processing datasets (0, 3)
Processing datasets (3, 4)
Processing datasets (3, 5)
Warning: cluster key cluster already exists in adata.obs and will be overwritten
NMI...
ARI...
Silhouette score...
PC regression...
covariate: tech
compute PCA n_comps: 50
covariate: tech
compute PCA n_comps: 50

Trying to set attribute `.obs` of view, copying.

cell cycle effect...

Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.

Isolated labels F1...
Isolated labels ASW...
Graph connectivity...
kBET...
batch: tech
Use 19 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 70.

/project_envs/scib-conda/lib/python3.8/site-packages/rpy2/robjects/pandas2ri.py:261: DeprecationWarning: The global conversion available with activate() is deprecated and will be removed in the next major release. Use a local converter.
  warnings.warn('The global conversion available with activate() '
/project_envs/scib-conda/lib/python3.8/site-packages/rpy2/robjects/numpy2ri.py:195: DeprecationWarning: The global conversion available with activate() is deprecated and will be removed in the next major release. Use a local converter.
  warnings.warn('The global conversion available with activate() '

Use 52 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 167.
Use 70 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 550.
Use 29 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 106.
Use 70 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 417.
Use 66 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 215.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 32.
Use 14 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 47.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 3.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 25.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 25.
Use 10 nearest neighbors.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 25.
t_cell consists of a single batch or is too small. Skip.
cLISI score...
using precomputed kNN graph
Convert nearest neighbor matrix and distances for LISI.
Compute knn on shortest paths
/tmp/lisi_8pklnsvb/input.mtx /tmp/lisi_8pklnsvb/
LISI score estimation
16 processes started.
File has no entries. Doing nothing.
iLISI score...
using precomputed kNN graph
Convert nearest neighbor matrix and distances for LISI.
Compute knn on shortest paths
/tmp/lisi_rlaqwaor/input.mtx /tmp/lisi_rlaqwaor/
LISI score estimation
16 processes started.

# Rename column
bench_result['scanorama'].rename(columns = {0 : 'scanorama'}, inplace = True)
bench_result['scanorama']

Combat¶

Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods

# Run combat
integration_result['combat'] = scib.integration.combat(adata, 'tech')

# Bench combat
bench_result['combat'] = scib.me.metrics(
        adata,
        integration_result['combat'],
        batch_key = BATCH_KEY,
        label_key = CELLTYPE_KEY,
        verbose = True,
        embed = 'X_pca',
        organism = 'human',
        **eval_metrics
)

/project_envs/scib-conda/lib/python3.8/site-packages/scanpy/preprocessing/_combat.py:340: RuntimeWarning: divide by zero encountered in true_divide
  (abs(g_new - g_old) / g_old).max(), (abs(d_new - d_old) / d_old).max()

Warning: cluster key cluster already exists in adata.obs and will be overwritten
NMI...
ARI...
Silhouette score...
PC regression...
covariate: tech
compute PCA n_comps: 50
covariate: tech
compute PCA n_comps: 50

Trying to set attribute `.obs` of view, copying.

cell cycle effect...

Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.

Isolated labels F1...
Isolated labels ASW...
Graph connectivity...
kBET...
batch: tech
Use 19 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 70.

/project_envs/scib-conda/lib/python3.8/site-packages/rpy2/robjects/pandas2ri.py:261: DeprecationWarning: The global conversion available with activate() is deprecated and will be removed in the next major release. Use a local converter.
  warnings.warn('The global conversion available with activate() '
/project_envs/scib-conda/lib/python3.8/site-packages/rpy2/robjects/numpy2ri.py:195: DeprecationWarning: The global conversion available with activate() is deprecated and will be removed in the next major release. Use a local converter.
  warnings.warn('The global conversion available with activate() '

Use 52 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 167.
Use 70 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 550.
Use 29 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 106.
Use 70 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 417.
Use 66 nearest neighbors.
Adding diffusion to step 4
importing expression matrix
kBET estimation
Number of kBET tests is set to 215.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 32.
Use 14 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 47.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 3.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 25.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 25.
Use 10 nearest neighbors.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 25.
t_cell consists of a single batch or is too small. Skip.
cLISI score...
using precomputed kNN graph
Convert nearest neighbor matrix and distances for LISI.
Compute knn on shortest paths
/tmp/lisi_k6bdrtlc/input.mtx /tmp/lisi_k6bdrtlc/
LISI score estimation
16 processes started.
File has no entries. Doing nothing.
iLISI score...
using precomputed kNN graph
Convert nearest neighbor matrix and distances for LISI.
Compute knn on shortest paths
/tmp/lisi_hfimfksv/input.mtx /tmp/lisi_hfimfksv/
LISI score estimation
16 processes started.

# Rename column
bench_result['combat'].rename(columns = {0 : 'combat'}, inplace = True)
bench_result['combat']

Saucie¶

Exploring single-cell data with deep multitasking neural networks

# Run saucie
integration_result['saucie'] = scib.integration.combat(adata, 'tech')

# Bench saucie
bench_result['saucie'] = scib.me.metrics(
        adata,
        integration_result['saucie'],
        batch_key = BATCH_KEY,
        label_key = CELLTYPE_KEY,
        verbose = True,
        embed = 'X_pca',
        organism = 'human',
        **eval_metrics
)

/project_envs/scib-conda/lib/python3.8/site-packages/scanpy/preprocessing/_combat.py:340: RuntimeWarning: divide by zero encountered in true_divide
  (abs(g_new - g_old) / g_old).max(), (abs(d_new - d_old) / d_old).max()

Warning: cluster key cluster already exists in adata.obs and will be overwritten
NMI...
ARI...
Silhouette score...
PC regression...
covariate: tech
compute PCA n_comps: 50
covariate: tech
compute PCA n_comps: 50

Trying to set attribute `.obs` of view, copying.

cell cycle effect...

Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.

Isolated labels F1...
Isolated labels ASW...
Graph connectivity...
kBET...
batch: tech
Use 19 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 70.

/project_envs/scib-conda/lib/python3.8/site-packages/rpy2/robjects/pandas2ri.py:261: DeprecationWarning: The global conversion available with activate() is deprecated and will be removed in the next major release. Use a local converter.
  warnings.warn('The global conversion available with activate() '
/project_envs/scib-conda/lib/python3.8/site-packages/rpy2/robjects/numpy2ri.py:195: DeprecationWarning: The global conversion available with activate() is deprecated and will be removed in the next major release. Use a local converter.
  warnings.warn('The global conversion available with activate() '

Use 52 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 167.
Use 70 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 550.
Use 29 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 106.
Use 70 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 417.
Use 66 nearest neighbors.
Adding diffusion to step 4
importing expression matrix
kBET estimation
Number of kBET tests is set to 215.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 32.
Use 14 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 47.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 3.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 25.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 25.
Use 10 nearest neighbors.
Use 10 nearest neighbors.
importing expression matrix
kBET estimation
Number of kBET tests is set to 25.
t_cell consists of a single batch or is too small. Skip.
cLISI score...
using precomputed kNN graph
Convert nearest neighbor matrix and distances for LISI.
Compute knn on shortest paths
/tmp/lisi_0xwgq4do/input.mtx /tmp/lisi_0xwgq4do/
LISI score estimation
16 processes started.
iLISI score...
using precomputed kNN graph
Convert nearest neighbor matrix and distances for LISI.
Compute knn on shortest paths
/tmp/lisi_ehchh6bb/input.mtx /tmp/lisi_ehchh6bb/
LISI score estimation
16 processes started.

# Rename column
bench_result['saucie'].rename(columns = {0 : 'saucie'}, inplace = True)
bench_result['saucie']

Merge all evaluation scores to summary¶

merged_benchscores = pd.concat(bench_result.values(), axis = 1)

# Metrics for batch effect removal
batch_effect_removal = {
        'PCR_batch',
        'ARI_cluster/label',
        'graph_conn',
}

# Metrics for conservation of biological variance
biological_conservation_metrics = {
        'NMI_cluster/label',
        'ASW_label',
        'ASW_label/batch',
        'iLISI',
        'cLISI',
        'isolated_label_F1',
        'isolated_label_silhouette',
        'cell_cycle_conservation',
        'hvg_overlap'
}

The overall score of all scores is calculated by a 40% weighted mean of batch correction metrics per 60% of biological conservation metrics.

total_scores = {}
for k in merged_benchscores.keys():
    batch_score = merged_benchscores[k].loc[batch_effect_removal].sum()/len(batch_effect_removal)*0.4
    conservation_score = merged_benchscores[k].loc[biological_conservation_metrics].sum()/len(biological_conservation_metrics)*0.6
    total_scores[k] = batch_score + conservation_score

totalscore_df = pd.DataFrame.from_dict({"Total score": total_scores}, orient="index")
merged_benchscores = merged_benchscores.append(totalscore_df)
merged_benchscores = merged_benchscores.T

# Move Total score column up
cols = merged_benchscores.columns.tolist()
cols = cols[-1:] + cols[:-1]
merged_benchscores = merged_benchscores[cols]

# Sort benchscore table by Total score
merged_benchscores.sort_values(by = 'Total score', ascending = False)

Explain each evaluation metric¶

All evaluation metrics below are scaled from 0 to 1. Higher is better.

PCR (Principal component regression)¶

Batch identity is a covariate that shouldn't contribute to the variance of the expression data. For each Principal component computed from the expression data, we do a linear regression model with PC as the response value and batch identity (one-hot encoded) as the covariate. The ratio of $R^2$ from the linear regression model to the variance of the PCA for each PC is the contribution of the batch to the variance of the expression data.

The PCR value computed by scib is the difference ratio of the contribution mentioned above after batch effect removal

merged_benchscores[['PCR_batch']].sort_values(by = 'PCR_batch', ascending = False)

Graph connectivity¶

Assesses whether the kNN graph representation (of the integrated data) directly connects all cells with the same cell identity label.

merged_benchscores[['graph_conn']].sort_values(by = 'graph_conn', ascending = False)

kBET (k-nearest-neighbor batch effect test)¶

Determines whether the label composition of a k nearest neighborhood of a cell is similar to the expected (global) label composition. The test is repeated for a random subset of cells, and the results are summarized as a rejection rate over all tested neighborhoods.

merged_benchscores[['kBET']].sort_values(by = 'kBET', ascending = False)

NMI (normalized mutual information)¶

Compares the overlap of two cell identities in the integrated dataset, Louvain cluster cell label and cell-type label. NMI scores of 0 or 1 correspond to uncorrelated clustering or a perfect match, respectively.

Louvain clustering was performed at a resolution range of 0.1 to 2 in steps of 0.1, and the clustering output with the highest NMI with the label set was used.

scib uses the NMI implementation from scikit-learn (v.0.22.1)

merged_benchscores[['NMI_cluster/label']].sort_values(by = 'NMI_cluster/label', ascending = False)

ARI (Adjusted Rand Index)¶

Compares the overlap of two cell identities in the integrated dataset, Louvain cluster cell label and cell-type label. One is the NMI-optimized Louvain clustering and the other is the cell-type labels. ARI of 0 or 1 corresponds to random labeling or a perfect match, respectively.

scib uses the ARI implementation from scikit-learn (v.0.22.1)

merged_benchscores[['ARI_cluster/label']].sort_values(by = 'ARI_cluster/label', ascending = False)

ASW (Average silhouette width)¶

ASW is commonly used to determine the separation of clusters where 1 represents dense and well-separated clusters, while 0 or −1 corresponds to overlapping clusters or strong miscalculation. For more information, please see: "Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis"

Measures the relationship between the within-cluster distances of a cell and the between-cluster distances of that cell to the closest cluster. Averaging over all silhouette widths of a set of cells yields the ASW, which ranges between −1 and 1.

Cell-type ASW is the silhouette of the cell labels and batch ASW measures batch mixing.

The ASW metric for a batch effect removal method is the Cell-type ASW / Batch ASW

# ASW_label/batch
merged_benchscores[['ASW_label/batch']].sort_values(by = 'ASW_label/batch', ascending = False)

# ASW_label
merged_benchscores[['ASW_label']].sort_values(by = 'ASW_label', ascending = False)

Graph LISI¶

Is a diversity score assess bath mixing (iLISI) and cell-type separation (cLISI). This evaluation metric was proposed by Korsunsky et. al 2019 Harmony

iLISI defines the effective number of datasets in a neighborhood. Neighborhoods represented by only a single dataset get an iLISI of 1, while neighborhoods with an equal number of cells from 2 datasets get an iLISI of 2. Note that even under ideal mixing, if the datasets have different numbers of cells, iLISI would be less than 2

To assess accuracy, the authors use “cell-type LISI” (cLISI), the same mathematical measure, but applied to cell-type instead of dataset labels. Accurate integration should maintain a cLISI of 1, reflecting a separation of unique cell types throughout the embedding.

To extend LISI graph-based integration outputs, the developers of scib developed graph LISI, which uses the integrated graph structure as an embedded space for distance calculation.

# iLISI
merged_benchscores[['iLISI']].sort_values(by = 'iLISI', ascending = False)

# cLISI
merged_benchscores[['cLISI']].sort_values(by = 'cLISI', ascending = False)

Isolated_label score¶

Evaluates how well the isolated cell label (the cell type label present in the least number of batches) separate from other cell identities. There are 2 version: the best clustering of the isolated label (F1 score) and the global ASW (average silhouette width) of the isolated label. It gives value between 0 & 1, where 1 shows that all the isolated label cells and no others are captured in the cluster.

# isolated_label_F1
merged_benchscores[['isolated_label_F1']].sort_values(by = 'isolated_label_F1', ascending = False)

# isolated_label_silhouette
merged_benchscores[['isolated_label_silhouette']].sort_values(by = 'isolated_label_silhouette', ascending = False)

Cell_cycle_conservation¶

Evaluates how well the cell-cycle effect can be captured before and after integration. Values close to 0 indicate lower conservation and 1 indicates complete conservation of the variance explained by cell cycle

The authors computed cell-cycle scores using Scanpy’s score_cell_cycle function with a reference gene set from Tirosh et al. then computed the variance contribution of the resulting S and G2/M phase scores using principal component regression (PCs as the response values and GS/M and S scores as the covariates), which was performed for each batch separately.

Then the differences in variance before, $Var_{before}$, and after, $Var_{after}$, integration were aggregated into a final score between 0 and 1, using the equation

$${\mathrm{CC}}\,{\mathrm{conservation}} = 1 - \frac{{\left| {{\mathrm{Var}}_{{\mathrm{after}}} - {\mathrm{Var}}_{{\mathrm{before}}}} \right|}}{{{\mathrm{Var}}_{{\mathrm{before}}}}}{{{\mathrm{.}}}}$$

merged_benchscores[['cell_cycle_conservation']].sort_values(by = 'cell_cycle_conservation', ascending = False)

Highly variable gene (HVG) conservation¶

Computes the number of HVGs before and after correction of each batch. The overall HVG score is the mean of the per-batch HVG overlap coefficients.

merged_benchscores[['cell_cycle_conservation']].sort_values(by = 'cell_cycle_conservation', ascending = False)

Visualization¶

To illustrate the integration results of different methods, we draw the UMAP plot with cells colored by batches and the cell types

BBKNN¶

sc.tl.umap(integration_result['bbknn'])
sc.pl.umap(integration_result['bbknn'], color = ['tech', 'celltype'])

Combat¶

sc.tl.umap(integration_result['combat'])
sc.pl.umap(integration_result['combat'], color = ['tech', 'celltype'])

Saucie¶

sc.tl.umap(integration_result['saucie'])
sc.pl.umap(integration_result['saucie'], color = ['tech', 'celltype'])

Scanorama¶

sc.tl.umap(integration_result['scanorama'])
sc.pl.umap(integration_result['scanorama'], color = ['tech', 'celltype'])

Harmony¶

# Find neigbors for adata
sc.pp.neighbors(adata, use_rep="X_pca_harmony")

sc.tl.umap(adata)
sc.pl.umap(adata, color = ['tech', 'celltype'])

	harmony
NMI_cluster/label	0.641552
ARI_cluster/label	0.297319
ASW_label	0.602215
ASW_label/batch	0.856864
PCR_batch	0.848686
cell_cycle_conservation	0.892146
isolated_label_F1	0.109375
isolated_label_silhouette	0.534415
graph_conn	0.870155
kBET	0.383899
iLISI	0.001107
cLISI	0.999470
hvg_overlap	1.000000
trajectory	NaN

	scanorama
NMI_cluster/label	0.641288
ARI_cluster/label	0.297198
ASW_label	0.550620
ASW_label/batch	0.792516
PCR_batch	0.643726
cell_cycle_conservation	0.935562
isolated_label_F1	0.112903
isolated_label_silhouette	0.514728
graph_conn	0.863278
kBET	0.207821
iLISI	0.001115
cLISI	0.999475
hvg_overlap	0.298667
trajectory	NaN

	combat
NMI_cluster/label	0.641552
ARI_cluster/label	0.297319
ASW_label	0.550620
ASW_label/batch	0.792516
PCR_batch	0.999973
cell_cycle_conservation	0.954411
isolated_label_F1	0.109375
isolated_label_silhouette	0.514728
graph_conn	0.870155
kBET	0.207976
iLISI	0.001291
cLISI	0.999466
hvg_overlap	0.435111
trajectory	NaN

	saucie
NMI_cluster/label	0.641552
ARI_cluster/label	0.297319
ASW_label	0.550620
ASW_label/batch	0.792516
PCR_batch	0.999973
cell_cycle_conservation	0.954411
isolated_label_F1	0.109375
isolated_label_silhouette	0.514728
graph_conn	0.870155
kBET	0.209427
iLISI	0.001253
cLISI	0.999495
hvg_overlap	0.435111
trajectory	NaN

	Total score	NMI_cluster/label	ARI_cluster/label	ASW_label	ASW_label/batch	PCR_batch	cell_cycle_conservation	isolated_label_F1	isolated_label_silhouette	graph_conn	kBET	iLISI	cLISI	hvg_overlap	trajectory
bbknn	0.657792	0.898412	0.945042	0.550620	0.792516	0.000000	1.000000	0.016949	0.514728	0.984663	0.208242	0.378540	0.855703	1.000000	NaN
harmony	0.644631	0.641552	0.297319	0.602215	0.856864	0.848686	0.892146	0.109375	0.534415	0.870155	0.383899	0.001107	0.999470	1.000000	NaN
combat	0.622264	0.641552	0.297319	0.550620	0.792516	0.999973	0.954411	0.109375	0.514728	0.870155	0.207976	0.001291	0.999466	0.435111	NaN
saucie	0.622264	0.641552	0.297319	0.550620	0.792516	0.999973	0.954411	0.109375	0.514728	0.870155	0.209427	0.001253	0.999495	0.435111	NaN
scanorama	0.563685	0.641288	0.297198	0.550620	0.792516	0.643726	0.935562	0.112903	0.514728	0.863278	0.207821	0.001115	0.999475	0.298667	NaN