Cells are the basis of life. Many of the world’s greatest challenges, like developing treatments for cancers or understanding life itself are fundamentally tied to cells and the role they play. Figuring out what type a cell is, is known as the Cell Type Prediction problem, and has stood as a grand challenge in scRNA-seq for the past years.

BioTuring Cell Type Prediction is a solution to this grand challenge by combining the knowledge from high-quality training datasets and the latest advancement in Deep Learning. Currently our prediction tool cover 54 cell types and 183 sub types. Please refer to this link for a full list of subtypes.

To use BioTuring Cell Type Prediction, you will need an authorized token to perform API call to our server, please go here and enter your email.
We will send the token to the email you entered. After then, input the token to the block below for later use.

In [2]:
token = '70d2acfda3a54ca6a4390699394****'

Currently, we support three data format:

Zip: a zipped folder contains 3 files:

  • barcodes.(tsv|csv|gz|tar|tar.gz)
  • features.(tsv|csv|gz|tar|tar.gz) or genes.(tsv|csv|gz|tar|tar.gz)
  • matrix.(mtx|gz|tar|tar.gz)

Hdf5: a hdf5 file contains 5 keys:

  • barcodes
  • genes or features
  • data
  • indices
  • indptr

Text: a full matrix text file separated by tab or comma

For this tutorial, we will predict the dataset from Zemin Zhang et al: Landscape of infiltrating T cells in liver cancer revealed by single-cell sequencing. Let's download the data.

In [ ]:
import os

os.system('wget https://bioturingpublic.s3.us-west-2.amazonaws.com/GSE98638.zip')

The downloaded file is a zipped file, contains 3 files:

  • barcodes.tsv
  • features.tsv
  • matrix.mtx

Now we are ready to submit the data. Let's write some functions to handle data submission/status check first.

In [3]:
import requests
import time

import pandas as pd

from requests_toolbelt import MultipartEncoder, MultipartEncoderMonitor
from sys import stderr as STREAM
from pathlib import Path
from tqdm import tqdm
from io import StringIO
In [4]:
def submit_file(species, version, file_path, file_type, shape, token):
	fields = {
		'token': token,
		'type': file_type,
		'shape': shape,
		'species': species,
		'version': str(version)

	upload_url = 'https://talk2data.bioturing.com/predict/submit'
	path = Path(file_path)
	total_size = path.stat().st_size
	file_name = path.name

	with tqdm(
		desc=file_name, total=total_size, unit='MB',  unit_scale=True, unit_divisor=1024,
	) as bar:
		fields['exp_matrix'] = (file_name, open(file_path, 'rb'))
		encoder = MultipartEncoder(fields=fields)
		multipart = MultipartEncoderMonitor(
			encoder, lambda monitor: bar.update(monitor.bytes_read - bar.n)
		headers = {
			'Content-Type': multipart.content_type
		return requests.post(upload_url, data=multipart, headers=headers).json()
In [5]:
def get_result(token, project_id):
	request_url = 'https://talk2data.bioturing.com/predict/get_result'
	data = {
		'token': token,
		'project_id': project_id

	last_status = []

	while True:
		status = requests.post(request_url, json=data).json()
		if not status or 'status' not in status:
			print('Internal server error. Please try again later! You don\'t need to re-submit your data.\
			 		Add --project_id project_id to get your result.')
			return None

		if 'data' in status:
			return pd.read_csv(StringIO(status['data']), sep='\t')

		if 'is_running' not in status:
			print('Your process is corrupted. Please check your input data or contact us at support@bioturing.com!')
			return None

		current_status = status['running_status'].split('\n')[:-1]
		new_status = current_status[len(last_status):]
		if len(new_status):

		last_status += new_status
		if not status['is_running']:
			print('Your process is corrupted. Please check your input data or contact us at support@bioturing.com!')
			return None


The submit_file function takes 6 input parameters:

  • species: Currently we support 2 species: human and mouse
  • version: Prediction version (Human: 1, 2. Mouse: 1)
  • file_path: The path to the dataset you want to predict
  • file_type: The file format of the dataset. It should be one of three format that we support currently (as mentioned in above). Available values: zip, hdf5, tsv
  • shape: The format of the gene expression matrix. genesxcells: Choose this if each row of your gene expression matrix represent a gene. cellsxgenes: Choose this if each row of your gene expression matrix represent a cell
  • token: The authorized token we sent to your email

For this example, the dataset we will submit is a zipped file, and each row of the gene expression matrix represent a gene. Let's submit the file to BioTuring server, using the token you retrieved above.

In [ ]:
submission_status = submit_file(
    species = 'human',
    version = 2,
    file_type = 'zip',
    shape = 'genesxcells',
In [9]:
{'count_projects': '97.0/10000.0',
 'message': 'Successfully submitted the data!',
 'project_id': '02397638-90a0-4a9e-853f-97f1971ef8e0',
 'species': 'human',
 'status': 200,
 'version': 2}

status: 200 means that you successfully submitted the data. Otherwise, failed. If you encounter any problem, feel free to email us at support@bioturing.com

Let's check the current status of the prediction process. And write the prediction result to a file for later use.

In [33]:
prediction_result = get_result(token, submission_status['project_id'])
[2022-08-08 04:51:29] Waiting in the queue...(2 left)
[2022-08-08 04:51:29] Extracting data...
In [34]:
Barcodes Major cell types Cell sub types
0 PTH13 CD4-positive; alpha-beta T cell CD4-positive; CD8-positive; alpha-beta T cell
1 PTH1 CD4-positive; alpha-beta T cell CD4-positive; CD8-positive; alpha-beta T cell
2 PTH20 Unassigned Unassigned
3 PTH26 CD4-positive; alpha-beta T cell CD4-positive; CD8-positive; alpha-beta T cell
4 PTH5 CD4-positive; alpha-beta T cell CD4-positive; CD8-positive; alpha-beta T cell
... ... ... ...
5058 TTR167-0508 CD4-positive; alpha-beta T cell regulatory T cell
5059 TTR173-0508 CD4-positive; alpha-beta T cell regulatory T cell
5060 TTR176-0508 CD4-positive; alpha-beta T cell regulatory T cell
5061 TTR179-0508 CD4-positive; alpha-beta T cell regulatory T cell
5062 TTR189-0508 CD4-positive; alpha-beta T cell regulatory T cell

5063 rows × 3 columns

Now we will process the example data through the standard pipeline from scanpy, to visualize the prediction result.

In [81]:
import scanpy as sc
import pandas as pd
import numpy as np
from scipy import sparse, io
import shutil
In [27]:
def unzip(source, destination):
    shutil.unpack_archive(source, destination)
In [36]:
unzip('./GSE98638.zip', '.')
In [71]:
matrix = sparse.csr_matrix(io.mmread('./GSE98638/matrix.mtx.gz'))
features = pd.read_csv('./GSE98638/features.tsv.gz', header=None)
features.columns = ['Genes']

We read the matrix successfully. Next, we will normalize the raw matrix.

In [ ]:
adata = sc.AnnData(matrix.T, dtype=np.float32)
adata.var = features
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5, n_top_genes=2000)

adata.raw = adata
adata = adata[:, adata.var.highly_variable]
sc.pp.scale(adata, max_value=10)

Let's create a t-SNE to visualize the cells in 2D plane.

In [ ]:
sc.tl.pca(adata, svd_solver='arpack')
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)

Now we will assign the prediction label and visualize together with gene expression.

In [84]:
adata.obs = prediction_result
In [85]:
sc.pl.tsne(adata, color=['Major cell types', 'Cell sub types'], wspace=0.4)

We can verify the result by novel marker genes. For example, let's check CD4+ Treg by using the gene FOXP3

In [86]:
sc.pl.tsne(adata, color=['FOXP3'], color_map='Reds', gene_symbols='Genes')
sc.pl.tsne(adata, color=['Cell sub types'], color_map='Reds', groups="regulatory T cell")

This is the end of this tutorial. We are still working to add more cell types to the list. If you feel interest in any cell types, please email us at support@bioturing.com

Thank you!