Cells are the basis of life. Many of the world’s greatest challenges, like developing treatments for cancers or understanding life itself are fundamentally tied to cells and the role they play. Figuring out what type a cell is, is known as the Cell Type Prediction problem, and has stood as a grand challenge in scRNA-seq for the past years.

BioTuring Cell Type Prediction is a solution to this grand challenge by combining the knowledge from high-quality training datasets and the latest advancement in Deep Learning. Currently our prediction tool cover 54 cell types and 183 sub types. Please refer to this link for a full list of subtypes.

To use BioTuring Cell Type Prediction, you will need an authorized token to perform API call to our server, please go here and enter your email.
We will send the token to the email you entered. After then, input the token to the block below for later use.

token = '70d2acfda3a54ca6a4390699394****'

Currently, we support three data format:

Zip: a zipped folder contains 3 files:

barcodes.(tsv|csv|gz|tar|tar.gz)
features.(tsv|csv|gz|tar|tar.gz) or genes.(tsv|csv|gz|tar|tar.gz)
matrix.(mtx|gz|tar|tar.gz)

Hdf5: a hdf5 file contains 5 keys:

barcodes
genes or features
data
indices
indptr

Text: a full matrix text file separated by tab or comma

For this tutorial, we will predict the dataset from Zemin Zhang et al: Landscape of infiltrating T cells in liver cancer revealed by single-cell sequencing. Let's download the data.

import os

os.system('wget https://bioturingpublic.s3.us-west-2.amazonaws.com/GSE98638.zip')

The downloaded file is a zipped file, contains 3 files:

barcodes.tsv
features.tsv
matrix.mtx

Now we are ready to submit the data. Let's write some functions to handle data submission/status check first.

import requests
import time

import pandas as pd

from requests_toolbelt import MultipartEncoder, MultipartEncoderMonitor
from sys import stderr as STREAM
from pathlib import Path
from tqdm import tqdm
from io import StringIO

def submit_file(species, version, file_path, file_type, shape, token):
	fields = {
		'token': token,
		'type': file_type,
		'shape': shape,
		'species': species,
		'version': str(version)
	}

	upload_url = 'https://talk2data.bioturing.com/predict/submit'
	path = Path(file_path)
	total_size = path.stat().st_size
	file_name = path.name

	with tqdm(
		desc=file_name, total=total_size, unit='MB',  unit_scale=True, unit_divisor=1024,
	) as bar:
		fields['exp_matrix'] = (file_name, open(file_path, 'rb'))
		encoder = MultipartEncoder(fields=fields)
		multipart = MultipartEncoderMonitor(
			encoder, lambda monitor: bar.update(monitor.bytes_read - bar.n)
		)
		headers = {
			'Content-Type': multipart.content_type
		}
		return requests.post(upload_url, data=multipart, headers=headers).json()

def get_result(token, project_id):
	request_url = 'https://talk2data.bioturing.com/predict/get_result'
	data = {
		'token': token,
		'project_id': project_id
	}

	last_status = []

	while True:
		status = requests.post(request_url, json=data).json()
		if not status or 'status' not in status:
			print('Internal server error. Please try again later! You don\'t need to re-submit your data.\
			 		Add --project_id project_id to get your result.')
			return None

		if 'data' in status:
			return pd.read_csv(StringIO(status['data']), sep='\t')

		if 'is_running' not in status:
			print('Your process is corrupted. Please check your input data or contact us at support@bioturing.com!')
			return None

		current_status = status['running_status'].split('\n')[:-1]
		new_status = current_status[len(last_status):]
		if len(new_status):
			print('\n'.join(new_status))

		last_status += new_status
		if not status['is_running']:
			print('Your process is corrupted. Please check your input data or contact us at support@bioturing.com!')
			return None

		time.sleep(60)

The submit_file function takes 6 input parameters:

species: Currently we support 2 species: human and mouse
version: Prediction version (Human: 1, 2. Mouse: 1)
file_path: The path to the dataset you want to predict
file_type: The file format of the dataset. It should be one of three format that we support currently (as mentioned in above). Available values: zip, hdf5, tsv
shape: The format of the gene expression matrix. genesxcells: Choose this if each row of your gene expression matrix represent a gene. cellsxgenes: Choose this if each row of your gene expression matrix represent a cell
token: The authorized token we sent to your email

For this example, the dataset we will submit is a zipped file, and each row of the gene expression matrix represent a gene. Let's submit the file to BioTuring server, using the token you retrieved above.

submission_status = submit_file(
    species = 'human',
    version = 2,
    file_path='./GSE98638.zip',
    file_type = 'zip',
    shape = 'genesxcells',
    token=token
)

submission_status

{'count_projects': '97.0/10000.0',
 'message': 'Successfully submitted the data!',
 'project_id': '02397638-90a0-4a9e-853f-97f1971ef8e0',
 'species': 'human',
 'status': 200,
 'version': 2}

status: 200 means that you successfully submitted the data. Otherwise, failed. If you encounter any problem, feel free to email us at support@bioturing.com

Let's check the current status of the prediction process. And write the prediction result to a file for later use.

prediction_result = get_result(token, submission_status['project_id'])

[2022-08-08 04:51:29] Waiting in the queue...(2 left)
[2022-08-08 04:51:29] Extracting data...

prediction_result

Now we will process the example data through the standard pipeline from scanpy, to visualize the prediction result.

import scanpy as sc
import pandas as pd
import numpy as np
from scipy import sparse, io
import shutil

def unzip(source, destination):
    shutil.unpack_archive(source, destination)

unzip('./GSE98638.zip', '.')

matrix = sparse.csr_matrix(io.mmread('./GSE98638/matrix.mtx.gz'))
features = pd.read_csv('./GSE98638/features.tsv.gz', header=None)
features.columns = ['Genes']

We read the matrix successfully. Next, we will normalize the raw matrix.

adata = sc.AnnData(matrix.T, dtype=np.float32)
adata.var = features
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5, n_top_genes=2000)

adata.raw = adata
adata = adata[:, adata.var.highly_variable]
sc.pp.scale(adata, max_value=10)

Let's create a t-SNE to visualize the cells in 2D plane.

sc.tl.pca(adata, svd_solver='arpack')
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
sc.tl.tsne(adata)

Now we will assign the prediction label and visualize together with gene expression.

adata.obs = prediction_result

sc.pl.tsne(adata, color=['Major cell types', 'Cell sub types'], wspace=0.4)

We can verify the result by novel marker genes. For example, let's check CD4+ Treg by using the gene FOXP3

sc.pl.tsne(adata, color=['FOXP3'], color_map='Reds', gene_symbols='Genes')
sc.pl.tsne(adata, color=['Cell sub types'], color_map='Reds', groups="regulatory T cell")

This is the end of this tutorial. We are still working to add more cell types to the list. If you feel interest in any cell types, please email us at support@bioturing.com

Thank you!

	Barcodes	Major cell types	Cell sub types
0	PTH13	CD4-positive; alpha-beta T cell	CD4-positive; CD8-positive; alpha-beta T cell
1	PTH1	CD4-positive; alpha-beta T cell	CD4-positive; CD8-positive; alpha-beta T cell
2	PTH20	Unassigned	Unassigned
3	PTH26	CD4-positive; alpha-beta T cell	CD4-positive; CD8-positive; alpha-beta T cell
4	PTH5	CD4-positive; alpha-beta T cell	CD4-positive; CD8-positive; alpha-beta T cell
...	...	...	...
5058	TTR167-0508	CD4-positive; alpha-beta T cell	regulatory T cell
5059	TTR173-0508	CD4-positive; alpha-beta T cell	regulatory T cell
5060	TTR176-0508	CD4-positive; alpha-beta T cell	regulatory T cell
5061	TTR179-0508	CD4-positive; alpha-beta T cell	regulatory T cell
5062	TTR189-0508	CD4-positive; alpha-beta T cell	regulatory T cell