Cells are the basis of life. Many of the world’s greatest challenges, like developing treatments for cancers or understanding life itself are fundamentally tied to cells and the role they play.
Figuring out what type a cell is, is known as the Cell Type Prediction problem
, and has stood as a grand challenge in scRNA-seq for the past years.
BioTuring Cell Type Prediction is a solution to this grand challenge by combining the knowledge from high-quality training datasets and the latest advancement in Deep Learning.
Currently our prediction tool cover 54 cell types
and 183 sub types
.
Please refer to this link for a full list of subtypes.
To use BioTuring Cell Type Prediction, you will need an authorized token to perform API call to our server, please go here and enter your email.
We will send the token to the email you entered. After then, input the token to the block below for later use.
token = '70d2acfda3a54ca6a4390699394****'
Currently, we support three data format:
Zip: a zipped folder contains 3 files:
Hdf5: a hdf5 file contains 5 keys:
Text: a full matrix text file separated by tab or comma
For this tutorial, we will predict the dataset from Zemin Zhang et al: Landscape of infiltrating T cells in liver cancer revealed by single-cell sequencing
. Let's download the data.
import os
os.system('wget https://bioturingpublic.s3.us-west-2.amazonaws.com/GSE98638.zip')
The downloaded file is a zipped file, contains 3 files:
Now we are ready to submit the data. Let's write some functions to handle data submission/status check first.
import requests
import time
import pandas as pd
from requests_toolbelt import MultipartEncoder, MultipartEncoderMonitor
from sys import stderr as STREAM
from pathlib import Path
from tqdm import tqdm
from io import StringIO
def submit_file(species, version, file_path, file_type, shape, token):
fields = {
'token': token,
'type': file_type,
'shape': shape,
'species': species,
'version': str(version)
}
upload_url = 'https://talk2data.bioturing.com/predict/submit'
path = Path(file_path)
total_size = path.stat().st_size
file_name = path.name
with tqdm(
desc=file_name, total=total_size, unit='MB', unit_scale=True, unit_divisor=1024,
) as bar:
fields['exp_matrix'] = (file_name, open(file_path, 'rb'))
encoder = MultipartEncoder(fields=fields)
multipart = MultipartEncoderMonitor(
encoder, lambda monitor: bar.update(monitor.bytes_read - bar.n)
)
headers = {
'Content-Type': multipart.content_type
}
return requests.post(upload_url, data=multipart, headers=headers).json()
def get_result(token, project_id):
request_url = 'https://talk2data.bioturing.com/predict/get_result'
data = {
'token': token,
'project_id': project_id
}
last_status = []
while True:
status = requests.post(request_url, json=data).json()
if not status or 'status' not in status:
print('Internal server error. Please try again later! You don\'t need to re-submit your data.\
Add --project_id project_id to get your result.')
return None
if 'data' in status:
return pd.read_csv(StringIO(status['data']), sep='\t')
if 'is_running' not in status:
print('Your process is corrupted. Please check your input data or contact us at support@bioturing.com!')
return None
current_status = status['running_status'].split('\n')[:-1]
new_status = current_status[len(last_status):]
if len(new_status):
print('\n'.join(new_status))
last_status += new_status
if not status['is_running']:
print('Your process is corrupted. Please check your input data or contact us at support@bioturing.com!')
return None
time.sleep(60)
The submit_file
function takes 6 input parameters:
species
: Currently we support 2 species: human
and mouse
version
: Prediction version (Human: 1
, 2
. Mouse: 1
) file_path
: The path to the dataset you want to predict file_type
: The file format of the dataset. It should be one of three format that we support currently (as mentioned in above). Available values: zip
, hdf5
, tsv
shape
: The format of the gene expression matrix. genesxcells
: Choose this if each row of your gene expression matrix represent a gene. cellsxgenes
: Choose this if each row of your gene expression matrix represent a celltoken
: The authorized token we sent to your emailFor this example, the dataset we will submit is a zipped file, and each row of the gene expression matrix represent a gene. Let's submit the file to BioTuring server, using the token you retrieved above.
submission_status = submit_file(
species = 'human',
version = 2,
file_path='./GSE98638.zip',
file_type = 'zip',
shape = 'genesxcells',
token=token
)
submission_status
status: 200
means that you successfully submitted the data. Otherwise, failed. If you encounter any problem, feel free to email us at support@bioturing.com
Let's check the current status of the prediction process. And write the prediction result to a file for later use.
prediction_result = get_result(token, submission_status['project_id'])
prediction_result
Now we will process the example data through the standard pipeline from scanpy
, to visualize the prediction result.
import scanpy as sc
import pandas as pd
import numpy as np
from scipy import sparse, io
import shutil
def unzip(source, destination):
shutil.unpack_archive(source, destination)
unzip('./GSE98638.zip', '.')
matrix = sparse.csr_matrix(io.mmread('./GSE98638/matrix.mtx.gz'))
features = pd.read_csv('./GSE98638/features.tsv.gz', header=None)
features.columns = ['Genes']
We read the matrix successfully. Next, we will normalize the raw matrix.
adata = sc.AnnData(matrix.T, dtype=np.float32)
adata.var = features
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5, n_top_genes=2000)
adata.raw = adata
adata = adata[:, adata.var.highly_variable]
sc.pp.scale(adata, max_value=10)
Let's create a t-SNE to visualize the cells in 2D plane.
sc.tl.pca(adata, svd_solver='arpack')
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
sc.tl.tsne(adata)
Now we will assign the prediction label and visualize together with gene expression.
adata.obs = prediction_result
sc.pl.tsne(adata, color=['Major cell types', 'Cell sub types'], wspace=0.4)
We can verify the result by novel marker genes. For example, let's check CD4+ Treg by using the gene FOXP3
sc.pl.tsne(adata, color=['FOXP3'], color_map='Reds', gene_symbols='Genes')
sc.pl.tsne(adata, color=['Cell sub types'], color_map='Reds', groups="regulatory T cell")
This is the end of this tutorial. We are still working to add more cell types to the list. If you feel interest in any cell types, please email us at support@bioturing.com
Thank you!