{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Cells are the basis of life. Many of the world’s greatest challenges, like developing treatments for cancers or understanding life itself are fundamentally tied to cells and the role they play.\n",
"Figuring out what type a cell is, is known as the `Cell Type Prediction problem`, and has stood as a grand challenge in scRNA-seq for the past years.\n",
"
\n",
"BioTuring Cell Type Prediction is a solution to this grand challenge by combining the knowledge from high-quality training datasets and the latest advancement in Deep Learning.\n",
"Currently our prediction tool cover `54 cell types` and `183 sub types`.\n",
"Please refer to this link for a full list of subtypes.\n",
"
\n",
"To use BioTuring Cell Type Prediction, you will need an authorized token to perform API call to our server, please go here and enter your email.
\n",
"We will send the token to the email you entered. After then, input the token to the block below for later use."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"token = '70d2acfda3a54ca6a4390699394****'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Currently, we support three data format:\n",
"\n",
"**Zip**: a zipped folder contains 3 files:\n",
"\n",
"- barcodes.(tsv|csv|gz|tar|tar.gz)\n",
"- features.(tsv|csv|gz|tar|tar.gz) or genes.(tsv|csv|gz|tar|tar.gz)\n",
"- matrix.(mtx|gz|tar|tar.gz)\n",
"\n",
"**Hdf5**: a hdf5 file contains 5 keys:\n",
"\n",
"- barcodes\n",
"- genes or features\n",
"- data\n",
"- indices\n",
"- indptr\n",
"\n",
"**Text**: a full matrix text file separated by tab or comma"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For this tutorial, we will predict the dataset from Zemin Zhang et al: `Landscape of infiltrating T cells in liver cancer revealed by single-cell sequencing`. Let's download the data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"os.system('wget https://bioturingpublic.s3.us-west-2.amazonaws.com/GSE98638.zip')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The downloaded file is a zipped file, contains 3 files:\n",
"- barcodes.tsv\n",
"- features.tsv\n",
"- matrix.mtx\n",
"\n",
"Now we are ready to submit the data. Let's write some functions to handle data submission/status check first."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"import time\n",
"\n",
"import pandas as pd\n",
"\n",
"from requests_toolbelt import MultipartEncoder, MultipartEncoderMonitor\n",
"from sys import stderr as STREAM\n",
"from pathlib import Path\n",
"from tqdm import tqdm\n",
"from io import StringIO"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"def submit_file(species, version, file_path, file_type, shape, token):\n",
"\tfields = {\n",
"\t\t'token': token,\n",
"\t\t'type': file_type,\n",
"\t\t'shape': shape,\n",
"\t\t'species': species,\n",
"\t\t'version': str(version)\n",
"\t}\n",
"\n",
"\tupload_url = 'https://talk2data.bioturing.com/predict/submit'\n",
"\tpath = Path(file_path)\n",
"\ttotal_size = path.stat().st_size\n",
"\tfile_name = path.name\n",
"\n",
"\twith tqdm(\n",
"\t\tdesc=file_name, total=total_size, unit='MB', unit_scale=True, unit_divisor=1024,\n",
"\t) as bar:\n",
"\t\tfields['exp_matrix'] = (file_name, open(file_path, 'rb'))\n",
"\t\tencoder = MultipartEncoder(fields=fields)\n",
"\t\tmultipart = MultipartEncoderMonitor(\n",
"\t\t\tencoder, lambda monitor: bar.update(monitor.bytes_read - bar.n)\n",
"\t\t)\n",
"\t\theaders = {\n",
"\t\t\t'Content-Type': multipart.content_type\n",
"\t\t}\n",
"\t\treturn requests.post(upload_url, data=multipart, headers=headers).json()\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"def get_result(token, project_id):\n",
"\trequest_url = 'https://talk2data.bioturing.com/predict/get_result'\n",
"\tdata = {\n",
"\t\t'token': token,\n",
"\t\t'project_id': project_id\n",
"\t}\n",
"\n",
"\tlast_status = []\n",
"\n",
"\twhile True:\n",
"\t\tstatus = requests.post(request_url, json=data).json()\n",
"\t\tif not status or 'status' not in status:\n",
"\t\t\tprint('Internal server error. Please try again later! You don\\'t need to re-submit your data.\\\n",
"\t\t\t \t\tAdd --project_id project_id to get your result.')\n",
"\t\t\treturn None\n",
"\n",
"\t\tif 'data' in status:\n",
"\t\t\treturn pd.read_csv(StringIO(status['data']), sep='\\t')\n",
"\n",
"\t\tif 'is_running' not in status:\n",
"\t\t\tprint('Your process is corrupted. Please check your input data or contact us at support@bioturing.com!')\n",
"\t\t\treturn None\n",
"\n",
"\t\tcurrent_status = status['running_status'].split('\\n')[:-1]\n",
"\t\tnew_status = current_status[len(last_status):]\n",
"\t\tif len(new_status):\n",
"\t\t\tprint('\\n'.join(new_status))\n",
"\n",
"\t\tlast_status += new_status\n",
"\t\tif not status['is_running']:\n",
"\t\t\tprint('Your process is corrupted. Please check your input data or contact us at support@bioturing.com!')\n",
"\t\t\treturn None\n",
"\n",
"\t\ttime.sleep(60)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `submit_file` function takes 6 input parameters:\n",
"- `species`: Currently we support 2 species: `human` and `mouse`\n",
"- `version`: Prediction version (Human: `1`, `2`. Mouse: `1`) \n",
"- `file_path`: The path to the dataset you want to predict \n",
"- `file_type`: The file format of the dataset. It should be one of three format that we support currently (as mentioned in above). Available values: `zip`, `hdf5`, `tsv`\n",
"- `shape`: The format of the gene expression matrix. `genesxcells`: Choose this if each row of your gene expression matrix represent a gene. `cellsxgenes`: Choose this if each row of your gene expression matrix represent a cell\n",
"- `token`: The authorized token we sent to your email"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For this example, the dataset we will submit is a zipped file, and each row of the gene expression matrix represent a gene. Let's submit the file to BioTuring server, using the token you retrieved above."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"submission_status = submit_file(\n",
" species = 'human',\n",
" version = 2,\n",
" file_path='./GSE98638.zip',\n",
" file_type = 'zip',\n",
" shape = 'genesxcells',\n",
" token=token\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'count_projects': '97.0/10000.0',\n",
" 'message': 'Successfully submitted the data!',\n",
" 'project_id': '02397638-90a0-4a9e-853f-97f1971ef8e0',\n",
" 'species': 'human',\n",
" 'status': 200,\n",
" 'version': 2}"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"submission_status"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`status: 200` means that you successfully submitted the data. Otherwise, failed. If you encounter any problem, feel free to email us at support@bioturing.com"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's check the current status of the prediction process. And write the prediction result to a file for later use."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2022-08-08 04:51:29] Waiting in the queue...(2 left)\n",
"[2022-08-08 04:51:29] Extracting data...\n"
]
}
],
"source": [
"prediction_result = get_result(token, submission_status['project_id'])"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | Barcodes | \n", "Major cell types | \n", "Cell sub types | \n", "
---|---|---|---|
0 | \n", "PTH13 | \n", "CD4-positive; alpha-beta T cell | \n", "CD4-positive; CD8-positive; alpha-beta T cell | \n", "
1 | \n", "PTH1 | \n", "CD4-positive; alpha-beta T cell | \n", "CD4-positive; CD8-positive; alpha-beta T cell | \n", "
2 | \n", "PTH20 | \n", "Unassigned | \n", "Unassigned | \n", "
3 | \n", "PTH26 | \n", "CD4-positive; alpha-beta T cell | \n", "CD4-positive; CD8-positive; alpha-beta T cell | \n", "
4 | \n", "PTH5 | \n", "CD4-positive; alpha-beta T cell | \n", "CD4-positive; CD8-positive; alpha-beta T cell | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
5058 | \n", "TTR167-0508 | \n", "CD4-positive; alpha-beta T cell | \n", "regulatory T cell | \n", "
5059 | \n", "TTR173-0508 | \n", "CD4-positive; alpha-beta T cell | \n", "regulatory T cell | \n", "
5060 | \n", "TTR176-0508 | \n", "CD4-positive; alpha-beta T cell | \n", "regulatory T cell | \n", "
5061 | \n", "TTR179-0508 | \n", "CD4-positive; alpha-beta T cell | \n", "regulatory T cell | \n", "
5062 | \n", "TTR189-0508 | \n", "CD4-positive; alpha-beta T cell | \n", "regulatory T cell | \n", "
5063 rows × 3 columns
\n", "