{ "cells": [ { "cell_type": "markdown", "id": "920d0893", "metadata": {}, "source": [ "# oggmap: Step 3 - map gene/transcript IDs\n", "\n", "This notebook will demonstrate how to match gene or transcript IDs between an orthomap and scRNA data." ] }, { "cell_type": "markdown", "id": "1ef70eb5", "metadata": {}, "source": [ "## Notebook file\n", "\n", "Notebook file can be obtained here:\n", "\n", "[https://raw.githubusercontent.com/kullrich/oggmap/main/docs/notebooks/get_orthomap.ipynb](https://raw.githubusercontent.com/kullrich/oggmap/main/docs/notebooks/geneset_overlap.ipynb)" ] }, { "cell_type": "markdown", "id": "a34e9d03", "metadata": {}, "source": [ "## Import libraries" ] }, { "cell_type": "code", "execution_count": 1, "id": "69b4df2a", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import scanpy as sc\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "from statannot import add_stat_annotation\n", "# increase dpi\n", "%matplotlib inline\n", "#plt.rcParams['figure.dpi'] = 300\n", "#plt.rcParams['savefig.dpi'] = 300\n", "plt.rcParams['figure.figsize'] = [6, 4.5]\n", "#plt.rcParams['figure.figsize'] = [4.4, 3.3]" ] }, { "cell_type": "markdown", "id": "156ec617", "metadata": {}, "source": [ "## Import oggmap python package submodules" ] }, { "cell_type": "code", "execution_count": 2, "id": "c6654a1c", "metadata": {}, "outputs": [], "source": [ "# import submodules\n", "from oggmap import qlin, gtf2t2g, of2orthomap, orthomap2tei, datasets" ] }, { "cell_type": "markdown", "id": "e5e67a8d", "metadata": {}, "source": [ "## Step 0, Step 1 and Step 2" ] }, { "cell_type": "markdown", "id": "e326c383", "metadata": {}, "source": [ "In order to come to Step 3, matching gene or transcript IDs, one needs to have the results from Step 0, Step 1 and Step 2.\n", "\n", "The query species in this part is: __*Danio rerio*__ (zebrafish).\n", "\n", "Please have a look at the documentation of [Step 0 - run OrthoFinder](https://oggmap.readthedocs.io/en/latest/tutorials/orthofinder.html) to get to know what information and files are mandatory to extract gene age classes from [OrthoFinder](https://oggmap.readthedocs.io/en/latest/tutorials/https://github.com/davidemms/OrthoFinder) results.\n", "\n", "In [Step 1 - get taxonomic information](https://oggmap.readthedocs.io/en/latest/tutorials/query_lineage.html) you have already been introduced how to extract query lineage information with `oggmap` and the `qlin.get_qlin()` function.\n", "\n", "In [Step 2 - gene age class assignment](https://oggmap.readthedocs.io/en/latest/tutorials/get_orthomap.html) you have already been introduced how to extract an orthomap (gene age class) from [OrthoFinder](https://oggmap.readthedocs.io/en/latest/tutorials/https://github.com/davidemms/OrthoFinder) results with `oggmap` and the `of2orthomap.get_orthomap()` function or how to import pre-calculated orthomaps with the `orthomap2tei.read_orthomap()` function." ] }, { "cell_type": "markdown", "id": "2f6846a5", "metadata": {}, "source": [ "### Step 0 - run OrthoFinder\n", "\n", "For this documentation part all mandatory [OrthoFinder](https://oggmap.readthedocs.io/en/latest/tutorials/https://github.com/davidemms/OrthoFinder) ([Emms and Kelly, 2019](https://doi.org/10.1186/s13059-019-1832-y)) results have been pre-calculated.\n", "\n", "Please have a look at the documentation of [Step 0 - run OrthoFinder](https://oggmap.readthedocs.io/en/latest/tutorials/orthofinder.html) to get further insides.\n", "\n", "The results are available here: \n", "\n", "https://doi.org/10.5281/zenodo.7242264\n", "\n", "or can be accessed with the `dataset` submodule of `oggmap`\n", "\n", "`datasets.ensembl105(datapath='data')` (download folder set to `'data'`)." ] }, { "cell_type": "code", "execution_count": 3, "id": "d2fe62b0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100% [..........................................................] 15662 / 15662" ] }, { "data": { "text/plain": [ "['data/ensembl_105_orthofinder_Orthogroups.GeneCount.tsv.zip',\n", " 'data/ensembl_105_orthofinder_Orthogroups.tsv.zip',\n", " 'data/ensembl_105_orthofinder_species_list.tsv']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "datasets.ensembl105(datapath='data')" ] }, { "cell_type": "markdown", "id": "61ffce83", "metadata": {}, "source": [ "### Step 1 - get taxonomic information\n", "\n", "Please have a look at the documentation of [Step 1 - get taxonomic information](https://oggmap.readthedocs.io/en/latest/tutorials/query_lineage.html) to get further insides." ] }, { "cell_type": "code", "execution_count": 4, "id": "8de4c664", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "query name: Danio rerio\n", "query taxID: 7955\n", "query kingdom: Eukaryota\n", "query lineage names: \n", "['root(1)', 'cellular organisms(131567)', 'Eukaryota(2759)', 'Opisthokonta(33154)', 'Metazoa(33208)', 'Eumetazoa(6072)', 'Bilateria(33213)', 'Deuterostomia(33511)', 'Chordata(7711)', 'Craniata(89593)', 'Vertebrata(7742)', 'Gnathostomata(7776)', 'Teleostomi(117570)', 'Euteleostomi(117571)', 'Actinopterygii(7898)', 'Actinopteri(186623)', 'Neopterygii(41665)', 'Teleostei(32443)', 'Osteoglossocephalai(1489341)', 'Clupeocephala(186625)', 'Otomorpha(186634)', 'Ostariophysi(32519)', 'Otophysi(186626)', 'Cypriniphysae(186627)', 'Cypriniformes(7952)', 'Cyprinoidei(30727)', 'Danionidae(2743709)', 'Danioninae(2743711)', 'Danio(7954)', 'Danio rerio(7955)']\n", "query lineage: \n", "[1, 131567, 2759, 33154, 33208, 6072, 33213, 33511, 7711, 89593, 7742, 7776, 117570, 117571, 7898, 186623, 41665, 32443, 1489341, 186625, 186634, 32519, 186626, 186627, 7952, 30727, 2743709, 2743711, 7954, 7955]\n" ] } ], "source": [ "# get query species taxonomic lineage information\n", "query_lineage = qlin.get_qlin(q='Danio rerio')" ] }, { "cell_type": "markdown", "id": "20f41cb5", "metadata": {}, "source": [ "### Step 2 - gene age class assignment\n", "\n", "Here, `oggmap` use the query species information and [OrthoFinder](https://oggmap.readthedocs.io/en/latest/tutorials/https://github.com/davidemms/OrthoFinder) results to extract the oldest common tree node per orthogroup along a species tree and to assign this node as the gene age to the corresponding genes.\n", "\n", "Please have a look at the documentation of [Step 2 - gene age class assignment](https://oggmap.readthedocs.io/en/latest/tutorials/get_orthomap.html) to get further insides.\n", "\n", "__Note:__ This step can take up to five minutes, depending on your hardware." ] }, { "cell_type": "markdown", "id": "e2a31e5e", "metadata": {}, "source": [ "For this step to get the query species `oggmap`, one uses the `of2orthomap.get_orthomap()` function, like:" ] }, { "cell_type": "code", "execution_count": 5, "id": "e8da7349", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Danio_rerio.GRCz11.cds.longest\n", "Danio rerio\n", "7955\n", " species taxID \\\n", "0 Acanthochromis_polyacanthus.ASM210954v1.cds.lo... 80966 \n", "1 Accipiter_nisus.Accipiter_nisus_ver1.0.cds.lon... 211598 \n", "2 Ailuropoda_melanoleuca.ASM200744v2.cds.longest 9646 \n", "3 Amazona_collaria.ASM394721v1.cds.longest 241587 \n", "4 Amphilophus_citrinellus.Midas_v5.cds.longest 61819 \n", ".. ... ... \n", "307 Xiphophorus_couchianus.Xiphophorus_couchianus-... 32473 \n", "308 Xiphophorus_maculatus.X_maculatus-5.0-male.cds... 8083 \n", "309 Zalophus_californianus.mZalCal1.pri.cds.longest 9704 \n", "310 Zonotrichia_albicollis.Zonotrichia_albicollis-... 44394 \n", "311 Zosterops_lateralis_melanops.ASM128173v1.cds.l... 1220523 \n", "\n", " lineage youngest_common \\\n", "0 [1, 131567, 2759, 33154, 33208, 6072, 33213, 3... 186625 \n", "1 [1, 131567, 2759, 33154, 33208, 6072, 33213, 3... 117571 \n", "2 [1, 131567, 2759, 33154, 33208, 6072, 33213, 3... 117571 \n", "3 [1, 131567, 2759, 33154, 33208, 6072, 33213, 3... 117571 \n", "4 [1, 131567, 2759, 33154, 33208, 6072, 33213, 3... 186625 \n", ".. ... ... \n", "307 [1, 131567, 2759, 33154, 33208, 6072, 33213, 3... 186625 \n", "308 [1, 131567, 2759, 33154, 33208, 6072, 33213, 3... 186625 \n", "309 [1, 131567, 2759, 33154, 33208, 6072, 33213, 3... 117571 \n", "310 [1, 131567, 2759, 33154, 33208, 6072, 33213, 3... 117571 \n", "311 [1, 131567, 2759, 33154, 33208, 6072, 33213, 3... 117571 \n", "\n", " youngest_name \n", "0 Clupeocephala \n", "1 Euteleostomi \n", "2 Euteleostomi \n", "3 Euteleostomi \n", "4 Clupeocephala \n", ".. ... \n", "307 Clupeocephala \n", "308 Clupeocephala \n", "309 Euteleostomi \n", "310 Euteleostomi \n", "311 Euteleostomi \n", "\n", "[312 rows x 5 columns]\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
seqIDOrthogroupPSnumPStaxIDPSnamePScontinuity
0ENSDART00000127643.3OG0000000633213Bilateria0.846154
1ENSDART00000171750.2OG0000000633213Bilateria0.846154
2ENSDART00000190648.1OG0000000633213Bilateria0.846154
3ENSDART00000130167.3OG0000001107742Vertebrata0.909091
4ENSDART00000150909.2OG0000001107742Vertebrata0.909091
.....................
25167ENSDART00000180796.1OG002951019186625Clupeocephala0.400000
25168ENSDART00000145618.2OG002951119186625Clupeocephala0.400000
25169ENSDART00000143229.2OG0029512297955Danio rerio1.000000
25170ENSDART00000143837.3OG0029512297955Danio rerio1.000000
25171ENSDART00000180573.1OG002951313117571Euteleostomi0.222222
\n", "

25172 rows × 6 columns

\n", "
" ], "text/plain": [ " seqID Orthogroup PSnum PStaxID PSname \\\n", "0 ENSDART00000127643.3 OG0000000 6 33213 Bilateria \n", "1 ENSDART00000171750.2 OG0000000 6 33213 Bilateria \n", "2 ENSDART00000190648.1 OG0000000 6 33213 Bilateria \n", "3 ENSDART00000130167.3 OG0000001 10 7742 Vertebrata \n", "4 ENSDART00000150909.2 OG0000001 10 7742 Vertebrata \n", "... ... ... ... ... ... \n", "25167 ENSDART00000180796.1 OG0029510 19 186625 Clupeocephala \n", "25168 ENSDART00000145618.2 OG0029511 19 186625 Clupeocephala \n", "25169 ENSDART00000143229.2 OG0029512 29 7955 Danio rerio \n", "25170 ENSDART00000143837.3 OG0029512 29 7955 Danio rerio \n", "25171 ENSDART00000180573.1 OG0029513 13 117571 Euteleostomi \n", "\n", " PScontinuity \n", "0 0.846154 \n", "1 0.846154 \n", "2 0.846154 \n", "3 0.909091 \n", "4 0.909091 \n", "... ... \n", "25167 0.400000 \n", "25168 0.400000 \n", "25169 1.000000 \n", "25170 1.000000 \n", "25171 0.222222 \n", "\n", "[25172 rows x 6 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# get query species orthomap\n", "\n", "# download orthofinder results here: https://doi.org/10.5281/zenodo.7242264\n", "# or download with datasets.ensembl105('data')\n", "query_orthomap, orthofinder_species_list, of_species_abundance = of2orthomap.get_orthomap(\n", " seqname='Danio_rerio.GRCz11.cds.longest',\n", " qt='7955',\n", " sl='data/ensembl_105_orthofinder_species_list.tsv',\n", " oc='data/ensembl_105_orthofinder_Orthogroups.GeneCount.tsv.zip',\n", " og='data/ensembl_105_orthofinder_Orthogroups.tsv.zip',\n", " continuity=True)\n", "query_orthomap" ] }, { "cell_type": "markdown", "id": "14dafaa6", "metadata": {}, "source": [ "## Step 3 - map OrthoFinder gene names and scRNA gene/transcript names\n", "\n", "To be able to link gene ages assignments from an orthomap and gene or transcript of scRNA dataset, one needs to check the overlap of the annotated gene names. With the `gtf2t2g` submodule of `oggmap` and the `gtf2t2g.parse_gtf()` function, one can extract gene and transcript names from a given gene feature file (`GTF`)." ] }, { "cell_type": "code", "execution_count": 6, "id": "514f6c2d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100% [....................................................] 18021890 / 18021890" ] }, { "data": { "text/plain": [ "'data/Danio_rerio.GRCz11.105.gtf.gz'" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "datasets.zebrafish_ensembl105_gtf(datapath='data')" ] }, { "cell_type": "code", "execution_count": 7, "id": "1d897e53", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "32520 gene_id found\n", "59876 transcript_id found\n", "59876 protein_id found\n", "0 duplicated\n" ] } ], "source": [ "# get gene to transcript table for Danio rerio\n", "\n", "# download zebrafish GTF file here:\n", "# https://ftp.ensembl.org/pub/release-105/gtf/danio_rerio/Danio_rerio.GRCz11.105.gtf.gz\n", "# or download with datasets.zebrafish_ensembl105_gtf(datapath='data')\n", "query_species_t2g = gtf2t2g.parse_gtf(\n", " gtf='data/Danio_rerio.GRCz11.105.gtf.gz',\n", " g=True, b=True, p=True, v=True, s=True, q=True)" ] }, { "cell_type": "code", "execution_count": 8, "id": "a7ea998f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gene_idgene_id_versiontranscript_idtranscript_id_versiongene_namegene_typeprotein_idprotein_id_version
0ENSDARG00000000001ENSDARG00000000001.6ENSDART00000000004ENSDART00000000004.5slc35a5protein_codingENSDARP00000000004ENSDARP00000000004.2
1ENSDARG00000000002ENSDARG00000000002.8ENSDART00000000005ENSDART00000000005.7ccdc80protein_codingENSDARP00000000005ENSDARP00000000005.6
2ENSDARG00000000018ENSDARG00000000018.9ENSDART00000181044ENSDART00000181044.1nrf1protein_codingENSDARP00000149440ENSDARP00000149440.1
3ENSDARG00000000018ENSDARG00000000018.9ENSDART00000138183ENSDART00000138183.2nrf1protein_codingENSDARP00000116798ENSDARP00000116798.1
4ENSDARG00000000019ENSDARG00000000019.9ENSDART00000124452ENSDART00000124452.3ube2hprotein_codingENSDARP00000107407ENSDARP00000107407.2
...........................
59871ENSDARG00000117825ENSDARG00000117825.1ENSDART00000194739ENSDART00000194739.1CU207269.4lincRNANoneNone
59872ENSDARG00000117826ENSDARG00000117826.1ENSDART00000194042ENSDART00000194042.1CR385041.2lincRNANoneNone
59873ENSDARG00000117826ENSDARG00000117826.1ENSDART00000194514ENSDART00000194514.1CR385041.2lincRNANoneNone
59874ENSDARG00000117827ENSDARG00000117827.1ENSDART00000194378ENSDART00000194378.1CR388164.3lincRNANoneNone
59875ENSDARG00000117827ENSDARG00000117827.1ENSDART00000194710ENSDART00000194710.1CR388164.3lincRNANoneNone
\n", "

59876 rows × 8 columns

\n", "
" ], "text/plain": [ " gene_id gene_id_version transcript_id \\\n", "0 ENSDARG00000000001 ENSDARG00000000001.6 ENSDART00000000004 \n", "1 ENSDARG00000000002 ENSDARG00000000002.8 ENSDART00000000005 \n", "2 ENSDARG00000000018 ENSDARG00000000018.9 ENSDART00000181044 \n", "3 ENSDARG00000000018 ENSDARG00000000018.9 ENSDART00000138183 \n", "4 ENSDARG00000000019 ENSDARG00000000019.9 ENSDART00000124452 \n", "... ... ... ... \n", "59871 ENSDARG00000117825 ENSDARG00000117825.1 ENSDART00000194739 \n", "59872 ENSDARG00000117826 ENSDARG00000117826.1 ENSDART00000194042 \n", "59873 ENSDARG00000117826 ENSDARG00000117826.1 ENSDART00000194514 \n", "59874 ENSDARG00000117827 ENSDARG00000117827.1 ENSDART00000194378 \n", "59875 ENSDARG00000117827 ENSDARG00000117827.1 ENSDART00000194710 \n", "\n", " transcript_id_version gene_name gene_type protein_id \\\n", "0 ENSDART00000000004.5 slc35a5 protein_coding ENSDARP00000000004 \n", "1 ENSDART00000000005.7 ccdc80 protein_coding ENSDARP00000000005 \n", "2 ENSDART00000181044.1 nrf1 protein_coding ENSDARP00000149440 \n", "3 ENSDART00000138183.2 nrf1 protein_coding ENSDARP00000116798 \n", "4 ENSDART00000124452.3 ube2h protein_coding ENSDARP00000107407 \n", "... ... ... ... ... \n", "59871 ENSDART00000194739.1 CU207269.4 lincRNA None \n", "59872 ENSDART00000194042.1 CR385041.2 lincRNA None \n", "59873 ENSDART00000194514.1 CR385041.2 lincRNA None \n", "59874 ENSDART00000194378.1 CR388164.3 lincRNA None \n", "59875 ENSDART00000194710.1 CR388164.3 lincRNA None \n", "\n", " protein_id_version \n", "0 ENSDARP00000000004.2 \n", "1 ENSDARP00000000005.6 \n", "2 ENSDARP00000149440.1 \n", "3 ENSDARP00000116798.1 \n", "4 ENSDARP00000107407.2 \n", "... ... \n", "59871 None \n", "59872 None \n", "59873 None \n", "59874 None \n", "59875 None \n", "\n", "[59876 rows x 8 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_species_t2g" ] }, { "cell_type": "markdown", "id": "67431b35", "metadata": {}, "source": [ "### Import now, the scRNA dataset of the query species\n", "\n", "Here, data is used, like in the publication ([Farrell et al., 2018](https://doi.org/10.1126/science.aar3131); [Wagner et al., 2018](https://doi.org/10.1126/science.aar4362); [Qiu et al., 2022](https://doi.org/10.1038/s41588-022-01018-x)).\n", "\n", "scRNA data was downloaded from http://tome.gs.washington.edu/ as R rds files, combined into a single Seurat object and converted into loom and AnnData (h5ad) files to be able to analyse with e.g. python scanpy or `oggmap` package and is available here:\n", "\n", "https://doi.org/10.5281/zenodo.7243602\n", "\n", "or can be accessed with the `dataset` submodule of `oggmap`:\n", "\n", "`datasets.qiu22_zebrafish(datapath='data')` (download folder set to `'data'`)." ] }, { "cell_type": "code", "execution_count": 9, "id": "cca377a9", "metadata": {}, "outputs": [], "source": [ "# load scRNA data\n", "\n", "# download zebrafish scRNA data here: https://doi.org/10.5281/zenodo.7243602\n", "# or download with datasets.qui22_zebrafish(datapath='data')\n", "\n", "#zebrafish_data = datasets.qiu22_zebrafish(datapath='data')\n", "zebrafish_data = sc.read('data/zebrafish_data.h5ad')" ] }, { "cell_type": "markdown", "id": "45ea32d0", "metadata": {}, "source": [ "### Get an overview of observations" ] }, { "cell_type": "code", "execution_count": 10, "id": "4e60111f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
orig.identnCount_RNAnFeature_RNAsamplestagegroupcell_statecell_type
hpf3.3_ZFHIGH_WT_DS5_AAAAGTTGCCTCZFHIGH5773.02570ZFHIGH_WT_DS5_AAAAGTTGCCTChpf3.3F_3.3hpf3.3:blastomereblastomere
hpf3.3_ZFHIGH_WT_DS5_AAACAAGTGTATZFHIGH2312.01451ZFHIGH_WT_DS5_AAACAAGTGTAThpf3.3F_3.3hpf3.3:blastomereblastomere
hpf3.3_ZFHIGH_WT_DS5_AAACACCTCGTCZFHIGH4180.02166ZFHIGH_WT_DS5_AAACACCTCGTChpf3.3F_3.3hpf3.3:blastomereblastomere
hpf3.3_ZFHIGH_WT_DS5_AAATGAGGTTTNZFHIGH6686.02845ZFHIGH_WT_DS5_AAATGAGGTTTNhpf3.3F_3.3hpf3.3:blastomereblastomere
hpf3.3_ZFHIGH_WT_DS5_AACCCTCTCGATZFHIGH20095.04993ZFHIGH_WT_DS5_AACCCTCTCGAThpf3.3F_3.3hpf3.3:blastomereblastomere
...........................
hpf24_DEW057_TGACACAACAG_GCCACATCDEW0573916.01328DEW057_TGACACAACAG_GCCACATChpf24batch2hpf24:midbrainmidbrain
hpf24_DEW057_CTTACGGG_AACCTGACDEW0575611.01700DEW057_CTTACGGG_AACCTGAChpf24batch2hpf24:pharyngeal archpharyngeal arch
hpf24_DEW057_TGAACATCTAT_GACGATGGDEW0573676.01345DEW057_TGAACATCTAT_GACGATGGhpf24batch2hpf24:midbrainmidbrain
hpf24_DEW057_TGAGGTTTCTC_CTCAGAATDEW0577021.01778DEW057_TGAGGTTTCTC_CTCAGAAThpf24batch2hpf24:optic cupoptic cup
hpf24_DEW057_ACGTGCTAG_CAAGTCATDEW0573378.01170DEW057_ACGTGCTAG_CAAGTCAThpf24batch2hpf24:hindbrain dorsalhindbrain dorsal
\n", "

71203 rows × 8 columns

\n", "
" ], "text/plain": [ " orig.ident nCount_RNA nFeature_RNA \\\n", "hpf3.3_ZFHIGH_WT_DS5_AAAAGTTGCCTC ZFHIGH 5773.0 2570 \n", "hpf3.3_ZFHIGH_WT_DS5_AAACAAGTGTAT ZFHIGH 2312.0 1451 \n", "hpf3.3_ZFHIGH_WT_DS5_AAACACCTCGTC ZFHIGH 4180.0 2166 \n", "hpf3.3_ZFHIGH_WT_DS5_AAATGAGGTTTN ZFHIGH 6686.0 2845 \n", "hpf3.3_ZFHIGH_WT_DS5_AACCCTCTCGAT ZFHIGH 20095.0 4993 \n", "... ... ... ... \n", "hpf24_DEW057_TGACACAACAG_GCCACATC DEW057 3916.0 1328 \n", "hpf24_DEW057_CTTACGGG_AACCTGAC DEW057 5611.0 1700 \n", "hpf24_DEW057_TGAACATCTAT_GACGATGG DEW057 3676.0 1345 \n", "hpf24_DEW057_TGAGGTTTCTC_CTCAGAAT DEW057 7021.0 1778 \n", "hpf24_DEW057_ACGTGCTAG_CAAGTCAT DEW057 3378.0 1170 \n", "\n", " sample stage \\\n", "hpf3.3_ZFHIGH_WT_DS5_AAAAGTTGCCTC ZFHIGH_WT_DS5_AAAAGTTGCCTC hpf3.3 \n", "hpf3.3_ZFHIGH_WT_DS5_AAACAAGTGTAT ZFHIGH_WT_DS5_AAACAAGTGTAT hpf3.3 \n", "hpf3.3_ZFHIGH_WT_DS5_AAACACCTCGTC ZFHIGH_WT_DS5_AAACACCTCGTC hpf3.3 \n", "hpf3.3_ZFHIGH_WT_DS5_AAATGAGGTTTN ZFHIGH_WT_DS5_AAATGAGGTTTN hpf3.3 \n", "hpf3.3_ZFHIGH_WT_DS5_AACCCTCTCGAT ZFHIGH_WT_DS5_AACCCTCTCGAT hpf3.3 \n", "... ... ... \n", "hpf24_DEW057_TGACACAACAG_GCCACATC DEW057_TGACACAACAG_GCCACATC hpf24 \n", "hpf24_DEW057_CTTACGGG_AACCTGAC DEW057_CTTACGGG_AACCTGAC hpf24 \n", "hpf24_DEW057_TGAACATCTAT_GACGATGG DEW057_TGAACATCTAT_GACGATGG hpf24 \n", "hpf24_DEW057_TGAGGTTTCTC_CTCAGAAT DEW057_TGAGGTTTCTC_CTCAGAAT hpf24 \n", "hpf24_DEW057_ACGTGCTAG_CAAGTCAT DEW057_ACGTGCTAG_CAAGTCAT hpf24 \n", "\n", " group cell_state \\\n", "hpf3.3_ZFHIGH_WT_DS5_AAAAGTTGCCTC F_3.3 hpf3.3:blastomere \n", "hpf3.3_ZFHIGH_WT_DS5_AAACAAGTGTAT F_3.3 hpf3.3:blastomere \n", "hpf3.3_ZFHIGH_WT_DS5_AAACACCTCGTC F_3.3 hpf3.3:blastomere \n", "hpf3.3_ZFHIGH_WT_DS5_AAATGAGGTTTN F_3.3 hpf3.3:blastomere \n", "hpf3.3_ZFHIGH_WT_DS5_AACCCTCTCGAT F_3.3 hpf3.3:blastomere \n", "... ... ... \n", "hpf24_DEW057_TGACACAACAG_GCCACATC batch2 hpf24:midbrain \n", "hpf24_DEW057_CTTACGGG_AACCTGAC batch2 hpf24:pharyngeal arch \n", "hpf24_DEW057_TGAACATCTAT_GACGATGG batch2 hpf24:midbrain \n", "hpf24_DEW057_TGAGGTTTCTC_CTCAGAAT batch2 hpf24:optic cup \n", "hpf24_DEW057_ACGTGCTAG_CAAGTCAT batch2 hpf24:hindbrain dorsal \n", "\n", " cell_type \n", "hpf3.3_ZFHIGH_WT_DS5_AAAAGTTGCCTC blastomere \n", "hpf3.3_ZFHIGH_WT_DS5_AAACAAGTGTAT blastomere \n", "hpf3.3_ZFHIGH_WT_DS5_AAACACCTCGTC blastomere \n", "hpf3.3_ZFHIGH_WT_DS5_AAATGAGGTTTN blastomere \n", "hpf3.3_ZFHIGH_WT_DS5_AACCCTCTCGAT blastomere \n", "... ... \n", "hpf24_DEW057_TGACACAACAG_GCCACATC midbrain \n", "hpf24_DEW057_CTTACGGG_AACCTGAC pharyngeal arch \n", "hpf24_DEW057_TGAACATCTAT_GACGATGG midbrain \n", "hpf24_DEW057_TGAGGTTTCTC_CTCAGAAT optic cup \n", "hpf24_DEW057_ACGTGCTAG_CAAGTCAT hindbrain dorsal \n", "\n", "[71203 rows x 8 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "zebrafish_data.obs" ] }, { "cell_type": "markdown", "id": "30ce8dab", "metadata": {}, "source": [ "### Helper functions to match gene names\n", "\n", "The `orthomap2tei` submodule contains the `orthomap2tei.geneset_overlap()` helper function to check for gene name overlap between the constructed orthomap from `OrthoFinder` results and a given scRNA dataset." ] }, { "cell_type": "code", "execution_count": 11, "id": "203d2361", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
g1_g2_overlapg1_ratiog2_ratio
000.00.0
\n", "
" ], "text/plain": [ " g1_g2_overlap g1_ratio g2_ratio\n", "0 0 0.0 0.0" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# check overlap of orthomap and scRNA data \n", "orthomap2tei.geneset_overlap(zebrafish_data.var_names, query_orthomap['seqID'])" ] }, { "cell_type": "markdown", "id": "320b2cd1", "metadata": {}, "source": [ "As one can see, there is no overlap between the `zebrafish_data.var_names` and the sequence IDs from the orthomap so far." ] }, { "cell_type": "code", "execution_count": 12, "id": "22038002", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
g1_g2_overlapg1_ratiog2_ratio
0204181.00.62786
\n", "
" ], "text/plain": [ " g1_g2_overlap g1_ratio g2_ratio\n", "0 20418 1.0 0.62786" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# check overlap of transcript table and scRNA data \n", "orthomap2tei.geneset_overlap(zebrafish_data.var_names, query_species_t2g['gene_id'])" ] }, { "cell_type": "markdown", "id": "afdbb50c", "metadata": {}, "source": [ "The overlap of the `zebrafish_data.var_names` and the imported `GTF` and `gene_id` looks better and covers all gene IDs from the scRNA data.\n", "\n", "However, the orthomap obtained from [OrthoFinder](https://oggmap.readthedocs.io/en/latest/tutorials/https://github.com/davidemms/OrthoFinder) results contains isoform gene IDs.\n", "\n", "As a first step matching the orthomap and the `zebrafish_data.var_names` from the scRNA data is to reduce each isoform gene ID to its corresponding gene ID." ] }, { "cell_type": "markdown", "id": "c5e297cf", "metadata": {}, "source": [ "### Reduce isoforms to genes\n", "\n", "The `of2orthomap.replace_by()` helper function can be used to add a new column to the orthomap dataframe by matching e.g. gene isoform names and their corresponding gene names." ] }, { "cell_type": "code", "execution_count": 13, "id": "6de06fc1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
seqIDOrthogroupPSnumPStaxIDPSnamePScontinuity
0ENSDART00000127643.3OG0000000633213Bilateria0.846154
1ENSDART00000171750.2OG0000000633213Bilateria0.846154
2ENSDART00000190648.1OG0000000633213Bilateria0.846154
3ENSDART00000130167.3OG0000001107742Vertebrata0.909091
4ENSDART00000150909.2OG0000001107742Vertebrata0.909091
.....................
25167ENSDART00000180796.1OG002951019186625Clupeocephala0.400000
25168ENSDART00000145618.2OG002951119186625Clupeocephala0.400000
25169ENSDART00000143229.2OG0029512297955Danio rerio1.000000
25170ENSDART00000143837.3OG0029512297955Danio rerio1.000000
25171ENSDART00000180573.1OG002951313117571Euteleostomi0.222222
\n", "

25172 rows × 6 columns

\n", "
" ], "text/plain": [ " seqID Orthogroup PSnum PStaxID PSname \\\n", "0 ENSDART00000127643.3 OG0000000 6 33213 Bilateria \n", "1 ENSDART00000171750.2 OG0000000 6 33213 Bilateria \n", "2 ENSDART00000190648.1 OG0000000 6 33213 Bilateria \n", "3 ENSDART00000130167.3 OG0000001 10 7742 Vertebrata \n", "4 ENSDART00000150909.2 OG0000001 10 7742 Vertebrata \n", "... ... ... ... ... ... \n", "25167 ENSDART00000180796.1 OG0029510 19 186625 Clupeocephala \n", "25168 ENSDART00000145618.2 OG0029511 19 186625 Clupeocephala \n", "25169 ENSDART00000143229.2 OG0029512 29 7955 Danio rerio \n", "25170 ENSDART00000143837.3 OG0029512 29 7955 Danio rerio \n", "25171 ENSDART00000180573.1 OG0029513 13 117571 Euteleostomi \n", "\n", " PScontinuity \n", "0 0.846154 \n", "1 0.846154 \n", "2 0.846154 \n", "3 0.909091 \n", "4 0.909091 \n", "... ... \n", "25167 0.400000 \n", "25168 0.400000 \n", "25169 1.000000 \n", "25170 1.000000 \n", "25171 0.222222 \n", "\n", "[25172 rows x 6 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_orthomap" ] }, { "cell_type": "code", "execution_count": 14, "id": "b7c1eaf5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['ENSDARG00000002968', 'ENSDARG00000056314', 'ENSDARG00000102274',\n", " 'ENSDARG00000012468', 'ENSDARG00000063621', 'ENSDARG00000044802',\n", " 'ENSDARG00000011410', 'ENSDARG00000041170', 'ENSDARG00000011855',\n", " 'ENSDARG00000103957',\n", " ...\n", " 'ENSDARG00000078476', 'ENSDARG00000058562', 'ENSDARG00000110745',\n", " 'ENSDARG00000114172', 'ENSDARG00000110433', 'ENSDARG00000098193',\n", " 'ENSDARG00000101137', 'ENSDARG00000095817', 'ENSDARG00000079034',\n", " 'ENSDARG00000063372'],\n", " dtype='object', name='index', length=20418)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "zebrafish_data.var_names" ] }, { "cell_type": "code", "execution_count": 15, "id": "267829ff", "metadata": {}, "outputs": [], "source": [ "# convert orthomap transcript IDs into GeneIDs and add them to orthomap\n", "query_orthomap['geneID'] = orthomap2tei.replace_by(\n", " x_orig = query_orthomap['seqID'],\n", " xmatch = query_species_t2g['transcript_id_version'],\n", " xreplace = query_species_t2g['gene_id'],\n", ")" ] }, { "cell_type": "code", "execution_count": 16, "id": "2e2ef7b3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
seqIDOrthogroupPSnumPStaxIDPSnamePScontinuitygeneID
0ENSDART00000127643.3OG0000000633213Bilateria0.846154ENSDARG00000087544
1ENSDART00000171750.2OG0000000633213Bilateria0.846154ENSDARG00000095745
2ENSDART00000190648.1OG0000000633213Bilateria0.846154ENSDARG00000097551
3ENSDART00000130167.3OG0000001107742Vertebrata0.909091ENSDARG00000086420
4ENSDART00000150909.2OG0000001107742Vertebrata0.909091ENSDARG00000086613
........................
25167ENSDART00000180796.1OG002951019186625Clupeocephala0.400000ENSDARG00000110427
25168ENSDART00000145618.2OG002951119186625Clupeocephala0.400000ENSDARG00000093188
25169ENSDART00000143229.2OG0029512297955Danio rerio1.000000ENSDARG00000069978
25170ENSDART00000143837.3OG0029512297955Danio rerio1.000000ENSDARG00000078193
25171ENSDART00000180573.1OG002951313117571Euteleostomi0.222222ENSDARG00000109747
\n", "

25172 rows × 7 columns

\n", "
" ], "text/plain": [ " seqID Orthogroup PSnum PStaxID PSname \\\n", "0 ENSDART00000127643.3 OG0000000 6 33213 Bilateria \n", "1 ENSDART00000171750.2 OG0000000 6 33213 Bilateria \n", "2 ENSDART00000190648.1 OG0000000 6 33213 Bilateria \n", "3 ENSDART00000130167.3 OG0000001 10 7742 Vertebrata \n", "4 ENSDART00000150909.2 OG0000001 10 7742 Vertebrata \n", "... ... ... ... ... ... \n", "25167 ENSDART00000180796.1 OG0029510 19 186625 Clupeocephala \n", "25168 ENSDART00000145618.2 OG0029511 19 186625 Clupeocephala \n", "25169 ENSDART00000143229.2 OG0029512 29 7955 Danio rerio \n", "25170 ENSDART00000143837.3 OG0029512 29 7955 Danio rerio \n", "25171 ENSDART00000180573.1 OG0029513 13 117571 Euteleostomi \n", "\n", " PScontinuity geneID \n", "0 0.846154 ENSDARG00000087544 \n", "1 0.846154 ENSDARG00000095745 \n", "2 0.846154 ENSDARG00000097551 \n", "3 0.909091 ENSDARG00000086420 \n", "4 0.909091 ENSDARG00000086613 \n", "... ... ... \n", "25167 0.400000 ENSDARG00000110427 \n", "25168 0.400000 ENSDARG00000093188 \n", "25169 1.000000 ENSDARG00000069978 \n", "25170 1.000000 ENSDARG00000078193 \n", "25171 0.222222 ENSDARG00000109747 \n", "\n", "[25172 rows x 7 columns]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_orthomap" ] }, { "cell_type": "markdown", "id": "d65e086f", "metadata": {}, "source": [ "Now, each `seqID` (isoform gene ID) is reduced to its gene ID `geneID` and one can check again the overlap." ] }, { "cell_type": "code", "execution_count": 17, "id": "36abdb98", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
g1_g2_overlapg1_ratiog2_ratio
0199820.9786460.793819
\n", "
" ], "text/plain": [ " g1_g2_overlap g1_ratio g2_ratio\n", "0 19982 0.978646 0.793819" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# check overlap of orthomap and scRNA data\n", "orthomap2tei.geneset_overlap(zebrafish_data.var_names, query_orthomap['geneID'])" ] }, { "cell_type": "markdown", "id": "64828d79", "metadata": {}, "source": [ "The created orthomap can be stored as a separated file like:" ] }, { "cell_type": "raw", "id": "3fa23ef7", "metadata": {}, "source": [ "query_orthomap.to_csv('./data/zebrafish_ensembl_105_orthomap.tsv', sep='\\t', index=False)" ] }, { "cell_type": "markdown", "id": "120356ab", "metadata": {}, "source": [ "To re-use the saved orthomap, so that one does not need to repeat Step 1 and Step 2, one could load it with `orthomap2tei.read_orthomap()` function." ] }, { "cell_type": "raw", "id": "709d87a1", "metadata": {}, "source": [ "zebrafish_orthomap = orthomap2tei.read_orthomap('data/zebrafish_ensembl_105_orthomap.tsv')" ] }, { "cell_type": "markdown", "id": "52002dfd", "metadata": {}, "source": [ "If you like to continue, please have a look at the documentation of [Step 4 - TEI calculation](https://oggmap.readthedocs.io/en/latest/tutorials/add_tei.html) to get further insides." ] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:scanpy]", "language": "python", "name": "conda-env-scanpy-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.15" } }, "nbformat": 4, "nbformat_minor": 5 }