{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "920d0893",
   "metadata": {},
   "source": [
    "# oggmap: Step 3 - map gene/transcript IDs\n",
    "\n",
    "This notebook will demonstrate how to match gene or transcript IDs between an orthomap and scRNA data."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1ef70eb5",
   "metadata": {},
   "source": [
    "## Notebook file\n",
    "\n",
    "Notebook file can be obtained here:\n",
    "\n",
    "[https://raw.githubusercontent.com/kullrich/oggmap/main/docs/notebooks/get_orthomap.ipynb](https://raw.githubusercontent.com/kullrich/oggmap/main/docs/notebooks/geneset_overlap.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a34e9d03",
   "metadata": {},
   "source": [
    "## Import libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "69b4df2a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import scanpy as sc\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "from statannot import add_stat_annotation\n",
    "# increase dpi\n",
    "%matplotlib inline\n",
    "#plt.rcParams['figure.dpi'] = 300\n",
    "#plt.rcParams['savefig.dpi'] = 300\n",
    "plt.rcParams['figure.figsize'] = [6, 4.5]\n",
    "#plt.rcParams['figure.figsize'] = [4.4, 3.3]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "156ec617",
   "metadata": {},
   "source": [
    "## Import oggmap python package submodules"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "c6654a1c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# import submodules\n",
    "from oggmap import qlin, gtf2t2g, of2orthomap, orthomap2tei, datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5e67a8d",
   "metadata": {},
   "source": [
    "## Step 0, Step 1 and Step 2"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e326c383",
   "metadata": {},
   "source": [
    "In order to come to Step 3, matching gene or transcript IDs, one needs to have the results from Step 0, Step 1 and Step 2.\n",
    "\n",
    "The query species in this part is: __*Danio rerio*__ (zebrafish).\n",
    "\n",
    "Please have a look at the documentation of [Step 0 - run OrthoFinder](https://oggmap.readthedocs.io/en/latest/tutorials/orthofinder.html) to get to know what information and files are mandatory to extract gene age classes from [OrthoFinder](https://oggmap.readthedocs.io/en/latest/tutorials/https://github.com/davidemms/OrthoFinder) results.\n",
    "\n",
    "In [Step 1 - get taxonomic information](https://oggmap.readthedocs.io/en/latest/tutorials/query_lineage.html) you have already been introduced how to extract query lineage information with `oggmap` and the `qlin.get_qlin()` function.\n",
    "\n",
    "In [Step 2 - gene age class assignment](https://oggmap.readthedocs.io/en/latest/tutorials/get_orthomap.html) you have already been introduced how to extract an orthomap (gene age class) from [OrthoFinder](https://oggmap.readthedocs.io/en/latest/tutorials/https://github.com/davidemms/OrthoFinder) results with `oggmap` and the `of2orthomap.get_orthomap()` function or how to import pre-calculated orthomaps with the `orthomap2tei.read_orthomap()` function."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f6846a5",
   "metadata": {},
   "source": [
    "### Step 0 - run OrthoFinder\n",
    "\n",
    "For this documentation part all mandatory [OrthoFinder](https://oggmap.readthedocs.io/en/latest/tutorials/https://github.com/davidemms/OrthoFinder) ([Emms and Kelly, 2019](https://doi.org/10.1186/s13059-019-1832-y)) results have been pre-calculated.\n",
    "\n",
    "Please have a look at the documentation of [Step 0 - run OrthoFinder](https://oggmap.readthedocs.io/en/latest/tutorials/orthofinder.html) to get further insides.\n",
    "\n",
    "The results are available here: \n",
    "\n",
    "https://doi.org/10.5281/zenodo.7242264\n",
    "\n",
    "or can be accessed with the `dataset` submodule of `oggmap`\n",
    "\n",
    "`datasets.ensembl105(datapath='data')` (download folder set to `'data'`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "d2fe62b0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100% [..........................................................] 15662 / 15662"
     ]
    },
    {
     "data": {
      "text/plain": [
       "['data/ensembl_105_orthofinder_Orthogroups.GeneCount.tsv.zip',\n",
       " 'data/ensembl_105_orthofinder_Orthogroups.tsv.zip',\n",
       " 'data/ensembl_105_orthofinder_species_list.tsv']"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "datasets.ensembl105(datapath='data')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61ffce83",
   "metadata": {},
   "source": [
    "### Step 1 - get taxonomic information\n",
    "\n",
    "Please have a look at the documentation of [Step 1 - get taxonomic information](https://oggmap.readthedocs.io/en/latest/tutorials/query_lineage.html) to get further insides."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "8de4c664",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "query name: Danio rerio\n",
      "query taxID: 7955\n",
      "query kingdom: Eukaryota\n",
      "query lineage names: \n",
      "['root(1)', 'cellular organisms(131567)', 'Eukaryota(2759)', 'Opisthokonta(33154)', 'Metazoa(33208)', 'Eumetazoa(6072)', 'Bilateria(33213)', 'Deuterostomia(33511)', 'Chordata(7711)', 'Craniata(89593)', 'Vertebrata(7742)', 'Gnathostomata(7776)', 'Teleostomi(117570)', 'Euteleostomi(117571)', 'Actinopterygii(7898)', 'Actinopteri(186623)', 'Neopterygii(41665)', 'Teleostei(32443)', 'Osteoglossocephalai(1489341)', 'Clupeocephala(186625)', 'Otomorpha(186634)', 'Ostariophysi(32519)', 'Otophysi(186626)', 'Cypriniphysae(186627)', 'Cypriniformes(7952)', 'Cyprinoidei(30727)', 'Danionidae(2743709)', 'Danioninae(2743711)', 'Danio(7954)', 'Danio rerio(7955)']\n",
      "query lineage: \n",
      "[1, 131567, 2759, 33154, 33208, 6072, 33213, 33511, 7711, 89593, 7742, 7776, 117570, 117571, 7898, 186623, 41665, 32443, 1489341, 186625, 186634, 32519, 186626, 186627, 7952, 30727, 2743709, 2743711, 7954, 7955]\n"
     ]
    }
   ],
   "source": [
    "# get query species taxonomic lineage information\n",
    "query_lineage = qlin.get_qlin(q='Danio rerio')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20f41cb5",
   "metadata": {},
   "source": [
    "### Step 2 - gene age class assignment\n",
    "\n",
    "Here, `oggmap` use the query species information and [OrthoFinder](https://oggmap.readthedocs.io/en/latest/tutorials/https://github.com/davidemms/OrthoFinder) results to extract the oldest common tree node per orthogroup along a species tree and to assign this node as the gene age to the corresponding genes.\n",
    "\n",
    "Please have a look at the documentation of [Step 2 - gene age class assignment](https://oggmap.readthedocs.io/en/latest/tutorials/get_orthomap.html) to get further insides.\n",
    "\n",
    "__Note:__ This step can take up to five minutes, depending on your hardware."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e2a31e5e",
   "metadata": {},
   "source": [
    "For this step to get the query species `oggmap`, one uses the `of2orthomap.get_orthomap()` function, like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "e8da7349",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Danio_rerio.GRCz11.cds.longest\n",
      "Danio rerio\n",
      "7955\n",
      "                                               species    taxID  \\\n",
      "0    Acanthochromis_polyacanthus.ASM210954v1.cds.lo...    80966   \n",
      "1    Accipiter_nisus.Accipiter_nisus_ver1.0.cds.lon...   211598   \n",
      "2       Ailuropoda_melanoleuca.ASM200744v2.cds.longest     9646   \n",
      "3             Amazona_collaria.ASM394721v1.cds.longest   241587   \n",
      "4         Amphilophus_citrinellus.Midas_v5.cds.longest    61819   \n",
      "..                                                 ...      ...   \n",
      "307  Xiphophorus_couchianus.Xiphophorus_couchianus-...    32473   \n",
      "308  Xiphophorus_maculatus.X_maculatus-5.0-male.cds...     8083   \n",
      "309    Zalophus_californianus.mZalCal1.pri.cds.longest     9704   \n",
      "310  Zonotrichia_albicollis.Zonotrichia_albicollis-...    44394   \n",
      "311  Zosterops_lateralis_melanops.ASM128173v1.cds.l...  1220523   \n",
      "\n",
      "                                               lineage  youngest_common  \\\n",
      "0    [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           186625   \n",
      "1    [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571   \n",
      "2    [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571   \n",
      "3    [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571   \n",
      "4    [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           186625   \n",
      "..                                                 ...              ...   \n",
      "307  [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           186625   \n",
      "308  [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           186625   \n",
      "309  [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571   \n",
      "310  [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571   \n",
      "311  [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571   \n",
      "\n",
      "     youngest_name  \n",
      "0    Clupeocephala  \n",
      "1     Euteleostomi  \n",
      "2     Euteleostomi  \n",
      "3     Euteleostomi  \n",
      "4    Clupeocephala  \n",
      "..             ...  \n",
      "307  Clupeocephala  \n",
      "308  Clupeocephala  \n",
      "309   Euteleostomi  \n",
      "310   Euteleostomi  \n",
      "311   Euteleostomi  \n",
      "\n",
      "[312 rows x 5 columns]\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>seqID</th>\n",
       "      <th>Orthogroup</th>\n",
       "      <th>PSnum</th>\n",
       "      <th>PStaxID</th>\n",
       "      <th>PSname</th>\n",
       "      <th>PScontinuity</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ENSDART00000127643.3</td>\n",
       "      <td>OG0000000</td>\n",
       "      <td>6</td>\n",
       "      <td>33213</td>\n",
       "      <td>Bilateria</td>\n",
       "      <td>0.846154</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ENSDART00000171750.2</td>\n",
       "      <td>OG0000000</td>\n",
       "      <td>6</td>\n",
       "      <td>33213</td>\n",
       "      <td>Bilateria</td>\n",
       "      <td>0.846154</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>ENSDART00000190648.1</td>\n",
       "      <td>OG0000000</td>\n",
       "      <td>6</td>\n",
       "      <td>33213</td>\n",
       "      <td>Bilateria</td>\n",
       "      <td>0.846154</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ENSDART00000130167.3</td>\n",
       "      <td>OG0000001</td>\n",
       "      <td>10</td>\n",
       "      <td>7742</td>\n",
       "      <td>Vertebrata</td>\n",
       "      <td>0.909091</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ENSDART00000150909.2</td>\n",
       "      <td>OG0000001</td>\n",
       "      <td>10</td>\n",
       "      <td>7742</td>\n",
       "      <td>Vertebrata</td>\n",
       "      <td>0.909091</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25167</th>\n",
       "      <td>ENSDART00000180796.1</td>\n",
       "      <td>OG0029510</td>\n",
       "      <td>19</td>\n",
       "      <td>186625</td>\n",
       "      <td>Clupeocephala</td>\n",
       "      <td>0.400000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25168</th>\n",
       "      <td>ENSDART00000145618.2</td>\n",
       "      <td>OG0029511</td>\n",
       "      <td>19</td>\n",
       "      <td>186625</td>\n",
       "      <td>Clupeocephala</td>\n",
       "      <td>0.400000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25169</th>\n",
       "      <td>ENSDART00000143229.2</td>\n",
       "      <td>OG0029512</td>\n",
       "      <td>29</td>\n",
       "      <td>7955</td>\n",
       "      <td>Danio rerio</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25170</th>\n",
       "      <td>ENSDART00000143837.3</td>\n",
       "      <td>OG0029512</td>\n",
       "      <td>29</td>\n",
       "      <td>7955</td>\n",
       "      <td>Danio rerio</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25171</th>\n",
       "      <td>ENSDART00000180573.1</td>\n",
       "      <td>OG0029513</td>\n",
       "      <td>13</td>\n",
       "      <td>117571</td>\n",
       "      <td>Euteleostomi</td>\n",
       "      <td>0.222222</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>25172 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                      seqID Orthogroup  PSnum PStaxID         PSname  \\\n",
       "0      ENSDART00000127643.3  OG0000000      6   33213      Bilateria   \n",
       "1      ENSDART00000171750.2  OG0000000      6   33213      Bilateria   \n",
       "2      ENSDART00000190648.1  OG0000000      6   33213      Bilateria   \n",
       "3      ENSDART00000130167.3  OG0000001     10    7742     Vertebrata   \n",
       "4      ENSDART00000150909.2  OG0000001     10    7742     Vertebrata   \n",
       "...                     ...        ...    ...     ...            ...   \n",
       "25167  ENSDART00000180796.1  OG0029510     19  186625  Clupeocephala   \n",
       "25168  ENSDART00000145618.2  OG0029511     19  186625  Clupeocephala   \n",
       "25169  ENSDART00000143229.2  OG0029512     29    7955    Danio rerio   \n",
       "25170  ENSDART00000143837.3  OG0029512     29    7955    Danio rerio   \n",
       "25171  ENSDART00000180573.1  OG0029513     13  117571   Euteleostomi   \n",
       "\n",
       "       PScontinuity  \n",
       "0          0.846154  \n",
       "1          0.846154  \n",
       "2          0.846154  \n",
       "3          0.909091  \n",
       "4          0.909091  \n",
       "...             ...  \n",
       "25167      0.400000  \n",
       "25168      0.400000  \n",
       "25169      1.000000  \n",
       "25170      1.000000  \n",
       "25171      0.222222  \n",
       "\n",
       "[25172 rows x 6 columns]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# get query species orthomap\n",
    "\n",
    "# download orthofinder results here: https://doi.org/10.5281/zenodo.7242264\n",
    "# or download with datasets.ensembl105('data')\n",
    "query_orthomap, orthofinder_species_list, of_species_abundance = of2orthomap.get_orthomap(\n",
    "    seqname='Danio_rerio.GRCz11.cds.longest',\n",
    "    qt='7955',\n",
    "    sl='data/ensembl_105_orthofinder_species_list.tsv',\n",
    "    oc='data/ensembl_105_orthofinder_Orthogroups.GeneCount.tsv.zip',\n",
    "    og='data/ensembl_105_orthofinder_Orthogroups.tsv.zip',\n",
    "    continuity=True)\n",
    "query_orthomap"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14dafaa6",
   "metadata": {},
   "source": [
    "## Step 3 - map OrthoFinder gene names and scRNA gene/transcript names\n",
    "\n",
    "To be able to link gene ages assignments from an orthomap and gene or transcript of scRNA dataset, one needs to check the overlap of the annotated gene names. With the `gtf2t2g` submodule of `oggmap` and the `gtf2t2g.parse_gtf()` function, one can extract gene and transcript names from a given gene feature file (`GTF`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "514f6c2d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100% [....................................................] 18021890 / 18021890"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'data/Danio_rerio.GRCz11.105.gtf.gz'"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "datasets.zebrafish_ensembl105_gtf(datapath='data')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "1d897e53",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "32520 gene_id found\n",
      "59876 transcript_id found\n",
      "59876 protein_id found\n",
      "0 duplicated\n"
     ]
    }
   ],
   "source": [
    "# get gene to transcript table for Danio rerio\n",
    "\n",
    "# download zebrafish GTF file here:\n",
    "# https://ftp.ensembl.org/pub/release-105/gtf/danio_rerio/Danio_rerio.GRCz11.105.gtf.gz\n",
    "# or download with datasets.zebrafish_ensembl105_gtf(datapath='data')\n",
    "query_species_t2g = gtf2t2g.parse_gtf(\n",
    "    gtf='data/Danio_rerio.GRCz11.105.gtf.gz',\n",
    "    g=True, b=True, p=True, v=True, s=True, q=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "a7ea998f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>gene_id</th>\n",
       "      <th>gene_id_version</th>\n",
       "      <th>transcript_id</th>\n",
       "      <th>transcript_id_version</th>\n",
       "      <th>gene_name</th>\n",
       "      <th>gene_type</th>\n",
       "      <th>protein_id</th>\n",
       "      <th>protein_id_version</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ENSDARG00000000001</td>\n",
       "      <td>ENSDARG00000000001.6</td>\n",
       "      <td>ENSDART00000000004</td>\n",
       "      <td>ENSDART00000000004.5</td>\n",
       "      <td>slc35a5</td>\n",
       "      <td>protein_coding</td>\n",
       "      <td>ENSDARP00000000004</td>\n",
       "      <td>ENSDARP00000000004.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ENSDARG00000000002</td>\n",
       "      <td>ENSDARG00000000002.8</td>\n",
       "      <td>ENSDART00000000005</td>\n",
       "      <td>ENSDART00000000005.7</td>\n",
       "      <td>ccdc80</td>\n",
       "      <td>protein_coding</td>\n",
       "      <td>ENSDARP00000000005</td>\n",
       "      <td>ENSDARP00000000005.6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>ENSDARG00000000018</td>\n",
       "      <td>ENSDARG00000000018.9</td>\n",
       "      <td>ENSDART00000181044</td>\n",
       "      <td>ENSDART00000181044.1</td>\n",
       "      <td>nrf1</td>\n",
       "      <td>protein_coding</td>\n",
       "      <td>ENSDARP00000149440</td>\n",
       "      <td>ENSDARP00000149440.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ENSDARG00000000018</td>\n",
       "      <td>ENSDARG00000000018.9</td>\n",
       "      <td>ENSDART00000138183</td>\n",
       "      <td>ENSDART00000138183.2</td>\n",
       "      <td>nrf1</td>\n",
       "      <td>protein_coding</td>\n",
       "      <td>ENSDARP00000116798</td>\n",
       "      <td>ENSDARP00000116798.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ENSDARG00000000019</td>\n",
       "      <td>ENSDARG00000000019.9</td>\n",
       "      <td>ENSDART00000124452</td>\n",
       "      <td>ENSDART00000124452.3</td>\n",
       "      <td>ube2h</td>\n",
       "      <td>protein_coding</td>\n",
       "      <td>ENSDARP00000107407</td>\n",
       "      <td>ENSDARP00000107407.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>59871</th>\n",
       "      <td>ENSDARG00000117825</td>\n",
       "      <td>ENSDARG00000117825.1</td>\n",
       "      <td>ENSDART00000194739</td>\n",
       "      <td>ENSDART00000194739.1</td>\n",
       "      <td>CU207269.4</td>\n",
       "      <td>lincRNA</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>59872</th>\n",
       "      <td>ENSDARG00000117826</td>\n",
       "      <td>ENSDARG00000117826.1</td>\n",
       "      <td>ENSDART00000194042</td>\n",
       "      <td>ENSDART00000194042.1</td>\n",
       "      <td>CR385041.2</td>\n",
       "      <td>lincRNA</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>59873</th>\n",
       "      <td>ENSDARG00000117826</td>\n",
       "      <td>ENSDARG00000117826.1</td>\n",
       "      <td>ENSDART00000194514</td>\n",
       "      <td>ENSDART00000194514.1</td>\n",
       "      <td>CR385041.2</td>\n",
       "      <td>lincRNA</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>59874</th>\n",
       "      <td>ENSDARG00000117827</td>\n",
       "      <td>ENSDARG00000117827.1</td>\n",
       "      <td>ENSDART00000194378</td>\n",
       "      <td>ENSDART00000194378.1</td>\n",
       "      <td>CR388164.3</td>\n",
       "      <td>lincRNA</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>59875</th>\n",
       "      <td>ENSDARG00000117827</td>\n",
       "      <td>ENSDARG00000117827.1</td>\n",
       "      <td>ENSDART00000194710</td>\n",
       "      <td>ENSDART00000194710.1</td>\n",
       "      <td>CR388164.3</td>\n",
       "      <td>lincRNA</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>59876 rows × 8 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                  gene_id       gene_id_version       transcript_id  \\\n",
       "0      ENSDARG00000000001  ENSDARG00000000001.6  ENSDART00000000004   \n",
       "1      ENSDARG00000000002  ENSDARG00000000002.8  ENSDART00000000005   \n",
       "2      ENSDARG00000000018  ENSDARG00000000018.9  ENSDART00000181044   \n",
       "3      ENSDARG00000000018  ENSDARG00000000018.9  ENSDART00000138183   \n",
       "4      ENSDARG00000000019  ENSDARG00000000019.9  ENSDART00000124452   \n",
       "...                   ...                   ...                 ...   \n",
       "59871  ENSDARG00000117825  ENSDARG00000117825.1  ENSDART00000194739   \n",
       "59872  ENSDARG00000117826  ENSDARG00000117826.1  ENSDART00000194042   \n",
       "59873  ENSDARG00000117826  ENSDARG00000117826.1  ENSDART00000194514   \n",
       "59874  ENSDARG00000117827  ENSDARG00000117827.1  ENSDART00000194378   \n",
       "59875  ENSDARG00000117827  ENSDARG00000117827.1  ENSDART00000194710   \n",
       "\n",
       "      transcript_id_version   gene_name       gene_type          protein_id  \\\n",
       "0      ENSDART00000000004.5     slc35a5  protein_coding  ENSDARP00000000004   \n",
       "1      ENSDART00000000005.7      ccdc80  protein_coding  ENSDARP00000000005   \n",
       "2      ENSDART00000181044.1        nrf1  protein_coding  ENSDARP00000149440   \n",
       "3      ENSDART00000138183.2        nrf1  protein_coding  ENSDARP00000116798   \n",
       "4      ENSDART00000124452.3       ube2h  protein_coding  ENSDARP00000107407   \n",
       "...                     ...         ...             ...                 ...   \n",
       "59871  ENSDART00000194739.1  CU207269.4         lincRNA                None   \n",
       "59872  ENSDART00000194042.1  CR385041.2         lincRNA                None   \n",
       "59873  ENSDART00000194514.1  CR385041.2         lincRNA                None   \n",
       "59874  ENSDART00000194378.1  CR388164.3         lincRNA                None   \n",
       "59875  ENSDART00000194710.1  CR388164.3         lincRNA                None   \n",
       "\n",
       "         protein_id_version  \n",
       "0      ENSDARP00000000004.2  \n",
       "1      ENSDARP00000000005.6  \n",
       "2      ENSDARP00000149440.1  \n",
       "3      ENSDARP00000116798.1  \n",
       "4      ENSDARP00000107407.2  \n",
       "...                     ...  \n",
       "59871                  None  \n",
       "59872                  None  \n",
       "59873                  None  \n",
       "59874                  None  \n",
       "59875                  None  \n",
       "\n",
       "[59876 rows x 8 columns]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query_species_t2g"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67431b35",
   "metadata": {},
   "source": [
    "### Import now, the scRNA dataset of the query species\n",
    "\n",
    "Here, data is used, like in the publication ([Farrell et al., 2018](https://doi.org/10.1126/science.aar3131); [Wagner et al., 2018](https://doi.org/10.1126/science.aar4362); [Qiu et al., 2022](https://doi.org/10.1038/s41588-022-01018-x)).\n",
    "\n",
    "scRNA data was downloaded from http://tome.gs.washington.edu/ as R rds files, combined into a single Seurat object and converted into loom and AnnData (h5ad) files to be able to analyse with e.g. python scanpy or `oggmap` package and is available here:\n",
    "\n",
    "https://doi.org/10.5281/zenodo.7243602\n",
    "\n",
    "or can be accessed with the `dataset` submodule of `oggmap`:\n",
    "\n",
    "`datasets.qiu22_zebrafish(datapath='data')` (download folder set to `'data'`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "cca377a9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# load scRNA data\n",
    "\n",
    "# download zebrafish scRNA data here: https://doi.org/10.5281/zenodo.7243602\n",
    "# or download with datasets.qui22_zebrafish(datapath='data')\n",
    "\n",
    "#zebrafish_data = datasets.qiu22_zebrafish(datapath='data')\n",
    "zebrafish_data = sc.read('data/zebrafish_data.h5ad')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "45ea32d0",
   "metadata": {},
   "source": [
    "### Get an overview of observations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "4e60111f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>orig.ident</th>\n",
       "      <th>nCount_RNA</th>\n",
       "      <th>nFeature_RNA</th>\n",
       "      <th>sample</th>\n",
       "      <th>stage</th>\n",
       "      <th>group</th>\n",
       "      <th>cell_state</th>\n",
       "      <th>cell_type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>hpf3.3_ZFHIGH_WT_DS5_AAAAGTTGCCTC</th>\n",
       "      <td>ZFHIGH</td>\n",
       "      <td>5773.0</td>\n",
       "      <td>2570</td>\n",
       "      <td>ZFHIGH_WT_DS5_AAAAGTTGCCTC</td>\n",
       "      <td>hpf3.3</td>\n",
       "      <td>F_3.3</td>\n",
       "      <td>hpf3.3:blastomere</td>\n",
       "      <td>blastomere</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>hpf3.3_ZFHIGH_WT_DS5_AAACAAGTGTAT</th>\n",
       "      <td>ZFHIGH</td>\n",
       "      <td>2312.0</td>\n",
       "      <td>1451</td>\n",
       "      <td>ZFHIGH_WT_DS5_AAACAAGTGTAT</td>\n",
       "      <td>hpf3.3</td>\n",
       "      <td>F_3.3</td>\n",
       "      <td>hpf3.3:blastomere</td>\n",
       "      <td>blastomere</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>hpf3.3_ZFHIGH_WT_DS5_AAACACCTCGTC</th>\n",
       "      <td>ZFHIGH</td>\n",
       "      <td>4180.0</td>\n",
       "      <td>2166</td>\n",
       "      <td>ZFHIGH_WT_DS5_AAACACCTCGTC</td>\n",
       "      <td>hpf3.3</td>\n",
       "      <td>F_3.3</td>\n",
       "      <td>hpf3.3:blastomere</td>\n",
       "      <td>blastomere</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>hpf3.3_ZFHIGH_WT_DS5_AAATGAGGTTTN</th>\n",
       "      <td>ZFHIGH</td>\n",
       "      <td>6686.0</td>\n",
       "      <td>2845</td>\n",
       "      <td>ZFHIGH_WT_DS5_AAATGAGGTTTN</td>\n",
       "      <td>hpf3.3</td>\n",
       "      <td>F_3.3</td>\n",
       "      <td>hpf3.3:blastomere</td>\n",
       "      <td>blastomere</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>hpf3.3_ZFHIGH_WT_DS5_AACCCTCTCGAT</th>\n",
       "      <td>ZFHIGH</td>\n",
       "      <td>20095.0</td>\n",
       "      <td>4993</td>\n",
       "      <td>ZFHIGH_WT_DS5_AACCCTCTCGAT</td>\n",
       "      <td>hpf3.3</td>\n",
       "      <td>F_3.3</td>\n",
       "      <td>hpf3.3:blastomere</td>\n",
       "      <td>blastomere</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>hpf24_DEW057_TGACACAACAG_GCCACATC</th>\n",
       "      <td>DEW057</td>\n",
       "      <td>3916.0</td>\n",
       "      <td>1328</td>\n",
       "      <td>DEW057_TGACACAACAG_GCCACATC</td>\n",
       "      <td>hpf24</td>\n",
       "      <td>batch2</td>\n",
       "      <td>hpf24:midbrain</td>\n",
       "      <td>midbrain</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>hpf24_DEW057_CTTACGGG_AACCTGAC</th>\n",
       "      <td>DEW057</td>\n",
       "      <td>5611.0</td>\n",
       "      <td>1700</td>\n",
       "      <td>DEW057_CTTACGGG_AACCTGAC</td>\n",
       "      <td>hpf24</td>\n",
       "      <td>batch2</td>\n",
       "      <td>hpf24:pharyngeal arch</td>\n",
       "      <td>pharyngeal arch</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>hpf24_DEW057_TGAACATCTAT_GACGATGG</th>\n",
       "      <td>DEW057</td>\n",
       "      <td>3676.0</td>\n",
       "      <td>1345</td>\n",
       "      <td>DEW057_TGAACATCTAT_GACGATGG</td>\n",
       "      <td>hpf24</td>\n",
       "      <td>batch2</td>\n",
       "      <td>hpf24:midbrain</td>\n",
       "      <td>midbrain</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>hpf24_DEW057_TGAGGTTTCTC_CTCAGAAT</th>\n",
       "      <td>DEW057</td>\n",
       "      <td>7021.0</td>\n",
       "      <td>1778</td>\n",
       "      <td>DEW057_TGAGGTTTCTC_CTCAGAAT</td>\n",
       "      <td>hpf24</td>\n",
       "      <td>batch2</td>\n",
       "      <td>hpf24:optic cup</td>\n",
       "      <td>optic cup</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>hpf24_DEW057_ACGTGCTAG_CAAGTCAT</th>\n",
       "      <td>DEW057</td>\n",
       "      <td>3378.0</td>\n",
       "      <td>1170</td>\n",
       "      <td>DEW057_ACGTGCTAG_CAAGTCAT</td>\n",
       "      <td>hpf24</td>\n",
       "      <td>batch2</td>\n",
       "      <td>hpf24:hindbrain dorsal</td>\n",
       "      <td>hindbrain dorsal</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>71203 rows × 8 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                  orig.ident  nCount_RNA  nFeature_RNA  \\\n",
       "hpf3.3_ZFHIGH_WT_DS5_AAAAGTTGCCTC     ZFHIGH      5773.0          2570   \n",
       "hpf3.3_ZFHIGH_WT_DS5_AAACAAGTGTAT     ZFHIGH      2312.0          1451   \n",
       "hpf3.3_ZFHIGH_WT_DS5_AAACACCTCGTC     ZFHIGH      4180.0          2166   \n",
       "hpf3.3_ZFHIGH_WT_DS5_AAATGAGGTTTN     ZFHIGH      6686.0          2845   \n",
       "hpf3.3_ZFHIGH_WT_DS5_AACCCTCTCGAT     ZFHIGH     20095.0          4993   \n",
       "...                                      ...         ...           ...   \n",
       "hpf24_DEW057_TGACACAACAG_GCCACATC     DEW057      3916.0          1328   \n",
       "hpf24_DEW057_CTTACGGG_AACCTGAC        DEW057      5611.0          1700   \n",
       "hpf24_DEW057_TGAACATCTAT_GACGATGG     DEW057      3676.0          1345   \n",
       "hpf24_DEW057_TGAGGTTTCTC_CTCAGAAT     DEW057      7021.0          1778   \n",
       "hpf24_DEW057_ACGTGCTAG_CAAGTCAT       DEW057      3378.0          1170   \n",
       "\n",
       "                                                        sample   stage  \\\n",
       "hpf3.3_ZFHIGH_WT_DS5_AAAAGTTGCCTC   ZFHIGH_WT_DS5_AAAAGTTGCCTC  hpf3.3   \n",
       "hpf3.3_ZFHIGH_WT_DS5_AAACAAGTGTAT   ZFHIGH_WT_DS5_AAACAAGTGTAT  hpf3.3   \n",
       "hpf3.3_ZFHIGH_WT_DS5_AAACACCTCGTC   ZFHIGH_WT_DS5_AAACACCTCGTC  hpf3.3   \n",
       "hpf3.3_ZFHIGH_WT_DS5_AAATGAGGTTTN   ZFHIGH_WT_DS5_AAATGAGGTTTN  hpf3.3   \n",
       "hpf3.3_ZFHIGH_WT_DS5_AACCCTCTCGAT   ZFHIGH_WT_DS5_AACCCTCTCGAT  hpf3.3   \n",
       "...                                                        ...     ...   \n",
       "hpf24_DEW057_TGACACAACAG_GCCACATC  DEW057_TGACACAACAG_GCCACATC   hpf24   \n",
       "hpf24_DEW057_CTTACGGG_AACCTGAC        DEW057_CTTACGGG_AACCTGAC   hpf24   \n",
       "hpf24_DEW057_TGAACATCTAT_GACGATGG  DEW057_TGAACATCTAT_GACGATGG   hpf24   \n",
       "hpf24_DEW057_TGAGGTTTCTC_CTCAGAAT  DEW057_TGAGGTTTCTC_CTCAGAAT   hpf24   \n",
       "hpf24_DEW057_ACGTGCTAG_CAAGTCAT      DEW057_ACGTGCTAG_CAAGTCAT   hpf24   \n",
       "\n",
       "                                    group              cell_state  \\\n",
       "hpf3.3_ZFHIGH_WT_DS5_AAAAGTTGCCTC   F_3.3       hpf3.3:blastomere   \n",
       "hpf3.3_ZFHIGH_WT_DS5_AAACAAGTGTAT   F_3.3       hpf3.3:blastomere   \n",
       "hpf3.3_ZFHIGH_WT_DS5_AAACACCTCGTC   F_3.3       hpf3.3:blastomere   \n",
       "hpf3.3_ZFHIGH_WT_DS5_AAATGAGGTTTN   F_3.3       hpf3.3:blastomere   \n",
       "hpf3.3_ZFHIGH_WT_DS5_AACCCTCTCGAT   F_3.3       hpf3.3:blastomere   \n",
       "...                                   ...                     ...   \n",
       "hpf24_DEW057_TGACACAACAG_GCCACATC  batch2          hpf24:midbrain   \n",
       "hpf24_DEW057_CTTACGGG_AACCTGAC     batch2   hpf24:pharyngeal arch   \n",
       "hpf24_DEW057_TGAACATCTAT_GACGATGG  batch2          hpf24:midbrain   \n",
       "hpf24_DEW057_TGAGGTTTCTC_CTCAGAAT  batch2         hpf24:optic cup   \n",
       "hpf24_DEW057_ACGTGCTAG_CAAGTCAT    batch2  hpf24:hindbrain dorsal   \n",
       "\n",
       "                                          cell_type  \n",
       "hpf3.3_ZFHIGH_WT_DS5_AAAAGTTGCCTC        blastomere  \n",
       "hpf3.3_ZFHIGH_WT_DS5_AAACAAGTGTAT        blastomere  \n",
       "hpf3.3_ZFHIGH_WT_DS5_AAACACCTCGTC        blastomere  \n",
       "hpf3.3_ZFHIGH_WT_DS5_AAATGAGGTTTN        blastomere  \n",
       "hpf3.3_ZFHIGH_WT_DS5_AACCCTCTCGAT        blastomere  \n",
       "...                                             ...  \n",
       "hpf24_DEW057_TGACACAACAG_GCCACATC          midbrain  \n",
       "hpf24_DEW057_CTTACGGG_AACCTGAC      pharyngeal arch  \n",
       "hpf24_DEW057_TGAACATCTAT_GACGATGG          midbrain  \n",
       "hpf24_DEW057_TGAGGTTTCTC_CTCAGAAT         optic cup  \n",
       "hpf24_DEW057_ACGTGCTAG_CAAGTCAT    hindbrain dorsal  \n",
       "\n",
       "[71203 rows x 8 columns]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "zebrafish_data.obs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30ce8dab",
   "metadata": {},
   "source": [
    "### Helper functions to match gene names\n",
    "\n",
    "The `orthomap2tei` submodule contains the `orthomap2tei.geneset_overlap()` helper function to check for gene name overlap between the constructed orthomap from `OrthoFinder` results and a given scRNA dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "203d2361",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>g1_g2_overlap</th>\n",
       "      <th>g1_ratio</th>\n",
       "      <th>g2_ratio</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   g1_g2_overlap  g1_ratio  g2_ratio\n",
       "0              0       0.0       0.0"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check overlap of orthomap <seqID> and scRNA data <var_names>\n",
    "orthomap2tei.geneset_overlap(zebrafish_data.var_names, query_orthomap['seqID'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "320b2cd1",
   "metadata": {},
   "source": [
    "As one can see, there is no overlap between the `zebrafish_data.var_names` and the sequence IDs from the orthomap so far."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "22038002",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>g1_g2_overlap</th>\n",
       "      <th>g1_ratio</th>\n",
       "      <th>g2_ratio</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>20418</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.62786</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   g1_g2_overlap  g1_ratio  g2_ratio\n",
       "0          20418       1.0   0.62786"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check overlap of transcript table <gene_id> and scRNA data <var_names>\n",
    "orthomap2tei.geneset_overlap(zebrafish_data.var_names, query_species_t2g['gene_id'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "afdbb50c",
   "metadata": {},
   "source": [
    "The overlap of the `zebrafish_data.var_names` and the imported `GTF` and `gene_id` looks better and covers all gene IDs from the scRNA data.\n",
    "\n",
    "However, the orthomap obtained from [OrthoFinder](https://oggmap.readthedocs.io/en/latest/tutorials/https://github.com/davidemms/OrthoFinder) results contains isoform gene IDs.\n",
    "\n",
    "As a first step matching the orthomap and the `zebrafish_data.var_names` from the scRNA data is to reduce each isoform gene ID to its corresponding gene ID."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c5e297cf",
   "metadata": {},
   "source": [
    "### Reduce isoforms to genes\n",
    "\n",
    "The `of2orthomap.replace_by()` helper function can be used to add a new column to the orthomap dataframe by matching e.g. gene isoform names and their corresponding gene names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "6de06fc1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>seqID</th>\n",
       "      <th>Orthogroup</th>\n",
       "      <th>PSnum</th>\n",
       "      <th>PStaxID</th>\n",
       "      <th>PSname</th>\n",
       "      <th>PScontinuity</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ENSDART00000127643.3</td>\n",
       "      <td>OG0000000</td>\n",
       "      <td>6</td>\n",
       "      <td>33213</td>\n",
       "      <td>Bilateria</td>\n",
       "      <td>0.846154</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ENSDART00000171750.2</td>\n",
       "      <td>OG0000000</td>\n",
       "      <td>6</td>\n",
       "      <td>33213</td>\n",
       "      <td>Bilateria</td>\n",
       "      <td>0.846154</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>ENSDART00000190648.1</td>\n",
       "      <td>OG0000000</td>\n",
       "      <td>6</td>\n",
       "      <td>33213</td>\n",
       "      <td>Bilateria</td>\n",
       "      <td>0.846154</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ENSDART00000130167.3</td>\n",
       "      <td>OG0000001</td>\n",
       "      <td>10</td>\n",
       "      <td>7742</td>\n",
       "      <td>Vertebrata</td>\n",
       "      <td>0.909091</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ENSDART00000150909.2</td>\n",
       "      <td>OG0000001</td>\n",
       "      <td>10</td>\n",
       "      <td>7742</td>\n",
       "      <td>Vertebrata</td>\n",
       "      <td>0.909091</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25167</th>\n",
       "      <td>ENSDART00000180796.1</td>\n",
       "      <td>OG0029510</td>\n",
       "      <td>19</td>\n",
       "      <td>186625</td>\n",
       "      <td>Clupeocephala</td>\n",
       "      <td>0.400000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25168</th>\n",
       "      <td>ENSDART00000145618.2</td>\n",
       "      <td>OG0029511</td>\n",
       "      <td>19</td>\n",
       "      <td>186625</td>\n",
       "      <td>Clupeocephala</td>\n",
       "      <td>0.400000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25169</th>\n",
       "      <td>ENSDART00000143229.2</td>\n",
       "      <td>OG0029512</td>\n",
       "      <td>29</td>\n",
       "      <td>7955</td>\n",
       "      <td>Danio rerio</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25170</th>\n",
       "      <td>ENSDART00000143837.3</td>\n",
       "      <td>OG0029512</td>\n",
       "      <td>29</td>\n",
       "      <td>7955</td>\n",
       "      <td>Danio rerio</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25171</th>\n",
       "      <td>ENSDART00000180573.1</td>\n",
       "      <td>OG0029513</td>\n",
       "      <td>13</td>\n",
       "      <td>117571</td>\n",
       "      <td>Euteleostomi</td>\n",
       "      <td>0.222222</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>25172 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                      seqID Orthogroup  PSnum PStaxID         PSname  \\\n",
       "0      ENSDART00000127643.3  OG0000000      6   33213      Bilateria   \n",
       "1      ENSDART00000171750.2  OG0000000      6   33213      Bilateria   \n",
       "2      ENSDART00000190648.1  OG0000000      6   33213      Bilateria   \n",
       "3      ENSDART00000130167.3  OG0000001     10    7742     Vertebrata   \n",
       "4      ENSDART00000150909.2  OG0000001     10    7742     Vertebrata   \n",
       "...                     ...        ...    ...     ...            ...   \n",
       "25167  ENSDART00000180796.1  OG0029510     19  186625  Clupeocephala   \n",
       "25168  ENSDART00000145618.2  OG0029511     19  186625  Clupeocephala   \n",
       "25169  ENSDART00000143229.2  OG0029512     29    7955    Danio rerio   \n",
       "25170  ENSDART00000143837.3  OG0029512     29    7955    Danio rerio   \n",
       "25171  ENSDART00000180573.1  OG0029513     13  117571   Euteleostomi   \n",
       "\n",
       "       PScontinuity  \n",
       "0          0.846154  \n",
       "1          0.846154  \n",
       "2          0.846154  \n",
       "3          0.909091  \n",
       "4          0.909091  \n",
       "...             ...  \n",
       "25167      0.400000  \n",
       "25168      0.400000  \n",
       "25169      1.000000  \n",
       "25170      1.000000  \n",
       "25171      0.222222  \n",
       "\n",
       "[25172 rows x 6 columns]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query_orthomap"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "b7c1eaf5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['ENSDARG00000002968', 'ENSDARG00000056314', 'ENSDARG00000102274',\n",
       "       'ENSDARG00000012468', 'ENSDARG00000063621', 'ENSDARG00000044802',\n",
       "       'ENSDARG00000011410', 'ENSDARG00000041170', 'ENSDARG00000011855',\n",
       "       'ENSDARG00000103957',\n",
       "       ...\n",
       "       'ENSDARG00000078476', 'ENSDARG00000058562', 'ENSDARG00000110745',\n",
       "       'ENSDARG00000114172', 'ENSDARG00000110433', 'ENSDARG00000098193',\n",
       "       'ENSDARG00000101137', 'ENSDARG00000095817', 'ENSDARG00000079034',\n",
       "       'ENSDARG00000063372'],\n",
       "      dtype='object', name='index', length=20418)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "zebrafish_data.var_names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "267829ff",
   "metadata": {},
   "outputs": [],
   "source": [
    "# convert orthomap transcript IDs into GeneIDs and add them to orthomap\n",
    "query_orthomap['geneID'] = orthomap2tei.replace_by(\n",
    "    x_orig = query_orthomap['seqID'],\n",
    "    xmatch = query_species_t2g['transcript_id_version'],\n",
    "    xreplace = query_species_t2g['gene_id'],\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "2e2ef7b3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>seqID</th>\n",
       "      <th>Orthogroup</th>\n",
       "      <th>PSnum</th>\n",
       "      <th>PStaxID</th>\n",
       "      <th>PSname</th>\n",
       "      <th>PScontinuity</th>\n",
       "      <th>geneID</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ENSDART00000127643.3</td>\n",
       "      <td>OG0000000</td>\n",
       "      <td>6</td>\n",
       "      <td>33213</td>\n",
       "      <td>Bilateria</td>\n",
       "      <td>0.846154</td>\n",
       "      <td>ENSDARG00000087544</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ENSDART00000171750.2</td>\n",
       "      <td>OG0000000</td>\n",
       "      <td>6</td>\n",
       "      <td>33213</td>\n",
       "      <td>Bilateria</td>\n",
       "      <td>0.846154</td>\n",
       "      <td>ENSDARG00000095745</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>ENSDART00000190648.1</td>\n",
       "      <td>OG0000000</td>\n",
       "      <td>6</td>\n",
       "      <td>33213</td>\n",
       "      <td>Bilateria</td>\n",
       "      <td>0.846154</td>\n",
       "      <td>ENSDARG00000097551</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ENSDART00000130167.3</td>\n",
       "      <td>OG0000001</td>\n",
       "      <td>10</td>\n",
       "      <td>7742</td>\n",
       "      <td>Vertebrata</td>\n",
       "      <td>0.909091</td>\n",
       "      <td>ENSDARG00000086420</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ENSDART00000150909.2</td>\n",
       "      <td>OG0000001</td>\n",
       "      <td>10</td>\n",
       "      <td>7742</td>\n",
       "      <td>Vertebrata</td>\n",
       "      <td>0.909091</td>\n",
       "      <td>ENSDARG00000086613</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25167</th>\n",
       "      <td>ENSDART00000180796.1</td>\n",
       "      <td>OG0029510</td>\n",
       "      <td>19</td>\n",
       "      <td>186625</td>\n",
       "      <td>Clupeocephala</td>\n",
       "      <td>0.400000</td>\n",
       "      <td>ENSDARG00000110427</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25168</th>\n",
       "      <td>ENSDART00000145618.2</td>\n",
       "      <td>OG0029511</td>\n",
       "      <td>19</td>\n",
       "      <td>186625</td>\n",
       "      <td>Clupeocephala</td>\n",
       "      <td>0.400000</td>\n",
       "      <td>ENSDARG00000093188</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25169</th>\n",
       "      <td>ENSDART00000143229.2</td>\n",
       "      <td>OG0029512</td>\n",
       "      <td>29</td>\n",
       "      <td>7955</td>\n",
       "      <td>Danio rerio</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>ENSDARG00000069978</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25170</th>\n",
       "      <td>ENSDART00000143837.3</td>\n",
       "      <td>OG0029512</td>\n",
       "      <td>29</td>\n",
       "      <td>7955</td>\n",
       "      <td>Danio rerio</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>ENSDARG00000078193</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25171</th>\n",
       "      <td>ENSDART00000180573.1</td>\n",
       "      <td>OG0029513</td>\n",
       "      <td>13</td>\n",
       "      <td>117571</td>\n",
       "      <td>Euteleostomi</td>\n",
       "      <td>0.222222</td>\n",
       "      <td>ENSDARG00000109747</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>25172 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                      seqID Orthogroup  PSnum PStaxID         PSname  \\\n",
       "0      ENSDART00000127643.3  OG0000000      6   33213      Bilateria   \n",
       "1      ENSDART00000171750.2  OG0000000      6   33213      Bilateria   \n",
       "2      ENSDART00000190648.1  OG0000000      6   33213      Bilateria   \n",
       "3      ENSDART00000130167.3  OG0000001     10    7742     Vertebrata   \n",
       "4      ENSDART00000150909.2  OG0000001     10    7742     Vertebrata   \n",
       "...                     ...        ...    ...     ...            ...   \n",
       "25167  ENSDART00000180796.1  OG0029510     19  186625  Clupeocephala   \n",
       "25168  ENSDART00000145618.2  OG0029511     19  186625  Clupeocephala   \n",
       "25169  ENSDART00000143229.2  OG0029512     29    7955    Danio rerio   \n",
       "25170  ENSDART00000143837.3  OG0029512     29    7955    Danio rerio   \n",
       "25171  ENSDART00000180573.1  OG0029513     13  117571   Euteleostomi   \n",
       "\n",
       "       PScontinuity              geneID  \n",
       "0          0.846154  ENSDARG00000087544  \n",
       "1          0.846154  ENSDARG00000095745  \n",
       "2          0.846154  ENSDARG00000097551  \n",
       "3          0.909091  ENSDARG00000086420  \n",
       "4          0.909091  ENSDARG00000086613  \n",
       "...             ...                 ...  \n",
       "25167      0.400000  ENSDARG00000110427  \n",
       "25168      0.400000  ENSDARG00000093188  \n",
       "25169      1.000000  ENSDARG00000069978  \n",
       "25170      1.000000  ENSDARG00000078193  \n",
       "25171      0.222222  ENSDARG00000109747  \n",
       "\n",
       "[25172 rows x 7 columns]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query_orthomap"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d65e086f",
   "metadata": {},
   "source": [
    "Now, each `seqID` (isoform gene ID) is reduced to its gene ID `geneID` and one can check again the overlap."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "36abdb98",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>g1_g2_overlap</th>\n",
       "      <th>g1_ratio</th>\n",
       "      <th>g2_ratio</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>19982</td>\n",
       "      <td>0.978646</td>\n",
       "      <td>0.793819</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   g1_g2_overlap  g1_ratio  g2_ratio\n",
       "0          19982  0.978646  0.793819"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check overlap of orthomap <geneID> and scRNA data\n",
    "orthomap2tei.geneset_overlap(zebrafish_data.var_names, query_orthomap['geneID'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64828d79",
   "metadata": {},
   "source": [
    "The created orthomap can be stored as a <tab> separated file like:"
   ]
  },
  {
   "cell_type": "raw",
   "id": "3fa23ef7",
   "metadata": {},
   "source": [
    "query_orthomap.to_csv('./data/zebrafish_ensembl_105_orthomap.tsv', sep='\\t', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "120356ab",
   "metadata": {},
   "source": [
    "To re-use the saved orthomap, so that one does not need to repeat Step 1 and Step 2, one could load it with `orthomap2tei.read_orthomap()` function."
   ]
  },
  {
   "cell_type": "raw",
   "id": "709d87a1",
   "metadata": {},
   "source": [
    "zebrafish_orthomap = orthomap2tei.read_orthomap('data/zebrafish_ensembl_105_orthomap.tsv')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "52002dfd",
   "metadata": {},
   "source": [
    "If you like to continue, please have a look at the documentation of [Step 4 - TEI calculation](https://oggmap.readthedocs.io/en/latest/tutorials/add_tei.html) to get further insides."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:scanpy]",
   "language": "python",
   "name": "conda-env-scanpy-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}