oggmap: Step 3 - map gene/transcript IDs

This notebook will demonstrate how to match gene or transcript IDs between an orthomap and scRNA data.

Notebook file

Notebook file can be obtained here:

https://raw.githubusercontent.com/kullrich/oggmap/main/docs/notebooks/get_orthomap.ipynb

Import libraries

[1]:
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns
import matplotlib.pyplot as plt
from statannot import add_stat_annotation
# increase dpi
%matplotlib inline
#plt.rcParams['figure.dpi'] = 300
#plt.rcParams['savefig.dpi'] = 300
plt.rcParams['figure.figsize'] = [6, 4.5]
#plt.rcParams['figure.figsize'] = [4.4, 3.3]

Import oggmap python package submodules

[2]:
# import submodules
from oggmap import qlin, gtf2t2g, of2orthomap, orthomap2tei, datasets

Step 0, Step 1 and Step 2

In order to come to Step 3, matching gene or transcript IDs, one needs to have the results from Step 0, Step 1 and Step 2.

The query species in this part is: Danio rerio (zebrafish).

Please have a look at the documentation of Step 0 - run OrthoFinder to get to know what information and files are mandatory to extract gene age classes from OrthoFinder results.

In Step 1 - get taxonomic information you have already been introduced how to extract query lineage information with oggmap and the qlin.get_qlin() function.

In Step 2 - gene age class assignment you have already been introduced how to extract an orthomap (gene age class) from OrthoFinder results with oggmap and the of2orthomap.get_orthomap() function or how to import pre-calculated orthomaps with the orthomap2tei.read_orthomap() function.

Step 0 - run OrthoFinder

For this documentation part all mandatory OrthoFinder (Emms and Kelly, 2019) results have been pre-calculated.

Please have a look at the documentation of Step 0 - run OrthoFinder to get further insides.

The results are available here:

https://doi.org/10.5281/zenodo.7242264

or can be accessed with the dataset submodule of oggmap

datasets.ensembl105(datapath='data') (download folder set to 'data').

[3]:
datasets.ensembl105(datapath='data')
100% [..........................................................] 15662 / 15662
[3]:
['data/ensembl_105_orthofinder_Orthogroups.GeneCount.tsv.zip',
 'data/ensembl_105_orthofinder_Orthogroups.tsv.zip',
 'data/ensembl_105_orthofinder_species_list.tsv']

Step 1 - get taxonomic information

Please have a look at the documentation of Step 1 - get taxonomic information to get further insides.

[4]:
# get query species taxonomic lineage information
query_lineage = qlin.get_qlin(q='Danio rerio')
query name: Danio rerio
query taxID: 7955
query kingdom: Eukaryota
query lineage names:
['root(1)', 'cellular organisms(131567)', 'Eukaryota(2759)', 'Opisthokonta(33154)', 'Metazoa(33208)', 'Eumetazoa(6072)', 'Bilateria(33213)', 'Deuterostomia(33511)', 'Chordata(7711)', 'Craniata(89593)', 'Vertebrata(7742)', 'Gnathostomata(7776)', 'Teleostomi(117570)', 'Euteleostomi(117571)', 'Actinopterygii(7898)', 'Actinopteri(186623)', 'Neopterygii(41665)', 'Teleostei(32443)', 'Osteoglossocephalai(1489341)', 'Clupeocephala(186625)', 'Otomorpha(186634)', 'Ostariophysi(32519)', 'Otophysi(186626)', 'Cypriniphysae(186627)', 'Cypriniformes(7952)', 'Cyprinoidei(30727)', 'Danionidae(2743709)', 'Danioninae(2743711)', 'Danio(7954)', 'Danio rerio(7955)']
query lineage:
[1, 131567, 2759, 33154, 33208, 6072, 33213, 33511, 7711, 89593, 7742, 7776, 117570, 117571, 7898, 186623, 41665, 32443, 1489341, 186625, 186634, 32519, 186626, 186627, 7952, 30727, 2743709, 2743711, 7954, 7955]

Step 2 - gene age class assignment

Here, oggmap use the query species information and OrthoFinder results to extract the oldest common tree node per orthogroup along a species tree and to assign this node as the gene age to the corresponding genes.

Please have a look at the documentation of Step 2 - gene age class assignment to get further insides.

Note: This step can take up to five minutes, depending on your hardware.

For this step to get the query species oggmap, one uses the of2orthomap.get_orthomap() function, like:

[5]:
# get query species orthomap

# download orthofinder results here: https://doi.org/10.5281/zenodo.7242264
# or download with datasets.ensembl105('data')
query_orthomap, orthofinder_species_list, of_species_abundance = of2orthomap.get_orthomap(
    seqname='Danio_rerio.GRCz11.cds.longest',
    qt='7955',
    sl='data/ensembl_105_orthofinder_species_list.tsv',
    oc='data/ensembl_105_orthofinder_Orthogroups.GeneCount.tsv.zip',
    og='data/ensembl_105_orthofinder_Orthogroups.tsv.zip',
    continuity=True)
query_orthomap
Danio_rerio.GRCz11.cds.longest
Danio rerio
7955
                                               species    taxID  \
0    Acanthochromis_polyacanthus.ASM210954v1.cds.lo...    80966
1    Accipiter_nisus.Accipiter_nisus_ver1.0.cds.lon...   211598
2       Ailuropoda_melanoleuca.ASM200744v2.cds.longest     9646
3             Amazona_collaria.ASM394721v1.cds.longest   241587
4         Amphilophus_citrinellus.Midas_v5.cds.longest    61819
..                                                 ...      ...
307  Xiphophorus_couchianus.Xiphophorus_couchianus-...    32473
308  Xiphophorus_maculatus.X_maculatus-5.0-male.cds...     8083
309    Zalophus_californianus.mZalCal1.pri.cds.longest     9704
310  Zonotrichia_albicollis.Zonotrichia_albicollis-...    44394
311  Zosterops_lateralis_melanops.ASM128173v1.cds.l...  1220523

                                               lineage  youngest_common  \
0    [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           186625
1    [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571
2    [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571
3    [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571
4    [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           186625
..                                                 ...              ...
307  [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           186625
308  [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           186625
309  [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571
310  [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571
311  [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571

     youngest_name
0    Clupeocephala
1     Euteleostomi
2     Euteleostomi
3     Euteleostomi
4    Clupeocephala
..             ...
307  Clupeocephala
308  Clupeocephala
309   Euteleostomi
310   Euteleostomi
311   Euteleostomi

[312 rows x 5 columns]
[5]:
seqID Orthogroup PSnum PStaxID PSname PScontinuity
0 ENSDART00000127643.3 OG0000000 6 33213 Bilateria 0.846154
1 ENSDART00000171750.2 OG0000000 6 33213 Bilateria 0.846154
2 ENSDART00000190648.1 OG0000000 6 33213 Bilateria 0.846154
3 ENSDART00000130167.3 OG0000001 10 7742 Vertebrata 0.909091
4 ENSDART00000150909.2 OG0000001 10 7742 Vertebrata 0.909091
... ... ... ... ... ... ...
25167 ENSDART00000180796.1 OG0029510 19 186625 Clupeocephala 0.400000
25168 ENSDART00000145618.2 OG0029511 19 186625 Clupeocephala 0.400000
25169 ENSDART00000143229.2 OG0029512 29 7955 Danio rerio 1.000000
25170 ENSDART00000143837.3 OG0029512 29 7955 Danio rerio 1.000000
25171 ENSDART00000180573.1 OG0029513 13 117571 Euteleostomi 0.222222

25172 rows × 6 columns

Step 3 - map OrthoFinder gene names and scRNA gene/transcript names

To be able to link gene ages assignments from an orthomap and gene or transcript of scRNA dataset, one needs to check the overlap of the annotated gene names. With the gtf2t2g submodule of oggmap and the gtf2t2g.parse_gtf() function, one can extract gene and transcript names from a given gene feature file (GTF).

[6]:
datasets.zebrafish_ensembl105_gtf(datapath='data')
100% [....................................................] 18021890 / 18021890
[6]:
'data/Danio_rerio.GRCz11.105.gtf.gz'
[7]:
# get gene to transcript table for Danio rerio

# download zebrafish GTF file here:
# https://ftp.ensembl.org/pub/release-105/gtf/danio_rerio/Danio_rerio.GRCz11.105.gtf.gz
# or download with datasets.zebrafish_ensembl105_gtf(datapath='data')
query_species_t2g = gtf2t2g.parse_gtf(
    gtf='data/Danio_rerio.GRCz11.105.gtf.gz',
    g=True, b=True, p=True, v=True, s=True, q=True)
32520 gene_id found
59876 transcript_id found
59876 protein_id found
0 duplicated
[8]:
query_species_t2g
[8]:
gene_id gene_id_version transcript_id transcript_id_version gene_name gene_type protein_id protein_id_version
0 ENSDARG00000000001 ENSDARG00000000001.6 ENSDART00000000004 ENSDART00000000004.5 slc35a5 protein_coding ENSDARP00000000004 ENSDARP00000000004.2
1 ENSDARG00000000002 ENSDARG00000000002.8 ENSDART00000000005 ENSDART00000000005.7 ccdc80 protein_coding ENSDARP00000000005 ENSDARP00000000005.6
2 ENSDARG00000000018 ENSDARG00000000018.9 ENSDART00000181044 ENSDART00000181044.1 nrf1 protein_coding ENSDARP00000149440 ENSDARP00000149440.1
3 ENSDARG00000000018 ENSDARG00000000018.9 ENSDART00000138183 ENSDART00000138183.2 nrf1 protein_coding ENSDARP00000116798 ENSDARP00000116798.1
4 ENSDARG00000000019 ENSDARG00000000019.9 ENSDART00000124452 ENSDART00000124452.3 ube2h protein_coding ENSDARP00000107407 ENSDARP00000107407.2
... ... ... ... ... ... ... ... ...
59871 ENSDARG00000117825 ENSDARG00000117825.1 ENSDART00000194739 ENSDART00000194739.1 CU207269.4 lincRNA None None
59872 ENSDARG00000117826 ENSDARG00000117826.1 ENSDART00000194042 ENSDART00000194042.1 CR385041.2 lincRNA None None
59873 ENSDARG00000117826 ENSDARG00000117826.1 ENSDART00000194514 ENSDART00000194514.1 CR385041.2 lincRNA None None
59874 ENSDARG00000117827 ENSDARG00000117827.1 ENSDART00000194378 ENSDART00000194378.1 CR388164.3 lincRNA None None
59875 ENSDARG00000117827 ENSDARG00000117827.1 ENSDART00000194710 ENSDART00000194710.1 CR388164.3 lincRNA None None

59876 rows × 8 columns

Import now, the scRNA dataset of the query species

Here, data is used, like in the publication (Farrell et al., 2018; Wagner et al., 2018; Qiu et al., 2022).

scRNA data was downloaded from http://tome.gs.washington.edu/ as R rds files, combined into a single Seurat object and converted into loom and AnnData (h5ad) files to be able to analyse with e.g. python scanpy or oggmap package and is available here:

https://doi.org/10.5281/zenodo.7243602

or can be accessed with the dataset submodule of oggmap:

datasets.qiu22_zebrafish(datapath='data') (download folder set to 'data').

[9]:
# load scRNA data

# download zebrafish scRNA data here: https://doi.org/10.5281/zenodo.7243602
# or download with datasets.qui22_zebrafish(datapath='data')

#zebrafish_data = datasets.qiu22_zebrafish(datapath='data')
zebrafish_data = sc.read('data/zebrafish_data.h5ad')

Get an overview of observations

[10]:
zebrafish_data.obs
[10]:
orig.ident nCount_RNA nFeature_RNA sample stage group cell_state cell_type
hpf3.3_ZFHIGH_WT_DS5_AAAAGTTGCCTC ZFHIGH 5773.0 2570 ZFHIGH_WT_DS5_AAAAGTTGCCTC hpf3.3 F_3.3 hpf3.3:blastomere blastomere
hpf3.3_ZFHIGH_WT_DS5_AAACAAGTGTAT ZFHIGH 2312.0 1451 ZFHIGH_WT_DS5_AAACAAGTGTAT hpf3.3 F_3.3 hpf3.3:blastomere blastomere
hpf3.3_ZFHIGH_WT_DS5_AAACACCTCGTC ZFHIGH 4180.0 2166 ZFHIGH_WT_DS5_AAACACCTCGTC hpf3.3 F_3.3 hpf3.3:blastomere blastomere
hpf3.3_ZFHIGH_WT_DS5_AAATGAGGTTTN ZFHIGH 6686.0 2845 ZFHIGH_WT_DS5_AAATGAGGTTTN hpf3.3 F_3.3 hpf3.3:blastomere blastomere
hpf3.3_ZFHIGH_WT_DS5_AACCCTCTCGAT ZFHIGH 20095.0 4993 ZFHIGH_WT_DS5_AACCCTCTCGAT hpf3.3 F_3.3 hpf3.3:blastomere blastomere
... ... ... ... ... ... ... ... ...
hpf24_DEW057_TGACACAACAG_GCCACATC DEW057 3916.0 1328 DEW057_TGACACAACAG_GCCACATC hpf24 batch2 hpf24:midbrain midbrain
hpf24_DEW057_CTTACGGG_AACCTGAC DEW057 5611.0 1700 DEW057_CTTACGGG_AACCTGAC hpf24 batch2 hpf24:pharyngeal arch pharyngeal arch
hpf24_DEW057_TGAACATCTAT_GACGATGG DEW057 3676.0 1345 DEW057_TGAACATCTAT_GACGATGG hpf24 batch2 hpf24:midbrain midbrain
hpf24_DEW057_TGAGGTTTCTC_CTCAGAAT DEW057 7021.0 1778 DEW057_TGAGGTTTCTC_CTCAGAAT hpf24 batch2 hpf24:optic cup optic cup
hpf24_DEW057_ACGTGCTAG_CAAGTCAT DEW057 3378.0 1170 DEW057_ACGTGCTAG_CAAGTCAT hpf24 batch2 hpf24:hindbrain dorsal hindbrain dorsal

71203 rows × 8 columns

Helper functions to match gene names

The orthomap2tei submodule contains the orthomap2tei.geneset_overlap() helper function to check for gene name overlap between the constructed orthomap from OrthoFinder results and a given scRNA dataset.

[11]:
# check overlap of orthomap <seqID> and scRNA data <var_names>
orthomap2tei.geneset_overlap(zebrafish_data.var_names, query_orthomap['seqID'])
[11]:
g1_g2_overlap g1_ratio g2_ratio
0 0 0.0 0.0

As one can see, there is no overlap between the zebrafish_data.var_names and the sequence IDs from the orthomap so far.

[12]:
# check overlap of transcript table <gene_id> and scRNA data <var_names>
orthomap2tei.geneset_overlap(zebrafish_data.var_names, query_species_t2g['gene_id'])
[12]:
g1_g2_overlap g1_ratio g2_ratio
0 20418 1.0 0.62786

The overlap of the zebrafish_data.var_names and the imported GTF and gene_id looks better and covers all gene IDs from the scRNA data.

However, the orthomap obtained from OrthoFinder results contains isoform gene IDs.

As a first step matching the orthomap and the zebrafish_data.var_names from the scRNA data is to reduce each isoform gene ID to its corresponding gene ID.

Reduce isoforms to genes

The of2orthomap.replace_by() helper function can be used to add a new column to the orthomap dataframe by matching e.g. gene isoform names and their corresponding gene names.

[13]:
query_orthomap
[13]:
seqID Orthogroup PSnum PStaxID PSname PScontinuity
0 ENSDART00000127643.3 OG0000000 6 33213 Bilateria 0.846154
1 ENSDART00000171750.2 OG0000000 6 33213 Bilateria 0.846154
2 ENSDART00000190648.1 OG0000000 6 33213 Bilateria 0.846154
3 ENSDART00000130167.3 OG0000001 10 7742 Vertebrata 0.909091
4 ENSDART00000150909.2 OG0000001 10 7742 Vertebrata 0.909091
... ... ... ... ... ... ...
25167 ENSDART00000180796.1 OG0029510 19 186625 Clupeocephala 0.400000
25168 ENSDART00000145618.2 OG0029511 19 186625 Clupeocephala 0.400000
25169 ENSDART00000143229.2 OG0029512 29 7955 Danio rerio 1.000000
25170 ENSDART00000143837.3 OG0029512 29 7955 Danio rerio 1.000000
25171 ENSDART00000180573.1 OG0029513 13 117571 Euteleostomi 0.222222

25172 rows × 6 columns

[14]:
zebrafish_data.var_names
[14]:
Index(['ENSDARG00000002968', 'ENSDARG00000056314', 'ENSDARG00000102274',
       'ENSDARG00000012468', 'ENSDARG00000063621', 'ENSDARG00000044802',
       'ENSDARG00000011410', 'ENSDARG00000041170', 'ENSDARG00000011855',
       'ENSDARG00000103957',
       ...
       'ENSDARG00000078476', 'ENSDARG00000058562', 'ENSDARG00000110745',
       'ENSDARG00000114172', 'ENSDARG00000110433', 'ENSDARG00000098193',
       'ENSDARG00000101137', 'ENSDARG00000095817', 'ENSDARG00000079034',
       'ENSDARG00000063372'],
      dtype='object', name='index', length=20418)
[15]:
# convert orthomap transcript IDs into GeneIDs and add them to orthomap
query_orthomap['geneID'] = orthomap2tei.replace_by(
    x_orig = query_orthomap['seqID'],
    xmatch = query_species_t2g['transcript_id_version'],
    xreplace = query_species_t2g['gene_id'],
)
[16]:
query_orthomap
[16]:
seqID Orthogroup PSnum PStaxID PSname PScontinuity geneID
0 ENSDART00000127643.3 OG0000000 6 33213 Bilateria 0.846154 ENSDARG00000087544
1 ENSDART00000171750.2 OG0000000 6 33213 Bilateria 0.846154 ENSDARG00000095745
2 ENSDART00000190648.1 OG0000000 6 33213 Bilateria 0.846154 ENSDARG00000097551
3 ENSDART00000130167.3 OG0000001 10 7742 Vertebrata 0.909091 ENSDARG00000086420
4 ENSDART00000150909.2 OG0000001 10 7742 Vertebrata 0.909091 ENSDARG00000086613
... ... ... ... ... ... ... ...
25167 ENSDART00000180796.1 OG0029510 19 186625 Clupeocephala 0.400000 ENSDARG00000110427
25168 ENSDART00000145618.2 OG0029511 19 186625 Clupeocephala 0.400000 ENSDARG00000093188
25169 ENSDART00000143229.2 OG0029512 29 7955 Danio rerio 1.000000 ENSDARG00000069978
25170 ENSDART00000143837.3 OG0029512 29 7955 Danio rerio 1.000000 ENSDARG00000078193
25171 ENSDART00000180573.1 OG0029513 13 117571 Euteleostomi 0.222222 ENSDARG00000109747

25172 rows × 7 columns

Now, each seqID (isoform gene ID) is reduced to its gene ID geneID and one can check again the overlap.

[17]:
# check overlap of orthomap <geneID> and scRNA data
orthomap2tei.geneset_overlap(zebrafish_data.var_names, query_orthomap['geneID'])
[17]:
g1_g2_overlap g1_ratio g2_ratio
0 19982 0.978646 0.793819

The created orthomap can be stored as a separated file like:

query_orthomap.to_csv('./data/zebrafish_ensembl_105_orthomap.tsv', sep='\t', index=False)

To re-use the saved orthomap, so that one does not need to repeat Step 1 and Step 2, one could load it with orthomap2tei.read_orthomap() function.

zebrafish_orthomap = orthomap2tei.read_orthomap('data/zebrafish_ensembl_105_orthomap.tsv')

If you like to continue, please have a look at the documentation of Step 4 - TEI calculation to get further insides.