oggmap: Step 3 - map gene/transcript IDs

This notebook will demonstrate how to match gene or transcript IDs between an orthomap and scRNA data.

Notebook file

Notebook file can be obtained here:

https://raw.githubusercontent.com/kullrich/oggmap/main/docs/notebooks/get_orthomap.ipynb

Import libraries

[1]:

import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns
import matplotlib.pyplot as plt
from statannot import add_stat_annotation
# increase dpi
%matplotlib inline
#plt.rcParams['figure.dpi'] = 300
#plt.rcParams['savefig.dpi'] = 300
plt.rcParams['figure.figsize'] = [6, 4.5]
#plt.rcParams['figure.figsize'] = [4.4, 3.3]

Import oggmap python package submodules

[2]:

# import submodules
from oggmap import qlin, gtf2t2g, of2orthomap, orthomap2tei, datasets

Step 0, Step 1 and Step 2

In order to come to Step 3, matching gene or transcript IDs, one needs to have the results from Step 0, Step 1 and Step 2.

The query species in this part is: Danio rerio (zebrafish).

Please have a look at the documentation of Step 0 - run OrthoFinder to get to know what information and files are mandatory to extract gene age classes from OrthoFinder results.

In Step 1 - get taxonomic information you have already been introduced how to extract query lineage information with oggmap and the qlin.get_qlin() function.

In Step 2 - gene age class assignment you have already been introduced how to extract an orthomap (gene age class) from OrthoFinder results with oggmap and the of2orthomap.get_orthomap() function or how to import pre-calculated orthomaps with the orthomap2tei.read_orthomap() function.

Step 0 - run OrthoFinder

For this documentation part all mandatory OrthoFinder (Emms and Kelly, 2019) results have been pre-calculated.

Please have a look at the documentation of Step 0 - run OrthoFinder to get further insides.

The results are available here:

https://doi.org/10.5281/zenodo.7242264

or can be accessed with the dataset submodule of oggmap

datasets.ensembl105(datapath='data') (download folder set to 'data').

[3]:

datasets.ensembl105(datapath='data')

100% [..........................................................] 15662 / 15662

[3]:

['data/ensembl_105_orthofinder_Orthogroups.GeneCount.tsv.zip',
 'data/ensembl_105_orthofinder_Orthogroups.tsv.zip',
 'data/ensembl_105_orthofinder_species_list.tsv']

Step 1 - get taxonomic information

Please have a look at the documentation of Step 1 - get taxonomic information to get further insides.

[4]:

# get query species taxonomic lineage information
query_lineage = qlin.get_qlin(q='Danio rerio')

query name: Danio rerio
query taxID: 7955
query kingdom: Eukaryota
query lineage names:
['root(1)', 'cellular organisms(131567)', 'Eukaryota(2759)', 'Opisthokonta(33154)', 'Metazoa(33208)', 'Eumetazoa(6072)', 'Bilateria(33213)', 'Deuterostomia(33511)', 'Chordata(7711)', 'Craniata(89593)', 'Vertebrata(7742)', 'Gnathostomata(7776)', 'Teleostomi(117570)', 'Euteleostomi(117571)', 'Actinopterygii(7898)', 'Actinopteri(186623)', 'Neopterygii(41665)', 'Teleostei(32443)', 'Osteoglossocephalai(1489341)', 'Clupeocephala(186625)', 'Otomorpha(186634)', 'Ostariophysi(32519)', 'Otophysi(186626)', 'Cypriniphysae(186627)', 'Cypriniformes(7952)', 'Cyprinoidei(30727)', 'Danionidae(2743709)', 'Danioninae(2743711)', 'Danio(7954)', 'Danio rerio(7955)']
query lineage:
[1, 131567, 2759, 33154, 33208, 6072, 33213, 33511, 7711, 89593, 7742, 7776, 117570, 117571, 7898, 186623, 41665, 32443, 1489341, 186625, 186634, 32519, 186626, 186627, 7952, 30727, 2743709, 2743711, 7954, 7955]

Step 2 - gene age class assignment

Here, oggmap use the query species information and OrthoFinder results to extract the oldest common tree node per orthogroup along a species tree and to assign this node as the gene age to the corresponding genes.

Please have a look at the documentation of Step 2 - gene age class assignment to get further insides.

Note: This step can take up to five minutes, depending on your hardware.

For this step to get the query species oggmap, one uses the of2orthomap.get_orthomap() function, like:

[5]:

# get query species orthomap

# download orthofinder results here: https://doi.org/10.5281/zenodo.7242264
# or download with datasets.ensembl105('data')
query_orthomap, orthofinder_species_list, of_species_abundance = of2orthomap.get_orthomap(
    seqname='Danio_rerio.GRCz11.cds.longest',
    qt='7955',
    sl='data/ensembl_105_orthofinder_species_list.tsv',
    oc='data/ensembl_105_orthofinder_Orthogroups.GeneCount.tsv.zip',
    og='data/ensembl_105_orthofinder_Orthogroups.tsv.zip',
    continuity=True)
query_orthomap

Danio_rerio.GRCz11.cds.longest
Danio rerio
7955
                                               species    taxID  \
0    Acanthochromis_polyacanthus.ASM210954v1.cds.lo...    80966
1    Accipiter_nisus.Accipiter_nisus_ver1.0.cds.lon...   211598
2       Ailuropoda_melanoleuca.ASM200744v2.cds.longest     9646
3             Amazona_collaria.ASM394721v1.cds.longest   241587
4         Amphilophus_citrinellus.Midas_v5.cds.longest    61819
..                                                 ...      ...
307  Xiphophorus_couchianus.Xiphophorus_couchianus-...    32473
308  Xiphophorus_maculatus.X_maculatus-5.0-male.cds...     8083
309    Zalophus_californianus.mZalCal1.pri.cds.longest     9704
310  Zonotrichia_albicollis.Zonotrichia_albicollis-...    44394
311  Zosterops_lateralis_melanops.ASM128173v1.cds.l...  1220523

                                               lineage  youngest_common  \
0    [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           186625
1    [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571
2    [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571
3    [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571
4    [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           186625
..                                                 ...              ...
307  [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           186625
308  [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           186625
309  [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571
310  [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571
311  [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571

     youngest_name
0    Clupeocephala
1     Euteleostomi
2     Euteleostomi
3     Euteleostomi
4    Clupeocephala
..             ...
307  Clupeocephala
308  Clupeocephala
309   Euteleostomi
310   Euteleostomi
311   Euteleostomi

[312 rows x 5 columns]

[5]:

	seqID	Orthogroup	PSnum	PStaxID	PSname	PScontinuity
0	ENSDART00000127643.3	OG0000000	6	33213	Bilateria	0.846154
1	ENSDART00000171750.2	OG0000000	6	33213	Bilateria	0.846154
2	ENSDART00000190648.1	OG0000000	6	33213	Bilateria	0.846154
3	ENSDART00000130167.3	OG0000001	10	7742	Vertebrata	0.909091
4	ENSDART00000150909.2	OG0000001	10	7742	Vertebrata	0.909091
...	...	...	...	...	...	...
25167	ENSDART00000180796.1	OG0029510	19	186625	Clupeocephala	0.400000
25168	ENSDART00000145618.2	OG0029511	19	186625	Clupeocephala	0.400000
25169	ENSDART00000143229.2	OG0029512	29	7955	Danio rerio	1.000000
25170	ENSDART00000143837.3	OG0029512	29	7955	Danio rerio	1.000000
25171	ENSDART00000180573.1	OG0029513	13	117571	Euteleostomi	0.222222

25172 rows × 6 columns

Step 3 - map OrthoFinder gene names and scRNA gene/transcript names

To be able to link gene ages assignments from an orthomap and gene or transcript of scRNA dataset, one needs to check the overlap of the annotated gene names. With the gtf2t2g submodule of oggmap and the gtf2t2g.parse_gtf() function, one can extract gene and transcript names from a given gene feature file (GTF).

[6]:

datasets.zebrafish_ensembl105_gtf(datapath='data')

100% [....................................................] 18021890 / 18021890

[6]:

'data/Danio_rerio.GRCz11.105.gtf.gz'

[7]:

# get gene to transcript table for Danio rerio

# download zebrafish GTF file here:
# https://ftp.ensembl.org/pub/release-105/gtf/danio_rerio/Danio_rerio.GRCz11.105.gtf.gz
# or download with datasets.zebrafish_ensembl105_gtf(datapath='data')
query_species_t2g = gtf2t2g.parse_gtf(
    gtf='data/Danio_rerio.GRCz11.105.gtf.gz',
    g=True, b=True, p=True, v=True, s=True, q=True)

32520 gene_id found
59876 transcript_id found
59876 protein_id found
0 duplicated

[8]:

query_species_t2g

[8]:

	gene_id	gene_id_version	transcript_id	transcript_id_version	gene_name	gene_type	protein_id	protein_id_version
0	ENSDARG00000000001	ENSDARG00000000001.6	ENSDART00000000004	ENSDART00000000004.5	slc35a5	protein_coding	ENSDARP00000000004	ENSDARP00000000004.2
1	ENSDARG00000000002	ENSDARG00000000002.8	ENSDART00000000005	ENSDART00000000005.7	ccdc80	protein_coding	ENSDARP00000000005	ENSDARP00000000005.6
2	ENSDARG00000000018	ENSDARG00000000018.9	ENSDART00000181044	ENSDART00000181044.1	nrf1	protein_coding	ENSDARP00000149440	ENSDARP00000149440.1
3	ENSDARG00000000018	ENSDARG00000000018.9	ENSDART00000138183	ENSDART00000138183.2	nrf1	protein_coding	ENSDARP00000116798	ENSDARP00000116798.1
4	ENSDARG00000000019	ENSDARG00000000019.9	ENSDART00000124452	ENSDART00000124452.3	ube2h	protein_coding	ENSDARP00000107407	ENSDARP00000107407.2
...	...	...	...	...	...	...	...	...
59871	ENSDARG00000117825	ENSDARG00000117825.1	ENSDART00000194739	ENSDART00000194739.1	CU207269.4	lincRNA	None	None
59872	ENSDARG00000117826	ENSDARG00000117826.1	ENSDART00000194042	ENSDART00000194042.1	CR385041.2	lincRNA	None	None
59873	ENSDARG00000117826	ENSDARG00000117826.1	ENSDART00000194514	ENSDART00000194514.1	CR385041.2	lincRNA	None	None
59874	ENSDARG00000117827	ENSDARG00000117827.1	ENSDART00000194378	ENSDART00000194378.1	CR388164.3	lincRNA	None	None
59875	ENSDARG00000117827	ENSDARG00000117827.1	ENSDART00000194710	ENSDART00000194710.1	CR388164.3	lincRNA	None	None

59876 rows × 8 columns

Import now, the scRNA dataset of the query species

Here, data is used, like in the publication (Farrell et al., 2018; Wagner et al., 2018; Qiu et al., 2022).

scRNA data was downloaded from http://tome.gs.washington.edu/ as R rds files, combined into a single Seurat object and converted into loom and AnnData (h5ad) files to be able to analyse with e.g. python scanpy or oggmap package and is available here:

https://doi.org/10.5281/zenodo.7243602

or can be accessed with the dataset submodule of oggmap:

datasets.qiu22_zebrafish(datapath='data') (download folder set to 'data').

[9]:

# load scRNA data

# download zebrafish scRNA data here: https://doi.org/10.5281/zenodo.7243602
# or download with datasets.qui22_zebrafish(datapath='data')

#zebrafish_data = datasets.qiu22_zebrafish(datapath='data')
zebrafish_data = sc.read('data/zebrafish_data.h5ad')

Get an overview of observations

[10]:

zebrafish_data.obs

[10]:

	orig.ident	nCount_RNA	nFeature_RNA	sample	stage	group	cell_state	cell_type
hpf3.3_ZFHIGH_WT_DS5_AAAAGTTGCCTC	ZFHIGH	5773.0	2570	ZFHIGH_WT_DS5_AAAAGTTGCCTC	hpf3.3	F_3.3	hpf3.3:blastomere	blastomere
hpf3.3_ZFHIGH_WT_DS5_AAACAAGTGTAT	ZFHIGH	2312.0	1451	ZFHIGH_WT_DS5_AAACAAGTGTAT	hpf3.3	F_3.3	hpf3.3:blastomere	blastomere
hpf3.3_ZFHIGH_WT_DS5_AAACACCTCGTC	ZFHIGH	4180.0	2166	ZFHIGH_WT_DS5_AAACACCTCGTC	hpf3.3	F_3.3	hpf3.3:blastomere	blastomere
hpf3.3_ZFHIGH_WT_DS5_AAATGAGGTTTN	ZFHIGH	6686.0	2845	ZFHIGH_WT_DS5_AAATGAGGTTTN	hpf3.3	F_3.3	hpf3.3:blastomere	blastomere
hpf3.3_ZFHIGH_WT_DS5_AACCCTCTCGAT	ZFHIGH	20095.0	4993	ZFHIGH_WT_DS5_AACCCTCTCGAT	hpf3.3	F_3.3	hpf3.3:blastomere	blastomere
...	...	...	...	...	...	...	...	...
hpf24_DEW057_TGACACAACAG_GCCACATC	DEW057	3916.0	1328	DEW057_TGACACAACAG_GCCACATC	hpf24	batch2	hpf24:midbrain	midbrain
hpf24_DEW057_CTTACGGG_AACCTGAC	DEW057	5611.0	1700	DEW057_CTTACGGG_AACCTGAC	hpf24	batch2	hpf24:pharyngeal arch	pharyngeal arch
hpf24_DEW057_TGAACATCTAT_GACGATGG	DEW057	3676.0	1345	DEW057_TGAACATCTAT_GACGATGG	hpf24	batch2	hpf24:midbrain	midbrain
hpf24_DEW057_TGAGGTTTCTC_CTCAGAAT	DEW057	7021.0	1778	DEW057_TGAGGTTTCTC_CTCAGAAT	hpf24	batch2	hpf24:optic cup	optic cup
hpf24_DEW057_ACGTGCTAG_CAAGTCAT	DEW057	3378.0	1170	DEW057_ACGTGCTAG_CAAGTCAT	hpf24	batch2	hpf24:hindbrain dorsal	hindbrain dorsal

71203 rows × 8 columns

Helper functions to match gene names

The orthomap2tei submodule contains the orthomap2tei.geneset_overlap() helper function to check for gene name overlap between the constructed orthomap from OrthoFinder results and a given scRNA dataset.

[11]:

# check overlap of orthomap <seqID> and scRNA data <var_names>
orthomap2tei.geneset_overlap(zebrafish_data.var_names, query_orthomap['seqID'])

[11]:

	g1_g2_overlap	g1_ratio	g2_ratio
0	0	0.0	0.0

As one can see, there is no overlap between the zebrafish_data.var_names and the sequence IDs from the orthomap so far.

[12]:

# check overlap of transcript table <gene_id> and scRNA data <var_names>
orthomap2tei.geneset_overlap(zebrafish_data.var_names, query_species_t2g['gene_id'])

[12]:

	g1_g2_overlap	g1_ratio	g2_ratio
0	20418	1.0	0.62786

The overlap of the zebrafish_data.var_names and the imported GTF and gene_id looks better and covers all gene IDs from the scRNA data.

However, the orthomap obtained from OrthoFinder results contains isoform gene IDs.

As a first step matching the orthomap and the zebrafish_data.var_names from the scRNA data is to reduce each isoform gene ID to its corresponding gene ID.

Reduce isoforms to genes

The of2orthomap.replace_by() helper function can be used to add a new column to the orthomap dataframe by matching e.g. gene isoform names and their corresponding gene names.

[13]:

query_orthomap

[13]:

	seqID	Orthogroup	PSnum	PStaxID	PSname	PScontinuity
0	ENSDART00000127643.3	OG0000000	6	33213	Bilateria	0.846154
1	ENSDART00000171750.2	OG0000000	6	33213	Bilateria	0.846154
2	ENSDART00000190648.1	OG0000000	6	33213	Bilateria	0.846154
3	ENSDART00000130167.3	OG0000001	10	7742	Vertebrata	0.909091
4	ENSDART00000150909.2	OG0000001	10	7742	Vertebrata	0.909091
...	...	...	...	...	...	...
25167	ENSDART00000180796.1	OG0029510	19	186625	Clupeocephala	0.400000
25168	ENSDART00000145618.2	OG0029511	19	186625	Clupeocephala	0.400000
25169	ENSDART00000143229.2	OG0029512	29	7955	Danio rerio	1.000000
25170	ENSDART00000143837.3	OG0029512	29	7955	Danio rerio	1.000000
25171	ENSDART00000180573.1	OG0029513	13	117571	Euteleostomi	0.222222

25172 rows × 6 columns

[14]:

zebrafish_data.var_names

[14]:

Index(['ENSDARG00000002968', 'ENSDARG00000056314', 'ENSDARG00000102274',
       'ENSDARG00000012468', 'ENSDARG00000063621', 'ENSDARG00000044802',
       'ENSDARG00000011410', 'ENSDARG00000041170', 'ENSDARG00000011855',
       'ENSDARG00000103957',
       ...
       'ENSDARG00000078476', 'ENSDARG00000058562', 'ENSDARG00000110745',
       'ENSDARG00000114172', 'ENSDARG00000110433', 'ENSDARG00000098193',
       'ENSDARG00000101137', 'ENSDARG00000095817', 'ENSDARG00000079034',
       'ENSDARG00000063372'],
      dtype='object', name='index', length=20418)

[15]:

# convert orthomap transcript IDs into GeneIDs and add them to orthomap
query_orthomap['geneID'] = orthomap2tei.replace_by(
    x_orig = query_orthomap['seqID'],
    xmatch = query_species_t2g['transcript_id_version'],
    xreplace = query_species_t2g['gene_id'],
)

[16]:

query_orthomap

[16]:

	seqID	Orthogroup	PSnum	PStaxID	PSname	PScontinuity	geneID
0	ENSDART00000127643.3	OG0000000	6	33213	Bilateria	0.846154	ENSDARG00000087544
1	ENSDART00000171750.2	OG0000000	6	33213	Bilateria	0.846154	ENSDARG00000095745
2	ENSDART00000190648.1	OG0000000	6	33213	Bilateria	0.846154	ENSDARG00000097551
3	ENSDART00000130167.3	OG0000001	10	7742	Vertebrata	0.909091	ENSDARG00000086420
4	ENSDART00000150909.2	OG0000001	10	7742	Vertebrata	0.909091	ENSDARG00000086613
...	...	...	...	...	...	...	...
25167	ENSDART00000180796.1	OG0029510	19	186625	Clupeocephala	0.400000	ENSDARG00000110427
25168	ENSDART00000145618.2	OG0029511	19	186625	Clupeocephala	0.400000	ENSDARG00000093188
25169	ENSDART00000143229.2	OG0029512	29	7955	Danio rerio	1.000000	ENSDARG00000069978
25170	ENSDART00000143837.3	OG0029512	29	7955	Danio rerio	1.000000	ENSDARG00000078193
25171	ENSDART00000180573.1	OG0029513	13	117571	Euteleostomi	0.222222	ENSDARG00000109747

25172 rows × 7 columns

Now, each seqID (isoform gene ID) is reduced to its gene ID geneID and one can check again the overlap.

[17]:

# check overlap of orthomap <geneID> and scRNA data
orthomap2tei.geneset_overlap(zebrafish_data.var_names, query_orthomap['geneID'])

[17]:

	g1_g2_overlap	g1_ratio	g2_ratio
0	19982	0.978646	0.793819

The created orthomap can be stored as a separated file like:

query_orthomap.to_csv('./data/zebrafish_ensembl_105_orthomap.tsv', sep='\t', index=False)

To re-use the saved orthomap, so that one does not need to repeat Step 1 and Step 2, one could load it with orthomap2tei.read_orthomap() function.

zebrafish_orthomap = orthomap2tei.read_orthomap('data/zebrafish_ensembl_105_orthomap.tsv')

If you like to continue, please have a look at the documentation of Step 4 - TEI calculation to get further insides.