oggmap: Step 3 - map gene/transcript IDs
This notebook will demonstrate how to match gene or transcript IDs between an orthomap and scRNA data.
Notebook file
Notebook file can be obtained here:
https://raw.githubusercontent.com/kullrich/oggmap/main/docs/notebooks/get_orthomap.ipynb
Import libraries
[1]:
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns
import matplotlib.pyplot as plt
from statannot import add_stat_annotation
# increase dpi
%matplotlib inline
#plt.rcParams['figure.dpi'] = 300
#plt.rcParams['savefig.dpi'] = 300
plt.rcParams['figure.figsize'] = [6, 4.5]
#plt.rcParams['figure.figsize'] = [4.4, 3.3]
Import oggmap python package submodules
[2]:
# import submodules
from oggmap import qlin, gtf2t2g, of2orthomap, orthomap2tei, datasets
Step 0, Step 1 and Step 2
In order to come to Step 3, matching gene or transcript IDs, one needs to have the results from Step 0, Step 1 and Step 2.
The query species in this part is: Danio rerio (zebrafish).
Please have a look at the documentation of Step 0 - run OrthoFinder to get to know what information and files are mandatory to extract gene age classes from OrthoFinder results.
In Step 1 - get taxonomic information you have already been introduced how to extract query lineage information with oggmap and the qlin.get_qlin() function.
In Step 2 - gene age class assignment you have already been introduced how to extract an orthomap (gene age class) from OrthoFinder results with oggmap and the of2orthomap.get_orthomap() function or how to import pre-calculated orthomaps with the orthomap2tei.read_orthomap() function.
Step 0 - run OrthoFinder
For this documentation part all mandatory OrthoFinder (Emms and Kelly, 2019) results have been pre-calculated.
Please have a look at the documentation of Step 0 - run OrthoFinder to get further insides.
The results are available here:
https://doi.org/10.5281/zenodo.7242264
or can be accessed with the dataset submodule of oggmap
datasets.ensembl105(datapath='data') (download folder set to 'data').
[3]:
datasets.ensembl105(datapath='data')
100% [..........................................................] 15662 / 15662
[3]:
['data/ensembl_105_orthofinder_Orthogroups.GeneCount.tsv.zip',
'data/ensembl_105_orthofinder_Orthogroups.tsv.zip',
'data/ensembl_105_orthofinder_species_list.tsv']
Step 1 - get taxonomic information
Please have a look at the documentation of Step 1 - get taxonomic information to get further insides.
[4]:
# get query species taxonomic lineage information
query_lineage = qlin.get_qlin(q='Danio rerio')
query name: Danio rerio
query taxID: 7955
query kingdom: Eukaryota
query lineage names:
['root(1)', 'cellular organisms(131567)', 'Eukaryota(2759)', 'Opisthokonta(33154)', 'Metazoa(33208)', 'Eumetazoa(6072)', 'Bilateria(33213)', 'Deuterostomia(33511)', 'Chordata(7711)', 'Craniata(89593)', 'Vertebrata(7742)', 'Gnathostomata(7776)', 'Teleostomi(117570)', 'Euteleostomi(117571)', 'Actinopterygii(7898)', 'Actinopteri(186623)', 'Neopterygii(41665)', 'Teleostei(32443)', 'Osteoglossocephalai(1489341)', 'Clupeocephala(186625)', 'Otomorpha(186634)', 'Ostariophysi(32519)', 'Otophysi(186626)', 'Cypriniphysae(186627)', 'Cypriniformes(7952)', 'Cyprinoidei(30727)', 'Danionidae(2743709)', 'Danioninae(2743711)', 'Danio(7954)', 'Danio rerio(7955)']
query lineage:
[1, 131567, 2759, 33154, 33208, 6072, 33213, 33511, 7711, 89593, 7742, 7776, 117570, 117571, 7898, 186623, 41665, 32443, 1489341, 186625, 186634, 32519, 186626, 186627, 7952, 30727, 2743709, 2743711, 7954, 7955]
Step 2 - gene age class assignment
Here, oggmap use the query species information and OrthoFinder results to extract the oldest common tree node per orthogroup along a species tree and to assign this node as the gene age to the corresponding genes.
Please have a look at the documentation of Step 2 - gene age class assignment to get further insides.
Note: This step can take up to five minutes, depending on your hardware.
For this step to get the query species oggmap, one uses the of2orthomap.get_orthomap() function, like:
[5]:
# get query species orthomap
# download orthofinder results here: https://doi.org/10.5281/zenodo.7242264
# or download with datasets.ensembl105('data')
query_orthomap, orthofinder_species_list, of_species_abundance = of2orthomap.get_orthomap(
seqname='Danio_rerio.GRCz11.cds.longest',
qt='7955',
sl='data/ensembl_105_orthofinder_species_list.tsv',
oc='data/ensembl_105_orthofinder_Orthogroups.GeneCount.tsv.zip',
og='data/ensembl_105_orthofinder_Orthogroups.tsv.zip',
continuity=True)
query_orthomap
Danio_rerio.GRCz11.cds.longest
Danio rerio
7955
species taxID \
0 Acanthochromis_polyacanthus.ASM210954v1.cds.lo... 80966
1 Accipiter_nisus.Accipiter_nisus_ver1.0.cds.lon... 211598
2 Ailuropoda_melanoleuca.ASM200744v2.cds.longest 9646
3 Amazona_collaria.ASM394721v1.cds.longest 241587
4 Amphilophus_citrinellus.Midas_v5.cds.longest 61819
.. ... ...
307 Xiphophorus_couchianus.Xiphophorus_couchianus-... 32473
308 Xiphophorus_maculatus.X_maculatus-5.0-male.cds... 8083
309 Zalophus_californianus.mZalCal1.pri.cds.longest 9704
310 Zonotrichia_albicollis.Zonotrichia_albicollis-... 44394
311 Zosterops_lateralis_melanops.ASM128173v1.cds.l... 1220523
lineage youngest_common \
0 [1, 131567, 2759, 33154, 33208, 6072, 33213, 3... 186625
1 [1, 131567, 2759, 33154, 33208, 6072, 33213, 3... 117571
2 [1, 131567, 2759, 33154, 33208, 6072, 33213, 3... 117571
3 [1, 131567, 2759, 33154, 33208, 6072, 33213, 3... 117571
4 [1, 131567, 2759, 33154, 33208, 6072, 33213, 3... 186625
.. ... ...
307 [1, 131567, 2759, 33154, 33208, 6072, 33213, 3... 186625
308 [1, 131567, 2759, 33154, 33208, 6072, 33213, 3... 186625
309 [1, 131567, 2759, 33154, 33208, 6072, 33213, 3... 117571
310 [1, 131567, 2759, 33154, 33208, 6072, 33213, 3... 117571
311 [1, 131567, 2759, 33154, 33208, 6072, 33213, 3... 117571
youngest_name
0 Clupeocephala
1 Euteleostomi
2 Euteleostomi
3 Euteleostomi
4 Clupeocephala
.. ...
307 Clupeocephala
308 Clupeocephala
309 Euteleostomi
310 Euteleostomi
311 Euteleostomi
[312 rows x 5 columns]
[5]:
| seqID | Orthogroup | PSnum | PStaxID | PSname | PScontinuity | |
|---|---|---|---|---|---|---|
| 0 | ENSDART00000127643.3 | OG0000000 | 6 | 33213 | Bilateria | 0.846154 |
| 1 | ENSDART00000171750.2 | OG0000000 | 6 | 33213 | Bilateria | 0.846154 |
| 2 | ENSDART00000190648.1 | OG0000000 | 6 | 33213 | Bilateria | 0.846154 |
| 3 | ENSDART00000130167.3 | OG0000001 | 10 | 7742 | Vertebrata | 0.909091 |
| 4 | ENSDART00000150909.2 | OG0000001 | 10 | 7742 | Vertebrata | 0.909091 |
| ... | ... | ... | ... | ... | ... | ... |
| 25167 | ENSDART00000180796.1 | OG0029510 | 19 | 186625 | Clupeocephala | 0.400000 |
| 25168 | ENSDART00000145618.2 | OG0029511 | 19 | 186625 | Clupeocephala | 0.400000 |
| 25169 | ENSDART00000143229.2 | OG0029512 | 29 | 7955 | Danio rerio | 1.000000 |
| 25170 | ENSDART00000143837.3 | OG0029512 | 29 | 7955 | Danio rerio | 1.000000 |
| 25171 | ENSDART00000180573.1 | OG0029513 | 13 | 117571 | Euteleostomi | 0.222222 |
25172 rows × 6 columns
Step 3 - map OrthoFinder gene names and scRNA gene/transcript names
To be able to link gene ages assignments from an orthomap and gene or transcript of scRNA dataset, one needs to check the overlap of the annotated gene names. With the gtf2t2g submodule of oggmap and the gtf2t2g.parse_gtf() function, one can extract gene and transcript names from a given gene feature file (GTF).
[6]:
datasets.zebrafish_ensembl105_gtf(datapath='data')
100% [....................................................] 18021890 / 18021890
[6]:
'data/Danio_rerio.GRCz11.105.gtf.gz'
[7]:
# get gene to transcript table for Danio rerio
# download zebrafish GTF file here:
# https://ftp.ensembl.org/pub/release-105/gtf/danio_rerio/Danio_rerio.GRCz11.105.gtf.gz
# or download with datasets.zebrafish_ensembl105_gtf(datapath='data')
query_species_t2g = gtf2t2g.parse_gtf(
gtf='data/Danio_rerio.GRCz11.105.gtf.gz',
g=True, b=True, p=True, v=True, s=True, q=True)
32520 gene_id found
59876 transcript_id found
59876 protein_id found
0 duplicated
[8]:
query_species_t2g
[8]:
| gene_id | gene_id_version | transcript_id | transcript_id_version | gene_name | gene_type | protein_id | protein_id_version | |
|---|---|---|---|---|---|---|---|---|
| 0 | ENSDARG00000000001 | ENSDARG00000000001.6 | ENSDART00000000004 | ENSDART00000000004.5 | slc35a5 | protein_coding | ENSDARP00000000004 | ENSDARP00000000004.2 |
| 1 | ENSDARG00000000002 | ENSDARG00000000002.8 | ENSDART00000000005 | ENSDART00000000005.7 | ccdc80 | protein_coding | ENSDARP00000000005 | ENSDARP00000000005.6 |
| 2 | ENSDARG00000000018 | ENSDARG00000000018.9 | ENSDART00000181044 | ENSDART00000181044.1 | nrf1 | protein_coding | ENSDARP00000149440 | ENSDARP00000149440.1 |
| 3 | ENSDARG00000000018 | ENSDARG00000000018.9 | ENSDART00000138183 | ENSDART00000138183.2 | nrf1 | protein_coding | ENSDARP00000116798 | ENSDARP00000116798.1 |
| 4 | ENSDARG00000000019 | ENSDARG00000000019.9 | ENSDART00000124452 | ENSDART00000124452.3 | ube2h | protein_coding | ENSDARP00000107407 | ENSDARP00000107407.2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 59871 | ENSDARG00000117825 | ENSDARG00000117825.1 | ENSDART00000194739 | ENSDART00000194739.1 | CU207269.4 | lincRNA | None | None |
| 59872 | ENSDARG00000117826 | ENSDARG00000117826.1 | ENSDART00000194042 | ENSDART00000194042.1 | CR385041.2 | lincRNA | None | None |
| 59873 | ENSDARG00000117826 | ENSDARG00000117826.1 | ENSDART00000194514 | ENSDART00000194514.1 | CR385041.2 | lincRNA | None | None |
| 59874 | ENSDARG00000117827 | ENSDARG00000117827.1 | ENSDART00000194378 | ENSDART00000194378.1 | CR388164.3 | lincRNA | None | None |
| 59875 | ENSDARG00000117827 | ENSDARG00000117827.1 | ENSDART00000194710 | ENSDART00000194710.1 | CR388164.3 | lincRNA | None | None |
59876 rows × 8 columns
Import now, the scRNA dataset of the query species
Here, data is used, like in the publication (Farrell et al., 2018; Wagner et al., 2018; Qiu et al., 2022).
scRNA data was downloaded from http://tome.gs.washington.edu/ as R rds files, combined into a single Seurat object and converted into loom and AnnData (h5ad) files to be able to analyse with e.g. python scanpy or oggmap package and is available here:
https://doi.org/10.5281/zenodo.7243602
or can be accessed with the dataset submodule of oggmap:
datasets.qiu22_zebrafish(datapath='data') (download folder set to 'data').
[9]:
# load scRNA data
# download zebrafish scRNA data here: https://doi.org/10.5281/zenodo.7243602
# or download with datasets.qui22_zebrafish(datapath='data')
#zebrafish_data = datasets.qiu22_zebrafish(datapath='data')
zebrafish_data = sc.read('data/zebrafish_data.h5ad')
Get an overview of observations
[10]:
zebrafish_data.obs
[10]:
| orig.ident | nCount_RNA | nFeature_RNA | sample | stage | group | cell_state | cell_type | |
|---|---|---|---|---|---|---|---|---|
| hpf3.3_ZFHIGH_WT_DS5_AAAAGTTGCCTC | ZFHIGH | 5773.0 | 2570 | ZFHIGH_WT_DS5_AAAAGTTGCCTC | hpf3.3 | F_3.3 | hpf3.3:blastomere | blastomere |
| hpf3.3_ZFHIGH_WT_DS5_AAACAAGTGTAT | ZFHIGH | 2312.0 | 1451 | ZFHIGH_WT_DS5_AAACAAGTGTAT | hpf3.3 | F_3.3 | hpf3.3:blastomere | blastomere |
| hpf3.3_ZFHIGH_WT_DS5_AAACACCTCGTC | ZFHIGH | 4180.0 | 2166 | ZFHIGH_WT_DS5_AAACACCTCGTC | hpf3.3 | F_3.3 | hpf3.3:blastomere | blastomere |
| hpf3.3_ZFHIGH_WT_DS5_AAATGAGGTTTN | ZFHIGH | 6686.0 | 2845 | ZFHIGH_WT_DS5_AAATGAGGTTTN | hpf3.3 | F_3.3 | hpf3.3:blastomere | blastomere |
| hpf3.3_ZFHIGH_WT_DS5_AACCCTCTCGAT | ZFHIGH | 20095.0 | 4993 | ZFHIGH_WT_DS5_AACCCTCTCGAT | hpf3.3 | F_3.3 | hpf3.3:blastomere | blastomere |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| hpf24_DEW057_TGACACAACAG_GCCACATC | DEW057 | 3916.0 | 1328 | DEW057_TGACACAACAG_GCCACATC | hpf24 | batch2 | hpf24:midbrain | midbrain |
| hpf24_DEW057_CTTACGGG_AACCTGAC | DEW057 | 5611.0 | 1700 | DEW057_CTTACGGG_AACCTGAC | hpf24 | batch2 | hpf24:pharyngeal arch | pharyngeal arch |
| hpf24_DEW057_TGAACATCTAT_GACGATGG | DEW057 | 3676.0 | 1345 | DEW057_TGAACATCTAT_GACGATGG | hpf24 | batch2 | hpf24:midbrain | midbrain |
| hpf24_DEW057_TGAGGTTTCTC_CTCAGAAT | DEW057 | 7021.0 | 1778 | DEW057_TGAGGTTTCTC_CTCAGAAT | hpf24 | batch2 | hpf24:optic cup | optic cup |
| hpf24_DEW057_ACGTGCTAG_CAAGTCAT | DEW057 | 3378.0 | 1170 | DEW057_ACGTGCTAG_CAAGTCAT | hpf24 | batch2 | hpf24:hindbrain dorsal | hindbrain dorsal |
71203 rows × 8 columns
Helper functions to match gene names
The orthomap2tei submodule contains the orthomap2tei.geneset_overlap() helper function to check for gene name overlap between the constructed orthomap from OrthoFinder results and a given scRNA dataset.
[11]:
# check overlap of orthomap <seqID> and scRNA data <var_names>
orthomap2tei.geneset_overlap(zebrafish_data.var_names, query_orthomap['seqID'])
[11]:
| g1_g2_overlap | g1_ratio | g2_ratio | |
|---|---|---|---|
| 0 | 0 | 0.0 | 0.0 |
As one can see, there is no overlap between the zebrafish_data.var_names and the sequence IDs from the orthomap so far.
[12]:
# check overlap of transcript table <gene_id> and scRNA data <var_names>
orthomap2tei.geneset_overlap(zebrafish_data.var_names, query_species_t2g['gene_id'])
[12]:
| g1_g2_overlap | g1_ratio | g2_ratio | |
|---|---|---|---|
| 0 | 20418 | 1.0 | 0.62786 |
The overlap of the zebrafish_data.var_names and the imported GTF and gene_id looks better and covers all gene IDs from the scRNA data.
However, the orthomap obtained from OrthoFinder results contains isoform gene IDs.
As a first step matching the orthomap and the zebrafish_data.var_names from the scRNA data is to reduce each isoform gene ID to its corresponding gene ID.
Reduce isoforms to genes
The of2orthomap.replace_by() helper function can be used to add a new column to the orthomap dataframe by matching e.g. gene isoform names and their corresponding gene names.
[13]:
query_orthomap
[13]:
| seqID | Orthogroup | PSnum | PStaxID | PSname | PScontinuity | |
|---|---|---|---|---|---|---|
| 0 | ENSDART00000127643.3 | OG0000000 | 6 | 33213 | Bilateria | 0.846154 |
| 1 | ENSDART00000171750.2 | OG0000000 | 6 | 33213 | Bilateria | 0.846154 |
| 2 | ENSDART00000190648.1 | OG0000000 | 6 | 33213 | Bilateria | 0.846154 |
| 3 | ENSDART00000130167.3 | OG0000001 | 10 | 7742 | Vertebrata | 0.909091 |
| 4 | ENSDART00000150909.2 | OG0000001 | 10 | 7742 | Vertebrata | 0.909091 |
| ... | ... | ... | ... | ... | ... | ... |
| 25167 | ENSDART00000180796.1 | OG0029510 | 19 | 186625 | Clupeocephala | 0.400000 |
| 25168 | ENSDART00000145618.2 | OG0029511 | 19 | 186625 | Clupeocephala | 0.400000 |
| 25169 | ENSDART00000143229.2 | OG0029512 | 29 | 7955 | Danio rerio | 1.000000 |
| 25170 | ENSDART00000143837.3 | OG0029512 | 29 | 7955 | Danio rerio | 1.000000 |
| 25171 | ENSDART00000180573.1 | OG0029513 | 13 | 117571 | Euteleostomi | 0.222222 |
25172 rows × 6 columns
[14]:
zebrafish_data.var_names
[14]:
Index(['ENSDARG00000002968', 'ENSDARG00000056314', 'ENSDARG00000102274',
'ENSDARG00000012468', 'ENSDARG00000063621', 'ENSDARG00000044802',
'ENSDARG00000011410', 'ENSDARG00000041170', 'ENSDARG00000011855',
'ENSDARG00000103957',
...
'ENSDARG00000078476', 'ENSDARG00000058562', 'ENSDARG00000110745',
'ENSDARG00000114172', 'ENSDARG00000110433', 'ENSDARG00000098193',
'ENSDARG00000101137', 'ENSDARG00000095817', 'ENSDARG00000079034',
'ENSDARG00000063372'],
dtype='object', name='index', length=20418)
[15]:
# convert orthomap transcript IDs into GeneIDs and add them to orthomap
query_orthomap['geneID'] = orthomap2tei.replace_by(
x_orig = query_orthomap['seqID'],
xmatch = query_species_t2g['transcript_id_version'],
xreplace = query_species_t2g['gene_id'],
)
[16]:
query_orthomap
[16]:
| seqID | Orthogroup | PSnum | PStaxID | PSname | PScontinuity | geneID | |
|---|---|---|---|---|---|---|---|
| 0 | ENSDART00000127643.3 | OG0000000 | 6 | 33213 | Bilateria | 0.846154 | ENSDARG00000087544 |
| 1 | ENSDART00000171750.2 | OG0000000 | 6 | 33213 | Bilateria | 0.846154 | ENSDARG00000095745 |
| 2 | ENSDART00000190648.1 | OG0000000 | 6 | 33213 | Bilateria | 0.846154 | ENSDARG00000097551 |
| 3 | ENSDART00000130167.3 | OG0000001 | 10 | 7742 | Vertebrata | 0.909091 | ENSDARG00000086420 |
| 4 | ENSDART00000150909.2 | OG0000001 | 10 | 7742 | Vertebrata | 0.909091 | ENSDARG00000086613 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 25167 | ENSDART00000180796.1 | OG0029510 | 19 | 186625 | Clupeocephala | 0.400000 | ENSDARG00000110427 |
| 25168 | ENSDART00000145618.2 | OG0029511 | 19 | 186625 | Clupeocephala | 0.400000 | ENSDARG00000093188 |
| 25169 | ENSDART00000143229.2 | OG0029512 | 29 | 7955 | Danio rerio | 1.000000 | ENSDARG00000069978 |
| 25170 | ENSDART00000143837.3 | OG0029512 | 29 | 7955 | Danio rerio | 1.000000 | ENSDARG00000078193 |
| 25171 | ENSDART00000180573.1 | OG0029513 | 13 | 117571 | Euteleostomi | 0.222222 | ENSDARG00000109747 |
25172 rows × 7 columns
Now, each seqID (isoform gene ID) is reduced to its gene ID geneID and one can check again the overlap.
[17]:
# check overlap of orthomap <geneID> and scRNA data
orthomap2tei.geneset_overlap(zebrafish_data.var_names, query_orthomap['geneID'])
[17]:
| g1_g2_overlap | g1_ratio | g2_ratio | |
|---|---|---|---|
| 0 | 19982 | 0.978646 | 0.793819 |
The created orthomap can be stored as a separated file like:
query_orthomap.to_csv('./data/zebrafish_ensembl_105_orthomap.tsv', sep='\t', index=False)To re-use the saved orthomap, so that one does not need to repeat Step 1 and Step 2, one could load it with orthomap2tei.read_orthomap() function.
If you like to continue, please have a look at the documentation of Step 4 - TEI calculation to get further insides.