oggmap: Step 2 - gene age class assignment

This notebook will demonstrate how to extract an orthomap (gene age class) from OrthoFinder results or how to import pre-calculated orthomaps.

Notebook file

Notebook file can be obtained here:

https://raw.githubusercontent.com/kullrich/oggmap/main/docs/notebooks/get_orthomap.ipynb

Import libraries

[1]:
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns
import matplotlib.pyplot as plt
from statannot import add_stat_annotation
# increase dpi
%matplotlib inline
#plt.rcParams['figure.dpi'] = 300
#plt.rcParams['savefig.dpi'] = 300
plt.rcParams['figure.figsize'] = [6, 4.5]
#plt.rcParams['figure.figsize'] = [4.4, 3.3]

Import oggmap python package submodules

[2]:
# import submodules
from oggmap import qlin, gtf2t2g, of2orthomap, orthomap2tei, datasets

Use OrthoFinder results to get gene age classification

The query species in this part is: Danio rerio (zebrafish).

Please have a look at the documentation of Step 0 - run OrthoFinder to get to know what information and files are mandatory to extract gene age classes from OrthoFinder results.

Mandatory query species information:

  • query species sequence name (see header of <Orthogroups.GeneCount.tsv> or <Orthogroups.tsv> file) (seqname=)

  • query species taxonomic ID (qt=)

Mandatory information fromOrthoFinderrun:

  • tab delimted file containing sequence name and taxonomic IDs of all species used (<OrthoFinder name><tab><species taxID>) (sl=)

Mandatory files fromOrthoFinder:

  • <Orthogroups.GeneCount.tsv> or <Orthogroups.GeneCount.tsv.zip> (oc=)

  • <Orthogroups.tsv> or <Orthogroups.tsv.zip> (og=)

In Step 1 - get taxonomic information you have already been introduced how to extract query lineage information with oggmap and the qlin.get_qlin() function.

[3]:
# get query species taxonomic lineage information
query_lineage = qlin.get_qlin(q='Danio rerio')
query name: Danio rerio
query taxID: 7955
query kingdom: Eukaryota
query lineage names:
['root(1)', 'cellular organisms(131567)', 'Eukaryota(2759)', 'Opisthokonta(33154)', 'Metazoa(33208)', 'Eumetazoa(6072)', 'Bilateria(33213)', 'Deuterostomia(33511)', 'Chordata(7711)', 'Craniata(89593)', 'Vertebrata(7742)', 'Gnathostomata(7776)', 'Teleostomi(117570)', 'Euteleostomi(117571)', 'Actinopterygii(7898)', 'Actinopteri(186623)', 'Neopterygii(41665)', 'Teleostei(32443)', 'Osteoglossocephalai(1489341)', 'Clupeocephala(186625)', 'Otomorpha(186634)', 'Ostariophysi(32519)', 'Otophysi(186626)', 'Cypriniphysae(186627)', 'Cypriniformes(7952)', 'Cyprinoidei(30727)', 'Danionidae(2743709)', 'Danioninae(2743711)', 'Danio(7954)', 'Danio rerio(7955)']
query lineage:
[1, 131567, 2759, 33154, 33208, 6072, 33213, 33511, 7711, 89593, 7742, 7776, 117570, 117571, 7898, 186623, 41665, 32443, 1489341, 186625, 186634, 32519, 186626, 186627, 7952, 30727, 2743709, 2743711, 7954, 7955]

Ensembl release-105

For this documentation part all mandatory OrthoFinder (Emms and Kelly, 2019) results have been pre-calculated.

OrthoFinder was run with the option -S diamond_ultra_sens using translated, longest-isoform coding sequences (CDS) from Ensembl release-105 including species taxonomic IDs.

The results are available here:

https://doi.org/10.5281/zenodo.7242264

or can be accessed with the dataset submodule of oggmap

datasets.ensembl105(datapath='data') (download folder set to 'data').

Note: If you want to use your own OrthoFinder results, please have a look at the documentation of Step 0 - run OrthoFinder.

[4]:
datasets.ensembl105(datapath='data')
100% [..............................................................................] 15662 / 15662
[4]:
['data/ensembl_105_orthofinder_Orthogroups.GeneCount.tsv.zip',
 'data/ensembl_105_orthofinder_Orthogroups.tsv.zip',
 'data/ensembl_105_orthofinder_species_list.tsv']

Here, oggmap use the query species information and OrthoFinder results to extract the oldest common tree node per orthogroup along a species tree and to assign this node as the gene age to the corresponding genes.

In a pairwise manner, the query species and any other species in the OrthoFinder result might share multiple tree nodes down to the root of the species tree, but have only one youngest tree node in common. Among all possible comparison between the query species and the other species, the oldest as defined by the species tree root is selected and used for the gene age assignment.

Given the query species sequence name (seqname=) used in the OrthoFinder comparison, the query species taxonomic ID(qt=), the taxonomic IDs of all species from the <species_list.tsv> file (sl=) used in the OrthoFinder comparison, the orthogroup gene count <Orthogroups.GeneCount.tsv> file (oc=) results and the orthogroups <Orthogroups.tsv> file (og=), an orthomap is constructed.

Note: This step can take up to five minutes, depending on your hardware.

For this step to get the query species orthomap, one uses the of2orthomap.get_orthomap() function, like:

[5]:
# get query species orthomap

# download orthofinder results here: https://doi.org/10.5281/zenodo.7242264
# or download with datasets.ensembl105('data')
query_orthomap, orthofinder_species_list, of_species_abundance = of2orthomap.get_orthomap(
    seqname='Danio_rerio.GRCz11.cds.longest',
    qt='7955',
    sl='data/ensembl_105_orthofinder_species_list.tsv',
    oc='data/ensembl_105_orthofinder_Orthogroups.GeneCount.tsv.zip',
    og='data/ensembl_105_orthofinder_Orthogroups.tsv.zip',
    continuity=True)
query_orthomap
Danio_rerio.GRCz11.cds.longest
Danio rerio
7955
                                               species    taxID  \
0    Acanthochromis_polyacanthus.ASM210954v1.cds.lo...    80966
1    Accipiter_nisus.Accipiter_nisus_ver1.0.cds.lon...   211598
2       Ailuropoda_melanoleuca.ASM200744v2.cds.longest     9646
3             Amazona_collaria.ASM394721v1.cds.longest   241587
4         Amphilophus_citrinellus.Midas_v5.cds.longest    61819
..                                                 ...      ...
307  Xiphophorus_couchianus.Xiphophorus_couchianus-...    32473
308  Xiphophorus_maculatus.X_maculatus-5.0-male.cds...     8083
309    Zalophus_californianus.mZalCal1.pri.cds.longest     9704
310  Zonotrichia_albicollis.Zonotrichia_albicollis-...    44394
311  Zosterops_lateralis_melanops.ASM128173v1.cds.l...  1220523

                                               lineage  youngest_common  \
0    [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           186625
1    [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571
2    [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571
3    [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571
4    [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           186625
..                                                 ...              ...
307  [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           186625
308  [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           186625
309  [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571
310  [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571
311  [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571

     youngest_name
0    Clupeocephala
1     Euteleostomi
2     Euteleostomi
3     Euteleostomi
4    Clupeocephala
..             ...
307  Clupeocephala
308  Clupeocephala
309   Euteleostomi
310   Euteleostomi
311   Euteleostomi

[312 rows x 5 columns]
[5]:
seqID Orthogroup PSnum PStaxID PSname PScontinuity
0 ENSDART00000127643.3 OG0000000 6 33213 Bilateria 0.846154
1 ENSDART00000171750.2 OG0000000 6 33213 Bilateria 0.846154
2 ENSDART00000190648.1 OG0000000 6 33213 Bilateria 0.846154
3 ENSDART00000130167.3 OG0000001 10 7742 Vertebrata 0.909091
4 ENSDART00000150909.2 OG0000001 10 7742 Vertebrata 0.909091
... ... ... ... ... ... ...
25167 ENSDART00000180796.1 OG0029510 19 186625 Clupeocephala 0.400000
25168 ENSDART00000145618.2 OG0029511 19 186625 Clupeocephala 0.400000
25169 ENSDART00000143229.2 OG0029512 29 7955 Danio rerio 1.000000
25170 ENSDART00000143837.3 OG0029512 29 7955 Danio rerio 1.000000
25171 ENSDART00000180573.1 OG0029513 13 117571 Euteleostomi 0.222222

25172 rows × 6 columns

The created orthomap can be stored as a separated file like:

query_orthomap.to_csv('./orthomap.tsv', sep='\t', index=False)

To re-use the saved orthomap, so that one does not need to repeat step1 and step2, one could load it with orthomap2tei.read_orthomap() function.

Ensembl release-110

For this documentation part all mandatory OrthoFinder (Emms and Kelly, 2019) results have been pre-calculated.

OrthoFinder was run with the option -S last using translated, longest-isoform coding sequences (CDS) from Ensembl release-110 including species taxonomic IDs.

The results are available here:

https://doi.org/10.5281/zenodo.7242264

or can be accessed with the dataset submodule of oggmap

datasets.ensembl110_last(datapath='data') (download folder set to 'data').

Note: If you want to use your own OrthoFinder results, please have a look at the documentation of Step 0 - run OrthoFinder.

[6]:
datasets.ensembl110_last(datapath='data')
100% [..............................................................................] 11317 / 11317
[6]:
['data/ensembl_110_orthofinder_last_Orthogroups.GeneCount.tsv.zip',
 'data/ensembl_110_orthofinder_last_Orthogroups.tsv.zip',
 'data/ensembl_110_orthofinder_last_species_list.tsv']
[7]:
# get query species orthomap

# download orthofinder results here: https://doi.org/10.5281/zenodo.7242264
# or download with datasets.ensembl105('data')
query_orthomap, orthofinder_species_list, of_species_abundance = of2orthomap.get_orthomap(
    seqname='7955.danio_rerio.pep',
    qt='7955',
    sl='data/ensembl_110_orthofinder_last_species_list.tsv',
    oc='data/ensembl_110_orthofinder_last_Orthogroups.GeneCount.tsv.zip',
    og='data/ensembl_110_orthofinder_last_Orthogroups.tsv.zip',
    continuity=True)
query_orthomap
7955.danio_rerio.pep
Danio rerio
7955
                                    species  taxID  \
0                 10020.dipodomys_ordii.pep  10020
1    10029.cricetulus_griseus_chok1gshd.pep  10029
2       10029.cricetulus_griseus_crigri.pep  10029
3         10029.cricetulus_griseus_picr.pep  10029
4            10036.mesocricetus_auratus.pep  10036
..                                      ...    ...
313          9986.oryctolagus_cuniculus.pep   9986
314        99883.tetraodon_nigroviridis.pep  99883
315        9994.marmota_marmota_marmota.pep   9994
316            9999.urocitellus_parryii.pep   9999
317    Xtropicalisv9.0.Named.primaryTrs.pep   8364

                                               lineage  youngest_common  \
0    [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571
1    [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571
2    [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571
3    [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571
4    [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571
..                                                 ...              ...
313  [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571
314  [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           186625
315  [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571
316  [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571
317  [1, 131567, 2759, 33154, 33208, 6072, 33213, 3...           117571

     youngest_name
0     Euteleostomi
1     Euteleostomi
2     Euteleostomi
3     Euteleostomi
4     Euteleostomi
..             ...
313   Euteleostomi
314  Clupeocephala
315   Euteleostomi
316   Euteleostomi
317   Euteleostomi

[318 rows x 5 columns]
[7]:
seqID Orthogroup PSnum PStaxID PSname PScontinuity
0 ENSDART00000013359.10 OG0000000 6 33213 Bilateria 0.846154
1 ENSDART00000136092.3 OG0000001 6 33213 Bilateria 1.000000
2 ENSDART00000148349.3 OG0000001 6 33213 Bilateria 1.000000
3 ENSDART00000160249.2 OG0000001 6 33213 Bilateria 1.000000
4 ENSDART00000174091.2 OG0000001 6 33213 Bilateria 1.000000
... ... ... ... ... ... ...
24952 ENSDART00000125904.3 OG0035492 19 186625 Clupeocephala 0.400000
24953 ENSDART00000191935.1 OG0035493 25 30727 Cyprinoidei 1.000000
24954 ENSDART00000143229.2 OG0035494 29 7955 Danio rerio 1.000000
24955 ENSDART00000143837.3 OG0035494 29 7955 Danio rerio 1.000000
24956 ENSDART00000143384.2 OG0035495 22 186626 Otophysi 0.666667

24957 rows × 6 columns

The created orthomap can be stored as a separated file like:

query_orthomap.to_csv('./orthomap.tsv', sep='\t', index=False)

To re-use the saved orthomap, so that one does not need to repeat step1 and step2, one could load it with orthomap2tei.read_orthomap() function.

Gene age assignments per query species lineage node

Given an orthomap, one can get an overview of the gene age assignments per query species lineage node.

The oggmap submodule of2orhomap and the of2orthomap.get_counts_per_ps() function will show the distribution of the gene age classes and can be further visualized as follows:

[8]:
# show count per taxonomic group (PStaxID)
of2orthomap.get_counts_per_ps(query_orthomap)
[8]:
PSnum counts PStaxID PSname
PSnum
3 3 4376 33154 Opisthokonta
6 6 10530 33213 Bilateria
8 8 2390 7711 Chordata
10 10 2838 7742 Vertebrata
11 11 1515 7776 Gnathostomata
13 13 760 117571 Euteleostomi
14 14 235 7898 Actinopterygii
16 16 275 41665 Neopterygii
18 18 231 1489341 Osteoglossocephalai
19 19 506 186625 Clupeocephala
20 20 33 186634 Otomorpha
22 22 120 186626 Otophysi
25 25 317 30727 Cyprinoidei
29 29 831 7955 Danio rerio

Visualize number of species along query lineage and counts per gene age class

[9]:
# show number of species along query lineage
of_species_abundance

# bar plot number of species along query lineage
of_species_abundance.plot.bar(y='counts', use_index=True)
[9]:
<AxesSubplot: >
../_images/notebooks_get_orthomap_26_1.png
[10]:
# show count per taxonomic group (PStaxID)
of2orthomap.get_counts_per_ps(query_orthomap)

# bar plot count per taxonomic group (PSname)
ax = of2orthomap.get_counts_per_ps(query_orthomap).plot.bar(y='counts', x='PSname')
ax.set_title('D. rerio - Number of genes per gene age class')
plt.show()
../_images/notebooks_get_orthomap_27_0.png

Use pre-calculated gene age classification

The query species in this part is: Caenorhabditis elegans (nematode).

Any pre-calculated gene age classification can be imported as a table using the function orthomap2tei.read_orthomap(orthomapfile=filename).

The pre-calculated gene age classification file should be delimited with two columns GeneID<tab>Phylostratum, like e.g.:

GeneID<tab>Phylostratum
WBGene00000001<tab>1
WBGene00000002<tab>1
WBGene00000003<tab>1
WBGene00000004<tab>1
WBGene00000005<tab>2

The orthomap for this part was pre-calculated (Sun et al., 2021) and is available here:

https://doi.org/10.5281/zenodo.7242263

or can be accessed with the dataset submodule of oggmap

datasets.sun21_orthomap(datapath='data') (download folder set to 'data').

Note: If you want to use your own OrthoFinder results, please have a look at the documentation of Step 0 - run OrthoFinder.

A detailed description of how to extract an orthomap from OrthoFinder results is described in the first part.

[11]:
datasets.sun21_orthomap(datapath='data')
100% [............................................................................] 344640 / 344640
[11]:
'data/Sun2021_Orthomap.tsv'
[12]:
# get query species taxonomic lineage information
query_lineage = qlin.get_qlin(q='Caenorhabditis elegans')
query name: Caenorhabditis elegans
query taxID: 6239
query kingdom: Eukaryota
query lineage names:
['root(1)', 'cellular organisms(131567)', 'Eukaryota(2759)', 'Opisthokonta(33154)', 'Metazoa(33208)', 'Eumetazoa(6072)', 'Bilateria(33213)', 'Protostomia(33317)', 'Ecdysozoa(1206794)', 'Nematoda(6231)', 'Chromadorea(119089)', 'Rhabditida(6236)', 'Rhabditina(2301116)', 'Rhabditomorpha(2301119)', 'Rhabditoidea(55879)', 'Rhabditidae(6243)', 'Peloderinae(55885)', 'Caenorhabditis(6237)', 'Caenorhabditis elegans(6239)']
query lineage:
[1, 131567, 2759, 33154, 33208, 6072, 33213, 33317, 1206794, 6231, 119089, 6236, 2301116, 2301119, 55879, 6243, 55885, 6237, 6239]
[13]:
# get query species orthomap

# download pre-calculated orthomap here: https://doi.org/10.5281/zenodo.7242263
# or download with datasets.sun21_orthomap(datapath='data')
query_orthomap = orthomap2tei.read_orthomap(orthomapfile='data/Sun2021_Orthomap.tsv')
query_orthomap
[13]:
GeneID Phylostratum
0 WBGene00000001 1
1 WBGene00000002 1
2 WBGene00000003 1
3 WBGene00000004 1
4 WBGene00000005 2
... ... ...
20035 WBGene00305132 9
20036 WBGene00305157 1
20037 WBGene00305158 13
20038 WBGene00305159 11
20039 WBGene00305173 0

20040 rows × 2 columns

Gene age assignments per query species lineage node

Given a pre-calculated orthomap, one can still get an overview of the gene age assignments per query species lineage node.

However, since it might be that some columns are missing which normally are produced by the of2orthomap.get_orthomap()function, the column names need to be set specifically with the of2orthomap.get_counts_per_ps() function to visualize the distribution of the gene age classes as follows:

[14]:
# show count per taxonomic group (Phylostratum)
of2orthomap.get_counts_per_ps(omap_df=query_orthomap,
    psnum_col='Phylostratum',
    pstaxid_col=None,
    psname_col=None)
[14]:
Phylostratum counts
Phylostratum
0 0 1290
1 1 5434
2 2 4666
3 3 603
4 4 1039
5 5 808
6 6 274
7 7 884
8 8 590
9 9 511
10 10 384
11 11 1113
12 12 1277
13 13 1167
[15]:
# bar plot count per taxonomic group (Phylostratum)
ax = of2orthomap.get_counts_per_ps(omap_df=query_orthomap,
    psnum_col='Phylostratum',
    pstaxid_col=None,
    psname_col=None).plot.bar(y='counts', x='Phylostratum')
ax.set_title('C. elegans - Number of genes per gene age class')
plt.show()
../_images/notebooks_get_orthomap_35_0.png

If you like to continue, please have a look at the documentation of Step 3 - map gene/transcript IDs to get further insides.