oggmap: Step 4 - TEI calculation

This notebook will demonstrate how to to add a transcriptome evolutionary index (short: TEI) to scRNA data.

Notebook file

Notebook file can be obtained here:

https://raw.githubusercontent.com/kullrich/oggmap/main/docs/notebooks/add_tei.ipynb

Import libraries

[1]:
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns
import matplotlib.pyplot as plt
from statannot import add_stat_annotation
# increase dpi
%matplotlib inline
#plt.rcParams['figure.dpi'] = 300
#plt.rcParams['savefig.dpi'] = 300
plt.rcParams['figure.figsize'] = [6, 4.5]
#plt.rcParams['figure.figsize'] = [4.4, 3.3]

Import oggmap python package submodules

[2]:
# import submodules
from oggmap import qlin, gtf2t2g, of2orthomap, orthomap2tei, datasets, ncbitax

Step 0, Step 1, Step 2 and Step 3

In order to come to Step 4, TEI calculation, one needs to have the results from Step 0, Step 1, Step 2 and Step 3.

The query species in this part is: Danio rerio (zebrafish).

Please have a look at the documentation of Step 0 - run OrthoFinder to get to know what information and files are mandatory to extract gene age classes from OrthoFinder results.

In Step 1 - get taxonomic information you have already been introduced how to extract query lineage information with oggmap and the qlin.get_qlin() function.

In Step 2 - gene age class assignment you have already been introduced how to extract an orthomap (gene age class) from OrthoFinder results with oggmap and the of2orthomap.get_orthomap() function or how to import pre-calculated orthomaps with the orthomap2tei.read_orthomap() function.

In Step 3 - map gene/transcript IDs you have already been introduced how to extract gene IDs from GTF file with orthoamp and the gtf2t2g.parse_gtf() function. You have also been introduced how to use the orthomap2tei.geneset_overlap() function to check the overlap between the gene IDs and have learned how to use the orthomap2tei.replace_by() function to e.g. reduce isoform gene IDs to gene IDs.

Step 0 - run OrthoFinder

For this documentation part all mandatory OrthoFinder (Emms and Kelly, 2019) results have been pre-calculated.

Please have a look at the documentation of Step 0 - run OrthoFinder to get further insides.

The results are available here:

https://doi.org/10.5281/zenodo.7242264

or can be accessed with the dataset submodule of oggmap

datasets.ensembl113_last(datapath='data') (download folder set to 'data').

Step 1 - get taxonomic information

Please have a look at the documentation of Step 1 - get taxonomic information to get further insides.

[3]:
# get query species taxonomic lineage information
query_lineage = qlin.get_qlin(q='Danio rerio', dbname='data/taxadb.sqlite')
query name: Danio rerio
query taxID: 7955
query kingdom: Eukaryota
query lineage names:
['root(1)', 'cellular organisms(131567)', 'Eukaryota(2759)', 'Opisthokonta(33154)', 'Metazoa(33208)', 'Eumetazoa(6072)', 'Bilateria(33213)', 'Deuterostomia(33511)', 'Chordata(7711)', 'Craniata(89593)', 'Vertebrata(7742)', 'Gnathostomata(7776)', 'Teleostomi(117570)', 'Euteleostomi(117571)', 'Actinopterygii(7898)', 'Actinopteri(186623)', 'Neopterygii(41665)', 'Teleostei(32443)', 'Osteoglossocephalai(1489341)', 'Clupeocephala(186625)', 'Otomorpha(186634)', 'Ostariophysi(32519)', 'Otophysi(186626)', 'Cypriniphysae(186627)', 'Cypriniformes(7952)', 'Cyprinoidei(30727)', 'Danionidae(2743709)', 'Danioninae(2743711)', 'Danio(7954)', 'Danio rerio(7955)']
query lineage:
[1, 131567, 2759, 33154, 33208, 6072, 33213, 33511, 7711, 89593, 7742, 7776, 117570, 117571, 7898, 186623, 41665, 32443, 1489341, 186625, 186634, 32519, 186626, 186627, 7952, 30727, 2743709, 2743711, 7954, 7955]

Step 2 - gene age class assignment

Here, oggmap use the query species information and OrthoFinder results to extract the oldest common tree node per orthogroup along a species tree and to assign this node as the gene age to the corresponding genes.

Please have a look at the documentation of Step 2 - gene age class assignment to get further insides.

Step 3 - map gene/transcript IDs

To be able to link gene ages assignments from an orthomap and gene or transcript of scRNA dataset, one needs to check the overlap of the annotated gene names. With the gtf2t2g submodule of oggmap and the gtf2t2g.parse_gtf() function, one can extract gene and transcript names from a given gene feature file (GTF).

Please have a look at the documentation of Step 3 - map gene/transcript IDs to get further insides.

Here, the pre-calculated orthomap from Danio rerio (zebrafish) obtained via Step 0, Step 1, Step 2 and Step 3 is loaded as follows:

[4]:
datasets.zebrafish_ensembl113_orthomap(datapath='data')
100% [..........................................................................] 1901053 / 1901053
[4]:
'data/zebrafish_ensembl_113_orthomap.tsv'
[5]:
# get query species orthomap

# download zebrafish orthomap results here: https://doi.org/10.5281/zenodo.7242263
# or download with datasets.zebrafish_ensembl_113_orthomap(datapath='data')

query_orthomap = orthomap2tei.read_orthomap('data/zebrafish_ensembl_113_orthomap.tsv')
query_orthomap
[5]:
seqID Orthogroup PSnum PStaxID PSname PScontinuity geneID
0 ENSDART00000013359.10 OG0000000 10 7742 Vertebrata 0.909091 ENSDARG00000013014
1 ENSDART00000014270.5 OG0000000 10 7742 Vertebrata 0.909091 ENSDARG00000094080
2 ENSDART00000099395.5 OG0000000 10 7742 Vertebrata 0.909091 ENSDARG00000068659
3 ENSDART00000145885.2 OG0000000 10 7742 Vertebrata 0.909091 ENSDARG00000094125
4 ENSDART00000159675.3 OG0000000 10 7742 Vertebrata 0.909091 ENSDARG00000101806
... ... ... ... ... ... ... ...
24956 ENSDART00000191935.1 OG0025717 25 30727 Cyprinoidei 1.000000 ENSDARG00000114540
24957 ENSDART00000171005.2 OG0025718 22 186626 Otophysi 0.666667 ENSDARG00000102218
24958 ENSDART00000143229.2 OG0025719 29 7955 Danio rerio 1.000000 ENSDARG00000069978
24959 ENSDART00000143837.3 OG0025719 29 7955 Danio rerio 1.000000 ENSDARG00000078193
24960 ENSDART00000143384.2 OG0025720 22 186626 Otophysi 0.666667 ENSDARG00000092452

24961 rows × 7 columns

Import now, the scRNA dataset of the query species

Here, data is used, like in the publication (Farrell et al., 2018; Wagner et al., 2018; Qiu et al., 2022).

scRNA data was downloaded from http://tome.gs.washington.edu/ as R rds files, combined into a single Seurat object and converted into loom and AnnData (h5ad) files to be able to analyse with e.g. python scanpy or oggmap package and is available here:

https://doi.org/10.5281/zenodo.7243602

or can be accessed with the dataset submodule of oggmap:

datasets.qiu22_zebrafish(datapath='data') (download folder set to 'data').

[6]:
# load scRNA data

# download zebrafish scRNA data here: https://doi.org/10.5281/zenodo.7243602
# or download with datasets.qui22_zebrafish(datapath='data')

#zebrafish_data = datasets.qiu22_zebrafish(datapath='data')
zebrafish_data = sc.read_h5ad('data/zebrafish_data.h5ad')
#subset of the zebrafish_data
zebrafish_data_subset = zebrafish_data[:1000].copy()

Step 4 - Get TEI values and add them to scRNA dataset

Since now the gene names correspond to each other in the orthomap and the scRNA adata object, one can calculate the transcriptome evolutionary index (TEI) and add them to the scRNA dataset (adata object).

The TEI measure represents the weighted arithmetic mean (expression levels as weights for the phylostratum value) over all evolutionary age categories denoted as phylostra.

\(TEI_s = \sum (e_{is} * ps_i) / \sum e_{is}\)

, where \(TEI_s\) denotes the TEI value in developmental stage \({s, e_{is}}\) denotes the gene expression level of gene \({i}\) in stage \({s}\), and \({ps_i}\) denotes the corresponding phylostratum of gene \({i, i = 1,...,N}\) and \({N = total\ number\ of\ genes}\).

Note: If e.g. two different isoforms would fall into two different gene age classes, their gene ages might differ based on the oldest ortholog found in their corresponding orthologous groups. However, both isoforms share the same gene name and their gene ages would clash. In this case one can decide either to use the keep='min' or keep='max' gene age to be kept by the get_tei function, which defaults to keep in this cases the keep='min' or in other words the ‘older’ gene age.

To be able to re-use the original count data, they are added as a new layer to the adata object. This is useful because later on the count data can be used to extract either the relative expression per gene age class or re-calculate other metrics.

This can be done either on un-normalized counts, on normalized and log-transformed data.

[7]:
zebrafish_data.layers['counts'] = zebrafish_data.X

add TEI to adata object

Using the submodule orthomap2tei from oggmap and the orthomap2tei.get_tei() function, transcriptome evolutionary index (TEI) values are calculated and directyl added to the existing adata object (add_obs=True).

There are other options to e.g. not start from the adata.X counts but from another layer from the adata object, the default is to use the adata.X (layer=None). By default, the values are pre-processed by the normalize_total option and the log1p option.

Note: Please have a look at Liu & Robinson-Rechavi, 2018 to get more insides how expression transformation can influence TEI calculation.

If add_obs=True the resulting TEI values are added to the existing adata object as a new observation with the name set with the obs_name option.

If add_var=True the gene age values are added to the existing adata object as a new variable with the name set with the var_name option.

Note: Genes not assigned to any gene class will get a missing assignment.

If one wants to calculate bootstrap TEI values per cell, the boot option can be set to boot=True and gene age classes will be randomly chosen prior calculating TEI values bt=10 times.

Note: If memory is a problem with large annData objects, consider processing the adata object in chunks using the chunk_size parameter.

[8]:
# add TEI values to existing adata object
orthomap2tei.get_tei(adata=zebrafish_data,
    gene_id=query_orthomap['geneID'],
    gene_age=query_orthomap['PSnum'],
    keep='min',
    layer=None,
    add_var=True,
    var_name='Phylostrata',
    add_obs=True,
    obs_name='tei',
    boot=False,
    bt=10,
    normalize_total=True,
    log1p=True,
    target_sum=1e6,
    chunk_size=100000)
[8]:
tei
hpf3.3_ZFHIGH_WT_DS5_AAAAGTTGCCTC 5.514485
hpf3.3_ZFHIGH_WT_DS5_AAACAAGTGTAT 5.503633
hpf3.3_ZFHIGH_WT_DS5_AAACACCTCGTC 5.488072
hpf3.3_ZFHIGH_WT_DS5_AAATGAGGTTTN 5.50889
hpf3.3_ZFHIGH_WT_DS5_AACCCTCTCGAT 5.574979
... ...
hpf24_DEW057_TGACACAACAG_GCCACATC 5.189284
hpf24_DEW057_CTTACGGG_AACCTGAC 5.399853
hpf24_DEW057_TGAACATCTAT_GACGATGG 5.154028
hpf24_DEW057_TGAGGTTTCTC_CTCAGAAT 5.050597
hpf24_DEW057_ACGTGCTAG_CAAGTCAT 5.344325

71203 rows × 1 columns

In the adata object, for each cell there is now a new variable called tei, which can be used in downstream analysis.

Once the gene age data has been added to the scRNA dataset, one can e.g. plot the corresponding transcriptome evolutionary index (TEI) values by any given observation pre-defined in the scRNA dataset.

Boxplot gene age class per embryo stage

[9]:
sc.pl.violin(adata=zebrafish_data,
             keys=['tei'],
             groupby='stage',
             rotation=90,
             palette='Paired',
             stripplot=False,
             inner='box',
             hue='stage',
             legend=False)
../_images/notebooks_add_tei_25_0.png

calculate bootstrap on subet of the data

[10]:
# calculate bootstrap
orthomap2tei.get_tei(adata=zebrafish_data_subset,
    gene_id=query_orthomap['geneID'],
    gene_age=query_orthomap['PSnum'],
    keep='min',
    layer=None,
    add_var=True,
    var_name='Phylostrata',
    add_obs=True,
    obs_name='tei',
    boot=True,
    bt=10,
    normalize_total=True,
    log1p=True,
    target_sum=1e6,
    chunk_size=100000)
|████████████████████████████████████████| 10/10 [100%] in 2.5s (4.00/s)
[10]:
(                                        tei
 hpf3.3_ZFHIGH_WT_DS5_AAAAGTTGCCTC  5.514485
 hpf3.3_ZFHIGH_WT_DS5_AAACAAGTGTAT  5.503633
 hpf3.3_ZFHIGH_WT_DS5_AAACACCTCGTC  5.488072
 hpf3.3_ZFHIGH_WT_DS5_AAATGAGGTTTN   5.50889
 hpf3.3_ZFHIGH_WT_DS5_AACCCTCTCGAT  5.574979
 ...                                     ...
 hpf4.3_ZFDOME_WT_DS5_CGGACCGCAGGT  5.301656
 hpf4.3_ZFDOME_WT_DS5_CGGATGGCCACT  5.358101
 hpf4.3_ZFDOME_WT_DS5_CGGCAATGCTTG  5.322054
 hpf4.3_ZFDOME_WT_DS5_CGGCAGTGTTAC  5.351171
 hpf4.3_ZFDOME_WT_DS5_CGGCCAGCGGAN  5.402065

 [1000 rows x 1 columns],
                                      boot_0    boot_1    boot_2    boot_3  \
 hpf3.3_ZFHIGH_WT_DS5_AAAAGTTGCCTC  6.759911  6.647228  6.694393  6.849195
 hpf3.3_ZFHIGH_WT_DS5_AAACAAGTGTAT   6.64466  6.624946  6.689043  6.722857
 hpf3.3_ZFHIGH_WT_DS5_AAACACCTCGTC  6.689828  6.718553  6.771587  6.837948
 hpf3.3_ZFHIGH_WT_DS5_AAATGAGGTTTN   6.78249  6.682479  6.687551  6.839673
 hpf3.3_ZFHIGH_WT_DS5_AACCCTCTCGAT  6.705242  6.738627  6.700644  6.805124
 ...                                     ...       ...       ...       ...
 hpf4.3_ZFDOME_WT_DS5_CGGACCGCAGGT  6.702291  6.733482  6.642275  6.853466
 hpf4.3_ZFDOME_WT_DS5_CGGATGGCCACT   6.77451   6.70771  6.707631  6.725549
 hpf4.3_ZFDOME_WT_DS5_CGGCAATGCTTG  6.940786  6.586875  6.724755  6.693181
 hpf4.3_ZFDOME_WT_DS5_CGGCAGTGTTAC  6.753914   6.69377  6.788657  6.836734
 hpf4.3_ZFDOME_WT_DS5_CGGCCAGCGGAN  6.715839  6.721093  6.690696  6.734315

                                      boot_4    boot_5    boot_6    boot_7  \
 hpf3.3_ZFHIGH_WT_DS5_AAAAGTTGCCTC  6.806379  6.684169  6.647911  6.745181
 hpf3.3_ZFHIGH_WT_DS5_AAACAAGTGTAT  6.874048  6.857684  6.640579  6.835806
 hpf3.3_ZFHIGH_WT_DS5_AAACACCTCGTC  6.781303  6.572003  6.632616   6.76269
 hpf3.3_ZFHIGH_WT_DS5_AAATGAGGTTTN  6.712307  6.724935  6.642234   6.80512
 hpf3.3_ZFHIGH_WT_DS5_AACCCTCTCGAT  6.809793   6.68517  6.684571  6.724404
 ...                                     ...       ...       ...       ...
 hpf4.3_ZFDOME_WT_DS5_CGGACCGCAGGT  6.739432  6.761997  6.659034  6.740662
 hpf4.3_ZFDOME_WT_DS5_CGGATGGCCACT  6.792971  6.753959  6.727792  6.888245
 hpf4.3_ZFDOME_WT_DS5_CGGCAATGCTTG  6.728347  6.707649  6.727023  6.785233
 hpf4.3_ZFDOME_WT_DS5_CGGCAGTGTTAC  6.904999  6.758114  6.599089  6.737372
 hpf4.3_ZFDOME_WT_DS5_CGGCCAGCGGAN  6.763937  6.733328  6.556272   6.80844

                                      boot_8    boot_9
 hpf3.3_ZFHIGH_WT_DS5_AAAAGTTGCCTC  6.778171  6.647924
 hpf3.3_ZFHIGH_WT_DS5_AAACAAGTGTAT  6.779376  6.633174
 hpf3.3_ZFHIGH_WT_DS5_AAACACCTCGTC  6.703302  6.685941
 hpf3.3_ZFHIGH_WT_DS5_AAATGAGGTTTN  6.713637  6.626618
 hpf3.3_ZFHIGH_WT_DS5_AACCCTCTCGAT  6.736239  6.686084
 ...                                     ...       ...
 hpf4.3_ZFDOME_WT_DS5_CGGACCGCAGGT  6.550693  6.772745
 hpf4.3_ZFDOME_WT_DS5_CGGATGGCCACT  6.587417   6.79228
 hpf4.3_ZFDOME_WT_DS5_CGGCAATGCTTG  6.635318  6.615079
 hpf4.3_ZFDOME_WT_DS5_CGGCAGTGTTAC  6.828364  6.599813
 hpf4.3_ZFDOME_WT_DS5_CGGCCAGCGGAN  6.743372  6.586846

 [1000 rows x 10 columns])

Please have a look at the documentation of Downstream analysis to get further insides.