oggmap: Step 1 - get taxonomic information

This notebook will demonstrate how to get taxonomic information for your query species with oggmap.

Given a species name or taxonomic ID, the query species lineage information is in oggmap version v0.0.1 extracted with the help of the ete3 python toolkit and the NCBI taxonomy (Huerta-Cepas et al., 2016). In oggmap version v0.0.2 the taxonomic information is ectracted with taxadb2 (see here for more information taxadb2). This information is needed alongside with the taxonomic classifications for all species used in the OrthoFinder comparison.

Note: If you need to download or update the NCBI taxonomy database via the ete3 python package and oggmap version v0.0.1. Please use the oggmap command line function ncbitax or run the following code:

# command line oggmap ncbitax -u # import submodule from oggmap import ncbitax ncbitax.update_ncbi()

Note: If you need to download or update the NCBI taxonomy database via the taxadb2 python package and oggmap version v0.0.2. Please use the oggmap command line function ncbitax or run the following code:

# command line oggmap ncbitax -u -outdir taxadb -t taxa -dbname taxadb.sqlite # import submodule import sys from oggmap import ncbitax outdir = 'taxadb' dbname = 'taxadb.sqlite' sys.argv = ['ncbitax', '-u', '-outdir', outdir, '-t', 'taxa', '-dbname', dbname] update_parser = ncbitax.define_parser() update_args, unknown_args = update_parser.parse_known_args() ncbitax.update_ncbi(update_args)

Notebook file

Notebook file can be obtained here:

https://raw.githubusercontent.com/kullrich/oggmap/main/docs/notebooks/query_lineage.ipynb

Import libraries

[1]:
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns
import matplotlib.pyplot as plt
from statannot import add_stat_annotation
# increase dpi
%matplotlib inline
#plt.rcParams['figure.dpi'] = 300
#plt.rcParams['savefig.dpi'] = 300
plt.rcParams['figure.figsize'] = [6, 4.5]
#plt.rcParams['figure.figsize'] = [4.4, 3.3]

Import oggmap python package submodules

[2]:
# import submodules
from oggmap import qlin, gtf2t2g, of2orthomap, orthomap2tei, datasets, ncbitax

Get query species taxonomic lineage information

The oggmap submodule qlin helps to get taxonomic information for you with the qlin.get_qlin() function as follows:

[3]:
# get query species taxonomic lineage information
query_lineage = qlin.get_qlin(q='Caenorhabditis elegans', dbname='taxadb.sqlite')
query name: Caenorhabditis elegans
query taxID: 6239
query kingdom: Eukaryota
query lineage names:
['root(1)', 'cellular organisms(131567)', 'Eukaryota(2759)', 'Opisthokonta(33154)', 'Metazoa(33208)', 'Eumetazoa(6072)', 'Bilateria(33213)', 'Protostomia(33317)', 'Ecdysozoa(1206794)', 'Nematoda(6231)', 'Chromadorea(119089)', 'Rhabditida(6236)', 'Rhabditina(2301116)', 'Rhabditomorpha(2301119)', 'Rhabditoidea(55879)', 'Rhabditidae(6243)', 'Peloderinae(55885)', 'Caenorhabditis(6237)', 'Caenorhabditis elegans(6239)']
query lineage:
[1, 131567, 2759, 33154, 33208, 6072, 33213, 33317, 1206794, 6231, 119089, 6236, 2301116, 2301119, 55879, 6243, 55885, 6237, 6239]

The query_lineage variable now contains the following information in a list:

  • query name query_lineage[0]

  • query taxID query_lineage[1]

  • query lineage query_lineage[2]

  • query lineage dictionary query_lineage[3]

  • query lineage zip query_lineage[4]

  • query lineage names query_lineage[5]

  • reverse query lineage query_lineage[6]

  • query kingdom query_lineage[7]

[4]:
#query name
query_lineage[0]
[4]:
'Caenorhabditis elegans'
[5]:
#query taxID
query_lineage[1]
[5]:
6239
[6]:
#query lineage
query_lineage[2]
[6]:
[1,
 131567,
 2759,
 33154,
 33208,
 6072,
 33213,
 33317,
 1206794,
 6231,
 119089,
 6236,
 2301116,
 2301119,
 55879,
 6243,
 55885,
 6237,
 6239]
[7]:
#query lineage dictionary
query_lineage[3]
[7]:
{1: 'root',
 131567: 'cellular organisms',
 2759: 'Eukaryota',
 33154: 'Opisthokonta',
 33208: 'Metazoa',
 6072: 'Eumetazoa',
 33213: 'Bilateria',
 33317: 'Protostomia',
 1206794: 'Ecdysozoa',
 6231: 'Nematoda',
 119089: 'Chromadorea',
 6236: 'Rhabditida',
 2301116: 'Rhabditina',
 2301119: 'Rhabditomorpha',
 55879: 'Rhabditoidea',
 6243: 'Rhabditidae',
 55885: 'Peloderinae',
 6237: 'Caenorhabditis',
 6239: 'Caenorhabditis elegans'}
[8]:
#query lineage zip
query_lineage[4]
[8]:
[(1, 'root'),
 (131567, 'cellular organisms'),
 (2759, 'Eukaryota'),
 (33154, 'Opisthokonta'),
 (33208, 'Metazoa'),
 (6072, 'Eumetazoa'),
 (33213, 'Bilateria'),
 (33317, 'Protostomia'),
 (1206794, 'Ecdysozoa'),
 (6231, 'Nematoda'),
 (119089, 'Chromadorea'),
 (6236, 'Rhabditida'),
 (2301116, 'Rhabditina'),
 (2301119, 'Rhabditomorpha'),
 (55879, 'Rhabditoidea'),
 (6243, 'Rhabditidae'),
 (55885, 'Peloderinae'),
 (6237, 'Caenorhabditis'),
 (6239, 'Caenorhabditis elegans')]
[9]:
#query lineage names
query_lineage[5]
[9]:
PSnum PStaxID PSname
0 0 1 root
1 1 131567 cellular organisms
2 2 2759 Eukaryota
3 3 33154 Opisthokonta
4 4 33208 Metazoa
5 5 6072 Eumetazoa
6 6 33213 Bilateria
7 7 33317 Protostomia
8 8 1206794 Ecdysozoa
9 9 6231 Nematoda
10 10 119089 Chromadorea
11 11 6236 Rhabditida
12 12 2301116 Rhabditina
13 13 2301119 Rhabditomorpha
14 14 55879 Rhabditoidea
15 15 6243 Rhabditidae
16 16 55885 Peloderinae
17 17 6237 Caenorhabditis
18 18 6239 Caenorhabditis elegans
[10]:
#reverse query lineage
query_lineage[6]
[10]:
[6239,
 6237,
 55885,
 6243,
 55879,
 2301119,
 2301116,
 6236,
 119089,
 6231,
 1206794,
 33317,
 33213,
 6072,
 33208,
 33154,
 2759,
 131567,
 1]
[11]:
#query kingdom
query_lineage[7]
[11]:
'Eukaryota'

Get query species lineage as a tree object

[13]:
import sys
from Bio import Phylo
from io import StringIO
lineage_tree = qlin.get_lineage_topo(qt='6239', dbname='taxadb.sqlite')
newick_str = StringIO()
Phylo.write(lineage_tree, newick_str, "newick")
newick_str.seek(0)
newick_str.read()
[13]:
'(((((((((((((((((((18/6239/Caenorhabditis_elegans:0.00000):0.00000,17/6237/Caenorhabditis:0.00000):0.00000,16/55885/Peloderinae:0.00000):0.00000,15/6243/Rhabditidae:0.00000):0.00000,14/55879/Rhabditoidea:0.00000):0.00000,13/2301119/Rhabditomorpha:0.00000):0.00000,12/2301116/Rhabditina:0.00000):0.00000,11/6236/Rhabditida:0.00000):0.00000,10/119089/Chromadorea:0.00000):0.00000,9/6231/Nematoda:0.00000):0.00000,8/1206794/Ecdysozoa:0.00000):0.00000,7/33317/Protostomia:0.00000):0.00000,6/33213/Bilateria:0.00000):0.00000,5/6072/Eumetazoa:0.00000):0.00000,4/33208/Metazoa:0.00000):0.00000,3/33154/Opisthokonta:0.00000):0.00000,2/2759/Eukaryota:0.00000):0.00000,1/131567/cellular_organisms:0.00000):0.00000,0/1/root:0.00000):0.00000;\n'

If you like to continue, please have a look at the documentation of Step 2 - gene age class assignment to get further insides.