Inspect & map identifiers#

To make data queryable by an entity identifier, one needs to ensure that identifiers comply to a chosen standard.

Bionty enables this by mapping metadata on the versioned ontologies using inspect().

For terms that are not directly mappable, we offer (also see /lookup):

import bionty as bt
import pandas as pd

Inspect and mapping synonyms of gene identifiers#

To illustrate it, let us generate a DataFrame that stores a number of gene identifiers, some of which corrupted.

data = {
    "gene symbol": ["A1CF", "A1BG", "FANCD1", "corrupted"],
    "hgnc id": ["HGNC:24086", "HGNC:5", "HGNC:1101", "corrupted"],
    "ensembl_gene_id": [
        "ENSG00000148584",
        "ENSG00000121410",
        "ENSG00000188389",
        "ENSGcorrupted",
    ],
}
df_orig = pd.DataFrame(data).set_index("ensembl_gene_id")
df_orig
gene symbol hgnc id
ensembl_gene_id
ENSG00000148584 A1CF HGNC:24086
ENSG00000121410 A1BG HGNC:5
ENSG00000188389 FANCD1 HGNC:1101
ENSGcorrupted corrupted corrupted

First we can check whether any of our values are mappable against the ontology reference.

Tip: available fields are accessible via gene_bionty.fields

gene_bionty = bt.Gene()

gene_bionty
Gene
Species: human
Source: ensembl, release-108

πŸ“– Gene.df(): ontology reference table
πŸ”Ž Gene.lookup(): autocompletion of ontology terms
🎯 Gene.fuzzy_match(): fuzzy match against ontology terms
🧐 Gene.inspect(): check if identifiers are mappable
πŸ‘½ Gene.map_synonyms(): map synonyms to standardized names
πŸ”— Gene.ontology: Pronto.Ontology object
gene_bionty.inspect(df_orig.index, gene_bionty.ensembl_gene_id)
βœ… 3 terms (75.0%) are mapped.
πŸ”Ά 1 terms (25.0%) are not mapped.
{'mapped': ['ENSG00000148584', 'ENSG00000121410', 'ENSG00000188389'],
 'not_mapped': ['ENSGcorrupted']}

The same procedure is available for gene symbols. First, we inspect which symbols are mappable against the ontology.

gene_bionty.inspect(df_orig["gene symbol"], gene_bionty.symbol)
πŸ”Ά The identifiers contain synonyms!
πŸ’‘ To increase mappability, convert them into standardized names/symbols using '.map_synonyms()'
βœ… 2 terms (50.0%) are mapped.
πŸ”Ά 2 terms (50.0%) are not mapped.
{'mapped': ['A1CF', 'A1BG'], 'not_mapped': ['FANCD1', 'corrupted']}

Apparently 2 of the gene symbols are mappable. Bionty further warns us that some of our symbols can be mapped into standardized symbols.

Mapping synonyms returns a list of standardized terms:

mapped_symbol_synonyms = gene_bionty.map_synonyms(
    df_orig["gene symbol"], gene_bionty.symbol
)

mapped_symbol_synonyms
['A1CF', 'A1BG', 'BRCA2', 'corrupted']

Optionally, only returns a mapper of {synonym : standardized name}:

gene_bionty.map_synonyms(df_orig["gene symbol"], gene_bionty.symbol, return_mapper=True)
{'FANCD1': 'BRCA2'}

We can use the standardized symbols as the new index:

df_curated = df_orig.reset_index()
df_curated.index = mapped_symbol_synonyms
df_curated
ensembl_gene_id gene symbol hgnc id
A1CF ENSG00000148584 A1CF HGNC:24086
A1BG ENSG00000121410 A1BG HGNC:5
BRCA2 ENSG00000188389 FANCD1 HGNC:1101
corrupted ENSGcorrupted corrupted corrupted

You may return a DataFrame with a boolean column indicating if the identifiers are mappable:

gene_bionty.inspect(df_curated.index, gene_bionty.symbol, return_df=True)
βœ… 3 terms (75.0%) are mapped.
πŸ”Ά 1 terms (25.0%) are not mapped.
__mapped__
A1CF True
A1BG True
BRCA2 True
corrupted False

Standardize and look up unmapped CellMarker identifiers#

Depending on how the data was collected and which terminology was used, it is not always possible to curate values. Some values might have used a different standard or be corrupted.

This section will demonstrate how to look up unmatched terms and curate them using CellMarker.

First, we take an example DataFrame whose index containing a valid & invalid cell markers (antibody targets) and an additional feature (time) from a flow cytometry dataset.

markers = pd.DataFrame(
    index=[
        "KI67",
        "CCR7",
        "CD14",
        "CD8",
        "CD45RA",
        "CD4",
        "CD3",
        "CD127a",
        "PD1",
        "Invalid-1",
        "Invalid-2",
        "CD66b",
        "Siglec8",
        "Time",
    ]
)

Let’s instantiate the CellMarker ontology with the default database and version.

cellmarker_bionty = bt.CellMarker()

cellmarker_bionty


CellMarker
Species: human
Source: cellmarker, 2.0

πŸ“– CellMarker.df(): ontology reference table
πŸ”Ž CellMarker.lookup(): autocompletion of ontology terms
🎯 CellMarker.fuzzy_match(): fuzzy match against ontology terms
🧐 CellMarker.inspect(): check if identifiers are mappable
πŸ‘½ CellMarker.map_synonyms(): map synonyms to standardized names
πŸ”— CellMarker.ontology: Pronto.Ontology object

Now let’s check which cell markers from the file can be found in the reference:

cellmarker_bionty.inspect(markers.index, cellmarker_bionty.name)
πŸ”Ά The identifiers contain synonyms!
πŸ’‘ To increase mappability, convert them into standardized names/symbols using '.map_synonyms()'
βœ… 7 terms (50.0%) are mapped.
πŸ”Ά 7 terms (50.0%) are not mapped.
{'mapped': ['CCR7', 'CD14', 'CD8', 'CD45RA', 'CD4', 'CD3', 'CD66b'],
 'not_mapped': ['KI67',
  'CD127a',
  'PD1',
  'Invalid-1',
  'Invalid-2',
  'Siglec8',
  'Time']}

Logging suggests we map synonyms:

synonyms_mapper = cellmarker_bionty.map_synonyms(
    markers.index, cellmarker_bionty.name, return_mapper=True
)

Now we mapped 3 additional terms:

synonyms_mapper
{'KI67': 'Ki67', 'PD1': 'PD-1', 'Siglec8': 'SIGLEC8'}

Let’s replace the synonyms with standardized names in the markers DataFrame:

markers.rename(index=synonyms_mapper, inplace=True)

From the logging, it can be seen that 4 terms were not found in the reference!

Among them Time, Invalid-1 and Invalid-2 are non-marker channels which won’t be curated by cell marker.

cellmarker_bionty.inspect(markers.index, cellmarker_bionty.name)
βœ… 10 terms (71.4%) are mapped.
πŸ”Ά 4 terms (28.6%) are not mapped.
{'mapped': ['Ki67',
  'CCR7',
  'CD14',
  'CD8',
  'CD45RA',
  'CD4',
  'CD3',
  'PD-1',
  'CD66b',
  'SIGLEC8'],
 'not_mapped': ['CD127a', 'Invalid-1', 'Invalid-2', 'Time']}

We don’t really find CD127a, let’s check in the lookup with auto-completion:

lookup = cellmarker_bionty.lookup()
lookup.cd127
CellMarker(id='CM_CD127', name='CD127', ncbi_gene_id='3575', gene_symbol='IL7R', gene_name='interleukin 7 receptor', uniprotkb_id='P16871', synonyms=None)

Indeed we find it should be cd127, we had a typo there with cd127a.

Now let’s fix the markers so all of them can be linked:

Tip

Using the .lookup instead of passing a string helps eliminate possible typos!

curated_df = markers.rename(index={"CD127a": lookup.cd127.name})

Optionally, run a fuzzy match:

cellmarker_bionty.fuzzy_match("CD127a", return_ranked_results=True).head(5)
id ncbi_gene_id gene_symbol gene_name uniprotkb_id synonyms __ratio__
name
CD127 CM_CD127 3575 IL7R interleukin 7 receptor P16871 None 90.909091
CD167a CM_CD167a None None None None None 83.333333
CD107a CM_CD107a 3916 LAMP1 lysosomal associated membrane protein 1 A0A024RDY3 None 83.333333
CD172a CM_CD172a None None None None None 83.333333
CD120a CM_CD120a 7132 TNFRSF1A TNF receptor superfamily member 1A P19438 None 83.333333

OK, now we can try to run curate again and all cell markers are linked!

cellmarker_bionty.inspect(curated_df.index, cellmarker_bionty.name)
βœ… 11 terms (78.6%) are mapped.
πŸ”Ά 3 terms (21.4%) are not mapped.
{'mapped': ['Ki67',
  'CCR7',
  'CD14',
  'CD8',
  'CD45RA',
  'CD4',
  'CD3',
  'CD127',
  'PD-1',
  'CD66b',
  'SIGLEC8'],
 'not_mapped': ['Invalid-1', 'Invalid-2', 'Time']}