OGER web APIs

Table of Contents

Introduction

This document briefly describes what are OGER APIs, and how to use them, in a few simple steps.

OGER is a fast, accurate entity annotation tool, which is accessible either as a software package, or as a web service. Given text as input, it delivers annotations as output, as illustrated in the picture below.

LitCovid-OGER.png

The dictionaries used for annotation are obtained from major life science databases (cellosaurus, cell ontology, ChEBI, CTD, EntrezGene, Gene Ontology, MeSH, Molecular Process Ontology, NCBI Taxonomy, Protein Ontology, RxNorm, Sequence Ontology, Swiss-Prot, Uberon).

OGER dictionaries are sourced and kept synchronized with the original databases through usage of our own Bio Term Hub.

OGER APIs are Web service APIs (type REST) that allows easy access to online OGER annotation capabilities.

API short description

You can create requests to the OGER API by adding specifiers to the base URL:

https://pub.cl.uzh.ch/projects/ontogene/oger/

With the APIs it is possible to:

  • Check the service status
  • Create new dictionaries and check their status
  • Annotate
    • Documents that are in the local machine (upload)
    • Documents fetched from PubMed (fetch)
  • Main available endpoints (specifiers):
    • /status (check status of the service)
    • /fetch (fetch documents and process them, used for PubMed and PMC)
    • /upload (used to submit your own text for processing)
    • /dict (manipulate dictionaries)
      • dict/{}/status

A more detailed description of the API can be found here.

Below you can find examples of how to perform an online annotation.

Step 1: verify the service, and get ID of the default dictionary

Verify that the service is up and running.

curl --location --request GET 'https://pub.cl.uzh.ch/projects/ontogene/oger/status'

This should respond something similar to:

{
    "status": "running",
    "active annotation dictionaries": 2,
    "default dictionary": "509f822aaf527390"
}

From the response it can be see that the default dictionary is the one identified with the hextoken 509f822aaf527390. This is the dictionary that would be used for the annotation if no other is explicitly passed within the annotation request.

Step 2: verify the status of the selected dictionary

Verify the availability and status of the dictionary that you want to use. The complete URL of the request will be composed by the base URL, the endpoint dict, the dictionary ID, and the specific request (status in this case), thus:

https://pub.cl.uzh.ch/projects/ontogene/oger/dict/509f822aaf527390/status

If you run the following request you will get the status and description of the dictionary:

curl --location --request GET 'https://pub.cl.uzh.ch/projects/ontogene/oger/dict/509f822aaf527390/status'

This would respond with something similar to:

{
    "status": "ready",
    "description": "default+COVID terminology"
}

So, we are ready to go.

It is also possible to request a new dictionary based on specific settings, find out more in https://github.com/OntoGene/OGER/wiki/REST-API.

Step 3: annotate a PubMed abstract

We can request the annotation of a PubMed abstract by using the fetch endpoint. We need to specify which source to consider. Currently only Pubmed (pubmed) and PubMed Central (pmc) are enabled. Next we ned to provide the output format, the easiest to use is tsv. And finally we need to provide the ID of the resource that we intend to process. For example let's assume that we want to process the PubMed abstract 21436587.

The final URL will be:

https://pub.cl.uzh.ch/projects/ontogene/oger/fetch/pubmed/tsv/21436587

And the most basic format to request the annotation is the following:

curl https://pub.cl.uzh.ch/projects/ontogene/oger/fetch/pubmed/tsv/21436587

The output will be delivered to your shell (standard output), so you might want to redirect to a file, e.g.

curl https://pub.cl.uzh.ch/projects/ontogene/oger/fetch/pubmed/tsv/21436587 > 21436587.tsv

The format of the output is explained below.

Notice that since we are using the default dictionary, we do not need to specify it explicitly. If we wanted to use a different dictionary, it would have to be explicitly specified in the request, e.g.

curl https://pub.cl.uzh.ch/projects/ontogene/oger/fetch/pubmed/tsv/21436587?dict=509f822aaf527390

Step 4: annotate an arbitrary snippet of text

We can request the annotation of local data by doing a POST to the /upload endpoint and passing the route parameters that specify the input and output format.

The URL of the request will be composed of the base URL (https://pub.cl.uzh.ch/projects/ontogene/oger/), the target endpoint (upload), the input format specification (txt) and the output format specification (tsv), so the final URL will be:

https://pub.cl.uzh.ch/projects/ontogene/oger/upload/txt/tsv

Note: the different I/O formats supported by OGER are documented here.

Below there is an example of the request. In this example the uploaded data is raw text (txt), the requested output format is a tabular table (tsv) and the text to be annotated is passed in the POST payload.

curl --location \
--request POST 'https://pub.cl.uzh.ch/projects/ontogene/oger/upload/txt/tsv' \
--header 'Content-Type: text/plain' \
--data-raw 'The initial cases of novel coronavirus (2019-nCoV)-infected 
pneumonia (NCIP) occurred in Wuhan, Hubei Province, China, in December 2019 
and January 2020.
We analyzed data on the first 425 confirmed cases in Wuhan to
determine the epidemiologic characteristics of NCIP.We collected
information on demographic characteristics, exposure history, and
illness timelines of laboratory-confirmed cases of NCIP that had been
reported by January 22, 2020.'

Find below a response example:

oger-tsv-output.png Note: click on the picture to see a larger version in a separate tab.

The fields that you can see in this TSV output format are described in detail here.

If we do the same request but change the output format from tsv to bioc_json, we get the following response:

oger-json-output.png Note: click on the picture to see a larger version in a separate tab.

The different I/O formats supported by OGER are documented below.

I/O formats

OGER can deal with different input and output formats, as shown in the tables below. The 'Format' specifier (first column of each table) can be used as a specifier in a OGER request to (a) submit input data in different formats, and (b) request the output in the specified output format.

Input formats

Format Content-type Description
txt text/plain unstructured plain-text document
bioc text/xml document or collection in BioC XML
bioc_json application/json document or collection in BioC JSON
pxml text/xml abstract in PubMed's citation XML
nxml text/xml article in PubMed Central's full-text XML
pxml.gz application/gzip compressed collection of abstracts in Medline's citation XML

Output formats

Format Content-type Description
tsv text/tab-separated-values entities in a tab-separated table
xml text/xml entities in a simple, self-explanatory XML format
text_tsv text/tab-separated-values text and entities in a tab-separated table
bioc text/xml text and entities in BioC XML
bioc_json application/json text and entities in BioC JSON
pubanno_json application/json text and entities in PubAnnotator JSON
pubtator text/plain text and entities in PubTator format (mixture of pipe- and tab-separated text)
pubtator_fbk text/plain a variant of the above, with slightly different entity attributes
odin text/xml text and entities in ODIN XML
odin_custom text/xml text and entities in ODIN XML, with customisable CSS

More information about the annotation parameters is available at https://github.com/OntoGene/OGER/wiki/REST-API

Visualization of the annotations

Notice that some of the output formats are oriented towards data processing (tsv, bioc, bioc_json), while some of them are oriented towards visualization (pubanno_json, pubtator, odin_custom).

In particular you can test the following ways to visualize the output of OGER:

  1. use the pubtator format to produce output which can then be visualized with the well-know PubTator tool
  2. use the pubanno_json format, to be visualized with the PubAnnotation tool. You can also send some text to be annotated and visualized directly in PubAnnotation. Try it here.
  3. use the odin_custom output format to visualize ODIN output directly in your browser. This is a simple xml format which can be either visualized as xml (see example), or embellished with css decorations (see here).

If you want to visualize your files as suggested above, you have to edit them and add the following line:

<?xml-stylesheet type="text/css" href="odin_custom.css"?>

You can download here the odin_custom.css file. Notice that the xml file and the css file must be in the same directory. Then open the xml file in any browser.

Documentation on the annotation fields

If you use the tsv format as your output format, the resulting output will contain one line for every discovered term. Each line contains the following tab-separated fields:

  1. paper identifier (pubmed id)
  2. type of entity
  3. start offset
  4. end offset
  5. term as found in the paper
  6. preferred term
  7. term id
  8. section of the paper where term has been found
  9. sentence ID
  10. name of source vocabulary / database
  11. common term identifier

Notice that

  • (7) is the unique identifier of the term as from the corresponding reference database, which is indicated by (10).
  • (6) is a term semantically equivalent to the one found in the paper (5), but which is indicated in the reference database as the "best" or normative term for the corresponding concept
  • (11) is a common identifier, in cases where we have one. In most applications we try to provide an UMLS CUI at this position.

Also notice that the same span of text might generate multiple annotations. This is the case when a term is ambiguos, so that several potential annotations can be applied to it.

Who are we?

This page is currently maintained by the NLP Group at the Dalle Molle Institute for Artificial Intelligence (IDSIA).

The work described in this page was initially carried out by the Biomedical Text Mining group at the Institute of Computational Linguistics, University of Zurich. It is now being continued at IDSIA where the PI of the group (Fabio Rinaldi) and some group members have moved.

For additional information about the tools and research activities described in this page, please contact Fabio Rinaldi.

Go back to main page

Author: Fabio Rinaldi

Created: 2021-01-19 Tue 16:27

Validate