Processing LitCovid with OGER-BB
Table of Contents
1 LitCovid
LitCovid is a collection of PubMed abstracts related to COVID19, released by the National Libray of Medicine. They are categorized by different research topics and geographic locations.
We have processed the LitCovid corpus with our entity recognition tools: Bio Term Hub and OGER-BB, which are described below. The current version (as of
) contains 135'169 abstracts.The annotations are made accessible in different formats:
from our own servers using the brat tool (see screenshot below):
https://pub.cl.uzh.ch/projects/ontogene/brat/#/LitCovid/ (beware, it might be slow to load)
How to use: select a pubmed identifier from the initial menu, and the corresponding abstract will be visualized, with annotations. You can then browse through the abstracts using the top arrows.
- as downloadable archive files in different formats:
in a simple tab separated format: litcovid19.tsv.tgz
A brief explanation of the data format can be found here: Documentation on the annotation fields
- in BioC json format: litcovid19.bioc.json.tgz
- the tab separated format gives the entities and their position in the original documents, which can be found here: litcovid19.txt.tgz
through DBCLS's PubAnnotation interface:
http://pubannotation.org/projects/LitCovid-OGER-BB
- we have also submitted this dataset to Europe PMC
It is also possible to submit any PubMed abstract (via PubMed ID), PubMed Ventral full text paper (via PMC ID), or any plain text (via cut and paste) to our OGER annotation tool, and have it annotated, see screenshot below.
2 Bio Term Hub (BTH)
The Bio Term Hub (BTH) is an aggregator of biomedical terminologies sourced from manually curated databases. The BTH allows the quick construction of a terminology resource in a simple standardized format for text mining purposes. The terminologies are sourced from well-known life science databases. The user can select the specific concept types (proteins, genes, diseases, cell lines, etc.) to be included in the generated terminology. The terminologies are provided with unique term identifiers from the original databases. The resources provided by our Bio Term Hub are kept up-to-date by checking the original databases for possible updates. Optionally, the user can request the generation of lexical statistics about the selected terminologies.
Try it at: https://bth.nlp.idsia.ch/
Reference:
Tilia Renate Ellendorff, Adrian van der Lek, Lenz Furrer, Fabio Rinaldi. A Combined Resource of Biomedical Terminology and its Statistics. Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain), pg. 39–50. http://ceur-ws.org/Vol-1495/paper_19.pdf
3 OGER
OGER is a fast, efficient, dictionary-based annotation tool, which is tightly coupled with the BTH, and allows rapid annotation of large quantities of text.
OGER can be accessed either through a web interface for testing purposes (single document annotation), or as a RESTful web service (typically for batch annotations).
Reference:
Lenz Furrer, Fabio Rinaldi. OGER: OntoGene’s Entity Recogniser in the BeCalm TIPS Task. Proceedings of the BioCreative V.5 Challenge Evaluation Workshop. Barcelona, Spain, 26–27 April 2017, pg. 175–182 https://www.zora.uzh.ch/id/eprint/137276/
4 OGER-BB
OGER has been coupled with a deep learning model (based on BioBERT), which has been fine-tuned using the CRAFT corpus, in order to increase both precision and recall.
Details of this work can be found in the following papers:
UZH@CRAFT-ST: a Sequence-labeling Approach to Concept Recognition. Lenz Furrer, Joseph Cornelius, Fabio Rinaldi. Proceedings of The 5th Workshop on BioNLP Open Shared Tasks. November 2019. https://www.aclweb.org/anthology/D19-5726
Parallel sequence tagging for concept recognition. Lenz Furrer, Joseph Cornelius, Fabio Rinaldi (2020). https://arxiv.org/abs/2003.07424
5 Who are we?
This page is currently maintained by the NLP Group at the Dalle Molle Institute for Artificial Intelligence (IDSIA).
The work described in this page was initially carried out by the Biomedical Text Mining group at the Institute of Computational Linguistics, University of Zurich. It is now being continued at IDSIA where the PI of the group (Fabio Rinaldi) and some group members have moved.
For additional information about the tools and research activities described in this page, please contact Fabio Rinaldi.