Computational Linguistics for COVID-19 !

Table of Contents

Literature Based Discovery


Our goal is to enrich a given corpus of COVID-related biomedical literature with biomedical entities.

Our annotation process is based on an efficient dictionary-based lookup (OGER) combined with a deep learning approach trained on existing corpora, for example the CRAFT corpus of the University of Colorado.

Our terminologies are derived from the major life science databases using our Bio Term Hub, which allows us to maintain up-to-date dictionaries synchronized with the original resources.

Our current annotation pipeline generates annotations for several entity types:

  • cell lines
  • clinical drugs (RxNorm)
  • cells
  • molecular processes
  • sequences
  • organ/tissue
  • chemicals
  • Gene Ontology (GO)
  • organisms
  • proteins

Click here to access the annotated documents

One interesting observation is that there has been an "infodemic" about COVID19 not only in the media outlets, but also in the scientific literature. The graph below shows the number of COVID19-related papers published on PubMed daily from the beginning of the year. Notice that the weekly peaks and troughs are simply due to the fact that articles are not published on weekends!



  • [2020-08-26 Wed] Update of our processed version of the LitCovid dataset. It contains about 35'313 abstracts.
  • [2020-06-26 Fri] Update of our processed version of the LitCovid dataset. It contains about 25'000 abstracts.
  • [2020-06-03 Wed] Datasets updated until today.
  • [2020-04-21 Tue] We have updated our annotated LitCovid dataset (now containing 5630 abstracts)
  • [2020-04-16 Thu] We have completed the annotation of the PMC subset of Litcovid. Find it here.
  • [2020-04-08 Wed] Our online annotation platform OGER now includes COVID specific terminology. You can also use it as a web service, try it out!
  • [2020-04-08 Wed] Our OGER+BioBERT annotations are now accessible on a local brat installation.

    See a screenshot below: LitCovid-Brat.png

  • [2020-04-06 Mon] We have submitted our (improved) OGER+BioBERT annotations of the LitCovid dataset to Europe PMC.
  • [2020-04-03 Fri] We have annotated the LitCovid dataset with OGER+BioBERT and published our results on PubAnnotation, a tool developed by DBCLS, Tokyo (group Jin-Dong Kim):


We have been working with two recently released datasets.

  • LitCovid, a set of more than 3000 abstracts, released by the National Libray of Medicine. They are categorized by different research topics and geographic locations.
  • CORD-19, a set of about 40000 documents. Made available by Allen Institute For AI as dataset for their CORD-19 challenge.

Our annotated datasets

  • Annotated corpora.
    • We have processed the LitCovid corpus with our entity recognition tools. Click here for details and downloads, in different formats.
    • Only a few of the abstracts contained in LitCovid have also a full text accessible from PubMed Central. We have processed this subset of full text paper (which we refer to as LitCovid/PMC). The results are available here.

Who are we?

This page is currently maintained by the NLP Group at the Dalle Molle Institute for Artificial Intelligence (IDSIA).

The work described in this page was initially carried out by the Biomedical Text Mining group at the Institute of Computational Linguistics, University of Zurich. It is now being continued at IDSIA where the PI of the group (Fabio Rinaldi) and some group members have moved.

For additional information about the tools and research activities described in this page, please contact Fabio Rinaldi.

Go back to main page

Author: Fabio Rinaldi

Created: 2022-01-13 Thu 00:47