ScienceBeam - Open-source PDF extraction for research PDFs

Table of Contents

ScienceBeam is an open-source project for converting research PDFs into structured, machine-readable data. I developed it while working at eLife as part of our efforts to improve open-science publishing workflows and make research outputs more accessible for text and data mining.

TLDR

The main repos are:

ScienceBeam was transferred to The Coko Foundation.

Working with PDF

Before working on ScienceBeam, I hadn’t realised how complex working with PDFs alone can be. They are designed for presentation - they capture how text looks, not what it means. Each character might just be drawn individually, sometimes rotated, sometimes on top of each other, sometimes to form a figure. Custom fonts can have obscure effects. Of course, characters are just one of the potential elements within a PDF. Let’s just say it is complicated.

Computer Vision Prototype

The first ScienceBeam prototype used Computer Vision to detect page regions such as titles, abstracts, and figures. That worked surprisingly well for traditional multi-column journal layouts. You can read more on the Computer Vision approach or eLife’s Labs post on it. There are also slides from a TensorFlow Meetup. We have also presented at Strata AI.

Building on GROBID

Later, our focus shifted to preprints, which usually have simpler visual layouts but are far more inconsistent in structure. At that point, visual layout analysis became less important than semantic extraction from text.

To support that, I integrated and experimented with GROBID, an open-source library for structured PDF extraction using machine learning. It features hierarchical model architecture which comes with complexity, but allows improving small specialised models.

I focused on preparing and sharing datasets (including bioRxiv 10k), re-training models and small enhancements to GROBID itself, such as making it cloud-friendly, and many more.

In early iterations I had over-engineered an end-to-end GROBID Airflow training pipeline. Let’s not talk about that. But at the very least I learned quite a bit.

Most improvements came from:

  • Bulk training data generation using existing PDF and XML pairs: no manual annotation, but rule based fuzzy matching
  • Extending deep learning models in delft with ScienceBeam Trainer DeLFT

With that we already got significantly better results, on biorxiv papers. See also eLife’s Labs post covering our use of GROBID and results. Additionally, the slides from the Open-Source Data Science Projects at eLife talk at the PKP Scholarly Publishing Conference 2019 cover the training process in slightly more detail.

However, extending or debugging it proved difficult in practice: it’s written in Java and communicates with Python through a bridge, which made experimentation slow and complex.

The Python port: ScienceBeam Parser

To make the system easier to extend, I began work on a Python port of GROBID, called ScienceBeam Parser. That was only possible because of having learned so much from GROBID. This design kept compatibility with GROBID’s trained models but simplified integration and iteration. It’s written in Python but still uses GROBID’s lower level tools such as pdfalto, wapiti and delft. The design made it straightforward to add features such as optional Computer Vision based figure detection.

Some of the key advantages:

  • No more Java
  • No more Java (worth mentioning twice)
  • Can be used as a PyPI package
  • Clear separation of concerns making maintenance and re-use easier
  • Resources such as models downloaded and loaded on-demand by default: small repo and Docker image while making it adaptable
  • Mostly compatible with GROBID (but not feature compelete)
  • TEI and JATS XML support
  • Word* support (Docx, …)

Later use and current status

ScienceBeam was used internally within eLife’s publishing pipelines for a period, enhancing the submission workflow and saving author’s time.

When eLife announced it’s new technology direction away from Libero, active development by eLife paused, and stewardship of ScienceBeam was transferred to The Coko Foundation.

Repositories

Main Repos

Storage Repos

These repos were used as free public storage in the form of GitHub release artefacts. ScienceBeam Parser would point to some of those model versions. This definitely felt like a crutch from the start.

Over-Engineered GROBID End-To-End Training

Early Computer Vision Prototype

Other