ScienceBeam - Open-source PDF extraction for research PDFs

Table of Contents

ScienceBeam is an open-source project for converting research PDFs into structured, machine-readable data. I developed it while working at eLife as part of our efforts to improve open-science publishing workflows and make research outputs more accessible for text and data mining.

TLDR

The main repos are:

ScienceBeam Parser: The Python port of GROBID (no more Java)
ScienceBeam Trainer DeLFT: Extension to delft, to make it cloud-friendly, handling long contexts, additional model features, …

ScienceBeam was transferred to The Coko Foundation.

Working with PDF

Before working on ScienceBeam, I hadn’t realised how complex working with PDFs alone can be. They are designed for presentation - they capture how text looks, not what it means. Each character might just be drawn individually, sometimes rotated, sometimes on top of each other, sometimes to form a figure. Custom fonts can have obscure effects. Of course, characters are just one of the potential elements within a PDF. Let’s just say it is complicated.

Computer Vision Prototype

The first ScienceBeam prototype used Computer Vision to detect page regions such as titles, abstracts, and figures. That worked surprisingly well for traditional multi-column journal layouts. You can read more on the Computer Vision approach or eLife’s Labs post on it. There are also slides from a TensorFlow Meetup. We have also presented at Strata AI.

Building on GROBID

Later, our focus shifted to preprints, which usually have simpler visual layouts but are far more inconsistent in structure. At that point, visual layout analysis became less important than semantic extraction from text.

To support that, I integrated and experimented with GROBID, an open-source library for structured PDF extraction using machine learning. It features hierarchical model architecture which comes with complexity, but allows improving small specialised models.

I focused on preparing and sharing datasets (including bioRxiv 10k), re-training models and small enhancements to GROBID itself, such as making it cloud-friendly, and many more.

In early iterations I had over-engineered an end-to-end GROBID Airflow training pipeline. Let’s not talk about that. But at the very least I learned quite a bit.

Most improvements came from:

Bulk training data generation using existing PDF and XML pairs: no manual annotation, but rule based fuzzy matching
Extending deep learning models in delft with ScienceBeam Trainer DeLFT

With that we already got significantly better results, on biorxiv papers. See also eLife’s Labs post covering our use of GROBID and results. Additionally, the slides from the Open-Source Data Science Projects at eLife talk at the PKP Scholarly Publishing Conference 2019 cover the training process in slightly more detail.

However, extending or debugging it proved difficult in practice: it’s written in Java and communicates with Python through a bridge, which made experimentation slow and complex.

The Python port: ScienceBeam Parser

To make the system easier to extend, I began work on a Python port of GROBID, called ScienceBeam Parser. That was only possible because of having learned so much from GROBID. This design kept compatibility with GROBID’s trained models but simplified integration and iteration. It’s written in Python but still uses GROBID’s lower level tools such as pdfalto, wapiti and delft. The design made it straightforward to add features such as optional Computer Vision based figure detection.

Some of the key advantages:

No more Java
No more Java (worth mentioning twice)
Can be used as a PyPI package
Clear separation of concerns making maintenance and re-use easier
Resources such as models downloaded and loaded on-demand by default: small repo and Docker image while making it adaptable
Mostly compatible with GROBID (but not feature compelete)
TEI and JATS XML support
Word* support (Docx, …)

Later use and current status

ScienceBeam was used internally within eLife’s publishing pipelines for a period, enhancing the submission workflow and saving author’s time.

When eLife announced it’s new technology direction away from Libero, active development by eLife paused, and stewardship of ScienceBeam was transferred to The Coko Foundation.

Repositories

Main Repos

ScienceBeam Parser: The Python port of GROBID (no more Java)
ScienceBeam Trainer DeLFT: Extension to delft, to make it cloud-friendly, handling long contexts, additional model features, …
ScienceBeam Judge: Tool agnostic evaluation (it was used to evaluate ScienceBeam, GROBID, CERMINE, ScienceParse, and others)
ScienceBeam Utils: Some shared tools, mostly around file lists
ScienceBeam Alignment: Used for training data generation (fuzzy matching of original XML)
ScienceBeam Trainer Tools for GROBID: Tools to auto-annotate GROBID model TEI training data using original XML files and fuzzy matching text (using ScienceBeam Alignment)
ScienceBeam Charts: Helm charts to deploy ScienceBeam (and related) to a Kubernetes cluster
ScienceBeam Usage Examples: Sample notebooks illustrating the use of ScienceBeam Parser as a PyPI package
ScienceBeam Pipelines: Bulk conversion using Apache Beam

Storage Repos

These repos were used as free public storage in the form of GitHub release artefacts. ScienceBeam Parser would point to some of those model versions. This definitely felt like a crutch from the start.

GROBID Fork: With all of the GROBID PRs that I raised merged.
ScienceBeam GROBID bioRxiv: GROBID Fork with bioRxiv models pre-loaded

Over-Engineered GROBID End-To-End Training

ScienceBeam Trainer for GROBID: GROBID with light-weight scripts to make training cloud friendly
ScienceBeam Airflow: Airflow pipelines that would result in Docker images containing re-trained models

Early Computer Vision Prototype

ScienceBeam Gym: Mostly Computer Vision related training

Other

ScienceBeam Editor: A prototype that connects ScienceBeam to Libero Editor
ScienceBeam Texture: A prototype that connects ScienceBeam to Texture
ScienceBeam Orchester: Early cross-tool evaluation
ScienceBeam Lab: Early experimental Tools. It contains a tool to visualise Layout XML (produced by pdf2xml or pdftoxml) as SVG.