Skip to content

Commit

Permalink
Merge branch 'master' into feature/data-availability-statement
Browse files Browse the repository at this point in the history
  • Loading branch information
kermitt2 committed Sep 25, 2022
2 parents d22a1ff + 54d1c29 commit 40331eb
Show file tree
Hide file tree
Showing 7 changed files with 1,353 additions and 16 deletions.
6 changes: 2 additions & 4 deletions doc/Consolidation.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,13 +44,11 @@ This service presents several advantages as compared to the CrossRef service. bi

Unfortunately, you need to install the service yourself, including loading and indexing the bibliographical resources, as documented [here](https://github.com/kermitt2/biblio-glutton#building-the-bibliographical-data-look-up-and-matching-databases). Note that a [docker container](https://github.com/kermitt2/biblio-glutton#running-with-docker) is available.

After installing biblio-glutton, you need to select the glutton matching service in the `grobid-home/config/grobid.yaml` file, with its host and port, for instance:
After installing biblio-glutton, you need to select the glutton matching service in the `grobid-home/config/grobid.yaml` file, with its url, for instance:

```yaml
consolidation:
service: "glutton"
glutton:
type: "http"
host: "localhost"
port: 8080
url: "http://localhost:8080"
```
22 changes: 12 additions & 10 deletions doc/Principles.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ To process publisher XML, complementary to GROBID, we built [Pub2TEI](https://gi

The rest of this page gives an overview of the main GROBID design principles. Skip it if you are not interested in the technical details. Functionalities are described in the [User Manual](https://grobid.readthedocs.io/en/latest/). Recent benchmarking are available [here](https://grobid.readthedocs.io/en/latest/Benchmarking/).

##Document parsing as a cascade of sequence labeling models
## Document parsing as a cascade of sequence labeling models

GROBID uses a cascade of sequence labeling models to parse a document. This modular approach makes possible to adapt the training data, the features, the text representations, and the models to the different hierarchical structures of the document. Each individual model maintains a small amount of labels (which is easier to manage and train), but, in combination, the full cascade provides very detailed end-result structures. The final models produce 55 different "leaf" labels, while other document analysis layout systems support significantly less label categories (up to 22 for GROTOAP2 dataset and CERMINE, _Tkaczyk et al., 2014_, the highest to our knowledge after GROBID).

Expand All @@ -38,7 +38,7 @@ The structuring of the same entity type, however, can depend on the position of

Cascading models offers thus the flexibility to tune each model and associated simpler training data to the nature of the structure to be recognized. In addition, it maintains each model small, while producing in combination to very fine-grained final structures. Finally, although errors from a model can be propagated to another model, we train each model with a certain amount of realistic errors and noise as input (which is anyway more or less always happening with PDF), which makes possible to recover upstream model errors.

##Layout tokens, not text
## Layout tokens, not text

The different GROBID models do not work on text, but on **Layout Tokens** to exploit various visual/layout information available for every token. Layout information provide at the same time more criteria of decision for the recognition of structures and more robustness to layout variations.

Expand All @@ -65,7 +65,7 @@ GROBID models maintain a synchronization between the labeling process and the la
<b>Fig. 4</b> - Visualization of a cited equation in context
</p>

##Training data: _Qualität statt Quantität_
## Training data: _Qualität statt Quantität_

GROBID does not use a vast amount of training data derived from existing publisher XML documents, like CERMINE _(Tkaczyk et al., 2015)_ or ScienceParse 1 &amp; 2, but small, high quality sets of manually-labeled training data. The data to be labeled are directly generated from PDF (not from publisher XML) and continuously extended with error cases. Although we also experimented with the large-set approaches and auto-generated training data at scale, we still currently remain with the quality over quantity approach, the reasons being the following ones:

Expand All @@ -81,31 +81,33 @@ GROBID does not use a vast amount of training data derived from existing publish

In practice, the size of GROBID training data is smaller than the ones of CERMINE _(Tkaczyk et al., 2015)_ by a factor 30 to 100, and smaller than ScienceParse 2 by a factor 2500 to 10000. Still GROBID provides comparable or better accuracy scores. To help to ensure high-quality training data, we develop detailed [annotation guidelines](training/General-principles/) to remove as much as possible disagreements/inconsistencies regarding the annotation decision. The training data is reviewed regularly. We do not use double-blind annotation with reconciliation and do not compute Inter Annotator Agreement (as we should), because the average size of the annotation team is under 2 :)

#### Evaluation
## Evaluation

As the training data is crafted for accuracy and coverage, it is strongly biased by undersampling non-edge cases. Our labeled data cannot be used for evaluation. Evaluations of GROBID models are thus done with separate and stable holdout sets from publishers, which follow more realistic distributions of document variations.
As the training data is crafted for accuracy and coverage, it is strongly biased by undersampling non-edge cases. Our labeled data cannot be used for evaluation. Evaluations of GROBID models are thus done with separated and stable holdout sets from publishers, which follow more realistic distributions of document variations.

See the current evaluations with [PubMed Central holdout set](https://grobid.readthedocs.io/en/latest/Benchmarking-pmc/) (1,943 documents, 90,125 bibliographical references in 139,835 citation contexts) and [bioarXiv holdout set](https://grobid.readthedocs.io/en/latest/Benchmarking-biorxiv/) (2,000 documents, 98,753 bibliographical references in 142,796 citation contexts).

Our evaluation approach, however, raises two main issues:

- our publisher evaluation sets present currently the same lack of diversity drawback as discussed above with publisher XML-based training data, because the evaluation sets are all coming from life science or preprints. At least, as compared to most of the similar works, we do not train and evaluate at the same time with the same domains and sources of publications, because we maintain a strong diversity in the training data.
- our publisher evaluation sets present currently the same lack of diversity drawback as discussed above with publisher XML-based training data, because the evaluation sets are all coming from life science or preprints. However, as compared to most of the similar works, we do not train and evaluate at the same time with the same domains and sources of publications, because we maintain a strong diversity in the training data.

- although much better adapted to tackle the gap between the n-fold validation and real performance, the usage of stable holdout sets (usually favored by ML practitioners) can lead too lower reliability over time due to successive re-uses of the holdout data for guiding design improvements (as we validate addition of training data and features based on holdout set performance).
- although much better adapted to tackle the gap between the n-fold validation and real performance, the usage of stable holdout sets (usually favored by ML practitioners) can lead to lower reliability over time due to successive re-uses of the holdout data for guiding design improvements (as we validate addition of training data and features based on holdout set performance).

For addressing these two issues, we plan to regularly add new holdout sets from various sources over time, trying to exploit new XML publications available under appropriate license.

#### Transformer approaches incorporating layout information
## Transformer approaches incorporating layout information

For the moment, we are also not relying on transformer approaches incorporating layout information, like LayoutML _(Xu et al., 2020)_, LayoutLMv2 _(Xu et al., 2021)_, SelfDoc or VILA _(Shen et al., 2021)_, which require considerable GPU capacities, long inference runtime, and do not show at this time convincing accuracy scores as compared to the current GROBID cheap approach (reported accuracy at token level are often lower than GROBID accuracy at field level, while using less labels).

However, these approaches are very promising. In GROBID, it is possible to run BERT and SciBERT baseline fine-tuned models, ignoring available layout features. We think the system is thus more or less ready to experiment with fine-tuning such extended transformer models - or rather few-shot learning given the size of our annotated example set - when/if they can surpass some of the current models (and when we will have saved enough money to buy a V100 GPU).

##Balancing accuracy and scalability
## Balancing accuracy and scalability

We develop a tool to process the full scholar literature corpus (several ten million PDF documents), but also to allow interactive usage, e.g. processing the header of a PDF article in sub-second. It's why the default configuration of GROBID is still set to CRF to maintain the ability to process PDF quickly, with commodity hardware, with low memory usage to ensure good parallelization and scalability capacities.

However, if the priority is accuracy, we also make possible custom settings to maximize the accuracy with deep learning models. Using some deep learning models will improve results by a few additional F1-score points (nothing extraordinary to be honest), but at the price of a slower runtime (2 to 5 times slower), the price of a GPU and more limited parallelization.

##References
## References

_(Tkaczyk et al., 2014)_ Dominika Tkaczyk, Pawel Szostek, and Lukasz Bolikowski. 2014. Grotoap2 - the methodology of creating a large ground truth dataset of scientific articles. D-Lib Magazine, 20(11/12)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -179,8 +179,8 @@ public static List<LayoutToken> dehyphenize(List<LayoutToken> tokens) {

for (int i = 0; i < tokens.size(); i++) {
LayoutToken currentToken = tokens.get(i);
//the current token is dash checking what's around
if (currentToken.getText().equals("-")) {
//the current token is dash (and is neither subscript nor superscript) checking what's around
if (currentToken.getText().equals("-") && !(currentToken.isSubscript() || currentToken.isSuperscript())) {
if (doesRequireDehypenisation(tokens, i)) {
//Cleanup eventual additional spaces before the hypen that have been already written to the output
int z = output.size() - 1;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -137,4 +137,46 @@ public void testSubList_noEnd() throws Exception {
assertThat(Iterables.getLast(result).getText(), is("!"));

}

@Test
public void testMaterialNameWithHypenInFormula_shouldNotDehypenise() throws Exception {

String text = "based \n" +
"(Nd 1-x Ce x ) 2 Fe 14-y Co y B nanostructured magnets";

List<LayoutToken> layoutTokens = GrobidAnalyzer.getInstance().tokenizeWithLayoutToken(text);

layoutTokens.stream().forEach(layoutToken -> layoutToken.setOffset(layoutToken.getOffset() + 1000));
layoutTokens.get(4).setY(406.746);
layoutTokens.get(4).setX(55.754000000000005);

layoutTokens.get(6).setSubscript(true);
layoutTokens.get(6).setY(410.506);
layoutTokens.get(6).setX(65.2255);


layoutTokens.get(7).setSubscript(true);
layoutTokens.get(7).setY(410.506);
layoutTokens.get(7).setX(67.66675);

layoutTokens.get(8).setSubscript(true);
layoutTokens.get(8).setY(410.538);
layoutTokens.get(8).setX(70.108);

layoutTokens.get(10).setY(406.76);
layoutTokens.get(10).setX(73.3461);

layoutTokens.get(12).setSubscript(true);

layoutTokens.get(21).setSubscript(true);

List<LayoutToken> result = LayoutTokensUtil.dehyphenize(layoutTokens);

assertThat(result.get(7).getText(), is("-"));

System.out.println(result);
}



}
9 changes: 9 additions & 0 deletions grobid-home/config/grobid.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,15 @@ grobid:
delft:
# deep learning parameters
architecture: "BidLSTM_CRF_FEATURES"
useELMo: false
runtime:
# parameters used at runtime/prediction
max_sequence_length: 3000
batch_size: 1
training:
# parameters used for trainikng
max_sequence_length: 2000
batch_size: 10

- name: "name-header"
engine: "wapiti"
Expand Down
Loading

0 comments on commit 40331eb

Please sign in to comment.