Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review patent process #1082

Merged
merged 18 commits into from
Feb 7, 2024
Merged

Review patent process #1082

merged 18 commits into from
Feb 7, 2024

Conversation

kermitt2
Copy link
Owner

@kermitt2 kermitt2 commented Feb 5, 2024

This is an update and review of the task of patent and non-patent reference extraction from patent documents.

  • Support of Deep Learning models with adapted segmentation of input sequences for training and prediction.
  • Batch processing for prediction.
  • Include a BidLSTM_CRF_FEATURES model, which improves significantly the extraction accuracy for NPL references as compared to CRF (+10 points F1-score), relatively similar for patent citations (+1 point F1-score). Note that the training also supports all BERT flavor, in particular Google BERT for Patents
  • Update of the mapping of US application prefix numbers to years (for patent publication number normalization according to epodoc), up to 2021/2022.
  • Review XML serialization to only include in the result XML paragraphs with references. References are given for each paragraphs (rather all at the end), with position offsets referencing the paragraph (and not all the document as before). This improve readability without changing the XML parser normally for getting the references.

With GPU and DL model, 8 threads, the processing 500 EP B publications took 775 seconds.

Related: grobid_client_python has been extended to process directories of patent files (ST36 or PDF), for example

grobid_client --input /media/lopez/data/document-quality-data/citation_recognition/patent/ground_truth/  --output resources/test_out/ --n 8 processCitationPatentST36

TODO: proper 10-fold cross evaluation of models, benchmark and optionnally automatic download of fine-tuned BERT large model for patents

@kermitt2 kermitt2 marked this pull request as draft February 5, 2024 21:21
@coveralls
Copy link

coveralls commented Feb 5, 2024

Coverage Status

coverage: 39.783% (-0.1%) from 39.893%
when pulling 1ebc6c8 on review-patent
into 4816a7a on master.

@kermitt2 kermitt2 marked this pull request as ready for review February 6, 2024 12:44
@kermitt2 kermitt2 merged commit 269c897 into master Feb 7, 2024
7 of 9 checks passed
@lfoppiano lfoppiano added this to the 0.8.1 milestone Jun 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants