Skip to content

Latest commit

 

History

History
155 lines (116 loc) · 7.99 KB

README.md

File metadata and controls

155 lines (116 loc) · 7.99 KB

SciAssist

PyPI Status PyTorch Lightning Config: Hydra Template
ReadTheDocs Hugging Face Spaces

AboutAnnouncementInstallationUsageContribution

About

This is the repository of SciAssist, which is a toolkit to assist scientists' research. SciAssist currently supports Summarization, Reference String Parsing, more functions are under active development by WING@NUS, Singapore. The project was built upon an open-sourced template by ashleve, which uses Pytorch Lightning and Hydra as the framework for model training and configuration, respectively.

Announcement

  • CocoSciSum: A Scientific Summarization Toolkit with Compositional Controllability is accepted as an EMNLP 2023 System Demonstration paper!
  • Our Demo is online in Huggingface Space!

Installation

conda create --name assist python=3.8
conda activate assist
[install pytorch]
pip install sciassist

Important: Make sure you install PyTorch (must be compatible to your machine) before SciAssist.

Setup Grobid for pdf processing

After you install the package, you can simply setup grobid with the CLI:

setup_grobid

This will setup Grobid. And after installation, starts the Grobid server with:

run_grobid

Usage

Task 1: (Single Document) Summarization

from SciAssist import Summarization

# Set device="cpu" if you want to use only CPU. The default device is "gpu".
# summarizer = Summarization(device="cpu")
summarizer = Summarization(device="gpu")

text = """1 INTRODUCTION . Statistical learning theory studies the learning properties of machine learning algorithms , and more fundamentally , the conditions under which learning from finite data is possible . 
In this context , classical learning theory focuses on the size of the hypothesis space in terms of different complexity measures , such as combinatorial dimensions , covering numbers and Rademacher/Gaussian complexities ( Shalev-Shwartz & Ben-David , 2014 ; Boucheron et al. , 2005 ) . 
Another more recent approach is based on defining suitable notions of stability with respect to perturbation of the data ( Bousquet & Elisseeff , 2001 ; Kutin & Niyogi , 2002 ) . 
In this view , the continuity of the process that maps data to estimators is crucial , rather than the complexity of the hypothesis space . 
Different notions of stability can be considered , depending on the data perturbation and metric considered ( Kutin & Niyogi , 2002 ) . 
Interestingly , the stability and complexity approaches to characterizing the learnability of problems are not at odds with each other , and can be shown to be equivalent as shown in Poggio et al . ( 2004 ) and Shalev-Shwartz et al . ( 2010 ) . 
In modern machine learning overparameterized models , with a larger number of parameters than the size of the training data , have become common . 
The ability of these models to generalize is well explained by classical statistical learning theory as long as some form of regularization is used in the training process ( Bühlmann & Van De Geer , 2011 ; Steinwart & Christmann , 2008 ) . 
However , it was recently shown - first for deep networks ( Zhang et al. , 2017 ) , and more recently for kernel methods ( Belkin et al. , 2019 ) - that learning is possible in the absence of regularization , i.e. , when perfectly fitting/interpolating the data . 
Much recent work in statistical learning theory has tried to find theoretical ground for this empirical finding . 
Since learning using models that interpolate is not exclusive to deep neural networks , we study generalization in the presence of interpolation in the case of kernel methods . 
We study both linear and kernel least squares problems in this paper . """

# For string
res = summarizer.predict(text, type="str")
# For text
res = summarizer.predict("bodytext.txt", type="txt")
# For pdf
res = summarizer.predict("raw.pdf")

Task 2: Reference string parsing

from SciAssist import ReferenceStringParsing

# Set device="cpu" if you want to use only CPU. The default device is "gpu".
# ref_parser = ReferenceStringParsing(device="cpu")
ref_parser = ReferenceStringParsing(device="gpu")

# For string
res = ref_parser.predict(
    """Calzolari, N. (1982) Towards the organization of lexical definitions on a 
    database structure. In E. Hajicova (Ed.), COLING '82 Abstracts, Charles 
    University, Prague, pp.61-64.""", type="str")
# For text
res  = ref_parser.predict("test.txt", type="txt")
# For pdf
res = ref_parser.predict("test.pdf")

**CHANGE LOG

Source Code

  • Rename SingleSummarization to Summarization.
  • Change the format of output files from .txt to .json.

Documentation

  • Move the definition of Pipeline class from Usage to Contribution Guide.
  • Add catalog for Contribution Guide.
  • Add examples for choosing devices in Usage.

Contribution

Here's a simple introduction about how to incorporate a new task into SciAssist. Generally, to add a new task, you will need to:

1. Git clone this repo and prepare the virtual environment.
2. Install Grobid Server.
3. Create a LightningModule and a DataLightningModule.
4. Train a model.
5. Provide a pipeline for users.

We provide a step-by-step contribution guide, see SciAssist’s documentation.

LICENSE

This toolkit is licensed under the Attribution-NonCommercial-ShareAlike 4.0 International. Read LICENSE for more information.