About

Chat with SPARC datasets

About

This repository belongs to Team #5, **SPARC **CHAT**, which took part in the SPARC Codeathon 2023. The project's concept and planning were collaboratively formulated by the team members during the event, reaching a mutual agreement.

Problem Statement

For a new user, navigating unfamiliar resources like SPARC and its associated portals can be quite challenging, especially when trying to find specific information quickly. The process might involve extensive exploration, leading to significant time and effort being spent to acquire relevant information or datasets. In some cases, users may find themselves repeatedly searching for the same things in a never-ending loop. To achieve their purpose, users often need to search through various sections, projects, and pipelines, which can become a time-consuming task.

Our Proposed Solution

The emergence of OpenAI ChatGPT marks a significant advancement in chatbot technology. This next-generation chatbot enables users to interactively and efficiently ask queries and receive relevant answers. However, it is essential to exercise caution while using it. OpenAI ChatGPT is a large language model (LLM) trained on extensive datasets gathered from the internet. Since its launch, numerous closed and open-source LLMs have also been released.

In this project, we leverage open-source LLMs and the available data on the SPARC portal to create a chatbot that assists users in finding the desired links and provides summaries of relevant information. Currently, the chatbot is limited to processing text-based information.

Workflow

Data

We gathered data from various pages of the SPARC portal, including the SPARC Data & Models page and other provided web links. For our model, we randomly picked 15 datasets that contain valuable information related to related datasets, descriptions, abstracts, protocols, and other relevant details.

Data pre-processing

The data from the datasets were stored manually in .txt files. They include descriptions of the datasets. They are available in the texts folder of the repo.

Model

We use publicly available HuggingFace models for vectorizing our data. Then we retrieve the information via prompt and answer through an LLM and finally, we use Gradio to serve as a GUI.

Running the app

Create a virtual environment conda create -n chat
Activate the virtual environment conda activate chat
Install requirements pip install -r requirements.txt
Run the app python app.py --hf_token <YOUR-HUGGING-FACE_TOKEN>
Open the app on your browser http://127.0.0.1:7860

You should see the Gradio interface running locally, and you would be prompted to enter your query, like so:

Troubleshooting

If you get issues with installing hnswlib, try installing it from source: pip install git+https://github.com/nmslib/hnswlib.git.

You may also need to run export HNSWLIB_NO_NATIVE=1. See this ongoing Github thread for the discussion.

Then proceed with installing the requirements from requirements.txt.

Reporting issues

Please report an issue or suggest a new feature using the issue page. Check existing issues before submitting a new one.

FAIR practices

Since the codeathon focused on FAIR data principles, SPARC CHAT also adheres to FAIR principles.

Team Members

Alireza Moshayedi [Lead]
Lee Jia Lin [System Engineer]
Anmol Kiran [Writer- Documentation]

License

This code is licensed under the MIT License.

We can change it to another license if we need.

Acknowledgements

We would like to thank the organizers of the SPARC Codeathon 2023 for guidance and help during this Codeathon.

ToDos

FAIR practices statement for this project

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
texts		texts
.gitignore		.gitignore
README.md		README.md
app.py		app.py
image.png		image.png
logo.png		logo.png
pipeline.png		pipeline.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chat with SPARC datasets

Table of Contents

About

Problem Statement

Our Proposed Solution

Workflow

Data

Data pre-processing

Model

Running the app

Troubleshooting

Reporting issues

FAIR practices

Team Members

License

Acknowledgements

ToDos

About

Releases

Packages

Contributors 2

Languages

SPARC-FAIR-Codeathon/2023-team-5

Folders and files

Latest commit

History

Repository files navigation

Chat with SPARC datasets

Table of Contents

About

Problem Statement

Our Proposed Solution

Workflow

Data

Data pre-processing

Model

Running the app

Troubleshooting

Reporting issues

FAIR practices

Team Members

License

Acknowledgements

ToDos

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages