Use this link to access the google drive which contains our dataset,word embeddings, PPT for the project and the sample videos
-
Requirements:
- python: version 3.8.x
- yarn: version 1.22.x
- node: version 12.16.x
- pip and virtualenv.
-
Steps:
- Clone the repository using
git clone https://github.com/tanmaypardeshi/CDAC-Hackathon.git - Download the glove folder from the google drive link provided above and save it in the project directory.
- Download all the other csv and json files from the google drive link and store it in the data folder of the project directory.
- Use command virtualenv venv in project directory to create virtualenv.
- Use source venv/bin/activate to activate virtualenv.
- For the first time, use pip install -r requirements.txt in project directory to install all dependencies
This will only be for the first time to install the packages. - Navigate to the frontend folder and run yarn install for the first time to install all javascript dependencies for React.
- To run the flask server use python run.py in the project directory.
- Navigate to the frontend folder and run yarn start to start development server and use the platform while keeping the flask server running as well.
- Use deactivate to deactivate virtualenv.
- Clone the repository using
1. glove: Embeddings used to perform text summarization and information retrieval for Real Time Research News.
2. summariser.py: Makes use of the TextRank algorithm to summarize the input Biomedical Text.
3. ir_author.py: Makes use of levenshtein distance to generate a similarity score between the author based query and documents
4. ir_title.py: Makes use of levenshtein distance and keyword indexing to generate a similarity score between the title based query and documents
5. ir_optimised.py: Makes use of levenshtein distance and keyword indexing along with a keywords pickle file to generate a similarity score between the author based query and documents
6. news.py: Makes use of scraping techniques to retrieve unstructured COVID-19 research news from the internet and makes use of info retrieval to display relevant results on the basis of a query.
7. Q&A_CDQA_Finetuning.py: The script written to fine-tune BERT with respect to a subset of CORD-19 dataset
8. Anomaly_detection.py: The script written to fine-tune BERT with respect to a subset of CORD-19 dataset
9. qna.joblib Trained model which predicts answers on the basis of the question query
10. ir_old.csv Dataset created by using CORD-19 data for Information Retrieval
-
Unsupervised Text Summarization Using Sentence Embeddings:
This research paper explains the process of text summarization using unsupervised methods. It is done by clustering sentence embeddings trained to embed paraphrases near each other.
-
Application and analysis of text summarization for biomedical domain content:
In this research paper, the approach is to implement and analyse abstractive and extractive text summarization machine learning models forgeneral language as well as biomedical domain-specific text. For abstractive text summarization, a sequence-to-sequence model that utilizes recurrent neural networks (RNNs) for biomedical text summarization. For extractive text summarization, a pretrained BERT model is used.
-
Supervised Machine Learning for Extractive Query Based Summarisation of Biomedical Data:
This paper explores the impact of several supervised machine learning approaches for extracting multi-document summaries for given queries. It compares classification and regression approaches for query-based extractive summarisation using data provided by the BioASQ Challenge.
-
Information Retrieval as Statistical Translation:
This paper proposes a new probabilistic approach to information retrieval based upon the ideas of statistical machine translation. The main approach is a statistical model on how a document can be translated into a query.
-
Statistical Language Modeling For Information Retrieval
This paper reviews research and applications in statistical language modelling for information retrieval (IR) that has emerged within the past several years as a new probabilistic framework for describing information retrieval processes.
-
Unsupervised Question Answering by Cloze Translation
This research paper explores to what extent high quality training data is actually required for Extractive QA, and investigates the possibility of unsupervised Extractive QA. This problem is approached by first learning to generate context, question and answer triplets in an unsupervised manner, which we then use to synthesize Extractive QA training data automatically.
-
A review on anomaly detection in disease outbreak detection
Gives a brief description about detection of pandemic like anomalies using AI
-
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
The paper explores the architecture of the current State Of The Art Language Representation Model - BERT