EY_GDS_Project : Prototype submission for Intent based search and Classification

Presentation Link: https://youtu.be/41fjEdk7ymk

Demo Link: https://youtu.be/1AIFkLv90WY

Hackpions – EY GDS Hackathon

Theme

Intent based search and Classification

Problem statement

Propose and design an algorithm that can periodically crawl content repositories of files (shared file or SharePoint) to determine the properties (metadata) and the content of the files and match them against some predefined intents. Example: reusable assets, policy documents and solution accelerators
List and match metadata of newly added items and add metadata of the matching items to a tracking database with an index and matched pre-defined intents

We have a created a solution for both the points!

Solution

Features

Retrainable on new data.
Can crawl through all files and sub-folders in a parent repository and generate intents for each specific file.
It stores all this information in
We have an intelligent query search which undertands the intents associated with a query sentence and provides the file names and location of the intents of this query.

Dataset

External libraries and frameworks used :

pdfminer.six : To extract text from pdfs.
bert-for-tf2 : For fine tuning the Bert model with our training data.
tensorflow : Using keras to create a custom model
pandas - for data analysis
os - For crawling through repository.
sklearn -
nltk

Portability

We give the notebook access to google drive using an API access token.
In the same way, we can connect share point.
For the folder you want to crawl, just provide the path of the folder like we have for our google drive folder.

Repository structure :

data_to_preprocess :
- This folder contains training files.
- You can put pdfs or documents into this file for training.
- Then type python3 preprocessing_folder.py
- Executing preprocessing_folder.py will extract sentences from the files in this folder and store them with the respective intents.
API_access_code_snippet :
- Sharepoint API client
preprocessing_folder.py :
- Preprocesses the data_to_preprocess folder to extract training sentences from the folders.
- Creates a csv file of the sentences and their respective intents.
Crawler_and_intelligent_query_search.ipynb :
- Uses train.csv, test.csv and valid.csv to fine tune BERT.
- Shows the test accuracies and metrics.
- The model contains the new updated checkpoints now.
- Download this complete folder.
crawler.py :
- Crawls through any given content repository.
- Loads the trained model.
- For each file, it predicts the intents of the file.
- It stores the filename, file location and unique intents in a csv.
query_search.py :
- You can enter as many number of queries you want.
- For each query sentence, our model predicts its intent or intents.
- We look through the csv of files and respectivey intents generated in the crawling process.
- If our query sentence's intent matches with intent of any file, we display that file.
- In this way, we display all the files which have the intents in the query sentence.

Why is this is an intelligent search?

Because we recognise the intents of the query sentence first.
Then we check if some files have the same intents.
For each file in the repo, we have already predicted multiple intents.

Steps to execute :

Create a new conda envioronment.
Install the dependancies with pip install -r requirements.txt.

Name		Name	Last commit message	Last commit date
Latest commit History 204 Commits
CSV Files		CSV Files
__pycache__		__pycache__
data_to_preprocess		data_to_preprocess
images		images
log/intent_detection/20201204-11551607063122/train		log/intent_detection/20201204-11551607063122/train
saved_model/ey_model2		saved_model/ey_model2
.gitattributes		.gitattributes
.gitignore		.gitignore
API_access_code_snippet.py		API_access_code_snippet.py
Crawler_and_Intelligent_Query_Search.ipynb		Crawler_and_Intelligent_Query_Search.ipynb
README.md		README.md
Training_Intelligent_Model.ipynb		Training_Intelligent_Model.ipynb
crawler.py		crawler.py
intelligent_search.py		intelligent_search.py
preprocessing_folder.py		preprocessing_folder.py
requirements.txt		requirements.txt
vocab.txt		vocab.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EY_GDS_Project : Prototype submission for Intent based search and Classification

Presentation Link: https://youtu.be/41fjEdk7ymk

Demo Link: https://youtu.be/1AIFkLv90WY

Hackpions – EY GDS Hackathon

Theme

Problem statement

Solution

Features

Dataset

External libraries and frameworks used :

Portability

Repository structure :

Why is this is an intelligent search?

Steps to execute :

About

Releases

Packages

Contributors 3

Languages

Soumi7/EY_GDS_Project

Folders and files

Latest commit

History

Repository files navigation

EY_GDS_Project : Prototype submission for Intent based search and Classification

Presentation Link: https://youtu.be/41fjEdk7ymk

Demo Link: https://youtu.be/1AIFkLv90WY

Hackpions – EY GDS Hackathon

Theme

Problem statement

Solution

Features

Dataset

External libraries and frameworks used :

Portability

Repository structure :

Why is this is an intelligent search?

Steps to execute :

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages