🚀 Sangmin Lee
🐋 Sungbeom Choi
🦄 ChanHyuk Park
🌟 Minsu Park
We are developing an AI-powered online research tool aimed at streamlining repetitive and time-consuming tasks in online data research. Our goal is to enable individuals to focus on more important tasks by harnessing the capabilities of generative AI technology.
We plan to start by collecting data from trusted sources such as 통계청 and 정책 브리핑, and then establish partnerships with other data-rich websites to expand our vector database. This will allow us to provide valuable and reliable information for research purposes.
Our tool will be versatile, capable of handling a wide range of data formats, including web pages, PDF documents, YouTube videos, and even audio content. This flexibility ensures that users can extract information from diverse sources efficiently.
To make our tool even more user-friendly and productive, we will implement an autonomous agent that can understand and execute user commands effectively. This agent will serve as a valuable assistant, helping users navigate and extract information from the vast pool of data available online.
In summary, our AI-powered online research tool aims to enhance the productivity and efficiency of data research by automating repetitive tasks, providing access to reliable data sources, and incorporating an autonomous agent to assist users in their research endeavors.
To make GPT more useful, we introduce Agent. it has been enhanced to understand human language, think autonomously, and make judgments to use appropriate tools.
It can now retrieve necessary information from databases or the web, and, based on the found data or numerical information, it is equipped to draw graphs or charts as required.
We conducted extensive preprocessing to ensure that GPT could better understand the data. This involved removing noise information (such as Ads, navigation bars, etc) and incorporating visual data to comprehend context and structural details.We obtained official authorization and fine-tuned Faster RCNN on a dataset comprising 200 images from the 통계청 (Korean Statistical Office) and 정부 브리핑 (Government Briefings). Through this process, we are able to divide each chunk into the following categories by utilizing the visual information
Category | Description |
---|---|
Topic | Identifying the central subject or theme of the content. |
Title | Recognizing and understanding the document or presentation's title. |
Contents | Grasping the textual information within the document or presentation. |
Figure | Identifying visual elements such as images or illustrations. |
Graph | Recognizing and interpreting graphical representations of data. |
Table | Understanding tabular data structures. |
Table Caption | Recognizing and comprehending captions associated with tables. |
Comment | Identifying and understanding comments or annotations within the content. |
When you do research, you'll be collecting a variety of data types. We've made it possible to receive multiple types of data, not just text.
Design our pipeline to work with you through the entire process
통계청 only returns results when keywords are matched, which makes searching difficult. We provide valuable material that is semantically similar.
Tables contain a lot of useful information that has a structural counterpart. However, if you scrape them directly into text, GPT can't understand them very well and can't utilize them. In order to understand tables well, it is important to preprocess the structure of the table into a form that GPT can understand, rather than just scraping it. When scraping the table as it was, GPT didn't utilize the table information. But with our chunking method, the table was better understood and utilized.
GPT outputs text by default, so it can't generate visuals like graphs. We give our agents tools, so they can write Python code to plot graphs based on the information they receive as text.
This is the first draft of our model. Based on the data we put in, model write a good draft with useful tables and thumbnail images (from DALL-E).
@misc{embedchain,
author = {Taranjeet Singh},
title = {Embedchain: Framework to easily create LLM powered bots over any dataset},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/embedchain/embedchain}},
}
@article{shen2021layoutparser,
title={LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis},
author={Shen, Zejiang and Zhang, Ruochen and Dell, Melissa and Lee, Benjamin Charles Germain and Carlson, Jacob and Li, Weining},
journal={arXiv preprint arXiv:2103.15348},
year={2021}
}