feat: Perform RAG in search instead of simple keywords #11891

petros94 · 2024-11-19T08:37:32Z

At my team, we usually use the search field to search for datasets with various characteristics. Currently this field operates with keyword matching of columns or words in the description of a dataset.

Since RAG adoption is on the rise, I'd like to propose a new feature leveraging natural language for searching the datasets:

All the information regarding the datasets (description, tags, owner, etc.) are transformed into embeddings and stored in a vector db.
The user would then use the search bar to write a complete question like "What are the datasets we have about X?", "Who is the owner of dataset Y", and the system would perform a similarity search + generation with LLM to answer the query.
Since there is already an ontology defined in DataHub, there could even be a more sophisticated graph RAG to answer questions involving relationships like "How many datasets we have regarding Z?", "Which dataset is the parent of W?", etc.

I think that feature would greatly enhance the user experience and productivity, provide a competitive advantage against other solutions (https://www.secoda.co/blog/transforming-data-discovery-using-secoda-ai) and open new possibilities for the platform as a whole.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Perform RAG in search instead of simple keywords #11891

feat: Perform RAG in search instead of simple keywords #11891

petros94 commented Nov 19, 2024 •

edited

Loading

feat: Perform RAG in search instead of simple keywords #11891

feat: Perform RAG in search instead of simple keywords #11891

Comments

petros94 commented Nov 19, 2024 • edited Loading

petros94 commented Nov 19, 2024 •

edited

Loading