This project allows you to crawl datasets and ingest them into Vectara by using pre-built or custom crawlers. You can use Vectara’s APIs to create conversational experiences—such as chatbots, semantic search, and workplace search—from your data.
For more information about this repository, see Code Organization and Crawling.
vectara-ingest
is an open source Python project that demonstrates how to crawl datasets and ingest them into Vectara. It provides a step-by-step guide on building your own crawler and some pre-built crawlers for ingesting data from sources such as:
- Websites
- RSS feeds
- Jira tickets
- Notion notes
- Docusaurus documents
This guide explains how to create a basic crawler to scrape content from Paul Graham's website, and ingest it into Vectara.
- Free Vectara account
- Created data corpus
- API key
- Write access to the corpus
- Python 3.8 (or higher)
- pyyaml - install with:
pip install pyyaml
- Docker
This section explains how to clone the vectara-ingest
repository to your machine.
Open a terminal session and clone the repository to a directory on your machine:
git clone https://github.com/vectara/vectara-ingest.git
-
Open a Windows PowerShell terminal.
-
Update your Windows Subsystem for Linux (WSL):
wsl --update
-
Ensure that WSL has the correct version of Linux:
wsl --install ubuntu-20.04
-
Open your Linux terminal and clone the repository to a directory on your machine:
git clone https://github.com/vectara/vectara-ingest.git
Note: make sure you execute step #4 above only within your Linux environment and not in the windows environment. You may need to choose a username for your Ubuntu environment as part of the setup process.
For our example we would index the content of https://www.paulgraham.com website to Vectara. Since this website does not provide a sitemap, but does provide an RSS feed, we will use the vectara-ingest RSS crawler instead.
-
Navigate to the directory that you have cloned.
-
Copy the
secrets.example.toml
tosecrets.toml
. -
In the
secrets.toml
file, changeapi_key
to the Vectara API Key.To retrieve your API key from the Vectara console, click API Access > API Keys.
-
In the
config/
directory, copy thenews-bbc.yaml
config file topg-rss.yaml
. -
Edit the
pg-rss.yaml
file and make the following changes:-
Change the
vectara.corpus_id
value to the ID of the corpus into which you want to ingest the content of the website.To retrieve your corpus ID from the Vectara console, click Data > Your Corpus Name.
-
Change the
vectara.account_id
value to the ID of your account.To retrieve your account ID from the Vectara console, click your username in the upper-right corner.
-
Change
rss_crawler.source
topg
. -
Change
rss_crawler.rss_pages
to["http://www.aaronsw.com/2002/feeds/pgessays.rss"]
. -
Change
rss_crawler.days_past
to365
.
-
-
Ensure that Docker is running.
-
Run the script from the directory that you cloned and specify your
.yaml
configuration file and yourdefault
profile from thesecrets.toml
file.bash run.sh config/pg-rss.yaml default
Note:
-
On Linux, ensure that the
run.sh
file is executable by running the following command:chmod +x run.sh
-
On Windows, ensure that you run this command from within the WSL 2 environment.
Note: To protect your system's resources and make it easier to move your crawlers to the cloud, the crawler executes inside a Docker container. This is a lengthy process because in involves numerous dependencies
-
-
When the container is set up, you can track your crawler’s progress:
docker logs -f vingest
While your crawler is ingesting data into your Vectara corpus, you can try queries against your corpus on the Vectara Console, click Data > Your Corpus Name and type in a query such as "What is a maker schedule?"
The codebase includes the following components.
run.sh
: The main shell script to execute when you want to launch a crawl job (see below for more details).ingest.py
: The main entry point for a crawl job.Dockerfile
: The Docker image definition file
Fundamental utilities depended upon by the crawlers:
indexer.py
: Defines theIndexer
class which implements helpful methods to index data into Vectara such asindex_url
,index_file()
andindex_document()
.crawler.py
: Defines theCrawler
class which implements a base class for crawling, where each specific crawler should implement thecrawl()
method specific to its type.pdf_convert.py
: Helper class to convert URLs into local PDF documents.extract.py
: Utilities for text extraction from HTMLutils.py
: Some utility functions used by the other code.
Includes implementations of the various specific crawlers.
Includes example YAML configuration files for various crawling jobs.
To crawl and index a source you run a crawl "job", which is controlled by several paramters that you can define in a YAML configuration file. You can see example configuration files in the config/ directory.
Each configuration YAML file includes a set of standard variables, for example:
vectara:
# the corpus ID for indexing
corpus_id: 4
# the Vectara customer ID
customer_id: 1234567
# flag: should vectara-ingest reindex if document already exists (optional)
reindex: false
# timeout (optional); sets the URL crawling timeout in seconds
timeout:60
crawling:
# type of crawler; valid options are website, docusaurus, notion, jira, rss, mediawiki, discourse, github and others (this continues to evolve as new crawler types are added)
crawler_type: XXX
Following that, where needed, the same YAML configuration file will include crawler-specific section with crawler-specific parameters (see about crawlers):
XXX_crawler:
# specific parameters for the crawler XXX
We use a secrets.toml
file to hold secret keys and parameters. You need to create this file in the root directory before running a crawl job. This file can hold multiple "profiles", and specific specific secrets for each of these profiles. For example:
[profile1]
api_key="<VECTAR-API-KEY-1>
[profile2]
api_key="<VECTARA-API-KEY-2>"
[profile3]
api_key="<VECTARA-API-KEY-3>"
MOTION_API_KEY="<YOUR-NOTION-API-KEY>
This allows easy secrets management when you have multiple crawl jobs that may not share the same secrets. For example when you have a different Vectara API key for indexing differnet corpora.
Many of the crawlers have their own secrets, for example Notion, Discourse, Jira, or GitHub. These are also kept in the secrets file in the appropriate section and need to be all upper case (e.g. NOTION_API_KEY
or JIRA_PASSWORD
).
The Indexer
class provides useful functionality to index documents into Vectara.
This is probably the most useful method. It takes a URL as input and extracts the content from that URL (using the playwright
and Goose3
libraries), then sends that content to Vectara using the standard indexing API. If the URL points to a PDF document, special care is taken to ensure proper processing.
Please note that we use Goose3
to extract the main (most important) content of the article, ignoring links, ads and other not-important content. If your crawled content has different requirements you can change the code to use a different extraction mechanism (html to text).
Use this when you have a file that you want to index using Vectara's file_uplaod API, so that it takes care of format identification, segmentation of text and indexing.
Use these when you build the document
JSON structure directly and want to index this document in the Vectara corpus.
Specifically, the reindex
parameter determines whether an existing document should be reindexed or not. If reindexing is required, the code automatically takes care of that by calling delete_doc()
to first remove the document from the corpus and then sends it to the corpus index.
The project is designed to be used within a Docker container, so that a crawl job can be run anywhere - on a local machine or on any cloud machine. See the Dockerfile for more information on the Docker file structure and build.
To run vectara-ingest
locally, perform the following steps:
- Make sure you have Docker installed on your machine, and that there is enough memory and storage to build the docker image.
- Clone this repo locally with
git clone https://github.com/vectara/vectara-ingest.git
. - Enter the directory with
cd vectara-ingest
. - Choose the configuration file for your project and run
bash run.sh config/<config-file>.yaml <profile>
. This command creates the Docker container locally, configures it with the parameters specified in your configuration file (with secrets taken from the appropriate<profile>
insecrets.toml
), and starts up the Docker container.
If you want your vectara-ingest
to run on Render, please follow these steps:
- Sign Up/Log In: If you don't have a Render account, you'll need to create one. If you already have one, just log in.
- Create New Service: Once you're logged in, click on the "New" button usually found on the dashboard and select "Background Worker".
- Choose "Deploy an existing image from a registry" and click "Next" Specify Docker Image: In the "Image URL" fill in "vectara/vectara-ingest" and click "Next"
- Choose a name for your deployment (e.g. "vectara-ingest"), and if you need to pick a region or leave the default. Then pick your instance type.
- Click "Create Web Service"
- Click "Environment", then "Add Secret File": name the file config.yaml, and copy the contents of the config.yaml for your crawler
- Assuming you have a secrets.toml file with multiple profiles and you want to use the secrets for the profile , click "Environment", then "Add Secret File": name the file secrets.toml, and copy only the contents of from the secrets.toml to this file (incuding the profile name)
- Click "Settings" and go to "Docker Command" and click "Edit", the put in the following command:
/bin/bash -c mkdir /home/vectara/env && cp /etc/secrets/config.yaml /home/vectara/env/ && cp /etc/secrets/secrets.toml /home/vectara/env/ && python3 ingest.py /home/vectara/env/config.yaml <my-profile>"
Then click "Save Changes", and your application should now be deployed.
Note:
- Hosting in this way does not support the CSV or folder crawlers.
- Where vectara-ingest uses
playwright
to crawl content (e.g. website crawler or docs crawler), the Render instance may require more RAM to work properly with headless browser.
vectara-ingest
can be easily deployed on any cloud platform such as AWS, Azure or GCP. You simply create a cloud VM and follow the local-deployment instructions after
you SSH into that machine.
The vectara-ingest
container is available for easy deployment via docker-hub.
👤 Vectara
- Website: vectara.com
- Twitter: @vectara
- GitHub: @vectara
- LinkedIn: @vectara
- Discord: @vectara
Contributions, issues and feature requests are welcome and appreciated!
Feel free to check issues page. You can also take a look at the contributing guide.
Give a ⭐️ if this project helped you!
Copyright © 2023 Vectara.
This project is Apache 2.0 licensed.