Skip to content

Commit

Permalink
Improve error messages and documentation (#6)
Browse files Browse the repository at this point in the history
* Improve error messages and documentations
* Fix issue with pagination when listing files in the OpenAI Assistant
  • Loading branch information
jirispilka authored Jul 2, 2024
1 parent 59a5603 commit d3ff061
Show file tree
Hide file tree
Showing 18 changed files with 3,228 additions and 1,918 deletions.
26 changes: 13 additions & 13 deletions .actor/input_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -31,42 +31,42 @@
"prefill": ["url", "text", "metadata.title"],
"editor": "json"
},
"fileIdsToDelete": {
"title": "Array of vector store file ids to delete",
"type": "array",
"description": "Delete specified file ids associated with vector store. This can be useful when one needs to delete files that are no longer needed.",
"editor": "json"
},
"filePrefix": {
"title": "Delete/Create vector store files with a prefix",
"type": "string",
"description": "Using a file prefix streamlines the management of vector store file updates by eliminating the need to track each file's ID. For instance, if you set the filePrefix to 'apify-advisor', the Actor will initially locate all files in the vector store with this prefix. Subsequently, it will delete these files and create new ones, also prefixed accordingly.",
"description": "Using a file prefix helps with the management of vector store file updates by eliminating the need to track each file's ID. For instance, if you set the filePrefix to 'apify-advisor', the Actor will initially locate all files in the vector store with this prefix. Subsequently, it will delete these files and create new ones, also prefixed accordingly.",
"editor": "textfield",
"minLength": 5
},
"fileIdsToDelete": {
"title": "Array of vector store file ids to delete",
"type": "array",
"description": "Delete specified file ids associated with vector store. This can be useful when one needs to delete files that are no longer needed.",
"editor": "json"
},
"saveCrawledFiles": {
"title": "Save crawled files (docs, pdf, pptx) to OpenAI File Store",
"type": "boolean",
"description": "Enables saving files from Apify's key-value store to OpenAI's file store. Useful when utilizing Apify’s website content crawler with the 'saveFiles' option, allowing the found files to be directly store and used in the assistant.",
"description": "Save files from Apify's key-value store to OpenAI's file store. Useful when utilizing Apify’s website content crawler with the 'saveFiles' option, allowing the found files to be directly store and used in the assistant.",
"default": true
},
"datasetId": {
"title": "Dataset ID",
"title": "Apify's Dataset ID",
"type": "string",
"description": "The Dataset ID is provided automatically when the actor is set up as an integration. You can fill it in explicitly here to enable debugging of the actor",
"editor": "textfield",
"sectionCaption": "Debugging options"
},
"keyValueStoreId": {
"title": "Key-value store ID",
"title": "Apify's Key-value store ID (source for json, pdf, pptx files) ",
"type": "string",
"description": "Apify's key value store ID is provided automatically when the actor is set up as an integration. You can fill it in explicitly here to enable debugging of the actor",
"description": "This is the ID for the Key-value store on Apify, which serves as the data source for json, pdf, and pptx files. This ID is automatically provided when the actor is integrated. However, you can manually enter the ID here for debugging purposes.",
"editor": "textfield"
},
"saveInApifyKeyValueStore": {
"title": "Save all files in the Apify key-value store",
"title": "Save all created files in the Apify's key-value store",
"type": "boolean",
"description": "Save all created files in the Apify Key-Value Store to easily check and retrieve all files (this is typically used when debugging)",
"description": "Save all created files in the Apify's Key-Value Store to easily check and retrieve all files (this is typically used when debugging)",
"default": false
}
},
Expand Down
8 changes: 6 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
# Change Log

## 0.2 (2024-05-09)
## 0.2.1 (2024-07-02)

- Fix issue with pagination when listing files in the OpenAI Assistant.

## 0.2.0 (2024-05-09)

- Added support to upload files to the OpenAI Assistant. The files are retrieved from the Apify's key-value store.

## 0.1 (2024-04-19)
## 0.1.0 (2024-04-19)

- Initial release of OpenAI vector store integration
50 changes: 45 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Additional costs are associated with the use of OpenAI Assistant. Please refer t

To utilize this integration, ensure you have:

- An OpenAI account and an `OpenAI API token`. Create a free account at [OpenAI](https://beta.openai.com/).
- An OpenAI account and an `OpenAI API KEY`. Create a free account at [OpenAI](https://beta.openai.com/).
- Created an [OpenAI Vector Store](https://platform.openai.com/docs/assistants/tools/file-search/vector-stores). You will need `vectorStoreId` to run this integration.
- Created an [OpenAI Assistant](https://platform.openai.com/docs/assistants/overview).

Expand All @@ -45,8 +45,8 @@ Refer to [input schema](.actor/input_schema.json) for details.
size limit of 5,000,000 tokens (as of 2024-04-23). When necessary, the model associated with the assistant is
utilized to count tokens and split the large file into smaller, manageable segments.
- `datasetFields` - Array of datasetFields you want to save, e.g., `["url", "text", "metadata.title"]`.
- `fileIdsToDelete` - Delete specified file IDs from vector store as needed.
- `filePrefix` - Delete and create files using a filePrefix, streamlining vector store updates.
- `fileIdsToDelete` - Delete specified file IDs from vector store as needed.
- `datasetId`: _[Debug]_ Dataset ID (when running Actor as standalone without integration).
- `keyValueStoreId`: _[Debug]_ Key Value Store ID (when running Actor as standalone without integration).
- `saveInApifyKeyValueStore`: _[Debug]_ Save all created files in the Apify Key-Value Store to easily check and retrieve all files (this is typically used when debugging)
Expand All @@ -66,13 +66,53 @@ Our Actors can automatically ingest entire websites, such as customer documentat
forums, blog posts, and other information sources to train or prompt your LLMs.
Integrate Apify into your product and allow your customers to upload their content in minutes.

## Example usage
## Save data from Website Content Crawler to OpenAI Vector Store

To use this integration, you need an OpenAI account and an `OpenAI API KEY`.
Additionally, you need to create an OpenAI Vector Store (vectorStoreId).

The Website Content Crawler can deeply crawl websites and save web page content to Apify's dataset.
It also stores files such as PDFs, PPTXs, and DOCXs.
A typical run crawling `https://platform.openai.com/docs/assistants/overview` includes the following dataset fields (truncated for brevity):

```json
[
{
"url": "https://platform.openai.com/docs/assistants/overview",
"text": "Assistants overview - OpenAI API\nThe Assistants API allows you to build AI assistants within your own applications ..."
},
{
"url": "https://platform.openai.com/docs/assistants/overview/step-1-create-an-assistant",
"text": "Assistants overview - OpenAI API\n An Assistant has instructions and can leverage models, tools, and files to respond to user queries ..."
}
]
```
Once you have the dataset, you can store the data in the OpenAI Vector Store.
Specify which fields you want to save to the OpenAI Vector Store, e.g., `["text", "url"]`.

```json
{
"assistantId": "YOUR-ASSISTANT-ID",
"datasetFields": ["text", "url"],
"openaiApiKey": "YOUR-OPENAI-API-KEY",
"vectorStoreId": "YOUR-VECTOR-STORE-ID"
}
```

### Update existing files in the OpenAI Vector Store

There are two ways to update existing files in the OpenAI Vector Store.
You can either delete all files with a specific prefix or delete specific files by their IDs.
It is more convenient to use the `filePrefix` parameter to delete and create files with the same prefix.
In the first run, the integration will save all the files with the prefix `openai_assistant_`.
In the next run, it will delete all the files with the prefix `openai_assistant_` and create new files.

The settings for the integration are as follows:
```json
{
"assistantId": "YOUR-ASSISTANT-ID",
"datasetFields": ["text", "url", "metadata.title"],
"filePrefix": "apify_test_",
"datasetFields": ["text", "url"],
"filePrefix": "openai_assistant_",
"openaiApiKey": "YOUR-OPENAI-API-KEY",
"vectorStoreId": "YOUR-VECTOR-STORE-ID"
}
Expand Down
6 changes: 3 additions & 3 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

32 changes: 16 additions & 16 deletions src/input_model.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# generated by datamodel-codegen:
# filename: input_schema.json
# timestamp: 2024-05-27T07:28:58+00:00
# timestamp: 2024-07-01T09:35:47+00:00

from __future__ import annotations

Expand All @@ -15,7 +15,7 @@ class OpenaiVectorStoreIntegration(BaseModel):
description='Vector Store ID where the data will be stored',
title='Vector Store ID',
)
openaiApiKey: str = Field(..., title='OpenAI API KEY')
openaiApiKey: str = Field(..., description='OpenAI API KEY', title='OpenAI API KEY')
assistantId: Optional[str] = Field(
None,
description='The ID of an OpenAI Assistant. This parameter is required only when a file exceeds the OpenAI size limit of 5,000,000 tokens (as of 2024-04-23).\n\n When necessary, the model associated with the assistant is utilized to count tokens and split the large file into smaller, manageable segments.',
Expand All @@ -26,34 +26,34 @@ class OpenaiVectorStoreIntegration(BaseModel):
description='A list of dataset fields which should be selected from the items, only these dataset fields will remain in the resulting record objects.\n\n For example, when using the website content crawler, you might select dataset fields such as `text` and `url`, and `metadata.title` among others, to be included in the vector store file.',
title='A list of dataset fields which should be selected from the dataset',
)
fileIdsToDelete: Optional[List] = Field(
None,
description='Delete specified file ids associated with vector store. This can be useful when one needs to delete files that are no longer needed.',
title='Array of vector store file ids to delete',
)
filePrefix: Optional[str] = Field(
None,
description="Using a file prefix streamlines the management of vector store file updates by eliminating the need to track each file's ID. For instance, if you set the filePrefix to 'apify-advisor', the Actor will initially locate all files in the vector store with this prefix. Subsequently, it will delete these files and create new ones, also prefixed accordingly.",
description="Using a file prefix helps with the management of vector store file updates by eliminating the need to track each file's ID. For instance, if you set the filePrefix to 'apify-advisor', the Actor will initially locate all files in the vector store with this prefix. Subsequently, it will delete these files and create new ones, also prefixed accordingly.",
min_length=5,
title='Delete/Create vector store files with a prefix',
)
saveFiles: Optional[bool] = Field(
fileIdsToDelete: Optional[List] = Field(
None,
description='Delete specified file ids associated with vector store. This can be useful when one needs to delete files that are no longer needed.',
title='Array of vector store file ids to delete',
)
saveCrawledFiles: Optional[bool] = Field(
True,
description="Enables saving files from Apify's key-value store to OpenAI's file store. Useful when utilizing Apify’s website content crawler with the 'saveFiles' option, allowing the found files to be directly store and used in the assistant.",
title='Save files from apify key-value store to OpenAI File Store',
description="Save files from Apify's key-value store to OpenAI's file store. Useful when utilizing Apify’s website content crawler with the 'saveFiles' option, allowing the found files to be directly store and used in the assistant.",
title='Save crawled files (docs, pdf, pptx) to OpenAI File Store',
)
datasetId: Optional[str] = Field(
None,
description='The Dataset ID is provided automatically when the actor is set up as an integration. You can fill it in explicitly here to enable debugging of the actor',
title='Dataset ID',
title="Apify's Dataset ID",
)
keyValueStoreId: Optional[str] = Field(
None,
description="Apify's key value store ID is provided automatically when the actor is set up as an integration. You can fill it in explicitly here to enable debugging of the actor",
title='Key-value store ID',
description='This is the ID for the Key-value store on Apify, which serves as the data source for json, pdf, and pptx files. This ID is automatically provided when the actor is integrated. However, you can manually enter the ID here for debugging purposes.',
title="Apify's Key-value store ID (source for json, pdf, pptx files) ",
)
saveInApifyKeyValueStore: Optional[bool] = Field(
False,
description='Save all created files in the Apify Key-Value Store to easily check and retrieve all files (this is typically used when debugging)',
title='Save all files in the Apify key-value store',
description="Save all created files in the Apify's Key-Value Store to easily check and retrieve all files (this is typically used when debugging)",
title="Save all created files in the Apify's key-value store",
)
Loading

0 comments on commit d3ff061

Please sign in to comment.