You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Publicly hosted CLIP retireval is useful but limited:
Service can go down and is not constantly available
Retrieval is limited to 1000 urls/query which can be quite restrictive
Slower than self hosted version
For that reason, we need a way to host the service externally and make it available for running our pipelines. The main aspects that need to be taken into account are:
Cost
Generalizability of solution across different clouds
Whether the solution will be made publicly available or only for private usage
Ease of deployment/ Speed of initialization
Testability
Proposed Approach
A) GKE node (similar to CF)
This approach relies on making the laion 5B dataset (~2TB) available across different clouds/regions within a specified region. Making this available requires first spinning up a VM and then uploading the dataset to the cloud (streaming or uploading).
Cost
Relatively low cost since storage is not that expensive and egress costs are low (provided the node is in the same region as the bucket where the data is stored). Approx 40€/month source.
Generalizability of solution across different clouds
Generalizes well
Whether the solution will be made publicly available or only for private usage
Private usage
Ease of deployment/ Speed of initialization
Requires adding additional helper functions to download the data locally
Nodepools with NVME disk may not be available in all regions
Slow initialization (approx 3 hours to download the whole dataset)
Testability
Difficult due to slow initialization
B) On an external VM
Relies on setting up a VM and exposing the retrieval service with an API. Initialization scripts can be setup to download the dataset + setup the clip service.
Cost
Relatively high 440€/months if no snapshots are used. Costs are included even if VM is stopped due to SSD.
Use snapshots. Costs of the time the service is up and running (few hours/days) + egress costs for creation and restoration (~80€/month for a one time storage + restoration per month) + snapshot storage costs (~100€/month). Link 1, Link 2.
Another alternative is to setup a VM and download the dataset from cloud storage instead of a snapshot to save storage and egress costs. This can be automated with an initialization script but relies on the user having the dataset on the cloud beforehand.
Generalization of solution across different clouds
Snapshots/VM setup may vary from one cloud to the other.
Whether the solution will be made publicly available or only for private usage
Depending on the option. option 1 can be made publicly available whereas option 2 is more for private usage.
Ease of deployment/ Speed of initialization
Option 1 -> set it up once
Option 2 -> manual setup, takes some times everytime you want to run the pipeline
Testability
Easier to test compared with approach A
Implementation Steps/Tasks
Will vary depending on the chosen approach/option specified above.
Potential Impact
None, if we go for Option A (GKE node) we will have to make sure that the user has a configured VM with required storage and disk,
Testing
Component testing. Difficult to test if component is not always available. Response can be mocked.
Documentation
Add relevant documentation to clip retrieval component.
Feel free to ask if you have any question around clip retrieval. Agreed it would be helpful to provide a self hosted version in fondant.
Maybe in addition to the big index, providing a demo with a smaller index would help mitigate some of the issues mentioned here
Problem Statement
Publicly hosted CLIP retireval is useful but limited:
For that reason, we need a way to host the service externally and make it available for running our pipelines. The main aspects that need to be taken into account are:
Proposed Approach
A) GKE node (similar to CF)
This approach relies on making the laion 5B dataset (~2TB) available across different clouds/regions within a specified region. Making this available requires first spinning up a VM and then uploading the dataset to the cloud (streaming or uploading).
Cost
Generalizability of solution across different clouds
Whether the solution will be made publicly available or only for private usage
Ease of deployment/ Speed of initialization
Testability
B) On an external VM
Relies on setting up a VM and exposing the retrieval service with an API. Initialization scripts can be setup to download the dataset + setup the clip service.
Cost
Another alternative is to setup a VM and download the dataset from cloud storage instead of a snapshot to save storage and egress costs. This can be automated with an initialization script but relies on the user having the dataset on the cloud beforehand.
Generalization of solution across different clouds
Whether the solution will be made publicly available or only for private usage
Ease of deployment/ Speed of initialization
Testability
Implementation Steps/Tasks
Will vary depending on the chosen approach/option specified above.
Potential Impact
None, if we go for Option A (GKE node) we will have to make sure that the user has a configured VM with required storage and disk,
Testing
Component testing. Difficult to test if component is not always available. Response can be mocked.
Documentation
Add relevant documentation to clip retrieval component.
Feedback and Suggestions
Dependent features
None
Additional Notes
Links
The text was updated successfully, but these errors were encountered: