Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assess options to host CLIP retrieval backend #182

Open
PhilippeMoussalli opened this issue Jun 5, 2023 · 2 comments
Open

Assess options to host CLIP retrieval backend #182

PhilippeMoussalli opened this issue Jun 5, 2023 · 2 comments
Assignees
Labels
Components Implementation of components

Comments

@PhilippeMoussalli
Copy link
Contributor

PhilippeMoussalli commented Jun 5, 2023

Problem Statement

Publicly hosted CLIP retireval is useful but limited:

  • Service can go down and is not constantly available
  • Retrieval is limited to 1000 urls/query which can be quite restrictive
  • Slower than self hosted version

For that reason, we need a way to host the service externally and make it available for running our pipelines. The main aspects that need to be taken into account are:

  • Cost
  • Generalizability of solution across different clouds
  • Whether the solution will be made publicly available or only for private usage
  • Ease of deployment/ Speed of initialization
  • Testability

Proposed Approach

A) GKE node (similar to CF)
This approach relies on making the laion 5B dataset (~2TB) available across different clouds/regions within a specified region. Making this available requires first spinning up a VM and then uploading the dataset to the cloud (streaming or uploading).

  • Cost

    • Relatively low cost since storage is not that expensive and egress costs are low (provided the node is in the same region as the bucket where the data is stored). Approx 40€/month source.
  • Generalizability of solution across different clouds

    • Generalizes well
  • Whether the solution will be made publicly available or only for private usage

    • Private usage
  • Ease of deployment/ Speed of initialization

    • Requires adding additional helper functions to download the data locally
    • Nodepools with NVME disk may not be available in all regions
    • Slow initialization (approx 3 hours to download the whole dataset)
  • Testability

    • Difficult due to slow initialization

B) On an external VM
Relies on setting up a VM and exposing the retrieval service with an API. Initialization scripts can be setup to download the dataset + setup the clip service.

  • Cost

      1. Relatively high 440€/months if no snapshots are used. Costs are included even if VM is stopped due to SSD.
      1. Use snapshots. Costs of the time the service is up and running (few hours/days) + egress costs for creation and restoration (~80€/month for a one time storage + restoration per month) + snapshot storage costs (~100€/month). Link 1, Link 2.
        Another alternative is to setup a VM and download the dataset from cloud storage instead of a snapshot to save storage and egress costs. This can be automated with an initialization script but relies on the user having the dataset on the cloud beforehand.
  • Generalization of solution across different clouds

    • Snapshots/VM setup may vary from one cloud to the other.
  • Whether the solution will be made publicly available or only for private usage

    • Depending on the option. option 1 can be made publicly available whereas option 2 is more for private usage.
  • Ease of deployment/ Speed of initialization

    • Option 1 -> set it up once
    • Option 2 -> manual setup, takes some times everytime you want to run the pipeline
  • Testability

    • Easier to test compared with approach A

Implementation Steps/Tasks

Will vary depending on the chosen approach/option specified above.

Potential Impact

None, if we go for Option A (GKE node) we will have to make sure that the user has a configured VM with required storage and disk,

Testing

Component testing. Difficult to test if component is not always available. Response can be mocked.

Documentation

Add relevant documentation to clip retrieval component.

Feedback and Suggestions

Dependent features

None

Additional Notes

Links

@PhilippeMoussalli PhilippeMoussalli converted this from a draft issue Jun 5, 2023
@PhilippeMoussalli PhilippeMoussalli self-assigned this Jun 5, 2023
@PhilippeMoussalli PhilippeMoussalli added the Components Implementation of components label Jun 5, 2023
@RobbeSneyders RobbeSneyders moved this from Breakdown to Validation in Fondant development Jun 29, 2023
@RobbeSneyders
Copy link
Member

This is being addressed in a separate repo: https://github.com/ml6team/laion5b-deployment/pull/1

@rom1504
Copy link

rom1504 commented Oct 6, 2023

Feel free to ask if you have any question around clip retrieval. Agreed it would be helpful to provide a self hosted version in fondant.
Maybe in addition to the big index, providing a demo with a smaller index would help mitigate some of the issues mentioned here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Components Implementation of components
Projects
Status: Backlog
Development

No branches or pull requests

4 participants