-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🛠️ #46 inited the least latency routing #70
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #70 +/- ##
===========================================
+ Coverage 63.87% 71.74% +7.86%
===========================================
Files 27 30 +3
Lines 1182 1313 +131
===========================================
+ Hits 755 942 +187
+ Misses 381 317 -64
- Partials 46 54 +8 ☔ View full report in Codecov by Sentry. |
# Conflicts: # pkg/providers/provider.go
s.expireAt = time.Now().Add(*s.model.LatencyUpdateInterval()) | ||
} | ||
|
||
// LeastLatencyRouting routes requests to the model that responses the fastest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how we can normalize this to the token count of the response because that seems to be the bottleneck is token generation takes a long time. Might be as simple as counting the generated tokens in the response and dividing it by the response time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are basically two options I see:
- use the time to fist byte metric (needs to instrument clients to get that info)
- or use the approach you described, then we are essentially calculating token generation velocity of each model.
I need to play with OpenAI API, for example, to see which approach makes sense here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a ticket not to forget about this: #78
Adding a new routing strategy to pick the least latency model. Adding simple coverage for some config building logic.