Caching between multiple backends so they can cross communicate the cached data to avoid reevaluation #355
Replies: 2 comments
-
chat1.txt |
Beta Was this translation helpful? Give feedback.
-
I recommend reading about the object lifecycle to understand how to reuse the existing state better. |
Beta Was this translation helpful? Give feedback.
-
Hey! I am trying to set up multiple node-llama.cpp as a backend in cadence with a queue server that tracks how many active and busy sessions each backend has; however, it seems it has to reevaluate the prompt each time for every message sent, I know there is a way to avoid doing this, I just can't figure out how.
Does anyone know how I could make this more efficient by continuing the conversation rather than having to reevualate it each time, the complicated part is the user is unlikely to be sent back to the same server after they send a message because it choosing the first available, am I going to have to change something with this so it focuses for say 6-8 users on the same machine who are currently active? It's just a huge inefficiently to have to wait 15-20 seconds to start generating off of a 6000character prompt forsay when I know things like textgenwebui can handle this somehow and only the first message takes an extra 15-20 seconds then it's only like 5 for the next one.
Here is my code below:
import express from 'express';
import { fileURLToPath } from 'url';
import path from 'path';
import { exec } from 'child_process'; // Import exec to run shell commands
import { performance } from 'perf_hooks'; // Import performance for timing
import { getLlama, LlamaChatSession } from 'node-llama-cpp';
import cors from 'cors'; // Import CORS
// Create the Express app
const app = express();
const port = 3000;
// Enable CORS
app.use(cors());
// Middleware to parse JSON requests
app.use(express.json());
// Get the current directory name
const __dirname = path.dirname(fileURLToPath(import.meta.url));
// Load the Llama model
let model;
let modelLoaded = false;
// Adjust these settings as necessary
let MAX_CONTEXT_SIZE = 8192; // Example: Reduce context size
let GPU_LAYERS = 8; // Start with a high number of GPU layers
let MAX_ACTIVE_SESSIONS = 2; // Initial maximum number of concurrent sessions
let activeSessions = 0; // Counter for active sessions
const VRAM_PER_1SESSION = 750; // Example VRAM required per session in MB for 1 layer
// Function to get available VRAM
const getAvailableVRAM = async () => {
return new Promise((resolve, reject) => {
// Get total VRAM capacity
exec('nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits', (error, totalStdout) => {
if (error) {
return reject(error);
}
};
// Function to calculate maximum active sessions based on available VRAM
const calculateMaxActiveSessions = async () => {
const usableVRAM = await getAvailableVRAM();
const VRAM_PER_SESSION = VRAM_PER_1SESSION + (32*GPU_LAYERS); //every layer is 32MB more for context
console.log(VRAM_PER_SESSION);
MAX_ACTIVE_SESSIONS = Math.floor(usableVRAM / VRAM_PER_SESSION);
console.log(
Maximum active sessions updated to: ${MAX_ACTIVE_SESSIONS}
);};
// Load the model with dynamic adjustment for GPU layers
import { InsufficientMemoryError } from 'node-llama-cpp'; // Import the specific error class if available
import { request } from 'http';
const loadModel = async () => {
try {
let attemptLayers = GPU_LAYERS; // Start with the current number of layers
};
// Load the model once when the server starts
loadModel().catch(console.error);
// Function to unload the model
const unloadModel = async () => {
if (model && typeof model.dispose === 'function') {
await model.dispose(); // Call dispose method if available
console.log("Model unloaded successfully");
modelLoaded = false; // Set to false once the model is unloaded
}
};
// Array to store the last 10 response generation times
const responseTimes = [];
const maxResponseTimes = 10; // Maximum number of response times to store
let isUnknown = false; // Variable to indicate unknown state
let averageGenerationTime = null;
// Function to log response time
function logResponseTime(responseTime) {
}
// Function to reset the response times and set the unknown variable
function resetResponseTimes() {
responseTimes.length = 0; // Clear the array
isUnknown = true; // Set the unknown state to true
console.log("Response times have been reset. Unknown state is set to true.");
}
app.post('/v1/chat/completions', async (req, res) => {
if (!modelLoaded) {
return res.status(503).send('Model is loading, please try again later.');
}
// seed: settings.seed || 10000, // Default seed if not provided
onTextChunk(chunk) {
// Send each chunk to the client
res.write(JSON.stringify({ chunk }));
});
// POST endpoint to reload the model with new GPU layers
app.post('/reload-model', async (req, res) => {
const { newLayers } = req.body;
});
// Function to attempt to increase active sessions and adjust GPU layers if necessary
const tryToIncreaseSessions = async () => {
const previousLayers = GPU_LAYERS; // Store the previous GPU layers count
let previousMaxActiveSessions = MAX_ACTIVE_SESSIONS; // Store the previous max active sessions
};
// API endpoint to trigger the session increase attempt
app.post('/increase-sessions', async (req, res) => {
try {
await tryToIncreaseSessions(); // Attempt to increase active sessions
res.send('Attempted to increase active sessions based on current VRAM.');
} catch (error) {
console.error('Error increasing sessions:', error);
res.status(500).send('Error attempting to increase active sessions');
}
});
// API endpoint to increase active sessions
app.post('/decrease-sessions', async (req, res) => {
try {
const result = await tryToDecreaseSessions();
res.send(result);
} catch (error) {
console.error('Error decreasing sessions:', error);
res.status(500).send('Error decreasing sessions');
}
});
// Function to try to decrease active sessions
const tryToDecreaseSessions = async () => {
console.log('Attempting to decrease active sessions.');
const orginalSessionCount = MAX_ACTIVE_SESSIONS;
// Save the original layer count for reference
const originalLayers = GPU_LAYERS;
let lastSuccessfulLayers = originalLayers; // Track the last successful layer count
let dropCount = 0; // Counter for session count drops
let layersUpCount = originalLayers;
};
// GET endpoint to show the status of the server
app.get('/status', (req, res) => {
res.json({
modelLoaded,
activeSessions,
maxActiveSessions: MAX_ACTIVE_SESSIONS,
averageGenerationTime,
isUnknown
});
});
// Start the server
app.listen(port, () => {
console.log(
Server running at http://localhost:${port}
);});
Beta Was this translation helpful? Give feedback.
All reactions