Caching between multiple backends so they can cross communicate the cached data to avoid reevaluation #355

tigert2173 · 2024-10-03T21:16:15Z

tigert2173
Oct 3, 2024

Hey! I am trying to set up multiple node-llama.cpp as a backend in cadence with a queue server that tracks how many active and busy sessions each backend has; however, it seems it has to reevaluate the prompt each time for every message sent, I know there is a way to avoid doing this, I just can't figure out how.

Does anyone know how I could make this more efficient by continuing the conversation rather than having to reevualate it each time, the complicated part is the user is unlikely to be sent back to the same server after they send a message because it choosing the first available, am I going to have to change something with this so it focuses for say 6-8 users on the same machine who are currently active? It's just a huge inefficiently to have to wait 15-20 seconds to start generating off of a 6000character prompt forsay when I know things like textgenwebui can handle this somehow and only the first message takes an extra 15-20 seconds then it's only like 5 for the next one.
Here is my code below:

import express from 'express';
import { fileURLToPath } from 'url';
import path from 'path';
import { exec } from 'child_process'; // Import exec to run shell commands
import { performance } from 'perf_hooks'; // Import performance for timing
import { getLlama, LlamaChatSession } from 'node-llama-cpp';
import cors from 'cors'; // Import CORS

// Create the Express app
const app = express();
const port = 3000;

// Enable CORS
app.use(cors());

// Middleware to parse JSON requests
app.use(express.json());

// Get the current directory name
const __dirname = path.dirname(fileURLToPath(import.meta.url));

// Load the Llama model
let model;
let modelLoaded = false;

// Adjust these settings as necessary
let MAX_CONTEXT_SIZE = 8192; // Example: Reduce context size
let GPU_LAYERS = 8; // Start with a high number of GPU layers
let MAX_ACTIVE_SESSIONS = 2; // Initial maximum number of concurrent sessions
let activeSessions = 0; // Counter for active sessions
const VRAM_PER_1SESSION = 750; // Example VRAM required per session in MB for 1 layer

// Function to get available VRAM
const getAvailableVRAM = async () => {
return new Promise((resolve, reject) => {
// Get total VRAM capacity
exec('nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits', (error, totalStdout) => {
if (error) {
return reject(error);
}

        const totalVRAM = parseInt(totalStdout.trim()); // Total VRAM in MB

        // Get currently available VRAM
        exec('nvidia-smi --query-gpu=memory.free --format=csv,noheader,nounits', (error, freeStdout) => {
            if (error) {
                return reject(error);
            }

            const availableVRAM = parseInt(freeStdout.trim()); // Available VRAM in MB
            const reservedVRAM = Math.floor(totalVRAM * 0.20 + 100); // Reserve 10% of total VRAM
            const usableVRAM = availableVRAM - reservedVRAM; // Calculate usable VRAM based on total capacity

            console.log(`Total VRAM: ${totalVRAM} MB, Available VRAM: ${availableVRAM} MB, Reserved VRAM: ${reservedVRAM} MB, Usable VRAM: ${usableVRAM} MB`);
            resolve(usableVRAM); // Return usable VRAM
        });
    });
});

};

// Function to calculate maximum active sessions based on available VRAM
const calculateMaxActiveSessions = async () => {
const usableVRAM = await getAvailableVRAM();
const VRAM_PER_SESSION = VRAM_PER_1SESSION + (32*GPU_LAYERS); //every layer is 32MB more for context
console.log(VRAM_PER_SESSION);
MAX_ACTIVE_SESSIONS = Math.floor(usableVRAM / VRAM_PER_SESSION);
console.log(Maximum active sessions updated to: ${MAX_ACTIVE_SESSIONS});
};

// Load the model with dynamic adjustment for GPU layers
import { InsufficientMemoryError } from 'node-llama-cpp'; // Import the specific error class if available
import { request } from 'http';

const loadModel = async () => {
try {
let attemptLayers = GPU_LAYERS; // Start with the current number of layers

    console.log(`Starting to load the model with GPU layers: ${attemptLayers}`);

    while (attemptLayers > 0) {
        try {
            const startTime = performance.now(); // Start timing the loading
            const llama = await getLlama();
            model = await llama.loadModel({
                modelPath: path.join(__dirname, "models", "nephra_v1.0.Q4_K_S.gguf"),
                gpuLayers: attemptLayers, // Try with the current number of layers
                loadProgress: true
            });
            const endTime = performance.now(); // End timing the loading
            console.log(`Model loaded successfully with GPU layers: ${attemptLayers} in ${endTime - startTime} ms`);
            modelLoaded = true; // Set to true once the model is loaded
            await calculateMaxActiveSessions(); // Calculate max sessions on load
            return; // Exit the loop on successful load
        } catch (error) {
            console.error(`Error loading model with ${attemptLayers} GPU layers:`, error);

            // Check if the error is an InsufficientMemoryError
            if (error instanceof InsufficientMemoryError) {
                console.log(`Model with ${attemptLayers} GPU layers is too large. Trying with one less layer.`);
                attemptLayers--; // Decrease the number of GPU layers
            } else {
                console.error("An unexpected error occurred. Aborting model load.");
                throw error; // Rethrow the error for handling outside
            }
        }
    }
    resetResponseTimes();
    console.error("Failed to load the model after multiple attempts with reduced GPU layers.");
} catch (error) {
    console.error('Error loading model:', error);
}

};

// Load the model once when the server starts
loadModel().catch(console.error);

// Function to unload the model
const unloadModel = async () => {
if (model && typeof model.dispose === 'function') {
await model.dispose(); // Call dispose method if available
console.log("Model unloaded successfully");
modelLoaded = false; // Set to false once the model is unloaded
}
};

// Array to store the last 10 response generation times
const responseTimes = [];
const maxResponseTimes = 10; // Maximum number of response times to store
let isUnknown = false; // Variable to indicate unknown state
let averageGenerationTime = null;

// Function to log response time
function logResponseTime(responseTime) {

// Store the response time in the array
if (responseTimes.length >= maxResponseTimes) {
    responseTimes.shift(); // Remove the oldest response time if we have reached the limit
}
responseTimes.push(responseTime); // Add the new response time

// Log the individual response time
console.log(`Response generated in ${responseTime} ms`);

// Calculate the average response time only if there are times recorded
if (responseTimes.length > 0) {
    const averageResponseTime = responseTimes.reduce((a, b) => a + b, 0) / responseTimes.length;
    console.log(`Average response time: ${averageResponseTime} ms`);
    averageGenerationTime = averageResponseTime;
}

}

// Function to reset the response times and set the unknown variable
function resetResponseTimes() {
responseTimes.length = 0; // Clear the array
isUnknown = true; // Set the unknown state to true
console.log("Response times have been reset. Unknown state is set to true.");
}

app.post('/v1/chat/completions', async (req, res) => {
if (!modelLoaded) {
return res.status(503).send('Model is loading, please try again later.');
}

// Check if the maximum active sessions limit has been reached
if (activeSessions >= MAX_ACTIVE_SESSIONS) {
    return res.status(503).send('Server is busy, please try again later.');
}

// Directly use req.body for the request data
const requestData = req.body;

// Prepare to stream the response
res.setHeader('Content-Type', 'application/json');
res.setHeader('Cache-Control', 'no-cache');
res.setHeader('Connection', 'keep-alive');

let context; // Declare context variable outside try block for cleanup
let session; // Declare session variable outside try block for cleanup

try {
    // Increment the active sessions counter
    activeSessions++;

    // Create a new context for each request, ensuring it's within VRAM limits
    const startTime = performance.now(); // Start timing context creation
    context = await model.createContext({ contextSize: MAX_CONTEXT_SIZE, batchSize: 512, noHistory: true, }); // Set the context size
    const endTime = performance.now(); // End timing context creation
    console.log(`Context created in ${endTime - startTime} ms`);

    session = new LlamaChatSession({
        contextSequence: context.getSequence(), // Use the valid context sequence
        systemPrompt: requestData.systemPrompt,
    });

    console.log('Request JSON:', requestData);

    // Begin generating the response with the JSON request
    const responseStartTime = performance.now(); // Start timing the response generation
    let totalCharacters = 0; // Initialize a variable to count total characters

    
    await session.prompt(requestData.prompt, {
        temperature: requestData.temperature,
        repetitionPenalty: requestData.repetitionPenalty,
        maxTokens: requestData.maxTokens || 128, // Default max tokens if not provided
        minTokens: 64,
        minP: requestData.minP || 0,
        topK: requestData.topK || 30,
        topP: requestData.topP || 0.85,

// seed: settings.seed || 10000, // Default seed if not provided
onTextChunk(chunk) {
// Send each chunk to the client
res.write(JSON.stringify({ chunk }));

            // Accumulate the number of characters in the chunk
            totalCharacters += chunk.length; // Update the character count
        }
    });

    const responseEndTime = performance.now(); // End timing the response generation
    console.log(`Response generated in ${responseEndTime - responseStartTime} ms`);
    
    // Calculate tokens per second
    const responseTimeSeconds = (responseEndTime - responseStartTime) / 1000; // Convert milliseconds to seconds
    console.log("Total characters: " + totalCharacters);

    if (responseTimeSeconds > 0) {
        const charactersPerSecond = (totalCharacters / responseTimeSeconds).toFixed(2); // Calculate CPS
        console.log(`Characters per second: ${charactersPerSecond}`);
    } else {
        console.log('Response time is too short to calculate CPS.');
    }

    // End the response when done
    res.write('\n\n[DONE]\n\n');
    res.end();

} catch (error) {
    console.error('Error generating response:', error);
    res.status(500).send('Error generating response');
} finally {
    // Cleanup: Unload or nullify session and context after response is sent
    if (session) {
        session.dispose(); // Uncomment if dispose method is available
    }
    if (context) {
       context.dispose(); // Uncomment if dispose method is available
    }
     
    activeSessions--;
    // Decrement the active sessions counter        
}

});

// POST endpoint to reload the model with new GPU layers
app.post('/reload-model', async (req, res) => {
const { newLayers } = req.body;

// Check if the new layers value is provided and valid
if (typeof newLayers !== 'number' || newLayers <= 0) {
    return res.status(400).send('Invalid number of layers specified.');
}

// Wait until all active sessions are finished
while (activeSessions > 0) {
    console.log(`Waiting for active sessions to finish: ${activeSessions} sessions active.`);
    await new Promise(resolve => setTimeout(resolve, 100)); // Wait 100ms before checking again
}

try {
    // Unload the current model
    await unloadModel();

    // Update GPU layers and load the new model
    GPU_LAYERS = newLayers;
    await loadModel();

    res.send('Model reloaded successfully.');
} catch (error) {
    console.error('Error during model reload:', error);
    res.status(500).send('Error reloading model');
}

});
// Function to attempt to increase active sessions and adjust GPU layers if necessary
const tryToIncreaseSessions = async () => {
const previousLayers = GPU_LAYERS; // Store the previous GPU layers count
let previousMaxActiveSessions = MAX_ACTIVE_SESSIONS; // Store the previous max active sessions

while (activeSessions < MAX_ACTIVE_SESSIONS) {
    await calculateMaxActiveSessions(); // Update the max active sessions based on VRAM
    if (MAX_ACTIVE_SESSIONS > previousMaxActiveSessions) {
        console.log(`Increased max active sessions to: ${MAX_ACTIVE_SESSIONS}`);
        return; // Exit if successfully increased active sessions
    }

    // Attempt to reduce GPU layers if unable to increase sessions
    if (GPU_LAYERS > 1) {
        console.log(`Reducing GPU layers from ${GPU_LAYERS} to ${GPU_LAYERS - 1} to try to increase active sessions.`);
        GPU_LAYERS--; // Decrease GPU layers
        await unloadModel(); // Unload current model
        await loadModel(); // Load new model with reduced layers
    } else {
        console.warn('Cannot reduce GPU layers further. Reverting to previous configuration.');
        GPU_LAYERS = previousLayers; // Revert back to the previous layers
        await unloadModel();
        await loadModel(); // Reload the model with previous layers
        return; // Exit if unable to increase active sessions
    }
}

};

// API endpoint to trigger the session increase attempt
app.post('/increase-sessions', async (req, res) => {
try {
await tryToIncreaseSessions(); // Attempt to increase active sessions
res.send('Attempted to increase active sessions based on current VRAM.');
} catch (error) {
console.error('Error increasing sessions:', error);
res.status(500).send('Error attempting to increase active sessions');
}
});

// API endpoint to increase active sessions
app.post('/decrease-sessions', async (req, res) => {
try {
const result = await tryToDecreaseSessions();
res.send(result);
} catch (error) {
console.error('Error decreasing sessions:', error);
res.status(500).send('Error decreasing sessions');
}
});

// Function to try to decrease active sessions
const tryToDecreaseSessions = async () => {
console.log('Attempting to decrease active sessions.');
const orginalSessionCount = MAX_ACTIVE_SESSIONS;
// Save the original layer count for reference
const originalLayers = GPU_LAYERS;
let lastSuccessfulLayers = originalLayers; // Track the last successful layer count
let dropCount = 0; // Counter for session count drops
let layersUpCount = originalLayers;

while (true) {
    // Increase GPU layers
    GPU_LAYERS++;
    console.log(`Trying to load the model with ${GPU_LAYERS} GPU layers.`);
    
    await unloadModel(); // Unload the current model
    await loadModel(); // Load the new model

    // Calculate the maximum active sessions based on new VRAM
    await calculateMaxActiveSessions();

    // Check if max active sessions have changed
    console.log(`Current max active sessions: ${MAX_ACTIVE_SESSIONS}`);

    // If the max sessions decreased, increment the drop count
    if (MAX_ACTIVE_SESSIONS < orginalSessionCount) { // Assuming original max sessions were 3
        dropCount = 1;
        if (MAX_ACTIVE_SESSIONS < orginalSessionCount-1) {
            dropCount = 2;
        } else {
            layersUpCount++;
        }
        console.log(`Session count decreased to ${MAX_ACTIVE_SESSIONS}. Drop count: ${dropCount}`);

        // If we've dropped down twice, revert to last successful layers
        if (dropCount === 2) {
            console.log(`Session count decreased twice. Going back to ${layersUpCount} layers.`);
            GPU_LAYERS = layersUpCount; // Revert to the last successful layer count
            await unloadModel(); // Unload the current model
            await loadModel(); // Load the new model with last successful layers
            break; // Exit the loop
        }
    } else {
        // Reset the drop count if we haven't decreased sessions
        dropCount = 0; 
        lastSuccessfulLayers = GPU_LAYERS; // Update the last successful layer count
    }
    
    // If we've dropped to 0 sessions, revert to last successful layers
    // if (MAX_ACTIVE_SESSIONS === 0) {
    //     console.log(`Model with ${GPU_LAYERS} GPU layers is too large. Reverting to ${lastSuccessfulLayers} layers.`);
    //     GPU_LAYERS = lastSuccessfulLayers; // Revert to the last successful layer count
    //     await unloadModel(); // Unload the current model
    //     await loadModel(); // Load the new model with last successful layers
    //     break; // Exit the loop
    //}
}

return `Active sessions adjusted. Current max active sessions: ${MAX_ACTIVE_SESSIONS}`;

};

// GET endpoint to show the status of the server
app.get('/status', (req, res) => {
res.json({
modelLoaded,
activeSessions,
maxActiveSessions: MAX_ACTIVE_SESSIONS,
averageGenerationTime,
isUnknown
});
});

// Start the server
app.listen(port, () => {
console.log(Server running at http://localhost:${port});
});

tigert2173 · 2024-10-03T21:49:38Z

tigert2173
Oct 3, 2024
Author

chat1.txt
Here is the code file it didn't upload write to the issue post.

0 replies

giladgd · 2024-10-05T17:08:55Z

giladgd
Oct 5, 2024
Maintainer

I recommend reading about the object lifecycle to understand how to reuse the existing state better.
At the moment, it's not possible to dump the evaluation state and load it later (I plan to add support for this soon, though), but even when it'll be possible, it takes a lot of storage space so you better avoid it by reusing existing in-memory context sequence states as much as possible.
Try to enable session stickiness in k8s and in your code to reuse the existing state better.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching between multiple backends so they can cross communicate the cached data to avoid reevaluation #355

{{title}}

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Caching between multiple backends so they can cross communicate the cached data to avoid reevaluation #355

tigert2173 Oct 3, 2024

Replies: 2 comments

tigert2173 Oct 3, 2024 Author

giladgd Oct 5, 2024 Maintainer

tigert2173
Oct 3, 2024

tigert2173
Oct 3, 2024
Author

giladgd
Oct 5, 2024
Maintainer