is there parallelism inside Ollama?

Question

is there parallelism inside Ollama?

109 views Asked by Januka samaranyake At 19 March 2024 at 16:55

I wrote a Python program to translate large English texts into French. What I do is feed a bunch of reports to Ollama using a for loop.

from functools import cached_property

from ollama import Client


class TestOllama:

    @cached_property
    def ollama_client(self) -> Client:
        return Client(host=f"http://127.0.0.1:11434")

    def translate(self, text_to_translate: str):
        ollama_response = self.ollama_client.generate(
            model="mistral",
            prompt=f"translate this French text into English: {text_to_translate}"
        )
        return ollama_response['response'].lstrip(), ollama_response['total_duration']

    def run(self):
        reports = ["reports_text_1", "reports_text_2"....] # avearge text size per report is between 750-1000 tokens.
        for each_report in reports:
            try:
                translated_report, total_duration = self.translate(
                    text_to_translate=each_report
                )
                print(f"Translated text:{translated_report}, Time taken:{total_duration}")
            except Exception as e:
                pass


if __name__ == '__main__':
    job = TestOllama()
    job.run()

docker command to run ollama:

docker run -d --gpus=all --network=host --security-opt seccomp=unconfined -v report_translation_ollama:/root/.ollama --name ollama ollama/ollama

My question is, when I run this script on V100 and H100, I don't see a significant difference of execution time, and also I have avoided parallelism by thinking perhaps Ollama internally uses parallelism to process, but I only see one core that is working with htop command. Am I right on this matter?

I am a beginner in NLP; any help/guidance would be appreciated, in order me to organize my code (e.g: use multithreading to send Ollama requests…etc)

Original Q&A

There are 1 answers

**Justin Zhang** · Answer 1 · 2024-03-29T02:47:01+00:00

According to my knowledge, till now (2024 March 29), ollama doesn't support parallelization.

Since you have two GPUs, you could try running two or more (not recommended) ollama containers at different ports. Here is an example.

OK, I shall make my answer self-contained. Here are certain steps for u:

Pull the offical ollama image from dockerhub: docker pull ollama/ollama:latest

Run the 1st docker container using

docker run -d \
--gpus=0 \
-v {the_dir_u_save_models}:/root/.ollama \
-p {port1}:11434 \
--name ollama \
ollama/ollama

Run the 2nd docker container using

docker run -d \
--gpus=1 \
-v {the_dir_u_save_models}:/root/.ollama \
-p {port2}:11434 \
--name ollama \
ollama/ollama

Watch out the difference in --gpus= and -p parameters.

use async to manage your parallelization logic in Python or other languages. Here are some pesudo-code describing the process:

function processQuestionsAsync(questions) {
    // Sort questions based on a specific criterion, e.g., LLM
    questions.sort((a, b) => {
        // Example: sorting by 'LLM'. Adjust comparison logic as needed.
        return a.LLM.localeCompare(b.LLM);
    });

    // Divide questions into two batches
    let batch1 = questions.slice(0, questions.length / 2);
    let batch2 = questions.slice(questions.length / 2);

    // Initialize an array to collect responses
    let responses = [];

    // Function to send questions to a server asynchronously
    async function sendToServer(batch, serverURL) {
        for (let question of batch) {
            let response = await sendQuestionToServer(question, serverURL);
            // Once a response is received, send it to the front-end
            sendResponseToFrontEnd(response);
            // Store response in the responses array
            responses.push(response);
        }
    }

    // Define server URLs for each batch
    let serverURL1 = "http://server1.com/api";
    let serverURL2 = "http://server2.com/api";

    // Use Promise.all to handle both batches in parallel
    Promise.all([
        sendToServer(batch1, serverURL1),
        sendToServer(batch2, serverURL2)
    ]).then(() => {
        // All questions have been processed, and all responses have been sent to the front-end
        console.log("All questions processed");
    }).catch(error => {
        // Handle errors
        console.error("An error occurred:", error);
    });
}

// Function to simulate sending a question to a server
async function sendQuestionToServer(question, serverURL) {
    // Simulate network request
    let response = await fetch(serverURL, {
        method: 'POST',
        body: JSON.stringify(question),
        headers: {
            'Content-Type': 'application/json'
        },
    });

    // Return the response data
    return await response.json();
}

// Function to simulate sending a response back to the front-end
function sendResponseToFrontEnd(response) {
    console.log("Sending response to front-end:", response);
}

// Example usage with questions as objects including 'contentType'
let questions = [
    { question: "What is 2+2?", contentType: "Math", LLM: "GPT-3" },
    { question: "What is the capital of France?", contentType: "Geography", LLM: "GPT-3" },
    { question: "What is the largest ocean?", contentType: "Geography", LLM: "GPT-4" },
    { question: "What is the speed of light?", contentType: "Physics", LLM: "GPT-3" }
];
processQuestionsAsync(questions);

The pesudo-code above doesn't matter. The most important thing is that u may receive requests from front-end asking for different LLMs, and one ollama instance has to onload and offload between them. Try to avoid this as far as possible.

By the way, based on my own experience with ollama on A100, parallelization may not be a good choice. VRAM is abundant for sure, but the computing capacity is always limited.

TechQA.

is there parallelism inside Ollama?

There are 1 answers

Related Questions in PYTHON

Related Questions in DOCKER

Related Questions in MISTRAL-7B

Related Questions in OLLAMA

Popular Questions

Trending Questions