is there parallelism inside Ollama?

109 views Asked by At

I wrote a Python program to translate large English texts into French. What I do is feed a bunch of reports to Ollama using a for loop.

from functools import cached_property

from ollama import Client


class TestOllama:

    @cached_property
    def ollama_client(self) -> Client:
        return Client(host=f"http://127.0.0.1:11434")

    def translate(self, text_to_translate: str):
        ollama_response = self.ollama_client.generate(
            model="mistral",
            prompt=f"translate this French text into English: {text_to_translate}"
        )
        return ollama_response['response'].lstrip(), ollama_response['total_duration']

    def run(self):
        reports = ["reports_text_1", "reports_text_2"....] # avearge text size per report is between 750-1000 tokens.
        for each_report in reports:
            try:
                translated_report, total_duration = self.translate(
                    text_to_translate=each_report
                )
                print(f"Translated text:{translated_report}, Time taken:{total_duration}")
            except Exception as e:
                pass


if __name__ == '__main__':
    job = TestOllama()
    job.run()

docker command to run ollama:

docker run -d --gpus=all --network=host --security-opt seccomp=unconfined -v report_translation_ollama:/root/.ollama --name ollama ollama/ollama

My question is, when I run this script on V100 and H100, I don't see a significant difference of execution time, and also I have avoided parallelism by thinking perhaps Ollama internally uses parallelism to process, but I only see one core that is working with htop command. Am I right on this matter?

I am a beginner in NLP; any help/guidance would be appreciated, in order me to organize my code (e.g: use multithreading to send Ollama requests…etc)

1

There are 1 answers

1
Justin Zhang On

According to my knowledge, till now (2024 March 29), ollama doesn't support parallelization.

Since you have two GPUs, you could try running two or more (not recommended) ollama containers at different ports. Here is an example.

OK, I shall make my answer self-contained. Here are certain steps for u:

  1. Pull the offical ollama image from dockerhub: docker pull ollama/ollama:latest

  2. Run the 1st docker container using

    docker run -d \
    --gpus=0 \
    -v {the_dir_u_save_models}:/root/.ollama \
    -p {port1}:11434 \
    --name ollama \
    ollama/ollama
    
  3. Run the 2nd docker container using

    docker run -d \
    --gpus=1 \
    -v {the_dir_u_save_models}:/root/.ollama \
    -p {port2}:11434 \
    --name ollama \
    ollama/ollama
    

    Watch out the difference in --gpus= and -p parameters.

  4. use async to manage your parallelization logic in Python or other languages. Here are some pesudo-code describing the process:

    function processQuestionsAsync(questions) {
        // Sort questions based on a specific criterion, e.g., LLM
        questions.sort((a, b) => {
            // Example: sorting by 'LLM'. Adjust comparison logic as needed.
            return a.LLM.localeCompare(b.LLM);
        });
    
        // Divide questions into two batches
        let batch1 = questions.slice(0, questions.length / 2);
        let batch2 = questions.slice(questions.length / 2);
    
        // Initialize an array to collect responses
        let responses = [];
    
        // Function to send questions to a server asynchronously
        async function sendToServer(batch, serverURL) {
            for (let question of batch) {
                let response = await sendQuestionToServer(question, serverURL);
                // Once a response is received, send it to the front-end
                sendResponseToFrontEnd(response);
                // Store response in the responses array
                responses.push(response);
            }
        }
    
        // Define server URLs for each batch
        let serverURL1 = "http://server1.com/api";
        let serverURL2 = "http://server2.com/api";
    
        // Use Promise.all to handle both batches in parallel
        Promise.all([
            sendToServer(batch1, serverURL1),
            sendToServer(batch2, serverURL2)
        ]).then(() => {
            // All questions have been processed, and all responses have been sent to the front-end
            console.log("All questions processed");
        }).catch(error => {
            // Handle errors
            console.error("An error occurred:", error);
        });
    }
    
    // Function to simulate sending a question to a server
    async function sendQuestionToServer(question, serverURL) {
        // Simulate network request
        let response = await fetch(serverURL, {
            method: 'POST',
            body: JSON.stringify(question),
            headers: {
                'Content-Type': 'application/json'
            },
        });
    
        // Return the response data
        return await response.json();
    }
    
    // Function to simulate sending a response back to the front-end
    function sendResponseToFrontEnd(response) {
        console.log("Sending response to front-end:", response);
    }
    
    // Example usage with questions as objects including 'contentType'
    let questions = [
        { question: "What is 2+2?", contentType: "Math", LLM: "GPT-3" },
        { question: "What is the capital of France?", contentType: "Geography", LLM: "GPT-3" },
        { question: "What is the largest ocean?", contentType: "Geography", LLM: "GPT-4" },
        { question: "What is the speed of light?", contentType: "Physics", LLM: "GPT-3" }
    ];
    processQuestionsAsync(questions);
    

The pesudo-code above doesn't matter. The most important thing is that u may receive requests from front-end asking for different LLMs, and one ollama instance has to onload and offload between them. Try to avoid this as far as possible.

By the way, based on my own experience with ollama on A100, parallelization may not be a good choice. VRAM is abundant for sure, but the computing capacity is always limited.