I wrote a Python program to translate large English texts into French. What I do is feed a bunch of reports to Ollama using a for loop.
from functools import cached_property
from ollama import Client
class TestOllama:
@cached_property
def ollama_client(self) -> Client:
return Client(host=f"http://127.0.0.1:11434")
def translate(self, text_to_translate: str):
ollama_response = self.ollama_client.generate(
model="mistral",
prompt=f"translate this French text into English: {text_to_translate}"
)
return ollama_response['response'].lstrip(), ollama_response['total_duration']
def run(self):
reports = ["reports_text_1", "reports_text_2"....] # avearge text size per report is between 750-1000 tokens.
for each_report in reports:
try:
translated_report, total_duration = self.translate(
text_to_translate=each_report
)
print(f"Translated text:{translated_report}, Time taken:{total_duration}")
except Exception as e:
pass
if __name__ == '__main__':
job = TestOllama()
job.run()
docker command to run ollama:
docker run -d --gpus=all --network=host --security-opt seccomp=unconfined -v report_translation_ollama:/root/.ollama --name ollama ollama/ollama
My question is, when I run this script on V100 and H100, I don't see a significant difference of execution time, and also I have avoided parallelism by thinking perhaps Ollama internally uses parallelism to process, but I only see one core that is working with htop command. Am I right on this matter?
I am a beginner in NLP; any help/guidance would be appreciated, in order me to organize my code (e.g: use multithreading to send Ollama requests…etc)
According to my knowledge, till now (2024 March 29), ollama doesn't support parallelization.
Since you have two GPUs, you could try running two or more (not recommended) ollama containers at different ports. Here is an example.
OK, I shall make my answer self-contained. Here are certain steps for u:
Pull the offical ollama image from dockerhub:
docker pull ollama/ollama:latestRun the 1st docker container using
Run the 2nd docker container using
Watch out the difference in
--gpus=and-pparameters.use async to manage your parallelization logic in Python or other languages. Here are some pesudo-code describing the process:
The pesudo-code above doesn't matter. The most important thing is that u may receive requests from front-end asking for different LLMs, and one ollama instance has to onload and offload between them. Try to avoid this as far as possible.
By the way, based on my own experience with ollama on A100, parallelization may not be a good choice. VRAM is abundant for sure, but the computing capacity is always limited.