How to run ollama in google colab?

Question

How to run ollama in google colab?

6.4k views Asked by Sergey Oblasov At 21 December 2023 at 10:28

I have a code like this. And I'm launching it. I get an ngrok link.

!pip install aiohttp pyngrok

import os
import asyncio
from aiohttp import ClientSession

# Set LD_LIBRARY_PATH so the system NVIDIA library becomes preferred
# over the built-in library. This is particularly important for
# Google Colab which installs older drivers
os.environ.update({'LD_LIBRARY_PATH': '/usr/lib64-nvidia'})

async def run(cmd):
  '''
  run is a helper function to run subcommands asynchronously.
  '''
  print('>>> starting', *cmd)
  p = await asyncio.subprocess.create_subprocess_exec(
      *cmd,
      stdout=asyncio.subprocess.PIPE,
      stderr=asyncio.subprocess.PIPE,
  )

  async def pipe(lines):
    async for line in lines:
      print(line.strip().decode('utf-8'))

  await asyncio.gather(
      pipe(p.stdout),
      pipe(p.stderr),
  )


await asyncio.gather(
    run(['ollama', 'serve']),
    run(['ngrok', 'http', '--log', 'stderr', '11434']),
)

Which I'm following, but the following is on the page

How can I fix this? Before that, I did the following

!choco install ngrok
!ngrok config add-authtoken -----

!curl https://ollama.ai/install.sh | sh
!command -v systemctl >/dev/null && sudo systemctl stop ollama

Original Q&A

There are 3 answers

RootKore On 23 December 2023 at 11:21

Add this

!pip install pyngrok

from pyngrok import ngrok
ngrok.set_auth_token("Your_Auth_token")

Alan Turing On 03 January 2024 at 01:09

@sergey Mate there's nothing wrong with ngrok link. As it says ollama is running. So everything is fine and already set for you. You are running ollama as a remote server on colab, now you can use it on your local machine super easily and it'll only use colab computing resources not your local machines.

Let me explain it a bit(with my limited knowledge) so that anyone can get what's going on. In your case It started the ollama service and expose an endpoint using ngrok which can be used to communicate with the ollama instance remotely. Unlike the text-generation-webui developed by oobabooga which is a web user interface for large language models, ollama is a command-line chatbot that makes it simple to use large language models almost anywhere.

Here's the complete guide. First Run this on Colab

!curl https://ollama.ai/install.sh | sh

!echo 'debconf debconf/frontend select Noninteractive' | sudo debconf-set-selections
!sudo apt-get update && sudo apt-get install -y cuda-drivers

!pip install pyngrok
from pyngrok import ngrok
ngrok.set_auth_token('Put_your_ngrok_auth_token_here')

import os
import asyncio

# Set LD_LIBRARY_PATH so the system NVIDIA library 
os.environ.update({'LD_LIBRARY_PATH': '/usr/lib64-nvidia'})

async def run_process(cmd):
  print('>>> starting', *cmd)
  p = await asyncio.subprocess.create_subprocess_exec(
      *cmd,
      stdout=asyncio.subprocess.PIPE,
      stderr=asyncio.subprocess.PIPE,
  )

  async def pipe(lines):
    async for line in lines:
      print(line.strip().decode('utf-8'))

  await asyncio.gather(
      pipe(p.stdout),
      pipe(p.stderr),
  )
from IPython.display import clear_output
clear_output()

await asyncio.gather(
    run_process(['ollama', 'serve']),
    run_process(['ngrok', 'http', '--log', 'stderr', '11434']),
)

Then on your machine just simply run below commands and you can use any model you want. I used dolphin-mistral as an example

On linux:

curl https://ollama.ai/install.sh | sh
export OLLAMA_HOST=(Put_your_ngrok_url_link_here)
ollama run dolphin-mistral

On mac:

brew install ollama
export OLLAMA_HOST=(Put_your_ngrok_url_link_here)
ollama run dolphin-mistral

**Gruff** · Accepted Answer · 2024-01-16T21:48:02+00:00

1. Run ollama but don't stop it

!curl https://ollama.ai/install.sh | sh

# should produce, among other thigns:
# The Ollama API is now available at 0.0.0.0:11434

This means Ollama is running (but do check to see if there are errors, especially around graphics capability/Cuda as these may interfere.

However, Don't run !command -v systemctl >/dev/null && sudo systemctl stop ollama (unless you want to stop Ollama).

The next step is to start the Ollama service, but since you are using ngrok I'm assuming you want to be able to run the LLM from other environments outside the Colab? If this isn't the case, then you don't really need ngrok, but since Colabs are tricky to get working nicely with async code and threads it's useful to use the Colab to e.g. run a powerful enough VM to play with larger models than (say) anthing you could run on your dev environment (if this is an issue).

2. Set up ngrok and forward the local ollama service to a public URI

Ollama isn't yet running as a service but we can set up ngrok in advance of this:

import threading
import time
import os
import asyncio
from pyngrok import ngrok
import threading
import queue
import time
from threading import Thread

# Get your ngrok token from your ngrok account:
# https://dashboard.ngrok.com/get-started/your-authtoken
token="your token goes here - don't forget to replace this with it!"
ngrok.set_auth_token(token)

# set up a stoppable thread (not mandatory, but cleaner if you want to stop this later
class StoppableThread(threading.Thread):
    def __init__(self, *args, **kwargs):
        super(StoppableThread, self).__init__(*args, **kwargs)
        self._stop_event = threading.Event()

    def stop(self):
        self._stop_event.set()

    def is_stopped(self):
        return self._stop_event.is_set()

def start_ngrok(q, stop_event):
    try:
        # Start an HTTP tunnel on the specified port
        public_url = ngrok.connect(11434)
        # Put the public URL in the queue
        q.put(public_url)
        # Keep the thread alive until stop event is set
        while not stop_event.is_set():
            time.sleep(1)  # Adjust sleep time as needed
    except Exception as e:
        print(f"Error in start_ngrok: {e}")

Run that code so the functions exist, then in the next cell, start ngrok in a separate thread so it doesn't hang your colab - we'll use a queue so we can still share data between threads because we want to know what the ngrok public URL will be when it runs:

# Create a queue to share data between threads
url_queue = queue.Queue()

# Start ngrok in a separate thread
ngrok_thread = StoppableThread(target=start_ngrok, args=(url_queue, StoppableThread.is_stopped))
ngrok_thread.start()

That will be running, but you need to get the results from the queue to see what ngrok returned, so then do:

# Wait for the ngrok tunnel to be established
while True:
    try:
        public_url = url_queue.get()
        if public_url:
            break
        print("Waiting for ngrok URL...")
        time.sleep(1)
    except Exception as e:
        print(f"Error in retrieving ngrok URL: {e}")

print("Ngrok tunnel established at:", public_url)

This should output something like:

Ngrok tunnel established at: NgrokTunnel: "https://{somelongsubdomain}.ngrok-free.app" -> "http://localhost:11434"

3. Run ollama as an async process

import os
import asyncio

# NB: You may need to set these depending and get cuda working depending which backend you are running.
# Set environment variable for NVIDIA library
# Set environment variables for CUDA
os.environ['PATH'] += ':/usr/local/cuda/bin'
# Set LD_LIBRARY_PATH to include both /usr/lib64-nvidia and CUDA lib directories
os.environ['LD_LIBRARY_PATH'] = '/usr/lib64-nvidia:/usr/local/cuda/lib64'

async def run_process(cmd):
    print('>>> starting', *cmd)
    process = await asyncio.create_subprocess_exec(
        *cmd,
        stdout=asyncio.subprocess.PIPE,
        stderr=asyncio.subprocess.PIPE
    )

    # define an async pipe function
    async def pipe(lines):
        async for line in lines:
            print(line.decode().strip())

        await asyncio.gather(
            pipe(process.stdout),
            pipe(process.stderr),
        )

    # call it
    await asyncio.gather(pipe(process.stdout), pipe(process.stderr))

That creates the function to run an async command but doesn't run it yet.

This will start ollama in a separate thread so your Colab isn't blocked:

import asyncio
import threading

async def start_ollama_serve():
    await run_process(['ollama', 'serve'])

def run_async_in_thread(loop, coro):
    asyncio.set_event_loop(loop)
    loop.run_until_complete(coro) 
    loop.close()

# Create a new event loop that will run in a new thread 
new_loop = asyncio.new_event_loop() 

# Start ollama serve in a separate thread so the cell won't block execution 
thread = threading.Thread(target=run_async_in_thread, args=(new_loop, start_ollama_serve()))
thread.start()

It should produce something like:

>>> starting ollama serve
Couldn't find '/root/.ollama/id_ed25519'. Generating new private key.
Your new public key is:

ssh-ed25519 {some key}

2024/01/16 20:19:11 images.go:808: total blobs: 0
2024/01/16 20:19:11 images.go:815: total unused blobs removed: 0
2024/01/16 20:19:11 routes.go:930: Listening on 127.0.0.1:11434 (version 0.1.20)

Now you're all set up. You can either do the next steps in the Colab, but it might be easier to run on your local machine if you normally dev there.

4. Run an ollama model remotely from your local dev environment

Assuming you have installed ollama on your local dev environment (say WSL2), I'm assuming it's linux anyway... but i.e. your laptop or desktop machine in front of you (as opposed to Colab).

Replace the actual URI below with whatever public URI ngrok reported above:

export OLLAMA_HOST=https://{longcode}.ngrok-free.app/

You can now run ollama and it will run on the remote in your Colab (so long as that's stays up and running).

e.g. run this on your local machine and it will look as if it's running locally but it's really running in your Colab and the results are being served to wherever you call this from (so long as the OLLAMA_HOST is set correctly and is a valid tunnel to your ollama service:

ollama run mistral

You can now interact with the model on the command line locally but the model runs on the Colab.

If you want to run larger models, like mixtral, then you need to be sure to connect your Colab to a Back end compute that's powerful enough (e.g. 48GB+ of RAM, so V100 GPU is minimum spec for this at the time of writing).

Note: If you have any issues with cuda or nvidia showing in the ouputs of any steps above, don't proceed until you fix them.

Hope that helps!

Gruff

TechQA.

How to run ollama in google colab?

There are 3 answers

1. Run ollama but don't stop it

2. Set up ngrok and forward the local ollama service to a public URI

3. Run ollama as an async process

4. Run an ollama model remotely from your local dev environment

Related Questions in PYTHON

Related Questions in PYNGROK

Related Questions in OLLAMA

Popular Questions

Popular Tags

Trending Questions