Why do I get a CUDA memory error when calling an api using RAPIDS/cudf but success in the container?

72 views Asked by At

The environment is Ubuntu20.04.5 with a Tesla P4, CUDA11.4 with Nvidia driver 470.223.02, the rapids image is based on rapidsai/base:23.10-cuda11.2-py3.10. I write a Kserve transformer service and deploy it with k8s, the given cpu is 4, memory is 4G and 1 Tesla P4. There is an error when I call the api but success when I exec the container to debug.

My dockerfile is:

FROM rapidsai/base:23.10-cuda11.2-py3.10

USER root

RUN apt-get install -y --no-install-recommends libsasl2-dev libsasl2-modules gcc g++

RUN pip install kserve==0.10.0 \
    sasl==0.3.1 thrift==0.16.0 thrift-sasl==0.4.3 \
    pyhive==0.7.0 sqlalchemy==2.0.23 redis==5.0.1 pymysql==1.1.0 statsmodels \
    -i https://mirrors.aliyun.com/pypi/simple

RUN pip install httpx==0.25.1 protobuf==4.23.4 fastapi==0.88.0

The possible conflict packages are:
The cudf origin fastapi version is 0.104.1, but I use 0.88.0 to suit the kserve.
The kserve origin protobuf is 3.19.0, but I use 4.23.4 to suit the origin cudf.

The server code is like:

import kserve
transformer = DriverTransformer()  # which inherit the kserve.Model 
server = kserve.ModelServer()
server.start(models=[transformer])

The error code is:

class DriverTransformer(...):
  def inputs2df(inputs: Dict):
    ...
    A_list = [...]
    B_list = [...]
    df = cudf.DataFrame(A_list + B_list, columns=["C", "D", "E", "F"])
    ...

gives me:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/cudf/core/column/column.py", line 2337, in as_column
    memoryview(arbitrary), dtype=dtype, nan_as_null=nan_as_null
TypeError: memoryview: a bytes-like object is required, not 'tuple'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 373, in run_asgi
    result = await app(self.scope, self.receive, self.send)
  File "/opt/conda/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 75, in __call__
    return await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/fastapi/applications.py", line 270, in __call__
    await super().__call__(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/applications.py", line 124, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/opt/conda/lib/python3.10/site-packages/timing_asgi/middleware.py", line 70, in __call__
    await self.app(scope, receive, send_wrapper)
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/opt/conda/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
    raise e
  File "/opt/conda/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 706, in __call__
    await route.handle(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/opt/conda/lib/python3.10/site-packages/fastapi/routing.py", line 235, in app
    raw_response = await run_endpoint_function(
  File "/opt/conda/lib/python3.10/site-packages/fastapi/routing.py", line 161, in run_endpoint_function
    return await dependant.call(**values)
  File "/opt/conda/lib/python3.10/site-packages/kserve/protocol/rest/v1_endpoints.py", line 106, in statistics
    response, response_headers = await self.dataplane.statistics(model_name=model_name, body=body, headers=headers)
  File "/opt/conda/lib/python3.10/site-packages/kserve/protocol/dataplane.py", line 333, in statistics
    response = await model(body, model_type=ModelType.STATISTICIAN)
  File "/opt/conda/lib/python3.10/site-packages/kserve/model.py", line 118, in __call__
    payload = await self.stat_preprocess(body, headers) if inspect.iscoroutinefunction(self.stat_preprocess) \
  File "/ims/transformer/driver_transformer.py", line 1023, in stat_preprocess
    input_df = self.inputs2df(inputs)
  File "/ims/transformer/dataprocess_gpu/utils.py", line 12, in wrapper
    result = func(*args, **kwargs)
  File "/ims/transformer/driver_transformer.py", line 352, in inputs2df
    df = cudf.DataFrame(
  File "/opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py", line 115, in inner
    result = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/cudf/core/dataframe.py", line 814, in __init__
    self._init_from_list_like(
  File "/opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py", line 115, in inner
    result = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/cudf/core/dataframe.py", line 987, in _init_from_list_like
    self._data[col_name] = column.as_column(col)
  File "/opt/conda/lib/python3.10/site-packages/cudf/core/column/column.py", line 2523, in as_column
    data = as_column(
  File "/opt/conda/lib/python3.10/site-packages/cudf/core/column/column.py", line 2009, in as_column
    col = ColumnBase.from_arrow(arbitrary)
  File "/opt/conda/lib/python3.10/site-packages/cudf/core/column/column.py", line 379, in from_arrow
    result = libcudf.interop.from_arrow(data)[0]
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "interop.pyx", line 199, in cudf._lib.interop.from_arrow
RuntimeError: Fatal CUDA error encountered at: /opt/conda/conda-bld/work/cpp/src/bitmask/null_mask.cu:93: 3 cudaErrorInitializationError initialization error

I don't know how the "tuple" appears and how to solve it. And when I run the images docker run -it --rm --gpus all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 rapidsai/base:23.10-cuda11.2-py3.10 /bin/bash or exec the above k8s container to use python to debug, it is successful. How can I solve the problem. Thanks in advance for your help.

0

There are 0 answers