GPU-Accelerated Containers for Deep Learning

From Basic NVIDIA CUDA Setup to Comprehensive PyTorch Development Environments

Jul 29, 2024

In this article, we explore the setup of GPU-accelerated Docker containers using NVIDIA GPUs. We cover the essential requirements for enabling GPU acceleration, including host system configuration and container-specific needs. The guide examines two main approaches: utilizing pre-built CUDA wheels for Python frameworks, and creating comprehensive development environments with full CUDA toolkit integration and PyTorch built from source.

Brief History
Motivation
Preparing the Host Environment
Enabling GPUs in Containers
Utilizing Pre-built CUDA Wheels in Python
Comprehensive Development Containers
- NVIDIA CUDA Docker Image Variants
- CUDA Environment with PyTorch Built from Source
Conclusions

Brief History

Initial GPUs were designed solely for graphics processing, with specialized hardware tailored for rendering images and video. GPUs excel in these tasks due to their highly parallel architecture, which allows for simultaneous processing of large amounts of data, making them particularly efficient for computations that can be broken down into many independent calculations. As GPU architecture evolved, it became more flexible, allowing programmers to use these processors for general-purpose computing tasks beyond graphics, leading to the development of GPGPU (General-Purpose computing on Graphics Processing Units).

In 2006 NVIDIA introduced CUDA as a general-purpose parallel computing platform and programming model for their GPUs. CUDA allowed developers to use high-level programming languages like C++ to harness the power of GPUs for complex computational problems beyond graphics processing, making GPGPU more accessible and efficient.

The emergence of general-purpose GPUs (GP-GPUs) and NVIDIA's CUDA programming platform significantly expanded the potential for GPU computing beyond graphics processing. This combination allowed researchers to execute arbitrary code on GPUs using a C-like language, providing a convenient programming model with massive parallelism. In 2009, Raina, R., Madhavan, A., & Ng, A. Y. published the paper “Large-scale Deep Unsupervised Learning using Graphics Processors”, demonstrating that GPUs could accelerate specific deep-learning tasks close to a hundredfold, leading to the rapid adoption of using GPUs for deep learning research.

GPUs are widely used in deep learning today due to their exceptional ability to perform parallel matrix and vector computations, which are fundamental to neural network algorithms. This capability, combined with continuous advancements in GPU architecture, allows engineers to accelerate training algorithms by orders of magnitude, reducing computation times from months/weeks to weeks/days and enabling the development of larger, more complex models.

Motivation

GPU acceleration and CUDA Toolkit integration are valuable for certain containerized applications. Some examples:

Containerized large machine learning models: GPU-powered inference in web APIs significantly reduces response times and overall latency.
Development workbenches: Containerized environments deployed on GPU-enabled machines, allowing engineers and researchers to develop and test GPU-accelerated software.

However, leveraging GPUs in containerized environments comes with specific requirements. In the following sections, we'll explore these prerequisites and guide you through the process of enabling GPU acceleration for your containerized applications.

Preparing the Host Environment

The first obvious step for the host system to detect and communicate with attached GPUs is to install the appropriate NVIDIA drivers. The drivers provide the interface between the operating system and the GPU hardware. When selecting the driver version, it's important to consult NVIDIA's support matrix. This matrix helps identify which driver versions are compatible with which CUDA Toolkit versions and hardware. Ubuntu installation guide can be found in Ubuntu's official documentation.

The second required component is the NVIDIA Container Toolkit. The official documentation provides a concise description:

The NVIDIA Container Toolkit enables users to build and run GPU-accelerated containers. The toolkit includes a container runtime library and utilities to automatically configure containers to leverage NVIDIA GPUs.

The guide on how to install the NVIDIA Container Toolkit is provided by NVIDIA here.

With these two steps completed, the host system should be prepared for GPU-accelerated containerized applications.

Enabling GPUs in Containers

Containers requiring GPU acceleration typically need to include specific components of the CUDA Toolkit, tailored to their purpose. The CUDA Toolkit is a comprehensive suite that includes GPU-accelerated libraries, development tools, a C/C++ compiler, and runtime libraries.

For running GPU-accelerated applications:

The CUDA runtime library is essential. This allows deployed applications to interact with the GPU.
Specific GPU-accelerated libraries (e.g., cuBLAS, cuDNN) may be needed, depending on the application's requirements.

For the development of GPU-accelerated software:

The CUDA compiler (NVCC) is necessary to compile CUDA C/C++ code.
Debugging and optimization tools are useful for performance tuning.
The full set of GPU-accelerated libraries and header files are typically included for comprehensive development capabilities.

By selectively including only the necessary components, containers can be optimized for size and purpose, whether for production deployment or development environments.

A crucial yet easy-to-miss step is to launch containers with GPU support activated. For Docker, the command would be:

docker run --gpus all [other options] [image name]

When using containerd (via nerdctl):

nerdctl run --gpus all [other options] [image name]

and so on for other container runtimes. Omitting this flag will prevent the container from accessing the host system's GPUs, even if the container image includes the necessary CUDA components.

Utilizing Pre-built CUDA Wheels in Python

After meeting the host system prerequisites and understanding the container requirements, we can try to create Docker images with CUDA runtime support. For example, popular frameworks like PyTorch simplify this process by offering pre-built wheels that include CUDA runtime. This removes the necessity for developers to install the CUDA runtime themselves, as well as cuDNN and NCCL. Let's analyze a PyTorch wheel as an example: torch-2.3.1+cu121-cp312-cp312-linux_x86_64.whl. This wheel name encodes several essential details:

cu121: Built for CUDA 12.1.
cp312-cp312:
- First cp312: Built with CPython 3.12.
- Second cp312: ABI (Application Binary Interface) compatible with CPython 3.12.
linux: For Linux operating systems.
x86_64: For 64-bit x86 architecture (AMD64).

Assuming a standard project structure, to use this wheel, just specify it directly in your pyproject.toml:

torch = [
    { url = "https://download.pytorch.org/whl/cu121/torch-2.3.1%2Bcu121-cp312-cp312-linux_x86_64.whl", markers = "sys_platform == 'linux'" }
]

Then, you can build the Docker image as defined in “Crafting an Efficient Dockerfile”. This configuration ensures your Docker container has the necessary CUDA runtime support for PyTorch. You can verify CUDA availability for PyTorch within your container using:

import torch
print(torch.cuda.is_available())
# True
print("CUDA version:", torch.version.cuda)
# CUDA version: 12.1
print("cuDNN version:", torch.backends.cudnn.version())
# cuDNN version: 8902
print("GPU device:", torch.cuda.get_device_name(0))
# GPU device: Tesla T4

The pre-built wheels here only include the CUDA runtime necessary for PyTorch's operations. They do not contain the complete CUDA toolkit, which would be required if you needed to build PyTorch from a source or develop custom CUDA extensions. In the following section, we'll explore Docker images for this use case.

It's important to note that serving models using PyTorch CUDA wheels within Docker containers, while functional, may not be the most efficient strategy for production environments. This holds true even for GPU-accelerated setups. For production deployments, more optimized serving strategies, such as those utilizing onnxruntime-gpu, are generally recommended. In upcoming articles, we will explore advanced model serving techniques and best practices for production environments.

When using CUDA-enabled frameworks, ensure compatibility between your environment components. The wheel's CUDA version should match your host's NVIDIA driver capabilities, and its Python version must align with your container's Python environment. These alignments prevent compatibility issues and runtime errors.

Comprehensive Development Containers

While pre-built wheels suffice for many use cases, more complex scenarios may require an entire CUDA development environment. These scenarios include:

Absence of pre-built wheels for your specific CUDA version or use case.
Need to build deep learning frameworks (like PyTorch) from the source.
Development of custom CUDA extensions.

In these cases, containers need more than just the CUDA runtime; they require the complete CUDA toolkit, including headers, the NVIDIA CUDA Compiler (NVCC), and other development tools. Setting up and maintaining such an environment can be challenging. Fortunately, NVIDIA provides specialized Docker images to address these needs.

NVIDIA CUDA Docker Image Variants

The nvidia/cuda repository on Docker Hub offers a variety of Docker image variants to address the above-mentioned issues. The “main” variants, as the documentation states:

base: Includes the CUDA runtime (cudart).
runtime: Builds on the base and includes the CUDA math libraries, and NCCL.
devel: Builds on the runtime and includes headers, development tools for building CUDA images. These images are particularly useful for multi-stage builds.

Additionally, there are variants with cudnn to enable a GPU-accelerated library of primitives for deep neural networks. The most “packed” image variant, in this case, would be cudnn-devel.

CUDA Environment with PyTorch Built from Source

Having examined the NVIDIA CUDA image variants, we can now construct a Dockerfile tailored for a development workbench with a comprehensive CUDA environment. When designing a development-oriented workbench, we must address a few key points. These include incorporating the full CUDA Toolkit with all necessary tools and libraries, managing the Python runtime environment precisely, and integrating desired deep learning frameworks like PyTorch with their specific build requirements.

Considering all these points, we can start constructing a Dockerfile that provides the desired environment for experimentation.

FROM alpine/git AS pytorch-source

# This command clones the main branch. For reproducibility, consider using a specific commit hash
# Example: git clone --depth 1 --recursive https://github.com/pytorch/pytorch.git pytorch && cd pytorch && git checkout <commit-hash>

RUN git clone --depth 1 --recursive https://github.com/pytorch/pytorch.git pytorch && \
    cd pytorch && \
    git submodule sync && \
    git submodule update --init --recursive && \
    cd ..

FROM nvidia/cuda:12.5.1-cudnn-devel-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive

# Install build essentials for Python (needed for pyenv) and PyTorch, plus common dev tools.
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
        build-essential \
        ca-certificates \
        cmake \
        curl \
        git \
        libbz2-dev \
        libffi-dev \
        liblzma-dev \
        libmkl-full-dev \
        libncurses5-dev \
        libncursesw5-dev \
        libreadline-dev \
        libsqlite3-dev \
        libssl-dev \
        libxml2-dev \
        libxmlsec1-dev \
        llvm \
        ninja-build \
        openssh-client \
        tk-dev \
        wget \
        xz-utils \
        zlib1g-dev && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# Here pyenv is used to create an isolated Python environment, separate from the system Python. This approach
# ensures system stability, prevents conflicts with Ubuntu's built-in tools, and provides flexibility in choosing
# Python versions.

# pyenv/poetry setup
ENV PYENV_GIT_TAG="v2.4.7" \
    PATH="/root/.pyenv/shims:/root/.pyenv/bin:$PATH" \
    PYTHON_VERSION=3.12.3 \
    PIP_INSTALL_VERSION=24.1.2 \
    POETRY_VERSION=1.8.3

WORKDIR /app

RUN curl https://pyenv.run | bash && \
    pyenv install $PYTHON_VERSION && \
    pyenv local $PYTHON_VERSION && \
    pip install --upgrade pip==$PIP_INSTALL_VERSION && \
    pip install --no-cache-dir poetry==$POETRY_VERSION && \
    pip cache purge

# PyTorch build configuration:
# https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-feature-list
# CUDA_ARCH_LIST specifies target GPU architectures (e.g., Turing: sm_75)
# Adjust MAX_JOBS based on available system resources
# Remove DEBUG flag if not needed.

ENV USE_CUDA=1 \
    USE_CUDNN=1 \
    CMAKE_PREFIX_PATH="/root/.pyenv/versions/$PYTHON_VERSION" \
    TORCH_CUDA_ARCH_LIST="7.5" \
    MAX_JOBS=12 \
    DEBUG=1

COPY pyproject.toml poetry.toml poetry.lock ./
COPY --from=pytorch-source git/pytorch pytorch
# Install project dependencies from lock file for reproducibility, then build PyTorch from source.
# PyTorch is installed via pip to improve build time. We don't update the .lock file for PyTorch
# and its dependencies as we're not redistributing this specific built PyTroch version. This approach
# balances reproducibility for PROJECT dependencies with build efficiency for PyTorch.
RUN poetry install --no-root && \
    . .venv/bin/activate && \
    pip install -v ./pytorch && \
    deactivate && \
    rm -rf pytorch

EXPOSE 8888
CMD ["poetry", "run", "jupyter", "lab", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--allow-root"]

To verify the setup, we can run the following version checks inside the Docker container:

>>> print(torch.cuda.is_available())
True
>>> print("CUDA version:", torch.version.cuda)
CUDA version: 12.5
>>> print("cuDNN version:", torch.backends.cudnn.version())
cuDNN version: 90201
>>> print("GPU device:", torch.cuda.get_device_name(0))
GPU device: Tesla T4

As expected, the CUDA version (12.5) matches the one specified in our NVIDIA Docker image variant. The Dockerfile has several nice traits for development workbenches:

Customizable PyTorch source - allows targeting specific PyTorch versions or commits.
Flexible Python environment - uses pyenv for precise Python version control, independent of the base image.
Pre-built PyTorch: builds PyTorch from source, targeted for the specified CUDA and Python setup.
Development tools: includes standard tools (Poetry, pyenv, git) for development and experimentation.

Important considerations: Building this Dockerfile can take a few hours, depending on your machine, and may require some computational resources. The resulting Docker image can be quite large (19.3GB for the Dockerfile provided above). Therefore, this setup is intended for experimentation environments only and is not suitable for production environments where image size and build time are important factors.

Conclusions

NVIDIA Container Toolkit and CUDA Toolkit provide a simplified way for containers to leverage GPU acceleration. For Python-based deep learning frameworks, pre-built CUDA wheels provide a simple solution for many GPU acceleration scenarios, meanwhile, NVIDIA's more comprehensive base Docker images enable the creation of extensive development environments for more advanced use cases. The Dockerfile provided above can serve as a practical starting point for the CUDA PyTorch development environment.

MLOps Shenanigans

Discussion about this post