Skip to main content
Back to ArticlesInfrastructure
15 min read

Running GPUStack with NVIDIA MIG: A Deep Dive into Multi-Instance GPU Orchestration

Multi-Instance GPU (MIG) technology promises to maximize GPU utilization by partitioning a single GPU into isolated instances. But getting MIG to work with container orchestration tools like GPUStack requires navigating a maze of CDI configuration, device enumeration, and runtime patches. This technical deep-dive shares our battle-tested solutions.

Running GPUStack with NVIDIA MIG: A Deep Dive into Multi-Instance GPU Orchestration

NVIDIA's Multi-Instance GPU (MIG) technology, introduced with the Ampere architecture, enables a single physical GPU to be partitioned into up to seven isolated instances. Each instance has dedicated compute resources, memory bandwidth, and L2 cache - making it ideal for multi-tenant AI inference workloads.

Yet deploying MIG with container orchestration tools remains challenging. The intersection of CDI (Container Device Interface), NVIDIA container runtime, and application-level device enumeration creates a complex web of potential failure points.

This article documents our journey deploying GPUStack with MIG on NVIDIA H200 NVL GPUs, including:

  • 8 distinct bugs discovered and fixed
  • Runtime patches for GPUStack and vLLM
  • Complete automation scripts for production deployment

The MIG Value Proposition

Before diving into implementation, let's understand why MIG matters for AI infrastructure.

Traditional GPU Allocation

Without MIG, GPUs are allocated as whole units:

┌─────────────────────────────────────────────────────────────┐
│                    TRADITIONAL GPU ALLOCATION               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   GPU 0 (H200 - 141GB)        GPU 1 (H200 - 141GB)         │
│   ┌─────────────────┐         ┌─────────────────┐          │
│   │                 │         │                 │          │
│   │   Model A       │         │   Model B       │          │
│   │   (uses 20GB)   │         │   (uses 35GB)   │          │
│   │                 │         │                 │          │
│   │   ░░░░░░░░░░    │         │   ░░░░░░░░░░    │          │
│   │   121GB WASTED  │         │   106GB WASTED  │          │
│   │                 │         │                 │          │
│   └─────────────────┘         └─────────────────┘          │
│                                                             │
│   Utilization: ~20%            Utilization: ~25%           │
│                                                             │
└─────────────────────────────────────────────────────────────┘

MIG-Enabled Allocation

With MIG, a single GPU can serve multiple isolated workloads:

┌─────────────────────────────────────────────────────────────┐
│                    MIG-ENABLED GPU ALLOCATION               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   GPU 0 (H200 - Full)         GPU 1 (H200 - MIG Enabled)   │
│   ┌─────────────────┐         ┌─────────────────┐          │
│   │                 │         │ ┌─────────────┐ │          │
│   │   Large Model   │         │ │ 4g.71gb     │ │ Model B  │
│   │   (needs full   │         │ │ (71GB)      │ │          │
│   │    GPU memory)  │         │ └─────────────┘ │          │
│   │                 │         │ ┌───────┐       │          │
│   │                 │         │ │2g.35gb│       │ Model C  │
│   │                 │         │ │(35GB) │       │          │
│   │                 │         │ └───────┘       │          │
│   │                 │         │ ┌────┐          │          │
│   │                 │         │ │1g  │          │ Model D  │
│   │                 │         │ │18GB│          │          │
│   │                 │         │ └────┘          │          │
│   └─────────────────┘         └─────────────────┘          │
│                                                             │
│   Utilization: 100%            Utilization: ~90%           │
│   (workload needs it)          (3 isolated workloads)      │
│                                                             │
└─────────────────────────────────────────────────────────────┘

MIG Profile Sizes (H200 NVL)

ProfileGPU MemoryGPU SlicesComputeUse Case
1g.18gb18 GB1/71 SMSmall inference, testing
2g.35gb35 GB2/72 SMsMedium models (7B params)
3g.47gb47 GB3/73 SMsLarge models (13B params)
4g.71gb71 GB4/74 SMsVery large models (30B+)
7g.141gb141 GB7/7All SMsFull GPU (no partitioning)

Our Environment

Hardware Configuration

┌─────────────────────────────────────────────────────────────┐
│                    HARDWARE SETUP                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   Server: ai-1                                              │
│                                                             │
│   ┌─────────────────────────┐  ┌─────────────────────────┐ │
│   │        GPU 0            │  │        GPU 1            │ │
│   │   NVIDIA H200 NVL       │  │   NVIDIA H200 NVL       │ │
│   │   MIG: DISABLED         │  │   MIG: ENABLED          │ │
│   │   Memory: 143 GB        │  │   Memory: 141 GB total  │ │
│   │                         │  │                         │ │
│   │   UUID: GPU-aaaaaaaa-   │  │   UUID: GPU-11111111-   │ │
│   │         bbbb-cccc-...   │  │         2222-3333-...   │ │
│   │                         │  │                         │ │
│   │   ┌─────────────────┐   │  │   ┌─────────────────┐   │ │
│   │   │   Full GPU      │   │  │   │ MIG 4g.71gb     │   │ │
│   │   │   Available     │   │  │   │ 71 GB           │   │ │
│   │   └─────────────────┘   │  │   │ MIG-xxxxxxxx-.. │   │ │
│   │                         │  │   └─────────────────┘   │ │
│   │                         │  │   ┌───────────┐         │ │
│   │                         │  │   │ MIG 2g.35gb│        │ │
│   │                         │  │   │ 35 GB     │         │ │
│   │                         │  │   │ MIG-abc60c│         │ │
│   │                         │  │   └───────────┘         │ │
│   │                         │  │   ┌─────┐               │ │
│   │                         │  │   │1g.18│ 18 GB        │ │
│   │                         │  │   │MIG- │               │ │
│   │                         │  │   └─────┘               │ │
│   └─────────────────────────┘  └─────────────────────────┘ │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Software Stack

ComponentVersion
GPUStackv2.0.3
gpustack-runtimev0.1.38.post4
vLLM0.13.0
NVIDIA Driver590.48.01
NVIDIA Container ToolkitLatest
Container RuntimeDocker with nvidia-container-runtime

The Problem: Eight Distinct Failures

When we first attempted to deploy models on MIG devices through GPUStack, we encountered a cascade of failures. Each fix revealed another underlying issue - a classic "peeling the onion" debugging experience.

Failure Cascade Overview

┌─────────────────────────────────────────────────────────────┐
│                    FAILURE CASCADE                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   1. CDI Vendor Mismatch                                    │
│      │  "unresolvable CDI devices runtime.nvidia.com/gpu"   │
│      ▼                                                      │
│   2. CDI Device Naming                                      │
│      │  Indices don't match parent:child format             │
│      ▼                                                      │
│   3. MIG Temperature/Power Queries                          │
│      │  pynvml.NVMLError on MIG device queries              │
│      ▼                                                      │
│   4. MIG Enumeration NotFound                               │
│      │  Non-contiguous MIG indices throw errors             │
│      ▼                                                      │
│   5. MIG Name Reuse Bug                                     │
│      │  All MIG devices show same name                      │
│      ▼                                                      │
│   6. MIG Index Collision                                    │
│      │  MIG indices start at 0, collide with non-MIG GPU    │
│      ▼                                                      │
│   7. CUDA_VISIBLE_DEVICES Index Mismatch                    │
│      │  GPUStack indices don't match CUDA enumeration       │
│      ▼                                                      │
│   8. vLLM UUID Parsing                                      │
│      │  "ValueError: invalid literal for int()"             │
│      ▼                                                      │
│   ✓ SUCCESS: Models deploy on MIG devices                   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Root Cause Analysis

Issue 1: CDI Vendor Mismatch

Symptom:

unresolvable CDI devices runtime.nvidia.com/gpu=2

Root Cause:

CDI (Container Device Interface) uses a vendor prefix to namespace devices. The default NVIDIA CDI generation creates devices under nvidia.com/gpu, but GPUStack requests devices using runtime.nvidia.com/gpu.

Default CDI Output:

# nvidia.com/gpu - DEFAULT (wrong)
nvidia.com/gpu=0
nvidia.com/gpu=1

Required CDI Output:

# runtime.nvidia.com/gpu - REQUIRED
runtime.nvidia.com/gpu=0
runtime.nvidia.com/gpu=1

Fix:

nvidia-ctk cdi generate \
  --vendor=runtime.nvidia.com \
  --device-name-strategy=index \
  --device-name-strategy=uuid \
  --output=/etc/cdi/nvidia.yaml

Issue 2: CDI Device Naming Strategy

Symptom:

GPUStack requests MIG devices by index (e.g., runtime.nvidia.com/gpu=2), but CDI generates MIG devices using parent:child notation (e.g., runtime.nvidia.com/gpu=1:0).

Analysis:

MIG devices exist within a parent GPU context. The CDI default naming reflects this hierarchy:

GPU 0 (no MIG)     → gpu=0
GPU 1 (MIG parent) → gpu=1
  MIG instance 0   → gpu=1:0
  MIG instance 1   → gpu=1:1
  MIG instance 2   → gpu=1:2

But GPUStack's device allocation uses flat indices and UUIDs.

Fix:

Generate CDI with both index and UUID naming strategies:

nvidia-ctk cdi generate \
  --vendor=runtime.nvidia.com \
  --device-name-strategy=index \
  --device-name-strategy=uuid \   # Enables UUID-based device selection
  --output=/etc/cdi/nvidia.yaml

Resulting CDI Devices:

runtime.nvidia.com/gpu=0
runtime.nvidia.com/gpu=1:0
runtime.nvidia.com/gpu=1:1
runtime.nvidia.com/gpu=1:2
runtime.nvidia.com/gpu=GPU-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee
runtime.nvidia.com/gpu=MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
runtime.nvidia.com/gpu=MIG-yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
runtime.nvidia.com/gpu=MIG-zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz
runtime.nvidia.com/gpu=all

Issue 3: MIG Temperature/Power Query Failures

Symptom:

pynvml.NVMLError: NVML_ERROR_NOT_SUPPORTED

Root Cause:

The pynvml library (Python bindings for NVIDIA Management Library) throws errors when querying temperature and power for MIG device handles. MIG instances don't support these queries - only the parent GPU does.

Problematic Code (gpustack_runtime/detector/nvidia.py):

# These calls fail for MIG devices
mdev_temp = pynvml.nvmlDeviceGetTemperature(mdev, pynvml.NVML_TEMPERATURE_GPU)
mdev_power_used = pynvml.nvmlDeviceGetPowerUsage(mdev) // 1000

Fix:

Wrap queries in contextlib.suppress to gracefully handle failures:

import contextlib

mdev_temp = None
with contextlib.suppress(pynvml.NVMLError):
    mdev_temp = pynvml.nvmlDeviceGetTemperature(
        mdev,
        pynvml.NVML_TEMPERATURE_GPU,
    )

mdev_power_used = None
with contextlib.suppress(pynvml.NVMLError):
    mdev_power_used = pynvml.nvmlDeviceGetPowerUsage(mdev) // 1000

Issue 4: MIG Enumeration NotFound Errors

Symptom:

pynvml.NVMLError_NotFound during MIG device enumeration

Root Cause:

MIG device indices can be non-contiguous. If you create MIG instances 0, 1, 2, then delete instance 1, you're left with indices 0 and 2. The code assumed contiguous indexing:

mdev_count = pynvml.nvmlDeviceGetMaxMigDeviceCount(dev)
for mdev_idx in range(mdev_count):
    mdev = pynvml.nvmlDeviceGetMigDeviceHandleByIndex(dev, mdev_idx)  # Throws NotFound!

Fix:

Add try/except handling to skip missing indices:

for mdev_idx in range(mdev_count):
    try:
        mdev = pynvml.nvmlDeviceGetMigDeviceHandleByIndex(dev, mdev_idx)
    except pynvml.NVMLError_NotFound:
        continue  # Skip non-existent MIG indices

Issue 5: MIG Name Reuse Bug

Symptom:

All MIG devices displayed the same name (the first MIG device's profile name).

Root Cause:

The mdev_name variable was initialized outside the MIG device loop and never reset:

mdev_name = ""  # Initialized once
mdev_cores = 1
mdev_count = pynvml.nvmlDeviceGetMaxMigDeviceCount(dev)
for mdev_idx in range(mdev_count):
    # mdev_name keeps the value from previous iteration!
    if some_condition:
        mdev_name = profile_name  # Only set conditionally

Fix:

Reset mdev_name inside the loop:

mdev_count = pynvml.nvmlDeviceGetMaxMigDeviceCount(dev)
for mdev_idx in range(mdev_count):
    mdev_name = ""  # Reset for each MIG device
    mdev_cores = 1
    # ... rest of loop

Issue 6: MIG Index Collision

Symptom:

GPUStack showed MIG devices with indices 0, 1, 2 - but index 0 was already used by the non-MIG GPU. This caused confusion and potential device selection errors.

idx=0 name=NVIDIA H200 NVL    # Non-MIG GPU
idx=0 name=4g.71gb            # MIG device - COLLISION!
idx=1 name=2g.35gb
idx=2 name=1g.18gb

Root Cause:

MIG device index assignment used the local MIG index (0, 1, 2...) instead of a global index that accounts for non-MIG devices:

mdev_index = mdev_idx  # Local MIG index, starts at 0

Fix:

Use len(ret) to start MIG indices after all previously detected devices:

mig_global_idx = len(ret)  # Start after non-MIG devices
for mdev_idx in range(mdev_count):
    mdev_index = mig_global_idx  # Global index
    mig_global_idx += 1

Corrected Output:

idx=0 name=NVIDIA H200 NVL    # Non-MIG GPU
idx=1 name=4g.71gb            # MIG device (unique index)
idx=2 name=2g.35gb
idx=3 name=1g.18gb

Issue 7: CUDA_VISIBLE_DEVICES Index Mismatch

Symptom:

Model containers failed to start with errors about invalid CUDA device indices.

Root Cause:

GPUStack assigns device indices 0, 1, 2, 3 to all detected devices (1 non-MIG GPU + 3 MIG devices). But CUDA's device enumeration is different - it only sees:

  • Device 0: The non-MIG GPU
  • Device 1: The MIG parent GPU (with MIG instances accessible via UUIDs)

When GPUStack sets CUDA_VISIBLE_DEVICES=2, CUDA fails because it doesn't have a device 2.

The Index Translation Problem:

┌─────────────────────────────────────────────────────────────┐
│               INDEX MISMATCH PROBLEM                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   GPUStack View:              CUDA View:                    │
│   ┌───────────────┐           ┌───────────────┐            │
│   │ idx=0: H200   │     →     │ device=0: H200│            │
│   │ idx=1: 4g.71gb│     ?     │ device=1: MIG │ (parent)   │
│   │ idx=2: 2g.35gb│     ?     │               │            │
│   │ idx=3: 1g.18gb│           │  NO device 2  │            │
│   └───────────────┘           │  NO device 3  │            │
│                               └───────────────┘            │
│                                                             │
│   CUDA_VISIBLE_DEVICES=2  →  ERROR: invalid device         │
│   CUDA_VISIBLE_DEVICES=MIG-uuid  →  SUCCESS                │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Fix:

Configure GPUStack to use UUIDs instead of indices for CUDA_VISIBLE_DEVICES:

# Old code: maps index to sequential index
alignment = {dev_indexes[i]: str(i) for i in range(len(devs))}

# New code: maps index to UUID
alignment = {dev_indexes[i]: dev_uuids[i] for i in range(len(devs))}

Enable via environment variable:

GPUSTACK_RUNTIME_DEPLOY_BACKEND_VISIBLE_DEVICES_VALUE_ALIGNMENT=CUDA_VISIBLE_DEVICES

Issue 8: vLLM UUID Parsing

Symptom:

ValueError: invalid literal for int() with base 10: 'MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'

Root Cause:

After fixing Issue 7, GPUStack correctly sets CUDA_VISIBLE_DEVICES to the MIG UUID. But vLLM's device mapping code assumes this environment variable contains integers:

# vllm/platforms/interface.py
def get_device_mapping():
    physical_device_id = os.environ.get("CUDA_VISIBLE_DEVICES", "").split(",")[device_id]
    return int(physical_device_id)  # Fails with UUID!

Fix:

Patch vLLM to handle UUID values gracefully:

def get_device_mapping():
    physical_device_id = os.environ.get("CUDA_VISIBLE_DEVICES", "").split(",")[device_id]
    try:
        return int(physical_device_id)
    except ValueError:
        # UUID format (e.g., MIG-xxx) - CUDA has already
        # remapped devices, so return the local device_id
        return device_id

The Complete Solution

Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                    GPUSTACK + MIG ARCHITECTURE                      │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   ┌─────────────────────────────────────────────────────────────┐   │
│   │                     HOST SYSTEM                             │   │
│   │   ┌─────────────────────────────────────────────────────┐   │   │
│   │   │  CDI Configuration (/etc/cdi/nvidia.yaml)           │   │   │
│   │   │  - vendor: runtime.nvidia.com                       │   │   │
│   │   │  - strategies: index + uuid                         │   │   │
│   │   └─────────────────────────────────────────────────────┘   │   │
│   │   ┌─────────────────────────────────────────────────────┐   │   │
│   │   │  NVIDIA Container Runtime Config                    │   │   │
│   │   │  - default-kind: runtime.nvidia.com/gpu             │   │   │
│   │   └─────────────────────────────────────────────────────┘   │   │
│   └─────────────────────────────────────────────────────────────┘   │
│                              │                                      │
│                              ▼                                      │
│   ┌─────────────────────────────────────────────────────────────┐   │
│   │                  gpustack-worker Container                  │   │
│   │   ┌───────────────────────────────────────────────────┐     │   │
│   │   │  gpustack-runtime (PATCHED)                       │     │   │
│   │   │  - MIG temp/power: contextlib.suppress            │     │   │
│   │   │  - MIG enumeration: NotFound handling             │     │   │
│   │   │  - MIG naming: reset per device                   │     │   │
│   │   │  - MIG indexing: global indices                   │     │   │
│   │   │  - CUDA alignment: UUID-based                     │     │   │
│   │   └───────────────────────────────────────────────────┘     │   │
│   │   Environment Variables:                                    │   │
│   │   - NVIDIA_VISIBLE_DEVICES=all                              │   │
│   │   - GPUSTACK_RUNTIME_DEPLOY_RUNTIME_VISIBLE_DEVICES_VALUE   │   │
│   │     _UUID=NVIDIA_VISIBLE_DEVICES                            │   │
│   │   - GPUSTACK_RUNTIME_DEPLOY_BACKEND_VISIBLE_DEVICES_VALUE   │   │
│   │     _ALIGNMENT=CUDA_VISIBLE_DEVICES                         │   │
│   └─────────────────────────────────────────────────────────────┘   │
│                              │                                      │
│                              │ Spawns                               │
│                              ▼                                      │
│   ┌─────────────────────────────────────────────────────────────┐   │
│   │              vLLM Runner Container (PATCHED)                │   │
│   │   ┌───────────────────────────────────────────────────┐     │   │
│   │   │  vllm (PATCHED)                                   │     │   │
│   │   │  - UUID handling in CUDA_VISIBLE_DEVICES          │     │   │
│   │   └───────────────────────────────────────────────────┘     │   │
│   │   Environment:                                              │   │
│   │   - NVIDIA_VISIBLE_DEVICES=MIG-xxxxxxxx-...                │   │
│   │   - CUDA_VISIBLE_DEVICES=MIG-xxxxxxxx-...                  │   │
│   └─────────────────────────────────────────────────────────────┘   │
│                              │                                      │
│                              ▼                                      │
│   ┌─────────────────────────────────────────────────────────────┐   │
│   │                    MIG Device Instance                      │   │
│   │   Profile: 4g.71gb | Memory: 71GB | Isolated Compute        │   │
│   └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Fix Implementation

We've automated the entire fix into a single script. Here's the breakdown:

Step 1: CDI Regeneration

nvidia-ctk cdi generate \
  --vendor=runtime.nvidia.com \
  --device-name-strategy=index \
  --device-name-strategy=uuid \
  --output=/etc/cdi/nvidia.yaml

Step 2: Container Runtime Configuration

sed -i 's|default-kind = "nvidia.com/gpu"|default-kind = "runtime.nvidia.com/gpu"|' \
  /etc/nvidia-container-runtime/config.toml

Step 3: Build Patched vLLM Runner Image

FROM gpustack/runner:cuda12.9-vllm0.13.0

# Patch vLLM to handle UUID values in CUDA_VISIBLE_DEVICES
RUN sed -i 's/return int(physical_device_id)/# VLLM_UUID_FIX\n            try:\n                return int(physical_device_id)\n            except ValueError:\n                return device_id/' \
    /usr/local/lib/python3.12/dist-packages/vllm/platforms/interface.py && \
    python3 -c "from vllm.platforms.interface import Platform; print('vLLM patch verified')"

Step 4: Deploy GPUStack Worker with MIG Environment Variables

docker run -d \
  --name gpustack-worker \
  --hostname gpustack-worker \
  --restart unless-stopped \
  --network host \
  --runtime nvidia \
  --privileged \
  --shm-size 64m \
  -v /data/cache:/var/lib/gpustack/cache \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -v /data/models:/data/models \
  -v gpustack-data:/var/lib/gpustack \
  -e "GPUSTACK_TOKEN=$TOKEN" \
  -e "NVIDIA_DISABLE_REQUIRE=true" \
  -e "NVIDIA_VISIBLE_DEVICES=all" \
  -e "NVIDIA_DRIVER_CAPABILITIES=compute,utility" \
  -e "GPUSTACK_RUNTIME_DEPLOY_MIRRORED_DEPLOYMENT=true" \
  -e "GPUSTACK_RUNTIME_DEPLOY_RUNTIME_VISIBLE_DEVICES_VALUE_UUID=NVIDIA_VISIBLE_DEVICES" \
  -e "GPUSTACK_RUNTIME_DEPLOY_BACKEND_VISIBLE_DEVICES_VALUE_ALIGNMENT=CUDA_VISIBLE_DEVICES" \
  gpustack/gpustack:v2.0.3 \
  --server-url http://$SERVER_IP \
  --worker-ip $WORKER_IP \
  --worker-port 10170 \
  --worker-metrics-port 10171

Step 5: Apply Runtime Patches

The gpustack-runtime patches must be applied inside the running container. Here's the CUDA UUID alignment patch:

# Patch location: /usr/local/lib/python3.11/dist-packages/gpustack_runtime/deployer/__types__.py

# Old code:
alignment = {dev_indexes[i]: str(i) for i in range(len(devs))}

# New code:
alignment = {dev_indexes[i]: dev_uuids[i] for i in range(len(devs))}

Verification

After applying all fixes, verify the setup:

GPU Detection Test

docker exec gpustack-worker python3 -c "
from gpustack_runtime.detector.nvidia import NVIDIADetector
det = NVIDIADetector()
result = det.detect()
for d in result:
    mem = getattr(d, 'memory', '?')
    uuid = getattr(d, 'uuid', '?')
    print(f'idx={d.index} name={d.name} mem={mem}MB uuid={uuid}')
"

Expected Output:

idx=0 name=NVIDIA H200 NVL mem=143771MB uuid=GPU-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee
idx=1 name=4g.71gb mem=71424MB uuid=MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
idx=2 name=2g.35gb mem=33280MB uuid=MIG-yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
idx=3 name=1g.18gb mem=16384MB uuid=MIG-zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz

vLLM MIG Test

docker run --rm --runtime nvidia \
  -e "NVIDIA_VISIBLE_DEVICES=MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" \
  -e "CUDA_VISIBLE_DEVICES=MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" \
  gpustack/runner:cuda12.9-vllm0.13.0 python3 -c "
import torch
print(f'CUDA devices: {torch.cuda.device_count()}')
for i in range(torch.cuda.device_count()):
    print(f'  {i}: {torch.cuda.get_device_name(i)}')
from vllm.platforms.cuda import CudaPlatform
print('vLLM imported successfully!')
"

Expected Output:

CUDA devices: 1
  0: NVIDIA H200 NVL
vLLM imported successfully!

CDI Verification

nvidia-ctk cdi list

Expected Output (includes UUIDs):

runtime.nvidia.com/gpu=0
runtime.nvidia.com/gpu=1:0
runtime.nvidia.com/gpu=1:1
runtime.nvidia.com/gpu=1:2
runtime.nvidia.com/gpu=GPU-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee
runtime.nvidia.com/gpu=MIG-zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz
runtime.nvidia.com/gpu=MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
runtime.nvidia.com/gpu=MIG-yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
runtime.nvidia.com/gpu=all

Operational Considerations

Patch Persistence

ComponentPersistenceNotes
CDI configurationSurvives rebootsWritten to /etc/cdi/nvidia.yaml
Container runtime configSurvives rebootsWritten to /etc/nvidia-container-runtime/config.toml
vLLM runner imagePermanentBaked into Docker image
gpustack-runtime patchesSurvives restartsLost on container recreation

Important: If you docker rm and recreate the gpustack-worker container, you must reapply the runtime patches. However, docker restart preserves them.

Monitoring MIG Devices

MIG devices have limited monitoring capabilities compared to full GPUs:

MetricFull GPUMIG Device
TemperatureYesNo (parent GPU only)
Power UsageYesNo (parent GPU only)
Memory UsageYesYes
UtilizationYesYes
Process ListYesYes

MIG Profile Selection Guidelines

Model SizeRecommended ProfileNotes
< 5B params1g.18gbSmall inference tasks
5B - 13B params2g.35gbTypical 7B model with context
13B - 30B params4g.71gbLarger models, batch inference
> 30B paramsFull GPUDisable MIG for this GPU

Key Takeaways

  1. MIG + Container Orchestration is complex. The intersection of CDI, NVIDIA container runtime, and application-level device enumeration creates multiple potential failure points.

  2. Vendor prefixes matter. CDI's vendor namespace (nvidia.com vs runtime.nvidia.com) must match what your orchestrator requests.

  3. Device naming strategies must align. MIG devices can be addressed by parent:child index or UUID. Your orchestrator and runtime must agree on which to use.

  4. pynvml has MIG limitations. Not all NVML queries work on MIG device handles. Wrap potentially failing calls in error handlers.

  5. Index enumeration differs between layers. GPUStack, CUDA, and CDI may all enumerate devices differently. UUID-based device selection is the most reliable approach.

  6. Runtime patches may be necessary. Both GPUStack-runtime and vLLM required patches to handle MIG correctly. These should eventually be upstreamed.

  7. Test the full stack. Verify GPU detection, CDI configuration, and actual model deployment. Each layer can fail independently.

  8. Document your patches. Runtime patches don't survive container recreation. Automate their application and document the process.


Resources


MIG is a powerful technology for maximizing GPU utilization in multi-tenant environments. With the right configuration and patches, GPUStack can effectively orchestrate inference workloads across MIG partitions - enabling more efficient use of expensive GPU hardware.

Frederico Vicente

Frederico Vicente

AI Research Engineer