Running GPUStack with NVIDIA MIG: A Deep Dive into Multi-Instance GPU Orchestration

NVIDIA's Multi-Instance GPU (MIG) technology, introduced with the Ampere architecture, enables a single physical GPU to be partitioned into up to seven isolated instances. Each instance has dedicated compute resources, memory bandwidth, and L2 cache - making it ideal for multi-tenant AI inference workloads.

Yet deploying MIG with container orchestration tools remains challenging. The intersection of CDI (Container Device Interface), NVIDIA container runtime, and application-level device enumeration creates a complex web of potential failure points.

This article documents our journey deploying GPUStack with MIG on NVIDIA H200 NVL GPUs, including:

8 distinct bugs discovered and fixed
Runtime patches for GPUStack and vLLM
Complete automation scripts for production deployment

The MIG Value Proposition

Before diving into implementation, let's understand why MIG matters for AI infrastructure.

Traditional GPU Allocation

Without MIG, GPUs are allocated as whole units:

┌─────────────────────────────────────────────────────────────┐
│                    TRADITIONAL GPU ALLOCATION               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   GPU 0 (H200 - 141GB)        GPU 1 (H200 - 141GB)         │
│   ┌─────────────────┐         ┌─────────────────┐          │
│   │                 │         │                 │          │
│   │   Model A       │         │   Model B       │          │
│   │   (uses 20GB)   │         │   (uses 35GB)   │          │
│   │                 │         │                 │          │
│   │   ░░░░░░░░░░    │         │   ░░░░░░░░░░    │          │
│   │   121GB WASTED  │         │   106GB WASTED  │          │
│   │                 │         │                 │          │
│   └─────────────────┘         └─────────────────┘          │
│                                                             │
│   Utilization: ~20%            Utilization: ~25%           │
│                                                             │
└─────────────────────────────────────────────────────────────┘

MIG-Enabled Allocation

With MIG, a single GPU can serve multiple isolated workloads:

┌─────────────────────────────────────────────────────────────┐
│                    MIG-ENABLED GPU ALLOCATION               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   GPU 0 (H200 - Full)         GPU 1 (H200 - MIG Enabled)   │
│   ┌─────────────────┐         ┌─────────────────┐          │
│   │                 │         │ ┌─────────────┐ │          │
│   │   Large Model   │         │ │ 4g.71gb     │ │ Model B  │
│   │   (needs full   │         │ │ (71GB)      │ │          │
│   │    GPU memory)  │         │ └─────────────┘ │          │
│   │                 │         │ ┌───────┐       │          │
│   │                 │         │ │2g.35gb│       │ Model C  │
│   │                 │         │ │(35GB) │       │          │
│   │                 │         │ └───────┘       │          │
│   │                 │         │ ┌────┐          │          │
│   │                 │         │ │1g  │          │ Model D  │
│   │                 │         │ │18GB│          │          │
│   │                 │         │ └────┘          │          │
│   └─────────────────┘         └─────────────────┘          │
│                                                             │
│   Utilization: 100%            Utilization: ~90%           │
│   (workload needs it)          (3 isolated workloads)      │
│                                                             │
└─────────────────────────────────────────────────────────────┘

MIG Profile Sizes (H200 NVL)

Profile	GPU Memory	GPU Slices	Compute	Use Case
1g.18gb	18 GB	1/7	1 SM	Small inference, testing
2g.35gb	35 GB	2/7	2 SMs	Medium models (7B params)
3g.47gb	47 GB	3/7	3 SMs	Large models (13B params)
4g.71gb	71 GB	4/7	4 SMs	Very large models (30B+)
7g.141gb	141 GB	7/7	All SMs	Full GPU (no partitioning)

Our Environment

Hardware Configuration

┌─────────────────────────────────────────────────────────────┐
│                    HARDWARE SETUP                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   Server: ai-1                                              │
│                                                             │
│   ┌─────────────────────────┐  ┌─────────────────────────┐ │
│   │        GPU 0            │  │        GPU 1            │ │
│   │   NVIDIA H200 NVL       │  │   NVIDIA H200 NVL       │ │
│   │   MIG: DISABLED         │  │   MIG: ENABLED          │ │
│   │   Memory: 143 GB        │  │   Memory: 141 GB total  │ │
│   │                         │  │                         │ │
│   │   UUID: GPU-aaaaaaaa-   │  │   UUID: GPU-11111111-   │ │
│   │         bbbb-cccc-...   │  │         2222-3333-...   │ │
│   │                         │  │                         │ │
│   │   ┌─────────────────┐   │  │   ┌─────────────────┐   │ │
│   │   │   Full GPU      │   │  │   │ MIG 4g.71gb     │   │ │
│   │   │   Available     │   │  │   │ 71 GB           │   │ │
│   │   └─────────────────┘   │  │   │ MIG-xxxxxxxx-.. │   │ │
│   │                         │  │   └─────────────────┘   │ │
│   │                         │  │   ┌───────────┐         │ │
│   │                         │  │   │ MIG 2g.35gb│        │ │
│   │                         │  │   │ 35 GB     │         │ │
│   │                         │  │   │ MIG-abc60c│         │ │
│   │                         │  │   └───────────┘         │ │
│   │                         │  │   ┌─────┐               │ │
│   │                         │  │   │1g.18│ 18 GB        │ │
│   │                         │  │   │MIG- │               │ │
│   │                         │  │   └─────┘               │ │
│   └─────────────────────────┘  └─────────────────────────┘ │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Software Stack

Component	Version
GPUStack	v2.0.3
gpustack-runtime	v0.1.38.post4
vLLM	0.13.0
NVIDIA Driver	590.48.01
NVIDIA Container Toolkit	Latest
Container Runtime	Docker with nvidia-container-runtime

The Problem: Eight Distinct Failures

When we first attempted to deploy models on MIG devices through GPUStack, we encountered a cascade of failures. Each fix revealed another underlying issue - a classic "peeling the onion" debugging experience.

Failure Cascade Overview

┌─────────────────────────────────────────────────────────────┐
│                    FAILURE CASCADE                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   1. CDI Vendor Mismatch                                    │
│      │  "unresolvable CDI devices runtime.nvidia.com/gpu"   │
│      ▼                                                      │
│   2. CDI Device Naming                                      │
│      │  Indices don't match parent:child format             │
│      ▼                                                      │
│   3. MIG Temperature/Power Queries                          │
│      │  pynvml.NVMLError on MIG device queries              │
│      ▼                                                      │
│   4. MIG Enumeration NotFound                               │
│      │  Non-contiguous MIG indices throw errors             │
│      ▼                                                      │
│   5. MIG Name Reuse Bug                                     │
│      │  All MIG devices show same name                      │
│      ▼                                                      │
│   6. MIG Index Collision                                    │
│      │  MIG indices start at 0, collide with non-MIG GPU    │
│      ▼                                                      │
│   7. CUDA_VISIBLE_DEVICES Index Mismatch                    │
│      │  GPUStack indices don't match CUDA enumeration       │
│      ▼                                                      │
│   8. vLLM UUID Parsing                                      │
│      │  "ValueError: invalid literal for int()"             │
│      ▼                                                      │
│   ✓ SUCCESS: Models deploy on MIG devices                   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Root Cause Analysis

Issue 1: CDI Vendor Mismatch

Symptom:

unresolvable CDI devices runtime.nvidia.com/gpu=2

Root Cause:

CDI (Container Device Interface) uses a vendor prefix to namespace devices. The default NVIDIA CDI generation creates devices under nvidia.com/gpu, but GPUStack requests devices using runtime.nvidia.com/gpu.

Default CDI Output:

# nvidia.com/gpu - DEFAULT (wrong)
nvidia.com/gpu=0
nvidia.com/gpu=1

Required CDI Output:

# runtime.nvidia.com/gpu - REQUIRED
runtime.nvidia.com/gpu=0
runtime.nvidia.com/gpu=1

Fix:

nvidia-ctk cdi generate \
  --vendor=runtime.nvidia.com \
  --device-name-strategy=index \
  --device-name-strategy=uuid \
  --output=/etc/cdi/nvidia.yaml

Issue 2: CDI Device Naming Strategy

Symptom:

GPUStack requests MIG devices by index (e.g., runtime.nvidia.com/gpu=2), but CDI generates MIG devices using parent:child notation (e.g., runtime.nvidia.com/gpu=1:0).

Analysis:

MIG devices exist within a parent GPU context. The CDI default naming reflects this hierarchy:

GPU 0 (no MIG)     → gpu=0
GPU 1 (MIG parent) → gpu=1
  MIG instance 0   → gpu=1:0
  MIG instance 1   → gpu=1:1
  MIG instance 2   → gpu=1:2

But GPUStack's device allocation uses flat indices and UUIDs.

Fix:

Generate CDI with both index and UUID naming strategies:

nvidia-ctk cdi generate \
  --vendor=runtime.nvidia.com \
  --device-name-strategy=index \
  --device-name-strategy=uuid \   # Enables UUID-based device selection
  --output=/etc/cdi/nvidia.yaml

Resulting CDI Devices:

runtime.nvidia.com/gpu=0
runtime.nvidia.com/gpu=1:0
runtime.nvidia.com/gpu=1:1
runtime.nvidia.com/gpu=1:2
runtime.nvidia.com/gpu=GPU-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee
runtime.nvidia.com/gpu=MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
runtime.nvidia.com/gpu=MIG-yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
runtime.nvidia.com/gpu=MIG-zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz
runtime.nvidia.com/gpu=all

Issue 3: MIG Temperature/Power Query Failures

Symptom:

pynvml.NVMLError: NVML_ERROR_NOT_SUPPORTED

Root Cause:

The pynvml library (Python bindings for NVIDIA Management Library) throws errors when querying temperature and power for MIG device handles. MIG instances don't support these queries - only the parent GPU does.

Problematic Code (gpustack_runtime/detector/nvidia.py):

# These calls fail for MIG devices
mdev_temp = pynvml.nvmlDeviceGetTemperature(mdev, pynvml.NVML_TEMPERATURE_GPU)
mdev_power_used = pynvml.nvmlDeviceGetPowerUsage(mdev) // 1000

Fix:

Wrap queries in contextlib.suppress to gracefully handle failures:

import contextlib

mdev_temp = None
with contextlib.suppress(pynvml.NVMLError):
    mdev_temp = pynvml.nvmlDeviceGetTemperature(
        mdev,
        pynvml.NVML_TEMPERATURE_GPU,
    )

mdev_power_used = None
with contextlib.suppress(pynvml.NVMLError):
    mdev_power_used = pynvml.nvmlDeviceGetPowerUsage(mdev) // 1000

Issue 4: MIG Enumeration NotFound Errors

Symptom:

pynvml.NVMLError_NotFound during MIG device enumeration

Root Cause:

MIG device indices can be non-contiguous. If you create MIG instances 0, 1, 2, then delete instance 1, you're left with indices 0 and 2. The code assumed contiguous indexing:

mdev_count = pynvml.nvmlDeviceGetMaxMigDeviceCount(dev)
for mdev_idx in range(mdev_count):
    mdev = pynvml.nvmlDeviceGetMigDeviceHandleByIndex(dev, mdev_idx)  # Throws NotFound!

Fix:

Add try/except handling to skip missing indices:

for mdev_idx in range(mdev_count):
    try:
        mdev = pynvml.nvmlDeviceGetMigDeviceHandleByIndex(dev, mdev_idx)
    except pynvml.NVMLError_NotFound:
        continue  # Skip non-existent MIG indices

Issue 5: MIG Name Reuse Bug

Symptom:

All MIG devices displayed the same name (the first MIG device's profile name).

Root Cause:

The mdev_name variable was initialized outside the MIG device loop and never reset:

mdev_name = ""  # Initialized once
mdev_cores = 1
mdev_count = pynvml.nvmlDeviceGetMaxMigDeviceCount(dev)
for mdev_idx in range(mdev_count):
    # mdev_name keeps the value from previous iteration!
    if some_condition:
        mdev_name = profile_name  # Only set conditionally

Fix:

Reset mdev_name inside the loop:

mdev_count = pynvml.nvmlDeviceGetMaxMigDeviceCount(dev)
for mdev_idx in range(mdev_count):
    mdev_name = ""  # Reset for each MIG device
    mdev_cores = 1
    # ... rest of loop

Issue 6: MIG Index Collision

Symptom:

GPUStack showed MIG devices with indices 0, 1, 2 - but index 0 was already used by the non-MIG GPU. This caused confusion and potential device selection errors.

idx=0 name=NVIDIA H200 NVL    # Non-MIG GPU
idx=0 name=4g.71gb            # MIG device - COLLISION!
idx=1 name=2g.35gb
idx=2 name=1g.18gb

Root Cause:

MIG device index assignment used the local MIG index (0, 1, 2...) instead of a global index that accounts for non-MIG devices:

mdev_index = mdev_idx  # Local MIG index, starts at 0

Fix:

Use len(ret) to start MIG indices after all previously detected devices:

mig_global_idx = len(ret)  # Start after non-MIG devices
for mdev_idx in range(mdev_count):
    mdev_index = mig_global_idx  # Global index
    mig_global_idx += 1

Corrected Output:

idx=0 name=NVIDIA H200 NVL    # Non-MIG GPU
idx=1 name=4g.71gb            # MIG device (unique index)
idx=2 name=2g.35gb
idx=3 name=1g.18gb

Issue 7: CUDA_VISIBLE_DEVICES Index Mismatch

Symptom:

Model containers failed to start with errors about invalid CUDA device indices.

Root Cause:

GPUStack assigns device indices 0, 1, 2, 3 to all detected devices (1 non-MIG GPU + 3 MIG devices). But CUDA's device enumeration is different - it only sees:

Device 0: The non-MIG GPU
Device 1: The MIG parent GPU (with MIG instances accessible via UUIDs)

When GPUStack sets CUDA_VISIBLE_DEVICES=2, CUDA fails because it doesn't have a device 2.

The Index Translation Problem:

┌─────────────────────────────────────────────────────────────┐
│               INDEX MISMATCH PROBLEM                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   GPUStack View:              CUDA View:                    │
│   ┌───────────────┐           ┌───────────────┐            │
│   │ idx=0: H200   │     →     │ device=0: H200│            │
│   │ idx=1: 4g.71gb│     ?     │ device=1: MIG │ (parent)   │
│   │ idx=2: 2g.35gb│     ?     │               │            │
│   │ idx=3: 1g.18gb│           │  NO device 2  │            │
│   └───────────────┘           │  NO device 3  │            │
│                               └───────────────┘            │
│                                                             │
│   CUDA_VISIBLE_DEVICES=2  →  ERROR: invalid device         │
│   CUDA_VISIBLE_DEVICES=MIG-uuid  →  SUCCESS                │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Fix:

Configure GPUStack to use UUIDs instead of indices for CUDA_VISIBLE_DEVICES:

# Old code: maps index to sequential index
alignment = {dev_indexes[i]: str(i) for i in range(len(devs))}

# New code: maps index to UUID
alignment = {dev_indexes[i]: dev_uuids[i] for i in range(len(devs))}

Enable via environment variable:

GPUSTACK_RUNTIME_DEPLOY_BACKEND_VISIBLE_DEVICES_VALUE_ALIGNMENT=CUDA_VISIBLE_DEVICES

Issue 8: vLLM UUID Parsing

Symptom:

ValueError: invalid literal for int() with base 10: 'MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'

Root Cause:

After fixing Issue 7, GPUStack correctly sets CUDA_VISIBLE_DEVICES to the MIG UUID. But vLLM's device mapping code assumes this environment variable contains integers:

# vllm/platforms/interface.py
def get_device_mapping():
    physical_device_id = os.environ.get("CUDA_VISIBLE_DEVICES", "").split(",")[device_id]
    return int(physical_device_id)  # Fails with UUID!

Fix:

Patch vLLM to handle UUID values gracefully:

def get_device_mapping():
    physical_device_id = os.environ.get("CUDA_VISIBLE_DEVICES", "").split(",")[device_id]
    try:
        return int(physical_device_id)
    except ValueError:
        # UUID format (e.g., MIG-xxx) - CUDA has already
        # remapped devices, so return the local device_id
        return device_id

The Complete Solution

Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                    GPUSTACK + MIG ARCHITECTURE                      │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   ┌─────────────────────────────────────────────────────────────┐   │
│   │                     HOST SYSTEM                             │   │
│   │   ┌─────────────────────────────────────────────────────┐   │   │
│   │   │  CDI Configuration (/etc/cdi/nvidia.yaml)           │   │   │
│   │   │  - vendor: runtime.nvidia.com                       │   │   │
│   │   │  - strategies: index + uuid                         │   │   │
│   │   └─────────────────────────────────────────────────────┘   │   │
│   │   ┌─────────────────────────────────────────────────────┐   │   │
│   │   │  NVIDIA Container Runtime Config                    │   │   │
│   │   │  - default-kind: runtime.nvidia.com/gpu             │   │   │
│   │   └─────────────────────────────────────────────────────┘   │   │
│   └─────────────────────────────────────────────────────────────┘   │
│                              │                                      │
│                              ▼                                      │
│   ┌─────────────────────────────────────────────────────────────┐   │
│   │                  gpustack-worker Container                  │   │
│   │   ┌───────────────────────────────────────────────────┐     │   │
│   │   │  gpustack-runtime (PATCHED)                       │     │   │
│   │   │  - MIG temp/power: contextlib.suppress            │     │   │
│   │   │  - MIG enumeration: NotFound handling             │     │   │
│   │   │  - MIG naming: reset per device                   │     │   │
│   │   │  - MIG indexing: global indices                   │     │   │
│   │   │  - CUDA alignment: UUID-based                     │     │   │
│   │   └───────────────────────────────────────────────────┘     │   │
│   │   Environment Variables:                                    │   │
│   │   - NVIDIA_VISIBLE_DEVICES=all                              │   │
│   │   - GPUSTACK_RUNTIME_DEPLOY_RUNTIME_VISIBLE_DEVICES_VALUE   │   │
│   │     _UUID=NVIDIA_VISIBLE_DEVICES                            │   │
│   │   - GPUSTACK_RUNTIME_DEPLOY_BACKEND_VISIBLE_DEVICES_VALUE   │   │
│   │     _ALIGNMENT=CUDA_VISIBLE_DEVICES                         │   │
│   └─────────────────────────────────────────────────────────────┘   │
│                              │                                      │
│                              │ Spawns                               │
│                              ▼                                      │
│   ┌─────────────────────────────────────────────────────────────┐   │
│   │              vLLM Runner Container (PATCHED)                │   │
│   │   ┌───────────────────────────────────────────────────┐     │   │
│   │   │  vllm (PATCHED)                                   │     │   │
│   │   │  - UUID handling in CUDA_VISIBLE_DEVICES          │     │   │
│   │   └───────────────────────────────────────────────────┘     │   │
│   │   Environment:                                              │   │
│   │   - NVIDIA_VISIBLE_DEVICES=MIG-xxxxxxxx-...                │   │
│   │   - CUDA_VISIBLE_DEVICES=MIG-xxxxxxxx-...                  │   │
│   └─────────────────────────────────────────────────────────────┘   │
│                              │                                      │
│                              ▼                                      │
│   ┌─────────────────────────────────────────────────────────────┐   │
│   │                    MIG Device Instance                      │   │
│   │   Profile: 4g.71gb | Memory: 71GB | Isolated Compute        │   │
│   └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Fix Implementation

We've automated the entire fix into a single script. Here's the breakdown:

Step 1: CDI Regeneration

nvidia-ctk cdi generate \
  --vendor=runtime.nvidia.com \
  --device-name-strategy=index \
  --device-name-strategy=uuid \
  --output=/etc/cdi/nvidia.yaml

Step 2: Container Runtime Configuration

sed -i 's|default-kind = "nvidia.com/gpu"|default-kind = "runtime.nvidia.com/gpu"|' \
  /etc/nvidia-container-runtime/config.toml

Step 3: Build Patched vLLM Runner Image

FROM gpustack/runner:cuda12.9-vllm0.13.0

# Patch vLLM to handle UUID values in CUDA_VISIBLE_DEVICES
RUN sed -i 's/return int(physical_device_id)/# VLLM_UUID_FIX\n            try:\n                return int(physical_device_id)\n            except ValueError:\n                return device_id/' \
    /usr/local/lib/python3.12/dist-packages/vllm/platforms/interface.py && \
    python3 -c "from vllm.platforms.interface import Platform; print('vLLM patch verified')"

Step 4: Deploy GPUStack Worker with MIG Environment Variables

docker run -d \
  --name gpustack-worker \
  --hostname gpustack-worker \
  --restart unless-stopped \
  --network host \
  --runtime nvidia \
  --privileged \
  --shm-size 64m \
  -v /data/cache:/var/lib/gpustack/cache \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -v /data/models:/data/models \
  -v gpustack-data:/var/lib/gpustack \
  -e "GPUSTACK_TOKEN=$TOKEN" \
  -e "NVIDIA_DISABLE_REQUIRE=true" \
  -e "NVIDIA_VISIBLE_DEVICES=all" \
  -e "NVIDIA_DRIVER_CAPABILITIES=compute,utility" \
  -e "GPUSTACK_RUNTIME_DEPLOY_MIRRORED_DEPLOYMENT=true" \
  -e "GPUSTACK_RUNTIME_DEPLOY_RUNTIME_VISIBLE_DEVICES_VALUE_UUID=NVIDIA_VISIBLE_DEVICES" \
  -e "GPUSTACK_RUNTIME_DEPLOY_BACKEND_VISIBLE_DEVICES_VALUE_ALIGNMENT=CUDA_VISIBLE_DEVICES" \
  gpustack/gpustack:v2.0.3 \
  --server-url http://$SERVER_IP \
  --worker-ip $WORKER_IP \
  --worker-port 10170 \
  --worker-metrics-port 10171

Step 5: Apply Runtime Patches

The gpustack-runtime patches must be applied inside the running container. Here's the CUDA UUID alignment patch:

# Patch location: /usr/local/lib/python3.11/dist-packages/gpustack_runtime/deployer/__types__.py

# Old code:
alignment = {dev_indexes[i]: str(i) for i in range(len(devs))}

# New code:
alignment = {dev_indexes[i]: dev_uuids[i] for i in range(len(devs))}

Verification

After applying all fixes, verify the setup:

GPU Detection Test

docker exec gpustack-worker python3 -c "
from gpustack_runtime.detector.nvidia import NVIDIADetector
det = NVIDIADetector()
result = det.detect()
for d in result:
    mem = getattr(d, 'memory', '?')
    uuid = getattr(d, 'uuid', '?')
    print(f'idx={d.index} name={d.name} mem={mem}MB uuid={uuid}')
"

Expected Output:

idx=0 name=NVIDIA H200 NVL mem=143771MB uuid=GPU-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee
idx=1 name=4g.71gb mem=71424MB uuid=MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
idx=2 name=2g.35gb mem=33280MB uuid=MIG-yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
idx=3 name=1g.18gb mem=16384MB uuid=MIG-zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz

vLLM MIG Test

docker run --rm --runtime nvidia \
  -e "NVIDIA_VISIBLE_DEVICES=MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" \
  -e "CUDA_VISIBLE_DEVICES=MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" \
  gpustack/runner:cuda12.9-vllm0.13.0 python3 -c "
import torch
print(f'CUDA devices: {torch.cuda.device_count()}')
for i in range(torch.cuda.device_count()):
    print(f'  {i}: {torch.cuda.get_device_name(i)}')
from vllm.platforms.cuda import CudaPlatform
print('vLLM imported successfully!')
"

Expected Output:

CUDA devices: 1
  0: NVIDIA H200 NVL
vLLM imported successfully!

CDI Verification

nvidia-ctk cdi list

Expected Output (includes UUIDs):

runtime.nvidia.com/gpu=0
runtime.nvidia.com/gpu=1:0
runtime.nvidia.com/gpu=1:1
runtime.nvidia.com/gpu=1:2
runtime.nvidia.com/gpu=GPU-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee
runtime.nvidia.com/gpu=MIG-zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz
runtime.nvidia.com/gpu=MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
runtime.nvidia.com/gpu=MIG-yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
runtime.nvidia.com/gpu=all

Operational Considerations

Patch Persistence

Component	Persistence	Notes
CDI configuration	Survives reboots	Written to `/etc/cdi/nvidia.yaml`
Container runtime config	Survives reboots	Written to `/etc/nvidia-container-runtime/config.toml`
vLLM runner image	Permanent	Baked into Docker image
gpustack-runtime patches	Survives restarts	Lost on container recreation

Important: If you docker rm and recreate the gpustack-worker container, you must reapply the runtime patches. However, docker restart preserves them.

Monitoring MIG Devices

MIG devices have limited monitoring capabilities compared to full GPUs:

Metric	Full GPU	MIG Device
Temperature	Yes	No (parent GPU only)
Power Usage	Yes	No (parent GPU only)
Memory Usage	Yes	Yes
Utilization	Yes	Yes
Process List	Yes	Yes

MIG Profile Selection Guidelines

Model Size	Recommended Profile	Notes
< 5B params	1g.18gb	Small inference tasks
5B - 13B params	2g.35gb	Typical 7B model with context
13B - 30B params	4g.71gb	Larger models, batch inference
> 30B params	Full GPU	Disable MIG for this GPU

Key Takeaways

MIG + Container Orchestration is complex. The intersection of CDI, NVIDIA container runtime, and application-level device enumeration creates multiple potential failure points.
Vendor prefixes matter. CDI's vendor namespace (nvidia.com vs runtime.nvidia.com) must match what your orchestrator requests.
Device naming strategies must align. MIG devices can be addressed by parent:child index or UUID. Your orchestrator and runtime must agree on which to use.
pynvml has MIG limitations. Not all NVML queries work on MIG device handles. Wrap potentially failing calls in error handlers.
Index enumeration differs between layers. GPUStack, CUDA, and CDI may all enumerate devices differently. UUID-based device selection is the most reliable approach.
Runtime patches may be necessary. Both GPUStack-runtime and vLLM required patches to handle MIG correctly. These should eventually be upstreamed.
Test the full stack. Verify GPU detection, CDI configuration, and actual model deployment. Each layer can fail independently.
Document your patches. Runtime patches don't survive container recreation. Automate their application and document the process.

Resources

MIG is a powerful technology for maximizing GPU utilization in multi-tenant environments. With the right configuration and patches, GPUStack can effectively orchestrate inference workloads across MIG partitions - enabling more efficient use of expensive GPU hardware.

Running GPUStack with NVIDIA MIG: A Deep Dive into Multi-Instance GPU Orchestration

Running GPUStack with NVIDIA MIG: A Deep Dive into Multi-Instance GPU Orchestration

The MIG Value Proposition

Traditional GPU Allocation

MIG-Enabled Allocation

MIG Profile Sizes (H200 NVL)

Our Environment

Hardware Configuration

Software Stack

The Problem: Eight Distinct Failures

Failure Cascade Overview

Root Cause Analysis

Issue 1: CDI Vendor Mismatch

Issue 2: CDI Device Naming Strategy

Issue 3: MIG Temperature/Power Query Failures

Issue 4: MIG Enumeration NotFound Errors

Issue 5: MIG Name Reuse Bug

Issue 6: MIG Index Collision

Issue 7: CUDA_VISIBLE_DEVICES Index Mismatch

Issue 8: vLLM UUID Parsing

The Complete Solution

Architecture Overview

Fix Implementation

Step 1: CDI Regeneration

Step 2: Container Runtime Configuration

Step 3: Build Patched vLLM Runner Image

Step 4: Deploy GPUStack Worker with MIG Environment Variables

Step 5: Apply Runtime Patches

Verification

GPU Detection Test

vLLM MIG Test

CDI Verification

Operational Considerations

Patch Persistence

Monitoring MIG Devices

MIG Profile Selection Guidelines

Key Takeaways

Resources

Frederico Vicente