NVIDIA DCGM simulation

Simulator side of the GPU story

This page covers the simulator — the metric OID types, the GPU cycler extension, device profiles, resource file layout, and integration points. Pair it with Protobuf model (the shape of the output data) and Pollaris and parsing rules (the collector side).

Overview

l8opensim simulates NVIDIA Data Center GPU Manager (DCGM) servers. Each simulated device represents a GPU server (e.g. DGX A100, DGX H100, HGX H200) with multiple GPUs. The simulation exposes GPU metrics via all three existing protocols: SNMP (enterprise MIB OIDs), SSH (nvidia-smi CLI), and REST API (DCGM HTTP endpoints).

What DCGM Exposes in the Real World

NVIDIA DCGM provides:

GPU utilization (compute and memory controller %)
GPU memory (used/total VRAM per GPU)
GPU temperature (current, shutdown threshold, slowdown threshold)
Power draw (current watts, power limit)
Fan speed (% of max)
GPU clock speeds (SM clock, memory clock in MHz)
PCIe throughput (TX/RX MB/s)
ECC error counts (single-bit corrected, double-bit uncorrected)
NVLink bandwidth (active links, throughput)
GPU health status (healthy, warning, critical)
Process/job info (running PIDs, memory per process)

These are accessible via:

SNMP: NVIDIA enterprise OIDs under 1.3.6.1.4.1.53246 (NVIDIA PEN)
SSH: nvidia-smi command variants (e.g., nvidia-smi, nvidia-smi -q, nvidia-smi dmon, dcgmi diag, dcgmi health)
REST API: DCGM exporter HTTP endpoints (/api/v1/gpu/status, /api/v1/gpu/{id}/metrics, etc.)

Scope

Device Flavors (3 resource profiles)

Resource File	GPU Model	GPUs	VRAM/GPU	Total VRAM	System RAM
`nvidia_dgx_a100.json`	A100 80GB SXM	8	80 GB	640 GB	1 TB
`nvidia_dgx_h100.json`	H100 80GB SXM	8	80 GB	640 GB	2 TB
`nvidia_hgx_h200.json`	H200 141GB SXM	8	141 GB	1128 GB	2 TB

Each flavor gets its own resource directory under resources/ with SNMP, SSH, and API JSON data.

GPU-Specific Metric Types (new)

Extend MetricsCycler to generate cycling values for GPU metrics alongside existing CPU/memory/temperature metrics.

Phase 1: GPU Metric Types and Cycler Extension

1.1 New Metric OID Types (`metrics_oids.go`)

Add new MetricOIDType constants:

MetricGPUUtil       MetricOIDType = iota + 10  // GPU compute utilization %
MetricGPUMemUsed                                // GPU memory used (MB)
MetricGPUMemTotal                               // GPU memory total (MB, constant)
MetricGPUMemUtil                                // GPU memory utilization %
MetricGPUTemp                                   // GPU temperature (Celsius)
MetricGPUPower                                  // GPU power draw (Watts)
MetricGPUFanSpeed                               // GPU fan speed %
MetricGPUClockSM                                // GPU SM clock (MHz)
MetricGPUClockMem                               // GPU memory clock (MHz)

Add vendorOIDs entries for each NVIDIA resource file, using NVIDIA's enterprise OID prefix 1.3.6.1.4.1.53246. Each GPU (0-7) gets its own OID suffix. Example for GPU 0:

3.6.1.4.1.53246.1.1.1.1.5.0  → MetricGPUUtil       (GPU 0 utilization)
3.6.1.4.1.53246.1.1.1.1.6.0  → MetricGPUMemUsed    (GPU 0 memory used)
3.6.1.4.1.53246.1.1.1.1.7.0  → MetricGPUMemTotal   (GPU 0 memory total)
3.6.1.4.1.53246.1.1.1.1.8.0  → MetricGPUTemp       (GPU 0 temperature)
3.6.1.4.1.53246.1.1.1.1.9.0  → MetricGPUPower      (GPU 0 power draw)
3.6.1.4.1.53246.1.1.1.1.10.0 → MetricGPUFanSpeed   (GPU 0 fan speed)
...
3.6.1.4.1.53246.1.1.1.1.5.7  → MetricGPUUtil       (GPU 7 utilization)

That's 6 metric OIDs × 8 GPUs = 48 dynamic metric OIDs per device, plus the existing CPU/memory/temperature host metrics.

1.2 GPU Device Profile (`device_profiles.go`)

Add three new profiles:

var profileGPUServerA100 = DeviceProfile{
    CPUBaseMin: 20, CPUBaseMax: 55, CPUSpike: 25,
    MemTotalKB: 1024 * 1024 * 1024,  // 1 TB system RAM
    MemBaseMin: 40, MemBaseMax: 75, MemVariance: 12,
    TempBaseMin: 32, TempBaseMax: 48, TempSpike: 6,
}

var profileGPUServerH100 = DeviceProfile{
    CPUBaseMin: 25, CPUBaseMax: 60, CPUSpike: 28,
    MemTotalKB: 2 * 1024 * 1024 * 1024,  // 2 TB system RAM
    MemBaseMin: 45, MemBaseMax: 80, MemVariance: 12,
    TempBaseMin: 34, TempBaseMax: 50, TempSpike: 7,
}

var profileGPUServerH200 = DeviceProfile{
    CPUBaseMin: 25, CPUBaseMax: 60, CPUSpike: 28,
    MemTotalKB: 2 * 1024 * 1024 * 1024,  // 2 TB system RAM
    MemBaseMin: 45, MemBaseMax: 80, MemVariance: 12,
    TempBaseMin: 34, TempBaseMax: 50, TempSpike: 7,
}

Add to deviceProfileMap:

"nvidia_dgx_a100.json": profileGPUServerA100,
"nvidia_dgx_h100.json": profileGPUServerH100,
"nvidia_hgx_h200.json": profileGPUServerH200,

1.3 GPU Metrics in MetricsCycler (`metrics_cycler.go`)

Add a new GPUMetrics struct with per-GPU cycling arrays:

type GPUMetrics struct {
    gpuUtil     [numDataPoints]int    // GPU utilization % (0-100)
    gpuMemUsed  [numDataPoints]int64  // GPU memory used (MB)
    gpuMemTotal int64                 // GPU VRAM total (MB, constant)
    gpuTemp     [numDataPoints]int    // GPU temperature (Celsius)
    gpuPower    [numDataPoints]int    // GPU power draw (Watts)
    gpuFanSpeed [numDataPoints]int    // GPU fan speed % (0-100)
    gpuClockSM  [numDataPoints]int    // SM clock MHz
    gpuClockMem [numDataPoints]int    // Memory clock MHz
    utilIndex   uint32
    memIndex    uint32
    tempIndex   uint32
    powerIndex  uint32
    fanIndex    uint32
    clockIndex  uint32
}

Add gpuMetrics [8]*GPUMetrics field to MetricsCycler. These are only allocated when the device profile is a GPU server (checked by resource file name).

Each GPU gets its own seed (baseSeed + gpuIndex) so GPUs within the same server have different but correlated curves (e.g., GPU 0 might run hotter when GPU 1 is under heavy compute load).

GPU-specific profile parameters (new struct):

type GPUProfile struct {
    GPUCount       int   // Number of GPUs (8 for DGX/HGX)
    VRAMPerGPUMB   int64 // VRAM per GPU in MB
    GPUUtilMin     int   // Min GPU utilization %
    GPUUtilMax     int   // Max GPU utilization %
    GPUUtilSpike   int   // Utilization spike amplitude
    GPUTempMin     int   // Min GPU temp (Celsius)
    GPUTempMax     int   // Max GPU temp (Celsius)
    GPUTempSpike   int   // Temp spike amplitude
    GPUPowerMin    int   // Min power draw (Watts)
    GPUPowerMax    int   // Max power draw (Watts)
    GPUPowerSpike  int   // Power spike amplitude
    GPUClockSMBase int   // Base SM clock (MHz)
    GPUClockMemBase int  // Base memory clock (MHz)
}

Parameter	A100	H100	H200
VRAMPerGPUMB	81920	81920	144384
GPUUtilMin/Max	15-85	20-90	20-90
GPUTempMin/Max	35-75	35-80	35-78
GPUPowerMin/Max	100-400	150-700	150-700
GPUClockSMBase	1410	1980	1980
GPUClockMemBase	1512	2619	2619

1.4 Getter Methods for GPU Metrics

Add methods to MetricsCycler:

func (c *MetricsCycler) GetGPUUtil(gpuIndex int) string
func (c *MetricsCycler) GetGPUMemUsed(gpuIndex int) string
func (c *MetricsCycler) GetGPUMemTotal(gpuIndex int) string
func (c *MetricsCycler) GetGPUMemUtil(gpuIndex int) string
func (c *MetricsCycler) GetGPUTemp(gpuIndex int) string
func (c *MetricsCycler) GetGPUPower(gpuIndex int) string
func (c *MetricsCycler) GetGPUFanSpeed(gpuIndex int) string
func (c *MetricsCycler) GetGPUClockSM(gpuIndex int) string
func (c *MetricsCycler) GetGPUClockMem(gpuIndex int) string

1.5 SNMP Handler Extension (`snmp_handlers.go`)

Update getMetricValue() to handle the new GPU metric types. The OID encodes the GPU index in the last component (e.g., .0 for GPU 0, .7 for GPU 7), so the handler must extract the GPU index from the OID suffix and call the appropriate getter.

Add a helper to parse GPU index from OID:

func parseGPUIndexFromOID(oid string) int {
    // Last component of OID is the GPU index
    parts := strings.Split(oid, ".")
    idx, _ := strconv.Atoi(parts[len(parts)-1])
    return idx
}

Files Modified

metrics_oids.go — new metric types + NVIDIA vendor OID mappings
device_profiles.go — 3 new profiles + GPUProfile struct + profile map entries
metrics_cycler.go — GPUMetrics struct, GPU data point generation, getter methods
snmp_handlers.go — GPU metric cases in getMetricValue()

Phase 2: SNMP Resource Files

2.1 Directory Structure

resources/
├── nvidia_dgx_a100/
│   ├── nvidia_dgx_a100_snmp_system.json    # sysDescr, sysObjectID, sysUpTime, sysContact, interfaces
│   ├── nvidia_dgx_a100_snmp_host.json      # Host Resources MIB (hrStorage, hrProcessor, hrDevice)
│   ├── nvidia_dgx_a100_snmp_gpu.json       # NVIDIA enterprise GPU MIB OIDs (static baselines)
│   ├── nvidia_dgx_a100_ssh.json            # SSH command/response pairs
│   └── nvidia_dgx_a100_api.json            # REST API endpoint responses
├── nvidia_dgx_h100/
│   ├── nvidia_dgx_h100_snmp_system.json
│   ├── nvidia_dgx_h100_snmp_host.json
│   ├── nvidia_dgx_h100_snmp_gpu.json
│   ├── nvidia_dgx_h100_ssh.json
│   └── nvidia_dgx_h100_api.json
└── nvidia_hgx_h200/
    ├── nvidia_hgx_h200_snmp_system.json
    ├── nvidia_hgx_h200_snmp_host.json
    ├── nvidia_hgx_h200_snmp_gpu.json
    ├── nvidia_hgx_h200_ssh.json
    └── nvidia_hgx_h200_api.json

2.2 SNMP System MIB (per flavor)

Standard MIB-II system group OIDs with GPU-server-appropriate values:

OID	Description	Example (DGX H100)
`1.3.6.1.2.1.1.1.0`	sysDescr	`NVIDIA DGX H100 - DCGM 3.3.0, Driver 535.129.03, CUDA 12.2`
`1.3.6.1.2.1.1.2.0`	sysObjectID	`1.3.6.1.4.1.53246.1.2.2`
`1.3.6.1.2.1.1.3.0`	sysUpTime	`234567800`
`1.3.6.1.2.1.1.4.0`	sysContact	`GPU Infrastructure Team`
`1.3.6.1.2.1.1.7.0`	sysServices	`72`

Plus IF-MIB interfaces (management NIC, InfiniBand/RoCE ports), Host Resources MIB entries (CPU, memory, storage).

2.3 SNMP GPU MIB (NVIDIA enterprise OIDs)

Static OIDs under 1.3.6.1.4.1.53246 for per-GPU identity info:

OID Pattern	Description	Example
`.1.1.1.1.1.{gpu}`	GPU name	`NVIDIA H100 80GB HBM3`
`.1.1.1.1.2.{gpu}`	GPU UUID	`GPU-a1b2c3d4-...`
`.1.1.1.1.3.{gpu}`	GPU serial	`1234567890ABC`
`.1.1.1.1.4.{gpu}`	GPU PCI bus ID	`0000:07:00.0`
`.1.1.1.1.13.{gpu}`	Driver version	`535.129.03`
`.1.1.1.1.14.{gpu}`	CUDA version	`12.2`
`.1.1.1.1.15.{gpu}`	ECC errors (corrected)	`0`
`.1.1.1.1.16.{gpu}`	ECC errors (uncorrected)	`0`
`.1.1.1.1.17.{gpu}`	NVLink active links	`12`

Dynamic OIDs (cycling via MetricsCycler) are listed in Phase 1.1 above.

Files Created

resources/nvidia_dgx_a100/nvidia_dgx_a100_snmp_system.json
resources/nvidia_dgx_a100/nvidia_dgx_a100_snmp_host.json
resources/nvidia_dgx_a100/nvidia_dgx_a100_snmp_gpu.json
resources/nvidia_dgx_h100/nvidia_dgx_h100_snmp_system.json
resources/nvidia_dgx_h100/nvidia_dgx_h100_snmp_host.json
resources/nvidia_dgx_h100/nvidia_dgx_h100_snmp_gpu.json
resources/nvidia_hgx_h200/nvidia_hgx_h200_snmp_system.json
resources/nvidia_hgx_h200/nvidia_hgx_h200_snmp_host.json
resources/nvidia_hgx_h200/nvidia_hgx_h200_snmp_gpu.json

Phase 3: SSH Resources (`nvidia-smi` and `dcgmi` CLI)

3.1 Supported Commands

Each flavor's SSH resource file provides responses for these commands:

Command	Description
`nvidia-smi`	Default GPU summary table (all 8 GPUs, utilization, memory, temp, power)
`nvidia-smi -q`	Detailed query (full GPU specs, clocks, ECC, power, thermals per GPU)
`nvidia-smi -q -d MEMORY`	Memory details (used/free/total per GPU)
`nvidia-smi -q -d UTILIZATION`	Utilization details (GPU%, memory controller%, encoder%, decoder%)
`nvidia-smi -q -d TEMPERATURE`	Temperature details (current, shutdown, slowdown thresholds)
`nvidia-smi -q -d POWER`	Power details (draw, limit, default limit, min/max limits)
`nvidia-smi -q -d CLOCK`	Clock details (current, max, SM, memory, video clocks)
`nvidia-smi -q -d ECC`	ECC error counts
`nvidia-smi topo -m`	GPU topology matrix (NVLink connections)
`nvidia-smi nvlink -s`	NVLink status
`dcgmi discovery -l`	DCGM discovered GPUs
`dcgmi diag -r 1`	DCGM quick diagnostic (level 1)
`dcgmi health -c`	DCGM health check
`dcgmi stats -e`	DCGM stats enable
`show version`	Linux OS version with NVIDIA driver info
`uname -a`	Linux kernel info
`lspci \| grep -i nvidia`	PCI device listing for GPUs
`cat /proc/driver/nvidia/version`	NVIDIA kernel module version
`free -h`	System memory summary
`lscpu`	CPU info (AMD EPYC / Intel Xeon for DGX)
`ip addr`	Network interfaces (management, InfiniBand)
`ibstat`	InfiniBand HCA status
`hostname`	Hostname

3.2 Example `nvidia-smi` Output (H100)

Fri Mar 14 12:00:00 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03   Driver Version: 535.129.03   CUDA Version: 12.2                 |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap  |           Memory-Usage | GPU-Util  Compute M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:07:00.0   Off |                    0 |
| N/A   42C    P0             320W / 700W  |   45312MiB / 81920MiB  |     67%      Default |
|-----------------------------------------+------------------------+----------------------|
|   1  NVIDIA H100 80GB HBM3          On  | 00000000:0B:00.0   Off |                    0 |
| N/A   39C    P0             285W / 700W  |   38400MiB / 81920MiB  |     54%      Default |
...  (GPUs 2-7)
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage       |
|=========================================================================================|
|    0   N/A  N/A     12345      C   python3                                     44800MiB |
|    1   N/A  N/A     12346      C   python3                                     37888MiB |
...
+-----------------------------------------------------------------------------------------+

Files Created

resources/nvidia_dgx_a100/nvidia_dgx_a100_ssh.json
resources/nvidia_dgx_h100/nvidia_dgx_h100_ssh.json
resources/nvidia_hgx_h200/nvidia_hgx_h200_ssh.json

Phase 4: REST API Resources (DCGM HTTP Endpoints)

4.1 API Endpoints

Each device exposes a REST API on its APIPort with these endpoints:

Method	Path	Description
`GET`	`/api/v1/gpu/status`	Overall GPU cluster health and summary
`GET`	`/api/v1/gpu/devices`	List of all GPUs with IDs, names, UUIDs
`GET`	`/api/v1/gpu/devices/{id}`	Single GPU detailed info
`GET`	`/api/v1/gpu/devices/{id}/metrics`	GPU metrics (util, memory, temp, power, clocks)
`GET`	`/api/v1/gpu/topology`	NVLink/NVSwitch topology
`GET`	`/api/v1/gpu/processes`	Running GPU processes
`GET`	`/api/v1/dcgm/health`	DCGM health check results
`GET`	`/api/v1/dcgm/diag`	DCGM diagnostic results
`GET`	`/api/v1/dcgm/config`	DCGM configuration
`GET`	`/api/v1/system/info`	System info (OS, driver, CUDA, hostname)
`GET`	`/api/v1/system/memory`	System memory stats
`GET`	`/api/v1/system/network`	Network interface info

4.2 Example Response: `/api/v1/gpu/status`

{
  "status": "healthy",
  "gpu_count": 8,
  "driver_version": "535.129.03",
  "cuda_version": "12.2",
  "dcgm_version": "3.3.0",
  "gpus": [
    {
      "index": 0,
      "name": "NVIDIA H100 80GB HBM3",
      "uuid": "GPU-a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "pci_bus_id": "00000000:07:00.0",
      "temperature_gpu": 42,
      "utilization_gpu": 67,
      "utilization_memory": 55,
      "memory_used_mib": 45312,
      "memory_total_mib": 81920,
      "power_draw_w": 320,
      "power_limit_w": 700,
      "fan_speed_pct": 0,
      "sm_clock_mhz": 1980,
      "mem_clock_mhz": 2619,
      "ecc_errors_corrected": 0,
      "ecc_errors_uncorrected": 0,
      "health": "healthy"
    }
  ]
}

4.3 Example Response: `/api/v1/gpu/topology`

{
  "gpu_count": 8,
  "nvlink_version": 4,
  "nvswitch_count": 4,
  "topology_matrix": [
    { "gpu": 0, "connections": [
      { "peer_gpu": 1, "type": "NVLink", "bandwidth_gbps": 900 },
      { "peer_gpu": 2, "type": "NVLink", "bandwidth_gbps": 900 },
      ...
    ]},
    ...
  ]
}

Files Created

resources/nvidia_dgx_a100/nvidia_dgx_a100_api.json
resources/nvidia_dgx_h100/nvidia_dgx_h100_api.json
resources/nvidia_hgx_h200/nvidia_hgx_h200_api.json

Phase 5: Integration with Simulator Framework

5.1 Resource Loading (`resources.go`)

No changes needed — the existing directory-based loading and merging already handles new resource directories automatically. The loadSpecificResourcesFromDir() function will pick up all JSON files in each NVIDIA directory and merge SNMP + SSH + API resources.

5.2 Device Type Detection (`resources.go`)

Add NVIDIA detection to getDeviceTypeFromName():

} else if strings.Contains(nameLower, "nvidia") || strings.Contains(nameLower, "dgx") || strings.Contains(nameLower, "hgx") {
    return "NVIDIA GPU Server"
}

5.3 Round-Robin Registration (`types.go`)

Add the 3 NVIDIA resource files to RoundRobinDeviceTypes:

var RoundRobinDeviceTypes = []string{
    // ... existing 19 types ...
    "nvidia_dgx_a100.json",
    "nvidia_dgx_h100.json",
    "nvidia_hgx_h200.json",
}

This brings the total to 22 device types in round-robin mode.

5.4 API Resource Merging (`resources.go`)

Update loadResourcesFromDir() and loadSpecificResourcesFromDir() to also merge API resources (currently only merges SNMP and SSH):

resources.API = append(resources.API, partResources.API...)

Initialize the API slice:

resources := &DeviceResources{
    SNMP: make([]SNMPResource, 0),
    SSH:  make([]SSHResource, 0),
    API:  make([]APIResource, 0),  // Add this
}

Files Modified

resources.go — NVIDIA device type detection + API merging in directory loader
types.go — 3 new entries in RoundRobinDeviceTypes

Phase 6: Build and Verify

cd go && go build ./simulator/ — verify compilation
cd go && go vet ./simulator/ — verify no issues
Start the simulator and create a single NVIDIA DGX H100 device
Verify SNMP walk returns GPU OIDs with cycling values
Verify SSH nvidia-smi returns formatted output
Verify REST API /api/v1/gpu/status returns GPU data
Create devices in round-robin mode — verify NVIDIA types appear among the 22 types
Verify web UI shows "NVIDIA GPU Server" as device type

File Change Summary

New Files (15 resource JSON files)

resources/nvidia_dgx_a100/
    nvidia_dgx_a100_snmp_system.json
    nvidia_dgx_a100_snmp_host.json
    nvidia_dgx_a100_snmp_gpu.json
    nvidia_dgx_a100_ssh.json
    nvidia_dgx_a100_api.json

resources/nvidia_dgx_h100/
    nvidia_dgx_h100_snmp_system.json
    nvidia_dgx_h100_snmp_host.json
    nvidia_dgx_h100_snmp_gpu.json
    nvidia_dgx_h100_ssh.json
    nvidia_dgx_h100_api.json

resources/nvidia_hgx_h200/
    nvidia_hgx_h200_snmp_system.json
    nvidia_hgx_h200_snmp_host.json
    nvidia_hgx_h200_snmp_gpu.json
    nvidia_hgx_h200_ssh.json
    nvidia_hgx_h200_api.json

Modified Files (6 Go files)

File	Changes
`metrics_oids.go`	9 new MetricOIDType constants + 3 NVIDIA vendor OID map entries (48 OIDs each)
`device_profiles.go`	GPUProfile struct + 3 GPU device profiles + profile map entries
`metrics_cycler.go`	GPUMetrics struct + GPU data generation in NewMetricsCycler + 9 GPU getter methods
`snmp_handlers.go`	GPU metric cases in getMetricValue() + parseGPUIndexFromOID helper
`resources.go`	NVIDIA device type detection + API merging in directory loaders
`types.go`	3 new entries in RoundRobinDeviceTypes

No Changes Required

File	Why
`device.go`	MetricsCycler already initialized per device; GPU init keyed on resource file
`manager.go`	Resource caching and device lifecycle unchanged
`snmp_server.go`	Protocol handling unchanged
`snmp_encoding.go`	Encoding unchanged
`ssh.go`	Command matching unchanged
`api.go`	API server already handles arbitrary endpoints from resource files
`web.go`	Web UI device listing works with new device types automatically
`web_routes.go`	Routes unchanged
`web/index.html`	Device type dropdown auto-populated from resources
`web/app_ui.js`	Renders device_type from API response automatically
`web/app_api.js`	API client unchanged

Overview​

What DCGM Exposes in the Real World​

Scope​

Device Flavors (3 resource profiles)​

GPU-Specific Metric Types (new)​

Phase 1: GPU Metric Types and Cycler Extension​

1.1 New Metric OID Types (metrics_oids.go)​

1.2 GPU Device Profile (device_profiles.go)​

1.3 GPU Metrics in MetricsCycler (metrics_cycler.go)​

1.4 Getter Methods for GPU Metrics​

1.5 SNMP Handler Extension (snmp_handlers.go)​

Files Modified​

Phase 2: SNMP Resource Files​

2.1 Directory Structure​

2.2 SNMP System MIB (per flavor)​

2.3 SNMP GPU MIB (NVIDIA enterprise OIDs)​

Files Created​

Phase 3: SSH Resources (nvidia-smi and dcgmi CLI)​

3.1 Supported Commands​

3.2 Example nvidia-smi Output (H100)​

Files Created​

Phase 4: REST API Resources (DCGM HTTP Endpoints)​

4.1 API Endpoints​

4.2 Example Response: /api/v1/gpu/status​

4.3 Example Response: /api/v1/gpu/topology​

Files Created​

Phase 5: Integration with Simulator Framework​

5.1 Resource Loading (resources.go)​

5.2 Device Type Detection (resources.go)​

5.3 Round-Robin Registration (types.go)​

5.4 API Resource Merging (resources.go)​

Files Modified​

Phase 6: Build and Verify​

File Change Summary​

New Files (15 resource JSON files)​

Modified Files (6 Go files)​

No Changes Required​

Overview

What DCGM Exposes in the Real World

Scope

Device Flavors (3 resource profiles)

GPU-Specific Metric Types (new)

Phase 1: GPU Metric Types and Cycler Extension

1.1 New Metric OID Types (`metrics_oids.go`)

1.2 GPU Device Profile (`device_profiles.go`)

1.3 GPU Metrics in MetricsCycler (`metrics_cycler.go`)

1.4 Getter Methods for GPU Metrics

1.5 SNMP Handler Extension (`snmp_handlers.go`)

Files Modified

Phase 2: SNMP Resource Files

2.1 Directory Structure

2.2 SNMP System MIB (per flavor)

2.3 SNMP GPU MIB (NVIDIA enterprise OIDs)

Files Created

Phase 3: SSH Resources (`nvidia-smi` and `dcgmi` CLI)

3.1 Supported Commands

3.2 Example `nvidia-smi` Output (H100)

Files Created

Phase 4: REST API Resources (DCGM HTTP Endpoints)

4.1 API Endpoints

4.2 Example Response: `/api/v1/gpu/status`

4.3 Example Response: `/api/v1/gpu/topology`

Files Created

Phase 5: Integration with Simulator Framework

5.1 Resource Loading (`resources.go`)

5.2 Device Type Detection (`resources.go`)

5.3 Round-Robin Registration (`types.go`)

5.4 API Resource Merging (`resources.go`)

Files Modified

Phase 6: Build and Verify

File Change Summary

New Files (15 resource JSON files)

Modified Files (6 Go files)

No Changes Required