Skip to main content

GPU pollaris and parsing rules

This page describes the polling definitions (pollaris) and parsing rules in l8parser that collect SNMP, SSH, and REST data from simulated NVIDIA GPU servers and populate the GpuDevice protobuf model. It's split into two parts:

  1. Foundational SNMP pollaris — the SnmpGpuTable rule, the vendor detection wiring, and the initial set of SNMP polls that cover about half of GpuDevice.
  2. Complete coverage (SNMP + SSH + REST) — the gap analysis for the remaining fields and the additional SSH / REST polls plus new parsing rules needed to close it.

The GpuDevice protobuf messages themselves are defined in Protobuf model; the simulator side of the data source is in DCGM simulation.


Part 1: Foundational SNMP pollaris

Context

The opensim simulator already generates GPU device data (SNMP mock data, REST API, SSH) for NVIDIA DGX / HGX servers. A GPU protobuf model (GpuDevice) is described in Protobuf model. This part creates the initial polling definitions (pollaris) and the one new parsing rule required to collect SNMP data from GPU devices and populate the GpuDevice model — following the same patterns used for NetworkDevice (vendor-specific pollaris) and Cluster (K8s pollaris).

Prerequisite: The GpuDevice proto messages must be added to probler/proto/inventory.proto and bindings generated before these rules can be tested end-to-end.

Target project

l8parser/go/parser/boot/

SNMP OID inventory (from opensim mock data)

NVIDIA enterprise OID prefix: 1.3.6.1.4.1.53246

  • sysObjectID: 1.3.6.1.4.1.53246.1.2.1 (used for vendor detection).

Module-level (singleton)

OIDField
53246.1.1.1.0.1.0GPU Count
53246.1.1.1.0.2.0DCGM Version

Per-GPU static info (index 0-7)

OID Pattern 53246.1.1.1.1.{X}.{gpu}XField
.1.{gpu}1Device Name
.2.{gpu}2UUID
.3.{gpu}3Serial Number
.4.{gpu}4PCI Bus ID
.13.{gpu}13Driver Version
.14.{gpu}14CUDA Version
.15.{gpu}15ECC Corrected Count
.16.{gpu}16ECC Uncorrected Count
.17.{gpu}17Power State

Per-GPU dynamic metrics (OIDs 5-12)

OID PatternXFieldRule Type
.5.{gpu}5GPU Utilization %SetTimeSeries
.6.{gpu}6VRAM Used MiBSetTimeSeries
.7.{gpu}7Temperature CSetTimeSeries
.8.{gpu}8Power Draw WSetTimeSeries
.9.{gpu}9Fan Speed %SetTimeSeries
.10.{gpu}10SM Clock MHzSetTimeSeries
.11.{gpu}11Memory Clock MHzSetTimeSeries
.12.{gpu}12Memory Utilization %SetTimeSeries

Standard MIBs (system, host resources, interfaces)

  • 1.3.6.1.2.1.1.* — sysDescr, sysName, sysLocation, sysUpTime → gpudevice.deviceinfo.*
  • 1.3.6.1.2.1.2.2.1.* — IF-MIB interface table → gpudevice.system.networkinterfaces.*
  • 1.3.6.1.2.1.25.* — Host Resources MIB (memory, storage, CPU) → gpudevice.system.*

Implementation phases

Phase 1.1: New parsing rule — SnmpGpuTable

New file: l8parser/go/parser/rules/SnmpGpuTable.go

The NVIDIA GPU SNMP data uses indexed OIDs: {base}.{metric_id}.{gpu_index}. Existing table rules (EntityMibToPhysicals, IfTableToPhysicals) are hardcoded for NetworkDevice. We need a new rule that:

  1. Receives a walked CMap of NVIDIA GPU OIDs.
  2. Groups entries by GPU index (0-7).
  3. Maps each metric OID suffix to the corresponding gpudevice.gpus.* property.
  4. Uses Set for static fields, SetTimeSeries for dynamic metrics.

Interface:

  • Name: "SnmpGpuTable"
  • Params:
    • oid_base — base OID prefix (e.g. 1.3.6.1.4.1.53246.1.1.1.1).
    • mapping — comma-separated oidSuffix:propertyName:type triples (e.g. 1:devicename:set,2:deviceuuid:set,5:gpuutilizationpercent:ts).
  • Logic: iterate CMap keys, extract {metric_id} and {gpu_index} from the OID, create / populate the Gpu repeated element at that index with the mapped property.

Phase 1.2: New pollaris file — nvidia.go

New file: l8parser/go/parser/boot/nvidia.go

Factory function: CreateNvidiaGpuBootPolls() *l8tpollaris.L8Pollaris

Pollaris Name: "nvidia-gpu"
Groups: ["nvidia", "nvidia-gpu"]

Polls to create:

Device info

  • What: .1.3.6.1.2.1.1 (system MIB)
  • Operation: L8C_Map
  • Cadence: DEFAULT_CADENCE
  • Attributes:
    • gpudevice.deviceinfo.hostname ← Set from .1.3.6.1.2.1.1.5.0 (sysName)
    • gpudevice.deviceinfo.vendor ← Contains "53246" → "NVIDIA"
    • gpudevice.deviceinfo.location ← Set from .1.3.6.1.2.1.1.6.0 (sysLocation)
    • gpudevice.deviceinfo.uptime ← Set from .1.3.6.1.2.1.1.3.0 (sysUpTime)
    • gpudevice.deviceinfo.devicestatus ← MapToDeviceStatus
    • gpudevice.deviceinfo.osversion ← Contains from sysDescr (parse OS info)
    • gpudevice.deviceinfo.kernelversion ← Contains from sysDescr (parse kernel)

Phase 1.3: Updates to SNMP.go

Add isNvidiaOid():

func isNvidiaOid(sysOid string) bool {
normalizedOid := sysOid
if !strings.HasPrefix(normalizedOid, ".") {
normalizedOid = "." + sysOid
}
return strings.HasPrefix(normalizedOid, ".1.3.6.1.4.1.53246.")
}

Add NVIDIA to GetPollarisByOid() waterfall — before the default fallback:

if isNvidiaOid(sysOid) {
return CreateNvidiaGpuBootPolls()
}

Add NVIDIA to GetAllPolarisModels() — append CreateNvidiaGpuBootPolls() to the returned slice.

Phase 1.4: Updates to ParsingRule.go

Add GpuDevice collection field mappings to injectIndexOrKey():

// GpuDevice collections
"gpus": "{2}0", // repeated Gpu
"networkinterfaces": "{2}0", // repeated GpuNetworkInterface
"gpu_links": "{2}0", // repeated GpuLink
"checks": "{2}0", // repeated GpuHealthCheck

Phase 1.5: Register rule

Register SnmpGpuTable in l8parser/go/parser/service/Parser.go inside newParser(), following the existing pattern:

snmpGpuTable := &rules.SnmpGpuTable{}
p.rules[snmpGpuTable.Name()] = snmpGpuTable

Files (part 1)

FilePurpose
l8parser/go/parser/boot/nvidia.goNVIDIA GPU pollaris factory (~200 lines) — new
l8parser/go/parser/rules/SnmpGpuTable.goCustom parsing rule for GPU table OIDs (~150 lines) — new
l8parser/go/parser/boot/SNMP.goAdd isNvidiaOid(), update GetPollarisByOid() / GetAllPolarisModels()modified
l8parser/go/parser/rules/ParsingRule.goAdd GPU collection mappings to injectIndexOrKey()modified
l8parser/go/parser/service/Parser.goRegister SnmpGpuTable rule in newParser()modified

Verification (part 1)

  1. cd l8parser && go build ./... — verify compilation.
  2. go vet ./... — verify no issues.
  3. Verify GetAllPolarisModels() includes the NVIDIA pollaris.
  4. Verify GetPollarisByOid("1.3.6.1.4.1.53246.1.2.1") returns the NVIDIA pollaris.
  5. End-to-end test requires the GpuDevice proto model to be implemented first.

Part 2: Complete coverage (SNMP + SSH + REST)

Context

The initial NVIDIA GPU pollaris (nvidia.go) from Part 1 covers ~50 % of GpuDevice attributes via SNMP. This part adds the missing SNMP mappings and introduces SSH and REST polls to achieve full coverage of all GpuDevice protobuf fields.

Target project: l8parser/go/parser/boot/nvidia.go and new parsing rules.

Data sources (from opensim mock data):

  • SNMP: OIDs under 1.3.6.1.4.1.53246.* + standard MIBs.
  • SSH: 10 commands (nvidia-smi, nvidia-smi -q -d *, dcgmi *, show version, lscpu).
  • REST: 7 endpoints (/api/v1/gpu/*, /api/v1/dcgm/*, /api/v1/system/*).

Gap analysis

GpuDeviceInfo — 5 fields missing

FieldSourceProtocolCommand/OID/Endpoint
modelRESTRESTCONF/api/v1/system/infogpu_model
serial_numberSSHSSHdmidecode -s system-serial-number or from show version
ip_addressSet by collector (not polled)
kernel_versionSSHSSHuname -a → parse kernel version
cuda_versionSNMPSNMPV2OID .53246.1.1.1.1.14.0 (GPU 0, already in table but not at device level)
last_seenSet by collector timestamp
latitude/longitudeManual config (not polled)

GpuDeviceSystem — 4 fields missing

FieldSourceProtocolCommand/Endpoint
cpu_socketsSSHSSHlscpu → parse "Socket(s)"
cpu_cores_totalSSHSSHlscpu → parse "CPU(s)"
memory_used_bytesSNMPSNMPV2HR MIB .1.3.6.1.2.1.25.2.3.1.6.1 (storage index 1 = Physical Memory)
memory_free_bytesComputed: total - used (or from REST /api/v1/system/memory)
power_suppliesRESTRESTCONF/api/v1/system/info (if available)
fansRESTRESTCONF/api/v1/system/info (if available)

Gpu (per-GPU) — 11 fields missing

FieldSourceProtocolCommand/Endpoint
gpu_indexSNMPSNMPV2Implicit from OID index — set in SnmpGpuTable rule
compute_capabilityRESTRESTCONF/api/v1/gpu/devicescompute_capability
persistence_modeRESTRESTCONF/api/v1/gpu/devicespersistence_mode
numa_nodeRESTRESTCONF/api/v1/gpu/topologynuma_affinity
vram_total_mibRESTRESTCONF/api/v1/gpu/devicesmemory_total_mib
encoder_utilization_percentSSHSSHnvidia-smi -q -d UTILIZATION → "Encoder"
decoder_utilization_percentSSHSSHnvidia-smi -q -d UTILIZATION → "Decoder"
memory_temperature_celsiusSSHSSHnvidia-smi -q -d TEMPERATURE → "GPU Memory Temp"
shutdown_temperatureSSHSSHnvidia-smi -q -d TEMPERATURE → "GPU Shutdown Temp"
slowdown_temperatureSSHSSHnvidia-smi -q -d TEMPERATURE → "GPU Slowdown Temp"
power_limit_wattsSSHSSHnvidia-smi -q -d POWER → "Default Power Limit"
sm_clock_base_mhzRESTRESTCONF/api/v1/gpu/devices (if available)
mem_clock_base_mhzRESTRESTCONF/api/v1/gpu/devices (if available)
health (GpuComponentHealth)RESTRESTCONF/api/v1/dcgm/health → per-check status
processesSSHSSHnvidia-smi → process table at bottom

GpuTopology — entirely missing

FieldSourceProtocolCommand/Endpoint
nvlink_versionRESTRESTCONF/api/v1/gpu/topologynvlink_version
nvswitch_countRESTRESTCONF/api/v1/gpu/topologynvswitch_count
gpu_linksRESTRESTCONF/api/v1/gpu/topologyconnectivity array

GpuDeviceHealth — entirely missing

FieldSourceProtocolCommand/Endpoint
overall_statusRESTRESTCONF/api/v1/dcgm/healthoverall_health
checksRESTRESTCONF/api/v1/dcgm/healthchecks object

Implementation phases

Phase 2.1: Fix missing SNMP attributes in existing polls

File: l8parser/go/parser/boot/nvidia.go

Add to existing polls:

  1. nvidiaGpuModule poll — add cuda_version from GPU 0:
    • gpudevice.deviceinfo.cudaversion ← Set from .1.3.6.1.4.1.53246.1.1.1.1.14.0.
  2. nvidiaHostResources poll — add memory used:
    • gpudevice.system.memoryusedbytes ← SetTimeSeries from .1.3.6.1.2.1.25.2.3.1.6.1 (Physical Memory used).
  3. SnmpGpuTable rule update — set gpu_index explicitly:
    • Currently the rule populates GPU fields but doesn't set the gpuindex field itself. Add logic in SnmpGpuTable.Parse() to set gpudevice.gpus.gpuindex = gpu_index from OID.

Phase 2.2: New SSH parsing rules

New file: l8parser/go/parser/rules/SshNvidiaSmiParse.go (~200 lines)

A parsing rule that handles nvidia-smi subcommand outputs. Uses a format parameter to select the parser:

  • Name: "SshNvidiaSmiParse"
  • Params: format — one of: utilization, temperature, power, version, lscpu.

Each format parser extracts per-GPU data from the structured nvidia-smi text output.

From nvidia-smi -q -d UTILIZATION — parses per-GPU blocks, extracts:

  • encoder_utilization_percent per GPU
  • decoder_utilization_percent per GPU

New file: l8parser/go/parser/rules/RestJsonParse.go (~150 lines)

A generic parsing rule that extracts fields from JSON REST API responses using dot-path notation.

  • Name: "RestJsonParse"
  • Params: mapping — comma-separated jsonPath:propertyId pairs, e.g. overall_health:gpudevice.health.overallstatus,nvlink_version:gpudevice.topology.nvlinkversion.

This rule deserialises the JSON response and walks the dot-paths to extract values, then sets them on the target properties. For array fields (like connectivity, checks), it iterates and populates repeated fields.

Phase 2.3: New SSH polls in nvidia.go

Add SSH polls using L8PSSH protocol:

PollCommandCadenceRulePropertyId
nvidiaGpuUtilizationnvidia-smi -q -d UTILIZATIONEVERY_5_MINUTES_ALWAYSSshNvidiaSmiParse(format: "utilization")gpudevice.gpus
nvidiaGpuTemperaturenvidia-smi -q -d TEMPERATUREEVERY_15_MINUTES_ALWAYSSshNvidiaSmiParse(format: "temperature")gpudevice.gpus
nvidiaGpuPowernvidia-smi -q -d POWERDEFAULT_CADENCESshNvidiaSmiParse(format: "power")gpudevice.gpus
nvidiaVersionshow versionDEFAULT_CADENCESshNvidiaSmiParse(format: "version")gpudevice.deviceinfo
nvidiaCpuInfolscpuDEFAULT_CADENCESshNvidiaSmiParse(format: "lscpu")gpudevice.system

Phase 2.4: New REST polls in nvidia.go

Add REST polls using L8PRESTCONF protocol:

PollEndpointCadenceMappingPropertyId
nvidiaGpuDevices/api/v1/gpu/devicesDEFAULT_CADENCEper-GPU compute_capability, persistence_mode, memory_total_mibgpudevice.gpus
nvidiaGpuTopology/api/v1/gpu/topologyDEFAULT_CADENCEnvlink_version, nvswitch_count, connectivitygpu_linksgpudevice.topology
nvidiaDcgmHealth/api/v1/dcgm/healthEVERY_5_MINUTES_ALWAYSoverall_healthoverallstatus, checks → repeated GpuHealthCheckgpudevice.health
nvidiaSystemMemory/api/v1/system/memoryEVERY_15_MINUTES_ALWAYSsystem_memory.free_gbmemoryfreesbytes, gpu_memory.*gpudevice.system

Phase 2.5: Update SnmpGpuTable rule

File: l8parser/go/parser/rules/SnmpGpuTable.go

Add automatic gpu_index population: when processing each GPU's data, also set gpudevice.gpus<{2}N>.gpuindex = N (as uint32).

Phase 2.6: Register new rules

File: l8parser/go/parser/service/Parser.go

Register the two new rules:

sshNvidiaSmiParse := &rules.SshNvidiaSmiParse{}
p.rules[sshNvidiaSmiParse.Name()] = sshNvidiaSmiParse
restJsonParse := &rules.RestJsonParse{}
p.rules[restJsonParse.Name()] = restJsonParse

Coverage summary after implementation

SectionBeforeAfter
GpuDeviceInfo (17 fields)10/1715/17 (ip_address and lat/lng are config, not polled)
GpuDeviceSystem (13 fields)8/1312/13 (fans/PSU depend on device support)
Gpu per-GPU (29 fields)14/2928/29 (processes deferred)
GpuTopology (3 fields)0/33/3
GpuDeviceHealth (2 fields)0/22/2
Total32/6460/64

Remaining 4 unpolled fields:

  • ip_address — set by collector infrastructure, not polled.
  • latitude/longitude — manual configuration.
  • last_seen — set by collector timestamp.

Files (part 2)

New:

FilePurposeEst. Lines
l8parser/go/parser/rules/SshNvidiaSmiParse.goSSH nvidia-smi/version/lscpu parser~300
l8parser/go/parser/rules/RestJsonParse.goGeneric REST JSON field extractor~200

Modified:

FileChange
l8parser/go/parser/boot/nvidia.goAdd 9 new polls (5 SSH + 4 REST), add missing SNMP attributes
l8parser/go/parser/rules/SnmpGpuTable.goAuto-set gpuindex field
l8parser/go/parser/service/Parser.goRegister 2 new rules

Verification (part 2)

  1. cd l8parser/go && go build ./... — verify compilation.
  2. cd l8parser/go && go vet ./... — verify no issues.
  3. Count polls in CreateNvidiaGpuBootPolls() — should be 15 total.
  4. Verify all GpuDevice proto fields have a corresponding pollaris attribute.
  5. End-to-end test requires the GpuDevice proto model to be implemented first (see Protobuf model).