Skip to content

Observability

Cisco Virtual Kubelet uses OpenTelemetry as the shared correlation plane for controller reconciliation, per-device Virtual Kubelet work, MDT-over-gNMI telemetry, configuration, topology, and read-only device operations.

The important rule is that every signal carries stable resource identity. That lets Grafana, Prometheus, Jaeger, Loki, Tempo, Splunk Observability, or another OTel backend join traces, metrics, and logs by device, pod, process role, and Kubernetes namespace without relying on log text.

OTel Spine

Process identity

CVK emits from three process roles. They may run in one Kubernetes pod, but they are separate OTel resources by design:

Process service.name cvk.process.role Notes
System controller cisco-vk-controller controller Watches CiscoDevice and reconciles controller-managed Kubernetes resources.
Per-device VK provider cisco-vk-vk vk-provider Owns the upstream Virtual Kubelet node and Pod lifecycle.
Per-device telemetry emitter cisco-vk-telemetry telemetry-emitter Owns MDT-over-gNMI subscriptions and telemetry mapping.

Common attributes include service.instance.id, host.name, net.peer.name, k8s.pod.name, k8s.namespace.name, k8s.node.name, k8s.pod.uid, cisco.device.name, cisco.device.address, and cvk.driver.kind.

Trace taxonomy

Span names use one stable shape across surfaces:

Surface Span name pattern Examples
Kubernetes reconcile cvk.<resource>.reconcile cvk.iosxeconfig.reconcile
Device transport cvk.transport.<protocol>.<verb> cvk.transport.netconf.get, cvk.transport.restconf.post
Config engine cvk.config.<phase> cvk.config.reconcile, cvk.config.plan, cvk.config.apply
Topology cycle cvk.topology.cycle root span per bounded topology emission
Device operation cvk.op.<kind> cvk.op.show_command, cvk.op.config_diff

The process installs the upstream Virtual Kubelet OTel adapter at startup, so VK spans such as Pod admission, status sync, and lease updates can parent CVK driver spans automatically when context is propagated. The repository keeps a small scripts/lint-ctx.sh guard to catch new unreviewed context.Background() usage in controller or driver paths.

Correlation

Pod admission stores a bounded (device, app_id) -> SpanContext entry using the W3C traceparent format. MDT app-hosting recovery events consult that cache so recent device-side state transitions can be emitted under the Pod admission trace. The cache is deliberately bounded and short-lived: it is for causality, not durable storage.

Config reconcile writes trace hints back to status annotations:

Annotation Meaning
cisco.vk/traceparent W3C carrier for the current reconcile window.
cisco.vk/trace-window-end Expiry for using that trace context.
cisco.vk/last-trace-id Most recent reconcile trace ID.
cisco.vk/last-trace-duration Most recent reconcile duration.
cisco.vk/last-error-trace-id Failed reconcile trace ID, when present.

Config Revision History

IOSXEConfigRevision objects store resolved intent for successful applies so spec.rollbackTo can replay a prior revision through the normal reconcile path. When an IOSXEConfig uses spec.secretRefs, revisions are still created, but secret-backed family blocks are omitted from spec.body and spec.secretMaterialOmitted is set. Rollback to those revisions is allowed only when the current IOSXEConfig references the same Secret names; otherwise status reports RollbackBlocked=True with reason RevisionMissingSecretMaterial.

Self metrics

CVK emits self metrics with the cisco_vk_telemetry_* prefix. Important pipeline health metrics include:

Metric Meaning
cisco_vk_telemetry_active_streams Active gNMI subscription streams by device and subscription.
cisco_vk_telemetry_stream_reconnects_total Stream reconnect attempts.
cisco_vk_telemetry_log_records_emitted_total OTel log records emitted by the MDT mapper.
cisco_vk_telemetry_instrument_cap_drops_total Metric instruments dropped by cardinality cap.
cisco_vk_telemetry_transitions_dropped_total State transition events dropped by transition buffering.
cisco_vk_telemetry_notifier_dropped_total PodNotifier events dropped because the callback queue was full.
cisco_vk_telemetry_exporter_failures_total OTLP exporter failures by signal and reason.

Current topology boundary

The MDT state cache is currently used for telemetry state, PodNotifier, and trace correlation. It is not yet the source of topology. The topology provider still reads the device through the existing driver paths. Moving CDP/OSPF topology to MDT remains gated on the lab validation of the YANG paths listed in the unified architecture plan.

Existing Surfaces

Cisco Virtual Kubelet also exposes four Kubernetes-facing observability surfaces:

  • Prometheus metrics on the standard kubelet /metrics/resource endpoint.
  • Kubernetes stats/summary on /stats/summary — powers kubectl top node.
  • OpenTelemetry topology traces emitted on a configurable interval. OpenTelemetry (OTEL) is a vendor-neutral observability framework; here we use it to emit a device-centric topology trace that backends like Splunk Observability Cloud can render as a service map.
  • Kubernetes node annotations populated on every status sync, surfacing basic network context like router ID and neighbor counts.

Three of these surfaces — metrics (topology subset), OTEL traces, and node annotations — depend on the driver implementing the optional TopologyProvider interface. The IOS-XE driver does; other drivers may not, in which case their VKs simply omit the topology-derived data without erroring.

CDP (Cisco Discovery Protocol) and OSPF (Open Shortest Path First) are referenced throughout this page. Both are neighbor-discovery protocols the device participates in — CDP at Layer 2 for directly-connected Cisco gear, OSPF at Layer 3 for routing peers.

Metrics

Catalog

All metrics are type gauge. The base set works on any driver; the topology-derived metrics require a driver with TopologyProvider.

Device resources (always present)

Metric Labels Notes
cisco_device_cpu_usage_percent IOx subsystem CPU usage
cisco_device_memory_used_bytes IOx subsystem memory in use
cisco_device_memory_total_bytes IOx subsystem memory quota
cisco_device_storage_used_bytes IOx subsystem storage in use
cisco_device_storage_total_bytes IOx subsystem storage quota

gNOI lifecycle metrics

These metrics are registered when the corresponding controllers and gNOI client packages are linked into the process. The write-class and software upgrade reconcilers still require their enablement flags before they act on CRs.

Metric Labels Notes
cisco_vk_gnoi_rpc_total service, outcome gNOI RPC outcomes, including ok, unimplemented, unavailable, and other error classes.
cisco_vk_gnoi_capability_cache_total service, result Capability cache hit, miss, expiration, pin, and fail-fast decisions.
cisco_vk_devicegrpc_lease_events_total class, event Workload-classed gRPC pool lease and release events.
cisco_vk_devicegrpc_outstanding_leases class Current outstanding gRPC pool leases.
cisco_vk_devicegrpc_close_leak_detected_total class Outstanding leases observed when a pool closes.
cisco_vk_iosxe_software_upgrade_phase_transitions_total device, target_version, from, to, reason Software upgrade state transitions.
cisco_vk_iosxe_operational_action_transitions_total device, kind, phase, reason Write-class action phase transitions and terminal outcomes.

Interfaces (TopologyProvider)

Metric Labels Notes
cisco_device_interface_rx_bits_per_sec interface Current receive rate
cisco_device_interface_tx_bits_per_sec interface Current transmit rate
cisco_device_interface_state interface, state 1 when state="up", 0 otherwise

Neighbors (TopologyProvider)

Metric Labels Notes
cisco_device_cdp_neighbor_count Number of CDP neighbors
cisco_device_ospf_neighbor_count Number of OSPF neighbors
cisco_device_neighbor_link target, interface, protocol, platform Fixed value 1 per discovered neighbor; drop/cardinality control lives on the collector side

Scraping

The metrics are served on the node's kubelet HTTPS listener (port 10250) at /metrics/resource. Any workload that already scrapes kubelet metrics picks them up automatically — for example, kube-prometheus-stack's ServiceMonitor for nodes, or the Prometheus kubelet scrape config.

No extra configuration on the VK side is required — the handler is always wired.

Stats / summary

GET /stats/summary returns a Kubernetes statsv1alpha1.Summary for this node:

  • CPU: UsageNanoCores derived from IOx CPU percentage
  • Memory: UsageBytes, AvailableBytes (in bytes, converted from MB)
  • Filesystem: CapacityBytes, UsedBytes, AvailableBytes for IOx storage
  • Network: Per-interface RxBytes / TxBytes (when TopologyProvider is implemented)

This is what kubectl top node cat9000-1 reads. It also feeds the Kubernetes Vertical/Horizontal pod autoscalers if you use them against VK nodes.

OpenTelemetry topology export

Enable per-device via the otel: block in the device config / CR:

otel:
  enabled: true
  endpoint: "otel-collector.observability.svc:4317"
  insecure: true
  serviceName: "cisco-network"
  intervalSecs: 60
  maxLinkSpans: 256

The exporter connects to the OTLP gRPC endpoint, emits one trace per interval, and shuts down cleanly on context cancel (5 s grace).

Resource attributes

Every span carries:

Key Value
service.name "{serviceName}.{hostname}" — e.g. cisco-network.cat9000-1
service.namespace "network.infrastructure"
host.name VK node name
device.address Management IP/hostname

Span hierarchy

Each emission cycle produces one trace:

root span: cvk.topology.cycle           [SERVER]
├── link.<localIface>-><peerDeviceID>   [CLIENT]  (one per CDP/OSPF neighbor)
├── link.<localIface>-><peerDeviceID>   [CLIENT]
├── …
└── hosted.<podNs>/<podName>            [CLIENT]  (one per hosted container)

Root span (cvk.topology.cycle)

Attribute Source
topology.cycle.id Unique ID for one bounded topology emission cycle
topology.emitted_link_count Link spans emitted after applying maxLinkSpans
topology.dropped_link_count Links omitted because the cap was reached
node.name Device hostname
node.type "network_device"
node.role "router"
node.neighbor_count count of observed consolidated neighbors before span capping
node.interface_count count of interfaces with IPs
router.id DeviceInfo.RouterID (OSPF/BGP)
router.platform DeviceInfo.ProductID
router.os.version DeviceInfo.SoftwareVersion
router.serial DeviceInfo.SerialNumber
router.ip.addresses comma-joined interface IPs
network.layer "L3"
network.type "routed"

CDP and OSPF neighbors are consolidated per local interface — a single span represents a link even when both protocols are active.

Attribute Notes
topology.cycle.id Matches the cycle root span
peer.service "{serviceName}.{peerDeviceID}" — matches the root span of the peer if it's also exporting, enabling service-map correlation
net.peer.name peerDeviceID
net.peer.ip Peer management IP
net.host.interface Local interface (e.g. GigabitEthernet0/0/1)
net.peer.interface Remote interface
link.type "physical"
link.protocols "+"-joined — e.g. "cdp+ospf"
link.state "up" normally, "degraded" when OSPF state is not "full" or "2way"
link.utilization.in.bps From interface stats (if available)
link.utilization.out.bps From interface stats (if available)
link.speed.bps Interface speed
ospf.neighbor.state OSPF-only, when OSPF is a protocol on this link
ospf.area OSPF-only
peer.platform, peer.capabilities From CDP
topology.link_id "{hostname}:{localIf}->{peerID}:{remoteIf}" — stable identifier

Hosted-app spans (hosted.<namespace>/<podName>)

Attribute Notes
topology.cycle.id Matches the cycle root span
peer.service "app.{namespace}/{podName}" — distinct namespace from network neighbors
service.type "app-hosting"
deployment.environment "edge-compute"
app.id Device app ID (e.g. cvk00000_<uid>)
app.state Device lifecycle state (RUNNING, DEPLOYED, …)
k8s.pod.name, k8s.pod.namespace, k8s.pod.uid, k8s.container.name Pod identity
app.ip, app.mac When oper-data has resolved them
net.host.interface Attached device interface
topology.link_id "{hostname}->{peerService}"

What you get

  • Service map — In a backend like Splunk Observability Cloud, devices appear as services and links between them render as edges. Pods hosted on a device appear as downstream services of that device.
  • Change detection — Each interval is a full snapshot. Diffing consecutive traces shows topology changes (new/lost neighbors, state transitions).
  • Correlationtopology.link_id is stable across emissions, so queries that group or filter by link are consistent over time.

Failure modes

OTEL initialisation failure is non-fatal. If the OTLP endpoint is unreachable at startup the VK pod logs a warning and continues without OTEL. Intermittent topology-data errors (CDP, OSPF, interfaces) are logged at debug level and the affected attributes are simply omitted from that emission.

Node annotations

On every node status sync the provider populates these annotations from the driver:

Annotation Source
cisco.io/router-id DeviceInfo.RouterID (OSPF/BGP)
cisco.io/hostname DeviceInfo.Hostname
cisco.io/cdp-neighbor-count Count from GetCDPNeighbors()
cisco.io/ospf-neighbor-count Count from GetOSPFNeighbors()
cisco.io/protocols Comma-joined list of protocols with at least one neighbor (cdp, ospf)

Use them for:

  • kubectl get nodes -L cisco.io/router-id
  • Dashboards that filter by active protocol
  • Basic alerting: cisco.io/cdp-neighbor-count == 0 → isolation alarm

Annotation size is deliberately kept small (scalar counts, not full neighbor lists). For full topology data use OTEL; for interface detail use the Prometheus metrics.

End-to-end example: Splunk Observability Cloud

Splunk Observability Cloud ingests both Prometheus metrics and OTLP traces through a single OpenTelemetry Collector, so a typical deployment is:

  1. Install the Splunk OpenTelemetry Collector for Kubernetes, pointed at your Splunk Observability Cloud realm and access token. It:
    • scrapes kubelet /metrics/resource automatically — the cisco_device_* metrics appear without extra config;
    • exposes an OTLP gRPC endpoint (default :4317) for traces.
  2. On each CiscoDevice, point the VK's OTEL exporter at the collector:
    spec:
      otel:
        enabled: true
        endpoint: "splunk-otel-collector-agent.observability.svc:4317"
        insecure: true
        intervalSecs: 60
    

Splunk Observability Cloud dashboards can then:

  • Plot per-interface throughput (cisco_device_interface_*_bits_per_sec)
  • Alert on neighbor loss (cisco_device_cdp_neighbor_count < previous)
  • Visualise the network as a service map from the OTEL traces