Troubleshooting
Common issues and how to diagnose them.
First — gather the basics
The Helm release name is cvk throughout this page (matching the Getting Started guide). If you installed with a different release name, substitute it wherever cvk appears below.
# CiscoDevice full state — usually the most useful starting point
kubectl describe ciscodevice <name>
kubectl get ciscodevice <name> -o yaml
# Controller logs
kubectl -n cvk-system logs deploy/cvk-controller --tail=200
# VK pod logs (one per device)
kubectl -n <device-namespace> logs deploy/<device-name>-vk --tail=200
# Virtual node status
kubectl describe node <device-name>
# Pods on the virtual node
kubectl get pods --field-selector spec.nodeName=<device-name>
On the device:
CiscoDevice stuck in Provisioning
Provisioning means the controller has created the ConfigMap and Deployment but no VK pod is Ready yet.
Check the VK Deployment:
kubectl get deploy <device-name>-vk -o yaml
kubectl describe pod -l app.kubernetes.io/name=cisco-vk,app.kubernetes.io/instance=<device-name>
Common causes:
- Image pull error — make sure
image.repository/vkImage.repositorypoints at a registry the cluster can pull from. - Bad credentials — look for
401 Unauthorizedin VK pod logs. Verify the Secret key is spelledpassword(notPASSWORD, notpass). - Device unreachable — look for
dial tcp: i/o timeoutin VK pod logs. Check routing, firewall, and that RESTCONF is enabled (restconfin device config). - TLS verification failing — look for
x509: certificate signed by unknown authority. Either supplytls.caFile, or temporarily settls.insecureSkipVerify: trueto confirm.
Pod stuck in Pending
Pending means the VK has accepted the pod but the device has not yet reached RUNNING.
Walk the state machine. Check VK logs for the reconcile line:
Known intermediate states that are expected:
INSTALLING— normal during the first 5–30 seconds.DEPLOYED— very brief; VK will issueactivateon the next poll.ACTIVATED— very brief; VK will issuestarton the next poll.
If the pod stays in the same state for more than a minute, there is something wrong. See specific sections below.
PackagePolicyInvalid false positives
Symptom
Pod shows Failed with:
status:
phase: Failed
reason: PackagePolicyInvalid
message: "install blocked: app package policy is invalid ..."
Why it happens
IOS-XE reports pkg-policy = iox-pkg-policy-invalid as the YANG default during the first 1–3 seconds of every install, before signature verification completes. A confirming install notification only appears when the device actually rejects the package. The reconciler tries to distinguish the two by waiting for the notification, but if the notification ordering is off or the device never emits one, you can get stuck.
Fix
If you're running unsigned packages on purpose — your own custom application builds or test images:
This does two things:
- Device-side — CVK PUTs
app-hosting-cfg-data/controlswithsign-verification: falseon first connect, disabling the IOS-XE package signature check. Equivalent tono app-hosting signed-verification. - Reconciler-side — treats
iox-pkg-policy-invalidduringINSTALLINGas a transient (non-fatal) signal.
If the device-side PUT fails (e.g. platform does not support the YANG leaf, or insufficient privilege), CVK logs a warning and the device policy may still block unsigned installs.
If you want signing enforced:
- Verify the package is actually signed (
show app-hosting detail appid <id>→Signature verified: YES). - If the package is signed but the check is firing, check logs for the full notification text — the device explains what failed:
The pod recovery loop will automatically retry these failed pods with exponential backoff; you don't need to kubectl delete them.
Pod never gets an IP (shows 0.0.0.0)
IP discovery runs in two stages. If both come up empty the pod stays at 0.0.0.0.
1. Oper-data path
If the oper-data shows a real IP but the pod doesn't, the VK isn't scraping it — check VK logs for errors calling app-hosting-oper-data.
2. ARP fallback
If the container's MAC appears but the IP is still 0.0.0.0 at the pod, the VK's ARP lookup is failing. Most common cause: the MAC in oper-data doesn't match the ARP entry because the container hasn't finished DHCP handshake yet. Give it 30 s; the reconciler will retry.
If no ARP entry exists at all:
- DHCP pool is misconfigured (wrong network, exhausted pool).
- VirtualPortGroup interface is down or has no IP.
- App-hosting does not have the
guest-ipaddressfields populated (checkshow app-hosting list detailed).
kubectl top node returns an error
Verify the stats endpoint is reachable:
If that works but kubectl top fails, the metrics-server does not trust the kubelet certificate. On k3s:
On upstream Kubernetes, either supply a signed kubelet cert via --tls-cert-file / --tls-key-file, or add --kubelet-insecure-tls to the metrics-server deployment.
Prometheus metrics missing
cisco_device_* metrics are served from the VK pod's kubelet endpoint (/metrics/resource).
Expected setup: your Prometheus is already scraping kubelets (kube-prometheus-stack does this by default).
Check:
# Are the metrics there at all?
kubectl get --raw "/api/v1/nodes/<name>/proxy/metrics/resource" | grep cisco_device
gNOI actions or software upgrades do nothing
Both write-class gNOI surfaces are opt-in on the per-device VK pod:
IOSXEOperationalActionrequires--enable-write-class-gnoiorCISCO_VK_ENABLE_WRITE_CLASS_GNOI=true.IOSXESoftwareUpgraderequires--enable-iosxesoftwareupgradeorCISCO_VK_ENABLE_IOSXE_SOFTWARE_UPGRADE=true.
Check the VK pod args and logs:
kubectl -n <device-namespace> get deploy <device-name>-vk -o yaml | grep -E "enable-write-class-gnoi|enable-iosxesoftwareupgrade"
kubectl -n <device-namespace> logs deploy/<device-name>-vk --tail=200 | grep -i gnoi
If the CR remains untouched, verify the spec.deviceRef.name matches the
device worker's CiscoDevice name and that the VK service account can update
the CR status and finalizer subresources.
IOSXEOperationalAction is rejected
Common rejection reasons:
ConfirmMismatch—spec.confirmmust exactly equalspec.deviceRef.name.InvalidAction— exactly one typed args block must matchspec.action.kind.- Kubernetes admission rejects updates because
specis immutable after creation. Create a new action CR for a changed request.
For actions that reach Running and then fail, inspect both events and status:
kubectl describe iosxeoperationalaction <name>
kubectl get events --field-selector involvedObject.name=<name>
Running means the controller may already have invoked the device-side RPC.
The reconciler will not dispatch the same CR again after a restart.
IOSXESoftwareUpgrade fails during image resolution or transfer
For URL sources, imageSource.sha256 is required. Credential-bearing URLs are
redacted before they are written to CR status, events, or logs. For SCP/SFTP,
host-key verification is required unless the operator explicitly enables the
lab-only escape hatch:
When using localPath, add localPathSHA256 if the device supports gNOI
File.Get. A mismatch fails before activation with LocalPathHashMismatch.
Transfer interruptions move to TransferInterrupted and retry according to
spec.maxRetries unless spec.resumePolicy: Abort is set.
With rollbackOnFailure: true, a verify mismatch enters RollingBack and
attempts to re-activate the previously observed running version. If no previous
version was captured, the CR fails with RollbackVersionMissing.
If the raw endpoint returns metrics but Prometheus doesn't see them:
- The node ServiceMonitor isn't matching (check labels).
- The scrape job for kubelets doesn't use the
/metrics/resourcepath — some configurations only scrape/metrics/cadvisor.
If the raw endpoint returns only cisco_device_cpu_*/memory_*/storage_* but no interface_* or cdp_*:
- The driver does not implement
TopologyProvider. This is always the case for the FAKE driver and will be the case for future drivers that don't implement topology. - Or the device has no CDP/OSPF neighbors to report.
OTEL traces not appearing
Check it's enabled:
Verify VK pod startup:
You should see one of:
OTEL topology exporter started— good, emittingFailed to initialise OTEL topology exporter— endpoint unreachable or config invaliddriver does not implement TopologyProvider— wrong driver (FAKE doesn't, XE does)
Common misconfigurations:
endpointhas scheme prefix (wrong):https://otel:4317. Usehost:portonly.insecure: falseagainst a plaintext gRPC collector — useinsecure: truefor typical in-cluster OTLP collectors without TLS.intervalSecsset below 10 — the minimum is enforced to 10 s; values below will silently use 60 s.
No traces after 60 s: check the collector logs — it will receive spans in batches. Splunk Observability Cloud can sometimes take a minute to surface the first trace.
Pod stuck in PullingImage waiting state
Symptom
kubectl describe pod <name> shows the container in a waiting state:
State: Waiting
Reason: PullingImage
Message: Copying image to device flash; this may take several minutes
What is happening
The VK attempted a device-native HTTP pull that timed out (default 3 minutes), and is now running the copy fallback: it downloads the image tar from the HTTP URL and copies it to device flash via RESTCONF, then reinstalls from that local path. The copy RPC is synchronous and can take several minutes depending on image size and network speed.
You can monitor progress via pod events:
kubectl describe pod <name>
# Look for the Events section at the bottom:
#
# Normal Pulling <time> cisco-virtual-kubelet Pulling image https://...
# Warning ImagePullFallback <time> cisco-virtual-kubelet Device-native pull timed out...
# Normal Copying <time> cisco-virtual-kubelet Copying image ... to flash:/...
# Normal Pulled <time> cisco-virtual-kubelet Image successfully copied to ...
# Normal Started <time> cisco-virtual-kubelet App ... is running
Wait times
- A 500 MB image over a 100 Mb/s management link takes roughly 40 seconds for the copy alone, plus 30 seconds for app activation. Allow 3–5 minutes total.
- If the pod does not transition to
Runningafter 10 minutes, check VK logs for errors:
Avoiding the copy fallback
To use the copy path intentionally and skip the device-native pull attempt entirely, set imagePullPolicy: Never and pre-copy the tar to flash yourself. Then reference the flash path directly in the pod spec:
imagePullPolicy: IfNotPresent still re-downloads the image every time
Symptom
You set imagePullPolicy: IfNotPresent expecting the image to be reused from a local cache, but each pod creation issues a fresh download. dir flash: shows no cached tar.
Why
IOS-XE App Hosting does not leave a copy of the image on flash when using the device-native install path (app-hosting install appid ... package <url>). The device fetches and loads the image directly into the container runtime without writing it to flash. Since no flash copy is ever created, there is nothing for IfNotPresent to reuse — it behaves identically to Always on the device-native pull path.
The IfNotPresent flash-cache optimization only activates when the VK's copy fallback path has run at least once (i.e., the device-native pull timed out and the VK copied the tar to flash itself). After that first copy, subsequent deploys with IfNotPresent will reuse the cached tar.
Workaround
To reliably benefit from local caching, force the copy path by one of the following:
- Pre-copy the image to flash manually on the device, then use
imagePullPolicy: Neverwith a flash path (flash:/virtual-kubelet/my-app.tar). - Accept that on platforms where the device-native pull succeeds quickly, re-downloading from the registry on each deploy is the expected behaviour.
imagePullPolicy: Never with HTTP image URL
Symptom
Pod immediately goes to Failed with:
Why
imagePullPolicy: Never means the image must already exist on device flash and no download of any kind will be attempted. Using an HTTP or HTTPS URL with this policy is invalid.
Fix
Either:
- Change the image reference to a flash path:
flash:/virtual-kubelet/my-app.tar - Or change the
imagePullPolicytoIfNotPresentorAlwaysto allow the VK to download it.
Pod stuck Failed forever
Usually one of:
reason: NotFound— VK pod was restarted and lost state. The pod recovery loop handles this automatically.reason: ProviderFailed— transient device issue. Recovery loop handles this.reason: PackagePolicyInvalid— see above.
The pod recovery loop resets matching Failed pods to Pending with exponential backoff (15 s → 5 min). You should see this in VK logs:
If the loop isn't running, check the VK pod is healthy (not crash-looping). The recovery goroutine starts with the rest of the VK and stops when the VK exits.
Virtual node lingers after CiscoDevice deletion
Normally the controller deletes the virtual Node as part of finalizer cleanup. If you see a lingering node:
This usually means the finalizer was skipped (force-delete of the CR, or the controller was down when deletion happened). Clean up by hand:
Check no orphaned Deployment / ConfigMap remains:
Avoid kubectl delete ciscodevice --force --grace-period=0 — it skips the finalizer and will cause this.
kubectl rollout restart did not pick up a new password
The controller only rotates the pod when the ConfigMap changes (via the cisco.vk/config-hash annotation on the pod template). A password change on the referenced Secret alone does not trigger a rollout. Force one manually:
Where to look next
- Architecture — internal state machines and data flow
- Configuration — every field and its defaults
- Observability — metrics and OTEL details
- GitHub issues — if your problem isn't listed here, file an issue with:
- CiscoDevice spec (redact credentials)
kubectl describe ciscodeviceoutput- VK pod logs (
--tail=200) show app-hosting detail appid <id>from the device