This page details how to troubleshoot Appian on Kubernetes for self-managed customers. It has tips for troubleshooting both the Appian site and Appian operator and specifics for site startup and shutdown.
As documented in the install guide, the status of a newly created Appian custom resource should transition from not set to Creating
to Starting
within seconds and then from Starting
to Ready
within 20 to 30 minutes. To check the status of your custom resource, run kubectl get appians
:
1
2
3
$ kubectl -n <NAMESPACE> get appians
NAME URL STATUS AGE
appian-k8s-appn appian.example.com Starting 42m
If the status is either not set or Creating
, go to Site status stuck in not set or Creating. If the status is Starting
, go to Site status stuck in starting.
If the status of your custom resource never reaches Starting
, the Appian operator is unable to create your custom resource's corresponding secondary resources, such as ConfigMaps, StatefulSets, or Deployments.
If you set webhooks.enabled to false
when installing the Appian operator Helm chart, it's likely that the operator is failing to reconcile your custom resource due to a validation error. If not, it's still likely that some other type of reconciliation error is the culprit.
Reconciliation errors are recorded as events on Appian custom resources.
To check your custom resource for reconciliation errors, run:
kubectl -n <NAMESPACE> describe appian <APPIAN>
.
For example:
1
2
3
4
5
6
$ kubectl -n <NAMESPACE> describe appian appian-k8s
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ReconcileError 32s (x15 over 2m) appian-controller Appian.crd.k8s.appian.com "appian-k8s" is invalid: spec.webapp.haExistingClaim: Required value: required when spec.webapp.replicas is greater than 1
Reconciliation errors are represented by ReconcileError
events of type Warning
. If you see such an event, take the appropriate steps to resolve it as necessary.
If you don't see a reconciliation error, it's likely that the operator itself isn't running properly.
To check the operator, run:
kubectl -n appian-operator get deployments,replicasets,pods
If everything is working properly, you should see something similar to the following:
1
2
3
4
5
6
7
8
9
10
11
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/appian-operator-controllers 1/1 1 1 2m45s
deployment.apps/appian-operator-webhooks 1/1 1 1 2m45s
NAME DESIRED CURRENT READY AGE
replicaset.apps/appian-operator-controllers-b9f7cc6fc 1 1 1 2m45s
replicaset.apps/appian-operator-webhooks-6f47f9d888 1 1 1 2m45s
NAME READY STATUS RESTARTS AGE
pod/appian-operator-controllers-b9f7cc6fc-d6p54 1/1 Running 0 2m45s
pod/appian-operator-webhooks-6f47f9d888-klzzg 1/1 Running 0 2m45s
If a Pod's status is CrashLoopBackOff
, check its logs by running:
kubectl -n appian-operator logs <POD> --previous
If a Pod's status isn't CrashLoopBackOff
or Running
, check its events by running:
kubectl -n appian-operator describe pod <POD>
If a Pod doesn't exist but its ReplicaSet does, check its ReplicaSet's events by running:
kubectl -n appian-operator describe replicaset <REPLICA_SET>
If a Pod and its ReplicaSet don't exist, check its Deployment's events by running:
kubectl -n appian-operator describe deployment <DEPLOYMENT>
If the status of your custom resource reaches Starting
but never reaches Ready
, the Appian operator has created your custom resource's corresponding secondary resources, but one or more components do not have a sufficient number of ready Pods.
To troubleshoot the site, run:
kubectl -n <NAMESPACE> get statefulsets,deployments,replicasets,pods
If everything is working properly, you should see something similar to the following:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
NAME READY AGE
statefulset.apps/appian-k8s-data-server-0 1/1 25m
statefulset.apps/appian-k8s-kafka-0 1/1 25m
statefulset.apps/appian-k8s-search-server-0 1/1 25m
statefulset.apps/appian-k8s-service-manager-0 1/1 25m
statefulset.apps/appian-k8s-webapp-0 1/1 25m
statefulset.apps/appian-k8s-zookeeper-0 1/1 25m
NAME READY STATUS RESTARTS AGE
pod/appian-k8s-data-server-0-0 1/1 Running 0 25m
pod/appian-k8s-kafka-0-0 1/1 Running 0 25m
pod/appian-k8s-search-server-0-0 1/1 Running 0 25m
pod/appian-k8s-service-manager-0-0 1/1 Running 0 25m
pod/appian-k8s-webapp-0-0 1/1 Running 0 25m
pod/appian-k8s-zookeeper-0-0 1/1 Running 0 25m
For Search Server, Zookeeper, Kafka, Data Server, Service Manager, and Webapp, you should see a single StatefulSet and Pod per replica. If you specified multiple Webapp replicas, only the StatefulSet and Pod for the first will be created initially. The rest will be created once the first becomes ready.
If you enabled Apache Web Server (httpd), you should see a single Deployment and ReplicaSet, but one or more Pods depending on how many replicas you specified.
If a Pod's status is CrashLoopBackOff
, check its logs by running:
kubectl -n <NAMESPACE> logs <POD> --previous
If a Pod's status is Running
but its READY
column displays 0/1
, run:
kubectl -n <NAMESPACE> logs <POD>
If a Pod's status isn't CrashLoopBackOff
or Running
, check its events by running:
kubectl -n <NAMESPACE> describe pod <POD>
.
For Apache Web Server (httpd), if a Pod doesn't exist but its ReplicaSet does, check its ReplicaSet's events by running: kubectl -n <NAMESPACE> describe replicaset <REPLICA_SET>
. If a Pod and its ReplicaSet don't exist, check its Deployment's events by running kubectl -n <NAMESPACE> describe deployment <DEPLOYMENT>
.
For Zookeeper, Kafka, Search Server, Data Server, Service Manager, and Webapp, if a Pod doesn't exist, check its StatefulSet's events by running kubectl -n <NAMESPACE> describe statefulset <STATEFUL_SET>
.
Appian components have dependencies on one another. If two components are having issues and one component (the downstream component) depends on the other (the upstream component), it's likely that the issues with the downstream component are due to the issues with the upstream component.
When troubleshooting multiple components, always troubleshoot upstream components first, as they will impact downstream components.
The following table depicts downstream components for each upstream component:
Upstream Component | Downstream Components |
---|---|
Zookeeper | Kafka, Data Server, Service Manager, Webapp |
Kafka | Data Server, Service Manager, Webapp |
Search Server | Webapp |
Data Server | Webapp |
Service Manager | Webapp |
Webapp | Apache Web Server (httpd) |
Apache Web Server (httpd) | N/A |
When an Appian custom resource is deleted, the Appian operator gracefully shuts down the site by shutting down its stateful components one by one.
Stateful components are shut down in the following order:
Each stateful component aside from Service Manager should shut down within 30 seconds. Service Manager may take several minutes to shut down based on site usage.
Sites with multiple replicas of Service Manager may get stuck during shutdown due to a known limitation which is currently being worked on.
In this situation, Service Manager Pods may be forcefully deleted, but only after checking that each Pod's engines have checkpointed to prevent data loss.
To check whether or not the Pod's engines have checkpointed, for each Service Manager Pod, run:
kubectl -n <NAMESPACE> logs <SERVICE_MANAGER_POD>
If the engines have checkpointed, you should see something similar to the following in the logs:
1
2
3
All 15 requested engines are down [analytics00, analytics01, analytics02, channels, content, download-stats, execution00, execution01, execution02, forums, groups, notifications, notifications-email, portal, process-design]
...
2020-08-26 01:49:56,742 [MainComponentService STOPPING] INFO com.appian.komodo.MainComponentService - Komodo shutdown.
If each Service Manager Pod contains the above logs, they can be forcefully deleted.
To delete the pods, run:
kubectl -n <NAMESPACE> delete pod --force --grace-period=0 <SERVICE_MANAGER_POD>
If the status of your custom resource changes to Unready
after reaching Ready
, one or more components do not have a sufficient number of ready Pods. To troubleshoot, follow the instructions described in Site status stuck in Starting.