Troubleshooting Appian on Kubernetes

Overview

This page details how to troubleshoot Appian on Kubernetes for self-managed customers. It has tips for troubleshooting both the Appian site and Appian operator and specifics for site startup and shutdown.

Troubleshooting site startup

As documented in the install guide, the status of a newly created Appian custom resource should transition from not set to Creating to Starting within seconds and then from Starting to Ready within 20 to 30 minutes. To check the status of your custom resource, run kubectl get appians:

1
2
3
$ kubectl -n <NAMESPACE> get appians
NAME              URL                  STATUS     AGE
appian-k8s-appn   appian.example.com   Starting   42m

If the status is either not set or Creating, go to Site status stuck in not set or Creating. If the status is Starting, go to Site status stuck in starting.

Site status stuck in not set or Creating

If the status of your custom resource never reaches Starting, the Appian operator is unable to create your custom resource's corresponding secondary resources, such as ConfigMaps, StatefulSets, or Deployments.

Step 1: Check for reconciliation errors

If you set webhooks.enabled to false when installing the Appian operator Helm chart, it's likely that the operator is failing to reconcile your custom resource due to a validation error. If not, it's still likely that some other type of reconciliation error is the culprit.

Reconciliation errors are recorded as events on Appian custom resources.

To check your custom resource for reconciliation errors, run:

kubectl -n <NAMESPACE> describe appian <APPIAN>.

For example:

1
2
3
4
5
6
$ kubectl -n <NAMESPACE> describe appian appian-k8s
...
Events:
  Type     Reason          Age                From               Message
  ----     ------          ----               ----               -------
  Warning  ReconcileError  32s (x15 over 2m)  appian-controller  Appian.crd.k8s.appian.com "appian-k8s" is invalid: spec.webapp.haExistingClaim: Required value: required when spec.webapp.replicas is greater than 1

Reconciliation errors are represented by ReconcileError events of type Warning. If you see such an event, take the appropriate steps to resolve it as necessary.

Step 2: Check the operator

If you don't see a reconciliation error, it's likely that the operator itself isn't running properly.

To check the operator, run:

kubectl -n appian-operator get deployments,replicasets,pods

If everything is working properly, you should see something similar to the following:

1
2
3
4
5
6
7
8
9
10
11
NAME                                          READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/appian-operator-controllers   1/1     1            1           2m45s
deployment.apps/appian-operator-webhooks      1/1     1            1           2m45s

NAME                                                    DESIRED   CURRENT   READY   AGE
replicaset.apps/appian-operator-controllers-b9f7cc6fc   1         1         1       2m45s
replicaset.apps/appian-operator-webhooks-6f47f9d888     1         1         1       2m45s

NAME                                              READY   STATUS    RESTARTS   AGE
pod/appian-operator-controllers-b9f7cc6fc-d6p54   1/1     Running   0          2m45s
pod/appian-operator-webhooks-6f47f9d888-klzzg     1/1     Running   0          2m45s

Step 3: Check for operator Pods with bad status

If a Pod's status is CrashLoopBackOff, check its logs by running:

kubectl -n appian-operator logs <POD> --previous

If a Pod's status isn't CrashLoopBackOff or Running, check its events by running:

kubectl -n appian-operator describe pod <POD>

Step 4: Check for operator Pods that don't exist

If a Pod doesn't exist but its ReplicaSet does, check its ReplicaSet's events by running:

kubectl -n appian-operator describe replicaset <REPLICA_SET>

If a Pod and its ReplicaSet don't exist, check its Deployment's events by running:

kubectl -n appian-operator describe deployment <DEPLOYMENT>

Site status stuck in Starting

If the status of your custom resource reaches Starting but never reaches Ready, the Appian operator has created your custom resource's corresponding secondary resources, but one or more components do not have a sufficient number of ready Pods.

Step 1: Inspect the resources

To troubleshoot the site, run:

kubectl -n <NAMESPACE> get statefulsets,deployments,replicasets,pods

If everything is working properly, you should see something similar to the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
NAME                                            READY   AGE
statefulset.apps/appian-k8s-data-server-0       1/1     25m
statefulset.apps/appian-k8s-kafka-0             1/1     25m
statefulset.apps/appian-k8s-search-server-0     1/1     25m
statefulset.apps/appian-k8s-service-manager-0   1/1     25m
statefulset.apps/appian-k8s-webapp-0            1/1     25m
statefulset.apps/appian-k8s-zookeeper-0         1/1     25m

NAME                                   READY   STATUS     RESTARTS   AGE
pod/appian-k8s-data-server-0-0         1/1     Running    0          25m
pod/appian-k8s-kafka-0-0               1/1     Running    0          25m
pod/appian-k8s-search-server-0-0       1/1     Running    0          25m
pod/appian-k8s-service-manager-0-0     1/1     Running    0          25m
pod/appian-k8s-webapp-0-0              1/1     Running    0          25m
pod/appian-k8s-zookeeper-0-0           1/1     Running    0          25m

For Search Server, Zookeeper, Kafka, Data Server, Service Manager, and Webapp, you should see a single StatefulSet and Pod per replica. If you specified multiple Webapp replicas, only the StatefulSet and Pod for the first will be created initially. The rest will be created once the first becomes ready.

If you enabled Apache Web Server (httpd), you should see a single Deployment and ReplicaSet, but one or more Pods depending on how many replicas you specified.

Step 2: Check for Pods with bad status

If a Pod's status is CrashLoopBackOff, check its logs by running:

kubectl -n <NAMESPACE> logs <POD> --previous

If a Pod's status is Running but its READY column displays 0/1, run:

kubectl -n <NAMESPACE> logs <POD>

If a Pod's status isn't CrashLoopBackOff or Running, check its events by running:

kubectl -n <NAMESPACE> describe pod <POD>.

Step 3: Check for Pods that don't exist

For Apache Web Server (httpd), if a Pod doesn't exist but its ReplicaSet does, check its ReplicaSet's events by running: kubectl -n <NAMESPACE> describe replicaset <REPLICA_SET>. If a Pod and its ReplicaSet don't exist, check its Deployment's events by running kubectl -n <NAMESPACE> describe deployment <DEPLOYMENT>.

For Zookeeper, Kafka, Search Server, Data Server, Service Manager, and Webapp, if a Pod doesn't exist, check its StatefulSet's events by running kubectl -n <NAMESPACE> describe statefulset <STATEFUL_SET>.

Troubleshooting multiple components

Appian components have dependencies on one another. If two components are having issues and one component (the downstream component) depends on the other (the upstream component), it's likely that the issues with the downstream component are due to the issues with the upstream component.

When troubleshooting multiple components, always troubleshoot upstream components first, as they will impact downstream components.

The following table depicts downstream components for each upstream component:

Upstream Component Downstream Components
Zookeeper Kafka, Data Server, Service Manager, Webapp
Kafka Data Server, Service Manager, Webapp
Search Server Webapp
Data Server Webapp
Service Manager Webapp
Webapp Apache Web Server (httpd)
Apache Web Server (httpd) N/A

Troubleshooting site shutdown

When an Appian custom resource is deleted, the Appian operator gracefully shuts down the site by shutting down its stateful components one by one.

Stateful components are shut down in the following order:

  1. Webapp
  2. Search Server
  3. Data Server
  4. Service Manager
  5. Kafka
  6. Zookeeper

Each stateful component aside from Service Manager should shut down within 30 seconds. Service Manager may take several minutes to shut down based on site usage.

Service Manager shutdown issues

Sites with multiple replicas of Service Manager may get stuck during shutdown due to a known limitation which is currently being worked on.

In this situation, Service Manager Pods may be forcefully deleted, but only after checking that each Pod's engines have checkpointed to prevent data loss.

To check whether or not the Pod's engines have checkpointed, for each Service Manager Pod, run:

kubectl -n <NAMESPACE> logs <SERVICE_MANAGER_POD>

If the engines have checkpointed, you should see something similar to the following in the logs:

1
2
3
All 15 requested engines are down [analytics00, analytics01, analytics02, channels, content, download-stats, execution00, execution01, execution02, forums, groups, notifications, notifications-email, portal, process-design]
...
2020-08-26 01:49:56,742 [MainComponentService STOPPING] INFO  com.appian.komodo.MainComponentService - Komodo shutdown.

If each Service Manager Pod contains the above logs, they can be forcefully deleted.

To delete the pods, run:

kubectl -n <NAMESPACE> delete pod --force --grace-period=0 <SERVICE_MANAGER_POD>

Troubleshooting Unready sites

If the status of your custom resource changes to Unready after reaching Ready, one or more components do not have a sufficient number of ready Pods. To troubleshoot, follow the instructions described in Site status stuck in Starting.

Open in Github Built: Tue, Mar 28, 2023 (09:28:18 PM)

On This Page

FEEDBACK