Troubleshooting Appian on Kubernetes [Appian on Kubernetes v0.154.1]

Page not found in the selected version; redirected to the home page.

Troubleshooting Appian on Kubernetes

Overview

This page details how to troubleshoot Appian on Kubernetes for self-managed customers. It has tips for troubleshooting both the Appian site and Appian operator and specifics for site startup and shutdown.

Troubleshooting site startup

As documented in the install guide, the status of a newly created Appian custom resource should transition from not set to Creating to Starting within seconds and then from Starting to Ready within 20 to 30 minutes. To check the status of your custom resource, run kubectl get appians:

$ kubectl -n <NAMESPACE> get appians
NAME     URL                  STATUS   AGE
appian   appian.example.com   Ready    25m

If the status is either not set or Creating, go to Site status stuck in not set or Creating. If the status is Starting, go to Site status stuck in starting.

Site status stuck in not set or Creating

If the status of your custom resource never reaches Starting, the Appian operator is unable to create your custom resource's corresponding secondary resources, such as ConfigMaps, StatefulSets, or Deployments.

Step 1: Check for reconciliation errors

If you set webhooks.enabled to false when installing the Appian operator Helm chart, it's likely that the operator is failing to reconcile your custom resource due to a validation error. If not, it's still likely that some other type of reconciliation error is the culprit.

Reconciliation errors are recorded as events on Appian custom resources.

To check your custom resource for reconciliation errors, run:

kubectl -n <NAMESPACE> describe appian <APPIAN>

For example:

$ kubectl -n <NAMESPACE> describe appian appian
...
Events:
  Type     Reason          Age                From               Message
  ----     ------          ----               ----               -------
  Warning  ReconcileError  32s (x15 over 2m)  appian-controller  Appian.crd.k8s.appian.com "appian" is invalid: spec.webapp.haExistingClaim: Required value: required when .spec.webapp.replicas is greater than 1

Reconciliation errors are represented by ReconcileError events of type Warning. If you see such an event, take the appropriate steps to resolve it as necessary.

Step 2: Check the operator

If you don't see a reconciliation error, it's likely that the operator itself isn't running properly.

To check the operator, run:

kubectl -n appian-operator get deployments,replicasets,pods

If everything is working properly, you should see something similar to the following:

NAME                                          READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/appian-operator-controllers   1/1     1            1           55m
deployment.apps/appian-operator-webhooks      1/1     1            1           55m

NAME                                                    DESIRED   CURRENT   READY   AGE
replicaset.apps/appian-operator-controllers-b9f7cc6fc   1         1         1       55m
replicaset.apps/appian-operator-webhooks-6f47f9d888     1         1         1       55m

NAME                                              READY   STATUS    RESTARTS   AGE
pod/appian-operator-controllers-b9f7cc6fc-d6p54   1/1     Running   0          55m
pod/appian-operator-webhooks-6f47f9d888-klzzg     1/1     Running   0          55m

Step 3: Check for operator Pods with bad status

If a Pod's status is CrashLoopBackOff, check its logs by running:

kubectl -n appian-operator logs <POD> --previous

If a Pod's status isn't CrashLoopBackOff or Running, check its events by running:

kubectl -n appian-operator describe pod <POD>

Step 4: Check for operator Pods that don't exist

If a Pod doesn't exist but its ReplicaSet does, check its ReplicaSet's events by running:

kubectl -n appian-operator describe replicaset <REPLICA_SET>

If a Pod and its ReplicaSet don't exist, check its Deployment's events by running:

kubectl -n appian-operator describe deployment <DEPLOYMENT>

Site status stuck in Starting

If the status of your custom resource reaches Starting but never reaches Ready, the Appian operator has created your custom resource's corresponding secondary resources, but one or more components do not have a sufficient number of ready Pods.

Step 1: Inspect the resources

To troubleshoot the site, run:

kubectl -n <NAMESPACE> get statefulsets,deployments,replicasets,pods

If everything is working properly, you should see something similar to the following:

NAME                                                          READY   AGE
statefulset.apps/appian-data-server                           1/1     25m
statefulset.apps/appian-kafka                                 1/1     25m
statefulset.apps/appian-search-server                         1/1     25m
statefulset.apps/appian-service-manager-analytics00           1/1     25m
statefulset.apps/appian-service-manager-analytics01           1/1     25m
statefulset.apps/appian-service-manager-analytics02           1/1     25m
statefulset.apps/appian-service-manager-channels              1/1     25m
statefulset.apps/appian-service-manager-content               1/1     25m
statefulset.apps/appian-service-manager-download-stats        1/1     25m
statefulset.apps/appian-service-manager-execution00           1/1     25m
statefulset.apps/appian-service-manager-execution01           1/1     25m
statefulset.apps/appian-service-manager-execution02           1/1     25m
statefulset.apps/appian-service-manager-forums                1/1     25m
statefulset.apps/appian-service-manager-groups                1/1     25m
statefulset.apps/appian-service-manager-notifications         1/1     25m
statefulset.apps/appian-service-manager-notifications-email   1/1     25m
statefulset.apps/appian-service-manager-portal                1/1     25m
statefulset.apps/appian-service-manager-process-design        1/1     25m
statefulset.apps/appian-webapp                                1/1     25m
statefulset.apps/appian-zookeeper                             1/1     25m

NAME                                               READY   STATUS    RESTARTS   AGE
pod/appian-data-server-0                           1/1     Running   0          25m
pod/appian-kafka-0                                 1/1     Running   0          25m
pod/appian-search-server-0                         1/1     Running   0          25m
pod/appian-service-manager-analytics00-0           1/1     Running   0          25m
pod/appian-service-manager-analytics01-0           1/1     Running   0          25m
pod/appian-service-manager-analytics02-0           1/1     Running   0          25m
pod/appian-service-manager-channels-0              1/1     Running   0          25m
pod/appian-service-manager-content-0               1/1     Running   0          25m
pod/appian-service-manager-download-stats-0        1/1     Running   0          25m
pod/appian-service-manager-execution00-0           1/1     Running   0          25m
pod/appian-service-manager-execution01-0           1/1     Running   0          25m
pod/appian-service-manager-execution02-0           1/1     Running   0          25m
pod/appian-service-manager-forums-0                1/1     Running   0          25m
pod/appian-service-manager-groups-0                1/1     Running   0          25m
pod/appian-service-manager-notifications-0         1/1     Running   0          25m
pod/appian-service-manager-notifications-email-0   1/1     Running   0          25m
pod/appian-service-manager-portal-0                1/1     Running   0          25m
pod/appian-service-manager-process-design-0        1/1     Running   0          25m
pod/appian-webapp-0                                1/1     Running   0          25m
pod/appian-zookeeper-0                             1/1     Running   0          25m

For Search Server, Zookeeper, Kafka, Data Server, each engine, and Webapp, you should see a single StatefulSet. For each, you should also see a single Pod per replica. If you specified multiple Service Manager / engine or Webapp replicas, only the Pod for the first will be created initially. The rest will be created sequentially as they become ready.

If you enabled Apache Web Server (httpd), you should see a single Deployment and ReplicaSet, but one or more Pods depending on how many replicas you specified.

Step 2: Check for Pods with bad status

If a Pod's status is CrashLoopBackOff, check its logs by running:

kubectl -n <NAMESPACE> logs <POD> --previous

If a Pod's status is Running but its READY column displays 0/1, run:

kubectl -n <NAMESPACE> logs <POD>

If a Pod's status isn't CrashLoopBackOff or Running, check its events by running:

kubectl -n <NAMESPACE> describe pod <POD>

Step 3: Check for Pods that don't exist

For Apache Web Server (httpd), if a Pod doesn't exist but its ReplicaSet does, check its ReplicaSet's events by running: kubectl -n <NAMESPACE> describe replicaset <REPLICA_SET>. If a Pod and its ReplicaSet don't exist, check its Deployment's events by running:

kubectl -n <NAMESPACE> describe deployment <DEPLOYMENT>

For Zookeeper, Kafka, Search Server, Data Server, Service Manager, and Webapp, if a Pod doesn't exist, check its StatefulSet's events by running:

kubectl -n <NAMESPACE> describe statefulset <STATEFUL_SET>

Troubleshooting multiple components

Appian components have dependencies on one another. If two components are having issues and one component (the downstream component) depends on the other (the upstream component), it's likely that the issues with the downstream component are due to the issues with the upstream component.

When troubleshooting multiple components, always troubleshoot upstream components first, as they will impact downstream components.

The following table depicts downstream components for each upstream component:

Upstream Component	Downstream Components
Zookeeper	Kafka, Data Server, Service Manager, Webapp
Kafka	Data Server, Service Manager, Webapp
Search Server	Webapp
Data Server	Webapp
Service Manager	Webapp
Webapp	Apache Web Server (httpd)
Apache Web Server (httpd)	N/A

Troubleshooting site shutdown

When an Appian custom resource is deleted, the Appian operator gracefully shuts down the site by shutting down its stateful components one by one.

Stateful components are shut down in the following order:

Webapp
Search Server
Data Server
Service Manager
Kafka
Zookeeper

Each stateful component aside from Service Manager should shut down within 30 seconds. Service Manager may take several minutes to shut down based on site usage. If the Service Manager does not shut down, follow the instructions in Removing the Service Manager finalizer.

Removing the Service Manager finalizer

Service Manager uses a job based shutdown approach to ensure consistent checkpoints. These jobs block Appian shutdown by way of a finalizer named crd.k8s.appian.com/checkpoint-engines. Each job initiates a graceful shutdown of the corresponding Service Manager, which includes a checkpoint. Once all the jobs have completed the finalizer is removed and Appian shutdown proceeds.

Each job will attempt graceful shutdown six times before giving up. The status of the jobs can be checked by running the following command.

   kubectl -n <NAMESPACE> get jobs

If the jobs fail to shutdown Service Manager after six attempts then the finalizer must be manually removed.

Caution: If Service Manager fails to shutdown gracefully and the finalizer has to be manually removed then it is unsafe to upgrade Appian. Appian must be restarted on the same version to ensure data integrity. After restarting on the same version perform a graceful shutdown before upgrading.

The finalizer can be manually removed by running the following commands:

Edit the Appian resource.

kubectl -n <NAMESPACE> edit appians <APPIAN NAME>

Locate the section of the custom resource containing the finalizers (this line is not defined in the spec but added by the operator).

  finalizers:
    - crd.k8s.appian.com/checkpoint-engines

Delete the line containing the crd.k8s.appian.com/checkpoint-engines finalizer
Save your changes.

After removing the finalizer the Service Manager pods should terminate. Shut down may take 5-30 minutes depending on their size. If they do not you will have to shut them down forcefully.

Get the list of Service Manager pods.
```
1
kubectl get pods
```

Exec into one of the Service Manager pods.

kubectl exec -it <SERVICE MANAGER POD NAME> -- bash

Get the list of Java and k processes running in the pod.

ps -ef

Example:

UID        PID  PPID  C STIME TTY          TIME CMD
appian     144     1  6 12:11 ?        00:22:10 /usr/local/appian/ae/java/bin/java ...
appian     617     1  6 12:11 ?        00:22:10 /usr/local/appian/ae/server/_bin/k/linux64/k ...

Kill any Java or k processes.

Note: killing these processes should cause the pod to terminate and your exec shell session to be terminated.
```
1
kill -9 <PROCESS ID>
```
Repeat for each remaining service manager pod.

This will delete all remaining Service Manager pods and allow Appian shutdown to continue.

Troubleshooting Unready sites

If the status of your custom resource changes to Unready after reaching Ready, one or more components do not have a sufficient number of ready Pods. To troubleshoot, follow the instructions described in Site status stuck in Starting.

Generating a support bundle

After trying the above steps, if you are still having issues, you may try generating a support bundle. The support bundle generation process will check common scenarios for issues.

Open in Github Built: Mon, May 13, 2024 (05:09:07 PM)

Troubleshooting Appian on Kubernetes