Service Status

Cluster 1 Server Issues, 22 November 2024: Actions

As some of you will know, we have had on-going issues with customer sites located on Cluster 1 (of 5 Clusters) in our Managed Cloud Services hosting environment:

  • Cluster 1 consists of 3x load-balanced "virtual machine" (VM) application servers;
  • on Thursday 21 November, at a few minutes past 09:00, the CPU (instruction processing chip) on all three VMs spiked to 100%;
  • when the server hits 100%, the load balancer stops sending traffic to the VM;
  • as traffic stopped to each VM, the CPU usage dropped to almost nothing — but the load increased on the other two VMs;
  • this is what leads to the massive up/down spikes that you can see in the image attached to this post.

This particular problem has not happened before, either on Cluster 1 or on any of the other Clusters.

Root Cause

Identifying the Root Cause Analysis has been challenging:

  • VerseOne's team has not made any changes to Cluster 1 — either to the VM configurations or to the site configurations — immediately prior to Thursday;
  • our analysis shows that there has not been any significant increase in total traffic to Cluster 1 (bearing in mind that we are logging approximately 1,000,000 requests per hour);
  • the Lucee Java application (which ultimately serves the sites) is something of a "black box", making analysis of CPU usage difficult.

Actions

Our priority is to stabilise the environment, keeping our customers' sites up and running and so giving us the time to understand where the Root Cause lies. As such, we are today taking the following actions:

  • we are doubling the size of Cluster 1 — increasing to 6x VMs;
  • using our Web Application Firewalls (WAF), we will "soft" split Cluster 1 into 3x sub-Clusters of 2x VMs each;
  • all of the sites currently pointed at Cluster 1 will be redistributed to one of the 3x sub-Clusters.

This approach aims to resolve three core issues:

  • we have more confidence that the sites will stay up;
  • if / when the issue recurs, then fewer VMs will be affected — so affecting fewer customers;
  • if / when the issue recurs, we can very quickly redirect traffic for individual sites to different sub-Clusters, so providing faster mitigation and keeping sites up; 
  • it will enable us to better pin-point which installations (if any) are causing the primary issue.

We will continue to monitor the environment over the weekend, and maintain the focus into next week.

Root Cause Analysis Investigation

In the meantime, we are refocusing our attempts to understand where the problem lies:

  • instead of analysing all traffic, we are now going to analyse traffic to the Lucee application — total traffic might not have significantly changed, but if the Lucee traffic has increased by, say, 20%, this would have a significant effect;
  • we are using some specialist analysis tools in order to delve into the Lucee Java application, in order to try to identify any "threads" that might be looping or over-running and so consuming near-infinite CPU resources.

As stated above, there is no obvious cause for this issue, and so analysis is likely to take some days. Our priority is to ensure that our customer sites continue to load and as swiftly as possible.

We are sorry that this issue is affecting your service, but please be assured that the efficiency of your service is our number one priority — we will keep working at this until we have solved the current issues.

We will continue to post updates here, and also to proactively contact you where appropriate.

Read Cluster 1 Server Issues, 22 November 2024: Actions…

Cluster 1 Server Issues update, 22 November 2024

We are still experiencing issues across Cluster 1, although we have improved stability through isolating some installations.

We are preparing some new servers to add into the Cluster 1 "pool" to try to mitigate the symptoms, but we are still trying to identify the Root Cause — and why it should only have manifested for the first time on Thursday morning.

We continue to investigate, and thank you for bearing with us.

Read Cluster 1 Server Issues update, 22 November 2024…

Cluster 1 Issues November 22, 2024

This morning we have seen a recurrence of the same CPU spikes that caused issues yesterday. Once again, it is limited to a single application server Cluster.

Cluster 1 CPU spikes

Despite working through millions of lines of logs yesterday, we have still not identified a Root Cause yet. However, we have formulated and enacted our mitigation strategies, so we should be able to restore full service much more quickly. We have also enabled more logging, in order to give us more information that will allow us to track down the current issues.

Please be assured that we are treating this as the highest priorities, and thank you for bearing with us.

ACTIONS

  • We are spinning up a new server in the Cluster;
  • using this, we will attempt to isolate each site in turn to identify and prove which are causing the problem;
  • we are continuing to prioritise stabilisation first, and Root Cause next.
Read Cluster 1 Issues November 22, 2024…