September 2019

Fault-Tolerant, Highly-Available Privileged Access Management



We’re often asked about high availability and fault tolerance for the Osirium PAM. This is understandable, since when fully deployed Privileged Access Management is the default route to all administrative interfaces. If it is down – you have a problem.

Here’s a video of our PxM Platform running on a VMware cluster built on modest systems:

PXM Platform Options


One option with PXM, is “Mesh,” where multiple PXM installations send their configurations to each other and one platform can take over the configuration of another if it becomes unavailable. While this works for multiple sites, many customers have just one data centre and would like a high-availability solution.


Our build systems deliver the PxM Platform in many formats: Azure, AWS, HyperV, VMware and even a kit form. The VMware version complies with all the pre-requisites for VMotion and VMware clustering. Every year the VMware offerings improve and the VMotion and clustering have become cheaper.

Recently we built a test VMware cluster to see the changes for ourselves. In particular, we wanted to see how fail-over and fail-back worked. We found that VMware 6.7 is pretty much seamless. In our test cluster, we could pull the power cable on either system and the PxM Platform would continue to run, and all sessions would be continue to run after a very short delay.

To stress the cluster we ran active streaming SSH sessions through the PxM Platform whilst pulling a power cable on the active ESXi in the cluster. The result was that no session was lost, no data was lost and there was a barely perceptible 0.3 second break in data transmission.

To recover the cluster, we just returned the power cable and allowed the ESXi system to boot and rejoin the cluster – this was all that was needed. We could see on the vCenter display when the ESXi rejoined and the change of Fault Tolerance status.

In the previous versions of VMware it was necessary to ‘fail-back’ the virtual machines to the primary ESXi system. The current version has a more balanced approach where ESXi systems can be part of a cluster but also host ordinary virtual machines.

The alternative solution – database replication – is not so friendly

It’s worth comparing this approach with ‘always on’ databases used by some other PAM tools. In this case, the loss of any worker system would mean that sessions would be dropped and users would have to restart these sessions on another worker. Perhaps of more concern is the behaviour of the database in a network partition scenario. Typically, always-on databases switch to read-only mode when their cluster becomes in-quorate. This means that credential data is always available, but not updateable. This is far from ideal. For example, in this state, password cycling may not be available and history will need to be saved elsewhere.

Global Scale

Using the PxM Mesh function, with network partitions, the PxM Platforms either side of the partition can have their own history and credential lifecycle management, they will re-mesh configurations once the network partition is healed. Besides the “belt & braces” protection, users also benefit from the performance boost by working with local PXM systems rather than traversing across global internet connections.

It’s an interesting thought that our larger customers could use WMware clusters and Mesh together to form collections of fault tolerance for highly-available Privileged Access Management services. A win-win for availability, resilience and user productivity.

As always – if you’d like to know more, please get in touch.

Click to chat