PostgreSQL High Availability Cluster - Quick Setup¶
What You're Building¶
A three-node PostgreSQL cluster with one primary node (handling writes) and two replicas (ready to take over on failure). The system includes automatic failover via repmgr and split-brain protection to prevent data corruption during network partitions.
Prerequisites¶
Three Ubuntu servers with static IP addresses and SSH access configured.
Step 1: Define Your Inventory¶
Create or edit your inventory file at ansible/inventory/offline/hosts.ini to define your PostgreSQL servers and their configuration.
The postgresql_rw group designates your primary node (accepts writes), while postgresql_ro contains replica nodes (follow the primary and can be promoted if needed). The network interface variable specifies which adapter to use for cluster communication.
Step 2: Test Connectivity¶
Verify Ansible can reach all three servers:
You should see three successful responses. If any node fails, check your SSH configuration and network connectivity.
Step 3: Deploy the Complete Cluster¶
Run the deployment playbook (takes 10-15 minutes):
This playbook installs PostgreSQL 17 and repmgr on all nodes, configures the primary node, clones and registers the replicas, deploys split-brain detection, creates the Wire database with credentials, and runs health checks.
Step 4: Verify the Cluster¶
Check cluster status from any node:
You should see one primary node (marked with asterisk) and two standby nodes, all running. Standby nodes should list the primary as their upstream node.
Verify critical services are running:
All three services should be active: postgresql (database engine), repmgrd (cluster health and failover), and detect-rogue-primary timer (checks for conflicting primaries every 30 seconds).
Step 5: Check Replication Status¶
On the primary node, verify both replicas are receiving data via streaming replication:
You should see two rows (one per replica) with state "streaming", confirming continuous replication to both standby nodes.
Step 6: Wire Database Credentials¶
The playbook generates a secure password and stores it in the wire-postgresql-external-secret Kubernetes secret. Running bin/offline-deploy.sh automatically syncs this password to brig and galley service secrets in values/wire-server/secrets.yaml.
If deploying/upgrading wire-server manually, use one of these methods:
Option 1: Run the sync script in the adminhosts container:¶
This script retrieves the password from wire-postgresql-external-secret, updates multiple YAML paths, creates a backup at secrets.yaml.bak, verifies updates, and works with any Kubernetes secret and YAML file.
Option 2: Manual Password Override¶
Override passwords during helm installation:
Optional: Test Automatic Failover¶
To verify automatic failover works, simulate a primary failure by stopping the PostgreSQL service on the primary node:
Wait 30 seconds, then check cluster status from a replica node:
One replica should now be promoted to primary. The repmgrd daemon detected the failure, formed quorum, selected the best candidate based on replication lag and priority, and promoted it. The remaining replica automatically reconfigures to follow the new primary.
What Happens During Failover¶
When the primary fails, repmgrd daemons retry connections every five seconds. After five failures (~25 seconds), the replicas reach consensus that the primary is down (requiring two-node quorum to prevent false positives). The system promotes the most up-to-date replica with the highest priority using PostgreSQL's native promotion function. The remaining replica detects the new primary and begins following it, while postgres-endpoint-manager updates Kubernetes services to point to the new primary.
Recovery Time Expectations¶
The cluster recovers within 30 seconds of a primary failure. Applications running in Kubernetes may take up to 2 minutes to reconnect due to the postgres-endpoint-manager's polling cycle, resulting in 30 seconds to 2 minutes of database unavailability during unplanned failover.
Troubleshooting¶
Common Issues During Deployment¶
PostgreSQL Service Won't Start¶
If PostgreSQL fails to start after deployment:
Replication Issues¶
If standby nodes show "disconnected" status:
Post-Deployment Issues¶
Split-Brain Detection¶
If you suspect multiple primaries exist, check the cluster status on each node:
Failed Automatic Failover¶
If failover doesn't happen automatically:
Replication Lag Issues¶
If standby nodes fall behind:
Kubernetes Integration Issues¶
If postgres-external chart fails to detect the primary:
Recovery Scenarios¶
For detailed recovery procedures covering complex scenarios such as: - Complete cluster failure recovery - Corrupt data node replacement - Network partition recovery - Emergency manual intervention - Backup and restore procedures - Disaster recovery planning
Please refer to the comprehensive PostgreSQL cluster recovery documentation in the wire-server-deploy repository.
Next Steps¶
With your PostgreSQL HA cluster running, integrate it with your Wire server deployment. The cluster runs independently outside Kubernetes. The postgres-endpoint-manager component (deployed with postgres-external helm chart) keeps Kubernetes services pointed at the current primary, ensuring seamless connectivity during failover.
Install postgres-external helm chart¶
From the wire-server-deploy directory:
This configures postgresql-external-rw and postgresql-external-ro services with corresponding endpoints.
The helm chart deploys a postgres-endpoint-manager cronjob that runs every 2 minutes to check the current primary. On failover, it updates endpoints with the current primary and standbys. When stable, it runs as a health probe.
Check cronjob logs:
See the postgres-endpoint-manager repository for endpoint update details.