I’ve been involved in various Postgres roles (developer on it, consultant for it for BigCo, using it at my startup now) for around 18 years.
I’ve never in that time seen a pacemaker/corosync/etc/etc configuration go well. Ever. I have seen corrupted DBs, fail overs for no reason, etc. The worst things always happen when the failover doesn’t go according to plan and someone accidentally nukes the DB at 2am.
The lesson I’ve taken from this is it’s better to have 15-20 minutes of downtime in the unlikely event a primary goes down and run a manual failover/takeover script then it is to rely on automation. pgbouncer makes this easy enough.
That said, there was a lot of bad luck involved in this incident.
>Fortunately, as part of some unrelated work we'd done recently, we had a version of the cluster that we could run inside Docker containers. We used it to help us build a script that mimicked the failures we saw in production. Being able to rapidly turn clusters up and down let us iterate on that script quickly, until we found a combination of events that broke the cluster in just the right way.
this is the coolest part of this story. Any chance these scripts are opensource ?
Since I am investigating HA with PostgreSQL right now and have bitter experience of Pacemaker 'HA' instances that have been anything but, I am looking at Amazon Aurora and Microsoft's (in preview) Azure database for PostgreSQL offerings. I would really appreciate any insight from others who are already using them (we intend to do some PoC work shortly).
Our dev team also came up with some pertinent questions, which we have put to both companies, but if anyone else can comment from experience that would be fantastic:
* Is the product a fork of PostgreSQL or a wrapper round the current version?
* Will the DB engine keep in lock-step with new PostgreSQL releases or might they diverge?
* If the DB engine keeps in lock-step, what’s the period between a new version of PostgreSQL being released before its incorporated in the live product?
* When new versions of Amazon Aurora/Azure DB for PostgreSQL are released will our live instance get automatically updated or will we be able to choose a version?
I think the root issue is that PostgreSQL does not offer an HA solution that works out of the box with minimal configuration, resulting in people using broken third-party ones and/or configuring them incorrectly.
They should either provide one or "bless" an external solution as the official one (after making sure it works correctly).
The other problem is that GoCardless setup an asynchronous and a synchronous replica instead of 2 synchronous replicas (or preferably 4+), resulting in only two points of failure, which is not enough.
Makes me sad that running your own instances is now an "elephant in the room." No pride in old-school do-it-yourself nerditry these days :/
This is why you should be extremely wary of anything that is only run once in a a blue moon. And very wary of such things that when run, are being run to save your bacon.
"The RAID controller logged the simultaneous loss of 3 disks from the array. All subsequent read and write operations against it failed."
People seem to forget that adding a RAID controller creates a single point of failure instead of removing one. :-)
Pacemaker is known to wreak havoc if it gets angry. The usual path to quick recovery when the cluster goes crazy like this is to make really sure what's the most up to date replica, shut down Pacemaker completely, assign VIP manually to a healthy replica and promote it manually. Then once you're up and back in the business figure out how to rebuild the cluster.
Good write-up. I'm curious about two more things:
1. What caused the crash on the synchronous replica? Was it just a coincidence and completely unrelated to the primary failure?
2. Given the three conditions necessary for the cluster to break, was the behavior of the Pacemaker software expected? I.e., was this a gotcha that should be in the Pacemaker documentation, or a bug?
The end of this post mortem was a bit handwavy TBH. I feel like they didnt dig deep enough, and the problem was the backup VIP, not the two processes crashing at once and the backup VIP.
I think by still allowing the backup VIP to run on the sync replica the same mistake is being repeated, there will always be the possibility of a situation where the VIP cannot be moved when promotion is required. That replica should be doing nothing but sitting there waiting to save the day, and if they want the backup VIP to be highly available they should provision 2 async replicas.
I too am coming up on a need for no-downtime HA failover for Postgres. I too am not allowed to use a hosted PaaS-ish solution like RDS. I was considering Citus's multi master impl (I don't need to spread the load, just need HA). I had not considered Pacemaker. Has GoCardless investigated this option and have any insight to give? HA has traditionally been a real pain point for traditional RDBMS's in my experience.
I'm told that MySQL replication blows Postgres out of the water by my company's data team, but they could just be biased since that is their area of expertise. I work on server code and don't really have much familiarity with the operations of running replica chains.
Postgres seems like a better choice for personal projects since it has a lot of nifty features. I'm also wary of Oracle, but that's my own attitude talking. For a startup eventually wanting to scale, would the better choice be to use MySQL out of the gates? Am I being mislead about Postgres clusters and availability?
Serious (naive) question; not wanting to start a flame war.
There are specialized tools like Patroni (https://github.com/zalando/patroni)
If you are running HA in AWS RDS, how would you compare your experience with the above? What are the types of RDS failures modes that you have experienced?
So far I've discovered that TCP keepalives are quite important, otherwise your queries may hang forever after failover (or at least for the default timeout which is like 30 minutes). The connection does not get broken otherwise by the failover.
I am astonished that, in the two years, you had not already handled 100+ scheduled failovers. If your HA is good, customers don't notice, and if not, you find out when there are fewer of them (and in daytime!), and fix it.
Probably by now Pacemaker would have been abandoned. A hundred drills would have been enough to flush out these behaviors. If you are afraid to run drills on production equipment, you should be running them on a full-scale production testbed, ideally with mirrored production traffic. With a production-scale testbed, two years is enough to run thousands of risk-free failovers.
Not doing frequent production failure drills is just irresponsible.
I'm seeing the term "HA" being used a lot in the comments here, what does it mean?
Stolon with etcd or consul is a far superior solution for HA postgres.
Stop pretending that there's a magic bullet called "multi-master" and "transparent promotion". Your apps are super simple. Their DB interactions are super simple. Learn how to do federations and all these problems will go away.