Postgres: Long transactions and Access exclusive locks
I would like to share a postgres story that has happened to me a few times.
The story has the following ingredients:
- An overly centralised Entity/Relationship model
- Long database transactions to aid data consistency
- Schema migration of tables
- Poor observability of transactions
1. Centralised tables go like this: A database with many FK to a to a central table in the business logic that hasn’t been properly designed works fine most of the time.
Django or any other MVC (MTV) frameworks that manage the database schema and include some central tables (like User) suffer from this issue
As the system grows, many models FK directly to the existing models rather than redesigning the database to reduce the number of FKs to a single model. These models become accidental Aggregates.
2. Long database transactions are a common pattern in MVC frameworks: As the system grows, more and more database operations need to happen with transactional guarantees. These frameworks offer the option to use the database transaction to coordinate all of these operations.
In web applications where many workers access the same database, this is a good trade off as it can be represented in code without any external tools. Generally this coordination of what needs to be simultaneous is one of the hard problems to figure out in a system.
3. Schema migrations of tables
These are also common in MVC frameworks. In python both django and alembic offer to manage the database schema by reading the model definitions from python files. But partly because of their ease to change and apply, in some configurations these happen at any time.
Everything works fine until it doesn’t
Imagine the following set of N transactions:
T1 -> T2 -> T3 -> T4 -> T5 -> ... -> TN
None of these factors cause issues on their own.
- Accidental aggregates add extra cost to any FK changes, but not enough to be noticeable.
- Long transactions could cause a problem, but most of the time the will lock individual rows and not interfere with other operations.
- Schema changes could block the database briefly because of the expensive table lock, but avoiding changing already populated big tables is generally enough
Combining all of these though can cause the database to become unresponsive. Something as simple as creating a new table with a FK to a central table has the power to block any operation that reads or writes the central table.
So for the same set of transactions:
- T2 is a long transaction
- T3 is a schema change that requires Access Exclusive
T1 -> T2(Long) -> T3(Lock) -> T4 -> T5 -> ... -> TN
If T2 was short, T3 would be able to start immediately. If T3 didn’t lock the whole table, it would allow the subsequent transactions to continue.
In this scenario:
T4 -> ... -> TNare all blocked by T3 because they all refer to the centralised model and postgres honours the order of operations.T3is blocked untilT2finishes because its lock requires no other operation running on the table
So no read operation happens to any table affected until T2 and
T3 are done. Which is enough to bring a web app down.
Bad visibility
These type of issues are hard to see from regular observability tools like datadog and grafana.
My personal experience is with datadog. It is possible to see the long queries and the long transactions but it is impossible to attribute causality to it. So if your system features long transactions and frequent schema changes, it will take some time to infer which transaction caused it and what to do
Solutions
There are many ways to tackle mitigate the risk of this scenario:
Plan for downtime For some systems this might be a viable option: Establish a time of the day where the schema change happen, avoid long transactions during that time.
Use write constraints https://medium.com/doctolib/adding-a-not-null-constraint-on-pg-faster-with-minimal-locking-38b2c00c4d1c It is possible to avoid using Foreign keys entirely and instead use a constraint. It will mitigate the risk of locking but add a penalty on every single write
Disallow long transactions Postgres allows to cancel long transactions https://www.postgresql.org/docs/17/runtime-config-client.html#GUC-TRANSACTION-TIMEOUT It is feasible to refactor the application to rely on external mechanisms instead of database locks. Even an advisory lock will be enough as it doesn’t use any particular table.
Refactor the models If having a central model causes this problem, separating the models within or even outside the database will fix it. Moving the coordination away from a single database has its own trade offs. Added complexity and out of order operations are common in microservice architectures.