Patching PgBouncer to Drop Slow Queries

17 September 2013  —  Hans Hasselberg & Torsten Becker

The other day my friend and colleague Hans was troubleshooting problems with one of our production PostgreSQL databases at 6Wunderkinder. The master was under high load and slowed down the whole system. So we needed to figure out why this happened and fix the problem.

Our setup includes one master PostgreSQL database and two slaves for reading. These database are connected to a PgBouncer which pools connections and is accessed by our app servers.

The problem was that something was running a bad query on the master:

SELECT COUNT(*)
FROM tasks
WHERE /* complex condition */;

This query is very problematic for the master because it is not usually doing these kind of reads with very complex WHERE clauses and it caused high I/O wait times which slowed down the master noticeably.

After searching the whole codebase and logs, we could not find any plausible line where this query could come from. So eventually we had to stop and yield to ActiveRecord’s ORM methods which (un)fortunately hide most raw SQL queries.

We tried getting more informations from PostgreSQL by running:

SELECT procpid, client_addr, current_query
FROM pg_stat_activity
WHERE current_query LIKE 'SELECT COUNT(*) FROM tasks%';

The above query can help identifying the machine which is running a matching query. Since we have a PgBouncer in our setup, client_addr always points to the PgBouncer.

So what’s the solution? Of course: Patching PgBouncer to kill all bad queries. This would trigger exceptions in the Ruby code and would then make the bad code visible in the logs. Causing a few 500s would also not be a problem because all our clients have offline support.

So we put on our C-heads, fired up gdb, and dug a couple minutes through the PgBouncer source. Eventually we found a neat spot: safe_send(). This function was called for every packet being forwarded from the client to the PostgreSQL server.

While this is not a very clean solution (because this is a place in the generic network library), we rewrote all task count queries to be invalid just to test our hack:

int safe_send(int fd, const void *buf, int len, int flags)
{
    ...
    if (memcmp(((char *)buf) + 5, "SELECT COUNT(*) FROM tasks", 26) == 0) {
        memcpy(((char *)buf) + 5, "SELECT * FROM 1337;--", 21);
    }
    ...

And woohoo, this actually worked smooth as butter on our development machine! Testing it on production was also without much harm since PgBouncer has a very useful -R flag to perform an online restart.

So after some cleanup we ended up with a small patch for PgBouncer and LibUsual. The snippet above is the heart of the patch, but we added more: Logging the killed queries and a config option, which enables turning this feature on and off without re-compiling and without even restarting PgBouncer.

If you ever end up with untraceable slow queries in PostgreSQL, apply this patch without any guarantees and at your own risk. And then grep the logs for 1337!