[guardian-dev] Summary of findings: IOCiper / libsqlfs and WAL growth
sjlombardo at zetetic.net
Thu Feb 7 22:18:28 EST 2013
David asked me to write up a brief recap of our findings about WAL growth under load with libsqlfs.
During the performance tuning and stabilization effort for IOCipher we enabled WAL journal mode for libsqlfs databases. WAL mode is a relatively new feature in SQLite that replaces the transaction journal with a write ahead log file and shared memory segment. One significant benefit of WAL mode is that it allows clients to read from a database at the same time another connection is writing to it. This provides improved performance in situations where multiple processes or threads are be reading and writing concurrently, and differs from the standard journal mode which exclusively locks the database during writes.
WAL mode allows concurrent read and write operations by appending changes to the WAL file. When a commit occurs it is also marked in the WAL. Since writers are appending data, readers are able to operate both off the original database contents and those records in the WAL prior to the last commit when the read started. Thus, writers can continually update the WAL without affecting the "snapshot" of data the readers are using.
Periodically, SQLite should move data from the WAL back into the main database file using a checkpoint operation. By default SQLite will attempt a passive checkpoint when the WAL grows to a sufficient size and a commit occurs. When a checkpoint finishes, SQLite can try to reset so the next transaction will write at the start of the WAL file if possible. However, it is important to note that a checkpoint can only restart the WAL if there are no other readers currently using it. SQLite will not truncate or delete the WAL until all database handles are closed.
During libsqlfs testing using multiple concurrent fsx processes (3 or more), Hans observed that the WAL file would grow without bound in his test environment. On some test systems the WAL would grow slowly, allowing long test runs without issues. On others, the growth could quickly fill up a drive. This behavior is unexpected, but the WAL is actually operating as designed. The root cause is that, with fsx processes running constantly, SQLite doesn't have the opportunity to complete a checkpoint. As the WAL grows the read operations essentially block checkpoints from restarting writes at the beginning of the file. Thus, writes wind up appended to the WAL, which continues to grow unchecked. We verified this behavior by adding tracing code to output the result of repeated checkpoint operations over the course of a test run.
This leaves us with a few options.
1. Switch back to the default journal mode. The use of reserved locks in libsqfs and the new busy timeout have stabilized the library under load even without WAL. This would immediately resolve the problem, but would sacrifice concurrency and performance.
2. Do nothing, as the WAL is actually working as intended and most applications would never run into this issue. Of course this is not ideal, as the problem could crop up for applications that are make continuous concurrent use of the database, applications under high load, and in testing.
3. Introduce a periodic blocking checkpoint, which would force a reset after some number of commits.
Since the first two options seem less desirable, we began looking in depth at the 3rd. Unfortunately, initial attempts at periodic checkpoints were not successful. They slowed growth but didn't stop it entirely. The problem was that, even though a restart checkpoint can block new readers and writers from starting, it can't run if there are ongoing transactions. Thus, over time even the explicit checkpoints could fail under high load.
At this point it became clear that in order to allow reset checkpoints to complete periodically, we needed to block readers. However, this blocking would need to be selective -- if we made all reads and writes exclusive we'd loose most of the benefits of using WAL in the first place. Ideally, we would want to block reads only when when a checkpoint operation needed to run.
Thus, as a proof of concept, we enhanced the libsqlfs library to introduce a Pthreads read-write lock for the checkpoint operation. In functions that read the database (e.g. get_value_block, get_attr), the code attempts to read lock with a timeout before proceeding. When it is time to checkpoint, say after 500 commits, an attempt is made to write lock with a timeout. Once a write lock is obtained subsequent readers are blocked, allowing the checkpoint to run to completion and restart the WAL. This proof of concept code is available here, on the rwlock branch:
This patch attempts to maximize performance by allowing read/write concurrency during most operations and exclusively locking only for infrequent checkpoints. In my environment this halted uncontrolled WAL growth under multi-hour test runs with 3 or 4 concurrent fsx processes. I'd definitely be interested to hear test results from anyone else who has a chance to put it through it's paces. This code would need additional work and testing to make it production ready, but it at least demonstrates a potentially viable approach to this issue.
Please let me know if you have any questions. Thanks!
More information about the Guardian-dev