[guardian-dev] Summary of findings: IOCiper / libsqlfs and WAL growth

Stephen Lombardo sjlombardo at zetetic.net
Fri Feb 8 15:37:04 EST 2013


Hi Hans,

On 2013-02-08, Hans-Christoph Steiner wrote:
> Thanks for that thorough write-up!  I did test your pthread locking some and
> but didn't have time to come up with conclusive results.  At the very least, I
> think its safe to say it did slow the WAL log growth.

I'm glad to hear that you've had some positive results on your side too. It would be normal to see the WAL file grow as any transactions that occur between checkpoints will be appending to the file. For example with 3 fsx processes, and at least one using large writes, I've seen the WAL grow as high as 50mb, but it eventually stabilizes and stops growing. It would be great if you, and perhaps some other folks, could run some extended multi-hour tests to confirm that the patch halts growth at some upper limit, even if it is large. 

Once we've established that the patch is really working, it would need tuning. The current interval between checkpoints is set to 500. This might be too high for some use cases, so we should try with 100 or even 50. Alternately, we could explore an entirely different approach to determine when to force checkpoint, e.g.:

1. Checkpoint based on a period of time instead of number of transactions. For instance, we might force a checkpoint after any commit that occurs more than 30 seconds after the last checkpoint. This may work better for clients issuing large writes since it could take a long time to get up to a fixed checkpoint threshold based on number of committed transactions. 

2. Re-enable wal_autocheckpoint for the default behavior but periodically look at and track the size of the WAL file. Then we could force a checkpoint only if the WAL grows by a certain percentage. This would act as a failsafe since WAL file growth would indicate that regular checkpoints are unable to complete due to competing operations. 

We'd also need to do a small audit on the code and make sure we have all read functions covered with the read lock (the POC patch only covers the two most common right now) 

In summary, there are a number of options we can discuss on how to proceed, but the first and most important thing is to verify that the POC approach reliably circumvents the WAL growth across a range of environments. 

> Also, about the WAL log not being deleted on unmount, I wasn't able to
> conclusively say one way or the other whether the WAL log was always being
> deleted properly or not.  My plan is to script a test of that, and run it
> under varying conditions.

I can't remember if I  mentioned it on IRC, but I was able to reproduce a situation where the WAL was not cleaned up by killing fuse_sqlfs. However, the WAL file is always cleaned up properly in my environment when everything is stopped and unmounted cleanly. Let me know if you're able to narrow this down and reproduce under other circumstances.

Cheers,
Stephen


More information about the Guardian-dev mailing list