[guardian-dev] debugging libsqlfs

Hans-Christoph Steiner hans at guardianproject.info
Tue Jan 8 22:25:08 EST 2013


Ok narrowing it down, but boy are these bugs squirmy and hard to reproduce in
any well defined way.  Here's my thoughts after today:

* it seems pretty clear that the corruption and crashes only happen in
threaded mode, not in -s single thread mode.  I have a single threaded test
running now for >5000 iterations.

* my guess is that there are two separate-ish issues: file corruption and
crash.  The file corruption can happen without a crash happening (fsx will
stop and complain).  The crash seems to cause file corruption, which is not
surprising, but my guess is that its not related to the first file corruption
where there is no crash.

* I finally understand what the sqlfs_op_* functions are about.  By sending a
NULL/0 sqlfs to sqlfs_proc_* functions, it triggers get_sqlfs() to get the
sqlfs from the pthread_key_t.  So we should be able to use the sqlfs_op_*
functions in the JNI code to support multi-threading.  I added a test script
for using sqlfs_proc_* in the style of sqlfs_op_*, i.e. 0 for the sqlfs arg:

https://github.com/eighthave/libsqlfs

.hc

On 01/08/2013 01:45 PM, Abel Luck wrote:
> I can reproduce these.
> 
> 70-80% of the time the truncate op appears in the backtrace, so I
> definitely think there's something fishy going on there.
> 
> However, due to the nature of fsx (i.e., do random shit a whole bunch
> until something breaks) the bug might not necessarily lie in truncate.
> Perhaps something is being written incorrectly, and truncation merely
> triggers the fault.
> 
> :|
> 
> 
> Hans-Christoph Steiner:
>>
>> Ok round 2 with valgrind, this looks like it caught something.
>>
>> * as root, I ran:
>>  valgrind --tool=memcheck --trace-children=no --leak-check=full
>> --show-reachable=yes \
>>    --max-stackframe=3000000 -v ./.libs/fuse_sqlfs -d -o allow_other /mnt/
>>
>> * then the same fsx as with gdb:
>>  ./libsqlfs/tests/fsx  -l 10485760 -o 1048576  /mnt/testfile8
>>
>> And it quite quickly gave me a useful looking dump, but its late, and I'm
>> going to bed soon, so hopefully, Abel, you can make something of this:
>>
>> FUSE library version: 2.9.0
>> nullpath_ok: 0
>> nopath: 0
>> utime_omit_ok: 0
>> unique: 1, opcode: INIT (26), nodeid: 0, insize: 56, pid: 0
>> INIT: 7.17
>> flags=0x0000047b
>> max_readahead=0x00020000
>>    INIT: 7.18
>>    flags=0x00000010
>>    max_readahead=0x00020000
>>    max_write=0x00020000
>>    max_background=0
>>    congestion_threshold=0
>>    unique: 1, success, outsize: 40
>> unique: 2, opcode: LOOKUP (1), nodeid: 1, insize: 50, pid: 9137
>> LOOKUP /testfile8
>> getattr /testfile8
>> --9131-- REDIR: 0x4267890 (__memset_sse2) redirected to 0x402aba0 (memset)
>> --9131-- REDIR: 0x42689a0 (__memcpy_ssse3) redirected to 0x4029c10 (memcpy)
>> --9131-- REDIR: 0x4273a40 (__memcmp_ssse3) redirected to 0x402a780 (bcmp)
>>    NODEID: 2
>>    unique: 2, success, outsize: 144
>> unique: 3, opcode: OPEN (14), nodeid: 2, insize: 48, pid: 9137
>> open flags: 0x8002 /testfile8
>>    open[0] flags: 0x8002 /testfile8
>>    unique: 3, success, outsize: 32
>> unique: 4, opcode: SETATTR (4), nodeid: 2, insize: 128, pid: 9137
>> truncate /testfile8 0
>> ==9131== Thread 4:
>> ==9131== Invalid write of size 4
>> ==9131==    at 0x4029E8A: memcpy (mc_replace_strmem.c:838)
>> ==9131==    by 0x412999A: get_value_block (sqlfs.c:1314)
>> ==9131==    by 0x412A6C7: get_value (sqlfs.c:1475)
>> ==9131==    by 0x412E803: sqlfs_proc_truncate (sqlfs.c:2533)
>> ==9131==    by 0x412F262: sqlfs_op_truncate (sqlfs.c:3421)
>> ==9131==    by 0x40F9809: fuse_fs_truncate (in
>> /lib/i386-linux-gnu/libfuse.so.2.9.0)
>> ==9131==    by 0x40FE63F: ??? (in /lib/i386-linux-gnu/libfuse.so.2.9.0)
>> ==9131==    by 0x4103E8B: ??? (in /lib/i386-linux-gnu/libfuse.so.2.9.0)
>> ==9131==    by 0x4104560: ??? (in /lib/i386-linux-gnu/libfuse.so.2.9.0)
>> ==9131==  Address 0x66b4ae0 is 8,321,720 bytes inside a block of size
>> 8,321,722 alloc'd
>> ==9131==    at 0x4028308: malloc (vg_replace_malloc.c:263)
>> ==9131==    by 0x412A64D: get_value (sqlfs.c:1467)
>> ==9131==    by 0x412E803: sqlfs_proc_truncate (sqlfs.c:2533)
>> ==9131==    by 0x412F262: sqlfs_op_truncate (sqlfs.c:3421)
>> ==9131==    by 0x40F9809: fuse_fs_truncate (in
>> /lib/i386-linux-gnu/libfuse.so.2.9.0)
>> ==9131==    by 0x40FE63F: ??? (in /lib/i386-linux-gnu/libfuse.so.2.9.0)
>> ==9131==    by 0x4103E8B: ??? (in /lib/i386-linux-gnu/libfuse.so.2.9.0)
>> ==9131==    by 0x4104560: ??? (in /lib/i386-linux-gnu/libfuse.so.2.9.0)
>> ==9131==
>> ==9131== Invalid read of size 4
>> ==9131==    at 0x4049C32: ??? (in /usr/lib/i386-linux-gnu/libsqlite3.so.0.8.6)
>> ==9131==    by 0x4079F4A: ??? (in /usr/lib/i386-linux-gnu/libsqlite3.so.0.8.6)
>> ==9131==    by 0x404DA04: ??? (in /usr/lib/i386-linux-gnu/libsqlite3.so.0.8.6)
>> ==9131==    by 0x407F63F: ??? (in /usr/lib/i386-linux-gnu/libsqlite3.so.0.8.6)
>> ==9131==    by 0x4092B2A: ??? (in /usr/lib/i386-linux-gnu/libsqlite3.so.0.8.6)
>> ==9131==    by 0x40939A2: ??? (in /usr/lib/i386-linux-gnu/libsqlite3.so.0.8.6)
>> ==9131==    by 0x40C8665: ??? (in /usr/lib/i386-linux-gnu/libsqlite3.so.0.8.6)
>> ==9131==    by 0x6E3C3EF: ???
>> ==9131==  Address 0xc is not stack'd, malloc'd or (recently) free'd
>> ==9131==
>>
>>
>> .hc
>>
>>
>> On 01/07/2013 09:50 PM, Hans-Christoph Steiner wrote:
>>>
>>> Hey Abel,
>>>
>>> I'm now diving in deep into debugging using gdb for now.  I've been catching
>>> crashed in sqlfs with gdb and can get backtraces.  For the record, here's how
>>> I'm doing it:
>>>
>>> * using a VM to avoid filesystem crashes and lock-ups :)
>>>
>>> * to mount the file system, run this as root:
>>> 	gdb --args ./.libs/fuse_sqlfs -d -o allow_other /mnt/
>>>
>>> * to test it, run this as a normal user:
>>> 	./libsqlfs/tests/fsx  -l 10485760 -o 1048576  /mnt/testfile8
>>>
>>> Most of the backtraces look like this, i.e. occuring in sqlfs_proc_trunctate()
>>> with sqlfs=sqlfs at entry=0x0:
>>>
>>> #1  0xb7ede033 in get_value_block (sqlfs=sqlfs at entry=0x8072b10,
>>>     key=0x808a7c8
>>> "(\317\006\b\b\265\b\b\300\266\b\bx\270\b\b\230\250\b\b\260\267\b\b\b",
>>>     key at entry=0x805a618 "/testfile8",
>>>     data=0xb64d2008
>>> "\024\177\024\305\024\303\024\241\024\231\024\372\024\006\024\334\024u\024\314\024\336\024\353\024\064\024\266\024x\024\277\024$\024\063\024\033\024\332\024\206\024\257\024\034\024y\024\a\024\340\024\061\024\202\024\a\024e\024\017\024\071\024?\024\221\024$\024\274\024\271\024U\024\336\024\030\024\254\024\302\024\065\024@\024\020\024\061\024\257\024\243\024\261\024\"\024\305\024\353\024T\024D\024@\024\357\024\214\024\064\024%\024\356\024A\024\062\024\272\024p\024y\024#\024\321\024\246\024\326\024K\024\250\024\022\024\372\024!\024:\024\264\024O\024\356\024\317\024\377\024Y\024}\024\246\024\306\024\343\024K\024\b\024\261\024m\024\314\024]\024\005\024\261\024\351\024\\\024@\024\210\024\267\024\331\024\243"...,
>>> block_no=block_no at entry=63, size=size at entry=0x0) at sqlfs.c:1326
>>> #2  0xb7edee28 in get_value (sqlfs=0x8072b10, key=key at entry=0x805a618
>>> "/testfile8",
>>>     value=value at entry=0xb7504fb4, begin=8257536, begin at entry=0, end=8321722,
>>> end at entry=0)
>>>     at sqlfs.c:1487
>>> #3  0xb7ee36f9 in sqlfs_proc_truncate (sqlfs=sqlfs at entry=0x0,
>>> path=path at entry=0x805a618 "/testfile8",
>>>     size=0) at sqlfs.c:2545
>>> #4  0xb7ee3fb3 in sqlfs_op_truncate (path=0x805a618 "/testfile8", size=0) at
>>> sqlfs.c:3433
>>> #5  0xb7ef980a in fuse_fs_truncate () from /lib/i386-linux-gnu/libfuse.so.2
>>>
>>>
>>> It seems that sqlfs_t gets re-issued a lot, I added this to get_sqlfs() and
>>> got it many many times:
>>>
>>>     if (p)
>>>         return p;
>>>     else
>>>         show_msg(stderr, "sqlfs was 0");
>>>
>>> This lead me to examine the threading stuff, since it seems that when sqlfs ==
>>> 0, its getting a new sqlfs issued from:
>>>
>>>     sqlfs = (sqlfs_t *) (pthread_getspecific(sql_key));
>>>
>>> Since the libsqlfs code only has pthread code relating to the pthread_key_t
>>> creation and use, my guess is that FUSE itself is handling the rest of the
>>> threading.  So the good news is that libsqlfs was made with threading in mind,
>>> the bad news is that a lot of it is normally handled in libfuse, so we'll have
>>> to figure that part out and implement it in IOCipher in order to support threads.
>>>
>>> Now, I'm going to see if I can get valgrind to run on fuse_sqlfs.
>>>
>>> .hc
>>>
>> _______________________________________________
>> Guardian-dev mailing list
>>
>> Post: Guardian-dev at lists.mayfirst.org
>> List info: https://lists.mayfirst.org/mailman/listinfo/guardian-dev
>>
>> To Unsubscribe
>>         Send email to:  Guardian-dev-unsubscribe at lists.mayfirst.org
>>         Or visit: https://lists.mayfirst.org/mailman/options/guardian-dev/abel%40guardianproject.info
>>
>> You are subscribed as: abel at guardianproject.info
>>
> 
> _______________________________________________
> Guardian-dev mailing list
> 
> Post: Guardian-dev at lists.mayfirst.org
> List info: https://lists.mayfirst.org/mailman/listinfo/guardian-dev
> 
> To Unsubscribe
>         Send email to:  Guardian-dev-unsubscribe at lists.mayfirst.org
>         Or visit: https://lists.mayfirst.org/mailman/options/guardian-dev/hans%40guardianproject.info
> 
> You are subscribed as: hans at guardianproject.info
> 


More information about the Guardian-dev mailing list