Lost IRQ drop harddisk

Bug #577785 reported by alf@all.de
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Medium
Unassigned

Bug Description

Binary package hint: linux-image-2.6.32-21-server

The hardware is working fine for 2 years. Cause some ext4 problems I decide to upgrade.

After the Upgrade to 10.04LTS my server lost a random harddisk once a day.
(Ubuntu 2.6.32-21.32-server 2.6.32.11+drm33.2)
This occure with one disk of the sata soft-raid or sometimes with the ide root partition.

After a reboot all is working fine for a couple of hours.
A sync of 6 TB soft-raid (6*1,5GB sata) is often complete successful in this time.

PS: I note in the Call-Trace a recursion in sync_buffer. Is this really right ?
[ 3240.828016] [<ffffffff8116d1d0>] ? sync_buffer+0x0/0x50
[ 3240.829103] [<ffffffff815555c7>] io_schedule+0x47/0x70
[ 3240.830119] [<ffffffff8116d215>] sync_buffer+0x45/0x50
[ 3240.831147] [<ffffffff81555bef>] __wait_on_bit+0x5f/0x90
[ 3240.832197] [<ffffffff8116d1d0>] ? sync_buffer+0x0/0x50

# dmesg
[ 3081.982688] ata1: lost interrupt (Status 0x50)
[ 3081.983466] sd 0:0:0:0: [sda] Unhandled error code
[ 3081.984245] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 3081.985092] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 56 14 eb 00 00 10 00
[ 3081.985984] end_request: I/O error, dev sda, sector 5641451
[ 3081.986878] Buffer I/O error on device dm-1, logical block 681028
[ 3081.987805] lost page write due to I/O error on dm-1
[ 3081.988647] Buffer I/O error on device dm-1, logical block 681029
[ 3081.989392] lost page write due to I/O error on dm-1
[ 3112.014960] ata1: lost interrupt (Status 0x50)
[ 3112.015887] sd 0:0:0:0: [sda] Unhandled error code
[ 3112.017101] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 3112.018489] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 56 15 0b 00 00 08 00
[ 3112.020199] end_request: I/O error, dev sda, sector 5641483
[ 3112.021738] Buffer I/O error on device dm-1, logical block 681032
[ 3112.023267] lost page write due to I/O error on dm-1
[ 3143.011366] ata1: lost interrupt (Status 0x50)
[ 3143.012391] sd 0:0:0:0: [sda] Unhandled error code
[ 3143.013724] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 3143.015232] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 56 15 1b 00 00 08 00
[ 3143.016835] end_request: I/O error, dev sda, sector 5641499
[ 3143.018425] Buffer I/O error on device dm-1, logical block 681034
[ 3143.019910] lost page write due to I/O error on dm-1
[ 3174.011299] ata1: lost interrupt (Status 0x50)
[ 3174.012362] sd 0:0:0:0: [sda] Unhandled error code
[ 3174.013470] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 3174.015268] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 56 15 33 00 00 08 00
[ 3174.017145] end_request: I/O error, dev sda, sector 5641523
[ 3174.018442] Buffer I/O error on device dm-1, logical block 681037
[ 3174.019481] lost page write due to I/O error on dm-1
[ 3205.011295] ata1: lost interrupt (Status 0x50)
[ 3205.012313] sd 0:0:0:0: [sda] Unhandled error code
[ 3205.013447] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 3205.015025] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 56 16 d3 00 00 08 00
[ 3205.016868] end_request: I/O error, dev sda, sector 5641939
[ 3205.018802] Buffer I/O error on device dm-1, logical block 681089
[ 3205.020483] lost page write due to I/O error on dm-1
[ 3236.011299] ata1: lost interrupt (Status 0x50)
[ 3236.012446] sd 0:0:0:0: [sda] Unhandled error code
[ 3236.014505] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 3236.016469] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 56 16 e3 00 00 08 00
[ 3236.018560] end_request: I/O error, dev sda, sector 5641955
[ 3236.020006] Buffer I/O error on device dm-1, logical block 681091
[ 3236.020972] lost page write due to I/O error on dm-1
[ 3240.820072] INFO: task jbd2/dm-1-8:619 blocked for more than 120 seconds.
[ 3240.821024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3240.822163] jbd2/dm-1-8 D 00000000ffffffff 0 619 2 0x00000000
[ 3240.823341] ffff880220253c20 0000000000000046 0000000000015bc0 0000000000015bc0
[ 3240.824572] ffff880221ceb1a0 ffff880220253fd8 0000000000015bc0 ffff880221ceade0
[ 3240.825790] 0000000000015bc0 ffff880220253fd8 0000000000015bc0 ffff880221ceb1a0
[ 3240.826999] Call Trace:
[ 3240.828016] [<ffffffff8116d1d0>] ? sync_buffer+0x0/0x50
[ 3240.829103] [<ffffffff815555c7>] io_schedule+0x47/0x70
[ 3240.830119] [<ffffffff8116d215>] sync_buffer+0x45/0x50
[ 3240.831147] [<ffffffff81555bef>] __wait_on_bit+0x5f/0x90
[ 3240.832197] [<ffffffff8116d1d0>] ? sync_buffer+0x0/0x50
[ 3240.833181] [<ffffffff81555c98>] out_of_line_wait_on_bit+0x78/0x90
[ 3240.834306] [<ffffffff81084fe0>] ? wake_bit_function+0x0/0x40
[ 3240.835484] [<ffffffff81084e67>] ? bit_waitqueue+0x17/0xd0
[ 3240.836574] [<ffffffff8116d1c6>] __wait_on_buffer+0x26/0x30
[ 3240.837561] [<ffffffff8121ac6c>] jbd2_journal_commit_transaction+0x90c/0x1240
[ 3240.838693] [<ffffffff81084fa0>] ? autoremove_wake_function+0x0/0x40
[ 3240.839890] [<ffffffff81221a9d>] kjournald2+0xbd/0x220
[ 3240.841026] [<ffffffff81084fa0>] ? autoremove_wake_function+0x0/0x40
[ 3240.842254] [<ffffffff812219e0>] ? kjournald2+0x0/0x220
[ 3240.843667] [<ffffffff81084c26>] kthread+0x96/0xa0
[ 3240.845201] [<ffffffff810141ea>] child_rip+0xa/0x20
[ 3240.846758] [<ffffffff81084b90>] ? kthread+0x0/0xa0
[ 3240.848027] [<ffffffff810141e0>] ? child_rip+0x0/0x20
[ 3240.849049] INFO: task kjournald:1622 blocked for more than 120 seconds.
[ 3240.850217] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3240.851658] kjournald D 0000000000000000 0 1622 2 0x00000000
[ 3240.853149] ffff8802183edc30 0000000000000046 0000000000015bc0 0000000000015bc0
[ 3240.854596] ffff8802220483c0 ffff8802183edfd8 0000000000015bc0 ffff880222048000
[ 3240.856096] 0000000000015bc0 ffff8802183edfd8 0000000000015bc0 ffff8802220483c0
[ 3240.857086] Call Trace:
[ 3240.857983] [<ffffffff8116d1d0>] ? sync_buffer+0x0/0x50
[ 3240.858963] [<ffffffff815555c7>] io_schedule+0x47/0x70
[ 3240.859877] [<ffffffff8116d215>] sync_buffer+0x45/0x50
[ 3240.860861] [<ffffffff81555bef>] __wait_on_bit+0x5f/0x90
[ 3240.861991] [<ffffffff8116bfb1>] ? submit_bh+0x111/0x140
[ 3240.863714] [<ffffffff8116d1d0>] ? sync_buffer+0x0/0x50
[ 3240.865069] [<ffffffff81555c98>] out_of_line_wait_on_bit+0x78/0x90
[ 3240.866597] [<ffffffff81084fe0>] ? wake_bit_function+0x0/0x40
[ 3240.867661] [<ffffffff8116d1c6>] __wait_on_buffer+0x26/0x30
[ 3240.868593] [<ffffffff81212beb>] journal_commit_transaction+0x31b/0xe40
[ 3240.869485] [<ffffffff810397a9>] ? default_spin_lock_flags+0x9/0x10
[ 3240.870526] [<ffffffff81076bdc>] ? lock_timer_base+0x3c/0x70
[ 3240.871731] [<ffffffff81077665>] ? try_to_del_timer_sync+0x75/0xd0
[ 3240.872899] [<ffffffff81216a8d>] kjournald+0xed/0x250
[ 3240.874013] [<ffffffff81084fa0>] ? autoremove_wake_function+0x0/0x40
[ 3240.875236] [<ffffffff812169a0>] ? kjournald+0x0/0x250
[ 3240.876185] [<ffffffff81084c26>] kthread+0x96/0xa0
[ 3240.877097] [<ffffffff810141ea>] child_rip+0xa/0x20
[ 3240.877937] [<ffffffff81084b90>] ? kthread+0x0/0xa0
[ 3240.878808] [<ffffffff810141e0>] ? child_rip+0x0/0x20
[ 3240.879610] INFO: task flush-251:1:1633 blocked for more than 120 seconds.
[ 3240.880529] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3240.881585] flush-251:1 D 0000000000000000 0 1633 2 0x00000000
[ 3240.882895] ffff880215f496c0 0000000000000046 0000000000015bc0 0000000000015bc0
[ 3240.884393] ffff88022204df80 ffff880215f49fd8 0000000000015bc0 ffff88022204dbc0
[ 3240.885753] 0000000000015bc0 ffff880215f49fd8 0000000000015bc0 ffff88022204df80
[ 3240.886990] Call Trace:
[ 3240.887755] [<ffffffff815555c7>] io_schedule+0x47/0x70
[ 3240.888516] [<ffffffff8129e37e>] get_request_wait+0xce/0x1a0
[ 3240.889285] [<ffffffff81084fa0>] ? autoremove_wake_function+0x0/0x40
[ 3240.890070] [<ffffffff812935eb>] ? elv_merge+0x1cb/0x200
[ 3240.890905] [<ffffffff8129ea65>] __make_request+0x95/0x4a0
[ 3240.892204] [<ffffffff8108962e>] ? up_read+0xe/0x10
[ 3240.893520] [<ffffffff8129d1c1>] generic_make_request+0x1b1/0x4f0
[ 3240.894830] [<ffffffff810f5685>] ? mempool_alloc_slab+0x15/0x20
[ 3240.895996] [<ffffffff811ded00>] ? noalloc_get_block_write+0x0/0x60
[ 3240.896875] [<ffffffff8116e9e0>] ? block_write_full_page_endio+0xe0/0x120
[ 3240.897753] [<ffffffff8129d580>] submit_bio+0x80/0x110
[ 3240.898597] [<ffffffff8116bf99>] submit_bh+0xf9/0x140
[ 3240.899415] [<ffffffff8116ded0>] __block_write_full_page+0x1e0/0x3c0
[ 3240.900237] [<ffffffff812b5da5>] ? radix_tree_gang_lookup_tag_slot+0x95/0xe0
[ 3240.901173] [<ffffffff8116d760>] ? end_buffer_async_write+0x0/0x180
[ 3240.902240] [<ffffffff811ded00>] ? noalloc_get_block_write+0x0/0x60
[ 3240.903328] [<ffffffff811ded00>] ? noalloc_get_block_write+0x0/0x60
[ 3240.904467] [<ffffffff8116e9e0>] block_write_full_page_endio+0xe0/0x120
[ 3240.905460] [<ffffffff811d98d0>] ? ext4_bh_delay_or_unwritten+0x0/0x30
[ 3240.906215] [<ffffffff8116ea35>] block_write_full_page+0x15/0x20
[ 3240.907040] [<ffffffff811db306>] ext4_writepage+0xd6/0x2a0
[ 3240.907848] [<ffffffff811db19c>] mpage_da_submit_io+0x14c/0x1c0
[ 3240.908688] [<ffffffff811dfcb9>] ext4_da_writepages+0x3e9/0x610
[ 3240.909421] [<ffffffff810fcbd1>] do_writepages+0x21/0x40
[ 3240.910392] [<ffffffff811655b6>] writeback_single_inode+0xf6/0x3d0
[ 3240.911323] [<ffffffff81166220>] writeback_inodes_wb+0x410/0x5e0
[ 3240.912233] [<ffffffff811664fa>] wb_writeback+0x10a/0x1d0
[ 3240.913077] [<ffffffff81077665>] ? try_to_del_timer_sync+0x75/0xd0
[ 3240.914019] [<ffffffff8155587b>] ? schedule_timeout+0x19b/0x300
[ 3240.915004] [<ffffffff8116682c>] wb_do_writeback+0x18c/0x1a0
[ 3240.915837] [<ffffffff81166893>] bdi_writeback_task+0x53/0xe0
[ 3240.916715] [<ffffffff8110e526>] bdi_start_fn+0x86/0x100
[ 3240.917539] [<ffffffff8110e4a0>] ? bdi_start_fn+0x0/0x100
[ 3240.918473] [<ffffffff81084c26>] kthread+0x96/0xa0
[ 3240.919280] [<ffffffff810141ea>] child_rip+0xa/0x20
[ 3240.920159] [<ffffffff81084b90>] ? kthread+0x0/0xa0
[ 3240.921116] [<ffffffff810141e0>] ? child_rip+0x0/0x20
[ 3240.922204] INFO: task flush-251:3:1634 blocked for more than 120 seconds.
[ 3240.923477] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3240.924825] flush-251:3 D 00000000ffffffff 0 1634 2 0x00000000
[ 3240.925892] ffff880215f4b750 0000000000000046 0000000000015bc0 0000000000015bc0
[ 3240.926871] ffff880227169ab0 ffff880215f4bfd8 0000000000015bc0 ffff8802271696f0
[ 3240.927904] 0000000000015bc0 ffff880215f4bfd8 0000000000015bc0 ffff880227169ab0
[ 3240.928900] Call Trace:
[ 3240.929717] [<ffffffff815555c7>] io_schedule+0x47/0x70
[ 3240.930607] [<ffffffff8129e37e>] get_request_wait+0xce/0x1a0
[ 3240.931543] [<ffffffff81084fa0>] ? autoremove_wake_function+0x0/0x40
[ 3240.932451] [<ffffffff812935eb>] ? elv_merge+0x1cb/0x200
[ 3240.933299] [<ffffffff8129ea65>] __make_request+0x95/0x4a0
[ 3240.934385] [<ffffffff8108962e>] ? up_read+0xe/0x10
[ 3240.935356] [<ffffffff8129eb2c>] ? __make_request+0x15c/0x4a0
[ 3240.936186] [<ffffffff8129d1c1>] generic_make_request+0x1b1/0x4f0
[ 3240.937105] [<ffffffff810f5685>] ? mempool_alloc_slab+0x15/0x20
[ 3240.937930] [<ffffffff8129d580>] submit_bio+0x80/0x110
[ 3240.938805] [<ffffffff8116bf99>] submit_bh+0xf9/0x140
[ 3240.939611] [<ffffffff8116ded0>] __block_write_full_page+0x1e0/0x3c0
[ 3240.940413] [<ffffffff810f3577>] ? unlock_page+0x27/0x30
[ 3240.941298] [<ffffffff8116d760>] ? end_buffer_async_write+0x0/0x180
[ 3240.942439] [<ffffffff8116e9e0>] block_write_full_page_endio+0xe0/0x120
[ 3240.943713] [<ffffffff8116d760>] ? end_buffer_async_write+0x0/0x180
[ 3240.944748] [<ffffffff8116ea35>] block_write_full_page+0x15/0x20
[ 3240.946082] [<ffffffff811b680d>] ext3_ordered_writepage+0x1dd/0x200
[ 3240.946903] [<ffffffff810fb717>] __writepage+0x17/0x40
[ 3240.947771] [<ffffffff810fc8cf>] write_cache_pages+0x21f/0x4d0
[ 3240.948558] [<ffffffff810fb700>] ? __writepage+0x0/0x40
[ 3240.949329] [<ffffffff810fcba4>] generic_writepages+0x24/0x30
[ 3240.950260] [<ffffffff810fcbe5>] do_writepages+0x35/0x40
[ 3240.951387] [<ffffffff811655b6>] writeback_single_inode+0xf6/0x3d0
[ 3240.952609] [<ffffffff81166220>] writeback_inodes_wb+0x410/0x5e0
[ 3240.953725] [<ffffffff811664fa>] wb_writeback+0x10a/0x1d0
[ 3240.954958] [<ffffffff81077665>] ? try_to_del_timer_sync+0x75/0xd0
[ 3240.955966] [<ffffffff8155587b>] ? schedule_timeout+0x19b/0x300
[ 3240.956770] [<ffffffff8116682c>] wb_do_writeback+0x18c/0x1a0
[ 3240.957692] [<ffffffff81166893>] bdi_writeback_task+0x53/0xe0
[ 3240.958575] [<ffffffff8110e526>] bdi_start_fn+0x86/0x100
[ 3240.959369] [<ffffffff8110e4a0>] ? bdi_start_fn+0x0/0x100
[ 3240.960311] [<ffffffff81084c26>] kthread+0x96/0xa0
[ 3240.961218] [<ffffffff810141ea>] child_rip+0xa/0x20
[ 3240.962497] [<ffffffff81084b90>] ? kthread+0x0/0xa0
[ 3240.963906] [<ffffffff810141e0>] ? child_rip+0x0/0x20
[ 3240.965221] INFO: task rsyslogd:1698 blocked for more than 120 seconds.
[ 3240.966464] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3240.967282] rsyslogd D 00000000ffffffff 0 1698 1 0x00000000
[ 3240.968184] ffff880215a3b998 0000000000000086 0000000000015bc0 0000000000015bc0
[ 3240.969126] ffff8802178ec890 ffff880215a3bfd8 0000000000015bc0 ffff8802178ec4d0
[ 3240.970116] 0000000000015bc0 ffff880215a3bfd8 0000000000015bc0 ffff8802178ec890
[ 3240.971206] Call Trace:
[ 3240.972459] [<ffffffff810f3460>] ? sync_page+0x0/0x50
[ 3240.973538] [<ffffffff815555c7>] io_schedule+0x47/0x70
[ 3240.974486] [<ffffffff810f349d>] sync_page+0x3d/0x50
[ 3240.975362] [<ffffffff81555a9a>] __wait_on_bit_lock+0x5a/0xc0
[ 3240.976171] [<ffffffff810f3437>] __lock_page+0x67/0x70
[ 3240.977038] [<ffffffff81084fe0>] ? wake_bit_function+0x0/0x40
[ 3240.977854] [<ffffffff810f326e>] ? find_get_page+0x1e/0xa0
[ 3240.978707] [<ffffffff810f4b02>] find_lock_page+0x52/0x80
[ 3240.979521] [<ffffffff810f4b7a>] grab_cache_page_write_begin+0x4a/0xc0
[ 3240.980383] [<ffffffff811ba38f>] ext3_write_begin+0x7f/0x240
[ 3240.981388] [<ffffffff810f2bb2>] generic_perform_write+0xc2/0x1d0
[ 3240.982300] [<ffffffff811c1134>] ? __ext3_journal_stop+0x34/0x60
[ 3240.983128] [<ffffffff8155743e>] ? _spin_lock+0xe/0x20
[ 3240.984066] [<ffffffff810f3d43>] generic_file_buffered_write+0x73/0xd0
[ 3240.985050] [<ffffffff810f5330>] __generic_file_aio_write+0x240/0x470
[ 3240.985866] [<ffffffff81013b0e>] ? common_interrupt+0xe/0x13
[ 3240.986772] [<ffffffff810f55cf>] generic_file_aio_write+0x6f/0xe0
[ 3240.987597] [<ffffffff810f5561>] ? generic_file_aio_write+0x1/0xe0
[ 3240.988478] [<ffffffff8114311a>] do_sync_write+0xfa/0x140
[ 3240.989326] [<ffffffff812b69a6>] ? rb_erase+0xd6/0x160
[ 3240.990256] [<ffffffff81084fa0>] ? autoremove_wake_function+0x0/0x40
[ 3240.991440] [<ffffffff81250946>] ? security_file_permission+0x16/0x20
[ 3240.992794] [<ffffffff81143418>] vfs_write+0xb8/0x1a0
[ 3240.993876] [<ffffffff81143c81>] sys_write+0x51/0x80
[ 3240.995031] [<ffffffff810131b2>] system_call_fastpath+0x16/0x1b
[ 3240.996046] INFO: task squid:1703 blocked for more than 120 seconds.
[ 3240.997093] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3240.997938] squid D 00000000ffffffff 0 1703 1701 0x00000000
[ 3240.998834] ffff88021594d9e8 0000000000000086 0000000000015bc0 0000000000015bc0
[ 3240.999887] ffff8802201d5f80 ffff88021594dfd8 0000000000015bc0 ffff8802201d5bc0
[ 3241.001005] 0000000000015bc0 ffff88021594dfd8 0000000000015bc0 ffff8802201d5f80
[ 3241.002366] Call Trace:
[ 3241.004031] [<ffffffff8116d1d0>] ? sync_buffer+0x0/0x50
[ 3241.005514] [<ffffffff815555c7>] io_schedule+0x47/0x70
[ 3241.006498] [<ffffffff8116d215>] sync_buffer+0x45/0x50
[ 3241.007421] [<ffffffff81555bef>] __wait_on_bit+0x5f/0x90
[ 3241.008288] [<ffffffff8116d1d0>] ? sync_buffer+0x0/0x50
[ 3241.009178] [<ffffffff81555c98>] out_of_line_wait_on_bit+0x78/0x90
[ 3241.010059] [<ffffffff81084fe0>] ? wake_bit_function+0x0/0x40
[ 3241.010984] [<ffffffff8116d1c6>] __wait_on_buffer+0x26/0x30
[ 3241.012254] [<ffffffff811b76a7>] __ext3_get_inode_loc+0x297/0x320
[ 3241.013619] [<ffffffff811b77bd>] ext3_iget+0x8d/0x4b0
[ 3241.014815] [<ffffffff811be1b8>] ext3_lookup+0xa8/0x130
[ 3241.016297] [<ffffffff8114d082>] real_lookup+0xe2/0x160
[ 3241.017383] [<ffffffff8114f028>] do_lookup+0xb8/0xf0
[ 3241.018172] [<ffffffff8114fb55>] __link_path_walk+0x765/0xf80
[ 3241.019039] [<ffffffff8101184e>] ? __switch_to+0x26e/0x320
[ 3241.019801] [<ffffffff811505ea>] path_walk+0x6a/0xe0
[ 3241.020694] [<ffffffff811507bb>] do_path_lookup+0x5b/0xa0
[ 3241.021673] [<ffffffff81144a81>] ? get_empty_filp+0xa1/0x170
[ 3241.022735] [<ffffffff81151763>] do_filp_open+0x103/0xba0
[ 3241.024094] [<ffffffff8155587b>] ? schedule_timeout+0x19b/0x300
[ 3241.025376] [<ffffffff81076ce0>] ? process_timeout+0x0/0x10
[ 3241.026354] [<ffffffff8117c99c>] ? ep_poll+0x22c/0x240
[ 3241.027222] [<ffffffff8115d2da>] ? alloc_fd+0x10a/0x150
[ 3241.028015] [<ffffffff81141109>] do_sys_open+0x69/0x170
[ 3241.028856] [<ffffffff81141250>] sys_open+0x20/0x30
[ 3241.029577] [<ffffffff810131b2>] system_call_fastpath+0x16/0x1b
[ 3241.030665] INFO: task nagios3:3861 blocked for more than 120 seconds.
[ 3241.031991] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3241.033200] nagios3 D 00000000ffffffff 0 3861 1 0x00000000
[ 3241.034441] ffff88020edebc58 0000000000000086 0000000000015bc0 0000000000015bc0
[ 3241.035842] ffff88020ed3b1a0 ffff88020edebfd8 0000000000015bc0 ffff88020ed3ade0
[ 3241.036781] 0000000000015bc0 ffff88020edebfd8 0000000000015bc0 ffff88020ed3b1a0
[ 3241.037663] Call Trace:
[ 3241.038422] [<ffffffff81219a7d>] do_get_write_access+0x30d/0x5d0
[ 3241.039277] [<ffffffff81084fe0>] ? wake_bit_function+0x0/0x40
[ 3241.040133] [<ffffffff81219ed1>] jbd2_journal_get_write_access+0x31/0x50
[ 3241.041107] [<ffffffff81201258>] __ext4_journal_get_write_access+0x38/0x70
[ 3241.042440] [<ffffffff811dd503>] ext4_reserve_inode_write+0x73/0xa0
[ 3241.043802] [<ffffffff811dd571>] ext4_mark_inode_dirty+0x41/0x1d0
[ 3241.045150] [<ffffffff811dd860>] ext4_dirty_inode+0x40/0x60
[ 3241.046095] [<ffffffff81165c22>] __mark_inode_dirty+0x42/0x1e0
[ 3241.047008] [<ffffffff8115a145>] touch_atime+0x135/0x180
[ 3241.047832] [<ffffffff81153ef0>] ? filldir+0x0/0xe0
[ 3241.048763] [<ffffffff81154166>] vfs_readdir+0xd6/0xe0
[ 3241.049638] [<ffffffff811542d5>] sys_getdents+0x85/0xf0
[ 3241.050544] [<ffffffff81557905>] ? page_fault+0x25/0x30
[ 3241.051519] [<ffffffff810131b2>] system_call_fastpath+0x16/0x1b
[ 3241.052986] INFO: task collectd:4263 blocked for more than 120 seconds.
[ 3241.054162] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3241.055488] collectd D 0000000000000000 0 4263 4249 0x00000000
[ 3241.056816] ffff88020d9fbb38 0000000000000086 0000000000015bc0 0000000000015bc0
[ 3241.058149] ffff88020da1b1a0 ffff88020d9fbfd8 0000000000015bc0 ffff88020da1ade0
[ 3241.059092] 0000000000015bc0 ffff88020d9fbfd8 0000000000015bc0 ffff88020da1b1a0
[ 3241.060062] Call Trace:
[ 3241.061013] [<ffffffff81219a7d>] do_get_write_access+0x30d/0x5d0
[ 3241.062362] [<ffffffff81084fe0>] ? wake_bit_function+0x0/0x40
[ 3241.063889] [<ffffffff81219ed1>] jbd2_journal_get_write_access+0x31/0x50
[ 3241.065460] [<ffffffff81201258>] __ext4_journal_get_write_access+0x38/0x70
[ 3241.066425] [<ffffffff811dd503>] ext4_reserve_inode_write+0x73/0xa0
[ 3241.067272] [<ffffffff811dd571>] ext4_mark_inode_dirty+0x41/0x1d0
[ 3241.068214] [<ffffffff811dd860>] ext4_dirty_inode+0x40/0x60
[ 3241.069059] [<ffffffff81165c22>] __mark_inode_dirty+0x42/0x1e0
[ 3241.070049] [<ffffffff8115a145>] touch_atime+0x135/0x180
[ 3241.071184] [<ffffffff811d706d>] ext4_file_mmap+0x5d/0x60
[ 3241.072370] [<ffffffff81119a84>] mmap_region+0x404/0x590
[ 3241.073634] [<ffffffff81119f45>] do_mmap_pgoff+0x335/0x380
[ 3241.074928] [<ffffffff8110bca4>] sys_mmap_pgoff+0x1d4/0x2a0
[ 3241.075858] [<ffffffff810f6957>] ? sys_fadvise64_64+0x77/0x210
[ 3241.076724] [<ffffffff81017af9>] sys_mmap+0x29/0x30
[ 3241.077663] [<ffffffff810131b2>] system_call_fastpath+0x16/0x1b
[ 3241.078550] INFO: task collectd:4264 blocked for more than 120 seconds.
[ 3241.079532] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3241.080684] collectd D 0000000000000000 0 4264 4249 0x00000000
[ 3241.081814] ffff88020d9f7e98 0000000000000086 0000000000015bc0 0000000000015bc0
[ 3241.082888] ffff88020da1c890 ffff88020d9f7fd8 0000000000015bc0 ffff88020da1c4d0
[ 3241.084060] 0000000000015bc0 ffff88020d9f7fd8 0000000000015bc0 ffff88020da1c890
[ 3241.085129] Call Trace:
[ 3241.086017] [<ffffffff8155716d>] __down_write_nested+0x8d/0xd0
[ 3241.086942] [<ffffffff815571bb>] __down_write+0xb/0x10
[ 3241.087817] [<ffffffff8155660e>] down_write+0xe/0x10
[ 3241.088708] [<ffffffff8110bb2c>] sys_mmap_pgoff+0x5c/0x2a0
[ 3241.089572] [<ffffffff81017af9>] sys_mmap+0x29/0x30
[ 3241.090452] [<ffffffff810131b2>] system_call_fastpath+0x16/0x1b
[ 3241.091432] INFO: task collectd:4265 blocked for more than 120 seconds.
[ 3241.092646] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3241.093814] collectd D 0000000000000000 0 4265 4249 0x00000000
[ 3241.095239] ffff88020da39e98 0000000000000086 0000000000015bc0 0000000000015bc0
[ 3241.096867] ffff88020da1df80 ffff88020da39fd8 0000000000015bc0 ffff88020da1dbc0
[ 3241.097891] 0000000000015bc0 ffff88020da39fd8 0000000000015bc0 ffff88020da1df80
[ 3241.098820] Call Trace:
[ 3241.099777] [<ffffffff8155716d>] __down_write_nested+0x8d/0xd0
[ 3241.100946] [<ffffffff815571bb>] __down_write+0xb/0x10
[ 3241.102315] [<ffffffff8155660e>] down_write+0xe/0x10
[ 3241.103812] [<ffffffff8110bb2c>] sys_mmap_pgoff+0x5c/0x2a0
[ 3241.105266] [<ffffffff81017af9>] sys_mmap+0x29/0x30
[ 3241.106154] [<ffffffff810131b2>] system_call_fastpath+0x16/0x1b
[ 3267.010606] ata1: lost interrupt (Status 0x50)
[ 3267.011434] sd 0:0:0:0: [sda] Unhandled error code
[ 3267.012501] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 3267.013673] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 56 17 03 00 00 08 00
[ 3267.015250] end_request: I/O error, dev sda, sector 5641987
[ 3267.016824] Buffer I/O error on device dm-1, logical block 681095
[ 3267.018274] lost page write due to I/O error on dm-1
[ 3298.011371] ata1: lost interrupt (Status 0x50)
[ 3298.012331] sd 0:0:0:0: [sda] Unhandled error code
[ 3298.013516] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 3298.015019] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 56 17 2b 00 00 08 00
[ 3298.016511] end_request: I/O error, dev sda, sector 5642027
[ 3298.018012] Buffer I/O error on device dm-1, logical block 681100
[ 3298.019464] lost page write due to I/O error on dm-1
[ 3329.010109] ata1: lost interrupt (Status 0x50)
[ 3329.011119] sd 0:0:0:0: [sda] Unhandled error code
[ 3329.012152] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 3329.013227] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 56 17 63 00 00 08 00
[ 3329.014258] end_request: I/O error, dev sda, sector 5642083
[ 3329.015333] Buffer I/O error on device dm-1, logical block 681107
[ 3329.016405] lost page write due to I/O error on dm-1
[ 3360.011317] ata1: lost interrupt (Status 0x50)
[ 3360.012177] sd 0:0:0:0: [sda] Unhandled error code
[ 3360.013007] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 3360.014041] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 56 17 8b 00 00 08 00
[ 3360.015630] end_request: I/O error, dev sda, sector 5642123
[ 3360.017830] Buffer I/O error on device dm-1, logical block 681112
[ 3360.019508] lost page write due to I/O error on dm-1
[ 3391.010271] ata1: lost interrupt (Status 0x50)
[ 3391.011428] sd 0:0:0:0: [sda] Unhandled error code
[ 3391.013020] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 3391.014755] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 56 17 9b 00 00 08 00
[ 3391.016611] end_request: I/O error, dev sda, sector 5642139
[ 3391.017924] Buffer I/O error on device dm-1, logical block 681114
[ 3391.018923] lost page write due to I/O error on dm-1
[ 3422.013503] ata1: lost interrupt (Status 0x50)
[ 3422.014520] sd 0:0:0:0: [sda] Unhandled error code
[ 3422.015555] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 3422.016708] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 56 1b 3b 00 00 08 00
[ 3422.017897] end_request: I/O error, dev sda, sector 5643067
[ 3422.018768] Buffer I/O error on device dm-1, logical block 681230
[ 3422.019670] lost page write due to I/O error on dm-1
[ 3453.011354] ata1: lost interrupt (Status 0x50)
[ 3453.012246] sd 0:0:0:0: [sda] Unhandled error code
[ 3453.013138] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 3453.014180] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 56 1b 63 00 00 10 00
[ 3453.015786] end_request: I/O error, dev sda, sector 5643107
[ 3453.017514] Buffer I/O error on device dm-1, logical block 681235
[ 3453.018925] lost page write due to I/O error on dm-1
[ 3453.020145] Buffer I/O error on device dm-1, logical block 681236
[ 3453.021120] lost page write due to I/O error on dm-1
[ 3484.011293] ata1: lost interrupt (Status 0x50)
[ 3484.012473] sd 0:0:0:0: [sda] Unhandled error code
[ 3484.014160] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 3484.015349] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 56 1b 93 00 00 08 00
[ 3484.016667] end_request: I/O error, dev sda, sector 5643155
[ 3484.017796] Buffer I/O error on device dm-1, logical block 681241
[ 3484.018891] lost page write due to I/O error on dm-1
[ 3515.011376] ata1: lost interrupt (Status 0x50)

ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: linux-image-2.6.32-21-server 2.6.32-21.32
Regression: Yes
Reproducible: Yes
ProcVersionSignature: Ubuntu 2.6.32-21.32-server 2.6.32.11+drm33.2
Uname: Linux 2.6.32-21-server x86_64
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.21.
AplayDevices: Error: [Errno 2] No such file or directory
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/by-path', '/dev/snd/controlC0', '/dev/snd/hwC0D0', '/dev/snd/pcmC0D3p', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: [Errno 2] No such file or directory
Card0.Amixer.info: Error: [Errno 2] No such file or directory
Card0.Amixer.values: Error: [Errno 2] No such file or directory
Date: Sun May 9 13:24:00 2010
HibernationDevice: RESUME=/dev/mapper/server-swap_1
Lsusb:
 Bus 003 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: System manufacturer System Product Name
ProcCmdLine: BOOT_IMAGE=/vmlinuz-2.6.32-21-server root=/dev/mapper/server-root ro splash
ProcEnviron:
 SHELL=/bin/bash
 LANG=en_US.UTF-8
RelatedPackageVersions: linux-firmware 1.34
RfKill:

SourcePackage: linux
dmi.bios.date: 04/28/2009
dmi.bios.vendor: Phoenix Technologies, LTD
dmi.bios.version: ASUS P5N32-E SLI PLUS ACPI BIOS Revision 1401
dmi.board.name: P5N32-E SLI PLUS
dmi.board.vendor: ASUSTeK Computer INC.
dmi.board.version: 1.XX
dmi.chassis.asset.tag: 123456789000
dmi.chassis.type: 3
dmi.chassis.vendor: Chassis Manufacture
dmi.chassis.version: Chassis Version
dmi.modalias: dmi:bvnPhoenixTechnologies,LTD:bvrASUSP5N32-ESLIPLUSACPIBIOSRevision1401:bd04/28/2009:svnSystemmanufacturer:pnSystemProductName:pvrSystemVersion:rvnASUSTeKComputerINC.:rnP5N32-ESLIPLUS:rvr1.XX:cvnChassisManufacture:ct3:cvrChassisVersion:
dmi.product.name: System Product Name
dmi.product.version: System Version
dmi.sys.vendor: System manufacturer

Revision history for this message
alf@all.de (alf-all) wrote :
Revision history for this message
alf@all.de (alf-all) wrote :

The problems go away when I switch back to 2.6.31-21-server, so it's not an hardware issue.
Only discs which are controlled by the nVidia MCP55 seems to be affected (PATA & SATA):

*-ide:0
     description: IDE interface
     product: MCP55 IDE
     vendor: nVidia Corporation
   *-disk
        description: ATA Disk
        vendor: Western Digital
*-ide:1
     description: IDE interface
     product: MCP55 SATA Controller
     vendor: nVidia Corporation
   *-disk:0
        description: ATA Disk
        vendor: Western Digital
   *-disk:1
        description: ATA Disk
        vendor: Western Digital
*-ide:2
     description: IDE interface
     product: MCP55 SATA Controller
     vendor: nVidia Corporation
   *-disk:0
        description: ATA Disk
        vendor: Western Digital
   *-disk:1
        description: ATA Disk
        vendor: Western Digital
*-ide:3
     description: IDE interface
     product: MCP55 SATA Controller
     vendor: nVidia Corporation
   *-disk:0
        description: ATA Disk
        vendor: Western Digital
   *-disk:1
        description: ATA Disk
        vendor: Western Digital

Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

Hi <email address hidden>,

If you could also please test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

    [This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: kj-triage
Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
alf@all.de (alf-all) wrote :

Hello

Im running "2.6.34-999-generic #201005121008 SMP Wed May 12 09:10:15 UTC 2010 x86_64 GNU/Linux" for two days now.
Ive read all raid twice concurent with a copy process. No lost IRQ and no dropped disk by now ;-)

ALF

Revision history for this message
alf@all.de (alf-all) wrote :

I laughed too soon ;-(

No lost IRQ, but dropped disks again ... Kernellog appended ...

Revision history for this message
alf@all.de (alf-all) wrote :
Download full text (5.4 KiB)

This drop is other than before. An other bug ?

Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md4 : active raid5 sdk1[3](F) sdi1[4](F) sdj1[5](F)
      1953519872 blocks level 5, 64k chunk, algorithm 2 [3/0] [___]
md1 : active raid5 sdd1[3](F) sdg1[2] sde1[1]
      2930269824 blocks level 5, 64k chunk, algorithm 2 [3/2] [_UU]
md0 : active raid5 sdb1[1] sda1[0] sdc1[3](F)
      2930271872 blocks level 5, 64k chunk, algorithm 2 [3/2] [UU_]

At first the disks (md4) connected at SIL are complete gone (6:55:40), which not happend before:

04:00.0 Mass storage controller: Silicon Image, Inc. SiI 3132 Serial ATA Raid II Controller (rev 01)

May 19 06:55:40 downtown kernel: [117018.953812] ata9.00: failed to read SCR 1 (Emask=0x40)
May 19 06:55:40 downtown kernel: [117018.954743] ata9.01: failed to read SCR 1 (Emask=0x40)
May 19 06:55:40 downtown kernel: [117018.955641] ata9.02: failed to read SCR 1 (Emask=0x40)
May 19 06:55:40 downtown kernel: [117018.956528] ata9.03: failed to read SCR 1 (Emask=0x40)
May 19 06:55:40 downtown kernel: [117018.957418] ata9.04: failed to read SCR 1 (Emask=0x40)
May 19 06:55:40 downtown kernel: [117018.958290] ata9.05: failed to read SCR 1 (Emask=0x40)
May 19 06:55:40 downtown kernel: [117018.959155] ata9.06: failed to read SCR 1 (Emask=0x40)
May 19 06:55:40 downtown kernel: [117018.960040] ata9.07: failed to read SCR 1 (Emask=0x40)
May 19 06:55:40 downtown kernel: [117018.960916] ata9.08: failed to read SCR 1 (Emask=0x40)
May 19 06:55:40 downtown kernel: [117018.961759] ata9.09: failed to read SCR 1 (Emask=0x40)
May 19 06:55:40 downtown kernel: [117018.962564] ata9.10: failed to read SCR 1 (Emask=0x40)
May 19 06:55:40 downtown kernel: [117018.963358] ata9.11: failed to read SCR 1 (Emask=0x40)
May 19 06:55:40 downtown kernel: [117018.964120] ata9.12: failed to read SCR 1 (Emask=0x40)
May 19 06:55:40 downtown kernel: [117018.964868] ata9.13: failed to read SCR 1 (Emask=0x40)
May 19 06:55:40 downtown kernel: [117018.965593] ata9.14: failed to read SCR 1 (Emask=0x40)
May 19 06:55:40 downtown kernel: [117018.966289] ata9.15: exception Emask 0x4 SAct 0x0 SErr 0x0 action 0x6 frozen
May 19 06:55:40 downtown kernel: [117018.967044] ata9.00: exception Emask 0x100 SAct 0x0 SErr 0x0 action 0x6 frozen
May 19 06:55:40 downtown kernel: [117018.967723] ata9.01: exception Emask 0x100 SAct 0x1 SErr 0x0 action 0x6 frozen
May 19 06:55:40 downtown kernel: [117018.968353] ata9.01: failed command: READ FPDMA QUEUED
May 19 06:55:40 downtown kernel: [117018.968974] ata9.01: cmd 60/08:00:5f:00:56/00:00:72:00:00/40 tag 0 ncq 4096 in
May 19 06:55:40 downtown kernel: [117018.968975] res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
May 19 06:55:40 downtown kernel: [117018.970248] ata9.01: status: { DRDY }
May 19 06:55:40 downtown kernel: [117018.970862] ata9.02: exception Emask 0x100 SAct 0x0 SErr 0x0 action 0x6 frozen
...

About 2 hours later (9:17:26) the MCP55 raids also breaks:

00:0d.0 IDE interface: nVidia Corporation MCP55 IDE (rev a1)
00:0e.0 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2)
00:0e.1 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2)
0...

Read more...

Revision history for this message
alf@all.de (alf-all) wrote :
Download full text (28.0 KiB)

And now (still 2.6.34-999) when the raids are syncing the initial behavier occur (Lost IRQ, dropping drives) ...

[ 2085.811299] CE: hpet increased min_delta_ns to 7500 nsec
[ 8050.011293] ata1: lost interrupt (Status 0x50)
[ 8050.012395] sd 0:0:0:0: [sda] Unhandled error code
[ 8050.013751] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 8050.015121] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 5e 87 1b 00 00 08 00
[ 8050.016592] end_request: I/O error, dev sda, sector 6194971
[ 8050.017704] Buffer I/O error on device dm-1, logical block 750218
[ 8050.018609] lost page write due to I/O error on dm-1
[ 8080.040103] ata1: lost interrupt (Status 0x50)
[ 8080.040995] sd 0:0:0:0: [sda] Unhandled error code
[ 8080.041858] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 8080.042828] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 5e 87 c3 00 00 08 00
[ 8080.043738] end_request: I/O error, dev sda, sector 6195139
[ 8080.044813] Buffer I/O error on device dm-1, logical block 750239
[ 8080.046276] lost page write due to I/O error on dm-1
[ 8111.043832] ata1: lost interrupt (Status 0x50)
[ 8111.044933] sd 0:0:0:0: [sda] Unhandled error code
[ 8111.046353] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 8111.048085] sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 5c 0f 13 00 00 08 00
[ 8111.049970] end_request: I/O error, dev sda, sector 6033171
[ 8142.010949] ata1: lost interrupt (Status 0x50)
[ 8142.012058] sd 0:0:0:0: [sda] Unhandled error code
[ 8142.013166] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 8142.014373] sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 5c 0f 13 00 00 08 00
[ 8142.015971] end_request: I/O error, dev sda, sector 6033171
[ 8161.060308] INFO: task jbd2/dm-1-8:615 blocked for more than 120 seconds.
[ 8161.061450] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 8161.062595] jbd2/dm-1-8 D ffff880001e95780 0 615 2 0x00000000
[ 8161.063675] ffff880221d93d10 0000000000000046 0000000000000000 ffff880221d93fd8
[ 8161.064863] ffff880223f11700 0000000000015780 0000000000015780 ffff880221d93fd8
[ 8161.066282] 0000000000015780 ffff880221d93fd8 0000000000015780 ffff880223f11700
[ 8161.067496] Call Trace:
[ 8161.068891] [<ffffffff812138e7>] jbd2_journal_commit_transaction+0x1e7/0x11d0
[ 8161.070346] [<ffffffff8100985b>] ? __switch_to+0xbb/0x2e0
[ 8161.071493] [<ffffffff8105118e>] ? put_prev_entity+0x2e/0x80
[ 8161.072598] [<ffffffff81540044>] ? _raw_spin_lock_irqsave+0x34/0x50
[ 8161.073878] [<ffffffff8106eb2b>] ? lock_timer_base+0x3b/0x70
[ 8161.073882] [<ffffffff8107edb0>] ? autoremove_wake_function+0x0/0x40
[ 8161.073886] [<ffffffff81219693>] kjournald2+0xc3/0x230
[ 8161.073889] [<ffffffff8107edb0>] ? autoremove_wake_function+0x0/0x40
[ 8161.073891] [<ffffffff812195d0>] ? kjournald2+0x0/0x230
[ 8161.073894] [<ffffffff8107e9d6>] kthread+0x96/0xa0
[ 8161.073897] [<ffffffff8100be64>] kernel_thread_helper+0x4/0x10
[ 8161.073900] [<ffffffff8107e940>] ? kthread+0x0/0xa0
[ 8161.073902] [<ffffffff8100be60>] ? kernel_thread_helper+0x0/0x10
[ 8161.073918] INFO: task kjournald:1695 blocked for more than 120 seconds.
[ 816...

Revision history for this message
alf@all.de (alf-all) wrote :

By the way, in the "crashed" state the sync of the raids are going on:

root@downtown:/var/log# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md4 : active raid5 sdg1[1] sde1[0] sdf1[2]
      1953519872 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

md1 : active raid5 sdi1[3] sdk1[2] sdj1[1]
      2930269824 blocks level 5, 64k chunk, algorithm 2 [3/2] [_UU]
      [==========>..........] recovery = 52.8% (774825600/1465134912) finish=143.8min speed=79971K/sec

md0 : active raid5 sdh1[3] sdb1[0] sdc1[1]
      2930271872 blocks level 5, 64k chunk, algorithm 2 [3/2] [UU_]
      [=========>...........] recovery = 46.5% (681339124/1465135936) finish=185.8min speed=70279K/sec

alf@all.de (alf-all)
description: updated
tags: removed: needs-upstream-testing
Revision history for this message
alf@all.de (alf-all) wrote :
Download full text (14.4 KiB)

There seems at least also an bug in error handling. Disks are gone till a complete reboot.
After a reboot all seems to be fine again ... for some hours.

[138662.040115] ata8: EH in SWNCQ mode,QC:qc_active 0xF sactive 0xF
[138662.040920] ata8: SWNCQ:qc_active 0x1 defer_bits 0xE last_issue_tag 0x0
[138662.040921] dhfis 0x0 dmafis 0x0 sdbfis 0x0
[138662.042476] ata8: ATA_REG 0x40 ERR_REG 0x0
[138662.043268] ata8: tag : dhfis dmafis sdbfis sacitve
[138662.044066] ata8: tag 0x0: 0 0 0 1
[138662.044848] ata8.00: exception Emask 0x0 SAct 0xf SErr 0x0 action 0x6 frozen
[138662.045663] ata8.00: failed command: READ FPDMA QUEUED
[138662.046484] ata8.00: cmd 60/80:00:80:1d:d1/00:00:4a:00:00/40 tag 0 ncq 65536 in
[138662.046485] res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
[138662.048173] ata8.00: status: { DRDY }
[138662.049018] ata8.00: failed command: READ FPDMA QUEUED
[138662.049884] ata8.00: cmd 60/58:08:00:1e:d1/00:00:4a:00:00/40 tag 1 ncq 45056 in
[138662.049885] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[138662.051674] ata8.00: status: { DRDY }
[138662.052563] ata8.00: failed command: READ FPDMA QUEUED
[138662.053475] ata8.00: cmd 60/28:10:58:1e:d1/00:00:4a:00:00/40 tag 2 ncq 20480 in
[138662.053476] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[138662.055352] ata8.00: status: { DRDY }
[138662.056266] ata8.00: failed command: READ FPDMA QUEUED
[138662.057230] ata8.00: cmd 60/80:18:00:1f:d1/00:00:4a:00:00/40 tag 3 ncq 65536 in
[138662.057231] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[138662.059142] ata8.00: status: { DRDY }
[138662.060121] ata8: hard resetting link
[138662.061080] ata8: nv: skipping hardreset on occupied port
[138662.540147] ata8: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[138663.010052] ata7: EH in SWNCQ mode,QC:qc_active 0x3 sactive 0x3
[138663.011086] ata7: SWNCQ:qc_active 0x1 defer_bits 0x2 last_issue_tag 0x0
[138663.011087] dhfis 0x1 dmafis 0x0 sdbfis 0x0
[138663.013101] ata7: ATA_REG 0x40 ERR_REG 0x0
[138663.014125] ata7: tag : dhfis dmafis sdbfis sacitve
[138663.015123] ata7: tag 0x0: 1 0 0 1
[138663.016133] ata7.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6 frozen
[138663.017172] ata7.00: failed command: READ FPDMA QUEUED
[138663.018211] ata7.00: cmd 60/b0:00:0f:3b:31/00:00:13:00:00/40 tag 0 ncq 90112 in
[138663.018212] res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
[138663.020278] ata7.00: status: { DRDY }
[138663.021335] ata7.00: failed command: READ FPDMA QUEUED
[138663.022419] ata7.00: cmd 60/80:08:3f:3c:31/00:00:13:00:00/40 tag 1 ncq 65536 in
[138663.022420] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[138663.024615] ata7.00: status: { DRDY }
[138663.025708] ata7: hard resetting link
[138663.026778] ata7: nv: skipping hardreset on occupied port
[138663.500155] ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[138667.560111] ata8.00: qc timeout (cmd 0x27)
[138667.561238] ata8.00: failed to read native max address (err_mask=0x4)
[138667.562305] ata8.00: HPA support seems broken, skipping HPA handling
[138667.563363] ata8.00: revalidation failed...

Revision history for this message
alf@all.de (alf-all) wrote :
Download full text (58.2 KiB)

Error in md syncing ?
If the disk gone first, I wonder why it will possible to complete the sync successful.
If the disk is failing after the syncing, why does the sync occure so often ?

[46472.588775] md: recovery of RAID array md0
[46472.589655] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
[46472.590503] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
[46472.591412] md: using 128k window, over a total of 1465135936 blocks.
[46472.592316] md: resuming recovery of md0 from checkpoint.
[46472.593229] md: md0: recovery done.
[46472.612683] RAID5 conf printout:
[46472.613588] --- rd:3 wd:2
[46472.614442] disk 0, o:1, dev:sde1
[46472.615279] disk 1, o:1, dev:sdf1
[46472.616111] disk 2, o:0, dev:sdh1
[46472.636717] md: recovery of RAID array md0
[46472.637602] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
[46472.638450] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
[46472.639346] md: using 128k window, over a total of 1465135936 blocks.
[46472.640249] md: resuming recovery of md0 from checkpoint.
[46472.641152] md: md0: recovery done.
[46472.660743] RAID5 conf printout:
[46472.661682] --- rd:3 wd:2
[46472.662536] disk 0, o:1, dev:sde1
[46472.663378] disk 1, o:1, dev:sdf1
[46472.664218] disk 2, o:0, dev:sdh1
[46472.684758] md: recovery of RAID array md0
[46472.685632] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
[46472.686490] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
[46472.687384] md: using 128k window, over a total of 1465135936 blocks.
[46472.688287] md: resuming recovery of md0 from checkpoint.
[46472.689186] md: md0: recovery done.
[46472.708745] RAID5 conf printout:
[46472.709661] --- rd:3 wd:2
[46472.710513] disk 0, o:1, dev:sde1
[46472.711357] disk 1, o:1, dev:sdf1
[46472.712188] disk 2, o:0, dev:sdh1
[46472.732776] md: recovery of RAID array md0
[46472.733663] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
[46472.734523] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
[46472.735429] md: using 128k window, over a total of 1465135936 blocks.
[46472.736332] md: resuming recovery of md0 from checkpoint.
[46472.737234] md: md0: recovery done.
[46472.756744] RAID5 conf printout:
[46472.757659] --- rd:3 wd:2
[46472.758501] disk 0, o:1, dev:sde1
[46472.759340] disk 1, o:1, dev:sdf1
[46472.760179] disk 2, o:0, dev:sdh1
[46472.780770] md: recovery of RAID array md0
[46472.781680] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
[46472.782567] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
[46472.783467] md: using 128k window, over a total of 1465135936 blocks.
[46472.784379] md: resuming recovery of md0 from checkpoint.
[46472.785285] md: md0: recovery done.
[46472.804736] RAID5 conf printout:
[46472.805653] --- rd:3 wd:2
[46472.806494] disk 0, o:1, dev:sde1
[46472.807332] disk 1, o:1, dev:sdf1
[46472.808162] disk 2, o:0, dev:sdh1
[46472.828773] md: recovery of RAID array md0
[46472.829653] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
[46472.830498] md: using maxim...

tags: added: kernel-bug
tags: added: kernel-core kernel-needs-review regression-release
removed: filesystem kernel-bug regression-potential
Changed in linux (Ubuntu):
status: Incomplete → Triaged
importance: Undecided → Medium
tags: added: kernel-reviewed maverick
removed: kernel-needs-review
Revision history for this message
alf@all.de (alf-all) wrote :
Download full text (3.5 MiB)

I tested also the actual kernel "Linux 2.6.34-020634-generic #020634 SMP Mon". It crashed now the second time.

But I note a strange behavior. The kernel post a dropped disk, but the soft-raid syncing (sata leds are flashing) seems
to complete his job bejond this message. I will have a closer look the next time when the kernel drop disks ...

May 31 00:15:47 downtown kernel: [355161.980134] ata8: EH in SWNCQ mode,QC:qc_active 0x7 sactive 0x7
May 31 00:15:47 downtown kernel: [355161.981103] ata8: SWNCQ:qc_active 0x3 defer_bits 0x4 last_issue_tag 0x1
May 31 00:15:47 downtown kernel: [355161.981104] dhfis 0x1 dmafis 0x0 sdbfis 0x0
May 31 00:15:47 downtown kernel: [355161.982973] ata8: ATA_REG 0x40 ERR_REG 0x0
May 31 00:15:47 downtown kernel: [355161.983913] ata8: tag : dhfis dmafis sdbfis sacitve
May 31 00:15:47 downtown kernel: [355161.984844] ata8: tag 0x0: 1 0 0 1
May 31 00:15:47 downtown kernel: [355161.985769] ata8: tag 0x1: 0 0 0 1
May 31 00:15:47 downtown kernel: [355161.986700] ata8.00: exception Emask 0x0 SAct 0x7 SErr 0x0 action 0x6 frozen
May 31 00:15:47 downtown kernel: [355161.987670] ata8.00: failed command: READ FPDMA QUEUED
May 31 00:15:47 downtown kernel: [355161.988632] ata8.00: cmd 60/28:00:58:f3:a4/00:00:4a:00:00/40 tag 0 ncq 20480 in
May 31 00:15:47 downtown kernel: [355161.988633] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
May 31 00:15:47 downtown kernel: [355161.990616] ata8.00: status: { DRDY }
May 31 00:15:47 downtown kernel: [355161.991647] ata8.00: failed command: READ FPDMA QUEUED
May 31 00:15:47 downtown kernel: [355161.992657] ata8.00: cmd 60/80:08:00:f4:a4/00:00:4a:00:00/40 tag 1 ncq 65536 in
May 31 00:15:47 downtown kernel: [355161.992658] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
May 31 00:15:47 downtown kernel: [355161.994658] ata8.00: status: { DRDY }
May 31 00:15:47 downtown kernel: [355161.995667] ata8.00: failed command: READ FPDMA QUEUED
May 31 00:15:47 downtown kernel: [355161.996693] ata8.00: cmd 60/80:10:80:f4:a4/00:00:4a:00:00/40 tag 2 ncq 65536 in
May 31 00:15:47 downtown kernel: [355161.996694] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
May 31 00:15:47 downtown kernel: [355161.998802] ata8.00: status: { DRDY }
May 31 00:15:47 downtown kernel: [355161.999852] ata8: hard resetting link
May 31 00:15:47 downtown kernel: [355162.000895] ata8: nv: skipping hardreset on occupied port
May 31 00:15:48 downtown kernel: [355162.480063] ata8: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
May 31 00:15:53 downtown kernel: [355167.500104] ata8.00: qc timeout (cmd 0x27)
May 31 00:15:53 downtown kernel: [355167.501196] ata8.00: failed to read native max address (err_mask=0x4)
May 31 00:15:53 downtown kernel: [355167.502239] ata8.00: HPA support seems broken, skipping HPA handling
May 31 00:15:53 downtown kernel: [355167.503277] ata8.00: revalidation failed (errno=-5)
May 31 00:15:53 downtown kernel: [355167.504303] ata8: hard resetting link
May 31 00:15:53 downtown kernel: [355167.505330] ata8: nv: skipping hardreset on occupied port
May 31 00:15:53 downtown kernel: [355167.990130] ata8: SATA link up 3.0 Gbps (SStatus 123 SControl ...

Revision history for this message
alf@all.de (alf-all) wrote :
Revision history for this message
alf@all.de (alf-all) wrote :
Download full text (117.3 KiB)

A 2.6.35rc2 test (which has included these 3 patches).
It works for a day ....

Jun 11 18:31:38 downtown kernel: imklog 4.2.0, log source = /proc/kmsg started.
Jun 11 18:31:38 downtown kernel: [ 0.000000] Initializing cgroup subsys cpuset
Jun 11 18:31:38 downtown kernel: [ 0.000000] Initializing cgroup subsys cpu
Jun 11 18:31:38 downtown kernel: [ 0.000000] Linux version 2.6.35-020635rc2-generic (root@zinc) (gcc version 4.2.3 (Ubuntu 4.2.3-2
ubuntu7)) #020635rc2 SMP Thu Jun 10 09:20:42 UTC 2010
Jun 11 18:31:38 downtown kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-2.6.35-020635rc2-generic root=/dev/mapper/server-ro
ot ro splash
Jun 11 18:31:38 downtown kernel: [ 0.000000] BIOS-provided physical RAM map:
Jun 11 18:31:38 downtown kernel: [ 0.000000] BIOS-e820: 0000000000000000 - 000000000009c400 (usable)
Jun 11 18:31:38 downtown kernel: [ 0.000000] BIOS-e820: 000000000009c400 - 00000000000a0000 (reserved)
Jun 11 18:31:38 downtown kernel: [ 0.000000] BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
Jun 11 18:31:38 downtown kernel: [ 0.000000] BIOS-e820: 0000000000100000 - 00000000cfef0000 (usable)
Jun 11 18:31:38 downtown kernel: [ 0.000000] BIOS-e820: 00000000cfef0000 - 00000000cfef3000 (ACPI NVS)
Jun 11 18:31:38 downtown kernel: [ 0.000000] BIOS-e820: 00000000cfef3000 - 00000000cff00000 (ACPI data)
Jun 11 18:31:38 downtown kernel: [ 0.000000] BIOS-e820: 00000000f0000000 - 00000000f2000000 (reserved)
Jun 11 18:31:38 downtown kernel: [ 0.000000] BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
Jun 11 18:31:38 downtown kernel: [ 0.000000] BIOS-e820: 0000000100000000 - 0000000230000000 (usable)
Jun 11 18:31:38 downtown kernel: [ 0.000000] NX (Execute Disable) protection: active
Jun 11 18:31:38 downtown kernel: [ 0.000000] DMI 2.4 present.
Jun 11 18:31:38 downtown kernel: [ 0.000000] Phoenix BIOS detected: BIOS may corrupt low RAM, working around it.
Jun 11 18:31:38 downtown kernel: [ 0.000000] e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)
Jun 11 18:31:38 downtown kernel: [ 0.000000] e820 update range: 0000000000000000 - 0000000000001000 (usable) ==> (reserved)
Jun 11 18:31:38 downtown kernel: [ 0.000000] e820 remove range: 00000000000a0000 - 0000000000100000 (usable)
Jun 11 18:31:38 downtown kernel: [ 0.000000] No AGP bridge found
Jun 11 18:31:38 downtown kernel: [ 0.000000] last_pfn = 0x230000 max_arch_pfn = 0x400000000
Jun 11 18:31:38 downtown kernel: [ 0.000000] MTRR default type: uncachable
Jun 11 18:31:38 downtown kernel: [ 0.000000] MTRR fixed ranges enabled:
Jun 11 18:31:38 downtown kernel: [ 0.000000] 00000-9FFFF write-back
Jun 11 18:31:38 downtown kernel: [ 0.000000] A0000-BFFFF uncachable
Jun 11 18:31:38 downtown kernel: [ 0.000000] C0000-CEFFF write-protect
Jun 11 18:31:38 downtown kernel: [ 0.000000] CF000-EFFFF uncachable
Jun 11 18:31:38 downtown kernel: [ 0.000000] F0000-FFFFF write-through
Jun 11 18:31:38 downtown kernel: [ 0.000000] MTRR variable ranges enabled:
Jun 11 18:31:38 downtown kernel: [ 0.000000] 0 base 000000000 mask F80000000 write-back
Jun 11 18:31:...

Revision history for this message
alf@all.de (alf-all) wrote :

Since I had installed 2.6.35 behavior changed a little bit. Now all disks at all controllers are affected.

My guess is, that there are two bugs:

The first concern the handling of the MCP55 controller. This could be fixed with 2.6.35.
The second bug is a timeing/locking issue when using (multiple) softraid:
* recursiv call of sync_buffer (see in trace)
* completion of raid sync when the kernel had stated the lost of one or more disks
* problem at initialisation: SATA link up 3.0 Gbps (SStatus 123 SControl 320); hard resetting link; failed to resume link (SControl 0); SATA link down (SStatus 0 SControl 0); hard resetting link; ...
* When all softraid are stopped, no problems happend for days

Actual snip:
[74261.120083] sd 6:0:6:0: WARNING: Command (0x2a) timed out, resetting card.
[74283.111357] sd 6:0:6:0: WARNING: Command (0x0) timed out, resetting card.
[74305.040127] sd 6:0:6:0: WARNING: Command (0x0) timed out, resetting card.
...
[74645.940123] sd 6:0:6:0: Device offlined - not ready after error recovery
[74645.940968] sd 6:0:6:0: Device offlined - not ready after error recovery
...
[74645.952020] sd 6:0:6:0: rejecting I/O to offline device
[74645.952594] sd 6:0:6:0: rejecting I/O to offline device
[74645.953136] sd 6:0:6:0: rejecting I/O to offline device
[74645.953139] sd 6:0:6:0: [sdf] Unhandled error code
[74645.953142] sd 6:0:6:0: [sdf] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[74645.953145] sd 6:0:6:0: [sdf] CDB: Write(10): 2a 00 0c 8c 33 b4 00 00 08 00
[74645.953149] end_request: I/O error, dev sdf, sector 210514868
[74645.953153] Buffer I/O error on device dm-5, logical block 26290213
[74645.953155] lost page write due to I/O error on dm-5
[74645.957047] sd 6:0:6:0: rejecting I/O to offline device
[74645.957632] sd 6:0:6:0: rejecting I/O to offline device
[74645.957634] sd 6:0:6:0: [sdf] Unhandled error code
[74645.957636] sd 6:0:6:0: [sdf] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[74645.957638] sd 6:0:6:0: [sdf] CDB: Write(10): 2a 00 0c fe f2 ac 00 00 08 00
...

Revision history for this message
penalvch (penalvch) wrote :

<email address hidden>, this bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? If so, could you please test for this with the latest server release of Ubuntu? ISO images are available from http://releases.ubuntu.com/raring/ .

If it remains an issue, could you please run the following command in the development release from a Terminal (Applications->Accessories->Terminal), as it will automatically gather and attach updated debug information to this report:

apport-collect -p linux <replace-with-bug-number>

Also, could you please test the latest upstream kernel available following https://wiki.ubuntu.com/KernelMainlineBuilds ? It will allow additional upstream developers to examine the issue. Please do not test the daily kernel folder, but the one all the way at the bottom. Once you've tested the upstream kernel, please comment on which kernel version specifically you tested. If this bug is fixed in the mainline kernel, please add the following tags:
kernel-fixed-upstream
kernel-fixed-upstream-VERSION-NUMBER

where VERSION-NUMBER is the version number of the kernel you tested. For example:
kernel-fixed-upstream-v3.12-rc2

This can be done by clicking on the yellow circle with a black pencil icon next to the word Tags located at the bottom of the bug description. As well, please remove the tag:
needs-upstream-testing

If the mainline kernel does not fix this bug, please add the following tags:
kernel-bug-exists-upstream
kernel-bug-exists-upstream-VERSION-NUMBER

As well, please remove the tag:
needs-upstream-testing

Once testing of the upstream kernel is complete, please mark this bug's Status as Confirmed. Please let us know your results. Thank you for your understanding.

Changed in linux (Ubuntu):
status: Triaged → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.