Open Source Support Tools
 
Search Item
 
Summary
  Reported Issue
Title: [6044] Kernel BUG at mm/rmap.c:555
Project: kernel
Item Last Modified: Tue, 23 Sep 2008 10:29:26
Tags:  
 
 
Bug addr allocator amd asus bf bisect bit bttv c2 c6 c7 call card case cc cc1 config consistent continue count cpu cpufreq_stats current debug dereference driver drm early ec end entries entry exist exit fault ff fix flags flat free gcc gcc-3.4 glibc glibc-2.3 handle hardware href hwmon increase info kernel kmem_cache kmem_cache_create kstrdup kzalloc ld leave linux linux-kernel linux-kernel-2.6 mapping mem memory mmap motherboard null page page_mapcount paging patch patches pfn pgd physical pmd pmd-0 pointer present process psmouse pte pull ram read recent request revert rip rsp secondary segfault send set sh single slab snd snd_mixer_oss snd_pcm snd_pcm_oss snd_seq_device snd_timer something start state support tainted task that thread trace triggered tveeprom util vma
Details
[6044] Kernel BUG at mm/rmap.c:555
Reporter:   shadowngoz
Created:   Thu, 09 Feb 2006 13:11:00
Updated:   Tue, 23 Sep 2008 10:29:26
Key:   6044
Versions:   Not provided
Environment:  
Priority:   -1
Status:   Not provided
Resolution:   UNREPRODUCIBLE
Original Link:   http://bugzilla.kernel.org/show_bug.cgi?id=6044
Summary:   Kernel BUG at mm/rmap.c:555
Description:
Most recent kernel where this bug did not occur:
Distribution: Gentoo AMD64
Hardware Environment: AMD Athlon 64 3700+, 1G RAM, ASUS A8N-VM CSM motherboard
Software Environment: Gentoo ~AMD64, GNOME, gcc 3.4.5, glibc 2.3.6
Problem Description:

Happens when compiling big packages, i.e. firefox, thunderbird, gcc, glibc etc ...
with or without X running.

First Log, exit X and rmmod nvidia:

Eeek! page_mapcount(page) went negative! (-1)
page->flags = 4000000000000094
page->count = 1
page->mapping = 0000000000000000
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at mm/rmap.c:555
invalid opcode: 0000 [1]
CPU 0
Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq_dummy snd_seq_oss
snd_seq_midi_event snd_seq snd_seq_device snd_hda_intel snd_hda_codec snd_pcm
snd_timer snd soundcore snd_page_alloc tuner tvaudio msp3400 bttv video_buf
compat_ioctl32 v4l2_common btcx_risc ir_common tveeprom videodev
Pid: 546, comm: cc1plus Tainted: P 2.6.16-rc2 #1
RIP: 0010:[<ffffffff801594b8>] <ffffffff801594b8>{page_remove_rmap+120}
RSP: 0018:ffff81003ab4fc58 EFLAGS: 00010286
RAX: 00000000ffffffff RBX: ffff8100017c49a0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff81003bdc70e0
RBP: 0000000000d5d000 R08: 0000000000000000 R09: 00000000ffffffff
R10: 0000000000000000 R11: 0000000000000004 R12: ffff8100017c49a0
R13: 0000000000000020 R14: ffff81003ab4fd18 R15: ffff810028954030
FS: 00002b2a5c5236d0(0000) GS:ffffffff80538000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002b7e13b103c0 CR3: 000000000f986000 CR4: 00000000000006e0
Process cc1plus (pid: 546, threadinfo ffff81003ab4e000, task ffff81003760a730)
Stack: ffff810035ca3ae8 ffffffff80152139 0000000000dd9fff 0000000000dd9fff
0000000000dd9fff fffffffffffffea3 ffff81003a2b2dc0 0000000000dda000
0000000000dda000 ffff81000f23c000
Call Trace: <ffffffff80152139>{unmap_vmas+1193} <ffffffff801573ec>{exit_mmap+92}
<ffffffff80125e5b>{mmput+27} <ffffffff8012a219>{do_exit+521}
<ffffffff8012a879>{do_group_exit+153} <ffffffff801332b7>{get_signal_to_deliver+1271}
<ffffffff80109dbf>{do_signal+159} <ffffffff80124c90>{default_wake_function+0}
<ffffffff80124c90>{default_wake_function+0} <ffffffff8010a9ef>{sysret_signal+28}
<ffffffff8010acdb>{ptregscall_common+103}

Code: 0f 0b 68 dd ce 40 80 c2 2b 02 5b 48 c7 c6 ff ff ff ff bf 20
RIP <ffffffff801594b8>{page_remove_rmap+120} RSP <ffff81003ab4fc58>
<1>Fixing recursive fault but reboot is needed!


Second Log, with Hugh Dickens patch:

Bad rmap: page ffff810001333ff8 flags 4000000000000094 count 1 mapcount 0
cc1plus addr 2ab62c55d000 ptpfn 170a3 vm_flags 100073
page mapping 0000000000000000 b98 vma mapping 0000000000000000 2ab62c55d


Steps to reproduce:

Happens when compiling big packages, i.e. firefox, thunderbird, gcc, glibc etc ...
with or without X running.

Test memory with memtest86+, everything seems ok.
Comments:
Andrew Morton Thu, 09 Feb 2006 13:18:23
It's probably the nvidia driver - it takes them
some time to catch up to the changes which we make in
core kernel.
Andrew Morton Thu, 09 Feb 2006 13:19:42
Is it possible to reproduce this on a kernel in which the
nvidia driver has not been loaded since bootup?
Martin J Bligh Thu, 09 Feb 2006 13:52:15
Tainted kernel. Please actually read the filing instructions.
They are right there, above the text box AS you file the bug.
shadow Thu, 09 Feb 2006 16:17:56
Sorry, should of been more clearer. The second log was from a fresh reboot
without the nvidia kernel installed, but with the patch from Hugh Dickens.

Bad rmap: page ffff810001333ff8 flags 4000000000000094 count 1 mapcount 0
cc1plus addr 2ab62c55d000 ptpfn 170a3 vm_flags 100073
page mapping 0000000000000000 b98 vma mapping 0000000000000000 2ab62c55d


This log is from a non-tainted kernel, but was running 2.6.15-gentoo and not
2.6.16-rc2, no debug patch:

Kernel BUG at mm/rmap.c:486
invalid operand: 0000 [1]
CPU 0
Modules linked in: cpufreq_conservative
Pid: 18935, comm: cc1 Not tainted 2.6.15-gentoo #1
RIP: 0010:[] {page_remove_rmap+18}
RSP: 0018:ffff81001da35de0 EFLAGS: 00010286
RAX: 00000000ffffffff RBX: ffff8100150aea48 RCX: ffffffff804d6d40
RDX: 0000000000000000 RSI: 00002aaaab549000 RDI: ffff81000169a3d8
RBP: 00002aaaab549000 R08: 800000001e2ed067 R09: ffff81000169a3d8
R10: 00002aaaaaeecec8 R11: 0000000000000246 R12: ffff81000169a3d8
R13: 0000000000000020 R14: ffff81001da35e98 R15: ffff810018cf5ad0
FS: 00002aaaaaff46d0(0000) GS:ffffffff8050e800(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aaaaad55a60 CR3: 000000001a612000 CR4: 00000000000006e0
Process cc1 (pid: 18935, threadinfo ffff81001da34000, task ffff81003bdf2f20)
Stack: ffffffff8015ae99 00002aaaac0eefff 00002aaaac0eefff 00002aaaac0eefff
ffffffffffffff3f ffff81003b6d1980 00002aaaab600000 00002aaaac0ef000
ffff81000fc31550 00002aaaac0ef000
Call Trace:{unmap_vmas+1257} {exit_mmap+92}
{mmput+27} {do_exit+521}
{do_group_exit+159} {system_call+126}


Code: 0f 0b 68 2c 02 3f 80 c2 e6 01 66 66 66 90 48 c7 c6 ff ff ff
RIP {page_remove_rmap+18} RSP
<1>Fixing recursive fault but reboot is needed!

----------------------

Kernel BUG at mm/rmap.c:486
invalid operand: 0000 [1]
CPU 0
Modules linked in:
Pid: 3304, comm: cc1plus Not tainted 2.6.15-gentoo #2
RIP: 0010:[] {page_remove_rmap+18}
RSP: 0018:ffff8100271ddde0 EFLAGS: 00010286
RAX: 00000000ffffffff RBX: ffff8100301f3338 RCX: ffffffff804d8f40
RDX: 0000000000000000 RSI: 00002aaaae467000 RDI: ffff81000111ab18
RBP: 00002aaaae467000 R08: 80000000050c5067 R09: ffff81000111ab18
R10: 00002aaaaaeecec8 R11: 0000000000000246 R12: ffff81000111ab18
R13: 0000000000000020 R14: ffff8100271dde98 R15: ffff810005dbab90
FS: 00002aaaaaff46d0(0000) GS:ffffffff80511800(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00002aaaaad55a60 CR3: 0000000039ddc000 CR4: 00000000000006e0
Process cc1plus (pid: 3304, threadinfo ffff8100271dc000, task ffff81003ae137f0)
Stack: ffffffff8015ae99 00002aaaae918fff 00002aaaae918fff 00002aaaae918fff
ffffffffffffff99 ffff81003aebd9c0 00002aaaae600000 00002aaaae919000
ffff810016dda550 00002aaaae919000
Call Trace:{unmap_vmas+1257} {exit_mmap+92}
{mmput+27} {do_exit+521}
{do_group_exit+159} {system_call+126}


Code: 0f 0b 68 ec 18 3f 80 c2 e6 01 66 66 66 90 48 c7 c6 ff ff ff
RIP {page_remove_rmap+18} RSP
<1>Fixing recursive fault but reboot is needed!


--------------------
shadow Thu, 09 Feb 2006 22:07:24
More logs:

Unable to handle kernel paging request at 0000000100000008 RIP:
{__rmqueue+71}
PGD 14402067 PUD 0
Oops: 0002 [1]
CPU 0
Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq_dummy snd_seq_oss
snd_seq_midi_event snd_seq snd_seq_device snd_hda_intel snd_hda_codec snd_pcm
snd_timer snd soundcore snd_page_alloc
Pid: 28676, comm: ld Not tainted 2.6.16-rc2 #2
RIP: 0010:[] {__rmqueue+71}
RSP: 0000:ffff810031655d00 EFLAGS: 00010083
RAX: 0000000100000000 RBX: 0000000000000001 RCX: 0000000000000000
RDX: ffffffff8046dff0 RSI: ffff810000b95858 RDI: ffffffff8046df70
RBP: 0000000000000001 R08: ffffffff8046df70 R09: ffffffff8046dff0
R10: ffff810000b95830 R11: 0000000000000000 R12: ffffffff8046dfb0
R13: ffffffff8046df70 R14: 000000000000001f R15: 0000000000000001
FS: 00002b07d2459e30(0000) GS:ffffffff80538000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000100000008 CR3: 000000002443f000 CR4: 00000000000006e0
Process ld (pid: 28676, threadinfo ffff810031654000, task ffff81000314a0c0)
Stack: ffffffff8046dfc0 000000000000000a ffffffff8014a707 0000000000000000
0000000100000000 0000000100000000 ffffffff8046e650 0000004431655e28
000280d200000000 0000000000000256
Call Trace: {get_page_from_freelist+247}
{__alloc_pages+89} {__handle_mm_fault+465}
{generic_file_aio_read+52} {do_page_fault+952}
{vfs_read+349} {error_exit+0}

Code: 48 89 50 08 48 89 02 48 c7 46 08 00 02 20 00 48 c7 06 00 01
RIP {__rmqueue+71} RSP
CR2: 0000000100000008
<1>Unable to handle kernel paging request at 0000000100000008 RIP:
{__rmqueue+71}
PGD 2aba5067 PUD 0
Oops: 0002 [2]
CPU 0
Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq_dummy snd_seq_oss
snd_seq_midi_event snd_seq snd_seq_device snd_hda_intel snd_hda_codec snd_pcm
snd_timer snd soundcore snd_page_alloc
Pid: 28719, comm: ld Not tainted 2.6.16-rc2 #2
RIP: 0010:[] {__rmqueue+71}
RSP: 0018:ffff8100067d78b0 EFLAGS: 00010083
RAX: 0000000100000000 RBX: 0000000000000001 RCX: 0000000000000000
RDX: ffffffff8046dff0 RSI: ffff810000b95858 RDI: ffffffff8046df70
RBP: 0000000000000001 R08: ffffffff8046df70 R09: ffffffff8046dff0
R10: ffff810000b95830 R11: 0000000000000000 R12: ffffffff8046dfb0
R13: ffffffff8046df70 R14: 000000000000001f R15: 0000000000000001
FS: 00002ba93cc45e30(0000) GS:ffffffff80538000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000100000008 CR3: 0000000027d46000 CR4: 00000000000006e0
Process ld (pid: 28719, threadinfo ffff8100067d6000, task ffff81000bd74af0)
Stack: ffffffff8046dfc0 0000000000000000 ffffffff8014a707 ffff8100067d7c18
0000000100000000 0000000100000000 ffffffff8046e650 00000044067d7ae8
000280d200000000 0000000000000256
Call Trace: {get_page_from_freelist+247}
{__alloc_pages+89} {__handle_mm_fault+465}
{do_page_fault+952} {__linvfs_get_block+156}
{__do_page_cache_readahead+128} {error_exit+0}
{file_read_actor+59} {page_cache_readahead+235}
{do_generic_mapping_read+462}
{file_read_actor+0}
{__generic_file_aio_read+425} {xfs_read+487}
{linvfs_aio_read+106} {do_sync_read+208}
{autoremove_wake_function+0} {vfs_read+233}
{sys_read+83} {system_call+126}

Code: 48 89 50 08 48 89 02 48 c7 46 08 00 02 20 00 48 c7 06 00 01
RIP {__rmqueue+71} RSP
CR2: 0000000100000008
Hugh Dickins Fri, 10 Feb 2006 08:58:50
Thanks for taking out the nVidia driver, and making that clear.
I presume you have given memtest86 and/or memtest86+ a good run?
I suspect your case won't be bad memory, but would like that
possibility eliminated.

Quite interesting: all those reports which contain any useful info
suggest a slab problem. Could be coincident. But in both reports
showing page flags, PG_slab was set; and your latest __rmqueue+71
oopses are consistent with the page allocator hitting what should
be a free page's ->lru, instead being used by slab.

Would you mind rebuilding your current kernel (2.6.16-rc2 I believe)
with the 2.6.14 mm/slab.c and include/linux/slab.h? Since your config
is UP, the only conflicts I'd expect from that substitution are that
you'll have to #ifdef out the kzalloc() and the kstrdup() in the
2.6.14 slab.c, because they've now moved into the 2.6.16-rc2 util.c;
you'll also get warnings on kmem_cache_create from fs/dcache.c, but
they don't matter. (If you'd prefer me to send you a patch against
2.6.16-rc2 to revert that slab.c and slab.h to 2.6.14, just ask.)

If you cannot reproduce any problems with this hybrid kernel,
then we'll need to go hunting for bugs in the newer slab.c,
perhaps asking you to bisect for where things went wrong.

Please continue to use my "Bad rmap" patch, that gives us more info
to go on than the base kernels. We may want to extend it, to show
the kmem_cache involved when we hit a PG_slab, but I'd rather leave
that until later (it'll need mods to whatever slab.c we decide on).

Thanks,
Hugh
shadow Fri, 10 Feb 2006 20:26:41
Yes, used both memtest86+ and mprime to test cpu and memory. Everything is
fine.

Will try out the modifications you suggested and let you know what happens.

Thanks.
shadow Sun, 12 Feb 2006 14:12:05
This is what I get using the 2.6.14 slab.c and slab.h:

Bad rmap: page ffff8100010e9bf8 flags 4000000000000060 count 1 mapcount 1
cc1plus addr 2b480195d000 ptpfn 198a3 vm_flags 100073
page mapping ffff810036edbce9 2b4802070 vma mapping ffff810036edbce9 2b480195d

--------

Bad rmap: page ffff810001b29618 flags 4000000000000060 count 1 mapcount 1
cc1 addr 2b2b1d75d000 ptpfn 338a3 vm_flags 100073
page mapping ffff81003a4d5cd9 2b2b1db07 vma mapping ffff81003a4d5cd9 2b2b1d75d

--------

Unable to handle kernel NULL pointer dereference at 0000000000000008 RIP:
{__rmqueue+71}
PGD 37fb5067 PUD 1b33f067 PMD 0
Oops: 0002 [1]
CPU 0
Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq_dummy snd_seq_oss
snd_seq_midi_event snd_seq snd_seq_device snd_hda_intel snd_hda_codec snd_pcm
snd_timer snd soundcore snd_page_alloc
Pid: 7734, comm: sh Not tainted 2.6.16-rc2-os #7
RIP: 0010:[] {__rmqueue+71}
RSP: 0000:ffff810033173d00 EFLAGS: 00010087
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
RDX: ffffffff80458ff0 RSI: ffff8100005641a8 RDI: ffffffff80458f70
RBP: 0000000000000001 R08: ffffffff80458f70 R09: ffffffff80458ff0
R10: ffff810000564180 R11: 0000000000000000 R12: ffffffff80458fb0
R13: ffffffff80458f70 R14: 000000000000001f R15: 0000000000000001
FS: 00002b4788358ae0(0000) GS:ffffffff8051d000(0000) knlGS:00000000f7eaf6b0
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 000000001cfdf000 CR4: 00000000000006e0
Process sh (pid: 7734, threadinfo ffff810033172000, task ffff8100396389f0)
Stack: ffffffff80458fc0 000000000000000f ffffffff8014a617 ffff81003bbe3ac0
000000013619c468 0000000100000000 ffffffff80459650 000000443b6bb000
000280d200000000 0000000000000256
Call Trace: {get_page_from_freelist+247}
{__alloc_pages+89} {__handle_mm_fault+465}
{do_path_lookup+574} {do_page_fault+952}
{sys_newlstat+54} {error_exit+0}

Code: 48 89 50 08 48 89 02 48 c7 46 08 00 02 20 00 48 c7 06 00 01
RIP {__rmqueue+71} RSP
CR2: 0000000000000008

--------

Don't know if this is related:

sh[6912]: segfault at 00002b2b250b9520 rip 00002b2b24eff255 rsp 00007ffffff79e88
error 7


shadow Sun, 12 Feb 2006 16:40:20
Bad rmap: page ffff810001019ec0 flags 814 count 0 mapcount 0
cc1 addr 2ae9d92ad000 ptpfn 358b3 vm_flags 100073
page mapping 0000000000000000 0 vma mapping 0000000000000000 2ae9d92ad
Bad page state in process 'cc1'
page:ffff810001019ec0 flags:0x0000000000000814 mapping:0000000000000000
mapcount:0 count:0
Trying to fix it up, but a reboot is needed
Backtrace:

Call Trace: {bad_page+89}
{free_hot_cold_page+131}
{unmap_vmas+1224} {exit_mmap+92}
{mmput+27} {do_exit+521}
{do_group_exit+153} {system_call+126}
Hugh Dickins Mon, 13 Feb 2006 08:55:34
Thanks for trying the 2.6.14 slab.c/slab.h and reporting back on that.

Well, you've continued to have similar problems, but PG_slab has not
been set in these cases. It looks to me like that was a coincidence
of the early reports, worth the 2.6.14 experiment, but actually of no
relevance. It'd probably be least confusing if you set your kernel
back to its original slab.c/slab.h now; but please continue to run
with the "Bad rmap" patch, since that is still giving good info.

The interesting lines are:
cc1plus addr 2ab62c55d000 ptpfn 170a3 vm_flags 100073
cc1plus addr 2b480195d000 ptpfn 198a3 vm_flags 100073
cc1 addr 2b2b1d75d000 ptpfn 338a3 vm_flags 100073
cc1 addr 2ae9d92ad000 ptpfn 358b3 vm_flags 100073

There's something of a pattern to that, and it's actually the first
time I've seen the "ptpfn" value to be useful - it's giving the page
frame number of the page table page containing the bad entry. From
that and the virtual "addr" we can see there were bad entries at
physical addresses 170a3ae8, 198a3ae8, 338a3ae8, 358b3568 (the last
departing somewhat from the pattern).

I'm no expert on hardware peculiarities, and I cannot be sure: but
it does look to me more likely to be a hardware issue - bad memory
or overheating or somesuch weirdness - than a software issue. It's
certainly been noticed in the past that cc may exercise the memory
in some ways better than a mem tester.

But those addresses are well spread out, so it's no use me suggesting
you pull out this or that card and see what happens. I hope someone
watching will be able to make a more informed suggestion. Andrew?
Natalie Protasevich Mon, 22 Oct 2007 22:34:05
Does this problem still exist or was it resolved in later kernels?
Thanks.
Bruce Duncan Wed, 23 Jan 2008 08:04:39
I found the same(?) BUG in 2.6.24-rc8 (x86-64, UP, not tainted), but I don't
know how to reproduce it or even what triggered it. I hope the log is of some
use:

Jan 23 00:37:46 moon kernel: Eeek! page_mapcount(page) went negative! (-1)
Jan 23 00:37:46 moon kernel: page pfn = 0
Jan 23 00:37:47 moon kernel: page->flags = 400
Jan 23 00:37:47 moon kernel: page->count = 1
Jan 23 00:37:47 moon kernel: page->mapping = 0000000000000000
Jan 23 00:37:47 moon kernel: vma->vm_ops = 0xffffffff804f9ec0
Jan 23 00:37:47 moon kernel: vma->vm_ops->nopage = _stext+0x7fdf7000/0x20
Jan 23 00:37:47 moon kernel: vma->vm_ops->fault = filemap_fault+0x0/0x400
Jan 23 00:37:47 moon kernel: vma->vm_file->f_op->mmap =
generic_file_mmap+0x0/0x50
Jan 23 00:37:47 moon kernel: ------------[ cut here ]------------
Jan 23 00:37:47 moon kernel: kernel BUG at mm/rmap.c:631!
Jan 23 00:37:47 moon kernel: invalid opcode: 0000 [1]
Jan 23 00:37:47 moon kernel: CPU 0
Jan 23 00:37:47 moon kernel: Modules linked in: cpufreq_powersave psmouse
cpufreq_stats radeon drm k8temp hwmon
Jan 23 00:37:47 moon kernel: Pid: 4258, comm: kio_thumbnail Not tainted
2.6.24-rc8 #20
Jan 23 00:37:47 moon kernel: RIP: 0010:[]
[] page_remove_rmap+0x15d/0x170
Jan 23 00:37:47 moon kernel: RSP: 0018:ffff81001c239dd8 EFLAGS: 00010292
Jan 23 00:37:47 moon kernel: RAX: 000000000000003b RBX: ffff810001000000 RCX:
000000000000ffff
Jan 23 00:37:47 moon kernel: RDX: 00000000ffffff01 RSI: 000000000006d9b8 RDI:
0000000000000000
Jan 23 00:37:47 moon kernel: RBP: ffff8100120b4e70 R08: 0000000000000000 R09:
00000000ffffffff
Jan 23 00:37:47 moon kernel: R10: 0000000000000000 R11: 0000000000000001 R12:
00002b3b5ca11000
Jan 23 00:37:47 moon kernel: R13: ffff810001000000 R14: 00002b3b5cb37000 R15:
0000000000237460
Jan 23 00:37:47 moon kernel: FS: 00002b3b5d28a7b0(0000)
GS:ffffffff8051e000(0000) knlGS:00000000f7dff6b0
Jan 23 00:37:47 moon kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jan 23 00:37:47 moon kernel: CR2: 00002b3b5cf5e688 CR3: 000000002f859000 CR4:
00000000000006e0
Jan 23 00:37:47 moon kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
Jan 23 00:37:47 moon kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
Jan 23 00:37:47 moon kernel: Process kio_thumbnail (pid: 4258, threadinfo
ffff81001c238000, task ffff810017bf95c0)
Jan 23 00:37:47 moon kernel: Stack: 00002b3b5c921000 00002b3b5cb37000
ffff810037286080 ffffffff8026860d
Jan 23 00:37:47 moon kernel: 00002b3b5cb36fff 0000000000000000
ffff81001c239ed8 ffffffffffffffff
Jan 23 00:37:47 moon kernel: 0000000000000000 ffff8100120b4e70
ffff81001c239ee0 0000000100000292
Jan 23 00:37:47 moon kernel: Call Trace:
Jan 23 00:37:47 moon kernel: [] unmap_vmas+0x48d/0x6c0
Jan 23 00:37:47 moon kernel: [] exit_mmap+0x5e/0xe0
Jan 23 00:37:47 moon kernel: [] mmput+0x25/0x80
Jan 23 00:37:47 moon kernel: [] do_exit+0x178/0x850
Jan 23 00:37:47 moon kernel: [] do_group_exit+0x29/0x60
Jan 23 00:37:47 moon kernel: [] system_call+0x7e/0x83
Jan 23 00:37:47 moon kernel:
Jan 23 00:37:47 moon kernel:
Jan 23 00:37:47 moon kernel: Code: 0f 0b eb fe 48 8b 53 10 e9 65 ff ff ff 66 0f
1f 44 00 00 48
Jan 23 00:37:47 moon kernel: RIP []
page_remove_rmap+0x15d/0x170
Jan 23 00:37:47 moon kernel: RSP
Jan 23 00:37:47 moon kernel: ---[ end trace 6b806afd24002a92 ]---
Jan 23 00:37:47 moon kernel: Fixing recursive fault but reboot is needed!

Any idea how I might go about reproducing this?
Bruce
Hugh Dickins Wed, 23 Jan 2008 08:50:35
On Wed, 23 Jan 2008, bugme-daemon@bugzilla.kernel.org wrote:
> I found the same(?) BUG in 2.6.24-rc8 (x86-64, UP, not tainted), but I don't
> know how to reproduce it or even what triggered it. I hope the log is of some
> use:

Thanks for the report. Yes, your log is indeed of use:

> Jan 23 00:37:46 moon kernel: Eeek! page_mapcount(page) went negative! (-1)
> Jan 23 00:37:46 moon kernel: page pfn = 0

pfn 0 means the pte entry which caused this was mostly zeroes:
pfn 0 shouldn't normally appear as present in any page table
(a possible exception would be vbetool, but this kio_thumbnail).

> Jan 23 00:37:47 moon kernel: R10: 0000000000000000 R11: 0000000000000001 R12:
> 00002b3b5ca11000

And (though I wouldn't swear to it), I suspect that R11 there is showing
the page table entry which caused this: just that first bit set, making it
look like a present entry, but otherwise 0. Looks like a single bit error.

> Any idea how I might go about reproducing this?

First probability is that it's bad RAM: please try running memtest86+
overnight to see what that shows. Secondary possibilities are that
it's from something like overheating, bad cabling somewhere, or cosmic
rays (last time I said that about one of these, I was half-joking, and
the recipient thought I was ridiculing him; but several people joined
in the thread to support that particular hypothesis - though several
months later he eventually conceded that it was a hardware error).

Another possibility is that it does indeed come from a kernel bug,
something somewhere corrupting memory and here hitting your pagetable;
but there's nothing here to indicate that - we'd need several more
such reports to start suspecting that case.

Hugh
Bruce Duncan Thu, 24 Jan 2008 17:56:02
Hugh,

Thanks, your analysis makes for interesting reading, however:

Fresh BUG! Squash yours today!

Eeek! page_mapcount(page) went negative! (-1)
page pfn = 0
page->flags = 400
page->count = 1
page->mapping = 0000000000000000
vma->vm_ops = 0xffffffff804f9ec0
vma->vm_ops->nopage = _stext+0x7fdf7000/0x20
vma->vm_ops->fault = filemap_fault+0x0/0x400
vma->vm_file->f_op->mmap = generic_file_mmap+0x0/0x50
------------[ cut here ]------------
kernel BUG at mm/rmap.c:631!
invalid opcode: 0000 [1]
CPU 0
Modules linked in: radeon drm psmouse snd_hda_intel snd_pcm snd_timer snd
soundcore snd_page_alloc r8169 k8temp hwmon
Pid: 3298, comm: kmail Not tainted 2.6.24-rc8 #21
RIP: 0010:[] []
page_remove_rmap+0x15d/0x170
RSP: 0018:ffff81001f501dd8 EFLAGS: 00010292
RAX: 000000000000003b RBX: ffff810001000000 RCX: 0000000000010000
RDX: 00000000ffffff01 RSI: 0000000000009e61 RDI: 0000000000000000
RBP: ffff8100340d8d20 R08: 0000000000000000 R09: 00000000ffffffff
R10: 0000000000000000 R11: 0000000000000001 R12: 00002b5482011000
R13: ffff810001000000 R14: 00002b5482101000 R15: 00000000003e9e8a
FS: 00002b5483453630(0000) GS:ffffffff8051e000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000003009b02c78 CR3: 000000001f563000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kmail (pid: 3298, threadinfo ffff81001f500000, task ffff81001f5675c0)
Stack: 00002b5481e17000 00002b5482101000 ffff810034b63080 ffffffff8026860d
00002b5482100fff 0000000000000000 ffff81001f501ed8 ffffffffffffffff
0000000000000000 ffff8100340d8d20 ffff81001f501ee0 0000000100000292
Call Trace:
[] unmap_vmas+0x48d/0x6c0
[] exit_mmap+0x5e/0xe0
[] mmput+0x25/0x80
[] do_exit+0x178/0x850
[] do_group_exit+0x29/0x60
[] system_call+0x7e/0x83


Code: 0f 0b eb fe 48 8b 53 10 e9 65 ff ff ff 66 0f 1f 44 00 00 48
RIP [] page_remove_rmap+0x15d/0x170
RSP
---[ end trace 4662ef901f5ffc26 ]---
Fixing recursive fault but reboot is needed!


This happened about a minute after a s2ram cycle. Does this rule out cosmic
rays? I think my flat is fairly well shielded from cosmic rays, but there's
probably still some radioactive gases in the rock it's made from.

I've not had any other symptoms of bad hardware, but I'll run memtest tomorrow
when I'm at work.

Bruce
Hugh Dickins Fri, 25 Jan 2008 03:23:21
On Thu, 24 Jan 2008, bugme-daemon@bugzilla.kernel.org wrote:
>
> This happened about a minute after a s2ram cycle. Does this rule out cosmic
> rays? I think my flat is fairly well shielded from cosmic rays, but there's
> probably still some radioactive gases in the rock it's made from.

I regret to inform you that s2ram gives very little protection against
cosmic rays or radon emissions - though I'm sure the developers would
accept any patches you can offer to improve its shielding properties ;)

But that fact does increase my suspicion that you have a RAM problem.

> I've not had any other symptoms of bad hardware, but I'll run memtest
> tomorrow when I'm at work.

Yes, please do. The symptoms this time were identical to before, so
same conclusion as before. If memtest86+ doesn't show anything, that
doesn't rule out the issue, but it will make it worth putting in further
diagnostic patches, to find if there's more of a pattern to these.

Hugh