Open Source Support Tools
 
Search Item
 
Summary
  Reported Issue
Title: [11237] corrupt PMD after resume
Project: kernel
Item Last Modified: Tue, 14 Oct 2008 07:24:15
Tags:  
 
 
2.6.26 32-bit 32bit 64-bit 64bit Bug __va acpi active add addr adjust ah alan allocated allocating allocations allocator allowed amd amount applied arch attach avoid bios bisect bit bits block boot booting bootup break broke buggy callback called calls case cat cc changed char check chipset com commit complete config consistent controller corr corruption current cycle data debug detect detection device diff direct disable disk dmesg dmi dmi_check_system dmi_scan_machine dmi_system_id do_page_fault documentation driver drivers drop dump early early_ioremap early_iounmap eclipse eclipse-common eclipse-core eclipse-core-2 eclipse-mapping empty enable enabling end entries entry eof error_code experimental ext3 ext3_find_entry fair fault find_e820_area firmware fitzhardinge fix fixes flag free freed freeing full function general generic git global grep handle handler hardware hpa href ident ids include info ingo init init_memory_mapping initialization int intel jeremy kb kde kernel kernels keyboard known larger late leave led limit linux linux-kernel long low lu mainline mapping mappings mark mb mem memmap memory memset merge merging molnar motherboard mounted move msg msi notify numbers occur offset oops page pages paging panic patch patches patchset pavel pci pgd physical pm_message_t pmd present printk process promise put queued quirk r1 rafael ram random read real recent references region regression regressions remove reserve reserved resume return revert rip rm s3 sane sata save save_stack_trace scan seed send sense series set setup socket something sourcefile space special split start static stop struct structures submitter subsystem suspend switching tables tasks tested that this trace trigger triggered triggers trivial txt unsigned unused usage userspace v2 video visible void wait wakeup warning wysocki x86 x86_64 xen yinghai
Details
[11237] corrupt PMD after resume
Reporter:   rjw
Created:   Sat, 02 Aug 2008 14:04:00
Updated:   Tue, 14 Oct 2008 07:24:15
Key:   11237
Versions:   Not provided
Environment:  
Priority:   -1
Status:   Opened
Resolution:   Not provided
Original Link:   http://bugzilla.kernel.org/show_bug.cgi?id=11237
Summary:   corrupt PMD after resume
Description:
Subject : [BUG] 2.6.27-rc1 in ext3_find_entry
Submitter : Alan Jenkins <<a href="mailto:alan-jenkins@tuffmail.co.uk">alan-jenkins@tuffmail.co.uk</a>>
Date : 2008-08-02 9:51
References : <a href="http://marc.info/?l=linux-kernel&m=121767073424952&w=4">http://marc.info/?l=linux-kernel&m=121767073424952&w=4</a>
Handled-By : Hugh Dickins <<a href="mailto:hugh@veritas.com">hugh@veritas.com</a>>

This entry is being used for tracking a regression from 2.6.26. Please don't
close it until the problem is fixed in the mainline.
Comments:
Alan Jenkins Sun, 03 Aug 2008 03:58:23
Hugh: Thanks for your suggestions for debugging the corrupt PMD entries, but
they're a bit scary for me :-). I'll try them if I can't get anything better.

I read basic-pm-debugging.txt and tried

echo core > /sys/power/pm_test # Do everything except the actual suspend
echo mem > /sys/power/state

and also a complete suspend to disk. But the BUG only seems to trigger with a
complete suspend to ram.

It might still be a driver bug though. I'm now using a statically linked
version of s2ram, so I should be able to try it from the initramfs before
drivers are loaded. (I just need to steal a PS/2 keyboard :-).
Alan Jenkins Sun, 03 Aug 2008 04:56:47
I ran s2ram from my Ubuntu initramfs, using a "break=premount" boot option. I
checked and the only modules which had been loaded were related to software
RAID. I also checked the contents of /sys/bus/pci/drivers. There were only
two pci drivers, "serial" and "pcieport-driver". I don't use a framebuffer
driver either. So that just leaves ACPI/pnp/platform/system devices...

I didn't get any BUGs printed to the console, so I continued the boot process
and logged into my KDE session. Previously, I found that after booting and
suspending, I had to log in, check my email etc. to reproduce the problem. I
did the same thing and this time it paniced (locked up with flashing keyboard
LEDs) before I could run "dmesg" to check for the BUGs.
Alan Jenkins Mon, 04 Aug 2008 06:41:27
I've bisected it to somewhere between v2.6.26-rc5-213-g1eede07 and
v2.6.26-rc9-696-g329513a. Unfortunately this seems to be a big hole of 132
commits worth of badness.

So far I've found a couple of different build failures, a BUG that kills init
during early boot, a panic just after "Kernel really alive"... nothing that
works well enough to test suspend.
Alan Jenkins Tue, 05 Aug 2008 10:17:47
I've bisected this down further to
v2.6.26-rc9-547-ga939098..v2.6.26-rc9-615-gd86623a. These are all x86 arch
commits, mostly unification. They are non-bisectable due to build breakage.

Looking the oneline commit log, there's nothing that directly mentions suspend
to ram. However there must be something in here that's breaking it, causing it
to resume with these invalid page... thingies. IIRC, the resume path re-uses
some code (maybe including assembler?) from the boot path. So something in
here happens to work for boot but not suspend.

BTW I found running gitk is a great way to reproduce this BUG after resume.
Alan Jenkins Tue, 05 Aug 2008 10:21:21
Please change this bug to Platform/x86_64, to alert the appropriate people(s).
Hugh Dickins Tue, 05 Aug 2008 11:19:14
You seem to be making useful progress, thanks for your efforts.
Later tonight or tomorrow I'll look for clues in the output of
git diff -u v2.6.26-rc9-547-ga939098 v2.6.26-rc9-615-gd86623a

I notice there's a fair number of max_pfn patches in there:
so rather than ask you what your max_pfn is, or how much RAM
you have, please may I ask you to put the early part of your
bootup dmesg into this bugzilla - say, up as far as the
Freeing unused kernel memory: XXXk freed
though it's mainly the BIOS-e820 map at the beginning I'm
thinking might turn out to be useful when framing hypotheses.
Alan Jenkins Tue, 05 Aug 2008 12:40:09
Sorry, but what I had on disk as "v2.6.26-rc9-615-gd86623a" wasn't. I.e. it
claimsit's a different version in dmesg. I can't actually build
v2.6.26-rc9-615-gd86623a.

Please use
v2.6.26-rc9-547-ga939098..v2.6.26-rc9-00681-g1a98fd1.(v2.6.26-rc9-00681-g1a98fd1~1
and immediate predecessors are the ones that panic just after "Kernel really
alive").
Alan Jenkins Tue, 05 Aug 2008 12:45:56
Created an attachment (id=17096) [details]
2.6.26-rc9-00681-g1a98fd1 boot messages

It also seems I missed the very noisy CPA self-test failures in this kernel
log! Note this happens *before* I suspend.
Hugh Dickins Fri, 08 Aug 2008 10:13:24
I've spent a while on this but made little progress, sorry.

I've probably spent too long wondering about those CPA self-test failures,
which I now think irrelevant. They all stem from the fact that the first
of the two pmds which cover the kernel's direct map of your 2GB didn't
have the global bit set in its entries (whereas the second did, to judge
by the 201 failures out of 400 tests), so clearing that bit never split
the level. That could easily be a bug at that point of the bisection,
which got fixed later (certainly the test got changed later, not to use
the global bit): before we're finished I should include a patch to check
that the global bit is now being set with your latest kernel, if it is
then those CPA failures won't be worth spending more time on.

But I've still no ideas about the corruption you're seeing in the second
of those pmds. I've gone through the diff, though I wouldn't pretend
with a fine-toothed comb, and nothing stood out as a good suspect. It's
really too big a diff for me to grasp, shame about that eclipse in the
bisection ("eclipse" is what comes to my mind when I hit an impenetrable
area of a bisection like that): there may be ways to narrow it, but it's
tedious on one's own machine, and worse trying to direct someone else
at a distance. And it's perfectly possible that changes here are just
shifting pre-existing or potential corruption to where it does visible
damage: the changes you've bisected to are not necessarily to blame.

I've not yet worked out a sensible next step. Presumably you saw a
variety of oopses as you bisected down, it would be worth attaching
those to the bugzilla, in case there's something to be learnt from
their pattern or their spread. But right now I'm stale: I'm going
to break off for a day or two.
Rafael J Wysocki Sun, 10 Aug 2008 05:43:50
On Sunday, 10 of August 2008, Hugh Dickins wrote:
> On Sat, 9 Aug 2008, Theodore Tso wrote:
> > On Sun, Aug 10, 2008 at 12:43:51AM +0200, Rafael J. Wysocki wrote:
> > > This message has been generated automatically as a part of a report
> > > of recent regressions.
> > >
> > > The following bug entry is on the current list of known regressions
> > > from 2.6.26. Please verify if it still should be listed and let me know
> > > (either way).
>
> Yes, it should still be listed.
>
> > >
> > >
> > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11237
> > > Subject : [BUG] 2.6.27-rc1 in ext3_find_entry
> > > Submitter : Alan Jenkins <alan-jenkins@tuffmail.co.uk>
> > > Date : 2008-08-02 9:51 (8 days old)
> > > References : http://marc.info/?l=linux-kernel&m=121767073424952&w=4
> > > Handled-By : Hugh Dickins <hugh@veritas.com>
> > >
> >
> > You might want to change the description to include that it occurred
> > after a suspend/resume; Hugh suspects corrupted PMD entries as the
> > cause of the crash, and not necessarily anything in the ext3 code. So
> > the title might be a bit misleading. (At the same time, if turns out
> > that the suspend/resume was a red herring, and it looks more like a
> > real ext3 bug, please send a note to that effect; right now I'm not
> > paying attention to this bug.)
>
> Right, there's no reason at all to suppose it's related to ext3,
> that just happened to be the first victim of the corruption on
> one occasion. Carry on paying no attention to this bug, Ted.
> "corrupt PMD after resume" perhaps.
Hugh Dickins Mon, 11 Aug 2008 12:48:44
I suppose we could try for a very lazy way out by asking:
do you see similar symptoms with 2.6.27-rc2 or current -git?
Though I won't feel very satisfied if you answer "no".
Alan Jenkins Wed, 13 Aug 2008 04:58:27
Created an attachment (id=17215) [details]
Kernel log 2.6.27-rc3 init=/bin/bash

Thanks for honesty. No, it's still here.

Reproduced on 2.6.27-rc3 with init=/bin/bash
- wait 30s for CPA self test - success
s2ram --force --acpi_sleep=3
- wait 30s for CPA self test - BUG

So I guess you're right that the CPA failures before suspend on bisect/bad was
not significant.
Hugh Dickins Wed, 13 Aug 2008 17:23:41
Thanks for trying. Your "No" is not the "no" which would have
left me unsatisfied, it's a "yes" which says we can't be lazy.
What a satisfying pity ;)

It's not at all surprising that the CPA self-test was successful:
even if what caused it to fail before is still an issue, the test
is now flipping a different bit, which is less likely to go wrong.

I'm uncertain whether you're implying that the CPA test after
resume was involved in the BUG, or whether you're just saying
that in the seconds you were waiting the BUG happened to occur.
I assume the latter (but not ruling out a connection).

Do you see the bug if CONFIG_CPA_DEBUG is switched off?
(Don't wait very long to suspend after bootup, after a few
minutes the CPA testing has split all and exhausted itself.)

The x86 laptop I can suspend/resume when _32 is these days not
resuming when _64 (not a recent regression, I believe). Seems to
be just a video issue, maybe I should try tweaking its s2ram args,
or try a more uptodate s2ram. If I can get it resuming usefully
when _64, it may be worth my trying CONFIG_CPA_DEBUG=y on it too.

(Well, I have been using that config a lot in the last few days,
for /proc/meminfo fixes; but not together with resume.)

I was hoping (did ask) for you to attach some more log extracts:
whenever you decided a bisection was bad, or saw the BUG above,
wasn't there an oops recorded in /var/log/messages? The more of
those I can see (within reason!), the better the (not good)
chance I can make sense of them. Thanks.
Alan Jenkins Thu, 14 Aug 2008 00:57:50
The CPA test after resume *was* involved in the bug. "do_pageattr_test" is on
the calltrace. The calltrace is at the end of the dmesg I posted.

When I was testing earlier & bisecting, I did not have CONFIG_CPA_DEBUG
enabled. Instead of just waiting, I had to trigger the bug by logging into X
and running some programs. I found "gitk" was a very reliable way to trigger
it.

Sorry about the logs. I misunderstood and thought you were talking about the
problems within the bisection "eclipse". (Most of that was build errors and
panics on early boot which leave no calltrace).

I didn't keep a persistent log of the BUGs myself. I'll see how easy it is to
dig them out from /var/log/messages. There were certainly a variety of
different ones and I can see how that could be useful.

When reproducing this from an X login, I ended up with a stream of BUGs which
soon left the computer unusable. I'll attach the log from my original report
which shows as much as I could capture of one of these sequences. If I can get
a population out of /var/log/messages then I'll limit them to the first three
BUGs per sequence.
Alan Jenkins Thu, 14 Aug 2008 01:00:27
Created an attachment (id=17228) [details]
Original BUG sequence (first one in ext3)

Here's the sequence of BUGs from my initial report.
Alan Jenkins Thu, 14 Aug 2008 01:15:31
Created an attachment (id=17232) [details]
BUG calltrace 1
Alan Jenkins Thu, 14 Aug 2008 01:15:50
Created an attachment (id=17233) [details]
Bug calltrace 2
Alan Jenkins Thu, 14 Aug 2008 01:16:05
Created an attachment (id=17234) [details]
Bug calltrace 3
Alan Jenkins Thu, 14 Aug 2008 01:16:20
Created an attachment (id=17235) [details]
Bug calltrace 4
Alan Jenkins Thu, 14 Aug 2008 01:16:34
Created an attachment (id=17236) [details]
Bug calltrace 5
Alan Jenkins Thu, 14 Aug 2008 01:23:40
That's all for now, I hope it helps our not good chances :-).

I selected a spread of different processes, and they include another trace in
ext3. I have lots more calltraces with git specifically (after I switched to
reproducing with gitk), but they all happen in copy_page_c and I didn't notice
any interesting differences.
Hugh Dickins Thu, 14 Aug 2008 07:56:28
On Thu, 14 Aug 2008, bugme-daemon@bugzilla.kernel.org wrote:
> ------- Comment #14 from alan-jenkins@tuffmail.co.uk 2008-08-14 00:57 -------
> The CPA test after resume *was* involved in the bug. "do_pageattr_test"
> is on the calltrace. The calltrace is at the end of the dmesg I posted.

Sorry, I was blind and now you've helped me see. Something about the
bugzilla mail format, I guess the http link always there at the top,
led me to miss the attachment completely. And thanks for all the
others you've sent: now all I need is time...
Alan Jenkins Sun, 17 Aug 2008 04:27:22
I've chipped away at this to reduce the suspect changes. What stands out now
is the setup unification. I can reproduce the bug after the sequence below. I
hope this helps in reducing the diff we have to look at.

# Get badness
git-checkout v2.6.26-rc9-00681-g1a98fd1

# Revert not-bad changes
git-revert d52d53b8a5b258bfaab9223a5e7284fcfdd48577
git-revert d8d5900ef8afc562088f8470feeaf17c4747790f

# docs, 32bit and mach are not relevant
git-checkout v2.6.26-rc9-547-ga939098 Documentation
git-checkout v2.6.26-rc9-547-ga939098 include/asm-x86/mach-*
git-checkout v2.6.26-rc9-547-ga939098 include/asm-x86/*_32.*
git-checkout v2.6.26-rc9-547-ga939098 arch/x86/*/*_32.*
git-checkout v2.6.26-rc9-547-ga939098 arch/x86/ia32
git-rm arch/x86/kernel/probe_roms_32.c

# Just need 2-line fix to build e820; drop the other changes
git-checkout v2.6.26-rc9-547-ga939098 arch/x86/kernel/e820.c
sed -i "s/\([^_a-zA-Z]\)end_pfn/\1max_pfn/g" arch/x86/kernel/e820.c

# Can also drop all these changed files
xargs git-checkout v2.6.26-rc9-547-ga939098 << EOF
arch/x86/kernel/acpi/sleep.c
arch/x86/kernel/apic_64.c
arch/x86/kernel/asm-offsets_64.c
arch/x86/kernel/cpu/amd_64.c
arch/x86/kernel/cpu/common.c
arch/x86/kernel/cpu/perfctr-watchdog.c
arch/x86/kernel/efi.c
arch/x86/kernel/head.c
arch/x86/kernel/head64.c
arch/x86/kernel/io_apic_64.c
arch/x86/kernel/machine_kexec_64.c
arch/x86/kernel/nmi.c
arch/x86/kernel/paravirt.c
arch/x86/kernel/paravirt_patch_64.c
arch/x86/kernel/probe_roms_32.c
arch/x86/kernel/process_64.c
arch/x86/kernel/smpboot.c
arch/x86/kernel/vsyscall_64.c
arch/x86/mm/fault.c
arch/x86/mm/numa_64.c
arch/x86/mm/pgtable.c
arch/x86/power/hibernate_64.c
arch/x86/xen/enlighten.c
arch/x86/xen/setup.c
fs/Kconfig
include/asm-x86/apic.h
include/asm-x86/cmpxchg_64.h
include/asm-x86/e820.h
include/asm-x86/elf.h
include/asm-x86/hw_irq.h
include/asm-x86/io_apic.h
include/asm-x86/msr.h
include/asm-x86/nmi.h
include/asm-x86/numa_64.h
include/asm-x86/paravirt.h
include/asm-x86/pgalloc.h
include/asm-x86/required-features.h
mm/page_alloc.c
EOF
Rafael J Wysocki Sun, 17 Aug 2008 05:14:09
On Sunday, 17 of August 2008, Hugh Dickins wrote:
> On Sat, 16 Aug 2008, Rafael J. Wysocki wrote:
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11237
> > Subject : corrupt PMD after resume
> > Submitter : Alan Jenkins <alan-jenkins@tuffmail.co.uk>
> > Date : 2008-08-02 9:51 (15 days old)
> > References : http://marc.info/?l=linux-kernel&m=121767073424952&w=4
> > Handled-By : Hugh Dickins <hugh@veritas.com>
>
> Definitely should still be listed: Alan has verified it still happens
> with -rc3. I keep on going back to look at the info he's sent, to
> try and work out what might be happening and what to try next.
Alan Jenkins Wed, 20 Aug 2008 02:19:04
I found a patch for the early boot problems
(http://lists-archives.org/linux-kernel/16541925-next-0704-x86_64-panics-on-booting.html),
which allowed me to isolate the bug to a specific commit,
4f9c11dd49fb73e1ec088b27ed6539681a445988.

In other words, this bug appears to be a duplicate of Bug #11313.
Rafael J Wysocki Wed, 20 Aug 2008 04:00:43
Thanks for identifying the offending commit:

commit 4f9c11dd49fb73e1ec088b27ed6539681a445988
Author: Jeremy Fitzhardinge <jeremy@goop.org>
Date: Wed Jun 25 00:19:19 2008 -0400

x86, 64-bit: adjust mapping of physical pagetables to work with Xen

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

but I'm not sure if that's a duplicate. IMO it's yet another regression caused
by the same patch.
Hugh Dickins Wed, 20 Aug 2008 05:27:07
Great work, Alan, you're a hero.

I agree with you that it looks like a duplicate of Bug #11313
(not a bug I'd paid any attention to until you mentioned it here):
again bad PMDs, again often hitting in clear_page_c or copy_page_c.

But we needn't debate that: the thing is to find a fix to either
and then check if it fixes the other.

I was looking at init_memory_mapping() and what it calls last week
(for the much more trivial business of /proc/meminfo DirectMap lines)
so I'll have another look around there now - but probably with the
same success I've had so far :(

Ah, good, jeremy@goop.org is now on the Cc list for this one.
Jeremy Fitzhardinge Wed, 20 Aug 2008 08:57:24
Hm, yes, I wouldn't say it's obviously a duplicate. However:
- the change in question works fine on many, many machines
- in this bug, the failure is after a suspend/resume cycle
- in Bug #11313 the failure is when plugging the HDMI connector

In other words, both actions which involve the BIOS. I suspect some bad
interaction with the firmware, but I don't have any good theories.

Could you try reverting just a6523748bddd38bcec11431f57502090b6014a96, while
leaving 4f9c... applied? Could you also apply the attached patch, and post the
boot-time dmesg output?

What's the hardware? How much memory does it have? Is there a firmware update
available?

In both cases, the oops is in ext3, but that could just be because it does most
of the memory touching. In this case it's simply a not-present pmd, but in
#11313 it's corrupted. But that could just be a coincidence depending on
whether the P bit is set in the random corruption.
Jeremy Fitzhardinge Wed, 20 Aug 2008 08:58:41
Created an attachment (id=17333) [details]
Print info on initial kernel mappings
Alan Jenkins Wed, 20 Aug 2008 09:03:13
(In reply to comment #28)
> In both cases, the oops is in ext3, but that could just be because it does most
> of the memory touching. In this case it's simply a not-present pmd, but in
> #11313 it's corrupted. But that could just be a coincidence depending on
> whether the P bit is set in the random corruption.
>

ext3 is a red herring - sometimes it shows up in the traces, more often it
doesn't. As Hugh says, it's mainly clear_page_c, copy_page_c, also memset_c.
Alan Jenkins Wed, 20 Aug 2008 09:19:23
(In reply to comment #28)
> Hm, yes, I wouldn't say it's obviously a duplicate. However:
> - the change in question works fine on many, many machines
> - in this bug, the failure is after a suspend/resume cycle
> - in Bug #11313 the failure is when plugging the HDMI connector
>
> In other words, both actions which involve the BIOS. I suspect some bad
> interaction with the firmware, but I don't have any good theories.
>
> Could you try reverting just a6523748bddd38bcec11431f57502090b6014a96, while
> leaving 4f9c... applied? Could you also apply the attached patch, and post the
> boot-time dmesg output?
>
> What's the hardware? How much memory does it have?

Desktop, Intel Core 2 Duo 6420. Intel 965-something chipset. 2G ram.

> Is there a firmware update
> available?

Not sure, but I'd be unlikely to try it if there was one. dmidecode says the
BIOS is "Phoenix Technologies 6.00 PG", released 07/26/2007.
Alan Jenkins Wed, 20 Aug 2008 09:33:20
Created an attachment (id=17334) [details]
dmesg with patch applied

info on initial kernel mappings
Alan Jenkins Wed, 20 Aug 2008 09:54:55
(In reply to comment #28)
> Could you try reverting just a6523748bddd38bcec11431f57502090b6014a96, while
> leaving 4f9c... applied?

Done - the paging error still happens.
Jeremy Fitzhardinge Wed, 20 Aug 2008 10:06:31
(In reply to comment #33)
> Done - the paging error still happens.
>

Well, consistent with #11313 at least.
H Peter Anvin Wed, 20 Aug 2008 10:21:53
Could we get a dump of the full kernel page tables?

To do so:

1. make sure the kernel is configured with CONFIG_X86_PTDUMP.
2. make sure debugfs is mounted
(mount -t debugfs none /sys/kernel/debug)
3. cat /sys/kernel/debug/kernel_page_tables > kernel_page_tables.txt
Alan Jenkins Thu, 21 Aug 2008 02:06:26
Created an attachment (id=17352) [details]
Page tables from 2.6.27-rc3
Jeremy Fitzhardinge Thu, 21 Aug 2008 11:10:46
[ 0.000000] kernel direct mapping tables up to 7f6e0000 @ 8000-c000
^^^^^^^^^
[ 0.000000] #5 [0000008000 - 000000a000] PGTABLE ==> [0000008000
- 000000a000]
^^^^^^^^^^
^^^^^^^^^^^
Jeremy Fitzhardinge Thu, 21 Aug 2008 11:11:22
Duplicate of Bug #11313
Jeremy Fitzhardinge Thu, 21 Aug 2008 11:11:41
(or vice versa, given the ordering...)
Rafael J Wysocki Thu, 21 Aug 2008 11:27:42
*** Bug 11313 has been marked as a duplicate of this bug. ***
Jeremy Fitzhardinge Thu, 21 Aug 2008 11:43:58
Could you post the full dmesg output for a boot from before
4f9c11dd49fb73e1ec088b27ed6539681a445988?
Alan Jenkins Thu, 21 Aug 2008 12:25:48
Created an attachment (id=17363) [details]
2.6.26 dmesg

Here's dmesg from 2.6.26. I could do
4f9c11dd49fb73e1ec088b27ed6539681a445988~1 tomorrow, if that would be more
useful.
Jeremy Fitzhardinge Thu, 21 Aug 2008 12:28:00
4f9c11dd49fb73e1ec088b27ed6539681a445988~1 would be useful, but the 2.6.26 is
an interesting base for comparison.
Alan Jenkins Fri, 22 Aug 2008 01:36:19
Created an attachment (id=17368) [details]
"last known good" dmesg

Ok, this is what I get from 4f9c11dd49fb73e1ec088b27ed6539681a445988~1, after
applying ingo's boot fix as above
(http://lists-archives.org/linux-kernel/16541925-next-0704-x86_64-panics-on-booting.html).
Jeremy Fitzhardinge Fri, 22 Aug 2008 15:38:15
I'm losing track a bit here. Do you have a dmesg of exactly
4f9c11dd49fb73e1ec088b27ed6539681a445988 (+fix, if needed) as well? I'd like
to get a minimal difference of good vs bad.

At the moment, comparing the 4f9c11dd49fb73e1ec088b27ed6539681a445988~1 vs
"2.6.26-rc9-00681-g1a98fd1 boot messages" shows this difference:

-[ 0.000000] early res: 5 [8000-afff] PGTABLE
+[ 0.000000] #5 [ 0000008000 - 0000009000 ] PGTABLE ===> [
0000008000 - 0000009000 ]

(afff vs 9000 (=8ffff)) but it's unclear to me whether that's a real problem or
just a difference in how the mapping is performed.
Jeremy Fitzhardinge Fri, 22 Aug 2008 15:39:10
er, 9000 == 8fff
Alan Jenkins Sat, 23 Aug 2008 01:09:27
Created an attachment (id=17381) [details]
dmesg from "bad" kernel

Good point.

Here's 4f9c11dd49fb73e1ec088b27ed6539681a445988 + boot fix.
Rafael J Wysocki Sat, 23 Aug 2008 12:18:02
Handled-By : Jeremy Fitzhardinge <jeremy@goop.org>
Jeremy Fitzhardinge Sat, 23 Aug 2008 16:41:45
OK, this shows:
-[ 0.000000] early res: 5 [8000-afff] PGTABLE
+[ 0.000000] early res: 5 [8000-8fff] PGTABLE

But this is expected, because the patch is reusing the boot-time pagetables
rather than allocating new ones. There are no other significant differences,
which is also expected.

Now, the question is whether this is leading to the reported bug? I'll put
together a patch to revert the reuse to see if that makes the problem go away.

Also, does booting with mem=1G or other values change the way it crashes?
zajec5 Sun, 24 Aug 2008 08:01:26
I can not test 4f9c11dd49fb... with mem=X option as ACPI doesn't work in system
booted this way (http://bugzilla.kernel.org/show_bug.cgi?id=11313#c40).

However booting 2.6.27-rc4 with mem=1G or mem=2G causes STABLE system after
playing with HDMI port! I can plug in HDMI cable, plug it out and start X (init
5)!

Alan: how does mem=X work in your case?
Hugh Dickins Mon, 25 Aug 2008 02:51:28
My suspicion (but sorry, I'm about to go out, and haven't written a
patch for you to verify this) is that Jeremy's patch gets implicated,
not because of any error in that patch, but because it changes which
physical pages are used for which page tables, moving corruption from
somewhere it would never get noticed (unless perhaps the kernel tried
accessing random addresses) to somewhere it is now sure to be noticed.

That "start = 0x8000;" in arch/x86/mm/init_64.c's find_early_table_space:
I wonder what the history of that particular offset is, and whether it
should be something else in the case of your machines (either for good
reason, or for hacky avoid-bug-in-BIOS reason).

Easy enough for you to patch that to "start = 0x10000;" say; but we
also need to memset page 8 (in Alan's case) or page 11 (in Rafal's)
and check for corruption thereafter, to see if there's any truth in
my suspicion.
Hugh Dickins Mon, 25 Aug 2008 04:19:54
Here's such a patch as I had in mind, which boots okay but stays silent
for me. I've no special reason for placing the check in do_page_fault,
just a nearby sourcefile which will check often. Against 2.6.27-rc4.

--- 2.6.27-rc4/arch/x86/mm/fault.c 2008-07-29 04:24:15.000000000 +0100
+++ linux/arch/x86/mm/fault.c 2008-08-25 12:11:59.000000000 +0100
@@ -680,6 +680,17 @@ void __kprobes do_page_fault(struct pt_r
error_code |= PF_USER;
again:
#endif
+ {
+ unsigned long *addr = (unsigned long *)__va(0x8000);
+ while (addr < (unsigned long *)__va(0xc000)) {
+ if (*addr) {
+ printk("%p: %lx\n", addr, *addr);
+ *addr = 0;
+ }
+ addr++;
+ }
+ }
+
/* When running in the kernel we expect faults to occur only to
* addresses in user space. All other faults represent errors in the
* kernel and should generate an OOPS. Unfortunately, in the case of
an
--- 2.6.27-rc4/arch/x86/mm/init_64.c 2008-08-21 05:52:51.000000000 +0100
+++ linux/arch/x86/mm/init_64.c 2008-08-25 12:11:59.000000000 +0100
@@ -468,6 +468,12 @@ static void __init find_early_table_spac
* need roughly 0.5KB per GB.
*/
start = 0x8000;
+ while (start < 0xc000) {
+ void *adr = early_ioremap(start, PAGE_SIZE);
+ memset(adr, 0, PAGE_SIZE);
+ early_iounmap(adr, PAGE_SIZE);
+ start += PAGE_SIZE;
+ }
table_start = find_e820_area(start, end, tables, PAGE_SIZE);
if (table_start == -1UL)
panic("Cannot find space for the kernel page tables");
Jeremy Fitzhardinge Mon, 25 Aug 2008 10:38:11
(In reply to comment #51)
> My suspicion (but sorry, I'm about to go out, and haven't written a
> patch for you to verify this) is that Jeremy's patch gets implicated,
> not because of any error in that patch, but because it changes which
> physical pages are used for which page tables, moving corruption from
> somewhere it would never get noticed (unless perhaps the kernel tried
> accessing random addresses) to somewhere it is now sure to be noticed.

Yes, that's my suspicion as well. We would have seen many more bug reports if
there had been something inherently wrong with this patch.

> Easy enough for you to patch that to "start = 0x10000;" say; but we
> also need to memset page 8 (in Alan's case) or page 11 (in Rafal's)
> and check for corruption thereafter, to see if there's any truth in
> my suspicion.

The real difference my change makes is that it continues to use
init_level4_pgt, level3_ident_pgt, level3_kernel_pgt, level2_fixmap_pgt,
level1_fixmap_pgt, level2_ident_pgt, level2_kernel_pgt and level2_spare_pgt in
head_64.S. These are all early in the .text segment, nestled among code which
is never used again after boot.

The same 0x8000-0x[ac]000 space is used either way, so I'm not sure that
scanning that region for corruption will really help much.
H Peter Anvin Mon, 25 Aug 2008 10:58:19
It's certainly not hard to imagine the first 64K being clobbered by firmware
during resume. There might be a good idea to add a kernel boot-time
configuration option to reserve low memory beyond the first 4K (which we always
reserve.)
Jeremy Fitzhardinge Mon, 25 Aug 2008 11:32:23
In the oops in "Kernel log 2.6.27-rc3 init=/bin/bash" the pgd entry is "0", and
cr3=0000000000201000 - which is init_level4_pgt. That's always used as
init_mm.pgd, and its use is unchanged by my patch.
Jeremy Fitzhardinge Mon, 25 Aug 2008 11:34:52
(In reply to comment #54)
> It's certainly not hard to imagine the first 64K being clobbered by firmware
> during resume. There might be a good idea to add a kernel boot-time
> configuration option to reserve low memory beyond the first 4K (which we always
> reserve.)

Yes, but 4f9c11dd49fb73e1ec088b27ed6539681a445988 doesn't affect whether memory
at 0x8000 is used or not, just how much memory is being used. And given that
the oops is showing that a pmd in 0x8000 has been corrupted, it's really
unclear what's going on. One possibility is that memory was always being
corrupted there, and its only now significant. Maybe Hugh's patch will point
something out...
Hugh Dickins Mon, 25 Aug 2008 18:48:24
Yes, the page at 0x8000 is used in all these cases (without my hack-patch).
My point is that, depending on whereabouts it's used in the pgd/pud/pmd
hierarchy, a corrupted area can easily get to be used for a part of the
vast address space that never actually corresponds to addresses that are
in use; but with the different usage of pages in 4f9c11dd, the corrupted
area gets shifted into a significant position, in the direct 1:1 map.

(I do have some doubts about your reuse of the head_64.S pagetables,
in particular the way level2_ident_pgt omits the NX bit but the direct
map usually sets it. But that's a very different issue, which I'd
rather come back to after this corruption question is sorted out.)
Alan Jenkins Tue, 26 Aug 2008 01:43:45
mem=1G works for me.

Hugh's patch also fixes it. Here's the output:

[ 116.546675] PM: Finishing wakeup.
[ 116.546675] Restarting tasks ... ffff8800000083e8: 803c85370cfc0000
[ 116.553658] ffff8800000083f0: 3000
[ 116.564012] done.
Hugh Dickins Tue, 26 Aug 2008 06:00:26
Thanks, Alan. Right, this fills me with shame, it tells us what
we knew three weeks ago, that suspend+resume corrupts the long at
0xffff8800000083ec to 0x3000803c85370cfc
(But Rafal's case not identical.)

And the message comes after "Restarting tasks" merely because of
my lazy (well, rushed) positioning of the check in do_page_fault.

Please let's now try what I originally intended way back then:
revert that patch and apply this patch below, which keeps the same
check through four pages (though in your case we now know that only
one long is corrupted to non-0), but does it during suspend+resume.

I've added the prefix "Corrupted" to make the messages in question
easier to find: please post those messages along with their context
of surrounding messages. I'm no expert on hardware initialization
or BIOS, but if it's just a BIOS issue then I'd expect the messages
to appear just before the first "EARLY resume" message, whereas if
a driver or device issue then just after its (perhaps LATE) suspend
or (perhaps EARLY) resume message.

--- 2.6.27-rc4/arch/x86/mm/init_64.c 2008-08-21 05:52:51.000000000 +0100
+++ linux/arch/x86/mm/init_64.c 2008-08-25 12:11:59.000000000 +0100
@@ -468,6 +468,12 @@ static void __init find_early_table_spac
* need roughly 0.5KB per GB.
*/
start = 0x8000;
+ while (start < 0xc000) {
+ void *adr = early_ioremap(start, PAGE_SIZE);
+ memset(adr, 0, PAGE_SIZE);
+ early_iounmap(adr, PAGE_SIZE);
+ start += PAGE_SIZE;
+ }
table_start = find_e820_area(start, end, tables, PAGE_SIZE);
if (table_start == -1UL)
panic("Cannot find space for the kernel page tables");
--- 2.6.27-rc4/drivers/base/power/main.c 2008-07-29 04:21:46.000000000
+0100
+++ linux/drivers/base/power/main.c 2008-08-26 13:20:00.000000000 +0100
@@ -17,6 +17,7 @@
* subsystem list maintains.
*/

+#define DEBUG
#include
#include
#include
@@ -263,6 +264,14 @@ static char *pm_verb(int event)

static void pm_dev_dbg(struct device *dev, pm_message_t state, char *info)
{
+ unsigned long *addr = (unsigned long *)__va(0x8000);
+ while (addr < (unsigned long *)__va(0xc000)) {
+ if (*addr) {
+ printk("Corrupted %p: %lx\n", addr, *addr);
+ *addr = 0;
+ }
+ addr++;
+ }
dev_dbg(dev, "%s%s%s\n", info, pm_verb(state.event),
((state.event & PM_EVENT_SLEEP) && device_may_wakeup(dev)) ?
", may wakeup" : "");
Ingo Molnar Tue, 26 Aug 2008 06:31:11
> Please let's now try what I originally intended way back then: revert
> that patch and apply this patch below, which keeps the same check
> through four pages (though in your case we now know that only one long
> is corrupted to non-0), but does it during suspend+resume.

btw., if you know the exact corruption pattern you might want to utilize
ftrace's function tracing callback to check for the corruption in a
brute-force way, by essentially doing the check for every kernel
function that gets called, and to generate a one-time stackdump of the
incident. (save_stack_trace() can be used to display that dump later on
- often the thing that triggers such corruption cannot do a printk)

Ingo
Alan Jenkins Tue, 26 Aug 2008 07:59:01
Well, I did discourage you by saying "scary", not asking for a patch, and then
doing my own thing instead. (I thought your suggestion was too brittle because
the PMD fault numbers varied; it was non-obvious that the corruption was always
in the same place).

Blame the BIOS.

[ 101.456977] agpgart-intel 0000:00:00.0: LATE suspend
[ 101.456977] Back to C!
[ 101.456977] Corrupted ffff8800000083e8: 803c85370cfc0000
[ 101.456977] Corrupted ffff8800000083f0: 3000
[ 101.456977] agpgart-intel 0000:00:00.0: EARLY resume

So from what hpa says my hardware will need a special boot option to avoid
suffering from this bug in future? Though I could just use the existing memmap
option ("memmap=64K$0", I think) to mark the first 64K as reserved.
Jeremy Fitzhardinge Tue, 26 Aug 2008 08:18:12
Hm, a tradeoff. Should we burn 64k everywhere to deal with a couple of
machines, when there's a kernel parameter which will fix it? The trouble is
knowing when you need to use the parameter.

It would be interesting to see if Rafał can reproduce this corruption - then
we can generalize something about it.
H Peter Anvin Tue, 26 Aug 2008 09:38:48
You're right, the memmap= option should do that.

Keep in mind, too, how little 64K is in the modern world. An x86 machine with
128 MB is considered anemic today, and that would amount to losing 0.05% of its
memory.

However, before considering whether or not to do that generically, I'd like a
confirmation that "memmap=64K$0" works at all.
Hugh Dickins Tue, 26 Aug 2008 09:46:07
Blame the BIOS, yes, it looks like that, and I don't suppose we'll
find out any more about it. Ingo's ftrace suggestion was helpful,
hadn't crossed my mind, but I don't think it would tell us more.

And using the existing memmap= boot parameter is good thinking, yes,
I'd rather we use what's already available than add something extra.
Even if the first 64kB is theoretically more vulnerable than the rest,
we haven't noticed that in the past, and it seems just a freak effect
of Jeremy's patch that we notice it now in your case. (There may be
others already using memmap= to avoid such corruption, who could now
stop doing so because of Jeremy's rearrangement.)

" memmap=4k!32k" should be enough for you, Alan. And using memmap=
will be safer than my patch - poking around afterwards, I notice that
the way I coded it, page 8 wasn't used in the pagetables, but was freed
for use by the system afterwards, so left you liable to worse random
corruption (though I think the page allocator tends to leave those
lowest pages free - hmm, if so, might that be why we haven't noticed
such corruptions in the past??).

And " memmap=4k!44k" is likely to be good for Rafał, though I would
like to hear first what his HDMI case shows with my first patch (the
one with the check in do_page_fault, since suspend+resume is not
relevant to his case).

(By the way, back in #9, I said that I wanted to check that your
global flag is being properly set, since its absence was causing
CPA selftest noise during your bisection. That was resolved by
the -rc3 pagetables you showed in #36, which show GLB throughout
the Low Kernel Mapping: so we need not worry any more about that.)
Alan Jenkins Tue, 26 Aug 2008 09:52:01
(In reply to comment #63)
> You're right, the memmap= option should do that.
>
> Keep in mind, too, how little 64K is in the modern world. An x86 machine with
> 128 MB is considered anemic today, and that would amount to losing 0.05% of its
> memory.
>
> However, before considering whether or not to do that generically, I'd like a
> confirmation that "memmap=64K$0" works at all.
>

Yes, it does work.
Jeremy Fitzhardinge Tue, 26 Aug 2008 10:06:13
(In reply to comment #64)
> Blame the BIOS, yes, it looks like that, and I don't suppose we'll
> find out any more about it. Ingo's ftrace suggestion was helpful,
> hadn't crossed my mind, but I don't think it would tell us more.
>
> And using the existing memmap= boot parameter is good thinking, yes,
> I'd rather we use what's already available than add something extra.
> Even if the first 64kB is theoretically more vulnerable than the rest,
> we haven't noticed that in the past, and it seems just a freak effect
> of Jeremy's patch that we notice it now in your case. (There may be
> others already using memmap= to avoid such corruption, who could now
> stop doing so because of Jeremy's rearrangement.)
>
> " memmap=4k!32k" should be enough for you, Alan. And using memmap=
> will be safer than my patch - poking around afterwards, I notice that
> the way I coded it, page 8 wasn't used in the pagetables, but was freed
> for use by the system afterwards, so left you liable to worse random
> corruption (though I think the page allocator tends to leave those
> lowest pages free - hmm, if so, might that be why we haven't noticed
> such corruptions in the past??).

Quite possible they were happening without noticeable effect. Those pages were
still being used for pieces of pagetable, but they may have corresponded to
some of the vast no-mans land of unused address space.

> And " memmap=4k!44k" is likely to be good for Rafał, though I would
> like to hear first what his HDMI case shows with my first patch (the
> one with the check in do_page_fault, since suspend+resume is not
> relevant to his case).

Yes. I would really like to confirm that these are actually duplicate bugs.
And I wonder how many other machines have similar problems? Maybe banning the
first 64k really is the right answer.

> (By the way, back in #9, I said that I wanted to check that your
> global flag is being properly set, since its absence was causing
> CPA selftest noise during your bisection. That was resolved by
> the -rc3 pagetables you showed in #36, which show GLB throughout
> the Low Kernel Mapping: so we need not worry any more about that.)

Yes, I fixed the inconsistency of __KERNEL_X vs KERNEL_X, where the former
doesn't have _PAGE_GLOBAL set (meaning that the static head_64.S pagetables
didn't have it). Also, I changed the CPA test to use another page flag anyway,
since _PAGE_GLOBAL isn't necessarily present on all kernel mappings anyway (old
32-bit and paravirtual 64-bit).
zajec5 Tue, 26 Aug 2008 12:48:10
I installed 32-bit version of openSUSE 11.0 to verify bug in iwlagn
(http://sourceforge.net/mailarchive/forum.php?thread_name=b170af450808120331j50a2e8a6o55b3412d8d24bbfa%40mail.gmail.com&forum_name=ipw3945-devel).

What's interesting is that I can not reproduce this bug using openSUSE 11.0
32-bit with self-compiled 2.6.27-rc4 using the same configuration as earlier.

I'll reinstall openSUSE again to 64-bit version and try theses patches.
Alan Jenkins Tue, 26 Aug 2008 13:09:04
You could save yourself some effort and just build a 64 bit kernel. It will
happily run a 32-bit userspace.

I guess 32-bit page tables are laid out differently, and the corruption misses
the vital parts.
Jeremy Fitzhardinge Tue, 26 Aug 2008 13:17:41
Or the BIOS bug only manifests when the kernel is running in 64-bit mode. Not
unlikely, given the rarity of 64-bit Windows.
H Peter Anvin Tue, 26 Aug 2008 13:19:52
Well, the page tables are definitely laid out differently in 32- and 64-bit
mode. It is, in fact, one of the biggest differences between 32- and 64-bit
mode, and rather inherently so. So no surprise there.
Yinghai Lu Tue, 26 Aug 2008 13:28:14
Rafal, can you post dmesg for your 64bit kernel and 32bit kernel?
32bit is supposed to use ram before 1M for pgtable too from 2.6.27-rc1
Jeremy Fitzhardinge Tue, 26 Aug 2008 13:39:55
(In reply to comment #70)
> Well, the page tables are definitely laid out differently in 32- and 64-bit
> mode. It is, in fact, one of the biggest differences between 32- and 64-bit
> mode, and rather inherently so. So no surprise there.

They're not that different; I think 0x8000 can still be allocated for pagetable
use in 32-bit, and with PAE the entries are even the same format. That said, I
think we can fully unify init_mm pagetable construction now (in principle,
barring bugs like this).
H Peter Anvin Tue, 26 Aug 2008 13:42:07
Well, yes; however, right now they're constructed quite differently, which
certainly explains the difference in behaviour.
zajec5 Wed, 27 Aug 2008 07:26:21
OK, I tried a few posted patches

1) Clean 2.6.27-rc4 + patch from comment #51 (start = 0x10000)
Works fine, system doesn't crash (and keeps stable) after using HDMI "init 5".

2) Clean 2.6.27-rc4 + patch from comment #52 + prefix "fault.c" in printk
Works fine, system stable and doesn't crash. After plugging HDMI I get this in
dmesg:
fault.c: ffff88000000be98: b02a000400000000

3) Clean 2.6.27-rc4 + patch from comment #59
Works fine again, "Corrupted" msg doesn't appear in dmesg after using HDMI
port.

I will try memmap later.
zajec5 Thu, 28 Aug 2008 04:28:06
I was trying memmap options on clean 2.6.27-rc4.

1) memmap=64K$0
Works fine, HDMI doesn't crash my OS

2) memmap=4k!44k
OS doesn't boot:
Kernel alive
Kernel really alive
PANIC: early exception 0e rip 10:ffffffff8021fd37 error 0 cr2 ffffffffff5fc0f0
Alan Jenkins Thu, 28 Aug 2008 06:05:44
kernel-parameters.txt doesn't say anything about "!" character in memmap=. I
think Hugh meant "memmap=4k$44k".
Hugh Dickins Thu, 28 Aug 2008 08:40:44
Aagh, sorry for wasting your time again, indeed I meant "$" not "!"

And regarding your results from the patches (in #74): yes, results
are as expected (the last staying silent because it would only notice
during suspend+resume, not during HDMI plugging). But just as in
Alan's case, they don't actually tell us anything new, just confirm
what already appeared to be the case; and I doubt we shall learn any
more about it.

I'll leave it to Jeremy, Ingo and hpa to weigh up whether these two
cases now justify reserving the first 64kB on x86_64: I'm unsure.
Ingo Molnar Thu, 28 Aug 2008 08:54:16
> I'll leave it to Jeremy, Ingo and hpa to weigh up whether these two
> cases now justify reserving the first 64kB on x86_64: I'm unsure.

definitely - could you please send a patch for it?

I'd suggest to make the initial version DMI quirk driven instead of
generic - in the hope of this being a one-off (or twice-off) anomaly.

Could everyone who is affected by this please attach a dmidecode output.

If it turns out to be more common than suspected, we could still make it
unconditional in the future.

Ingo
Alan Jenkins Thu, 28 Aug 2008 09:00:08
Created an attachment (id=17508) [details]
dmidecode on faulty system

Hah. Well, here's my DMI info. Problem is it's useless, at least if you're
limited to the same IDs as s2ram uses.

sudo s2ram -i
This machine can be identified by:
sys_vendor = "OEM"
sys_product = "OEM"
sys_version = "OEM"
bios_version = "6.00 PG"
Ingo Molnar Thu, 28 Aug 2008 09:05:04
> Hah. Well, here's my DMI info. Problem is it's useless, at least if you're
> limited to the same IDs as s2ram uses.
>
> sudo s2ram -i
> This machine can be identified by:
> sys_vendor = "OEM"
> sys_product = "OEM"
> sys_version = "OEM"
> bios_version = "6.00 PG"

bah ...

is there any indication about which exact area the BIOS really needs?
Maybe it's the first 8K instead of the first 4K? Wasting +4K of RAM is
not a big deal. Wasting 60K (on all systems, all around the globe) we
should try to avoid.

Or is there perhaps an indication somewhere about which area to protect?
Does the EBDA show it perhaps?

Ingo
Jeremy Fitzhardinge Thu, 28 Aug 2008 09:26:44
I think we should unconditionally reserve the first 64k, and add a debug option
to do a corruption scan along the lines of Hugh's patch. That way we can get a
sense of how common this kind of lowmem corruption is.
Ingo Molnar Thu, 28 Aug 2008 09:30:59
> I think we should unconditionally reserve the first 64k, and add a
> debug option to do a corruption scan along the lines of Hugh's patch.
> That way we can get a sense of how common this kind of lowmem
> corruption is.

ok. Please send a patch - if it's unintrusive enough we might still be
able to get it into 2.6.27.

Ingo
zajec5 Thu, 28 Aug 2008 12:54:23
Just for making diagnose 100% sure: memmap=4k$44k works fine.
zajec5 Thu, 28 Aug 2008 12:55:36
Created an attachment (id=17510) [details]
dmidecode
Jeremy Fitzhardinge Thu, 28 Aug 2008 12:59:12
Created an attachment (id=17511) [details]
Proposed patch for mainline to workaround this problem and detect other
instances of corruption

Alan, Rafeł: could you test this patch and see if it solves the problem and
prints the expected warnings?
zajec5 Fri, 29 Aug 2008 03:01:39
Created an attachment (id=17526) [details]
dmesg | grep -i corrup

After booting 2.6.27-rc5 with proposed patch applied I get only:
zajec@sony:~> dmesg | grep -i corr
scanning 2 areas for BIOS corruption

So I tried s2ram (-f -p -m which works for me) and there is output of dmesg |
grep -i corr after waking up.
Rafael J Wysocki Sat, 30 Aug 2008 14:59:17
Patch : http://marc.info/?l=linux-kernel&m=122001615314700&w=2
Andy Wettstein Fri, 12 Sep 2008 08:24:11
Just to try and confirm some suspicions that this bug may be widespread. I'm
seeing this problem on a MSI AMD socket 754 motherboard with VIA K8T800 chipset
and AMI BIOS running the 64 bit kernel. I get the "unable to handle kernel
paging request" when trying to resume from S3. Booting with memmap=64k$0 makes
suspend/resume work fine. The 32 bit kernel works fine without any extra boot
options.

Should I add any output to this bug report from this system?
Jeremy Fitzhardinge Fri, 12 Sep 2008 09:11:25
That's interesting to know. Could you include the output of the corruption
message here?

How much memory were you scanning for corruption? More than the default 64k I
guess.
Hugh Dickins Fri, 12 Sep 2008 09:25:27
Thanks for the report. It would be interesting to see how closely
yours resembles the previous two cases - but when I say interesting,
we'll probably just say, "ooh, that's interesting, it's different"
or "ooh, that's interesting, it's the same", and not much more -
so you may not want to go to great lengths to slake our curiosity!

But if you can, please attach your dmesg, preferably the whole dmesg
from boot through suspend/resume through the failure(s) - if there
are lots all showing the same PMD line, no need for more than one.

Even (slightly) better would be to apply the patch indicated in
#87 (Jeremy's made a further patch series since, but the one in
#87 should be good enough) and send the dmesg from boot through
suspend/resume and the next couple of minutes after resume (it
does a check every minute). I say _slightly_ better because I
don't really expect it to tell us anything more than the previous
dmesg; but it _might_ find some other places corrupted.

The 32-bit kernel is probably working fine because it's using the
corrupted pages in such a way the corruption doesn't hit anything
that matters: they may well be tied up in the GFP_DMA page reserve,
with nothing much needing those pages, or relying on their contents
across suspend/resume.

Thank you; but as I say, we probably won't work out anything much
from the info you supply, so don't break a leg getting it to us.
Jeremy Fitzhardinge Fri, 12 Sep 2008 09:45:04
(In reply to comment #90)
> Thanks for the report. It would be interesting to see how closely
> yours resembles the previous two cases - but when I say interesting,
> we'll probably just say, "ooh, that's interesting, it's different"
> or "ooh, that's interesting, it's the same", and not much more -
> so you may not want to go to great lengths to slake our curiosity!

Well, if we can get a sense of how common the problem is, we can determine
whether we should ban that area of memory by default. It's certainly not
"incredibly rare" any more.

> The 32-bit kernel is probably working fine because it's using the
> corrupted pages in such a way the corruption doesn't hit anything
> that matters: they may well be tied up in the GFP_DMA page reserve,
> with nothing much needing those pages, or relying on their contents
> across suspend/resume.

If the 32-bit kernel is scanning that area for corruption and not finding it,
then it means that the bios is only causing corruption when the kernel is
running in 64-bit mode. Which is interesting.
Hugh Dickins Fri, 12 Sep 2008 10:03:03
(In reply to comment #90)

I've assumed that Andy is not running a kernel with any scanning patches,
just observing corruption as it originally appeared: that the corruption
is below 64kB and so memmap=64k$0 innoculates against it successfully in
the 64-bit case, but it fell on fallow ground in the 32-bit case.

But it would be interesting to try the #87 patch with both 64-bit and
32-bit kernels (since it seems Andy's already equipped to try both),
to check that they then behave in exactly the same way, as we'd expect.
Andy Wettstein Fri, 12 Sep 2008 11:19:48
(In reply to comment #92)
> (In reply to comment #90)
>
> I've assumed that Andy is not running a kernel with any scanning patches,
> just observing corruption as it originally appeared: that the corruption
> is below 64kB and so memmap=64k$0 innoculates against it successfully in
> the 64-bit case, but it fell on fallow ground in the 32-bit case.
>
> But it would be interesting to try the #87 patch with both 64-bit and
> 32-bit kernels (since it seems Andy's already equipped to try both),
> to check that they then behave in exactly the same way, as we'd expect.

You are correct that I am not using the bios scanning corruption patch. I have
a 64 bit kernel compiled with it applied, but I haven't had time to test it yet
(There seems to also be a problem with suspend/resume with the motherboard's
promise SATA controller, so that complicated the troubleshooting. I switched
to the VIA controller and that works fine).

I'll try and get the dmesg and test out the patched kernel sometime this
weekend.
Andy Wettstein Fri, 12 Sep 2008 15:46:24
Created an attachment (id=17755) [details]
dmesg 2.6.26 suspend/resume
Andy Wettstein Fri, 12 Sep 2008 15:48:27
I've attached the dmesg for the suspend/resume problem.

I tried 2.6.27-rc5 with the patch applied. The machine resets itself on
resume, so that is a bit of a problem.
Jeremy Fitzhardinge Fri, 12 Sep 2008 16:17:22
Are you reporting that there's been a general suspend/resume regression? Does
it still fail when you boot with memmap=64k$0?
Andy Wettstein Fri, 12 Sep 2008 19:34:20
(In reply to comment #96)
> Are you reporting that there's been a general suspend/resume regression? Does
> it still fail when you boot with memmap=64k$0?
>

There is definitely something going on with 2.6.27 kernels. I just tested
without the patch applied both with and without the memmap option. In both
cases the machine resets itself when resuming.

I am using the debian experimental builds for 2.6.27-rc5 from here:
http://kernel-archive.buildserver.net/debian-kernel

I've only tested 64 bit so far.
Andy Wettstein Fri, 12 Sep 2008 21:44:14
Created an attachment (id=17756) [details]
dmesg 2.6.27-rc5 bios corruption
Andy Wettstein Fri, 12 Sep 2008 21:50:03
Ok,

32 bit suspend/resume doesn't reboot itself. I compiled with the bios
corruption detection patch and attached the output.
Hugh Dickins Sat, 13 Sep 2008 01:42:17
I'm now thoroughly confused. I thought that patch (the one #87 points
to) only detects corruption between 0 and 0x10000, but your output is
showing corruption (or good use) between 0x243e0 and 0x268c4.

A bug in the patch, or you were running a later patchset from
Jeremy and changed the default, or I'm just in a muddle?

Sorry, I'll be offline for a day or so now: over to Jeremy I hope.
Jeremy Fitzhardinge Sat, 13 Sep 2008 07:55:15
Andy, as Hugh says, that output doesn't look right. It looks like the output
of my earlier buggy patch which scanned beyond the regions it was supposed to.

I also think the resume regression you're seeing is a separate bug, and you
should file a new one for it.
Rafael J Wysocki Sun, 14 Sep 2008 16:29:51
Andy, please file a separate bug report for the resume regression you're
seeing, thanks.
Andy Wettstein Sun, 14 Sep 2008 19:06:03
Created an attachment (id=17780) [details]
corrected dmesg 2.6.27-rc6 bios corruption
Andy Wettstein Sun, 14 Sep 2008 19:07:40
Yes, you were right about the bad patch. Sorry. I've updated to the newer one
and attached the new output. Hopefully it looks better.
Ingo Molnar Sun, 14 Sep 2008 23:36:14
> [...] I've updated to the newer one and attached the new output.
> Hopefully it looks better.

thanks - it shows corruption in the 0xc000-0xc400 range. That's a 1 KB
block at 48KB. The content looks structured and intentional, but i dont
recognize it straight away.

i'm wondering, what's the EBDA of this box? The bootup log suggests it's
at 0x9f400:

[000009f400 - 0000100000] BIOS reserved

but i'm not completely sure.

i've added a printout into tip/master:

# 23ea780: x86: print out EBDA/lowmem address

could you please try to boot the latest tip/master:

http://people.redhat.com/mingo/tip.git/README

and post a new dmesg? (straight after bootup - but it's fine with a
corruption message included.) tip/master has all the detection patches
included.

the BIOS itself seems quirkable as well:

RSDT 3FFF0000, 002C (r1 AMIINT VIA_K8

do we have the dmidecode info of this system attached already?

Ingo
Andy Wettstein Mon, 15 Sep 2008 15:01:10
Created an attachment (id=17793) [details]
msi dmidecode
Andy Wettstein Mon, 15 Sep 2008 15:03:47
Created an attachment (id=17794) [details]
dmesg from tip

Here is the dmesg from tip with the EBDA printout and I've attached the
dmidecode from this machine, too.
Ingo Molnar Tue, 16 Sep 2008 00:42:27
> Here is the dmesg from tip with the EBDA printout and I've attached
> the dmidecode from this machine, too.

thanks - the EBDA is at the expected place:

[ 0.000000] BIOS EBDA/lowmem at: 0009fc00/0009f400

so there's no clue i can see in the logs or in other system environment
data that would notify the kernel that there's some extra activity
expected at physical address 0xc000. Memory 0xc000 is marked by the BIOS
as general purpose RAM:

[ 0.000000] BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)

and we utilize that RAM in Linux and only reserve the first 4K [which is
customary BIOS scratch area] - so it's 635K of perfectly fine RAM, and
the kernel will break if anything gets modified in that RAM. The commit
that got identified in the bisection just happened to move around
allocations so that we broke in a more apparent (and more violent) way.

So, based on your dmidecode info i've created a quirk for AMI BIOSen, to
reserve 0xc000 forcibly, and applied it to the
tip/x86/memory-corruption-check tree:

# 8a64124: x86: add DMI quirk for AMI BIOS which corrupts address 0xc000
during resume

Could you please check the end result and try latest tip/master with
CONFIG_X86_CHECK_BIOS_CORRUPTION _disabled_? The kernel should just work
out of box, and you should get the new DMI quirk printk during bootup.
Please attach that dmesg output too, so that we can double check the end
result.

If this commit works as expected then this is queued up for v2.6.28
merging as part of the x86 tree, alongside the memory-corruption-check
feature.

Thanks,

Ingo
Ingo Molnar Tue, 16 Sep 2008 01:20:12