failure to compile on x86 (32-bit) with "Control stack exhausted"

Bug #2000009 reported by Michael Pujos
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
SBCL
Invalid
Undecided
Unassigned

Bug Description

SBCL 2.2.10 (and 2.2.11) fails to compile on x86 32-bit.

It fails in make-target-2.sh, in "doing warm init - compilation phase", as can be seen at the end of this build log on OBS for openSUSE Tumbleweed:

https://build.openbuildservice.org/package/live_build_log/devel:languages:misc/sbcl/openSUSE_Tumbleweed/i586

It remains stuck for a good while on the "obj/from-xc/src/code/late-globaldb.lisp-obj" line before bailing out on that error.
I investigated an tried "--control-stack-size 4" but it does not help.
It builds fine on x86_64 so it seems a 32-bit specific issue.

Revision history for this message
Douglas Katzman (dougk) wrote :

I just built 32-bit x86 at release 2.2.10, 2.2.11, and the latest git revision and it all built fine.
I also tried downloadeding SBCL 1.4.3 from the binaries page, and using that to build 2.2.11 which also worked.
We will need more information.

Stas Boukarev (stassats)
Changed in sbcl:
status: New → Incomplete
Revision history for this message
Michael Pujos (bobbie424242) wrote :

Thank you for your quick reply and test.

Agreed, it must be a weird quirk of the x86 build environment employed on OBS, especially since v2.2.10 apparently built properly before (1.5 months ago), which I missed when submitting this bug report.

Revision history for this message
Michael Pujos (bobbie424242) wrote (last edit ):

EDIT: either a problem with the x86 build environment, or some build dependency (maybe the compiler, gcc 12.2.1) that has been updated in these 1.5 months (as openSUSE TW is a rolling distro) that is causing the problem.
In any case, this is nothing obvious as I tried various things (disbabling LTO, disabling openSUSE specific CFLAGS, disabling bootstrapping) without success.

Here's the full failing build log:

https://build.openbuildservice.org/public/build/devel:languages:misc/openSUSE_Tumbleweed/i586/sbcl/_log

Revision history for this message
Douglas Katzman (dougk) wrote :

I downloaded openSUSE-Tumbleweed-NET-i586-Current.iso, installed that under QEMU, installed gcc, make, git, and SBCL 1.4.3, which built SBCL just fine.
There's nothing in your build log which suggests how to proceed.

Revision history for this message
Michael Pujos (bobbie424242) wrote (last edit ):

It's build via Open Build Service (OBS), whose build can be done on a local machine with the osc command line tool which can be easily installed on an openSUSE distro (and is probably packaged on other distros as well):

https://en.opensuse.org/openSUSE:OSC

Once you have installed osc and created an account on OBS, you can check out the project:

osc -A https://api.opensuse.org checkout devel:languages:misc/sbcl && cd $_

and build it for i586 in a build sandbox:

osc build openSUSE_Factory i586

It is built from this RPM spec recipe:
https://build.openbuildservice.org/package/view_file/devel:languages:misc/sbcl/sbcl.spec?expand=1

For me, it fails exactly as the OBS online build used for the distro for which I linked the full log.

That build uses bootstrapping with sbcl 1.4.3. Though this issue also happens without bootstrapping (that is, using existing sbcl 2.2.10 package compiled 1.5 month ago, at a time it did compile in this environment)

I think the problem is either in the build environment, or dependencies of that environment (that changed since it compiled successfully 1.5 months ago), or something specified in the .spec file that cause a problem with x86 build. It's probably not a problem with sbcl, or if it is, it must be caused by a very specific combination for compiling it. In any case, there is no future version of sbcl possible on i586 openSUSE TW until the cause is found.

Revision history for this message
Douglas Katzman (dougk) wrote :

I think that nobody among the SBCL development has enough understanding of OBS to know how it would be harming the SBCL build; I think you're very much on your own.
(From my perspective, you lost me at "create an acount". I don't know what that means and I don't want to know)

One thing I can suggest is that you add something into handle_guard_page_triggered() in 'interrupt.c' to do:
 backtrace_from_context(context, 40000); // ridiculous number of frames, try showing them all

Revision history for this message
Michael Pujos (bobbie424242) wrote :

Not too surprised to allergic reaction to "create an account" as this is exactly what I thought would happen when I wrote these instructions.

I will try your backtrace suggestion.

Revision history for this message
Michael Pujos (bobbie424242) wrote :

I added this line in interrupt.c with call to backtrace_from_context:

    if (lose_on_corruption_p) {
             backtrace_from_context(context, 40000);
             fake_foreign_function_call(context);
             lose("Control stack exhausted, fault: %p, PC: %p",
                  addr, (void*)os_context_pc(context));

As expected, it results in a giant backtrace with a bunch of blocks like below repeating, with the count (here 243) changing (usually between 200 and 250).

[ 443s] 0: fp=0xf7a5e5b8 pc=0x805e65a Foreign function fake_foreign_function_call_noassert
[ 443s] 1: fp=0xf7a5e5f8 pc=0x806556d Foreign function handle_guard_page_triggered
[ 443s] 2: fp=0xf7a5e648 pc=0x8075790 Foreign function (null)
[ 443s] 3: fp=0xf7a5e678 pc=0x805ed32 Foreign function (null)
[ 443s] 4: fp=0xf7a5ec78 pc=0xf7f6e580 Foreign function __kernel_rt_sigreturn
[ 443s] 5: fp=0xf7a5ecb8 pc=0x806556d Foreign function handle_guard_page_triggered
[ 443s] 6: fp=0xf7a5ed08 pc=0x8075790 Foreign function (null)
[ 443s] 7: fp=0xf7a5ed38 pc=0x805ed32 Foreign function (null)
[ 443s] 8: fp=0xf7a5f338 pc=0xf7f6e580 Foreign function __kernel_rt_sigreturn
[ 443s] 9: fp=0xf7a5f378 pc=0x806556d Foreign function handle_guard_page_triggered
[ 443s] 10: fp=0xf7a5f3c8 pc=0x8075790 Foreign function (null)
[ 443s] 11: fp=0xf7a5f3f8 pc=0x805ed32 Foreign function (null)
...
[ 443s] 243: fp=0xf7a77b78 pc=0x805ed32 Foreign function (null)

Revision history for this message
Michael Pujos (bobbie424242) wrote :

I think you can close this issue, as it is probably a problem in the build environment employed and I opened an issue on their github:

https://github.com/openSUSE/obs-build/issues/906

Revision history for this message
Douglas Katzman (dougk) wrote :

before it got into an infinite loop of handle_guard_page_triggered, was there anything other than the repeating pattern ?
Look at the oldest (highest-numbered) frames for anything different

Revision history for this message
Michael Pujos (bobbie424242) wrote (last edit ):

After much trial and error comparing build log of last working and failing builds, I finally found the cause of this problem.

The maintainer of the sbcl package (which I'm not) recently modified the build (RPM spec file), adding -D_GNU_SOURCE to CFLAGS, exported for use by all gcc invocations. With this comment:

"Inject -D_GNU_SOURCE to CFLAGS: fixes Fixes build issue due to O_LARGEFILE hiding behind feature test macro."

Which can be seen on line 136:
https://build.opensuse.org/package/view_file/openSUSE:Factory/sbcl/sbcl.spec?expand=1

This was done, because otherwise the build (both i586 and x86_64) would now fail on this error:

[ 39s] make: Entering directory '/home/abuild/rpmbuild/BUILD/sbcl-2.2.10/tools-for-build'
[ 39s] cc -I../src/runtime -fomit-frame-pointer -O2 -Wall -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=3 -fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -Werror=return-type -flto=auto grovel-headers.c -o grovel-headers
[ 39s] grovel-headers.c: In function 'main':
[ 39s] grovel-headers.c:228:32: error: 'O_LARGEFILE' undeclared (first use in this function)
[ 39s] 228 | defconstant("o_largefile", O_LARGEFILE);
[ 39s] | ^~~~~~~~~~~
[ 39s] grovel-headers.c:228:32: note: each undeclared identifier is reported only once for each function it appears in
[ 39s] make: *** [<builtin>: grovel-headers] Error 1

Note that this error was new, as compiler, gcc and build tools and other packages are an ever moving target in openSUSE TW.

So adding -D_GNU_SOURCE to CFLAGS made the build work on x86_64, but produced a broken sbcl binary on i586.

The proper workaround for the build to pass the compile error above is instead to add "-D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64" to CFLAGS, which after some man reading, Googling and various failed attempts, I took from src/runtime/Config.x86-linux.
Maybe a change to the build system in SBCL could be done so this is not necessary.

I hadn't initially paid much attention to this -D_GNU_SOURCE addition, as I did not know what it did and thought it to be benign, until I read about it.

Revision history for this message
Stas Boukarev (stassats) wrote :

Building vanilla sbcl gives

cc -I../src/runtime -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -m32 -fno-omit-frame-pointer grovel-headers.c -ldl -lpthread -lzstd -o grovel-headers

So someone first had to remove "D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64" before breaking it with -D_GNU_SOURCE

Changed in sbcl:
status: Incomplete → Invalid
Revision history for this message
Michael Pujos (bobbie424242) wrote (last edit ):

Nobody explicitly removed "-D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64", but thanks to your hint I noticed that this line in the spec file is causing the problem when it did not before (commenting that line and the problem goes away):

export CFLAGS="%optflags"

%optflags is a rpm macro evaluating to standard optimize flags on openSUSE.
It obviously never contained "-D_LARGEFILE_SOURCE ..." before, which I suppose is added to CFLAGS by the relevant src/runtime/Config.* file

Interestingly, make was updated to v4.4 on Oct 31st in openSUSE TW and it has a huge list of changes (some marked backward incompatible), so would not be surprised the problem is here.

Anyway, I will stop here. Thanks all of us for helping.

Revision history for this message
Michael Pujos (bobbie424242) wrote (last edit ):

After digging even more, it turns out this issue is caused by a bug in GNU make 4.4, which should be reproducible on any normal sbcl build with make 4.4.

It does not matter if CFLAGS is exported or not with a value, as I wrongly mentioned in post above.

In make-target-1.sh we have this line to compile grovel-headers:

$GNUMAKE -C tools-for-build -I../src/runtime grovel-headers

the specified included folder with -I is not seen at all in grovel-headers/Makefile, thus in that file,
genesis/Makefile.features and Config are not included (which silently fails because of the -include directive), resulting in the CFLAGS missing "-D_LARGEFILE_SOURCE..." and the error in #11.

This can be trivially worked-around with this line which works:

(cd tools-for-build ; $GNUMAKE -I../src/runtime grovel-headers)

I have not seen this issue on the make bugtracker, and I will report it.

So a faulty i586 sbcl build caused by a faulty workaround, leading to another workaround, to finally a bug in make 4.4...

Revision history for this message
Michael Pujos (bobbie424242) wrote :

make 4.4 regression reported here:

https://savannah.gnu.org/bugs/?63552

Revision history for this message
Stas Boukarev (stassats) wrote :

Is it maybe because of the ".."?

Revision history for this message
Michael Pujos (bobbie424242) wrote :

I tried that and... nope.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.