Comment 18 for bug 488354

Revision history for this message
In , Jacob Bramley (jacob-bramley) wrote :

That's a neat solution. The only worry I have is the fiddling of the stack pointer within C code; that is likely to break even in the same compiler with unrelated code changes causing it to prod the stack immediately before calling "func".

Actually, I'd be interested in seeing the assembly for that, as I'd like to see what it's doing differently.

(In reply to comment #11)
> Note I'm not entirely sure the attached patch is totally safe for any compiler,
> but it works properly when built with gcc 4.4. If this is not the case, I'm
> pretty sure this can be made safer.

That doesn't matter; the original assembly was extremely compiler-dependent!

> Also note invoke_count_words could also be removed, to improve speed even more,
> as per bug 530087.

Not without sacrificing memory: invoke_copy_to_stack on WinCE fills the stack downwards and avoids calling invoke_count_words to work out how much space to reserve. On Linux, with EABI, 64-bit arguments must be 64-bit aligned, so it's impossible for invoke_copy_to_stack to know where to put each argument (with one pass) if it's filling them backwards; the alignment requirements on an argument early in the list can shift everything above it.

The best solution I have is to reserve arg_count*2 words on the stack and waste stack space if we have 32-bit arguments. That will allow us to stick to one pass, skipping invoke_copy_to_stack entirely and filling the arguments forwards with proper alignment, at the cost of some stack space.

(In reply to comment #12)
> (In reply to comment #11)
> > I get a 25% speed improvement on the TestXPTCInvoke speed test with an ARMv4T
> > target. (note it ends up using bl on armv4t, and blx when using e.g. an armv5t
> > target)
>
> (Note this was compiled for ARMv4T, but run on an ARMv5T)

"bl" and "blx" will exhibit equivalent performance; the only difference is when interworking is involved. Return stack prediction is used for both (but maybe not on an ARMv5T device).

There are two issues here. I think you know these anyway, but for the sake of documentation & clarity in the bug:
 * Correctness on a system with mixed ARM & Thumb code: "blx" _must_ be used here.
 * Performance:
   - ARMv4T interworking branches aren't return-stack-predicted on newer cores, and this costs a huge amount of performance. (I've seen figures ranging from 5% to 50%, but I don't know what would happen in this particular case.)
   - ARMv4 long-range branches involve doing "mov pc, <dst_reg>". That's what the old code used to do. ("bl" can't branch to an address in a register and is restricted to PC±32MB.)

----

I was planning to work on this (and 530087) further over the next few days because it's been in my "to do" list for ages, but you're welcome to take it if you want to.