arm64 autopkgtests are flaky

Bug #1785262 reported by Olivier Tilloy
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
LibreOffice
Confirmed
Medium
libreoffice (Ubuntu)
Fix Released
Medium
Heather Ellsworth
Groovy
Fix Released
Undecided
Unassigned

Bug Description

[Verification and regression analysis is all handled via the main SRU bug of LP: #1913353]

See http://autopkgtest.ubuntu.com/packages/libreoffice/cosmic/arm64.

In the past 30 days, 8 out of 45 test runs failed. That a 18% failure rate. It would be good to understand if all those failures are similar, what's causing them, and if they can easily be made more robust.

Olivier Tilloy (osomon)
Changed in libreoffice (Ubuntu):
assignee: nobody → Olivier Tilloy (osomon)
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Olivier Tilloy (osomon) wrote :

3 failures out of 54 test runs in the past 30 days, that's a 5.6% failure rate.

Revision history for this message
Olivier Tilloy (osomon) wrote :

1 failure out of 57 test runs in the past 30 days, that's a 1.75% failure rate.

Olivier Tilloy (osomon)
Changed in libreoffice (Ubuntu):
assignee: Olivier Tilloy (osomon) → nobody
Revision history for this message
In , Michael-stahl (michael-stahl) wrote :
Download full text (32.3 KiB)

Description:

current master f1604675e71c67024887d12bf73ccb86ac05a7a3

the problem is:
* thread 6 runs async call SwXTextDocument::release() which blocks to acquire SolarMutex
* thread 1 holds SolarMutex, blocks on sync call on same aThreadId

Steps to Reproduce:
have only seen it once

Actual Results:
(gdb) thread apply all bt

Thread 8 (Thread 0x7faa995c7700 (LWP 612145)):
#0 0x00007faabb116610 in __lll_lock_wait () at /lib64/libpthread.so.0
#1 0x00007faabb10efe1 in pthread_mutex_lock () at /lib64/libpthread.so.0
#2 0x00007faabb995ba6 in osl_acquireMutex(oslMutex) (pMutex=0x8b33b0) at /work/lo/master/sal/osl/unx/mutex.cxx:100
#3 0x00007faab3a42867 in osl::Mutex::acquire() (this=0x8b85b8) at /work/lo/master/include/osl/mutex.hxx:57
#4 0x00007faab43aa1ba in SvpSalYieldMutex::doAcquire(unsigned int) (this=0x8b85b0, nLockCount=1) at /work/lo/master/vcl/headless/svpinst.cxx:360
#5 0x00007faaa90e8d6e in comphelper::SolarMutex::acquire(unsigned int) (this=0x8b85b0, nLockCount=1) at /work/lo/master/include/comphelper/solarmutex.hxx:86
#6 0x00007faaa90e91a5 in osl::Guard<comphelper::SolarMutex>::Guard(comphelper::SolarMutex&) (this=0x7faa995c5d80, t=...) at /work/lo/master/include/osl/mutex.hxx:135
#7 0x00007faaa90e8dc3 in SolarMutexGuard::SolarMutexGuard() (this=0x7faa995c5d80) at /work/lo/master/include/vcl/svapp.hxx:1359
#8 0x00007faaa9158476 in framework::OComponentAccess::createEnumeration() (this=0x5890aa0) at /work/lo/master/framework/source/helper/ocomponentaccess.cxx:55
#9 0x00007faabb67ff28 in gcc3::callVirtualMethod(void*, unsigned int, void*, _typelib_TypeDescriptionReference*, bool, unsigned long*, unsigned int, unsigned long*, double*) (pThis=0x5890ac8, nVtableIndex=5, pRegisterReturn=0x7faa995c5f20, pReturnTypeRef=0x9f66b0, bSimpleReturn=false, pStack=0x7faa995c5f30, nStack=0, pGPR=0x7faa995c6060, pFPR=0x7faa995c6090) at /work/lo/master/bridges/source/cpp_uno/gcc3_linux_x86-64/callvirtualmethod.cxx:133
#10 0x00007faabb67eb7b in cpp_call(bridges::cpp_uno::shared::UnoInterfaceProxy*, bridges::cpp_uno::shared::VtableSlot, typelib_TypeDescriptionReference*, sal_Int32, typelib_MethodParameter*, void*, void**, uno_Any**) (pThis=0x58914b0, aVtableSlot=..., pReturnTypeRef=0x9f66b0, nParams=0, pParams=0x0, pUnoReturn=0x588dce0, pUnoArgs=0x0, ppUnoExc=0x7faa995c6200) at /work/lo/master/bridges/source/cpp_uno/gcc3_linux_x86-64/uno2cpp.cxx:233
#11 0x00007faabb67f596 in bridges::cpp_uno::shared::unoInterfaceProxyDispatch(uno_Interface*, typelib_TypeDescription const*, void*, void**, uno_Any**) (pUnoI=0x58914b0, pMemberDescr=0x9f8190, pReturn=0x588dce0, pArgs=0x0, ppException=0x7faa995c6200) at /work/lo/master/bridges/source/cpp_uno/gcc3_linux_x86-64/uno2cpp.cxx:413
#12 0x00007faaa9f7a3d7 in binaryurp::IncomingRequest::execute_throw(binaryurp::BinaryAny*, std::__debug::vector<binaryurp::BinaryAny, std::allocator<binaryurp::BinaryAny> >*) const (this=0x5891210, returnValue=0x7faa995c65c0, outArguments=0x7faa995c6640) at /work/lo/master/binaryurp/source/incomingrequest.cxx:236
#13 0x00007faaa9f79280 in binaryurp::IncomingRequest::execute() const (this=0x5891210) at /work/lo/master/binaryurp/source/incomingrequest.cxx:79
#14 0x00007faaa9f9d...

Revision history for this message
In , Michael-stahl (michael-stahl) wrote :

Created attachment 156969
backtrace without linebreaks

Revision history for this message
In , Sbergman (sbergman) wrote :

The problem is that thread 1 frame 13 makes a call on a UNO object while it allegedly holds the SolarMutex.

The remote side servicing that call apparently sends back an async release call. However, async calls are executed on fresh threads (thread 6 here), see <https://wiki.openoffice.org/wiki/Uno/Remote/Specifications/Uno_Remote_Protocol#Threads>: "Whenever an asynchronous request message becomes the first entry in a queue for a given TID (of some UNO thread h), the receiving side starts to sequentially execute the computation steps corresponding to the requested UNO interface method, in a new (previously immaterial) thread h′. It uses the TID of thread h′ for all request messages sent while executing these computation steps. When it is finished with the computation steps, thread h′ transitions from active to done, and the asynchronous request message is removed from the head of the queue."

Iain Lane (laney)
tags: added: rls-hh-incoming
Revision history for this message
Iain Lane (laney) wrote :

Christian raised this with us just now - I think we should consider taking a look at this.

Looking at

http://autopkgtest.ubuntu.com/packages/libreoffice/hirsute/arm64
http://autopkgtest.ubuntu.com/packages/libreoffice/hirsute/armhf

the tests are basically OK but one of the uicheck-* ones ends up timing out often enough that it can be described as flaky. When this happens it causes a problem for other migrations, doubly so as the runs take many hours to complete whether they fail or not, which slows down migration speed.

If it's not possible to fix the timeouts (feels like it might be hard or take a while) then disabling these subtests (only) on arm would be an option - stable tests with less coverage are better than flaky ones with more.

We had

https://git.launchpad.net/~libreoffice/ubuntu/+source/libreoffice/commit/?id=baf5a581da7ce3bb27a5b8ed44fadeb6e245e1ff

in the past which might be a basis for what to do now, but that would want to be extended to arm64 too.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

This recently showed up often enough that I bothered to find this bug :-/
I think it degraded again into a state where we should consider:

IRC:
[13:16] <cpaelzer> Was there any hope to not expect this to be a retry-monster forever?
[13:17] <Laney> The desktop team should be asked if they can work on making that less flaky
[13:17] <Laney> I don't think ignoring the tests there is a good answer
[13:18] <Laney> Reducing the coverage would be preferable if that gets it to be more stable, for example, assuming the flakiness can't be easily fixed

I have a simple test stats gatherer [1] that also shows the subtests which helps a lot in those cases to identify potential candidates to skip.

I attached the output, best viewed in monospace
AFAICS up until recently we had <15% on all architectures.
But in Hirsute on arm* I see a bump to 22% of the subtest "uicheck-sw".
Could be bad-luck / noise - but maybe it is worth considering to skip that on arm?

[1]: https://git.launchpad.net/~ubuntu-server/ubuntu-helpers/tree/cpaelzer/check-autopkgtest-stats.sh

Revision history for this message
Heather Ellsworth (hellsworth) wrote :

I agree that the uicheck-sw test has increased in flakiness. How about instead of removing the test or sinking the resources into actually fixing them right now, I can mark them flaky so that at least a failed test looks like a skipped test:

https://salsa.debian.org/ci-team/autopkgtest/blob/master/doc/README.package-tests.rst

Revision history for this message
Christian Ehrhardt  (paelzer) wrote : Re: [Bug 1785262] Re: arm64 autopkgtests are flaky

On Fri, Nov 20, 2020 at 9:00 PM Heather Ellsworth
<email address hidden> wrote:
>
> I agree that the uicheck-sw test has increased in flakiness. How about
> instead of removing the test or sinking the resources into actually
> fixing them right now, I can mark them flaky so that at least a failed
> test looks like a skipped test:

Would work for me, can we mark flaky-on-armhf only (I'd not be aware)?

Changed in libreoffice (Ubuntu):
assignee: nobody → Heather Ellsworth (hellsworth)
Revision history for this message
Sebastien Bacher (seb128) wrote :

Tagging notfixing, the bug is assigned and we will try to get it fixed rather sooner than later but we don't believe it's a release issue

tags: added: rls-hh-notfixing
removed: rls-hh-incoming
Revision history for this message
Iain Lane (laney) wrote :

> we don't believe it's a release issue

It would be good to elaborate when making such comments.

It's a drag on the archive in general when we have flaky tests, and even more so when they take many hours and potentially multiple retries to succeed.

What's the criteria you're using here? Are we saying that "release" is only for user facing issues and bugs which have a negative impact on development of Ubuntu itself aren't to go through this process? If so, I think the team should come up with a way of being able to raise such issues.

That would seem to rule out FTBFS for example too.

Revision history for this message
Iain Lane (laney) wrote :

> Would work for me, can we mark flaky-on-armhf only (I'd not be aware)?

The only way I could think to do this would be to duplicate the test and use "Architecture:" to run the "Restrictions: flaky" one on armhf arm64 only and the other one otherwise.

In this case the failure is a *timeout* though so I'm not sure that Restrictions: flaky works since it is being killed from the outside by autopkgtest itself (the test runner). That would need to be tested to make sure it does. You could do that by making a simple (in another package, doesn't have to be libreoffice) which has "Restricitons: flaky" and runs something like "sleep infinity". See what happens when autopkgtest times out the test - you can pass --timeout-test to reduce this value for testing.

Onto the test itself a bit.

When the "uicheck-sw" test passes, it takes in the region of 20-25 minutes. Picking two recent results:

armhf:

autopkgtest [05:43:20]: test uicheck-sw: [-----------------------
autopkgtest [06:03:54]: test uicheck-sw: -----------------------]
autopkgtest [06:03:58]: test uicheck-sw: - - - - - - - - - - results - - - - - - - - - -
uicheck-sw PASS

arm64:

autopkgtest [02:52:01]: test uicheck-sw: [-----------------------
autopkgtest [03:05:33]: test uicheck-sw: -----------------------]
autopkgtest [03:05:34]: test uicheck-sw: - - - - - - - - - - results - - - - - - - - - -
uicheck-sw PASS

but the timeout we have (this is a "large" test, it runs with higher timeouts than we use by default) is after over 5 *hours* 30 minutes. It's clearly wildly variable, that's more than 10× as much time and it still timed out.

If this is to do with the test runners being "loaded" (slower) vs "unloaded" (faster), then it could be a race condition that happens when the test runs slowly that causes it to get stuck?

I'm not sure if you want to go down the route of investigating the actual failure, though, but it does feel to me like this "shouldn't" be failing. We're not usually close to the limits and just being tipped over them sometimes, does look like the test is actually getting stuck. But it would require someone to actually look at the tests themselves to determine that.

Clearly having the tests working would be ideal; we want assurance that Libreoffice works on armhf and arm64.

In the absence of this, since we're just mashing the retry button and not actually trying to fix them (ATM), making this stop blocking us one way or another would be better than the status quo.

Revision history for this message
Sebastien Bacher (seb128) wrote :

> That would seem to rule out FTBFS for example too.

We have been rls-not-fixing build failures in the past as well yes, especially in non lts series

Revision history for this message
Heather Ellsworth (hellsworth) wrote :

After some more investigation, here are the main points that make this issue a tricky one to debug:
1. The test failure is not isolated. While the uicheck-sw test is the culprit, it is just an umbrella for many other tests. And it's those smaller tests that the failure occurs all over. This makes it hard to just "fix the tests".
2. I have not been able to reproduce this locally on a rpi4. Laney pointed out that perhaps having multiple containers running the same test might generate the load needed (assuming that plays a key role in the failures seen on LP) and I'll certainly give this a shot.
3. Currently if you look at the recent hirsute armhf and arm64 LO test results, you'll see that the rate of failure is way down (about 1/10 times). But just a couple of weeks ago the failure rate was much higher. This suggests that it's load on the nodes causing the problem because that's when the infra was quite busy.
4. Marking the test simply as "flaky" won't work because it's actually autopkgtest killing the process

The approach going forward is to try and be clever about letting this uicheck-sw test timeout and not be considered a failure:
https://git.launchpad.net/~hellsworth/+git/libreoffice/commit/?h=ubuntu-hirsute-7.0-workaround-arm-flakiness&id=63170c4e1824dcee45b670738ff47a6174a991a8

A test build on both armhf and arm64 to try out these changes:
https://launchpad.net/~hellsworth/+archive/ubuntu/lo-arm-flaky-testing

Revision history for this message
Rico Tzschichholz (ricotz) wrote :

This issue is not restricted to arm* but have been seen occasionally on amd64.
It effects the binary package build too, where these tests are part of "make check".

There are a lot tests disabled in 'sw/qa/uitest/writer_dialogs/openDialogs.py' for similar reasons.

So while it locks up at "test_21_.uno:EditGlossary (openDialogs.openDialogs) ... " I would give that a try:
https://git.launchpad.net/~libreoffice/ubuntu/+source/libreoffice/commit/?id=0136e1c7a8f8823cbd0e5027ed203ee0d2f71262

Changed in df-libreoffice:
importance: Unknown → Medium
status: Unknown → Confirmed
Revision history for this message
Iain Lane (laney) wrote : Re: [Bug 1785262] Re: arm64 autopkgtests are flaky

On Wed, Dec 09, 2020 at 10:46:52PM -0000, Rico Tzschichholz wrote:
> This issue is not restricted to arm* but have been seen occasionally on amd64.
> It effects the binary package build too, where these tests are part of "make check".
>
> There are a lot tests disabled in
> 'sw/qa/uitest/writer_dialogs/openDialogs.py' for similar reasons.
>
> So while it locks up at "test_21_.uno:EditGlossary (openDialogs.openDialogs) ... " I would give that a try:
> https://git.launchpad.net/~libreoffice/ubuntu/+source/libreoffice/commit/?id=0136e1c7a8f8823cbd0e5027ed203ee0d2f71262

Thanks for the bug reference - I had an unconfirmed theory that the test
was deadlocking somewhere!

I think Heather mentioned seeing this in other tests too - it would be
good to grep the failing logs to see... seems like the string when it
starts a test doesn't contain a \n, so it looks like:

  test_21_.uno:EditGlossary (openDialogs.openDialogs) ... autopkgtest [22:36:52]: ERROR: timed out on command "...

If it's only this one, or only a small number of tests which can hang
(on slow / loaded machines particularly it seems) always, then skipping
those few would be a reasonable solution, indeed.

But if this can apply to any of the uicheck tests, then I think I'm
still in favour of using timeout(1) to paper over the deadlock, by
setting our own timeout, lower than the autopkgtest one, and skipping if
we hit it.

Cheers,

--
Iain Lane [ <email address hidden> ]
Debian Developer [ <email address hidden> ]
Ubuntu Developer [ <email address hidden> ]

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libreoffice - 1:7.0.4~rc2-0ubuntu2

---------------
libreoffice (1:7.0.4~rc2-0ubuntu2) hirsute; urgency=medium

  * Revert "Re-enable coinmp (LP: #1905422)" until MIR is sorted out

 -- Rico Tzschichholz <email address hidden> Mon, 14 Dec 2020 10:05:19 +0100

Changed in libreoffice (Ubuntu):
status: Triaged → Fix Released
Changed in libreoffice (Ubuntu):
status: Fix Released → Confirmed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libreoffice - 1:7.1.1~rc2-2ubuntu1

---------------
libreoffice (1:7.1.1~rc2-2ubuntu1) hirsute; urgency=medium

  * Fix autopkgtest failures

 -- Rico Tzschichholz <email address hidden> Mon, 01 Mar 2021 11:17:32 +0100

Changed in libreoffice (Ubuntu):
status: Confirmed → Fix Released
description: updated
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Please test proposed package

Hello Olivier, or anyone else affected,

Accepted libreoffice into groovy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/libreoffice/1:7.0.5-0ubuntu0.20.10.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-groovy to verification-done-groovy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-groovy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in libreoffice (Ubuntu Groovy):
status: New → Fix Committed
tags: added: verification-needed verification-needed-groovy
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Groovy verified per LP: #1913353.

tags: added: verification-done-groovy
removed: verification-needed-groovy
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libreoffice - 1:7.0.5-0ubuntu0.20.10.1

---------------
libreoffice (1:7.0.5-0ubuntu0.20.10.1) groovy; urgency=medium

  * New upstream release (LP: #1913353)

  [ Rico Tzschichholz ]
  * Update yaru icon style "2021-01-31"
  * Move libreoffice-sdbc-hsqldb to Recommends (LP: #1916786)
  * Fix Calc crash in ScSelectionTransferObj (LP: #1897784)
  * Drop fix-bluez-external.diff and unowinreg-static-libgcc.diff,
    applied upstream

  [ Heather Ellsworth ]
  * Add timeout to uicheck-sw test to allow it to handle timeouts under
    load better (LP: #1785262)

libreoffice (1:7.0.4-3) unstable; urgency=medium

  * debian/tests/control.in: *really* add libreoffice-writer dependency
    to uicheck-sc test

libreoffice (1:7.0.4-2) unstable; urgency=medium

  * debian/test/control: make uicheck-sc depend on libreoffice-writer, too
    (the openDialogs/uno.Show:License Dialog test opens a new "Writer/Web"
    document...)

libreoffice (1:7.0.4-1) unstable; urgency=medium

  * LibreOffice 7.0.4 final release (identical to rc2)

  * debian/patches/pdfium-m68k.diff: fix pdfium build on m68k

  * debian/rules, debian/control*in: s/noinsttests/noinsttest/, thanks
    lintian
  * debian/rules:
    - revert clang (>= 1:11) build-dep for buster-backports; doesn't exist in
      buster and we resort back to gcc
    - don't rm LICENSE.html, it is used by
      Help -> License Information -> Show License
  * debian/control.mediawiki.in: remove Homepage: (closes: #978713)
  * debian/*.mime: stop quoting %s (closes: #950319)

libreoffice (1:7.0.4~rc2-1) unstable; urgency=medium

  * New upstream release candidate

libreoffice (1:7.0.4~rc1-1) unstable; urgency=medium

  * New upstream release candidate

  * debian/patches/unowinreg-static-libgcc.diff: use -static-libgcc, see
    LO master commit 01241113947fc7bd7f7b765dd897bb023c8ca99c

  * debian/rules:
    - set MYSQL_FLAVOUR=mariadb (as it's used anyway, and upstreams internal
      copy is mariadb-connector-c anyway) and build-depend on libmariadb-dev
      instead of libmariadbclient-dev-compat (closes: #975481)
    - build-depend on clang (>= 1:11) on armel
    - fix nocheck -b builds...
    - use nowindows build profile for (not) building unowinreg
  * debian/rules, debian/control*in:
    - honour noinsttests build profile for (not) building
      -subsequentcheckbase/-smoketest-test-data
  * debian/rules, debian/control*in:
    - move unowinreg.dll into /usr/lib/i686-w64-mingw32 (the mingw g++
      compiler installs its .dlls into /usr/lib/gcc/i686-w64-mingw32)

 -- Rico Tzschichholz <email address hidden> Mon, 15 Mar 2021 19:01:12 +0100

Changed in libreoffice (Ubuntu Groovy):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for libreoffice has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
In , Qa-admin-q (qa-admin-q) wrote :

Dear Michael Stahl (allotropia),

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present.

If you have time, please do the following:

Test to see if the bug is still present with the latest version of LibreOffice from https://www.libreoffice.org/download/

If the bug is present, please leave a comment that includes the information from Help - About LibreOffice.

If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a comment that includes the information from Help - About LibreOffice.

Please DO NOT

Update the version field
Reply via email (please reply directly on the bug tracker)
Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not
appropriate in this case)

If you want to do more to help you can test to see if your issue is a REGRESSION. To do so:
1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3) from https://downloadarchive.documentfoundation.org/libreoffice/old/

2. Test your bug
3
. Leave a comment with your results.
4a. If the bug was present with 3.3 - set version to 'inherited from OOo';
4b. If the bug was not present in 3.3 - add 'regression' to keyword

Feel free to come ask questions or to say hello in our QA chat: https://web.libera.chat/?settings=#libreoffice-qa

Thank you for helping us make LibreOffice even better for everyone!

Warm Regards,
QA Team

MassPing-UntouchedBug

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.