CPU Load Avg calculation gets very confused by multiple recv()s on the same PF_UNIX/SOCK_DGRAM socket

Bug #379452 reported by Yann Leboulanger
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

Load goes high (4) when using this module. To reproduce, just open a python console, and type:
>>> import libasyncns
>>> asyncns = libasyncns.Asyncns()

Now in another console look at your load increasing.

Revision history for this message
Damien Thébault (damien-thebault) wrote :

That's right, it happens with the attached simple C code too, which is equivalent to the given python code.
So the problem is in the underlying libasyncns library.

Revision history for this message
Damien Thébault (damien-thebault) wrote : Re: [Bug 379452] Re: High load

On Fri, May 22, 2009 at 18:16, Damien Thébault
<email address hidden> wrote:
> That's right, it happens with the attached simple C code too, which is equivalent to the given python code.
> So the problem is in the underlying libasyncns library.
>
> ** Attachment added: "asyncns-0.7-load.c"
>   http://launchpadlibrarian.net/27057398/asyncns-0.7-load.c
>
> --
> High load
> https://bugs.launchpad.net/bugs/379452
> You received this bug notification because you are the registrant for
> libasyncns-python.
>
> Status in Python bindings for libasyncns: New
>
> Bug description:
> Load goes high (4) when using this module. To reproduce, just open a python console, and type:
>>>> import libasyncns
>>>> asyncns = libasyncns.Asyncns()
>
> Now in another console look at your load increasing.
>

CC'ing libasyncns' author Lennart Poettering about this bug.
Lennart, what do you think about this?

--
Damien Thebault

Revision history for this message
Yann Leboulanger (asterix) wrote :

Lennart Poettering wrote:
> On Fri, 22.05.09 18:27, Damien Thébault (<email address hidden>) wrote:
>
>> On Fri, May 22, 2009 at 18:16, Damien Thébault
>> <email address hidden> wrote:
>>> That's right, it happens with the attached simple C code too, which is equivalent to the given python code.
>>> So the problem is in the underlying libasyncns library.
>>>
>>> ** Attachment added: "asyncns-0.7-load.c"
>>> http://launchpadlibrarian.net/27057398/asyncns-0.7-load.c
>>>
>>>
>>> Status in Python bindings for libasyncns: New
>>>
>>> Bug description:
>>> Load goes high (4) when using this module. To reproduce, just open a python console, and type:
>>>>>> import libasyncns
>>>>>> asyncns = libasyncns.Asyncns()
>>> Now in another console look at your load increasing.
>>>
>> CC'ing libasyncns' author Lennart Poettering about this bug.
>> Lennart, what do you think about this?
>
> Hmm, no clue.
>
> If you can get me a simple test use case using just the C API I'd be
> happy to take a closer look!
>
> Lennart
>

There is a link in this mail, just some line above:
http://launchpadlibrarian.net/27057398/asyncns-0.7-load.c

--
Yann

Revision history for this message
Damien Thébault (damien-thebault) wrote :

On Wed, May 27, 2009 at 07:22, Yann Leboulanger<email address hidden> wrote:
> Lennart Poettering wrote:
>> On Fri, 22.05.09 18:27, Damien Thébault (<email address hidden>) wrote:
>>
>>> On Fri, May 22, 2009 at 18:16, Damien Thébault
>>> <email address hidden> wrote:
>>>> That's right, it happens with the attached simple C code too, which is equivalent to the given python code.
>>>> So the problem is in the underlying libasyncns library.
>>>>
>>>> ** Attachment added: "asyncns-0.7-load.c"
>>>>   http://launchpadlibrarian.net/27057398/asyncns-0.7-load.c
>>>>
>>>>
>>>> Status in Python bindings for libasyncns: New
>>>>
>>>> Bug description:
>>>> Load goes high (4) when using this module. To reproduce, just open a python console, and type:
>>>>>>> import libasyncns
>>>>>>> asyncns = libasyncns.Asyncns()
>>>> Now in another console look at your load increasing.
>>>>
>>> CC'ing libasyncns' author Lennart Poettering about this bug.
>>> Lennart, what do you think about this?
>>
>> Hmm, no clue.
>>
>> If you can get me a simple test use case using just the C API I'd be
>> happy to take a closer look!
>>
>> Lennart

Look like recv() shouldn't be called from multiple threads on the same socket.
(See the attached file for a similar behaviour.)

If I understand it, some worker threads are created, and they wait a message
with recv() on the same fd.
Then, when a new request comes, send() is used on the connected socket,
and is picked by one of the worker threads.

Maybe a worker thread system could be implemented with queues
and pthread mutexes/conds, but it's less generic (harder to be compatible with
process workers).
Another way may be to use pipe() and select()?

The strange thing is that the same load increases with the fork() method too,
but I didn't looked at this part.

Regards,
--
Damien Thebault

Revision history for this message
Yann Leboulanger (asterix) wrote :

Damien Thébault wrote:
> On Wed, May 27, 2009 at 07:22, Yann Leboulanger<email address hidden> wrote:
>> Lennart Poettering wrote:
>>> On Fri, 22.05.09 18:27, Damien Thébault (<email address hidden>) wrote:
>>>
>>>> On Fri, May 22, 2009 at 18:16, Damien Thébault
>>>> <email address hidden> wrote:
>>>>> That's right, it happens with the attached simple C code too, which is equivalent to the given python code.
>>>>> So the problem is in the underlying libasyncns library.
>>>>>
>>>>> ** Attachment added: "asyncns-0.7-load.c"
>>>>> http://launchpadlibrarian.net/27057398/asyncns-0.7-load.c
>>>>>
>>>>>
>>>>> Status in Python bindings for libasyncns: New
>>>>>
>>>>> Bug description:
>>>>> Load goes high (4) when using this module. To reproduce, just open a python console, and type:
>>>>>>>> import libasyncns
>>>>>>>> asyncns = libasyncns.Asyncns()
>>>>> Now in another console look at your load increasing.
>>>>>
>>>> CC'ing libasyncns' author Lennart Poettering about this bug.
>>>> Lennart, what do you think about this?
>>> Hmm, no clue.
>>>
>>> If you can get me a simple test use case using just the C API I'd be
>>> happy to take a closer look!
>>>
>>> Lennart
>
> Look like recv() shouldn't be called from multiple threads on the same socket.
> (See the attached file for a similar behaviour.)
>
> If I understand it, some worker threads are created, and they wait a message
> with recv() on the same fd.
> Then, when a new request comes, send() is used on the connected socket,
> and is picked by one of the worker threads.
>
> Maybe a worker thread system could be implemented with queues
> and pthread mutexes/conds, but it's less generic (harder to be compatible with
> process workers).
> Another way may be to use pipe() and select()?
>
> The strange thing is that the same load increases with the fork() method too,
> but I didn't looked at this part.
>
> Regards,
>

Will there be any attempt to fix this or is the development of this
library dead?
--
Yann Leboulanger

Revision history for this message
Damien Thébault (damien-thebault) wrote :

On Wed, Jun 24, 2009 at 23:28, Yann Leboulanger<email address hidden> wrote:
> Damien Thébault wrote:
>> On Wed, May 27, 2009 at 07:22, Yann Leboulanger<email address hidden> wrote:
>>> Lennart Poettering wrote:
>>>> On Fri, 22.05.09 18:27, Damien Thébault (<email address hidden>) wrote:
>>>>
>>>>> On Fri, May 22, 2009 at 18:16, Damien Thébault
>>>>> <email address hidden> wrote:
>>>>>> That's right, it happens with the attached simple C code too, which is equivalent to the given python code.
>>>>>> So the problem is in the underlying libasyncns library.
>>>>>>
>>>>>> ** Attachment added: "asyncns-0.7-load.c"
>>>>>>   http://launchpadlibrarian.net/27057398/asyncns-0.7-load.c
>>>>>>
>>>>>>
>>>>>> Status in Python bindings for libasyncns: New
>>>>>>
>>>>>> Bug description:
>>>>>> Load goes high (4) when using this module. To reproduce, just open a python console, and type:
>>>>>>>>> import libasyncns
>>>>>>>>> asyncns = libasyncns.Asyncns()
>>>>>> Now in another console look at your load increasing.
>>>>>>
>>>>> CC'ing libasyncns' author Lennart Poettering about this bug.
>>>>> Lennart, what do you think about this?
>>>> Hmm, no clue.
>>>>
>>>> If you can get me a simple test use case using just the C API I'd be
>>>> happy to take a closer look!
>>>>
>>>> Lennart
>>
>> Look like recv() shouldn't be called from multiple threads on the same socket.
>> (See the attached file for a similar behaviour.)
>>
>> If I understand it, some worker threads are created, and they wait a message
>> with recv() on the same fd.
>> Then, when a new request comes, send() is used on the connected socket,
>> and is picked by one of the worker threads.
>>
>> Maybe a worker thread system could be implemented with queues
>> and pthread mutexes/conds, but it's less generic (harder to be compatible with
>> process workers).
>> Another way may be to use pipe() and select()?
>>
>> The strange thing is that the same load increases with the fork() method too,
>> but I didn't looked at this part.
>>
>> Regards,
>>
>
> Will there be any attempt to fix this or is the development of this
> library dead?

I was thinking about fixing it myself but I don't have a lot of time these days,
and I don't really know the code of libasyncns.
I started doing it anyway, maybe it'll give something working in the end.

--
Damien Thebault

Revision history for this message
In , Lennart (lennart-redhat-bugs) wrote :

Created attachment 364906
the test case

Consider the attached example code. All it does is create a PF_UNIX/SOCK_DGRAM socket, spawn 4 threads, and call recv() on the socket in each of those threads. Nothing else. Because nobody is sending anything on the socket the program bsically just hangs, rightly does not appear in top -- except that the CPU load average top shows starts to go up and up. Which it shouldn't of course.

Revision history for this message
In , Lennart (lennart-redhat-bugs) wrote :

use case written by the ubuntians btw.

Revision history for this message
In , Lennart (lennart-redhat-bugs) wrote :

s/use case/test case/

Martin Pitt (pitti)
Changed in linux (Ubuntu):
status: New → Triaged
affects: libasyncns-python → linux
summary: - High load
+ CPU Load Avg calculation gets very confused by multiple recv()s on the
+ same PF_UNIX/SOCK_DGRAM socket
Changed in linux (Fedora):
status: Unknown → Confirmed
Revision history for this message
In , Matthew (matthew-redhat-bugs) wrote :

Load average seems to go to 4 for me, which is what I'd expect. Do you see different behaviour?

Revision history for this message
In , Lennart (lennart-redhat-bugs) wrote :

yes, that's what i see too, butnot what i expected. recv() hangs in D state, and it should be in S state i believe, given that the sleeping actually *is* interruptable with a simple C-c which causes EINTR on the recv().

Revision history for this message
Yann Leboulanger (asterix) wrote :

so it's only a load calculation that is wrong? Nothing more that could hurt?

Revision history for this message
In , Lennart (lennart-redhat-bugs) wrote :

Problem seems to be related to the simultaneous recv() in multiple threads:

One of the threads will be hanging in S state, and the others in D. The load avg should hence go up to n-1 if we have n threads calling recv() on the same socket.

I would say this a bug.

Revision history for this message
In , Lennart (lennart-redhat-bugs) wrote :

doing the same thing with a pipe instead of an AF_UNIX socket btw works properly: all threads will hang in S.

Whether read() or recv() is used on the fd makes no real difference for the AF_UNIX case.

Revision history for this message
In , Lennart (lennart-redhat-bugs) wrote :

If this code is done with AF_UNIX/SOCK_STREAM then all threads will hang in S. As it should be.

Summarizing:

On pipes all threads waiting will be in S state
On AF_UNIX/SOCK_STREAM all threads waiting will be in S state
On AF_UNIX/SOCK_DGRAM one thread will be in S state, the others in D state (BROKEN!)

So, yes, this is definitely a bug in the socket handling code.

Revision history for this message
In , Lennart (lennart-redhat-bugs) wrote :

I am now setting this as F12Target. glib now uses libasyncns for the resolver and we probably shouldn't show a completely bogus loadavg when the user runs a glib app that uses the resolver.

It's admittedly not high prio though, given that only the statistics are wrong but everything else seems to be fine.

Revision history for this message
In , Matěj (matj-redhat-bugs) wrote :

(In reply to comment #8)
> It's admittedly not high prio though, given that only the statistics are wrong
> but everything else seems to be fine.

Are you sure about this? My gajim when tries to use python-libasyncns makes whole computer pretty slowly reacting.

Revision history for this message
In , Michal (michal-redhat-bugs) wrote :

Created attachment 365146
modified test case with processes instead of threads

I modified the test case slightly to use full processes instead of threads, just to demonstrate that it's not a threads-only issue. The high loadavg is reproducible just as well here and the processes are really in D state.

I can't reproduce the system slowdown (which Matěj is seeing) with this. I'll try Gajim later.

I wonder if we can just use mutex_lock_interruptible(&u->readlock) in unix_dgram_recvmsg...

Revision history for this message
In , Lennart (lennart-redhat-bugs) wrote :

(In reply to comment #9)
> (In reply to comment #8)
> > It's admittedly not high prio though, given that only the statistics are wrong
> > but everything else seems to be fine.
>
> Are you sure about this? My gajim when tries to use python-libasyncns makes
> whole computer pretty slowly reacting.

Hmm, no, never seen that. Everyone else reports as if this is only a statistics issue. Is the process actually showing up as CPU time consuming in top?

Revision history for this message
In , Lennart (lennart-redhat-bugs) wrote :

*** Bug 529504 has been marked as a duplicate of this bug. ***

Revision history for this message
In , Matěj (matj-redhat-bugs) wrote :

(In reply to comment #11)
> (In reply to comment #9)
> > (In reply to comment #8)
> > > It's admittedly not high prio though, given that only the statistics are wrong
> > > but everything else seems to be fine.
> >
> > Are you sure about this? My gajim when tries to use python-libasyncns makes
> > whole computer pretty slowly reacting.
>
> Hmm, no, never seen that. Everyone else reports as if this is only a statistics
> issue. Is the process actually showing up as CPU time consuming in top?

Yes, gajim is then pretty active process (low tens of per cent).

Revision history for this message
In , Michal (michal-redhat-bugs) wrote :

*** Bug 570323 has been marked as a duplicate of this bug. ***

Revision history for this message
In , Michal (michal-redhat-bugs) wrote :

Note to self: The mutex was added in 2.6.10 by DaveM in:
[AF_UNIX]: Serialize dgram read using semaphore just like stream
It fixed an exploitable race condition (http://www.securityfocus.com/archive/1/381689).
Using mutex_lock_interruptible() almost works, except that SO_RCVTIMEO will still work badly in this situation.

Revision history for this message
In , Bug (bug-redhat-bugs) wrote :

This bug appears to have been reported against 'rawhide' during the Fedora 13 development cycle.
Changing version to '13'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Revision history for this message
In , Bug (bug-redhat-bugs) wrote :

This message is a reminder that Fedora 13 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 13. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora
'version' of '13'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version prior to Fedora 13's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that
we may not be able to fix it before Fedora 13 is end of life. If you
would still like to see this bug fixed and are able to reproduce it
against a later version of Fedora please change the 'version' of this
bug to the applicable version. If you are unable to change the version,
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

The process we are following is described here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Revision history for this message
In , Bug (bug-redhat-bugs) wrote :

Fedora 13 changed to end-of-life (EOL) status on 2011-06-25. Fedora 13 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Revision history for this message
penalvch (penalvch) wrote :

Yann Leboulanger, this bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? If so, could you please test for this with the latest development release of Ubuntu? ISO images are available from http://cdimage.ubuntu.com/daily-live/current/ .

If it remains an issue, could you please run the following command in the development release from a Terminal (Applications->Accessories->Terminal), as it will automatically gather and attach updated debug information to this report:

apport-collect -p linux <replace-with-bug-number>

Also, could you please test the latest upstream kernel available following https://wiki.ubuntu.com/KernelMainlineBuilds ? It will allow additional upstream developers to examine the issue. Please do not test the daily kernel folder, but the one all the way at the bottom. Once you've tested the upstream kernel, please comment on which kernel version specifically you tested. If this bug is fixed in the mainline kernel, please add the following tags:
kernel-fixed-upstream
kernel-fixed-upstream-VERSION-NUMBER

where VERSION-NUMBER is the version number of the kernel you tested. For example:
kernel-fixed-upstream-v3.12-rc2

This can be done by clicking on the yellow circle with a black pencil icon next to the word Tags located at the bottom of the bug description. As well, please remove the tag:
needs-upstream-testing

If the mainline kernel does not fix this bug, please add the following tags:
kernel-bug-exists-upstream
kernel-bug-exists-upstream-VERSION-NUMBER

As well, please remove the tag:
needs-upstream-testing

Once testing of the upstream kernel is complete, please mark this bug's Status as Confirmed. Please let us know your results. Thank you for your understanding.

Changed in linux (Ubuntu):
status: Triaged → Incomplete
Changed in linux (Fedora):
importance: Unknown → Medium
status: Confirmed → Won't Fix
Revision history for this message
penalvch (penalvch) wrote :

No OR response requested in 2013.

no longer affects: linux (Ubuntu)
affects: linux → linux (Ubuntu)
Changed in linux (Ubuntu):
status: New → Invalid
no longer affects: linux (Ubuntu)
affects: linux (Fedora) → linux (Ubuntu)
Changed in linux (Ubuntu):
importance: Medium → Undecided
status: Won't Fix → New
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.