oopses do not gather environmental data(load, thread-cpu-time, ...)

Bug #243554 reported by Diogo Matsubara
4
Affects Status Importance Assigned to Milestone
Launchpad itself
Triaged
High
Unassigned
OOPS model
Triaged
High
Unassigned
python-oops-tools
Triaged
High
Unassigned

Bug Description

When timeouts occur, they can be caused by a) inefficient code or b) external influences.

We should gather enough data that we don't spend time debugging the wrong things.

Specifically we should gather:
 - system load average
 - number of cpucores (to normalise the load average)
 - process memory & physical memory (to guesstimate whether we're hitting swap)
 - *process* time since the request started. As each request is in a separate thread, the OS's system accounting can tell us whether 5 seconds of wall clock time was 5 seconds of CPU time, or 1 second of CPU time.

The canonical.mem.resident() and canonical.mem.memory() will help in implementing this. os.loadavg will give us load averages. We can grep /proc/cpu as bzr does for the cpu counts, and time.clock() will give us CPU usage ('per process', which may be equivalent to per-thread when we use it from a non main thread. Testing will be needed). If time.clock does not suffice, a small extension could call clock_gettime(CLOCK_THREAD_CPUTIME_ID, ....)

We are hitting many questions we cannot answer today as a result of not knowing these things.

Alternatively:
#RUSAGE_THREAD = 1 on my linux system - we'd want a C extension to get the right constant
resource.getrusage(1)ru_utime
should give us what we need.

Joey Stanford (joey)
Changed in launchpad:
importance: Undecided → High
Revision history for this message
Christian Reis (kiko) wrote :

AIUI Francis' team is in the best position to actually store this information, and he already has put work into capturing this data into data structures we can output in the OOPS dump.

Changed in oops-tools:
status: New → Triaged
Changed in launchpad:
importance: Undecided → High
status: New → Triaged
summary: - oops report should record information about the running process
+ oops report should record information about the running environment
description: updated
description: updated
Revision history for this message
Gary Poster (gary) wrote : Re: oops report should record information about the running environment

adding to Foundations kanban backlog.

Gary Poster (gary)
tags: added: oops-infrastructure
removed: infrastructure oops-tools
description: updated
Revision history for this message
Robert Collins (lifeless) wrote :

This now looks doable without needing a C module at all(short term). Gary, could we look at slotting this in in the near future? I think it would pay itself back pretty quickly.

Revision history for this message
Robert Collins (lifeless) wrote :

See http://bugs.python.org/issue10440 for a request for the constant.

summary: - oops report should record information about the running environment
+ oopses do not gather environmental data(load, thread-cpu-time, ...)
Changed in python-oops:
status: New → Triaged
importance: Undecided → High
affects: oops-tools → python-oops-tools
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.