Distribution:+assignments is taking a long time maybe 20% of the time

Bug #618367 reported by Robert Collins
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
Robert Collins

Bug Description

Distribution:+assignments is showing up on https://devpad.canonical.com/~stub/ppr/lpnet/latest-daily-timeout-candidates.html as taking a very long time 20% of the time. At only 5 hits a day, and with most hits < 1 second, its likely not going to show on our top-oops reports, but it is still timing out consistently. I wonder about contention - we should dig up an oops for it and investigate.

https://launchpad.net/ubuntu/+assignments

https://lp-oops.canonical.com/oops.py/?oopsid=1702D705

SQL time: 6799 ms
Non-sql time: 11123 ms
Total time: 17922 ms
Statement Count: 208

1 187 6132 32 6100 launchpad-main-slave SELECT ValidPersonCache.id FROM ValidPersonCache WHERE ValidPersonCache.id = %s LIMIT 1

is the most duplicated query, and seems to be most of the sql time.

Related branches

description: updated
description: updated
Changed in blueprint:
assignee: nobody → Robert Collins (lifeless)
Revision history for this message
Robert Collins (lifeless) wrote :

This is now down to 76 queries and 13 seconds : not brilliant, but we have some breathing room.

Changed in blueprint:
status: Triaged → Fix Committed
Revision history for this message
Launchpad QA Bot (lpqabot) wrote : Bug fixed by a commit
Changed in blueprint:
milestone: none → 10.09
tags: added: qa-needstesting
Changed in blueprint:
status: Fix Committed → Triaged
Revision history for this message
Robert Collins (lifeless) wrote :

Updated staging oops: https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1707S287

https://staging.launchpad.net/ubuntu/+assignments

SQL time: 875 ms
Non-sql time: 12600 ms
Total time: 13475 ms
Statement Count: 20

The sql overhead is Not-A-Problem now, so we need to get a kcachegrind profile and see whats going on.

Changed in blueprint:
assignee: Robert Collins (lifeless) → nobody
Revision history for this message
Robert Collins (lifeless) wrote :

(Unless, of course, its in the incremental fetches which we don't yet profile - see 'oops may be underrepresenting sql time'.

Curtis Hovey (sinzui)
tags: added: qa-ok
removed: qa-needstesting
tags: removed: qa-ok
Changed in blueprint:
milestone: 10.09 → none
Revision history for this message
Robert Collins (lifeless) wrote :

Just used Gary's shiny new ++profile++show,log on this.

16 seconds are spent in _load_objects_storm in 30K calls : we're pulling 30K full objects out of the DB, which is moderately high. The page was killed at 25 seconds by doing a query.

3000 calls to _load_objects
3000 to load_objects
30K to _load_object
  4.5 K to add_to_alive (3.5%)
  4.5K to get_object_info (20%)
  11K to set_values (42%)

One possibility is that the 'for value in values' check for non-null rows is expensive.

Another possibility is that that is cheap and what we're seeing is mainly coming from the other cases - given the set_values case should run always when we get past the null-row check, we can say that 65% of the time is dealing with found objects, and nearly 30% ?! inline or dealing with NULL objects. Looks like we should be able to short circuit the NULL check by looking at the primary key: if those rows are NULL, the whole thing is NULL.

Revision history for this message
Robert Collins (lifeless) wrote :

We also may be pulling too much data back still; am going to recheck the previous query and make sure its not returning too many rows. That wouldn't alter the efficiency of the storm code but it would mean we're overworking it.

Revision history for this message
Robert Collins (lifeless) wrote :

The query returns 3000 rows; for all specifications in Ubuntu ever, thats not unreasonable. A quick check shows they are not duplicated.

The 30K objects come from this:
3000 rows
10 classes per row (spec, (person, email, address):drafter, (person, email, address):approver (person, email, address):assignee)

So, making storm more efficient here is probably a big win, but fairly hard to do. One thing we can do is tune the preloading stuff to not load the objects, just the ids - enough to be sure its sufficient. I'll start on this now, and separately file a bug on storm about efficient in load_object.

Revision history for this message
Robert Collins (lifeless) wrote :
Revision history for this message
Robert Collins (lifeless) wrote :

https://lp-oops.canonical.com/oops.py/?oopsid=1711S6 the oops (ignore the totals on the left, that was a temporary bug, fixed in edge, not deployed on staging yet). The log lines are correct though.

Revision history for this message
Robert Collins (lifeless) wrote :

I tried changing this to a 2-query form.

https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1711S60 is an example of the result.

In a different oops (-59) it does execute both queries, and we could see that on the SQL side it is slower (2 seconds vs 800ms) and on the python side it is /slightly/ faster.

In thie linked oops, (ignore the SQL time: 1001297 ms, Non-sql time: -981318 ms) figures - they are due to the sentinel showing the non-run query at the end - there is a clear 12 second gap between:
 - the execute returning to storm
 - the query for Person being executed.

So parsing the 3000 narrower rows(Specification only) took 12 seconds (including pulling out their related people IDs and building the next query) under the profiler (vs 16 for the very fat rows). This suggests a storm bug to me and I'm going to follow up on the storm bug I filed as a result of the first profile.

Revision history for this message
Robert Collins (lifeless) wrote :

the patch - http://paste.ubuntu.com/489589/ - applies to db-stable cleanly. Or my lp:~lifeless/launchpad/bug-618367 branch.

Revision history for this message
Robert Collins (lifeless) wrote :

So this page grows indefinitely, but its not really valuable to look at all releases at once. I'm going to exercise my new found batching skillz to batch it in 500 row blocks.

Changed in blueprint:
status: Triaged → In Progress
Changed in blueprint:
assignee: nobody → Robert Collins (lifeless)
Revision history for this message
Launchpad QA Bot (lpqabot) wrote :
Changed in blueprint:
milestone: none → 10.10
tags: added: qa-needstesting
Changed in blueprint:
status: In Progress → Fix Committed
Revision history for this message
Robert Collins (lifeless) wrote :

Verified on edge.

tags: added: qa-ok
removed: qa-needstesting
Curtis Hovey (sinzui)
Changed in blueprint:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.