CodeImportSchedulerApplication:CodeImportSchedulerAPI timeouts

Bug #637758 reported by Robert Collins
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
Unassigned

Bug Description

https://lp-oops.canonical.com/oops.py/?oopsid=1717XMLP125

https://launchpad.net/codeimportscheduler/getJobForMachine

Branch: launchpad-rev-9760
Revno: 9760
SQL time: 7502 ms
Non-sql time: 9025 ms
Total time: 16527 ms
Statement Count: 5

The key bit is this:
4. 29 7474ms SQL-launchpad-main-master UPDATE CodeImportMachine SET heartbeat=CURRENT_TIMESTAMP AT TIME ZONE 'UTC' WHERE CodeImportMachine.id = %s
5. 16518 9ms SQL-launchpad-main-master SELECT CodeImportJob.code_import, CodeImportJob.date_created, CodeImportJob.date_due, CodeImportJob.date_started, CodeImportJob.heartbeat, CodeImportJob.id, CodeImportJob.logtail, CodeImportJob.machine, CodeImportJob.ordering, CodeImportJob.requesting_user, CodeImportJob.state FROM CodeImportJob WHERE id IN (SELECT id FROM CodeImportJob WHERE date_due <= CURRENT_TIMESTAMP AT TIME ZONE 'UTC' AND state = 10 ORDER BY requesting_user IS NULL, date_due LIMIT 1)

Notes: one 7.5 second update (bad)
and a further 8 seconds unallocated - and we've check the code; there is almost nothing there. Its highly likely this is a scheduler issue and a single-threaded appserver will mitigate it. We need to find out in a controlled way whats up with the 8 second gap, but its not the fault of this code, so filing a separate bug. This bug is about the 7.5 second update.

Revision history for this message
Tim Penhey (thumper) wrote :

Robert and I talked through this for quite a while today.

CodeImportJobSet.getJobForMachine (lp.code.model.codeimportjob) is where the fun happens.

The heartbeat update is happening in
    machine.shouldLookForJob(worker_limit)

After the heartbeat, the only things are comparing a local db field with an enum, and comparing two integers. Neither of these would trigger the flush (we think). The next thing that happens is the CodeImportJob.selectOne call, which does trigger the flush. It is between the flush which sets the heartbeat, and the execution of the select that we are losing a lot of time.

Revision history for this message
Robert Collins (lifeless) wrote :

OOPS-1725XMLP10, OOPS-1725XMLP101, OOPS-1725XMLP104, OOPS-1725XMLP105, OOPS-1725XMLP113 OOPS-1725XMLP6 OOPS-1725XMLP112, OOPS-1725XMLP117, OOPS-1725XMLP147, OOPS-1725XMLP151, OOPS-1725XMLP160

Revision history for this message
Robert Collins (lifeless) wrote :

times:
0030
0615
0631
0633
0710
0015
0706
0735
1031
1046
1115

tags: added: pg83
Revision history for this message
Robert Collins (lifeless) wrote :

Was the xmlrpc server being overloaded, fixed by clustering it.

Changed in launchpad-code:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.