Comment 6 for bug 302798

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

It turns out the oops reports may be wrong about the non-SQL time. The cases where most time is logged as non-SQL time don't actually log the query that the timeout error is for.

Some possibilities I'm looking at:
 * Two-thirds of the oops reports end up not logging that final, long query for some reason and this is probably screwing up the SQL-time computation. Reported as bug 310818.
 * The time is spent somewhere where we can't see it. Stuart suggested as one possibility: C/Pyrex code holding Python's Global Interpreter Lock for too long and effectively serializing the app server. Something to discuss with Gustavo or Gary.
 * I'm assured that the databases are still running out of memory, in which case this is not a repetition of our old I/O timeouts. But could the app servers be running out of memory because of aggressive caching? Setting up the Jaunty translations must have increased the working set.

If working-set size is the root cause, there are some short-term things we can do:
 * We're going to close the translation UIs for obsolete Ubuntu releases. A branch for this is in PQM right now. This won't do much in itself, but it also means we can delete the translation messages for these obsolete series. That accounts for about 30% of the data set.
 * The main cost of rendering these pages is in fetching suggestions. We can set an age limit on those suggestions if that helps the query plan. It's mainly the test code that suffers.