pofile stats daily takes ages to complete

Bug #622668 reported by Данило Шеган
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
Benji York

Bug Description

cronscripts/rosetta-pofile-stats-daily.py takes ages to complete, either due to DBLoopTuner being insane (bug 622670), or due to bad usage of it.

This is how a typical run looks like:

2010-08-14 06:37:04 INFO Creating lockfile: /var/lock/launchpad-pofile-stats-daily.lock
2010-08-14 06:37:15 INFO Verifying stats of POFiles updated in the last 7 days.
2010-08-14 06:37:35 INFO Verifying a total of 98599 POFiles.
2010-08-14 06:37:46 DEBUG Iteration 0 (size 1.0): 11.037 seconds
2010-08-14 06:37:46 INFO Blocked on 6:26:44.992256 old xact postgres@launchpad_prod_3/18668 - <HIDDEN>.
2010-08-14 06:37:46 INFO Blocked on 6:25:08.238912 old xact slony@launchpad_prod_3/29559 - <HIDDEN>.
2010-08-14 06:37:46 INFO Blocked on 1:22:44.646834 old xact fiera@launchpad_prod_3/19300 - <IDLE> in transaction.
2010-08-14 06:37:46 INFO Sleeping for up to 10 minutes.
2010-08-14 06:47:47 INFO Blocked on 6:36:45.967862 old xact postgres@launchpad_prod_3/18668 - <HIDDEN>.
2010-08-14 06:47:47 INFO Blocked on 6:35:09.214518 old xact slony@launchpad_prod_3/29559 - <HIDDEN>.
2010-08-14 06:47:47 INFO Blocked on 1:32:45.622440 old xact fiera@launchpad_prod_3/19300 - <IDLE> in transaction.
2010-08-14 06:47:47 INFO Sleeping for up to 10 minutes.
2010-08-14 06:57:48 INFO Blocked on 6:46:46.757742 old xact postgres@launchpad_prod_3/18668 - <HIDDEN>.
2010-08-14 06:57:48 INFO Blocked on 6:45:10.004398 old xact slony@launchpad_prod_3/29559 - <HIDDEN>.
2010-08-14 06:57:48 INFO Blocked on 1:42:46.412320 old xact fiera@launchpad_prod_3/19300 - <IDLE> in transaction.
2010-08-14 06:57:48 INFO Sleeping for up to 10 minutes.
...

2010-08-14 08:37:57 INFO Sleeping for up to 10 minutes.
2010-08-14 08:47:58 INFO Blocked on 1:47:57.103012 old xact postgres@launchpad_prod_3/31251 - <HIDDEN>.
2010-08-14 08:47:58 INFO Sleeping for up to 10 minutes.
2010-08-14 08:58:00 INFO Blocked on 1:57:58.284515 old xact postgres@launchpad_prod_3/31251 - <HIDDEN>.
2010-08-14 08:58:00 INFO Sleeping for up to 10 minutes.
2010-08-14 09:08:01 INFO Blocked on 2:07:59.294789 old xact postgres@launchpad_prod_3/31251 - <HIDDEN>.
2010-08-14 09:08:01 INFO Sleeping for up to 10 minutes.
2010-08-14 09:13:43 DEBUG Iteration 1 (size 1.0): 42.150 seconds
2010-08-14 09:13:55 DEBUG Iteration 2 (size 1.0): 11.925 seconds
2010-08-14 09:14:06 DEBUG Iteration 3 (size 1.0): 10.319 seconds
2010-08-14 09:14:15 DEBUG Iteration 4 (size 1.0): 9.151 seconds
2010-08-14 09:14:24 DEBUG Iteration 5 (size 1.0): 9.456 seconds
2010-08-14 09:14:35 DEBUG Iteration 6 (size 1.0): 10.665 seconds
...

After it gets unblocked, it takes ages to process stuff, doing one item per iteration. For reference, regular pofile-stats script which runs over all our POFiles (all 900k of them) finishes in roughly 24h.

Revision history for this message
Данило Шеган (danilo) wrote :

We should disable pofile-stats-daily script until this is fixed.

description: updated
Revision history for this message
Данило Шеган (danilo) wrote :

FWIW, the number of POFiles that need processing is unusually large in the above run: we expect it mainly to be around 1-10k POFiles every day.

Revision history for this message
Stuart Bishop (stub) wrote :

The loop tuner algorithm fails when an iteration with a batch size of one takes longer than the target transaction time. It either needs to be sped up, or the target transaction time increased.

Revision history for this message
Данило Шеган (danilo) wrote :

I wonder if it makes sense to make it possible for DBLoopTuner users to choose to ignore the lag after it's been stuck for, say, 1h or so.

Revision history for this message
Stuart Bishop (stub) wrote :

Ignoring lag, or not using DBLoopTuner at all, will not help. The overhead per chunk is higher than the target time, so the chunk size will decrease to the minimum chunk size. This is currently set to 1.

A quick fix would seem to be to increase the minimum chunk size. However, we are trying to limit our transactions to <2s to avoid contention issues.

We need to fix the per-chunk overhead.

Maybe calculating everything upfront into a temporary table could help? I'd like to avoid temporary tables, but maybe we really need them.

Maybe it is just an slow query to fix.

Maybe we need a better index.

Revision history for this message
Данило Шеган (danilo) wrote :

Most of the transaction time is read time: calculating the relevant data. For each of the different counts we do a separate query over TranslationMessage table, and some of them are pretty complex (especially the unreviewed_count). I'd like us to investigate why has pofile-stats script runtime gone from 24h to 8 days for roughly the same data size (i.e. 1.2M POFiles on 70M TranslationMessage table compared to the old days and 900K POFiles on roughly 100M row table).

If it is indeed a slow query, we should look at pofile.updateStatistics().

Revision history for this message
Данило Шеган (danilo) wrote :

Oh, actually, looking at this now, it's not a single query per count anymore (probably for a while now).

Revision history for this message
Данило Шеган (danilo) wrote :

This would be "solved" by removing the need for this script in the first place: bug 877195.

Revision history for this message
Benji York (benji) wrote :

This has been fixed by removing the need for the job. I've opened bug 898648 to keep track of removing it.

Changed in launchpad:
status: Triaged → Fix Released
assignee: nobody → Benji York (benji)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related questions

Remote bug watches

Bug watches keep track of this bug in other bug trackers.