Distribution:+ppas timeout in PPA search

Bug #659129 reported by Armindo Silva
32
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
Julian Edwards

Bug Description

https://edge.launchpad.net/ubuntu/+ppas?name_filter=munin

OOPS-1746EB682

        Time Reps Database id Statement
1 13001.0 1 SQL-launchpad-main-slave SELECT COUNT(*) FROM Archive, Person, ValidPersonOrTeamCache WHERE Archive.purpose = 2 AND Archive.distribution = 1 AND Person.id = Archive.owner AND Person.id = ValidPersonOrTeamCache.id AND Archive.id IN ( SELECT DISTINCT archive FROM SourcepackagePublishingHistory WHERE status IN (2, 1)) AND Archive.fti @@ ftq('munin') AND Archive.private = FALSE AND Archive.enabled = TRUE AND (1=1)

and
(Error ID: OOPS-1747ED330)

Related branches

Revision history for this message
Armindo Silva (deathon2legs) wrote :

I disabled the redirection o edge.launchpad and when i press search on https://launchpad.net/ubuntu/+ppas i get this error:

Timeout error
Sorry, something just went wrong in Launchpad.

We’ve recorded what happened, and we’ll fix it as soon as possible. Apologies for the inconvenience.

Trying again in a couple of minutes might work.

(Error ID: OOPS-1746B1391)

summary: - Launchpad Timeout error
+ Distribution:+ppas timeout in PPA search
description: updated
tags: added: dba timeout
Changed in launchpad:
status: New → Triaged
importance: Undecided → High
affects: launchpad → soyuz
Revision history for this message
maxadamo (massimilianoadamo) wrote :

just to clarify that this issue, is not affecting us from "16 hours", but from around one week.
I filed my bug report nr. 657409, 4 days ago, when the server was not responding from two days.

description: updated
Revision history for this message
Stuart Bishop (stub) wrote :

Two causes of slowdown in slow COUNT query: Joining with Person is unnecessary, and the DISTINCT in the subselect is unnecessary. The DISTINCT is the big slowdown. The following query performs in under a second, as opposed to around 45 seconds on a production slave:

SELECT COUNT(*)
FROM Archive, ValidPersonOrTeamCache
WHERE
    Archive.purpose = 2
    AND Archive.distribution = 1
    AND Archive.owner = ValidPersonOrTeamCache.id
    AND Archive.id IN (
        SELECT archive
        FROM SourcepackagePublishingHistory
        WHERE status IN (2, 1))
    AND Archive.fti @@ ftq('munin')
    AND Archive.private = FALSE
    AND Archive.enabled = TRUE AND (1=1);

Revision history for this message
Robert Collins (lifeless) wrote :

Dropping the use of validpersonorteamcache in favour of person validity checks might be better still (the views have routinely had trouble)

tags: removed: dba
Changed in soyuz:
assignee: nobody → Julian Edwards (julian-edwards)
status: Triaged → In Progress
tags: added: qa-needstesting
tags: added: qa-bad
removed: qa-needstesting
Revision history for this message
Launchpad QA Bot (lpqabot) wrote : Bug fixed by a commit
Changed in soyuz:
milestone: none → 10.10
tags: added: qa-needstesting
removed: qa-bad
Changed in soyuz:
status: In Progress → Fix Committed
Revision history for this message
Ursula Junque (ursinha) wrote :

<lifeless> Ursinha: qa-ok for that bug, if you're touching it
<lifeless> Ursinha: the fix wasn't /wrong/ it just wasn't deep enough (a later query times out instead)

tags: added: qa-ok
removed: qa-needstesting
Curtis Hovey (sinzui)
Changed in soyuz:
status: Fix Committed → Fix Released
Changed in soyuz:
status: Fix Released → Triaged
tags: removed: qa-ok
Revision history for this message
Stuart Bishop (stub) wrote :

OOPS-1748EC129 for an example timeout with the current round of fixes.

Revision history for this message
Stuart Bishop (stub) wrote :

The fixed query in comment #3 does not seem to be live - edge is still timing out on the old query containing the unnecessary DISTINCT in the subquery.

Revision history for this message
Robert Collins (lifeless) wrote :

It was marked QA-bad and only the preceeding commits were deployed (because it didn't fix the problem).

On staging we can see the effect with rev 9886 deployed.

This OOPS was it with a cowboy:

https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1747S227

I'll get a deployment of tip underway to get the count improvement done.

Changed in soyuz:
assignee: Julian Edwards (julian-edwards) → nobody
milestone: 10.10 → none
Revision history for this message
Stuart Bishop (stub) wrote :

The slow query in that OOPS is:

SELECT COUNT(*) FROM Archive, Person, ValidPersonOrTeamCache
WHERE
    Archive.purpose = 2 AND Archive.distribution = 1
    AND Person.id = Archive.owner
    AND Person.id = ValidPersonOrTeamCache.id
    AND Archive.private = FALSE AND Archive.enabled = TRUE;

This can be sped up by removing the unnecessary join with Person:

SELECT COUNT(*) FROM Archive, ValidPersonOrTeamCache
WHERE
    Archive.purpose = 2 AND Archive.distribution = 1
    AND Archive.owner = ValidPersonOrTeamCache.id
    AND Archive.private = FALSE AND Archive.enabled = TRUE;

If you need Person in there later, we need to not join with ValidPersonOrTeamCache and add that logic to the query:

SELECT COUNT(*)
FROM Archive
JOIN Person ON Person.id = Archive.owner
LEFT OUTER JOIN EmailAddress ON EmailAddress.person = Person.id
LEFT OUTER JOIN Account ON Account.id = EmailAddress.account
WHERE
    Archive.purpose = 2 AND Archive.distribution = 1
    AND Archive.private = FALSE AND Archive.enabled = TRUE
    AND (
        (Person.teamowner IS NOT NULL AND Person.merged IS NULL)
        OR (Person.teamowner IS NULL AND Account.status = 20
            AND EmailAddress.status = 4));

If you choose the last option, please cite Bug #660400 with an XXX.

Revision history for this message
Stuart Bishop (stub) wrote :

On staging at the moment r9886:

At least 42 queries/external actions issued in 2.07 seconds OOPS-1748S128

Changed in soyuz:
assignee: nobody → Julian Edwards (julian-edwards)
Revision history for this message
Julian Edwards (julian-edwards) wrote :

The patch was cowboyed on to staging and it seems to do the trick. What was taking 10 seconds or more (and OOPSing) is now coming back in 1-2 seconds.

Changed in soyuz:
status: Triaged → In Progress
Revision history for this message
Launchpad QA Bot (lpqabot) wrote :
Changed in soyuz:
milestone: none → 10.11
tags: added: qa-needstesting
Changed in soyuz:
status: In Progress → Fix Committed
Revision history for this message
Launchpad QA Bot (lpqabot) wrote :
tags: added: qa-ok
removed: qa-needstesting
Changed in soyuz:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.