revisions' rss feed timing out for some projects

Bug #260140 reported by Ursula Junque
18
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
Tim Penhey

Bug Description

As seen on OOPS-964A3852, OOPS-964C2854, OOPS-964C3459, OOPS-964E4041, OOPS-964F3608, OOPS-964E2078, OOPS-964E2140, revisions.atom is timing out, several times.

I was able to reproduce doing the following:
1) Click on http://feeds.launchpad.net/liveusb/revisions.atom or try to add it to a rss reader - i've tried with akregator
2) It will lead to a timeout error page, and in akregator it will return that there is no feed available (the page returned to akregator is the timeout one, presumably).

Ursula Junque (ursinha)
Changed in launchpad-bazaar:
assignee: nobody → stub
importance: Undecided → High
milestone: none → 2.1.8
status: New → Confirmed
Revision history for this message
Tim Penhey (thumper) wrote :

This appears to be due to bad query plan selection by the postgresql server.

Funnily enough it appears only to affect projects with a small number of revisions.

Revision history for this message
Stuart Bishop (stub) wrote :

It will affect all projects when there are a lot of commits from other projects that have happened more recently.

Revision history for this message
Stuart Bishop (stub) wrote :

The current approach cannot be done efficiently. We are joining three tables - Revision, BranchRevision and Branch. Going right to left, we first get the set of branches that might be interesting. We then join with BranchRevision with no filters, so get all rows for all those branches (Nearly 4.5 million for bzr). We then join with Revision, ordering by revision_date so we can pick the most recent. This is slow no matter how we juggle things. Going from left to right is worse as we start with all 3.5 million revisions, join with BranchRevision giving us all 15 million rows in our set before finally filtering this by joining with Branch. We then need to order and pull out the interesting rows at the top.

We need to filter on BranchRevision. We could mirror Revision.revision_date into Branch, but I suspect that using the existing BranchRevision.id will actually give us better results.

A Revision needs to appear in the feed when Launchpad becomes aware of it, not possibly skipped altogether because the actual commit was made some time in the past and the branch only just pushed to somewhere Launchpad can see it. So the feed should display the most recently uploaded revisions rather the most recent commits that happen to have been mirrored to Launchpad. This is Revision.date_created, and we can achieve the same ordering by relying on Revision.id or BranchRevision.revision.

So if we are happy showing the last 25 revisions pushed/pulled into Launchpad then we can do:

SELECT DISTINCT ON (BranchRevision.revision) Revision.*
FROM Revision, BranchRevision, Branch
WHERE BranchRevision.branch = Branch.id
    AND BranchRevision.revision = Revision.id
    AND Branch.private IS FALSE AND Branch.product = 1186
ORDER BY BranchRevision.revision DESC LIMIT 25;

or

SELECT Revision.*
FROM Revision, (
    SELECT DISTINCT ON (BranchRevision.revision) BranchRevision.revision
    FROM BranchRevision, Branch
    WHERE BranchRevision.branch = Branch.id
        AND Branch.private IS FALSE AND Branch.product = 1186
    ORDER BY BranchRevision.revision) AS BR
WHERE Revision.id = BR.revision
ORDER BY Revision.revision_date DESC LIMIT 25;

The first is preferable, but the latter might be easier to put into Storm syntax - I can't recall if there is DISTINCT ON support there yet.

Revision history for this message
Paul Hummer (rockstar) wrote :

I've started work on this, and am about 60% done.

Changed in launchpad-bazaar:
assignee: stub → rockstar
Revision history for this message
Tim Penhey (thumper) wrote :

Firstly I'd like to document the intent of this feed. It is not to give a complete view into the order of revisions being noticed by Launchpad. It is used to indicate I view of who is doing what. We definitely want to order the feed by revision_date, not Revision.datecreated or some BranchRevision.id. It isn't really going to matter if we miss some. Perhaps we will have to return more than 25 revisions for some projects or project groups.

> A Revision needs to appear in the feed when Launchpad becomes aware of it,
> not possibly skipped altogether because the actual commit was made some
> time in the past and the branch only just pushed to somewhere Launchpad
> can see it. So the feed should display the most recently uploaded revisions
> rather the most recent commits that happen to have been mirrored to Launchpad.

I strongly disagree with this. The intent is to show what has happened recently, not necessarily what happened six months ago that Launchpad has just been told about.

The primary issue is that the feed is timing out for products without many revisions. The question becomes "how can we stop this?" One thing to identify is what the cut off is before the postgresql query analyser chooses what we consider the correct plan. If it is on number of branches, or number of revisions, we could do object traversal directly from the branches and it would be quick enough.

Given that the "correct" query plan is fast, how can we identify what the "cut off" is?

As far as how to structure a query, can we do something like this:

1) Filter the Branch table on product, private, whatever to get a reduced set.
2) Using the reduced Branch set, join with BranchRevision, but get destinct revision ids. Hopefully this would reduce the number of revisions for something like bzr to the 20k or so in the ancestry.
3) Those are then joined across Revision, and ordered.

Can this be done?

Revision history for this message
Stuart Bishop (stub) wrote :

If we have to order by revision_date, we can get the query down to 14sec with the current schema - this is what we had a previous cycle that wasn't good enough. So we will need to mirror revision_date in the BranchRevision table. I'll do some tests on staging to see what this gets us down to.

Revision history for this message
Stuart Bishop (stub) wrote :

That 14sec is our current worse case - bzr. It has lots of branches and lots of revisions.

Revision history for this message
Tim Penhey (thumper) wrote :

I have a branch that implements what I think is a "good enough" fix. This branch does an initial query to count the number of BranchRevisions for a product or project. If this is less than some magic number, we do two queryies, one to get the revision ids, and another to get the revisions. This will force the use of indices. If there are more than this magic number, then postgresql "should" use the fast query plan.

bzr+ssh://devpad.canonical.com/code/tim/launchpad/revision-feed-speed

Revision history for this message
Christian Reis (kiko) wrote :

It sounds like you two are talking across each other. Stuart, do you agree with the diagnostic Tim is giving that there is an issue with PostgreSQL mis-selecting a query plan when there are few results?

Revision history for this message
Stuart Bishop (stub) wrote :
Download full text (3.5 KiB)

Tims diagnosis was incorrect - the query was not performing badly for projects with a few number of revisions, but when there are a lot of revisions in the system with a more recent commit date than those in your branch. The currently deployed query basically iterates over Revision in order of revision_date, issuing a subquery for each row. This generally performs quickly for the most active projects because they will nearly always have 25 recent revisions near the top of the heap of Revisions. It sucks for all the other projects. If you look at the query plans, they are identical and have identical estimated cost - 0 to 8.85 million. The worst case is triggered by a Branch with < 25 commits, as the entire Revision table will be searched.

Tim's approach will work for our current data, as the number of BranchRevisions currently happens to be an indicator of commit activity too. It is possible to tell if there are more than a certain threshold of BranchRevision rows using a LIMIT trick (without the trick, you are stuck at 14secs worst case for doing the check):

SELECT COUNT(*) >= 500000 FROM (SELECT * FROM Branch, BranchRevision WHERE Branch.private IS FALSE AND Branch.id = BranchRevision.branch AND Branch.product=1186 LIMIT 500000) AS Whatever;

This check takes about 2 seconds worst case.

We only have four products with > 1 million BranchRevisions. I've checked all of them and the currently deployed query all runs very fast for them. There is then a big gap as the next highest count of BranchRevisions is 302,000. These branches do not perform well with the currently deployed query.

So if you want to do the check, the cutoff in BranchRevision counts should be 500,000.

I don't see this as a long term fix though, as the data will change over time - if a project with a large number of BranchRevisions becomes inactive, then some time later the query will start timing out. Similarly if more projects switch to bzr, we will end up with a lot more BranchRevisions competing for that sweet spot at the top of the ordered set of Revisions. Because this isn't a long term fix, we might as well avoid the count() check altogether saving a few seconds and hard code the list of products - bzr, mysql-server, python, evolution

There does seem to be an alternative approach though. If we are prepared to change the feed from 'last 25 revisions' to 'last 25 revisions if they where made in the last two weeks' then the following seems to perform acceptably:

SELECT Revision.* FROM (
    SELECT * FROM Revision
    WHERE revision_date > CURRENT_TIMESTAMP AT TIME ZONE 'UTC'
        - interval '2 weeks'
    ) AS Revision
WHERE EXISTS (
    SELECT TRUE FROM Branch, BranchRevision
    WHERE BranchRevision.revision = Revision.id
        AND BranchRevision.branch = Branch.id
        AND Branch.private IS FALSE AND Branch.product = 1186)
ORDER BY Revision.revision_date DESC LIMIT 25 OFFSET 0;

This works at the moment because there are only 23,000 revisions in that date range, so that is only a maximum of 23k checks PostgreSQL needs to make in the worst case rather than the current worst case of 3.5 million. The worst case here is triggered by a Branch with < 25 commit...

Read more...

Changed in launchpad-bazaar:
milestone: 2.1.8 → 2.1.9
Revision history for this message
Tim Penhey (thumper) wrote :

I agree with stub's analysis. I'll look to fix this this afternoon.

I'm hesitant to have all revisions in the last month as for some products and projects there will be *many* revisions.

Changed in launchpad-bazaar:
assignee: rockstar → thumper
Revision history for this message
Stuart Bishop (stub) wrote :

If we use revision_date we need to cope with revision_dates in the future. At the moment it is possible to kill the feed by pushing 25 future revisions.

The simple fix is to add a WHERE revision_date <= CURRENT_TIMESTAMP clause, although then it is possible to add revisions that will never be seen in the feed then (a possible way of hiding trojans if we use this pattern elsewhere).

Tim Penhey (thumper)
Changed in launchpad-bazaar:
status: Confirmed → In Progress
Revision history for this message
Ursula Junque (ursinha) wrote :

<thumper> bug 260140 bounced pqm due to conflict, resolving as we speak and resubmitting

Revision history for this message
Tim Penhey (thumper) wrote :

Fixed in RF 6972.

Changed in launchpad-bazaar:
status: In Progress → Fix Committed
Revision history for this message
Christian Reis (kiko) wrote : Re: [Bug 260140] Re: revisions' rss feed timing out for some projects

Cherry-pick candidate?
--
Christian Robottom Reis | http://async.com.br/~kiko/ | [+55 16] 3376 0125

Paul Hummer (rockstar)
Changed in launchpad-bazaar:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.