Comment 10 for bug 260140

Revision history for this message
Stuart Bishop (stub) wrote :

Tims diagnosis was incorrect - the query was not performing badly for projects with a few number of revisions, but when there are a lot of revisions in the system with a more recent commit date than those in your branch. The currently deployed query basically iterates over Revision in order of revision_date, issuing a subquery for each row. This generally performs quickly for the most active projects because they will nearly always have 25 recent revisions near the top of the heap of Revisions. It sucks for all the other projects. If you look at the query plans, they are identical and have identical estimated cost - 0 to 8.85 million. The worst case is triggered by a Branch with < 25 commits, as the entire Revision table will be searched.

Tim's approach will work for our current data, as the number of BranchRevisions currently happens to be an indicator of commit activity too. It is possible to tell if there are more than a certain threshold of BranchRevision rows using a LIMIT trick (without the trick, you are stuck at 14secs worst case for doing the check):

SELECT COUNT(*) >= 500000 FROM (SELECT * FROM Branch, BranchRevision WHERE Branch.private IS FALSE AND Branch.id = BranchRevision.branch AND Branch.product=1186 LIMIT 500000) AS Whatever;

This check takes about 2 seconds worst case.

We only have four products with > 1 million BranchRevisions. I've checked all of them and the currently deployed query all runs very fast for them. There is then a big gap as the next highest count of BranchRevisions is 302,000. These branches do not perform well with the currently deployed query.

So if you want to do the check, the cutoff in BranchRevision counts should be 500,000.

I don't see this as a long term fix though, as the data will change over time - if a project with a large number of BranchRevisions becomes inactive, then some time later the query will start timing out. Similarly if more projects switch to bzr, we will end up with a lot more BranchRevisions competing for that sweet spot at the top of the ordered set of Revisions. Because this isn't a long term fix, we might as well avoid the count() check altogether saving a few seconds and hard code the list of products - bzr, mysql-server, python, evolution

There does seem to be an alternative approach though. If we are prepared to change the feed from 'last 25 revisions' to 'last 25 revisions if they where made in the last two weeks' then the following seems to perform acceptably:

SELECT Revision.* FROM (
    SELECT * FROM Revision
    WHERE revision_date > CURRENT_TIMESTAMP AT TIME ZONE 'UTC'
        - interval '2 weeks'
    ) AS Revision
WHERE EXISTS (
    SELECT TRUE FROM Branch, BranchRevision
    WHERE BranchRevision.revision = Revision.id
        AND BranchRevision.branch = Branch.id
        AND Branch.private IS FALSE AND Branch.product = 1186)
ORDER BY Revision.revision_date DESC LIMIT 25 OFFSET 0;

This works at the moment because there are only 23,000 revisions in that date range, so that is only a maximum of 23k checks PostgreSQL needs to make in the worst case rather than the current worst case of 3.5 million. The worst case here is triggered by a Branch with < 25 commits. Testing using the 'espresso' product with a total of 12 commits between march and august this year shows this performs ok, and we could increase that date range to a couple of months if necessary. We also may want to drop the LIMIT 25 entirely (or better yet, change it to a LIMIT 2000), and just have this as 'commits made in the last 2 weeks' rather than 'last 25 commits' - it is just as expensive to do from the database side.