DistroSeries.getBuildRecords often timing out

Bug #590708 reported by William Grant
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
William Grant

Bug Description

Since the rollout of 10.05, DistroSeries.getBuildRecords has begun to time out frequently like OOPS-1619ED582.

Related branches

tags: added: api soyuz-build
Changed in soyuz:
status: New → Triaged
importance: Undecided → High
tags: added: oops timeout
Changed in soyuz:
assignee: nobody → Michael Nelson (michael.nelson)
Changed in soyuz:
milestone: none → 10.06
Revision history for this message
Julian Edwards (julian-edwards) wrote :

This was most probably caused by the snafu with the master/slave setup, it seems to be ok now.

Changed in soyuz:
status: Triaged → Invalid
Revision history for this message
Michael Nelson (michael.nelson) wrote :

From http://irclogs.ubuntu.com/2010/06/10/%23launchpad-dev.html#t10:19
{{{
11:19 < wgrant> noodles775: Have the getBuildRecords timeouts been fixed? I've only had a few emails from cron today, rather than the usual several dozen.
11:19 < wgrant> It still times out sometimes, but not as much.
...
11:26 < noodles775> wgrant: no, well not specifically. There has been some changes since yesterday that reduce the db load, so that might be the reason you're seeing less timeouts.
}}}

Revision history for this message
William Grant (wgrant) wrote :

It's still happening several times a day.

Changed in soyuz:
status: Invalid → New
Revision history for this message
Julian Edwards (julian-edwards) wrote :

Gar. Can you post an up-to-date OOPS please.

Changed in soyuz:
status: New → Triaged
Changed in soyuz:
milestone: 10.07 → 10.08
Revision history for this message
William Grant (wgrant) wrote :

There are two slow queries: a COUNT and an actual retrieval with similar conditions. http://paste.ubuntu.com/465800/ is an EXPLAIN ANALYZE of the latter query, from yesterday.

Revision history for this message
Robert Collins (lifeless) wrote :

https://lp-oops.canonical.com/oops.py/?oopsid=1661EA1172
query plan
http://paste.ubuntu.com/465800/
https://lp-oops.canonical.com/oops.py/?oopsid=1661EA1192
and more - 78 yesterday

https://lp-oops.canonical.com/oops.py/?oopsid=1661EA1116 also is a separate reported thing, same actual cause. 29 of those.
so - on edge - 127 yesterday.

Revision history for this message
Robert Collins (lifeless) wrote :

18. 100 6485ms launchpad-main-master SELECT COUNT(*) FROM Archive, BinaryPackageBuild, BuildFarmJob, PackageBuild WHERE distro_arch_series IN (109, 110, 111, 112, 113, 114) AND BinaryPackageBuild.package_build = PackageBuild.id AND PackageBuild.build_farm_job = BuildFarmJob.id AND (BuildFarmJob.status <> 1 OR BuildFarmJob.date_finished IS NOT NULL) AND BinaryPackageBuild.package_build = PackageBuild.id AND PackageBuild.build_farm_job = BuildFarmJob.id AND BuildFarmJob.status=2 AND Archive.purpose IN (1,4) AND Archive.id = PackageBuild.archive AND (1=1)
19. 6585 6168ms launchpad-main-master SELECT BinaryPackageBuild.distro_arch_series, BinaryPackageBuild.id, BinaryPackageBuild.package_build, BinaryPackageBuild.source_package_release FROM Archive, BinaryPackageBuild, BuildFarmJob, PackageBuild WHERE distro_arch_series IN (109, 110, 111, 112, 113, 114) AND BinaryPackageBuild.package_build = PackageBuild.id AND PackageBuild.build_farm_job = BuildFarmJob.id AND (BuildFarmJob.status <> 1 OR BuildFarmJob.date_finished IS NOT NULL) AND BinaryPackageBuild.package_build = PackageBuild.id AND PackageBuild.build_farm_job = BuildFarmJob.id AND BuildFarmJob.status=2 AND Archive.purpose IN (1,4) AND Archive.id = PackageBuild.archive AND (1=1) ORDER BY BuildFarmJob.date_finished DESC LIMIT 50 OFFSET 0

are the two specific problemative queuries. I'm going to subscribe stub for insight.

I'm also told that lazr.restful is doing the count - we can save 50% of the SQL time (roughly) by not doing that - so subscribing leonard for input on that.

Revision history for this message
Данило Шеган (danilo) wrote :

Looking at the code, it seems Postgres doesn't choose an optimal query plan when looking through PackageBuild table. To help it out with that, one can do a subselect. The following query improves the situation immensely for me with warm caches on staging:

explain analyze
SELECT
  BinaryPackageBuild.*
  FROM BinaryPackageBuild
  JOIN PackageBuild
    ON BinaryPackageBuild.package_build = PackageBuild.id
  JOIN BuildFarmJob
    ON PackageBuild.build_farm_job = BuildFarmJob.id
  WHERE
    distro_arch_series IN (109, 110, 111, 112, 113, 114) AND
    (BuildFarmJob.status <> 1 OR
     BuildFarmJob.date_finished IS NOT NULL) AND
    BuildFarmJob.status=2 AND
    PackageBuild.archive IN (
      SELECT Archive.id
        FROM PackageBuild
        JOIN Archive
          ON Archive.id = PackageBuild.archive
        WHERE Archive.purpose IN (1,4) AND
              PackageBuild.id = BinaryPackageBuild.package_build)
    AND (1=1)
  ORDER BY BuildFarmJob.date_finished DESC LIMIT 50 OFFSET 0
;
On cold caches, it's even slower (~8s), but right after that it goes down to 250ms on staging (the original query takes 7s on cold caches, but warm caches don't improve it much as it only goes down to 3-4s). The count is still slow, but takes 2s on staging with warm caches (whereas it stays around 4s on the old count query).

This trick would make use of the fact that any reasonably often used indexes would end up somewhere in the cache of Postgres on our 128GB RAM DB servers. It works well for translations. There are probably even better ways to force Postgres to use a query plan which uses better indexes when joining PackageBuild and BinaryPackageBuild with the Archive.

Revision history for this message
Robert Collins (lifeless) wrote :

We will likely want to review a lot of our schemas as we improve performance, but I think the sort of tradeoff you are suggesting is reasonable in the short term.

The count seems totally waste to me in the case that the returned content is < one page, so I think we're going to want to fix that.

However, dropping down to 250ms + 2000ms would be awesome.

Care to work up a patch to make it query in this fashion?

Revision history for this message
Leonard Richardson (leonardr) wrote :
Download full text (3.5 KiB)

Here's an overview of what happens:

The tagged interface method is IHasBuildRecords.getBuildRecords(). I presume the method that's timing out is the one in DistroArchSeries, not the method of the same name in Archive.

When a GET request comes in triggering the named operation getBuildRecords, lazr.restful code calls DistroArchSeries.getBuildRecords(). This calls BinaryPackageBuildSet.getBuildsByArchIds, which returns a ResultSet based on a complicated query.

Control then returns to lazr.restful, which puts the ResultSet into a subclass of lazr.batchnavigator BatchNavigator. We then iterate over navigator.batch, which runs the complicated query the first time. And then we look at navigator.listlength, which runs the query again.

= QUESTION =

It sounds like one of these queries would be okay on its own, but both of them together exceed the timeout. Is this true? If someone really wanted the count, and made an HTTP request whose only purpose was to calculate the count, would the count query on its own exceed the timeout?

= SUGGESTED SOLUTION =

We provide the total length of a collection in a batched response as a convenience for the end-user. When the length is expensive to calculate or it's unlikely the end-user will want the length, we can instead provide a _link_ to the length of the collection. That is, instead of the JSON response including a key like this:

  'total_size' : 3000

It can include a key like this:

  'total_size_link' : 'https://edge.launchpad.net/api/1.0/ubuntu/maverick?ws.op=getBuildRecords&ws.show=total_size'

Sending a GET to that link would return a JSON document containing just the size, e.g.

  3000

Right now, getBuildRecords is published with annotations including this one:

    # Really a IBuild see _schema_circular_imports.
    @operation_returns_collection_of(Interface)

To switch from total_size to total_size_link, we would change that annotation:

    # Really a IBuild; see _schema_circular_imports.
    @operation_returns_collection_of(Interface, include_total_size=False)

We would change the WADL so that 'total_size_link' was published as a parameter to this named operation instead of 'total_size'.
We would change lazr.restfulclient so that its __len__ implementation understood what to do when total_size_link was present instead of total_size.

Once everyone is using versions of lazr.restfulclient that understand total_size_link (this will take a long time), we can even experiment with making include_total_size=False the default.

= DOWNSIDES =

This fix breaks backwards compatibility. Old versions of lazr.restfulclient will not be able to obtain the length of a collection with include_total_size=False, because they won't know how to handle total_size_link.

We could maintain a bit of backwards compatibility by exporting this operation for 'beta' with include_total_size=True, un-exporting it for '1.0', and then re-exporting it for '1.0' with include_total_size=True. This would give 'beta' the old behavior and '1.0' the new behavior. (Or, mutatis mutandis to give 'beta' and '1.0' the old behavior and 'devel' the new behavior.)

This is a good solution when we remove total_size to make a named operation m...

Read more...

Revision history for this message
Leonard Richardson (leonardr) wrote :

I'd like to know which OAuth consumer keys are associated with the requests for this named operation, so that I can talk to the authors/examine the code and see if they need the total_size. If they are using recent versions of lazr.restfulclient, the consumer keys will be in the User-Agent strings. Otherwise, I don't know how to acquire them or if it's even possible. For clients run within Canonical, we can often figure out which client is running the code by looking at the IP address and seeing who's running code on that server.

Revision history for this message
Robert Collins (lifeless) wrote :

https://pastebin.canonical.com/34829/ - run on a slave

Still doing some table scans, but better. Not a 250ms query either which may be load or statistics I guess.

Revision history for this message
Robert Collins (lifeless) wrote :

More data - we expect the common case to have result sets around 2K, so a quick hack will help but not fix.

Julian suggests an index on date_finished, status to avoid the table scans (makes sense to me) = vet it on staging, but +1 from my db review hat.

And if we can make API's more tolerant here that would rock.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

I think a more general approach would be to run an "explain analyze" on all new queries to make sure the appropriate indexes are in place. We should only be scanning on small sets of data. (I don't know how small that should be though, stub would have better info)

Revision history for this message
Leonard Richardson (leonardr) wrote :

Robert, what do you mean when you say make APIs more tolerant?

AFAICT we have two users of getBuildRecords. I examined their code. William Grant doesn't use the total_size, but Leann Ogasawara does. Leann reports that when the script worked, total_size was in the 2K range. So I doubt the optimization "don't run the count(*) query when the collection size is less than the batch size" will work.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Let me explain the optimization I was investigating, since it hasn't been recorded here.

The idea is to avoid running the count(*) query when the total size of the collection is less than the batch size. If we know this bit of information, we know what value to put for total_size, and we know not to include next_collection_link or prev_collection_link.

The trick is to find the total size of the collection without triggering _Batch.listlength, which will run the count(*) query. This turns out to be fairly difficult in our existing system. The _Batch constructor calls .listlength if no value for it is passed in, and the BatchNavigator constructor (which is what lazr.restful uses) invokes _Batch() without passing in that argument. So right now, the count(*) query is run as soon as we create the BatchNavigator object.

This optimization should not be incredibly difficult, but it's not nearly as easy as I thought, and it may require changing lazr.batchnavigator as well as lazr.restful. Since neither of our users will benefit from this optimization, I suggest we shelve it.

Revision history for this message
Robert Collins (lifeless) wrote :

BuildFarmJob.status <> 1
^- this causes a table scan doesn't it? can we switch it to an IN list ?

Revision history for this message
Julian Edwards (julian-edwards) wrote : Re: [Bug 590708] Re: DistroSeries.getBuildRecords often timing out

On Tuesday 20 July 2010 17:17:03 you wrote:
> BuildFarmJob.status <> 1
> ^- this causes a table scan doesn't it? can we switch it to an IN list ?

Hmmm, I assumed that it would be fine with an index, but maybe not?

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

The status <> 1 shouldn't cause a table scan... it would probably be a bitmap scan if it were done directly on a table rather than an intermediate result. In any case it doesn't seem to be where the cost is.

The planner goes horribly wrong in estimating the number of PackageBuilds it needs to take into account: it expects about 500 and gets about 28K. The subquery Danilo suggests would fix that, I guess. BTW that version still contains a redundant join to a second PackageBuild instance. Getting rid of that may speed things up further.

Revision history for this message
Данило Шеган (danilo) wrote :

About the difference in timings on production: my guess is that yes, it's partly statistics and fragmentation (staging, being restored the way it is, has basically zero fragmentation and is fully packed). However, since our entire DB is larger than the RAM on the server, it could also be a case of cold caches (i.e. no queries were using this particular index, for instance). That means that the query would take that long roughly once per DB server restart, which shouldn't be too often. Note that this is just from the perspective from a LP developer with lots of query optimization on Postgres behind him—I am not claiming to know exactly what happens in Postgres other than what explain analyze and \timing (in psql) tell me (and the rest is just educated-guessing :).

FWIW, this query was hitting mostly indexes even in the original version, but the biggest waste of time was in joining all packagebuilds with archives which is unnecessary for this particular query (i.e. you want to join only packagebuilds that you care about). Using a subselect makes postgres prefer such a query plan.

Revision history for this message
Leonard Richardson (leonardr) wrote :

I don't know why I took so long to realize this, but here it is. Getting rid of total_size for certain named operations won't stop the query from being run, for exactly the reasons I outlined above: that query is run as soon as the _Batch object is created. So the first step to any kind of optimization is to get a _Batch-like object that doesn't need to know the total size of the collection unless you explicitly ask for it.

I can implement include_total_size and it will work fine, but it won't have any effect on performance unless we solve this other problem.

Revision history for this message
Данило Шеган (danilo) wrote :

Warm caches seem to help even on production, and my suggested query runs much faster after caches are warm: http://pastebin.ubuntu.com/466604/

Revision history for this message
Stuart Bishop (stub) wrote :

On Wed, Jul 21, 2010 at 1:38 AM, Данило Шеган <email address hidden> wrote:
> Warm caches seem to help even on production, and my suggested query runs
> much faster after caches are warm: http://pastebin.ubuntu.com/466604/

I have so far been unable to beat Danilo's optimization.

You might want to confirm that the query is actually doing what you
want btw. Looking at the COUNT query lifeless cites (the first), not
only are there two repeated joins in the WHERE clause, but there is
also a spurious status clause ('status <> 1 OR datefinished IS NOT
NULL' is irrelevant, as it is overridden by the more restrictive
'status=2' clause. This might indicate some missing brackets
somewhere.

--
Stuart Bishop <email address hidden>
http://www.stuartbishop.net/

Revision history for this message
Gary Poster (gary) wrote :
Changed in launchpad-foundations:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Benji York (benji)
Revision history for this message
Robert Collins (lifeless) wrote :

Additionally, lazr.restful *appears* to be causing a count(*) on *every* page of a paginated API, not just the first one. This may require a separate fix? See https://lp-oops.canonical.com/oops.py/?oopsid=1681EB1345 query 6 - takes 4 seconds on its own.

Changed in soyuz:
milestone: 10.08 → none
assignee: Michael Nelson (michael.nelson) → nobody
William Grant (wgrant)
Changed in soyuz:
status: Triaged → In Progress
assignee: nobody → William Grant (wgrant)
Revision history for this message
Ursula Junque (ursinha) wrote : Bug fixed by a commit
tags: added: qa-needstesting
tags: removed: oops
Changed in soyuz:
status: In Progress → Fix Committed
milestone: none → 10.08
William Grant (wgrant)
tags: added: qa-ok
removed: qa-needstesting
William Grant (wgrant)
Changed in soyuz:
status: Fix Committed → Fix Released
Revision history for this message
Gary Poster (gary) wrote : Re: [Bug 590708] Re: DistroSeries.getBuildRecords often timing out

On Aug 9, 2010, at 8:02 PM, Robert Collins wrote:

> Additionally, lazr.restful *appears* to be causing a count(*) on *every*
> page of a paginated API, not just the first one. This may require a
> separate fix? See https://lp-
> oops.canonical.com/oops.py/?oopsid=1681EB1345 query 6 - takes 4 seconds
> on its own.

I believe it is the same fix in lazr.restful. The count was done every time a batch was instantiated, AIUI.

Changed in launchpad-foundations:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.