Bug #590708 “DistroSeries.getBuildRecords often timing out” : Bugs : Launchpad itself

Julian Edwards (julian-edwards) on 2010-06-07

tags:	added: api soyuz-build
Changed in soyuz:
status:	New → Triaged
importance:	Undecided → High
tags:	added: oops timeout
Changed in soyuz:
assignee:	nobody → Michael Nelson (michael.nelson)

Michael Nelson (michael.nelson) on 2010-06-08

Changed in soyuz:
milestone:	none → 10.06

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2010-06-16:

#1

This was most probably caused by the snafu with the master/slave setup, it seems to be ok now.

Changed in soyuz:
status:	Triaged → Invalid

Revision history for this message

Michael Nelson (michael.nelson) wrote on 2010-06-16:

#2

From http://irclogs.ubuntu.com/2010/06/10/%23launchpad-dev.html#t10:19
{{{
11:19 < wgrant> noodles775: Have the getBuildRecords timeouts been fixed? I've only had a few emails from cron today, rather than the usual several dozen.
11:19 < wgrant> It still times out sometimes, but not as much.
...
11:26 < noodles775> wgrant: no, well not specifically. There has been some changes since yesterday that reduce the db load, so that might be the reason you're seeing less timeouts.
}}}

Revision history for this message

William Grant (wgrant) wrote on 2010-06-16:

#3

It's still happening several times a day.

Changed in soyuz:
status:	Invalid → New

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2010-06-16:

#4

Gar. Can you post an up-to-date OOPS please.

Changed in soyuz:
status:	New → Triaged

Julian Edwards (julian-edwards) on 2010-07-05

Changed in soyuz:
milestone:	10.07 → 10.08

Revision history for this message

William Grant (wgrant) wrote on 2010-07-20:

#5

There are two slow queries: a COUNT and an actual retrieval with similar conditions. http://paste.ubuntu.com/465800/ is an EXPLAIN ANALYZE of the latter query, from yesterday.

Revision history for this message

Robert Collins (lifeless) wrote on 2010-07-20:

#6

https://lp-oops.canonical.com/oops.py/?oopsid=1661EA1172
query plan
http://paste.ubuntu.com/465800/
https://lp-oops.canonical.com/oops.py/?oopsid=1661EA1192
and more - 78 yesterday

https://lp-oops.canonical.com/oops.py/?oopsid=1661EA1116 also is a separate reported thing, same actual cause. 29 of those.
so - on edge - 127 yesterday.

Revision history for this message

Robert Collins (lifeless) wrote on 2010-07-20:

#7

18. 100 6485ms launchpad-main-master SELECT COUNT(*) FROM Archive, BinaryPackageBuild, BuildFarmJob, PackageBuild WHERE distro_arch_series IN (109, 110, 111, 112, 113, 114) AND BinaryPackageBuild.package_build = PackageBuild.id AND PackageBuild.build_farm_job = BuildFarmJob.id AND (BuildFarmJob.status <> 1 OR BuildFarmJob.date_finished IS NOT NULL) AND BinaryPackageBuild.package_build = PackageBuild.id AND PackageBuild.build_farm_job = BuildFarmJob.id AND BuildFarmJob.status=2 AND Archive.purpose IN (1,4) AND Archive.id = PackageBuild.archive AND (1=1)
19. 6585 6168ms launchpad-main-master SELECT BinaryPackageBuild.distro_arch_series, BinaryPackageBuild.id, BinaryPackageBuild.package_build, BinaryPackageBuild.source_package_release FROM Archive, BinaryPackageBuild, BuildFarmJob, PackageBuild WHERE distro_arch_series IN (109, 110, 111, 112, 113, 114) AND BinaryPackageBuild.package_build = PackageBuild.id AND PackageBuild.build_farm_job = BuildFarmJob.id AND (BuildFarmJob.status <> 1 OR BuildFarmJob.date_finished IS NOT NULL) AND BinaryPackageBuild.package_build = PackageBuild.id AND PackageBuild.build_farm_job = BuildFarmJob.id AND BuildFarmJob.status=2 AND Archive.purpose IN (1,4) AND Archive.id = PackageBuild.archive AND (1=1) ORDER BY BuildFarmJob.date_finished DESC LIMIT 50 OFFSET 0

are the two specific problemative queuries. I'm going to subscribe stub for insight.

I'm also told that lazr.restful is doing the count - we can save 50% of the SQL time (roughly) by not doing that - so subscribing leonard for input on that.

Revision history for this message

Данило Шеган (danilo) wrote on 2010-07-20:

#8

Looking at the code, it seems Postgres doesn't choose an optimal query plan when looking through PackageBuild table. To help it out with that, one can do a subselect. The following query improves the situation immensely for me with warm caches on staging:

explain analyze
SELECT
  BinaryPackageBuild.*
  FROM BinaryPackageBuild
  JOIN PackageBuild
    ON BinaryPackageBuild.package_build = PackageBuild.id
  JOIN BuildFarmJob
    ON PackageBuild.build_farm_job = BuildFarmJob.id
  WHERE
    distro_arch_series IN (109, 110, 111, 112, 113, 114) AND
    (BuildFarmJob.status <> 1 OR
     BuildFarmJob.date_finished IS NOT NULL) AND
    BuildFarmJob.status=2 AND
    PackageBuild.archive IN (
      SELECT Archive.id
        FROM PackageBuild
        JOIN Archive
          ON Archive.id = PackageBuild.archive
        WHERE Archive.purpose IN (1,4) AND
              PackageBuild.id = BinaryPackageBuild.package_build)
    AND (1=1)
  ORDER BY BuildFarmJob.date_finished DESC LIMIT 50 OFFSET 0
;
On cold caches, it's even slower (~8s), but right after that it goes down to 250ms on staging (the original query takes 7s on cold caches, but warm caches don't improve it much as it only goes down to 3-4s). The count is still slow, but takes 2s on staging with warm caches (whereas it stays around 4s on the old count query).

This trick would make use of the fact that any reasonably often used indexes would end up somewhere in the cache of Postgres on our 128GB RAM DB servers. It works well for translations. There are probably even better ways to force Postgres to use a query plan which uses better indexes when joining PackageBuild and BinaryPackageBuild with the Archive.

Revision history for this message

Robert Collins (lifeless) wrote on 2010-07-20:

#9

We will likely want to review a lot of our schemas as we improve performance, but I think the sort of tradeoff you are suggesting is reasonable in the short term.

The count seems totally waste to me in the case that the returned content is < one page, so I think we're going to want to fix that.

However, dropping down to 250ms + 2000ms would be awesome.

Care to work up a patch to make it query in this fashion?

Revision history for this message

Leonard Richardson (leonardr) wrote on 2010-07-20:

#10

Download full text (3.5 KiB)

Here's an overview of what happens:

The tagged interface method is IHasBuildRecords.getBuildRecords(). I presume the method that's timing out is the one in DistroArchSeries, not the method of the same name in Archive.

When a GET request comes in triggering the named operation getBuildRecords, lazr.restful code calls DistroArchSeries.getBuildRecords(). This calls BinaryPackageBuildSet.getBuildsByArchIds, which returns a ResultSet based on a complicated query.

Control then returns to lazr.restful, which puts the ResultSet into a subclass of lazr.batchnavigator BatchNavigator. We then iterate over navigator.batch, which runs the complicated query the first time. And then we look at navigator.listlength, which runs the query again.

= QUESTION =

It sounds like one of these queries would be okay on its own, but both of them together exceed the timeout. Is this true? If someone really wanted the count, and made an HTTP request whose only purpose was to calculate the count, would the count query on its own exceed the timeout?

= SUGGESTED SOLUTION =

We provide the total length of a collection in a batched response as a convenience for the end-user. When the length is expensive to calculate or it's unlikely the end-user will want the length, we can instead provide a _link_ to the length of the collection. That is, instead of the JSON response including a key like this:

'total_size' : 3000

It can include a key like this:

'total_size_link' : 'https://edge.launchpad.net/api/1.0/ubuntu/maverick?ws.op=getBuildRecords&ws.show=total_size'

Sending a GET to that link would return a JSON document containing just the size, e.g.

3000

Right now, getBuildRecords is published with annotations including this one:

# Really a IBuild see _schema_circular_imports.
@operation_returns_collection_of(Interface)

To switch from total_size to total_size_link, we would change that annotation:

# Really a IBuild; see _schema_circular_imports.
@operation_returns_collection_of(Interface, include_total_size=False)

We would change the WADL so that 'total_size_link' was published as a parameter to this named operation instead of 'total_size'.
We would change lazr.restfulclient so that its __len__ implementation understood what to do when total_size_link was present instead of total_size.

Once everyone is using versions of lazr.restfulclient that understand total_size_link (this will take a long time), we can even experiment with making include_total_size=False the default.

= DOWNSIDES =

This fix breaks backwards compatibility. Old versions of lazr.restfulclient will not be able to obtain the length of a collection with include_total_size=False, because they won't know how to handle total_size_link.

We could maintain a bit of backwards compatibility by exporting this operation for 'beta' with include_total_size=True, un-exporting it for '1.0', and then re-exporting it for '1.0' with include_total_size=True. This would give 'beta' the old behavior and '1.0' the new behavior. (Or, mutatis mutandis to give 'beta' and '1.0' the old behavior and 'devel' the new behavior.)

This is a good solution when we remove total_size to make a named operation m...

Here's an overview of what happens:

The tagged interface method is IHasBuildRecords.getBuildRecords(). I presume the method that's timing out is the one in DistroArchSeries, not the method of the same name in Archive.

When a GET request comes in triggering the named operation getBuildRecords, lazr.restful code calls DistroArchSeries.getBuildRecords(). This calls BinaryPackageBuildSet.getBuildsByArchIds, which returns a ResultSet based on a complicated query.

Control then returns to lazr.restful, which puts the ResultSet into a subclass of lazr.batchnavigator BatchNavigator. We then iterate over navigator.batch, which runs the complicated query the first time. And then we look at navigator.listlength, which runs the query again.

= QUESTION =

It sounds like one of these queries would be okay on its own, but both of them together exceed the timeout. Is this true? If someone really wanted the count, and made an HTTP request whose only purpose was to calculate the count, would the count query on its own exceed the timeout?

= SUGGESTED SOLUTION =

We provide the total length of a collection in a batched response as a convenience for the end-user. When the length is expensive to calculate or it's unlikely the end-user will want the length, we can instead provide a _link_ to the length of the collection. That is, instead of the JSON response including a key like this:

'total_size' : 3000

It can include a key like this:

'total_size_link' : 'https://edge.launchpad.net/api/1.0/ubuntu/maverick?ws.op=getBuildRecords&ws.show=total_size'

Sending a GET to that link would return a JSON document containing just the size, e.g.

3000

Right now, getBuildRecords is published with annotations including this one:

# Really a IBuild see _schema_circular_imports.
    @operation_returns_collection_of(Interface)

To switch from total_size to total_size_link, we would change that annotation:

# Really a IBuild; see _schema_circular_imports.
    @operation_returns_collection_of(Interface, include_total_size=False)

We would change the WADL so that 'total_size_link' was published as a parameter to this named operation instead of 'total_size'.
We would change lazr.restfulclient so that its __len__ implementation understood what to do when total_size_link was present instead of total_size.

Once everyone is using versions of lazr.restfulclient that understand total_size_link (this will take a long time), we can even experiment with making include_total_size=False the default.

= DOWNSIDES =

This fix breaks backwards compatibility. Old versions of lazr.restfulclient will not be able to obtain the length of a collection with include_total_size=False, because they won't know how to handle total_size_link.

We could maintain a bit of backwards compatibility by exporting this operation for 'beta' with include_total_size=True, un-exporting it for '1.0', and then re-exporting it for '1.0' with include_total_size=True. This would give 'beta' the old behavior and '1.0' the new behavior. (Or, mutatis mutandis to give 'beta' and '1.0' the old behavior and 'devel' the new behavior.)

This is a good solution when we remove total_size to make a named operation more efficient. But here, the old behavior is causing timeouts, so it already doesn't work and we shouldn't try to preserve it at all.

If the COUNT query is inefficient enough to trigger the timeout all by itself, this fix won't really help--there will still be no way to get the length of the collection, . But the timeouts will stop (except for the timeouts caused by anyone trying to get the length of the collection).

Revision history for this message

Leonard Richardson (leonardr) wrote on 2010-07-20:

#11

I'd like to know which OAuth consumer keys are associated with the requests for this named operation, so that I can talk to the authors/examine the code and see if they need the total_size. If they are using recent versions of lazr.restfulclient, the consumer keys will be in the User-Agent strings. Otherwise, I don't know how to acquire them or if it's even possible. For clients run within Canonical, we can often figure out which client is running the code by looking at the IP address and seeing who's running code on that server.

Revision history for this message

Robert Collins (lifeless) wrote on 2010-07-20:

#12

https://pastebin.canonical.com/34829/ - run on a slave

Still doing some table scans, but better. Not a 250ms query either which may be load or statistics I guess.

Revision history for this message

Robert Collins (lifeless) wrote on 2010-07-20:

#13

More data - we expect the common case to have result sets around 2K, so a quick hack will help but not fix.

Julian suggests an index on date_finished, status to avoid the table scans (makes sense to me) = vet it on staging, but +1 from my db review hat.

And if we can make API's more tolerant here that would rock.

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2010-07-20:

#14

I think a more general approach would be to run an "explain analyze" on all new queries to make sure the appropriate indexes are in place. We should only be scanning on small sets of data. (I don't know how small that should be though, stub would have better info)

Revision history for this message

Leonard Richardson (leonardr) wrote on 2010-07-20:

#15

Robert, what do you mean when you say make APIs more tolerant?

AFAICT we have two users of getBuildRecords. I examined their code. William Grant doesn't use the total_size, but Leann Ogasawara does. Leann reports that when the script worked, total_size was in the 2K range. So I doubt the optimization "don't run the count(*) query when the collection size is less than the batch size" will work.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2010-07-20:

#16

Let me explain the optimization I was investigating, since it hasn't been recorded here.

The idea is to avoid running the count(*) query when the total size of the collection is less than the batch size. If we know this bit of information, we know what value to put for total_size, and we know not to include next_collection_link or prev_collection_link.

The trick is to find the total size of the collection without triggering _Batch.listlength, which will run the count(*) query. This turns out to be fairly difficult in our existing system. The _Batch constructor calls .listlength if no value for it is passed in, and the BatchNavigator constructor (which is what lazr.restful uses) invokes _Batch() without passing in that argument. So right now, the count(*) query is run as soon as we create the BatchNavigator object.

This optimization should not be incredibly difficult, but it's not nearly as easy as I thought, and it may require changing lazr.batchnavigator as well as lazr.restful. Since neither of our users will benefit from this optimization, I suggest we shelve it.

Revision history for this message

Robert Collins (lifeless) wrote on 2010-07-20:

#17

BuildFarmJob.status <> 1
^- this causes a table scan doesn't it? can we switch it to an IN list ?

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2010-07-20: Re: [Bug 590708] Re: DistroSeries.getBuildRecords often timing out

#18

On Tuesday 20 July 2010 17:17:03 you wrote:
> BuildFarmJob.status <> 1
> ^- this causes a table scan doesn't it? can we switch it to an IN list ?

Hmmm, I assumed that it would be fine with an index, but maybe not?

Revision history for this message

Jeroen T. Vermeulen (jtv) wrote on 2010-07-20:

#19

The status <> 1 shouldn't cause a table scan... it would probably be a bitmap scan if it were done directly on a table rather than an intermediate result. In any case it doesn't seem to be where the cost is.

The planner goes horribly wrong in estimating the number of PackageBuilds it needs to take into account: it expects about 500 and gets about 28K. The subquery Danilo suggests would fix that, I guess. BTW that version still contains a redundant join to a second PackageBuild instance. Getting rid of that may speed things up further.

Revision history for this message

Данило Шеган (danilo) wrote on 2010-07-20:

#20

About the difference in timings on production: my guess is that yes, it's partly statistics and fragmentation (staging, being restored the way it is, has basically zero fragmentation and is fully packed). However, since our entire DB is larger than the RAM on the server, it could also be a case of cold caches (i.e. no queries were using this particular index, for instance). That means that the query would take that long roughly once per DB server restart, which shouldn't be too often. Note that this is just from the perspective from a LP developer with lots of query optimization on Postgres behind him—I am not claiming to know exactly what happens in Postgres other than what explain analyze and \timing (in psql) tell me (and the rest is just educated-guessing :).

FWIW, this query was hitting mostly indexes even in the original version, but the biggest waste of time was in joining all packagebuilds with archives which is unnecessary for this particular query (i.e. you want to join only packagebuilds that you care about). Using a subselect makes postgres prefer such a query plan.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2010-07-20:

#21

I don't know why I took so long to realize this, but here it is. Getting rid of total_size for certain named operations won't stop the query from being run, for exactly the reasons I outlined above: that query is run as soon as the _Batch object is created. So the first step to any kind of optimization is to get a _Batch-like object that doesn't need to know the total size of the collection unless you explicitly ask for it.

I can implement include_total_size and it will work fine, but it won't have any effect on performance unless we solve this other problem.

Revision history for this message

Данило Шеган (danilo) wrote on 2010-07-20:

#22

Warm caches seem to help even on production, and my suggested query runs much faster after caches are warm: http://pastebin.ubuntu.com/466604/

Revision history for this message

Stuart Bishop (stub) wrote on 2010-07-23:

#23

On Wed, Jul 21, 2010 at 1:38 AM, Данило Шеган <email address hidden> wrote:
> Warm caches seem to help even on production, and my suggested query runs
> much faster after caches are warm: http://pastebin.ubuntu.com/466604/

I have so far been unable to beat Danilo's optimization.

You might want to confirm that the query is actually doing what you
want btw. Looking at the COUNT query lifeless cites (the first), not
only are there two repeated joins in the WHERE clause, but there is
also a spurious status clause ('status <> 1 OR datefinished IS NOT
NULL' is irrelevant, as it is overridden by the more restrictive
'status=2' clause. This might indicate some missing brackets
somewhere.

--
Stuart Bishop <email address hidden>
http://www.stuartbishop.net/

Revision history for this message

Gary Poster (gary) wrote on 2010-08-04:

#24

see bug 613555

Changed in launchpad-foundations:
status:	New → Triaged
importance:	Undecided → High
assignee:	nobody → Benji York (benji)

Revision history for this message

Robert Collins (lifeless) wrote on 2010-08-10:

#25

Additionally, lazr.restful *appears* to be causing a count(*) on *every* page of a paginated API, not just the first one. This may require a separate fix? See https://lp-oops.canonical.com/oops.py/?oopsid=1681EB1345 query 6 - takes 4 seconds on its own.

Julian Edwards (julian-edwards) on 2010-08-10

Changed in soyuz:
milestone:	10.08 → none
assignee:	Michael Nelson (michael.nelson) → nobody

William Grant (wgrant) on 2010-08-10

Changed in soyuz:
status:	Triaged → In Progress
assignee:	nobody → William Grant (wgrant)

Revision history for this message

Ursula Junque (ursinha) wrote on 2010-08-12: Bug fixed by a commit

#26

Fixed in stable r11318 <http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/11318>.

tags:	added: qa-needstesting
tags:	removed: oops
Changed in soyuz:
status:	In Progress → Fix Committed
milestone:	none → 10.08

William Grant (wgrant) on 2010-08-12

tags:

added: qa-ok
removed: qa-needstesting

William Grant (wgrant) on 2010-08-12

Changed in soyuz:
status:	Fix Committed → Fix Released

Revision history for this message

Gary Poster (gary) wrote on 2010-08-16: Re: [Bug 590708] Re: DistroSeries.getBuildRecords often timing out

#27

On Aug 9, 2010, at 8:02 PM, Robert Collins wrote:

> Additionally, lazr.restful *appears* to be causing a count(*) on *every*
> page of a paginated API, not just the first one. This may require a
> separate fix? See https://lp-
> oops.canonical.com/oops.py/?oopsid=1681EB1345 query 6 - takes 4 seconds
> on its own.

I believe it is the same fix in lazr.restful. The count was done every time a batch was instantiated, AIUI.

Leonard Richardson (leonardr) on 2010-09-22

Changed in launchpad-foundations:
status:	Triaged → Fix Released

Launchpad itself

DistroSeries.getBuildRecords often timing out

Bug Description

Related branches

Other bug subscribers

Remote bug watches