wadl generation timeout?

Bug #607961 reported by Robert Collins
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
Critical
Gary Poster

Bug Description

This oops - https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1662EA447 seems odd. One explanation is that its getting the wadl, but I really don't know whats up.

I hope foundations is a good place to start looking :)

SQL time: 381 ms
Non-sql time: 122201 ms
Total time: 122582 ms
Statement Count: 11

https://api.edge.launchpad.net/beta is the url.

Related branches

Revision history for this message
Robert Collins (lifeless) wrote :

Sorry for the status noise, realised it was soft shortly after filing: at this point I think sleep is needed ;)

Changed in launchpad-foundations:
importance: High → Medium
Revision history for this message
Gary Poster (gary) wrote :

WADL is served from memory, and should be super-duper fast. The SQL is just authorization, I believe.

If I had to guess, I'd say that this is a really slow client--but Apache should be slurping this up from Zope quickly and serving it at whatever speed the client wants. So I don't know.

I feel like the wadl ought to be served from Apache, but maybe there's a good reason not to. If we were to do that, though, this OOPS would go away.

Revision history for this message
Robert Collins (lifeless) wrote :
Revision history for this message
Robert Collins (lifeless) wrote :

I'm +1000 on serving the wadl from icing.

How do we make that happen? - It would help to have just a few pointers here :)

Revision history for this message
Gary Poster (gary) wrote :

How so we put wadl in icing: I was imprecise. it actually can't be in +icing because it is supposed to be served from a non-icing URL. I meant that it should be served from Apache. My first guess is that it's a matter of checking in generated wadl (already in the plans for another reason) and asking the LOSAs to set up Apache specifically for this purpose. I believe that should be sufficient.

The first step in making this happen is asking Leonard if there are any surprises to this we should know. Then we should ask ask LOSAs where they might want to find the WADL in the LP tree. Then we should do it.

Leonard, for now at least, I'm assigning this to you just for a comment as to the plan I described.

Thanks

Changed in launchpad-foundations:
assignee: nobody → Leonard Richardson (leonardr)
Revision history for this message
Leonard Richardson (leonardr) wrote :

There's no reason not to serve the service root through Apache. It's the same for every user and we don't want to deny access to it to anyone. I can think of four potential stumbling blocks.

1. Apropos checking in the WADL. I think I mentioned this before, but just in case. The WADL has a URL in it, https://api.launchpad.dev/beta/ or similar. This WADL is different for every site. Whose WADL gets checked in? We either need to check in all the WADL, or we need to check in a generic WADL and customize it as part of the build process.

2. There are multiple versions of the WADL at different URLs. /beta/, /1.0/, etc. Each will correspond to a different file on disk. I don't think this will cause much of a problem; we can use a rewrite rule to map /(.*)/$ to /path/to/wadl/\1.xml if that file exists.

3. The WADL doesn't have a URL all to itself. It's the WADL representation of the service root, and we use content negotiation to determine whether to serve WADL or JSON. I don't know if it's possible to do the content negotiation and send the request to Apache if it's WADL and to Zope if it's JSON.

We'd have a simpler and faster system if we also pre-generated the JSON, checked *that* in, and served both representations from Apache. Any argument for serving the WADL from Apache applies just as strongly to the JSON, it's just that the body of the JSON is much smaller. Since the JSON representation is full of URLs, we'd have to generate a generic JSON as well as a generic WADL and customize it at build time.

4. We've implemented optimizations for spending as little time as possible sending out WADL. We serve an ETag so the client can make conditional requests later, and we use mod_compress to compress the representation if the client asks for it. These optimizations need to be preserved.

This shouldn't be difficult: the compression happens in Apache already, and Apache knows how to generate an ETag for a file on disk. There will be a point where the ETag suddenly changes format and everyone re-downloads the WADL, but that will probably happen during a Launchpad upgrade, when the ETag would have changed anyway.

Revision history for this message
Gary Poster (gary) wrote :

Thank you Leonard.

- This is more work than I had hoped, but still something we need to pursue. I'll confer with Francis as to where this might fit in with our other priorities.

- The ETag changing will become a bigger deal when we have more frequent rollouts.

- FWIW, as is often the case, I have a crazy idea (divide up wadl data, pointing to individual parts of it via URIs such as in <link> rel attributes). It probably won't pan out after discussion with Leonard, because of underlying problems with the idea or estimated size of the effort involved, but it would make these mongo get-everything-first requests go away (and move us away from WADL, which appears to have little or no uptake beyond Launchpad).

Changed in launchpad-foundations:
assignee: Leonard Richardson (leonardr) → nobody
Revision history for this message
Leonard Richardson (leonardr) wrote :

If we serve WADL from Apache then it will only change when the web service actually changes (since Apache generates ETags based on the content of a file). As long as we don't put any revision number in the WADL (I don't think we do) we'll be fine. The ETag generated by lazr.restful changes at every rollout because it's based on the revision number.

This is not really the place to discuss such a thing, but what are you thinking of replacing WADL with? We could probably use HTML5 + URI Templates.

Revision history for this message
Benji York (benji) wrote :

Gary and I investigated whether or not soft timeouts can be caused by
very slow clients. They can not. The soft timeout mechanism only
counts time spend in the app (not time spent sending the response to the
client or time reading the request from the client). Therefore there
must be some other reason that the request took so long to process.

We perused the various graphs of machine metrics and couldn't discern an
external reason why the response would take so long (CPU contention,
swap, etc.).

Garry suggested that we instrument the suspect parts of the code
responsible for returning the WADL with timers that will generate an
informational OOPS if any of the timed parts of the operation exceed a
given time. Two good candidates are: when generating WADL and when
reading the WADL from disk.

Also, since we don't expect the WADL to ever be generated in prodcution
(other than at build time), perhaps we should generate an informational
OOPS whenever generating WADL in production.

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 607961] Re: wadl generation timeout?

Just to note, reading from disk is a candidate culprit if we are
approach memory limits, seeks and reads are relatively slow.

Revision history for this message
Robert Collins (lifeless) wrote :

High as per zero-oops-policy

Changed in launchpad-foundations:
importance: Medium → High
tags: added: pg83
removed: oops
Gary Poster (gary)
tags: added: oops
removed: pg83
Gary Poster (gary)
tags: removed: oops
Changed in launchpad:
importance: High → Critical
Revision history for this message
Martin Pool (mbp) wrote :

Per discussion in <https://bugs.launchpad.net/launchpad/+bug/548429>, this is quite reproducible with

   curl -v https://api.launchpad.net/devel/ -H "Accept: application/vd.sun.wadl+xml"

Curl obviously is not going to be reading slowly, and I have a high bandwidth connection.

For me this manifests as a 502 gateway timeout, with no visible oops.

Revision history for this message
Robert Collins (lifeless) wrote :

Back to wadl and icing.

I propose the following non-checking approach:
 - generate the wadl to the icing dir
 - have an apache rule to pickup the wadl from the icing dir.

Revision history for this message
Gary Poster (gary) wrote :
Download full text (3.5 KiB)

We need to serve the wadl and json conditionally based on the Accept request header. http://httpd.apache.org/docs/current/content-negotiation.html (http://httpd.apache.org/docs/current/mod/mod_negotiation.html) seems to handle that fine.

The type-map approach would mean that we would need to have explicit type maps for each version of the api. While we freeze the API infrequently enough that this will probably not be a problem often, it would be nice if we could make a solution that was static within our Apache configuration, and the other Apache approach, multiviews, does that.

Here's my understanding so far of what to do, then.

We probably need this in /etc/apache2/mods-enabled/mime.conf (http://httpd.apache.org/docs/current/mod/mod_mime.html).

AddType application/vd.sun.wadl+xml .wadl

This too.

AddType application/json .json

Our wadl generation would be changed. Right now it creates wadl-development-1.0.xml, wadl-development-beta.xml and wadl-development-devel.xml in lib/canonical/launchpad/apidoc . Instead we'd create a structure like this.

  version/index.[wadl|json]

In other words, for the three revisions we have now, you'd see

  beta/index.wadl
  beta/index.json
  1.0/index.wadl
  1.0/index.json
  devel/index.wadl
  devel/index.json

Maybe we'd also symlink the .json to .txt in case we have a json client that sends the wrong mime type.

I'd put it in lib/canonical/launchpad/apidoc because it won't be served the same as icing, and because people are used to finding it there. I think the apidoc stuff needs to be by itself in any case.

In our Apache Launchpad site config for production, equivalent to /etc/apache2/sites-enabled/local-launchpad locally, we would want a new section that turned on MultiViews. Something like this might work:

<VirtualHost 127.0.0.88:443>
  ServerName api.launchpad.dev
  <Proxy *>
    Order deny,allow
    Allow from 127.0.0.0/255.0.0.0
  </Proxy>
  SSLEngine On
  SSLCertificateFile /etc/apache2/ssl/launchpad.crt
  SSLCertificateKeyFile /etc/apache2/ssl/launchpad.key

  ProxyPassMatch ^(/[^/]+/.+)$+ http://localhost:8086/$1 retry=1

  DocumentRoot /var/tmp/api.launchpad.dev/static/
  <Directory />
    Options MultiViews
    DirectoryIndex index
  </Directory>

  <Location />
    # Insert filter
    SetOutputFilter DEFLATE

    # Don't compress images
    SetEnvIfNoCase Request_URI \
    \.(?:gif|jpe?g|png)$ no-gzip dont-vary

    # Don't gzip anything that starts /@@/ and doesn't end .js (ie images)
    SetEnvIfNoCase Request_URI ^/@@/ no-gzip dont-vary
    SetEnvIfNoCase Request_URI ^/@@/.*\.js$ !no-gzip !dont-vary
  </Location>

</VirtualHost>

I'm not sure what we would really do for dev boxes. We could do what +icing does now, and defer to Zope when you are on the dev, and implement the same kind of behavior in Zope; or, as implied in the snippet above, we could make starting an instance stuff a symlink from the current tree's lib/canonical/launchpad/apidoc into /var/tmp/api.launchpad.dev/static/ for when you run in devel or testing mode.

The desire to continue to use ETags would be a little trickier, given load balanced servers. Apache by default incudes the inode in the etag calculati...

Read more...

Revision history for this message
Martin Pool (mbp) wrote :

I'm pretty sure you could do this using mod_rewrite in a way that does
not need per-api-version configuration.

It looks like etags are being attached to a lot of files that have
per-revision file names and that could equally well be just said to
never need revalidation. What's the desire to still use it?

Revision history for this message
Robert Collins (lifeless) wrote :

Gary, that sounds fine though I haven't looked into the precise details.

Martin, the wadl urls are static but content changes daily; etag is probably much better than a very short cache halflife for clients.

Some thoughts:
 - we need this to be robust on rollouts (as icing is)
 - I don't care if its in a new dir or not
 - we release new API versions very rarely so having a manual step there is fine (but lets capture the know-how somewhere!)
 - writing /anything/ to well known files on disk for testing is a step backwards in the parallel testing efforts (as would requiring apache during test runs) - please avoid that (e.g. by retaining the serve-from-zope facilities we have today for development environments).

Revision history for this message
Gary Poster (gary) wrote :

@Martin
Thank you for the thoughts.

"I'm pretty sure you could do this using mod_rewrite in a way that does
not need per-api-version configuration." OK, but neither does what I outlined. Is there an advantage of using mod_rewrite instead? Would it be shorter or simpler somehow?

"It looks like etags are being attached to a lot of files that have per-revision file names and that could equally well be just said to never need revalidation. What's the desire to still use it?" Yes, the only file that should need it is the "head" devel version.

The importance of wadl cacheing on devel might be minimal. The only practical usage of devel I know about other than early testing is for our JS, which doesn't need wadl. Conclusively determining that devel doesn't need efficient wadl cacheing in the client would probably require a bit of log sniffing, but if we did, that would radically simplify that requirement.

Revision history for this message
Gary Poster (gary) wrote :

@Robert
Cool, good points. Thanks!

Revision history for this message
Martin Pool (mbp) wrote :

The approach you describe sounds fine.

The issue with etags being apparently unnecessarily set is really a separate
bug, i was just curious about your comment.

Gary Poster (gary)
Changed in launchpad:
assignee: nobody → Gary Poster (gary)
Revision history for this message
Gary Poster (gary) wrote :

Because we use rsync to move code around, mthaddon believes that we should have synced mtimes across machines, which means that "FileETag MTime Size" (or even simply "FileETag MTime") should do the right thing. Yay.

My next planned steps are to see if I can hack things to see if the described configuration will work, and if so, ask the LOSAs for their thoughts before actually implementing.

Gary Poster (gary)
Changed in launchpad:
status: Triaged → In Progress
Revision history for this message
Gary Poster (gary) wrote :

I have most of this working; it looks like, with some tweaks, what I proposed will do the job.

Apparently there's a misspelled wadl type we also need to support because old launchpadlibs sent it.

application/vd.sun.wadl+xml

I've tried what appears to be a relatively elegant solution to try and have Apache change the Accept string before it tries to find the correct representation (a variant of example 4 from http://httpd.apache.org/docs/2.0/mod/mod_headers.html):

    SetEnvIf Accept \Qapplication/vd.sun.wadl+xml\E MISSPELLED_WADL_ACCEPT_HEADER
    RequestHeader set Accept "application/vnd.sun.wadl+xml" env=MISSPELLED_WADL_ACCEPT_HEADER

This does not work.

I'm inclined now to add a separate mime type for the misspelling, mapping it to an extension to "brokenwadl" (apache won't work properly with different mimetypes pointing to the same extension, in my experiments) and making an index.brokenwadl symlink. That works fine, though it is more kludgy than I wanted. I'm going to present the approach to the losas, now that I have something that works, and see what they have to say.

Revision history for this message
Gary Poster (gary) wrote :

Got approval from mthaddon of the basic approach as outlined in http://pastebin.ubuntu.com/631820/

Revision history for this message
Gary Poster (gary) wrote :

Updated/corrected version of Apache based on further local tests: http://pastebin.ubuntu.com/631952/

Most pertinent parts of this are lines 3-5 and lines 25-33.

Revision history for this message
Gary Poster (gary) wrote :

I thought I was done and was preparing an MP, but no: we do not generate the WADL on the central box before we push out. This means that the ETag scheme I had planned won't work. The obvious fix is to check in the WADL, which will be a win for building new branches, and we are already generating wadl in test runs so we can use that sunk cost to also check that the WADL that is checked in is the same as what would be generated.

Revision history for this message
Robert Collins (lifeless) wrote :

On Sat, Jun 25, 2011 at 8:39 AM, Gary Poster <email address hidden> wrote:
> I thought I was done and was preparing an MP, but no: we do not generate
> the WADL on the central box before we push out.  This means that the
> ETag scheme I had planned won't work.  The obvious fix is to check in
> the WADL, which will be a win for building new branches, and we are
> already generating wadl in test runs so we can use that sunk cost to
> also check that the WADL that is checked in is the same as what would be
> generated.

The wadl is pretty big; I'd -really- rather not check it in.

Generating the wadl on our central box is easy.

Revision history for this message
Gary Poster (gary) wrote :

On Jun 24, 2011, at 4:53 PM, Robert Collins wrote:

> On Sat, Jun 25, 2011 at 8:39 AM, Gary Poster <email address hidden> wrote:
>> I thought I was done and was preparing an MP, but no: we do not generate
>> the WADL on the central box before we push out. This means that the
>> ETag scheme I had planned won't work. The obvious fix is to check in
>> the WADL, which will be a win for building new branches, and we are
>> already generating wadl in test runs so we can use that sunk cost to
>> also check that the WADL that is checked in is the same as what would be
>> generated.
>
> The wadl is pretty big; I'd -really- rather not check it in.

Why not? I don't feel strongly about it, but I don't understand why it's a bad idea yet.

> Generating the wadl on our central box is easy.

I had hoped that might be the case, but that was not what I saw.

Right now, we only build the eggs on the central box. The way we build the wadl right now requires much more of the build to exist on the central box (sourcecode and mailman, for a start).

Other parts of the build, such as mailman integration, have caused problems when copied from the central box. In my experience, changing what is built and copied is difficult to qa properly, and has been a cause of problems in the past. Our staging/qa infrastructure has been insufficient for these changes, at least the way I and others have handled them in the past. I consider these sorts of changes to be a high risk. I want to make this change as small as possible.

We could generate the wadl on our central box and then try to remove the risky parts that we created for the wadl generation, but that's fragile and not compelling.

Options such as changing how our wadl is initially created, or how our build is done, might be valuable, but in my opinion should be well out of scope for fixing this critical bug.

Maybe you can change my mind? :-)

Revision history for this message
Robert Collins (lifeless) wrote :
Download full text (4.3 KiB)

On Sun, Jun 26, 2011 at 7:58 AM, Gary Poster <email address hidden> wrote:
>> The wadl is pretty big; I'd -really- rather not check it in.
>
> Why not?  I don't feel strongly about it, but I don't understand why
> it's a bad idea yet.

IIRC benji landed a change to check it in last year some time; that
got backed out because of fallout ( I don't remember the fallout
specifics ).

I have a strong learnt allergy to checking in build products in
primary source trees; the sorts of things I have seen go wrong are:
 - datestamp based build system confusion [an update that pulls in a
build product touches it and makes it equal-to-or-newer-than a
dependency]
 - conflicts in build products [two developers make independent
changes, and now have to manually resolve conflicts in multiple xml
files]
 - (often a silent failure mode) not adding new build products won't
break anything until it gets all the way through the pipeline to the
intended user-of-the-checked-in-products, when it will break. In our
case the symptoms will be 404 replies on whatever wadl/json file we
added a need for but didn't check-in. This would in principle be
caught in qa...

I think those three modes are the minimum set: all other
things-have-gone-wrong can be reached my mixing and matching. (For
instance, you can have a dev environment that can't build but thinks
it can due to the first one masking a real problem).

I think introducing the potential for these problems into the system
is a bad idea, and the cost of compensating in some way included when
comparing the costs of (check in the file | build-and-deploy in a
different way).

>> Generating the wadl on our central box is easy.
>
> I had hoped that might be the case, but that was not what I saw.
>
> Right now, we only build the eggs on the central box.  The way we build
> the wadl right now requires much more of the build to exist on the
> central box (sourcecode and mailman, for a start).

The wadl doesn't depend on mailman does it ? Is it because we want to
import our mailman interfaces?

> Other parts of the build, such as mailman integration, have caused
> problems when copied from the central box.  In my experience, changing
> what is built and copied is difficult to qa properly, and has been a
> cause of problems in the past.  Our staging/qa infrastructure has been
> insufficient for these changes, at least the way I and others have
> handled them in the past.  I consider these sorts of changes to be a
> high risk.  I want to make this change as small as possible.

Its a bit of a side note, but if its high risk now, its never going to
be lower risk until we improve on it: if we become allergic to
changing things that are high risk, they stop evolving and we
accumulate more and more accommodations to those things in our
environment - making them higher risk than they were before.

> We could generate the wadl on our central box and then try to remove the
> risky parts that we created for the wadl generation, but that's fragile
> and not compelling.

Some other options that may be easier:
 - We could generate the wadl on the apache frontend; rsync between
the two servers to harmonise datestamps;
 - use a date inse...

Read more...

Revision history for this message
Robert Collins (lifeless) wrote :
Revision history for this message
Gary Poster (gary) wrote :

Thanks, Robert, that's convincing. That Apache patch sounds nice for the future.

For this work, I'll set the mtime of the generated wadl files to correspond with the last bazaar commit time for the branch. That's safe because the wadl derives from the code in the branch; relatively apt, for the same reason; and will work with the Apache we have deployed now, with a minimum of fuss.

Revision history for this message
Launchpad QA Bot (lpqabot) wrote :
tags: added: qa-needstesting
Changed in launchpad:
status: In Progress → Fix Committed
Revision history for this message
Gary Poster (gary) wrote :

qa notes: https://launchpad.net/+apidoc/ is the expected place for the html docs, not https://api.launchpad.net/ . qastaging still works this way, before the Apache config is made. We'll need to check this again after the Apache change on qastaging. I suspect we should maybe make that address served from Apache too (or even redirect to api.launchpad.net, because that seems to be a better address to me) but that's not critical to addressing this bug.

Revision history for this message
Gary Poster (gary) wrote :

I marked this qa-ok. This is for stage one of the deployment, in which the code is ready for Apache to serve the files, but we are still serving from Zope. I suspect the second stage will be handled via RT.

tags: added: qa-ok
removed: qa-needstesting
Revision history for this message
Gary Poster (gary) wrote :

See RT 46782

William Grant (wgrant)
Changed in launchpad:
status: Fix Committed → In Progress
Revision history for this message
Gary Poster (gary) wrote :

This will also fix critical bug 754038.

Gary Poster (gary)
Changed in launchpad:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.