Bug #380504 “Handle HTTP Error 502: Bad Gateway automatically” : Bugs : lazr.restfulclient

Revision history for this message

William Grant (wgrant) wrote on 2009-05-26:

#1

I get a few complaints from cron about my launchpadlib scripts sporadically failing every day. It has been happening for a while.

Francis J. Lacoste (flacoste) on 2009-05-26

Changed in launchpadlib:
assignee:	nobody → Leonard Richardson (leonardr)

Revision history for this message

Leonard Richardson (leonardr) wrote on 2009-05-27:

#2

I need to see information about these 502 requests. Can you (either one of you, but preferably both) put this code in your Python before starting the Launchpad client?

httplib2.debuglevel = 1

Let's see the header and body dump of the HTTP request that caused the 502, and the 502 response itself. From what I can tell from Bryce's code he was trying to make a pretty standard GET to a bug.

In particular I want to see any Via headers, because the 502 response code is generally caused by the failure of an intermediary. OOPS headers would also be nice. The intermediary can't be Squid because web service requests all include an Authentication header. It might be Apache.

If the debuglevel thing isn't working, catch the HTTPError, print its .contents, and look at the headers in its .response object.

Revision history for this message

William Grant (wgrant) wrote on 2009-05-27: Re: [Bug 380504] Re: HTTP Error 502: Bad Gateway

#3

On Wed, 2009-05-27 at 18:21 +0000, Leonard Richardson wrote:
> I need to see information about these 502 requests. Can you (either one
> of you, but preferably both) put this code in your Python before
> starting the Launchpad client?
>
> httplib2.debuglevel = 1
>
> Let's see the header and body dump of the HTTP request that caused the
> 502, and the 502 response itself. From what I can tell from Bryce's code
> he was trying to make a pretty standard GET to a bug.
>
> In particular I want to see any Via headers, because the 502 response
> code is generally caused by the failure of an intermediary. OOPS headers
> would also be nice. The intermediary can't be Squid because web service
> requests all include an Authentication header. It might be Apache.
>
> If the debuglevel thing isn't working, catch the HTTPError, print its
> .contents, and look at the headers in its .response object.

I've added that to my scripts - I'll see if any 502s show up.

--
William Grant

Revision history for this message

William Grant (wgrant) wrote on 2009-05-30: Re: HTTP Error 502: Bad Gateway

#4

After many megabytes of debug output with no failures, this appeared this morning:

send: 'GET /beta/ubuntu/karmic?build_state=%22Failed+to+build%22&ws.op=getBuildRecords&ws.start=600&ws.size=50 HTTP/1.1\r\nHost: api.edge.launchpad.net\r\nAccept-Encoding: identity\r\nte: deflate, gzip\r\nAuthorization: OAuth realm="https://api.launchpad.net", oauth_nonce="49411926", oauth_timestamp="1243642364", oauth_consumer_key="qa-ftbfs", oauth_signature_method="PLAINTEXT", oauth_version="1.0", oauth_token="7X9pG3LSFxQ79svQ138W", oauth_signature="[REDACTED]"\r\naccept: application/json\r\nuser-agent: Python-httplib2/$Rev: 259 $\r\n\r\n'
reply: 'HTTP/1.1 502 Bad Gateway\r\n'
header: Date: Sat, 30 May 2009 00:12:44 GMT
header: Server: Apache/2.2.8 (Ubuntu) mod_ssl/2.2.8 OpenSSL/0.9.8g
header: Last-Modified: Tue, 26 May 2009 19:15:11 GMT
header: ETag: "44004c-399-46ad58bb02dc0"
header: Accept-Ranges: bytes
header: Content-Length: 921
header: Content-Type: text/html

Revision history for this message

Leonard Richardson (leonardr) wrote on 2009-06-01:

#5

This is an error from Apache. Every day at midnight GMT, edge goes down for an upgrade. During that time requests will be handled by Apache alone, and it won't be able to handle them. Maybe "Bad Gateway" refers to the absence of a Zope server.

The time stamp on the error is close to midnight GMT, so everything fits that theory. Did the error go away shortly thereafter or did it persist?

Before I close this bug I really want to see that 921 bytes of HTML. Can you catch the HTTPError and print out the .contents? I hope it says something like "Launchpad is down for maintenance." That would explain this error perfectly. People get this error on the website all the time, but they see "Launchpad is down for maintance", not "HTTP Error 502: Bad Gateway", so they don't file bugs.

Revision history for this message

Bryce Harrington (bryce) wrote on 2009-06-02:

#6

Interesting. I do have my cronjob set to run 1 min after the hour, so it would presumably get run at the time when edge is down. I should be able to adjust my cronjobs to avoid this window.

However, in looking at the time for my most recent 502 error, it doesn't correspond to midnight GMT:

Date: Mon, 01 Jun 2009 21:03:45 -0700
From: Cron Daemon <email address hidden>
To: <email address hidden>
Subject: Cron <bryce@chideok> /home/bryce/src/Arsenal/arsenal/hooks/data-symptoms-intel

Traceback (most recent call last):
  File "/home/bryce/src/Arsenal/arsenal/scripts//bugs-symptoms.py", line 51, in <module>
    bug = ArsenalBug(bugtask.bug, arsenal.launchpad)
  File "/home/bryce/src/Arsenal/arsenal/scripts/arsenal_lib.py", line 137, in __init__
    self.owner_firstname = self.owner.name.split(' ')[0]
  File "/usr/lib/python2.6/dist-packages/launchpadlib/resource.py", line 457, in __getattr__
    return super(Entry, self).__getattr__(name)
  File "/usr/lib/python2.6/dist-packages/launchpadlib/resource.py", line 295, in __getattr__
    return self.lp_get_parameter(attr)
  File "/usr/lib/python2.6/dist-packages/launchpadlib/resource.py", line 186, in lp_get_parameter
    self._ensure_representation()
  File "/usr/lib/python2.6/dist-packages/launchpadlib/resource.py", line 316, in _ensure_representation
    representation = self._root._browser.get(self._wadl_resource)
  File "/usr/lib/python2.6/dist-packages/launchpadlib/_browser.py", line 222, in get
    response, content = self._request(url, extra_headers=headers)
  File "/usr/lib/python2.6/dist-packages/launchpadlib/_browser.py", line 212, in _request
    raise HTTPError(response, content)
launchpadlib.errors.HTTPError: HTTP Error 502: Bad Gateway
Error: (1) Could not create symptoms_intel.json

Revision history for this message

Leonard Richardson (leonardr) wrote on 2009-06-03:

#7

I'd really like to see the .content of one of those exception objects (It's .content, not .contents as I said earlier).

No matter when you get this error, it's probably because the Zope server is down, and it's almost certain that web browser requests will get the same error. Getting a look at the .content could confirm this.

Revision history for this message

Bryce Harrington (bryce) wrote on 2009-06-04: Re: [Bug 380504] Re: HTTP Error 502: Bad Gateway

#8

FWIW, Launchpad is down for scheduled maintenance right now, and my
scripts get a 503 error, not 502:

launchpadlib.errors.HTTPError: HTTP Error 503: Service Unavailable

Revision history for this message

Francis J. Lacoste (flacoste) wrote on 2009-06-05: Re: HTTP Error 502: Bad Gateway

#9

Right, 503 is the code raised when doing maintenance, because the LOSA set ups apache to return that. The daily edge update is automatic and I'm pretty sure they don't reconfigure apache for the "down for maintenance" page.

The reason is that the automatic is done one app server at a time, so should in theory be transparent to user. Like this bug shows, it doesn't seem to be so.

I think apache returns 502 when the app server closes the connection abrutly or other similar conditions. I'll check with the LOSAs.

Revision history for this message

Francis J. Lacoste (flacoste) wrote on 2009-06-05:

#10

After investigation with the sys admins, these errors do correlates to either the daily edge update or to other incident where the app server was restarted manually. We need to revise our procedure to first disable the app server in the load balancer, wait a little for pending requests to be processed, and then restarting the app server. (And then re-enable it in the load balancer.)

Changed in launchpadlib:
assignee:	Leonard Richardson (leonardr) → Tom Haddon (mthaddon)
importance:	Undecided → High

Revision history for this message

Steve McInerney (spm) wrote on 2009-06-05:

#11

Rather than doing a major system architecture fiddle on every logrotation/app restart, is it possible to fix bug https://bugs.edge.launchpad.net/launchpad-foundations/+bug/287304 such that the problem naturally goes away?

Such that is no longer a need to flip app servers in/out of a load balancer at all?
Manually adjusting the LB seems like a major violation of the KISS principle... :-/

We're also looking at some options to reduce the interruption on edge rollouts.

Revision history for this message

Francis J. Lacoste (flacoste) wrote on 2009-06-06:

#12

Unfortunately, solving that bug wouldn't help unless the load balancer can automatically detect that an app server is unavailable and stop sending connections to it. If that's the case, then right we should probably handle a signal telling the app server to:

1) stop accepting new connections
2) wait for existing ones to complete
3) restart

If the load balancer can detect 1) and 3) and take automatically in and out of the rotation the app server, then we have a solution. Otherwise, there is no alternative but to fiddle the load balancer config.

Revision history for this message

Steve McInerney (spm) wrote on 2009-06-07:

#13

The existing loadbalancer can detect, and does stop sending requests to hard failed app servers.

Unf it can take up to 60 secs to detect and fail; and
The app servers frequently fail soft: Dead, but still responding to HTTP requests.

Being able to rotate logs without restarting the app servers would solve a similar problem across all of lpnet and edge; where with every rotation, we will fail around 1/4 to 1/5 of all requests until the LB detects the given apps are down.
The edge updates are simply more noticable as all 4 edge servers are updated in a single block. Taking around 15 minutes from start to finish.

Revision history for this message

Bryce Harrington (bryce) wrote on 2009-08-07:

#14

Fwiw, I still have been experiencing this exception thrown daily. I'm working around it by just catching and ignoring the error, but for others it might make launchpadlib look more reliable if there were a cleaner solution.

Revision history for this message

Bryce Harrington (bryce) wrote on 2010-03-23:

#15

I'm still seeing this bug quite a bit. We're expanding use of my scripts to the rest of the desktop team, as well as the kernel team.

The workaround of lacing scripts with lots and lots of exception handling code is quite a drag, it would be really nice to get a proper solution.

It would be nice if the launchpadlib api handled the exceptions internally where possible. A couple random ideas:

* Provide an API call which simply checks if the Launchpad service is up and will continue being up for $time.

* For highly transient exceptions, have launchpadlib try one additional time before tossing an exception up to the client, perhaps with a (configurable) sleep between the retries.

Francis J. Lacoste (flacoste) on 2010-03-23

summary:

- HTTP Error 502: Bad Gateway
+ Handle HTTP Error 502: Bad Gateway automatically

Revision history for this message

Leonard Richardson (leonardr) wrote on 2010-03-24:

#16

Since we have an HTML notification when Launchpad is about to go down, we should be able to publish the same information through the web service. However, if Launchpad is actually down, the part of the web service that tells you whether Launchpad is down will itself be down.

How transient are these transient exceptions? I'm open to the idea of having launchpadlib deal with 5xx errors by retrying a configurable number of times using exponential backoff.

Gary also wanted to ask me why you are using edge (which has much more planned downtime) for what appears to be production work, but I don't see anywhere you explicitly state that you're using edge. Are you?

Revision history for this message

Bryce Harrington (bryce) wrote on 2010-03-24: Re: [Bug 380504] Re: Handle HTTP Error 502: Bad Gateway automatically

#17

On Wed, Mar 24, 2010 at 08:41:26PM -0000, Leonard Richardson wrote:
> Since we have an HTML notification when Launchpad is about to go down,
> we should be able to publish the same information through the web
> service. However, if Launchpad is actually down, the part of the web
> service that tells you whether Launchpad is down will itself be down.
>
> How transient are these transient exceptions? I'm open to the idea of
> having launchpadlib deal with 5xx errors by retrying a configurable
> number of times using exponential backoff.

I generally get them one or two times a day on scripts that are run
hourly, so say less than 5% of the time.

> Gary also wanted to ask me why you are using edge (which has much more
> planned downtime) for what appears to be production work, but I don't
> see anywhere you explicitly state that you're using edge. Are you?

No, last time I reported, it was suggested to production instead of
edge, so I updated all my scripts accordingly. But I still see the
issue at about the same level of frequency so I decided I should follow
up.

Bryce

Revision history for this message

Gary Poster (gary) wrote on 2010-03-25:

#18

On Mar 24, 2010, at 6:31 PM, Bryce Harrington wrote:
> On Wed, Mar 24, 2010 at 08:41:26PM -0000, Leonard Richardson wrote:
>>
>> How transient are these transient exceptions? I'm open to the idea of
>> having launchpadlib deal with 5xx errors by retrying a configurable
>> number of times using exponential backoff.
>
> I generally get them one or two times a day on scripts that are run
> hourly, so say less than 5% of the time.

That production would have daily problems like this is an unpleasant surprise.

When we last looked into these reports, Francis and the LOSAs found that the reports coincided with nightly edge updates, deployments and other planned downtimes.

Have you noticed any potentially helpful patterns for these? For instance, are they often at a particular time, or on particular launchpadlib calls?

I suspect that we'll want to ask you to institute httplib2 logging again, as Leonard described earlier. Leonard, could you confirm?

Revision history for this message

Leonard Richardson (leonardr) wrote on 2010-03-25:

#19

> I generally get them one or two times a day on scripts that are run
> hourly, so say less than 5% of the time.

How long do you have to wait before the 5xx error goes away? Do you just try again an hour later?

Revision history for this message

Bryce Harrington (bryce) wrote on 2010-03-25:

#20

On Thu, Mar 25, 2010 at 02:19:00PM -0000, Gary Poster wrote:
> On Mar 24, 2010, at 6:31 PM, Bryce Harrington wrote:
> > On Wed, Mar 24, 2010 at 08:41:26PM -0000, Leonard Richardson wrote:
> >>
> >> How transient are these transient exceptions? I'm open to the idea of
> >> having launchpadlib deal with 5xx errors by retrying a configurable
> >> number of times using exponential backoff.
> >
> > I generally get them one or two times a day on scripts that are run
> > hourly, so say less than 5% of the time.
>
>
> That production would have daily problems like this is an unpleasant surprise.
>
> When we last looked into these reports, Francis and the LOSAs found that
> the reports coincided with nightly edge updates, deployments and other
> planned downtimes.

Yes, when they pointed that out I noticed a pattern of time when the
failures would occur. I subsequently adjusted my scripts to not run
during that 1-2 hour period. But there were also some outliers that
didn't fit into the pattern.

Then I thought maybe the issues were due to edge being restarted and so
on, so I moved away from using edge to lpnet. It took some time to
update all my scripts to parameterize this rather than reference the
edge url directly but all scripts have been changed over now.

> Have you noticed any potentially helpful patterns for these? For
> instance, are they often at a particular time, or on particular
> launchpadlib calls?

Unfortunately no. Sometimes they happen when setting up the launchpad
connection initially, but most of the time they occur on some later call
almost arbitrarily. Typically I'll run through a loop making the same
sequence of calls for a bunch of times before the error triggers.

I have a number of scripts which run for long periods of time (due I
think to slowness in attachment handling on the launchpadlib side, which
I've reported separately). I might speculate that the longer a
launchpad connection is kept open the higher the chance it has of
triggering this error. But for all I know it's just random.

> I suspect that we'll want to ask you to institute httplib2 logging
> again, as Leonard described earlier. Leonard, could you confirm?

Can do. (Might take a few days to collect it though.)

Bryce

On Thu, Mar 25, 2010 at 02:19:00PM -0000, Gary Poster wrote:
> On Mar 24, 2010, at 6:31 PM, Bryce Harrington wrote:
> > On Wed, Mar 24, 2010 at 08:41:26PM -0000, Leonard Richardson wrote:
> >> 
> >> How transient are these transient exceptions? I'm open to the idea of
> >> having launchpadlib deal with 5xx errors by retrying a configurable
> >> number of times using exponential backoff.
> > 
> > I generally get them one or two times a day on scripts that are run
> > hourly, so say less than 5% of the time.
> 
> 
> That production would have daily problems like this is an unpleasant surprise.
> 
> When we last looked into these reports, Francis and the LOSAs found that
> the reports coincided with nightly edge updates, deployments and other
> planned downtimes.

Yes, when they pointed that out I noticed a pattern of time when the
failures would occur.  I subsequently adjusted my scripts to not run
during that 1-2 hour period.  But there were also some outliers that
didn't fit into the pattern.

Then I thought maybe the issues were due to edge being restarted and so
on, so I moved away from using edge to lpnet.  It took some time to
update all my scripts to parameterize this rather than reference the
edge url directly but all scripts have been changed over now.
 
> Have you noticed any potentially helpful patterns for these?  For
> instance, are they often at a particular time, or on particular
> launchpadlib calls?

Unfortunately no.  Sometimes they happen when setting up the launchpad
connection initially, but most of the time they occur on some later call
almost arbitrarily.  Typically I'll run through a loop making the same
sequence of calls for a bunch of times before the error triggers.

I have a number of scripts which run for long periods of time (due I
think to slowness in attachment handling on the launchpadlib side, which
I've reported separately).  I might speculate that the longer a
launchpad connection is kept open the higher the chance it has of
triggering this error.  But for all I know it's just random.

> I suspect that we'll want to ask you to institute httplib2 logging
> again, as Leonard described earlier.  Leonard, could you confirm?

Can do.  (Might take a few days to collect it though.)

Bryce

Revision history for this message

Bryce Harrington (bryce) wrote on 2010-03-25:

#21

On Thu, Mar 25, 2010 at 02:21:08PM -0000, Leonard Richardson wrote:
> > I generally get them one or two times a day on scripts that are run
> > hourly, so say less than 5% of the time.
>
> How long do you have to wait before the 5xx error goes away? Do you just
> try again an hour later?

Pretty much. I've seen a few cases where I'll get a group of 2-3 errors
across several scripts. What's funny is that they often are not the
same error. Like, I'll get an IncompleteRead error followed by one or
more 503 or 502 errors within an hour or two later. This makes me think
that while different error codes are being reported, that they are
interrelated in some fashion.

For example, here are the cron reports I've received over the past day
or so, for several different scripts cronned to run hourly:

# Crontab
20 * * * * /home/bryce/src/Arsenal/arsenal/hooks/data-xorg-packagebugs
22 * * * * /home/bryce/src/Arsenal/arsenal/hooks/data-upgrade
24 * * * * /home/bryce/src/Arsenal/arsenal/hooks/data-bugtotals
26 * * * * /home/bryce/src/Arsenal/arsenal/hooks/data-milestone-bugs

Date: Wed, 24 Mar 2010 10:04:39 -0700
Subject: Cron <bryce@chideok> /home/bryce/src/Arsenal/arsenal/hooks/data-bugtotals

...
httplib.IncompleteRead: IncompleteRead(15687 bytes read, 40586 more expected)

Date: Wed, 24 Mar 2010 19:05:43 -0700
Subject: Cron <bryce@chideok> /home/bryce/src/Arsenal/arsenal/hooks/data-patches

Exception: code 0
reason 'HTTP Error 503: Service Unavailable
Response headers:

Date: Wed, 24 Mar 2010 21:53:45 -0700
Subject: Cron <bryce@chideok> /home/bryce/src/Arsenal/arsenal/hooks/data-bugtotals

httplib.IncompleteRead: IncompleteRead(57991 bytes read, 4914 more expected)

Date: Wed, 24 Mar 2010 21:58:53 -0700
Subject: Cron <bryce@chideok> /home/bryce/src/Arsenal/arsenal/hooks/data-bugtotals

Exception: code 0
reason 'HTTP Error 502: Bad Gateway
Response headers:

Date: Thu, 25 Mar 2010 01:16:02 -0700
Subject: Cron <bryce@chideok> /home/bryce/src/Arsenal/arsenal/hooks/data-bugtotals

Exception: code 0
reason 'HTTP Error 502: Bad Gateway
Response headers:

Date: Thu, 25 Mar 2010 01:16:44 -0700
Subject: Cron <bryce@chideok> /home/bryce/src/Arsenal/arsenal/hooks/data-milestone-bugs

Exception: code 0
reason 'HTTP Error 502: Bad Gateway
Response headers:

Date: Thu, 25 Mar 2010 06:31:52 -0700
Subject: Cron <bryce@chideok> /home/bryce/src/Arsenal/arsenal/hooks/data-bugtotals

Exception: code 0
reason 'HTTP Error 503: Service Unavailable
Response headers:

Since I get back html error pages in the emails I assume these are
errors on the launchpad side and not just network flakiness. Is that
an accurate assumption? I mean, if I'm the only person seeing these
issues maybe I should be looking for a faulty router or something?

Bryce

On Thu, Mar 25, 2010 at 02:21:08PM -0000, Leonard Richardson wrote:
> > I generally get them one or two times a day on scripts that are run
> > hourly, so say less than 5% of the time.
> 
> How long do you have to wait before the 5xx error goes away? Do you just
> try again an hour later?

Pretty much.  I've seen a few cases where I'll get a group of 2-3 errors
across several scripts.  What's funny is that they often are not the
same error.  Like, I'll get an IncompleteRead error followed by one or
more 503 or 502 errors within an hour or two later.  This makes me think
that while different error codes are being reported, that they are
interrelated in some fashion.

For example, here are the cron reports I've received over the past day
or so, for several different scripts cronned to run hourly:

# Crontab
 20  * * * * /home/bryce/src/Arsenal/arsenal/hooks/data-xorg-packagebugs
 22  * * * * /home/bryce/src/Arsenal/arsenal/hooks/data-upgrade
 24  * * * * /home/bryce/src/Arsenal/arsenal/hooks/data-bugtotals
 26  * * * * /home/bryce/src/Arsenal/arsenal/hooks/data-milestone-bugs

Date: Wed, 24 Mar 2010 10:04:39 -0700
Subject: Cron <bryce@chideok> /home/bryce/src/Arsenal/arsenal/hooks/data-bugtotals

...
httplib.IncompleteRead: IncompleteRead(15687 bytes read, 40586 more expected)

Date: Wed, 24 Mar 2010 19:05:43 -0700
Subject: Cron <bryce@chideok> /home/bryce/src/Arsenal/arsenal/hooks/data-patches

Exception:  code 0
            reason 'HTTP Error 503: Service Unavailable
Response headers:

Date: Wed, 24 Mar 2010 21:53:45 -0700
Subject: Cron <bryce@chideok> /home/bryce/src/Arsenal/arsenal/hooks/data-bugtotals

httplib.IncompleteRead: IncompleteRead(57991 bytes read, 4914 more expected)

Date: Wed, 24 Mar 2010 21:58:53 -0700
Subject: Cron <bryce@chideok> /home/bryce/src/Arsenal/arsenal/hooks/data-bugtotals

Exception:  code 0
            reason 'HTTP Error 502: Bad Gateway
Response headers:

Date: Thu, 25 Mar 2010 01:16:02 -0700
Subject: Cron <bryce@chideok> /home/bryce/src/Arsenal/arsenal/hooks/data-bugtotals

Exception:  code 0
            reason 'HTTP Error 502: Bad Gateway
Response headers:

Date: Thu, 25 Mar 2010 01:16:44 -0700
Subject: Cron <bryce@chideok> /home/bryce/src/Arsenal/arsenal/hooks/data-milestone-bugs

Exception:  code 0
            reason 'HTTP Error 502: Bad Gateway
Response headers:

Date: Thu, 25 Mar 2010 06:31:52 -0700
Subject: Cron <bryce@chideok> /home/bryce/src/Arsenal/arsenal/hooks/data-bugtotals

Exception:  code 0
            reason 'HTTP Error 503: Service Unavailable
Response headers:

Since I get back html error pages in the emails I assume these are
errors on the launchpad side and not just network flakiness.  Is that
an accurate assumption?  I mean, if I'm the only person seeing these
issues maybe I should be looking for a faulty router or something?

Bryce

Revision history for this message

James Westby (james-w) wrote on 2010-03-25:

#22

On Thu, 25 Mar 2010 19:46:09 -0000, Bryce Harrington <email address hidden> wrote:
> Since I get back html error pages in the emails I assume these are
> errors on the launchpad side and not just network flakiness.

The incomplete read is probably the appserver you are talking to
breaking the connection.

The 502 and 503 are likely to be served by the machines in front of the
LP code (HAProxy and Apache?) when they can't proxy your request.

> Is that
> an accurate assumption? I mean, if I'm the only person seeing these
> issues maybe I should be looking for a faulty router or something?

You are not the only person to see them.

Using the network is never going to be reliable. The libarary should
retry transient errors, and should have an API that encourages the user
to consider this and code to expect it.

Thanks,

James

Gary Poster (gary) on 2010-03-25

Changed in launchpadlib:
assignee:	Tom Haddon (mthaddon) → nobody
status:	New → Triaged
Changed in launchpad-foundations:
status:	New → Triaged
importance:	Undecided → Low

Revision history for this message

Gary Poster (gary) wrote on 2010-03-25:

#23

I agree that this needs to be addressed. Prioritization is the question. I have added this as one of the bugs for discussion on our usability effort (see bottom of https://dev.launchpad.net/Foundations/Webservice and yes I should probably start using bug tags for this list :-) ). That discussion will come after our webservice performance work, which is just starting now.

Revision history for this message

Bryce Harrington (bryce) wrote on 2010-03-25:

#24

On Thu, Mar 25, 2010 at 09:43:09PM -0000, Gary Poster wrote:
> ** Changed in: launchpad-foundations
> Importance: Undecided => Low

Yikes! Why low priority?

Revision history for this message

James Westby (james-w) wrote on 2010-03-25:

#25

On Thu, 25 Mar 2010 21:51:00 -0000, Gary Poster <email address hidden> wrote:
> I agree that this needs to be addressed. Prioritization is the
> question. I have added this as one of the bugs for discussion on our
> usability effort (see bottom of
> https://dev.launchpad.net/Foundations/Webservice and yes I should
> probably start using bug tags for this list :-) ). That discussion will
> come after our webservice performance work, which is just starting now.

Hi,

A patch for basic retry support client-side is already available. I
believe that Leonard has already discussed it with Kees.

I also have battle-tested retry code available, so I think some
improvement wouldn't take much effort at all.

If you are thinking of more extensive changes then consideration of
scheduling is understandable.

Thanks,

James

Gary Poster (gary) on 2010-03-29

Changed in launchpad-foundations:
importance:	Low → High

Revision history for this message

Gary Poster (gary) wrote on 2010-03-29:

#26

Why "Low": Because I didn't think we would work on it immediately.

Why did I just change it to "High": because I think we will work on it in the next three cycles, which I find is our definition (https://dev.launchpad.net/BugTriage).

The bug with a patch from kees is https://bugs.edge.launchpad.net/lazr.restfulclient/+bug/524980 . I am marking that bug as a duplicate of this one.

Revision history for this message

Gary Poster (gary) wrote on 2010-03-29:

#27

In regards to kees' patch to lazr.restfulclient, it seems reasonable to me, though it is certainly unfortunate we need it, and we should consider keeping a bug around for the situation(s) that causes this kind of code to be necessary.

Thoughts:
- I'd probably like to have the number of retries and the seconds between retries configurable.
- Would incremental delays be of value?
- I'd like it if we logged the problem.

Gary

Revision history for this message

James Westby (james-w) wrote on 2010-03-30:

#28

On Mon, 29 Mar 2010 22:56:18 -0000, Gary Poster <email address hidden> wrote:
> In regards to kees' patch to lazr.restfulclient, it seems reasonable to
> me, though it is certainly unfortunate we need it, and we should
> consider keeping a bug around for the situation(s) that causes this kind
> of code to be necessary.

Please fix all the bugs that we can find that do this, but having the
library retry is good, even if we fix all of the known bugs. It's just
sensible design given then the unreliability of communication across the
network.

> Thoughts:
> - I'd probably like to have the number of retries and the seconds
> between retries configurable.

Yes, but there a few layers to pull these options up through to still
allow people to easily get a Launchpad object.

> - Would incremental delays be of value?

What do you mean by this?

> - I'd like it if we logged the problem.

Yep.

Thanks,

James

Revision history for this message

Leonard Richardson (leonardr) wrote on 2010-03-30:

#29

> > - Would incremental delays be of value?
> What do you mean by this?

I think this means exponential backoff.

kees' patch looks good (http://launchpadlibrarian.net/39453235/lazr.restfulclient_0.9.11-1ubuntu1.debdiff) but it assumes that the problem will go away immediately. I think exponential backoff would be smarter.

I'm also not sure how to test it. Maybe I could rig up a fake WSGI app that fails a configurable number of times.

Revision history for this message

Gary Poster (gary) wrote on 2010-03-30:

#30

Yeah, exponential backoff. I was blanking on the term, and hoped people would catch my drift.

And yeah, a fake WSGI app that fails, and maybe making a stub for time.sleep, sounds reasonable.

Revision history for this message

Gary Poster (gary) wrote on 2010-03-31:

#31

The following is in regards to the underlying problem of the frequent 502 errors, rather than the lazr.restfulclient changes we've been discussing, of which I am still in favor.

James offered yesterday to set up a dedicated launchpadlib instance to catch the errors if I could tell him what logs to capture. I had in mind the logs that Leonard had suggested in comment #2, but I wanted to verify that these would be useful, so I said I'd get back to him.

I then talked with Francis about his earlier investigation described in comment #10, and what we might be able to investigate now. He proposed a reasonable hypothesis, which we could test.

HAProxy will queue requests up to 30 seconds, and send no more than 8 at once to the appservers. Busy appservers will mean that HAProxy may drop requests. This could be the cause of the 502 errors.

Things to test for this hypothesis that come to mind:
- does a 502 typically indicate HAProxy complaining, and 503 indicate Apache complaining? Can we find logs for HAProxy and Apache that correlate errors there with errors reported in the clients?
- when clients get 502s and 503s, are they typically accompanied with request times of >30 seconds?

Therefore, for client-side logs for this problem, I'd be inclined to get logs of what Leonard requests (which is what wgrant shows in Comment #4, I think) along with logs of how long the request took (which I do not see in that example output). Is that reasonable, and reasonably easy to gather?

Additionally and separately, we should investigate HAProxy logs and see what we can discover.

Revision history for this message

James Westby (james-w) wrote on 2010-03-31:

#32

On Wed, 31 Mar 2010 15:42:51 -0000, Gary Poster <email address hidden> wrote:
> The following is in regards to the underlying problem of the frequent
> 502 errors, rather than the lazr.restfulclient changes we've been
> discussing, of which I am still in favor.
>
> James offered yesterday to set up a dedicated launchpadlib instance to
> catch the errors if I could tell him what logs to capture. I had in
> mind the logs that Leonard had suggested in comment #2, but I wanted to
> verify that these would be useful, so I said I'd get back to him.
>
> I then talked with Francis about his earlier investigation described in
> comment #10, and what we might be able to investigate now. He proposed
> a reasonable hypothesis, which we could test.
>
> HAProxy will queue requests up to 30 seconds, and send no more than 8 at
> once to the appservers. Busy appservers will mean that HAProxy may drop
> requests. This could be the cause of the 502 errors.

Yes, I very much think that is one of the causes. I have also seen it
sending the offline.html response.

> Things to test for this hypothesis that come to mind:
> - does a 502 typically indicate HAProxy complaining, and 503 indicate Apache complaining? Can we find logs for HAProxy and Apache that correlate errors there with errors reported in the clients?
> - when clients get 502s and 503s, are they typically accompanied with request times of >30 seconds?
>
> Therefore, for client-side logs for this problem, I'd be inclined to get
> logs of what Leonard requests (which is what wgrant shows in Comment #4,
> I think) along with logs of how long the request took (which I do not
> see in that example output). Is that reasonable, and reasonably easy to
> gather?

Yes, pretty simple.

Do we also want to know which server we are talking to as well? That
could be trickier to get hold of. We could just point at an IP address
instead of a hostname to narrow this down if the arrangement is such
that this would help.

Thanks,

James

On Wed, 31 Mar 2010 15:42:51 -0000, Gary Poster <gary.poster@canonical.com> wrote:
> The following is in regards to the underlying problem of the frequent
> 502 errors, rather than the lazr.restfulclient changes we've been
> discussing, of which I am still in favor.
> 
> James offered yesterday to set up a dedicated launchpadlib instance to
> catch the errors if I could tell him what logs to capture.  I had in
> mind the logs that Leonard had suggested in comment #2, but I wanted to
> verify that these would be useful, so I said I'd get back to him.
> 
> I then talked with Francis about his earlier investigation described in
> comment #10, and what we might be able to investigate now.  He proposed
> a reasonable hypothesis, which we could test.
> 
> HAProxy will queue requests up to 30 seconds, and send no more than 8 at
> once to the appservers.  Busy appservers will mean that HAProxy may drop
> requests.  This could be the cause of the 502 errors.

Yes, I very much think that is one of the causes. I have also seen it
sending the offline.html response.

> Things to test for this hypothesis that come to mind:
> - does a 502 typically indicate HAProxy complaining, and 503 indicate Apache complaining?  Can we find logs for HAProxy and Apache that correlate errors there with errors reported in the clients?
> - when clients get 502s and 503s, are they typically accompanied with request times of >30 seconds?
> 
> Therefore, for client-side logs for this problem, I'd be inclined to get
> logs of what Leonard requests (which is what wgrant shows in Comment #4,
> I think) along with logs of how long the request took (which I do not
> see in that example output).  Is that reasonable, and reasonably easy to
> gather?

Yes, pretty simple.

Do we also want to know which server we are talking to as well? That
could be trickier to get hold of. We could just point at an IP address
instead of a hostname to narrow this down if the arrangement is such
that this would help.

Thanks,

James

Revision history for this message

Gary Poster (gary) wrote on 2010-03-31:

#33

Note: Analyzing the HAProxy logs is blocked by Canonical RT #38406.

Per the discussion James and I had with mbarnett, we don't have an easy way of identifying servers. We're hopeful that we'll be able to correlate with logs without too much trouble. We can verify once we have access to the logs.

Revision history for this message

Stuart Bishop (stub) wrote on 2010-04-01:

#34

You should be able to tell what is generating a particular error page
by looking at the contents of the document and maybe headers - a
Launchpad error page contains different text to an Apache error page
contains different text to a HAProxy error page contains different
text to a Squid error page. We could log this is a debug mode, or
perhaps even extend launchpadlib to be able to report where the error
came from since it is designed to work against our web service.

We should be graphing errors from HAProxy and send nagios alerts - if
the appservers are not able to provide enough throughput we need to
know about it.

--
Stuart Bishop <email address hidden>
http://www.stuartbishop.net/

Revision history for this message

Robert Collins (lifeless) wrote on 2010-04-21:

#35

I've just me-too'd this; we're starting to use launchpadlib in our CI environment - this is running inside the datacentre, no 'internets' to affect things, and its breaking several times a day.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2010-04-22:

#36

According to jml, lp:udd and lp:ubuntu-qa-tools include their own code to work around this problem.

Revision history for this message

Gary Poster (gary) wrote on 2010-04-22:

#37

We're targeting Leonard to put in the client-side work-around to this (see the discussion of kees' patch) next week.

Changed in launchpadlib:
assignee:	nobody → Leonard Richardson (leonardr)

Revision history for this message

Leonard Richardson (leonardr) wrote on 2010-04-22:

#38

Looking into this. The algorithm called "exponential backoff" was designed with a random element to avoid collisions, which isn't really the point here--when we 502 it's not because we got a lot of simultaneous hits. But I do like the idea of waiting for 2 seconds, 4 seconds, etc. and timing out for real after 16 or 32 seconds, and that's what I mean by "exponential backoff" here.

However, if we had real data about how long these 502 episodes last, 1/4 of that time would be a better starting point than 2 seconds. We could time out after waiting a total of 2x the length of a typical episode.

Revision history for this message

Kees Cook (kees) wrote on 2010-04-22:

#39

In my experience, instantly retrying works most of the time. So,
perhaps use back-off times of 0, 1, 2, 4, 8, 16, 32, *fail*

Revision history for this message

Leonard Richardson (leonardr) wrote on 2010-05-03:

#40

lazr.restfulclient 0.9.16 includes retry code based on Kees's branch.

affects:

launchpadlib → lazr.restfulclient

Revision history for this message

Gary Poster (gary) wrote on 2010-05-07:

#41

The lazr.restfulclient part of this bug is now Fix Released. LOSAs now have access to provide me filtered data on HAProxy logs, so we should investigate the other half of this soon.

Changed in lazr.restfulclient:
status:	Triaged → Fix Released

Tom Haddon (mthaddon) on 2010-05-28

tags:

added: canonical-losa-lp

Revision history for this message

Robert Collins (lifeless) wrote on 2010-09-16:

#42

So, this bug has gotten pretty big.

I'm going to split it into N:
- lazr restful to be a bit more resilient.
- appservers & deployment to not interrupt live requests.

Revision history for this message

James Westby (james-w) wrote on 2010-09-16:

#43

Hi,

I have some source data which you might be able to cross-reference with the logs.

All requests will originate from jubany.canonical.com (and I believe don't go through
the proxy), and are via launchpadlib, so will use whatever user-agent that sends these
days.

There will have been a lot of requests over this time, as it is ~20 minutes, and each fails
after doing a number of retries, so I don't think this will be a needle in a haystack.

The first I have recorded is:

Failed at 2010-09-15 22:04:59.142456

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <title>Please try again</title>
    <style type="text/css" media="screen, print">
      @import url(https://launchpad.net/+icing/rev5/combo.css);
    </style>
  </head>
  <body>
    <div id="topline"></div>
    <div id="locationbar"></div>
    <div class="offline">

        <h1>Please try again</h1>
        <p>
          Sorry, there was a problem connecting to the Launchpad server.
        </p>
        <p>
          Try reloading this page in a minute or two.
          If the problem persists, let us know in
          <a href="irc://chat.freenode.net/#launchpad"
          >the #launchpad IRC channel on Freenode</a>.
        </p>
        <p>Thanks for your patience.</p>

</div>
</body>
</html>

The last is

Failed at 2010-09-15 22:27:44.672453

Traceback (most recent call last):
  File "/srv/package-import.canonical.com/new/scripts/import_package.py", line 172, in <module>
    lp = icommon.get_lp(options.lp_cache)
  File "/srv/package-import.canonical.com/new/scripts/icommon.py", line 96, in get_lp
    lp = Launchpad(creds, service_root=SERVICE_ROOT, cache=cache)
  File "/usr/lib/python2.5/site-packages/launchpadlib/launchpad.py", line 95, in __init__
    credentials, service_root, cache, timeout, proxy_info)
  File "/usr/lib/python2.5/site-packages/lazr/restfulclient/resource.py", line 379, in __init__
    self._wadl = self._browser.get_wadl_application(self._root_uri)
  File "/usr/lib/python2.5/site-packages/lazr/restfulclient/_browser.py", line 222, in get_wadl_application
    url, media_type='application/vd.sun.wadl+xml')
  File "/usr/lib/python2.5/site-packages/lazr/restfulclient/_browser.py", line 204, in _request
    raise HTTPError(response, content)
lazr.restfulclient.errors.HTTPError: HTTP Error 502: Bad Gateway

and there were 31 jobs that failed in the middle, so we are talking at least 90 failed requests.

Times are UTC.

Thanks,

James

Hi,

I have some source data which you might be able to cross-reference with the logs.

All requests will originate from jubany.canonical.com (and I believe don't go through
the proxy), and are via launchpadlib, so will use whatever user-agent that sends these
days.

There will have been a lot of requests over this time, as it is ~20 minutes, and each fails
after doing a number of retries, so I don't think this will be a needle in a haystack.

The first I have recorded is:

Failed at 2010-09-15 22:04:59.142456

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <title>Please try again</title>
    <style type="text/css" media="screen, print">
      @import url(https://launchpad.net/+icing/rev5/combo.css);
    </style>
  </head>
  <body>
    <div id="topline"></div>
    <div id="locationbar"></div>
    <div class="offline">

<h1>Please try again</h1>
        <p>
          Sorry, there was a problem connecting to the Launchpad server.
        </p>
        <p>
          Try reloading this page in a minute or two.
          If the problem persists, let us know in
          <a href="irc://chat.freenode.net/#launchpad"
          >the #launchpad IRC channel on Freenode</a>.
        </p>
        <p>Thanks for your patience.</p>

</div>
  </body>
</html>

The last is

Failed at 2010-09-15 22:27:44.672453

Traceback (most recent call last):
  File "/srv/package-import.canonical.com/new/scripts/import_package.py", line 172, in <module>
    lp = icommon.get_lp(options.lp_cache)
  File "/srv/package-import.canonical.com/new/scripts/icommon.py", line 96, in get_lp
    lp = Launchpad(creds, service_root=SERVICE_ROOT, cache=cache)
  File "/usr/lib/python2.5/site-packages/launchpadlib/launchpad.py", line 95, in __init__
    credentials, service_root, cache, timeout, proxy_info)
  File "/usr/lib/python2.5/site-packages/lazr/restfulclient/resource.py", line 379, in __init__
    self._wadl = self._browser.get_wadl_application(self._root_uri)
  File "/usr/lib/python2.5/site-packages/lazr/restfulclient/_browser.py", line 222, in get_wadl_application
    url, media_type='application/vd.sun.wadl+xml')
  File "/usr/lib/python2.5/site-packages/lazr/restfulclient/_browser.py", line 204, in _request
    raise HTTPError(response, content)
lazr.restfulclient.errors.HTTPError: HTTP Error 502: Bad Gateway

and there were 31 jobs that failed in the middle, so we are talking at least 90 failed requests.

Times are UTC.

Thanks,

James

Robert Collins (lifeless) on 2010-09-16

Changed in launchpad-foundations:
status:	Triaged → Invalid

Revision history for this message

Robert Collins (lifeless) wrote on 2010-09-16:

#44

@James see bug 640067

There was a rollout during that time, I strongly suspect you had many
in-progress requests terminated abruptly, which will generate 502s.

-Rob

Revision history for this message

Gary Poster (gary) wrote on 2011-01-12:

#45

Leonard asked me to review this bug, because of my comment in #41 that this was only half done.

the other half I meant is to investigate flacoste's hypothesis I described in comment #31--haproxy. Investigating was blocked (#33) but then was unblocked (#40).

I don't believe that Robert's dismissal (comment #44) is accurate: the bug describes daily problems, which could be explained by the HAproxy hypothesis, but not by rollouts. That said, the bug lasted a long time, and covered too much ground. Robert's reply is pertinent to comment #43, just not to the whole bug.

Per Robert's comment #42, I think this bug could be regarded to be fix released for the lazr.restfulclient change. We probably want to open a Launchpad bug about the daily 502s described, and investigate the HAProxy hypothesis. I'll ask Robert to confirm or deny; I may be misunderstanding his position.

Revision history for this message

Robert Collins (lifeless) wrote on 2011-01-12:

#46

HI Gary, yes I was talking only about comment 43 - I was replying to James. I filed a separate bug about the request interruption stuff (I forget its number offhand)

Revision history for this message

Gary Poster (gary) wrote on 2011-01-12:

#47

I filed bug 702134 for part of these symptoms. Robert had already filed bug 640065, which is also related.

lazr.restfulclient

Handle HTTP Error 502: Bad Gateway automatically

Bug Description

Related branches

Duplicates of this bug

Other bug subscribers

Remote bug watches

Affects		Status	Importance	Assigned to	Milestone
	Launchpad itself	Invalid	High	Unassigned
	lazr.restfulclient	Fix Released	High	Leonard Richardson