timeout submitting request - can't tell if its lplib or lp

Bug #567180 reported by Robert Collins
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
launchpadlib
New
Undecided
Unassigned

Bug Description

We're seeing the below error, and don't know quite what it indicates - its rather opaque :(. Is it an issue connecting to the API server? An issue with a server side timeout? Any guidance on this would be great; likewise, I think launchpadlib should give more information in this case.

    self.lp_mp.createComment(content=message, subject="PQM failure")
  File "/usr/lib/python2.5/site-packages/lazr/restfulclient/resource.py", line 460, in __call__
    url, in_representation, http_method, extra_headers=extra_headers)
  File "/usr/lib/python2.5/site-packages/lazr/restfulclient/_browser.py", line 201, in _request
    str(url), method=method, body=data, headers=headers)
  File "/var/lib/python-support/python2.5/httplib2/__init__.py", line 1050, in request
    (response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
  File "/usr/lib/python2.5/site-packages/launchpadlib/launchpad.py", line 262, in _request
    redirections, cachekey)
  File "/usr/lib/python2.5/site-packages/lazr/restfulclient/_browser.py", line 108, in _request
    redirections, cachekey)
  File "/var/lib/python-support/python2.5/httplib2/__init__.py", line 854, in _request
    (response, content) = self._conn_request(conn, request_uri, method, body, headers)
  File "/usr/lib/python2.5/site-packages/lazr/restfulclient/_browser.py", line 113, in _conn_request
    conn, request_uri, method, body, headers)
  File "/var/lib/python-support/python2.5/httplib2/__init__.py", line 823, in _conn_request
    conn.request(method, request_uri, body, headers)
  File "/usr/lib/python2.5/httplib.py", line 866, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python2.5/httplib.py", line 892, in _send_request
    self.send(body)
  File "/usr/lib/python2.5/httplib.py", line 711, in send
    self.sock.sendall(str)
  File "/usr/lib/python2.5/httplib.py", line 1108, in send
    return self._ssl.write(stuff)
error: (110, 'Connection timed out')

Revision history for this message
Robert Collins (lifeless) wrote :

Another possibility is an idle connection being dropped in a harsh way, if launchpadlib has a connection pool. If so, launchpadlib should really retry this.

Revision history for this message
Robert Collins (lifeless) wrote :

This is causing us severe problems, do you have any thoughts?

Revision history for this message
James Westby (james-w) wrote : Re: [Bug 567180] [NEW] timeout submitting request - can't tell if its lplib or lp

On Tue, 20 Apr 2010 10:55:04 -0000, Robert Collins <email address hidden> wrote:
> Public bug reported:
>
> We're seeing the below error, and don't know quite what it indicates -
> its rather opaque :(. Is it an issue connecting to the API server? An
> issue with a server side timeout? Any guidance on this would be great;
> likewise, I think launchpadlib should give more information in this
> case.

You suspected that it was due to writing a large comment, have you
confirmed that?

Thanks,

James

Revision history for this message
Leonard Richardson (leonardr) wrote :

The connection might really be timing out. I noticed yesterday that edge was unusually slow. If you give Launchpad a complex task when it's slow (I don't know how complex creating a bug comment it), it might legitimately time out. You can specify a 'timeout' option when you create the Launchpad object. But the real question in this model is: why is edge so slow?

Revision history for this message
Leonard Richardson (leonardr) wrote :

For the record, here's what I get when I ran my timing script against edge just now.

Import cost: 0.19 sec
Startup cost: 32.92 sec
First fetch took 12.03 sec

That's 33 seconds to get the WADL and JSON service root representations, and 12 seconds to get a collection of fifty bugs. That's incredibly slow, and I don't know what's behind it unless our bandwidth is maxed out because of the upcoming Ubuntu release. The WADL representation, in particular, is cached in memory and just dumped out through the socket.

I don't know what the default socket timeout is--supposedly it's a Python default, but near as I can tell the Python default is None, which implies you'd _never_ get a socket timeout. Maybe there's a separate SSL timeout?

I was messing around trying to duplicate your script on staging, and I got a timeout on the web site (OOPS-1572S195). I think we've got a much more general problem than launchpadlib or the web service being slow.

Revision history for this message
Leonard Richardson (leonardr) wrote :

A couple updates:

1. Although the exact error is different, this seems very similar to bug 512552. That was a timeout error that occured when a script invoked createComment() and passed in a huge (400k) test log. The difference is that they got a server-side timeout (502 or 503) instead of a client-side timeout.

2. I added a 68k comment to a merge proposal on edge. It took about 20 minutes to process, but it did complete eventually.

How big a document are you sending in createComment?

Conclusion: regardless of how slow edge is right now, createComment() is _really_ slow on big datasets. As I mentioned in 512552, the best solution would be to publish 'merge proposal attachments' similar to bug attachments, so that these huge logs and whatnot can go into the librarian instead of the database. Until we do that, or optimize createComment in some other way, or something, you'll have this problem.

There's also a huge overhead involved in sending a large string through POST. It has to be encoded and then decoded on the server side. Doing this as a 'merge proposal attachment' would fix this because you'd be able to PUT the data without encoding it.

Since this has happened twice I strongly suggest we start working on the 'merge proposal attachment' feature.

Revision history for this message
Leonard Richardson (leonardr) wrote :

I set up a local Launchpad instance and made a client invoke createComment with an 8 megabyte comment. The total elapsed time was 21 seconds, but the way the time was distributed was very interesting:

T Event
000 Client sends 8 megabyte POST
002 Request logged to Zope access log
002 Zope tracelog "B"
002 Zope tracelog "I", POST size identified
004 Zope tracelog "-", target URL identified
007 Zope tracelog "A", response code identified
007 Control returns to client

Then the client sends a GET for the newly created comment.

007 Zope tracelog "B"
021 Zope tracelog "A"

The POST isn't exactly fast, but the GET afterwards is twice as slow as the POST. When a POST has a return value of 201, launchpadlib automatically fetches the newly created object. I've verified that subsequent GETs to the comment URL are also very slow, and it's probably also going to be slow to get the comment collection. I haven't profiled it, but it doesn't help that that 8 megabyte string has to be encoded as JSON.

There are a couple solutions. This is yet another argument for "merge proposal attachment", since anything this huge should really be its own resource, and we've optimized file-type resources to handle the "big blob of unprocessed data" case. We could also stop launchpadlib from making that subsequent GET, and return a shim object on 201 that will fetch a representation only if you end up using the object.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Question: does the script succeed in adding a comment to its merge proposal? That would tell us whether the error is happening in the POST or in the subsequent GET. Setting httplib2.debuglevel = 1 would also help.

Revision history for this message
Leonard Richardson (leonardr) wrote :

I just re-ran my "68k comment" test to see if the GET or the POST was to blame. This time the operation completed in 11 seconds!

I thought: "maybe it wasn't really 68k, maybe it was 680k", so I tried again with a payload ten times as big. This time I got a timeout (503 error) and OOPS-1572EC720. This was in the POST, not the GET.

I don't know what this means, just putting it out there. I certainly did _something_ earlier today that took 20 minutes but successfully completed.

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 567180] Re: timeout submitting request - can't tell if its lplib or lp

I don't know how big the comments triggering this are; they could be large -
its test run output, but we are filtering out successes. 10's of kb -
likely. 100's - maybe. Mb's - unlikely. Does that debug setting using
logging to output stuff, or write to stderr or what? Will look at putting it
on, just need to tell our sysadmins what to expect to capture its info [as
this is running in the dc in a controlled environment].
Thank you very much for looking into this! I certainly would like to see
merge proposal attachments too ;)

Revision history for this message
Leonard Richardson (leonardr) wrote :

I experimented with setting a very short timeout when creating the Launchpad object. Here's the error I got.

  File "/home/leonardr/.buildout/eggs/httplib2-0.6.0-py2.6.egg/httplib2/__init__.py", line 901, in _request
    (response, content) = self._conn_request(conn, request_uri, method, body, headers)
  File "/home/leonardr/.buildout/eggs/lazr.restfulclient-0.9.14-py2.6.egg/lazr/restfulclient/_browser.py", line 166, in _conn_request
    conn, request_uri, method, body, headers)
  File "/home/leonardr/.buildout/eggs/httplib2-0.6.0-py2.6.egg/httplib2/__init__.py", line 871, in _conn_request
    response = conn.getresponse()
  File "/home/leonardr/bin/python/lib/python2.6/httplib.py", line 974, in getresponse
    response.begin()
  File "/home/leonardr/bin/python/lib/python2.6/httplib.py", line 391, in begin
    version, status, reason = self._read_status()
  File "/home/leonardr/bin/python/lib/python2.6/httplib.py", line 349, in _read_status
    line = self.fp.readline()
  File "/home/leonardr/bin/python/lib/python2.6/socket.py", line 397, in readline
    data = recv(1)
  File "/home/leonardr/bin/python/lib/python2.6/ssl.py", line 96, in <lambda>
    self.recv = lambda buflen=1024, flags=0: SSLSocket.recv(self, buflen, flags)
  File "/home/leonardr/bin/python/lib/python2.6/ssl.py", line 222, in recv
    raise x
ssl.SSLError: The read operation timed out

Of course, this was during the read phase, not the write phase.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Setting httplib2.debuglevel=1 will cause information about the HTTP requests you make to be written to standard output (not error).

I forced a timeout to occur during the initial POST, and I got the same ssl.SSLError I pasted in the previous comment.

I should mention that when a timeout occurs when the server is sending data to the client, this error is written to the server-side log:

2010-04-21T15:13:37 ERROR root Exception during task
Traceback (most recent call last):
  File "/home/leonardr/canonical/lp-sourcedeps/eggs/zope.server-3.6.1-py2.5.egg/zope/server/taskthreads.py", line 48, in handlerThread
    task.service()
  File "/home/leonardr/canonical/lp-sourcedeps/eggs/zope.server-3.6.1-py2.5.egg/zope/server/serverchannelbase.py", line 207, in service
    task.service()
  File "/home/leonardr/canonical/lp-sourcedeps/eggs/zope.server-3.6.1-py2.5.egg/zope/server/http/httptask.py", line 79, in service
    self.channel.close_when_done()
  File "/home/leonardr/canonical/lp-sourcedeps/eggs/zope.server-3.6.1-py2.5.egg/zope/server/dualmodechannel.py", line 195, in close_when_done
    while self._flush_some():
  File "/home/leonardr/canonical/lp-sourcedeps/eggs/zope.server-3.6.1-py2.5.egg/zope/server/dualmodechannel.py", line 187, in _flush_some
    num_sent = self.send(chunk)
  File "/usr/lib/python2.5/asyncore.py", line 331, in send
    result = self.socket.send(data)

When a timeout occurs when the client is sending data to the server, no error message is written to the server-side log.

Revision history for this message
Leonard Richardson (leonardr) wrote :

I investigated the possibility that the different exceptions we're seeing are caused by python 2.5 versus 2.6. Python 2.6 has a lot of new features for handling socket timeouts, so maybe the exceptions are different.

Well, I was able to get different exceptions, but they're not _that_ different. Here's a client-side socket timeout on write in python 2.5:

  File "/usr/lib/python2.5/httplib.py", line 1108, in send
    return self._ssl.write(stuff)
socket.sslerror: The write operation timed out

Here it is in Python 2.6:

  File "/usr/lib/python2.6/httplib.py", line 755, in send
    self.sock.sendall(str)
  File "/usr/lib/python2.6/ssl.py", line 203, in sendall
    v = self.send(data[count:])
  File "/usr/lib/python2.6/ssl.py", line 94, in <lambda>
    self.send = lambda data, flags=0: SSLSocket.send(self, data, flags)
  File "/usr/lib/python2.6/ssl.py", line 174, in send
    v = self._sslobj.write(data)
ssl.SSLError: The write operation timed out

Note that the 2.5 error occurs on the same line as the one Robert originally reported. Some code underneath that is deciding to raise a generic 'error' instead of a Python socket.sslerror.

I don't know enough about sockets to distinguish between "The write operation timed out" and "Connection timed out". "Connection timed out" might be a sign that the server gave up on you.

You might run this code:

import socket
print socket.getdefaulttimeout()

That's None for me; it may have a value for you. If it's None, I don't see how you can have a client-side timeout since lp:pqm doesn't set a timeout.

At this point I'm running out of ideas. We know that createComment is prone to timeouts, and the problem we're seeing is a a timeout. The only thing that doesn't fit is that the exception looks like a much lower-level exception than I'd expect--as through the server had simply dropped the connection rather than sending a 503 error.

Revision history for this message
Gary Poster (gary) wrote :

There are at least two issues here.

1. launchpadlib should have given information on what exact request was causing the time out. I created bug 569106 for this.

2. bug 512552 (MPs should have attachments and they should be exposed in the API).

I think Leonard's research seems to show that there's not a badly-behaving connection pool on the client side.

I'll mark this as a dupe of 512552.

Revision history for this message
Frank Heimes (fheimes) wrote :

Well, I'm not sure if this is really a duplicate of 512552 ("Large POST fails with createComment on merge proposal"), because it was mentioned here - and it also happened to me - that the 'read' op times out.
So I get this from time to time: "The read operation timed out"
And it doesn't matter if I'm using an authenticated access to LP or not (anonymous).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.