Comment 4 for bug 687653

Revision history for this message
Andrew Bennetts (spiv) wrote : Re: codehost error handling infinite recursion death spiral of doom

The recursion chain is pretty long, here's what it appears to be:

repr(NotBranchError) -> calls open_repository -> calls into codehosting vfs -> XMLRPC -> uses Twisted callbacks -> something breaks -> Twisted stringifies the error/traceback -> repr(NotBranchError) -> ...

It's a new instance of NotBranchError in every round of the recursion, although presumably with the same parameters.

It's hard for bzrlib to mitigate this without losing functionality (it calls open_repository there to give more helpful errors to users in the case of "bzr branch shared_repo_with_no_branch", which happens sometimes and can be confusing for new users, and it doesn't call open_repository until error formatting time to avoid paying the price of the IO unless the error is being displayed somewhere). bzrlib could do something ugly like set a global flag while in NotBranchError's _format method, and if it is already set skip the open_repository call. That would probably work, although that's what I thought about the previous mitigation added to bzrlib.

It would also paper over the other factors here: why is NotBranchError being triggered and propagated? Can we fix Twisted to avoid expensive and dangerous stringifying here?

I spent a lot of time yesterday looking over the codehosting code and I don't see an obvious way to trigger this sort of problem, but clearly there is one lurking somewhere.

We don't seem to have a reliable way to reproduce this problem. Part of what confounded us yesterday was the branches on disk were changing while we were investigating as the user worked on them. e.g. at one point there was an empty bzrdir (no repo, branch or tree), then a few minutes later it had a branch and repo.

Other facts from the pasted traceback:
 * there's a NoSuchFile for /srv/bazaar.launchpad.net/mirrors/00/06/43/cc/.bzr/repository/format in some frames. Presumably the result of the open_repository call.
 * The branch involved appears to be <lp:~goadict/tsumego20k/release-1.0>.
 * the XML-RPC call is almost certainly translatePath

Other observations:
 * some of the processes appeared to be connected to /+branch/...
 * some processes appeared to be for non-existent branches under +branch, e.g. BRANCH:%2Bbranch/tsumego20k/1.0
 * the ps listing of all lp-serve processes showed both /+branch/... and /%2Bbranch/... which seems a bit odd
 * the branches for this project are small, and several lp-serve processes were stuck on it. I think that suggests it's not a race condition with e.g. removing a branch via the web UI (or if it is a race, it's remarkably easy to hit repeatedly)

Many of these observations are likely to be red herrings :(