Bazaar

Bug #405251
Comment #41

Comment 41 for bug 405251

Revision history for this message

John A Meinel (jameinel) wrote on 2009-08-06: Re: [Bug 405251] Re: Huge data transfers/bad performance OVERALL

#41

Frits Jalvingh wrote:
>> There should be no case that we take out a write lock on the source repository in order to read data.
> I think you misunderstood; it is a push TO the target, and the target server has the lock. That is the server that the push data is being written to.
>

So the server has a lock on the repository you are pushing to. Fair enough.

>> 91% of the time is spent in "UNTIL_NO_EINTR" which is just waiting for data on a socket.
> This is what I see: the client is deadlocked waiting for the server to sent data (it is in a recv).
>
>> so that is 94s or 1m34s sending data.
> That seems correct; it started to transmit 15MB of data to the server and then it blocked. Sending that data will take that time, the server is on a remote link. So the server was not packing at that time probably. (There is some huge problem here where bzr sends or receives way too much data for push/pull; the 200+ MB pull is not an exception but was seen a few times before. I get the impression that it reloads revisions it has already seen from the server time and time again.)
>
> When the client (doing the push) was blocked (after the 2min) I checked
> the server and did the strace there; that server did not do anything at
> all. I tried again using bzr+ssh protocol which actually starts a bzr
> serve remote using an ssh pipe; that one hung at exactly the same time
> and doing an strace there revealed that it did completely nothing- it
> hung in a recv like the client (it did not poll or anything and did NO
> IO at all, the strace showed a single line for 10 minutes, then I
> cancelled). So this is 100% a protocol error somewhere - the server is
> *not* packing at all but blocking on recv.

Hmm... I was certainly starting to lean that way. It could be something
about 1.17 not being 100% compliant with 1.16.1. I don't know off-hand
any reason for that, but it is certainly a possibility.

It would have been nice if we could have gotten a backtrace of what both
the client and server were trying to do at the moment they were stalled.

For future reference, you can send SIGQUIT (Ctrl+\) to a running
process, and it should drop you into a (pdb) debugging shell. You can
use something like "bt" at that point to print the backtrace.

(If you are on Windows, then in the next release [1.18] you should be
able to do Ctrl+Pause, aka SIGBREAK. since SIGQUIT doesn't exist on
Windows.)

>
> Ok, back to the other stuff:
>> You mentioned that this repository:
>> 1) Used to have a lot more than ~10k files
> No, when experimenting with bzr I found out that the #of files was a huge problem under Windows. So before I created this repository (from scratch, no import or whatever) I removed all files *before* creating the initial import. It has never seen the bigger number of files.
>

>> 2) Had multiple independent ancestries pulled together.
> It has been "merged with history" with a single other independent branch. That branch's bzr info (when in a standalone repo, zeroes removed) says:
> In the working tree:
> 1369 unchanged
> 168 versioned subdirectories
> Branch history:
> 190 revisions
> 127 days old
> Repository:
> 202 revisions

so 1.4k files and 200 revisions. Fairly small, overall.

>
>> 3) Has about 10-20k revisions.
> I don't know if we are talking about the same thing, but bzr info -v for this repo tells me:
> Branch history:
> 1380 revisions
> 426 days old
> Repository:
> 7265 revisions

Right approx 7k revisions.

>
>> that means that you average almost 80 files changing for every commit
> Well we do not work that hard at all 8-), actually your metrics sound very much like what we do (5..10 files average) - it is nowhere close to 80. There have been only a few (2 or so, but including merges) that changed 3143 files in one commit. But those were exceptional ones (CRLF fixes and source formatting fixes).
>
> So all in all I do not understand the metrics we see here. Is there a
> way to associate those 100K text changes with their commits so that we
> can see if this is actually the case?
>

So something like "bzr log -v" would show the files we considered
changed in each commit. That command will be a bit slow with your format
(it is *much* faster in the new --2a format, which was part of the
reason for the update).

You could probably do "bzr log -v -n1 -r 1..-1" to give a rough overview.

It is almost 800,000 text changes (not 100k), which is why it is so
surprising. Given the info so far, I would have estimated maybe 70-100k
changes, with a more likely average around 30-50k. Note that bzr itself
has almost 80k changes and I certainly have never seen the performance
problems you are mentioning here.

The other really important think to check is if the local and remote
repositories are exactly the same format.

Run 'bzr info -v' locally on both repositories. (unfortunately bzr info
-v bzr:// or bzr+ssh:// just reports "RemoteRepository" the last I
checked.)

If you are incurring *conversion* costs on every push and pull, then you
would be much more likely to see stuff like pushes that have to read a
lot of remote data. (it is reading the context info in order to figure
out what needs to be written to convert to the new format.)

If that is the case, and you cannot convert the formats to match, we
*do* have a patch that will likely land in the next release that should
make this a lot better:
https://code.edge.launchpad.net/~spiv/bzr/inventory-delta/+merge/9676

John
=:->

So the server has a lock on the repository you are pushing to. Fair enough.

>> 91% of the time is spent in "UNTIL_NO_EINTR" which is just waiting for data on a socket.
> This is what I see: the client is deadlocked waiting for the server to sent data (it is in a recv).
> 
>> so that is 94s or 1m34s sending data.
> That seems correct; it started to transmit 15MB of data to the server and then it blocked. Sending that data will take that time, the server is on a remote link. So the server was not packing at that time probably. (There is some huge problem here where bzr sends or receives way too much data for push/pull; the 200+ MB pull is not an exception but was seen a few times before. I get the impression that it reloads revisions it has already seen from the server time and time again.)
> 
> When the client (doing the push) was blocked (after the 2min) I checked
> the server and did the strace there; that server did not do anything at
> all. I tried again using bzr+ssh protocol which actually starts a bzr
> serve remote using an ssh pipe; that one hung at exactly the same time
> and doing an strace there revealed that it did completely nothing- it
> hung in a recv like the client (it did not poll or anything and did NO
> IO at all, the strace showed a single line for 10 minutes, then I
> cancelled). So this is 100% a protocol error somewhere - the server is
> *not* packing at all but blocking on recv.

Hmm... I was certainly starting to lean that way. It could be something
about 1.17 not being 100% compliant with 1.16.1. I don't know off-hand
any reason for that, but it is certainly a possibility.

It would have been nice if we could have gotten a backtrace of what both
the client and server were trying to do at the moment they were stalled.

For future reference, you can send SIGQUIT (Ctrl+\) to a running
process, and it should drop you into a (pdb) debugging shell. You can
use something like "bt" at that point to print the backtrace.

(If you are on Windows, then in the next release [1.18] you should be
able to do Ctrl+Pause, aka SIGBREAK. since SIGQUIT doesn't exist on
Windows.)

> 
> Ok, back to the other stuff:
>> You mentioned that this repository:
>> 1) Used to have a lot more than ~10k files
> No, when experimenting with bzr I found out that the #of files was a huge problem under Windows. So before I created this repository (from scratch, no import or whatever) I removed all files *before* creating the initial import. It has never seen the bigger number of files.
>

>> 2) Had multiple independent ancestries pulled together.
> It has been "merged with history" with a single other independent branch. That branch's bzr info (when in a standalone repo, zeroes removed) says:
> In the working tree:
>       1369 unchanged
>        168 versioned subdirectories
> Branch history:
>        190 revisions
>        127 days old
> Repository:
>        202 revisions

so 1.4k files and 200 revisions. Fairly small, overall.

> 
>> 3) Has about 10-20k revisions.
> I don't know if we are talking about the same thing, but bzr info -v for this repo tells me:
> Branch history:
>       1380 revisions
>        426 days old
> Repository:
>       7265 revisions

Right approx 7k revisions.

> 
>> that means that you average almost 80 files changing for every commit
> Well we do not work that hard at all 8-), actually your metrics sound very much like what we do (5..10 files average) - it is nowhere close to 80. There have been only a few (2 or so, but including merges) that changed 3143 files in one commit. But those were exceptional ones (CRLF fixes and source formatting fixes).
> 
> So all in all I do not understand the metrics we see here. Is there a
> way to associate those 100K text changes with their commits so that we
> can see if this is actually the case?
>

You could probably do "bzr log -v -n1 -r 1..-1" to give a rough overview.

The other really important think to check is if the local and remote
repositories are exactly the same format.

Run 'bzr info -v' locally on both repositories. (unfortunately bzr info
-v bzr:// or bzr+ssh:// just reports "RemoteRepository" the last I
checked.)

John
=:->