Bazaar

Bug #374726
Comment #3

Comment 3 for bug 374726

Revision history for this message

John A Meinel (jameinel) wrote on 2009-05-16:

So was the source you were using completely unpacked? which meant the grouping was very unoptimal? Was it a recent conversion, which would mean the grouping was poor but not terrible? etc, etc

If this is the mysql source, a plain conversion will have 5 10k packs and 6 1k packs. The distribution which is interesting is more 1 56k pack, 5 100 item packs, 5 10 item, and then some small ones. Because honestly, post conversion we should pack. And then from then on, someone will either be doing a single big fetch, or slowly committing new ones.

I highly expect the distribution to be skewed from what would happen from a simple "bzr commit 50k times".

I *am* surprised if the 'much much worse' is truly because of BTree code, and not because of much more expensive PatienceDiff calls. Are you running on files like sql/sql_parse.cc? ISTR that disabling the knit fast path was 60x slower for that file.

If the problem *is* chk nodes, something is wrong. As 'bzr annotate foo' shouldn't touch the CHK nodes *at all*.

Now, there also is a factor that GC doesn't pre-fetch multiple blocks at a time. So if you have tons of small groups, rather than a properly packed grouping, I would expect a lot more index hits.

I don't quite understand:
"Issuing a bzr pack bring that done to x3."
Brings the x20 down to x3?
And:
"BTreeGraphIndex._get_nodes() from ~3500 to 300.000 in the 3x case"
I assume that is 300k requests down to 3.5k requests when it brings the time down from x20 to x3.

So *if* the overhead is in BTree, then it is definitely something worth spending time looking at, as I certainly thought the problem would be in the diff code, not the text extraction code.

By the way, the file that seems to have the most history is: 'sql/mysqld.cc' which by my quick check has 3719 revisions. but others are just about as many:
[(2060, 'sp1f-sql_table.cc-1
(2155, 'sp1f-mysqltestrun.pl-2004
(2229, 'sp1f-manual.texi-19
(2397, 'sp1f-ha_ndbcluster.cc-200
(2655, 'sp1f-mysql_priv.h-1
(2840, 'sp1f-sql_yacc.yy-19
(2860, 'sp1f-configure.in-1
(3355, 'sp1f-sql_select.cc-
(3683, 'sp1f-sql_parse.cc-1
(3719, 'sp1f-mysqld.cc-1970

If the solution is simply 'bzr pack' in the 'I just converted' case, I'm fine with that. As long as you don't have to pack when you have the 1x56k + 2x1000 + 5x100 + 5x10 + 5x1 case. The latter certainly being a lot harder to simulate. In the past, I think I approximated it by doing a conversion, and then pulling the various branches into another repository. So you got 5.0, then 5.1, then 5.1-ndb, then 6.0, then 6.0-ndb, etc. It isn't perfect, but it at least gives you *some* distribution.

You could also play some tricks with selecting revisions from 'bzr ancestry' and branching them across at appropriate times. (In general, the important thing is that real-world copy code means we are likely to have 1x50k rather than 5x10k, even though a plain conversion generates 5x10k.)

So was the source you were using completely unpacked? which meant the grouping was very unoptimal? Was it a recent conversion, which would mean the grouping was poor but not terrible? etc, etc

I highly expect the distribution to be skewed from what would happen from a simple "bzr commit 50k times".

If the problem *is* chk nodes, something is wrong. As 'bzr annotate foo' shouldn't touch the CHK nodes *at all*.

Now, there also is a factor that GC doesn't pre-fetch multiple blocks at a time. So if you have tons of small groups, rather than a properly packed grouping, I would expect a lot more index hits.

I don't quite understand:
 "Issuing a bzr pack bring that done to x3."
Brings the x20 down to x3?
And:
 "BTreeGraphIndex._get_nodes() from ~3500 to 300.000 in the 3x case"
I assume that is 300k requests down to 3.5k requests when it brings the time down from x20 to x3.

So *if* the overhead is in BTree, then it is definitely something worth spending time looking at, as I certainly thought the problem would be in the diff code, not the text extraction code.

By the way, the file that seems to have the most history is: 'sql/mysqld.cc' which by my quick check has 3719 revisions. but others are just about as many:
[(2060, 'sp1f-sql_table.cc-1
 (2155,  'sp1f-mysqltestrun.pl-2004
 (2229, 'sp1f-manual.texi-19
 (2397,  'sp1f-ha_ndbcluster.cc-200
 (2655, 'sp1f-mysql_priv.h-1
 (2840, 'sp1f-sql_yacc.yy-19
 (2860, 'sp1f-configure.in-1
 (3355, 'sp1f-sql_select.cc-
 (3683, 'sp1f-sql_parse.cc-1
 (3719, 'sp1f-mysqld.cc-1970