Merge lp:~jameinel/bzr/2.4-too-much-walking-388269 into lp:bzr

Proposed by John A Meinel
Status: Merged
Approved by: John A Meinel
Approved revision: no longer in the source branch.
Merged at revision: 6099
Proposed branch: lp:~jameinel/bzr/2.4-too-much-walking-388269
Merge into: lp:bzr
Diff against target: 345 lines (+244/-26)
5 files modified
bzrlib/graph.py (+158/-3)
bzrlib/remote.py (+12/-20)
bzrlib/tests/test_graph.py (+71/-0)
bzrlib/tests/test_remote.py (+2/-2)
bzrlib/vf_repository.py (+1/-1)
To merge this branch: bzr merge lp:~jameinel/bzr/2.4-too-much-walking-388269
Reviewer Review Type Date Requested Status
John A Meinel Approve
Review via email: mp+71852@code.launchpad.net

Commit message

Bug #388269, when walking to find revisions to transmit, only send a local part of the search graph, not the whole graph walked for every step.

Description of the change

This was written against bzr-2.4 code, but it involves a change to how it makes queries, so I'm proposing it for trunk first. The nice thing is that it is a client-only change, so it doesn't require updating the server. (And I've been testing it against production launchpad.)

The overall idea is that during '_walk_to_common_revisions' we call get_parent_map a bunch of times. To reduce the round-trips, we send the tip revisions we are currently at, and the server then walks extra ancestors and returns more information than just the direct parents. However, because our graphs are not sorted, we can easily cover the same location multiple times. For example:

A
|
B
|
C
|
D
|\
E F
| |
G |
| |
H |
| |
I |
|/
J

If we start walking at J, we'll get to D relatively quickly via F, however we'll get to D *again* via E-I after a while.

A naive implementation would then ask for the parents of E and A, and we'd get back all of B-D again.

So what we do is send a SearchRecipe description, that is meant to describe all of the revisions we've already seen. (we're walking the server's history from the client, the server can walk it all on its own.)

If our RPCs weren't stateless, the server could have already been holding the search on its side.

However, we try to have stateless RPCs, so we have to send the state back to the server. Which it then faithfully walks to ensure that it knows what keys it doesn't need to send again.

The failure is that when the search gets *big* (like gcc-linaro's 100k revisions), it takes the server a *long* time to walk all of that history to re-build its state. (something like 10s for each get_parent_map call.)

So this patch changes the client. Instead of sending a complete SearchRecipe, it starts at the current tips, walks towards children for DEPTH steps, and then builds a search recipe from those heads.

The idea, is that in the above case, we want to make sure we send a search query that includes D. (So if we are currently at the tips A & E, we want to walk back to at least D/F.)

I played around with a lot of possible settings for DEPTH. From 1, 10, 20, 100, 1000, 10000, and infinite. First, a graph showing why we want this at all:
http://tinypic.com/r/2hdyyqe/7

So without this, it takes ~5 minutes to walk all of gcc-linaro's history, and with some reasonable number it is more like 70s. (almost 5x faster, seen in the blue line)

The other factor to consider, though, is bandwidth cost. When we don't eliminate a sub-graph, the server will re-send data that it has already sent to us. (Seen in the orange line.)

You can see that there is a slight decrease in the data sent except for when sending all data, there is a very large jump.

So some of the time tradeoff is time spent by the server to walk the extra history, vs the extra bandwidth of sending redundant data. You can also see that for gcc-linaro, the time has started growing significantly at DEPTH=1000. But what about other projects.

I used lp:bzr, lp:mysql-server and lp:gcc-linaro as my test cases. Since mysql should be a large but very 'bushy' project, bzr is a more modest sized project, and gcc-linaro is in the "very long history converted from svn/cvs" sort of project.
http://tinypic.com/r/2l8ukvb/7

I forced the axis rather than shrinking everything because of the gcc-linaro blowout.

What is very interesting is that both bzr and mysql are pretty insensitive to value, which is probably why we didn't ever bother fixing this before. My guess is because of the projects behavior to merge across major versions. The tip revision of bzr-2.1 is *way* farther back in ancestry than the tip of bzr-2.4, but you can get their pretty quickly because of merging 2.1=>2.2=>2.3=>2.4. As such, when walking children, it only takes about 5 steps, and you'll have described the full history of everything that changed from bzr-2.4 vs bzr-2.1.

The next check is about how much extra data is being transferred. At DEPTH=1, bzr and mysql transfer 10%-15% extra data, while at DEPTH=10 this drops 1.2%-3%, and at 100 we are at .3%-1.3%. gcc-linaro is a different story, because at all DEPTH from 1-10000, it transfers almost 40% extra data. I'm guessing that there is one query late in the game that covers a very large span. If you normalize against the DEPTH=10,000 case, then it is 1.8%@1 down to 1%@100.

So overall, I'm happy with setting DEPTH=100. We could make it configurable, but I don't think we'd see much benefit of that, and probably negated by a round-trip to read the configuration.

To post a comment you must log in.
Revision history for this message
Martin Pool (mbp) wrote :

Thanks for the clear explanation and quite cool description of it.

> We could make it configurable, but I don't think we'd see much benefit of that, and probably negated by a round-trip to read the configuration.

Well, we could say it comes from only the global configuration, which would avoid any network roundtrip time and still give the option to experiment with it later.

> +def ignore():

I think that could definitely do with a docstring because the name and the code are not obvious.

Actually it seems to be dead entirely...?

+def _run_search(parent_map, heads, exclude_keys):

This too could do with a docstring (and it is actually called.)

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 8/17/2011 2:45 PM, Martin Pool wrote:
> Thanks for the clear explanation and quite cool description of it.
>
>> We could make it configurable, but I don't think we'd see much
>> benefit of that, and probably negated by a round-trip to read the
>> configuration.
>
> Well, we could say it comes from only the global configuration, which
> would avoid any network roundtrip time and still give the option to
> experiment with it later.
>
>> +def ignore():

Yeah, this is cruft. I was moving code around and wanted to hold on to
it a bit, but it should just be removed.

>
> I think that could definitely do with a docstring because the name
> and the code are not obvious.
>
> Actually it seems to be dead entirely...?
>
> +def _run_search(parent_map, heads, exclude_keys):
>
> This too could do with a docstring (and it is actually called.)

Yeah, I added one and removed ignore(), thanks for catching it.

I also re-looked at the diff and found a bit more debugging cruft
(get_parent_map_logging), which I've cleaned up.

Having talked about it a bit on IRC, I'm thinking to punt for now on a
config item. I like the idea of 'turn it off' if it doesn't work, but
more for a stable series. For trunk, I'd rather just get feedback on how
it works for people and tweak from there.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk5L0iwACgkQJdeBCYSNAAPJewCguiATPzw0APFSKX7UeKNQ7570
UrMAoLRS2jN0mTWc0jLpEzez5kLl8c25
=gY72
-----END PGP SIGNATURE-----

Revision history for this message
John A Meinel (jameinel) wrote :

Martin said this was approved with the tweaks applied.

review: Approve
Revision history for this message
John A Meinel (jameinel) wrote :

sent to pqm by email

Revision history for this message
John A Meinel (jameinel) wrote :

sent to pqm by email

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'bzrlib/graph.py'
2--- bzrlib/graph.py 2011-06-20 11:03:53 +0000
3+++ bzrlib/graph.py 2011-08-24 09:03:40 +0000
4@@ -61,7 +61,7 @@
5 def get_parent_map(self, keys):
6 """See StackedParentsProvider.get_parent_map"""
7 ancestry = self.ancestry
8- return dict((k, ancestry[k]) for k in keys if k in ancestry)
9+ return dict([(k, ancestry[k]) for k in keys if k in ancestry])
10
11
12 class StackedParentsProvider(object):
13@@ -1419,13 +1419,14 @@
14 parents_of_found = set()
15 # revisions may contain nodes that point to other nodes in revisions:
16 # we want to filter them out.
17- self.seen.update(revisions)
18+ seen = self.seen
19+ seen.update(revisions)
20 parent_map = self._parents_provider.get_parent_map(revisions)
21 found_revisions.update(parent_map)
22 for rev_id, parents in parent_map.iteritems():
23 if parents is None:
24 continue
25- new_found_parents = [p for p in parents if p not in self.seen]
26+ new_found_parents = [p for p in parents if p not in seen]
27 if new_found_parents:
28 # Calling set.update() with an empty generator is actually
29 # rather expensive.
30@@ -1891,6 +1892,160 @@
31 limit=self.limit)
32
33
34+def invert_parent_map(parent_map):
35+ """Given a map from child => parents, create a map of parent=>children"""
36+ child_map = {}
37+ for child, parents in parent_map.iteritems():
38+ for p in parents:
39+ # Any given parent is likely to have only a small handful
40+ # of children, many will have only one. So we avoid mem overhead of
41+ # a list, in exchange for extra copying of tuples
42+ if p not in child_map:
43+ child_map[p] = (child,)
44+ else:
45+ child_map[p] = child_map[p] + (child,)
46+ return child_map
47+
48+
49+def _find_possible_heads(parent_map, tip_keys, depth):
50+ """Walk backwards (towards children) through the parent_map.
51+
52+ This finds 'heads' that will hopefully succinctly describe our search
53+ graph.
54+ """
55+ child_map = invert_parent_map(parent_map)
56+ heads = set()
57+ current_roots = tip_keys
58+ walked = set(current_roots)
59+ while current_roots and depth > 0:
60+ depth -= 1
61+ children = set()
62+ children_update = children.update
63+ for p in current_roots:
64+ # Is it better to pre- or post- filter the children?
65+ try:
66+ children_update(child_map[p])
67+ except KeyError:
68+ heads.add(p)
69+ # If we've seen a key before, we don't want to walk it again. Note that
70+ # 'children' stays relatively small while 'walked' grows large. So
71+ # don't use 'difference_update' here which has to walk all of 'walked'.
72+ # '.difference' is smart enough to walk only children and compare it to
73+ # walked.
74+ children = children.difference(walked)
75+ walked.update(children)
76+ current_roots = children
77+ if current_roots:
78+ # We walked to the end of depth, so these are the new tips.
79+ heads.update(current_roots)
80+ return heads
81+
82+
83+def _run_search(parent_map, heads, exclude_keys):
84+ """Given a parent map, run a _BreadthFirstSearcher on it.
85+
86+ Start at heads, walk until you hit exclude_keys. As a further improvement,
87+ watch for any heads that you encounter while walking, which means they were
88+ not heads of the search.
89+
90+ This is mostly used to generate a succinct recipe for how to walk through
91+ most of parent_map.
92+
93+ :return: (_BreadthFirstSearcher, set(heads_encountered_by_walking))
94+ """
95+ g = Graph(DictParentsProvider(parent_map))
96+ s = g._make_breadth_first_searcher(heads)
97+ found_heads = set()
98+ while True:
99+ try:
100+ next_revs = s.next()
101+ except StopIteration:
102+ break
103+ for parents in s._current_parents.itervalues():
104+ f_heads = heads.intersection(parents)
105+ if f_heads:
106+ found_heads.update(f_heads)
107+ stop_keys = exclude_keys.intersection(next_revs)
108+ if stop_keys:
109+ s.stop_searching_any(stop_keys)
110+ for parents in s._current_parents.itervalues():
111+ f_heads = heads.intersection(parents)
112+ if f_heads:
113+ found_heads.update(f_heads)
114+ return s, found_heads
115+
116+
117+def limited_search_result_from_parent_map(parent_map, missing_keys, tip_keys,
118+ depth):
119+ """Transform a parent_map that is searching 'tip_keys' into an
120+ approximate SearchResult.
121+
122+ We should be able to generate a SearchResult from a given set of starting
123+ keys, that covers a subset of parent_map that has the last step pointing at
124+ tip_keys. This is to handle the case that really-long-searches shouldn't be
125+ started from scratch on each get_parent_map request, but we *do* want to
126+ filter out some of the keys that we've already seen, so we don't get
127+ information that we already know about on every request.
128+
129+ The server will validate the search (that starting at start_keys and
130+ stopping at stop_keys yields the exact key_count), so we have to be careful
131+ to give an exact recipe.
132+
133+ Basic algorithm is:
134+ 1) Invert parent_map to get child_map (todo: have it cached and pass it
135+ in)
136+ 2) Starting at tip_keys, walk towards children for 'depth' steps.
137+ 3) At that point, we have the 'start' keys.
138+ 4) Start walking parent_map from 'start' keys, counting how many keys
139+ are seen, and generating stop_keys for anything that would walk
140+ outside of the parent_map.
141+
142+ :param parent_map: A map from {child_id: (parent_ids,)}
143+ :param missing_keys: parent_ids that we know are unavailable
144+ :param tip_keys: the revision_ids that we are searching
145+ :param depth: How far back to walk.
146+ """
147+ if not parent_map:
148+ # No search to send, because we haven't done any searching yet.
149+ return [], [], 0
150+ heads = _find_possible_heads(parent_map, tip_keys, depth)
151+ s, found_heads = _run_search(parent_map, heads, set(tip_keys))
152+ _, start_keys, exclude_keys, key_count = s.get_result().get_recipe()
153+ if found_heads:
154+ # Anything in found_heads are redundant start_keys, we hit them while
155+ # walking, so we can exclude them from the start list.
156+ start_keys = set(start_keys).difference(found_heads)
157+ return start_keys, exclude_keys, key_count
158+
159+
160+def search_result_from_parent_map(parent_map, missing_keys):
161+ """Transform a parent_map into SearchResult information."""
162+ if not parent_map:
163+ # parent_map is empty or None, simple search result
164+ return [], [], 0
165+ # start_set is all the keys in the cache
166+ start_set = set(parent_map)
167+ # result set is all the references to keys in the cache
168+ result_parents = set()
169+ for parents in parent_map.itervalues():
170+ result_parents.update(parents)
171+ stop_keys = result_parents.difference(start_set)
172+ # We don't need to send ghosts back to the server as a position to
173+ # stop either.
174+ stop_keys.difference_update(missing_keys)
175+ key_count = len(parent_map)
176+ if (revision.NULL_REVISION in result_parents
177+ and revision.NULL_REVISION in missing_keys):
178+ # If we pruned NULL_REVISION from the stop_keys because it's also
179+ # in our cache of "missing" keys we need to increment our key count
180+ # by 1, because the reconsitituted SearchResult on the server will
181+ # still consider NULL_REVISION to be an included key.
182+ key_count += 1
183+ included_keys = start_set.intersection(result_parents)
184+ start_set.difference_update(included_keys)
185+ return start_set, stop_keys, key_count
186+
187+
188 def collapse_linear_regions(parent_map):
189 """Collapse regions of the graph that are 'linear'.
190
191
192=== modified file 'bzrlib/remote.py'
193--- bzrlib/remote.py 2011-08-17 01:19:17 +0000
194+++ bzrlib/remote.py 2011-08-24 09:03:40 +0000
195@@ -48,6 +48,9 @@
196 from bzrlib.trace import mutter, note, warning
197
198
199+_DEFAULT_SEARCH_DEPTH = 100
200+
201+
202 class _RpcHelper(object):
203 """Mixin class that helps with issuing RPCs."""
204
205@@ -1745,26 +1748,15 @@
206 if parents_map is None:
207 # Repository is not locked, so there's no cache.
208 parents_map = {}
209- # start_set is all the keys in the cache
210- start_set = set(parents_map)
211- # result set is all the references to keys in the cache
212- result_parents = set()
213- for parents in parents_map.itervalues():
214- result_parents.update(parents)
215- stop_keys = result_parents.difference(start_set)
216- # We don't need to send ghosts back to the server as a position to
217- # stop either.
218- stop_keys.difference_update(self._unstacked_provider.missing_keys)
219- key_count = len(parents_map)
220- if (NULL_REVISION in result_parents
221- and NULL_REVISION in self._unstacked_provider.missing_keys):
222- # If we pruned NULL_REVISION from the stop_keys because it's also
223- # in our cache of "missing" keys we need to increment our key count
224- # by 1, because the reconsitituted SearchResult on the server will
225- # still consider NULL_REVISION to be an included key.
226- key_count += 1
227- included_keys = start_set.intersection(result_parents)
228- start_set.difference_update(included_keys)
229+ if _DEFAULT_SEARCH_DEPTH <= 0:
230+ (start_set, stop_keys,
231+ key_count) = graph.search_result_from_parent_map(
232+ parents_map, self._unstacked_provider.missing_keys)
233+ else:
234+ (start_set, stop_keys,
235+ key_count) = graph.limited_search_result_from_parent_map(
236+ parents_map, self._unstacked_provider.missing_keys,
237+ keys, depth=_DEFAULT_SEARCH_DEPTH)
238 recipe = ('manual', start_set, stop_keys, key_count)
239 body = self._serialise_search_recipe(recipe)
240 path = self.bzrdir._path_for_remote_call(self._client)
241
242=== modified file 'bzrlib/tests/test_graph.py'
243--- bzrlib/tests/test_graph.py 2011-06-20 11:03:53 +0000
244+++ bzrlib/tests/test_graph.py 2011-08-24 09:03:40 +0000
245@@ -1744,3 +1744,74 @@
246 self.assertEqual(set([NULL_REVISION, 'tip', 'tag', 'mid']), recipe[2])
247 self.assertEqual(0, recipe[3])
248 self.assertTrue(result.is_empty())
249+
250+
251+class TestSearchResultFromParentMap(TestGraphBase):
252+
253+ def assertSearchResult(self, start_keys, stop_keys, key_count, parent_map,
254+ missing_keys=()):
255+ (start, stop, count) = _mod_graph.search_result_from_parent_map(
256+ parent_map, missing_keys)
257+ self.assertEqual((sorted(start_keys), sorted(stop_keys), key_count),
258+ (sorted(start), sorted(stop), count))
259+
260+ def test_no_parents(self):
261+ self.assertSearchResult([], [], 0, {})
262+ self.assertSearchResult([], [], 0, None)
263+
264+ def test_ancestry_1(self):
265+ self.assertSearchResult(['rev4'], [NULL_REVISION], len(ancestry_1),
266+ ancestry_1)
267+
268+ def test_ancestry_2(self):
269+ self.assertSearchResult(['rev1b', 'rev4a'], [NULL_REVISION],
270+ len(ancestry_2), ancestry_2)
271+ self.assertSearchResult(['rev1b', 'rev4a'], [],
272+ len(ancestry_2)+1, ancestry_2,
273+ missing_keys=[NULL_REVISION])
274+
275+ def test_partial_search(self):
276+ parent_map = dict((k,extended_history_shortcut[k])
277+ for k in ['e', 'f'])
278+ self.assertSearchResult(['e', 'f'], ['d', 'a'], 2,
279+ parent_map)
280+ parent_map.update((k,extended_history_shortcut[k])
281+ for k in ['d', 'a'])
282+ self.assertSearchResult(['e', 'f'], ['c', NULL_REVISION], 4,
283+ parent_map)
284+ parent_map['c'] = extended_history_shortcut['c']
285+ self.assertSearchResult(['e', 'f'], ['b'], 6,
286+ parent_map, missing_keys=[NULL_REVISION])
287+ parent_map['b'] = extended_history_shortcut['b']
288+ self.assertSearchResult(['e', 'f'], [], 7,
289+ parent_map, missing_keys=[NULL_REVISION])
290+
291+
292+class TestLimitedSearchResultFromParentMap(TestGraphBase):
293+
294+ def assertSearchResult(self, start_keys, stop_keys, key_count, parent_map,
295+ missing_keys, tip_keys, depth):
296+ (start, stop, count) = _mod_graph.limited_search_result_from_parent_map(
297+ parent_map, missing_keys, tip_keys, depth)
298+ self.assertEqual((sorted(start_keys), sorted(stop_keys), key_count),
299+ (sorted(start), sorted(stop), count))
300+
301+ def test_empty_ancestry(self):
302+ self.assertSearchResult([], [], 0, {}, (), ['tip-rev-id'], 10)
303+
304+ def test_ancestry_1(self):
305+ self.assertSearchResult(['rev4'], ['rev1'], 4,
306+ ancestry_1, (), ['rev1'], 10)
307+ self.assertSearchResult(['rev2a', 'rev2b'], ['rev1'], 2,
308+ ancestry_1, (), ['rev1'], 1)
309+
310+
311+ def test_multiple_heads(self):
312+ self.assertSearchResult(['e', 'f'], ['a'], 5,
313+ extended_history_shortcut, (), ['a'], 10)
314+ # Note that even though we only take 1 step back, we find 'f', which
315+ # means the described search will still find d and c.
316+ self.assertSearchResult(['f'], ['a'], 4,
317+ extended_history_shortcut, (), ['a'], 1)
318+ self.assertSearchResult(['f'], ['a'], 4,
319+ extended_history_shortcut, (), ['a'], 2)
320
321=== modified file 'bzrlib/tests/test_remote.py'
322--- bzrlib/tests/test_remote.py 2011-08-12 09:49:24 +0000
323+++ bzrlib/tests/test_remote.py 2011-08-24 09:03:40 +0000
324@@ -2147,8 +2147,8 @@
325 parents = repo.get_parent_map([rev_id])
326 self.assertEqual(
327 [('call_with_body_bytes_expecting_body',
328- 'Repository.get_parent_map', ('quack/', 'include-missing:',
329- rev_id), '\n\n0'),
330+ 'Repository.get_parent_map',
331+ ('quack/', 'include-missing:', rev_id), '\n\n0'),
332 ('disconnect medium',),
333 ('call_expecting_body', 'Repository.get_revision_graph',
334 ('quack/', ''))],
335
336=== modified file 'bzrlib/vf_repository.py'
337--- bzrlib/vf_repository.py 2011-08-02 11:18:43 +0000
338+++ bzrlib/vf_repository.py 2011-08-24 09:03:40 +0000
339@@ -70,7 +70,7 @@
340 )
341
342 from bzrlib.trace import (
343- mutter,
344+ mutter
345 )
346
347