Bazaar

Merge lp:~jameinel/bzr/2.4-too-much-walking-388269 into lp:bzr

2.4-too-much-walking-388269
Merge into bzr.dev

Proposed by John A Meinel on 2011-08-17

Status:

Merged

Approved by:

John A Meinel on 2011-08-23

Approved revision:

no longer in the source branch.

Merged at revision:

6099

Proposed branch:

lp:~jameinel/bzr/2.4-too-much-walking-388269

Merge into:

lp:bzr

Diff against target:

345 lines (+244/-26)

5 files modified

bzrlib/graph.py (+158/-3)
bzrlib/remote.py (+12/-20)
bzrlib/tests/test_graph.py (+71/-0)
bzrlib/tests/test_remote.py (+2/-2)
bzrlib/vf_repository.py (+1/-1)

To merge this branch:

bzr merge lp:~jameinel/bzr/2.4-too-much-walking-388269

High

Fix Released

Link a bug report

Reviewer	Review Type	Date Requested	Status
John A Meinel			Approve on 2011-08-23
Review via email: mp+71852@code.launchpad.net

Commit message

Bug #388269, when walking to find revisions to transmit, only send a local part of the search graph, not the whole graph walked for every step.

Description of the change

This was written against bzr-2.4 code, but it involves a change to how it makes queries, so I'm proposing it for trunk first. The nice thing is that it is a client-only change, so it doesn't require updating the server. (And I've been testing it against production launchpad.)

The overall idea is that during '_walk_to_common_revisions' we call get_parent_map a bunch of times. To reduce the round-trips, we send the tip revisions we are currently at, and the server then walks extra ancestors and returns more information than just the direct parents. However, because our graphs are not sorted, we can easily cover the same location multiple times. For example:

A
|
B
|
C
|
D
|\
E F
| |
G |
| |
H |
| |
I |
|/
J

If we start walking at J, we'll get to D relatively quickly via F, however we'll get to D *again* via E-I after a while.

A naive implementation would then ask for the parents of E and A, and we'd get back all of B-D again.

So what we do is send a SearchRecipe description, that is meant to describe all of the revisions we've already seen. (we're walking the server's history from the client, the server can walk it all on its own.)

If our RPCs weren't stateless, the server could have already been holding the search on its side.

However, we try to have stateless RPCs, so we have to send the state back to the server. Which it then faithfully walks to ensure that it knows what keys it doesn't need to send again.

The failure is that when the search gets *big* (like gcc-linaro's 100k revisions), it takes the server a *long* time to walk all of that history to re-build its state. (something like 10s for each get_parent_map call.)

So this patch changes the client. Instead of sending a complete SearchRecipe, it starts at the current tips, walks towards children for DEPTH steps, and then builds a search recipe from those heads.

The idea, is that in the above case, we want to make sure we send a search query that includes D. (So if we are currently at the tips A & E, we want to walk back to at least D/F.)

I played around with a lot of possible settings for DEPTH. From 1, 10, 20, 100, 1000, 10000, and infinite. First, a graph showing why we want this at all:
http://tinypic.com/r/2hdyyqe/7

So without this, it takes ~5 minutes to walk all of gcc-linaro's history, and with some reasonable number it is more like 70s. (almost 5x faster, seen in the blue line)

The other factor to consider, though, is bandwidth cost. When we don't eliminate a sub-graph, the server will re-send data that it has already sent to us. (Seen in the orange line.)

You can see that there is a slight decrease in the data sent except for when sending all data, there is a very large jump.

So some of the time tradeoff is time spent by the server to walk the extra history, vs the extra bandwidth of sending redundant data. You can also see that for gcc-linaro, the time has started growing significantly at DEPTH=1000. But what about other projects.

I used lp:bzr, lp:mysql-server and lp:gcc-linaro as my test cases. Since mysql should be a large but very 'bushy' project, bzr is a more modest sized project, and gcc-linaro is in the "very long history converted from svn/cvs" sort of project.
http://tinypic.com/r/2l8ukvb/7

I forced the axis rather than shrinking everything because of the gcc-linaro blowout.

What is very interesting is that both bzr and mysql are pretty insensitive to value, which is probably why we didn't ever bother fixing this before. My guess is because of the projects behavior to merge across major versions. The tip revision of bzr-2.1 is *way* farther back in ancestry than the tip of bzr-2.4, but you can get their pretty quickly because of merging 2.1=>2.2=>2.3=>2.4. As such, when walking children, it only takes about 5 steps, and you'll have described the full history of everything that changed from bzr-2.4 vs bzr-2.1.

The next check is about how much extra data is being transferred. At DEPTH=1, bzr and mysql transfer 10%-15% extra data, while at DEPTH=10 this drops 1.2%-3%, and at 100 we are at .3%-1.3%. gcc-linaro is a different story, because at all DEPTH from 1-10000, it transfers almost 40% extra data. I'm guessing that there is one query late in the game that covers a very large span. If you normalize against the DEPTH=10,000 case, then it is 1.8%@1 down to 1%@100.

So overall, I'm happy with setting DEPTH=100. We could make it configurable, but I don't think we'd see much benefit of that, and probably negated by a round-trip to read the configuration.

Revision history for this message

Martin Pool (mbp) wrote on 2011-08-17:

Thanks for the clear explanation and quite cool description of it.

> We could make it configurable, but I don't think we'd see much benefit of that, and probably negated by a round-trip to read the configuration.

Well, we could say it comes from only the global configuration, which would avoid any network roundtrip time and still give the option to experiment with it later.

> +def ignore():

I think that could definitely do with a docstring because the name and the code are not obvious.

Actually it seems to be dead entirely...?

+def _run_search(parent_map, heads, exclude_keys):

This too could do with a docstring (and it is actually called.)

Revision history for this message

John A Meinel (jameinel) wrote on 2011-08-17:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 8/17/2011 2:45 PM, Martin Pool wrote:
> Thanks for the clear explanation and quite cool description of it.
>
>> We could make it configurable, but I don't think we'd see much
>> benefit of that, and probably negated by a round-trip to read the
>> configuration.
>
> Well, we could say it comes from only the global configuration, which
> would avoid any network roundtrip time and still give the option to
> experiment with it later.
>
>> +def ignore():

Yeah, this is cruft. I was moving code around and wanted to hold on to
it a bit, but it should just be removed.

>
> I think that could definitely do with a docstring because the name
> and the code are not obvious.
>
> Actually it seems to be dead entirely...?
>
> +def _run_search(parent_map, heads, exclude_keys):
>
> This too could do with a docstring (and it is actually called.)

Yeah, I added one and removed ignore(), thanks for catching it.

I also re-looked at the diff and found a bit more debugging cruft
(get_parent_map_logging), which I've cleaned up.

Having talked about it a bit on IRC, I'm thinking to punt for now on a
config item. I like the idea of 'turn it off' if it doesn't work, but
more for a stable series. For trunk, I'd rather just get feedback on how
it works for people and tweak from there.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk5L0iwACgkQJdeBCYSNAAPJewCguiATPzw0APFSKX7UeKNQ7570
UrMAoLRS2jN0mTWc0jLpEzez5kLl8c25
=gY72
-----END PGP SIGNATURE-----

Revision history for this message

John A Meinel (jameinel) wrote on 2011-08-23:

Martin said this was approved with the tweaks applied.

review: Approve

Revision history for this message

John A Meinel (jameinel) wrote on 2011-08-23:

sent to pqm by email

Revision history for this message

John A Meinel (jameinel) wrote on 2011-08-24:

sent to pqm by email

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Alejandro Cornejo2

Bazaar Codereview Subscribers

Benoit Pierre

Gmood

John A Meinel

Karl Bielefeldt

Mahmoud Hassan

Matt Nordhoff

Mohd Fikri Mohd Amin

MrJOHN

Václav Haisman

bzr PQM

vincenzo

to status/vote changes:

Alexander Belchenko

amandla2023

 === modified file 'bzrlib/graph.py'
 --- bzrlib/graph.py	2011-06-20 11:03:53 +0000
 +++ bzrlib/graph.py	2011-08-24 09:03:40 +0000
@@ -61,7 +61,7 @@
      def get_parent_map(self, keys):
          """See StackedParentsProvider.get_parent_map"""
          ancestry = self.ancestry
--        return dict((k, ancestry[k]) for k in keys if k in ancestry)
++        return dict([(k, ancestry[k]) for k in keys if k in ancestry])
  class StackedParentsProvider(object):
@@ -1419,13 +1419,14 @@
          parents_of_found = set()
          # revisions may contain nodes that point to other nodes in revisions:
          # we want to filter them out.
--        self.seen.update(revisions)
++        seen = self.seen
++        seen.update(revisions)
          parent_map = self._parents_provider.get_parent_map(revisions)
          found_revisions.update(parent_map)
          for rev_id, parents in parent_map.iteritems():
              if parents is None:
                  continue
--            new_found_parents = [p for p in parents if p not in self.seen]
++            new_found_parents = [p for p in parents if p not in seen]
              if new_found_parents:
                  # Calling set.update() with an empty generator is actually
                  # rather expensive.
@@ -1891,6 +1892,160 @@
              limit=self.limit)
++def invert_parent_map(parent_map):
++    """Given a map from child => parents, create a map of parent=>children"""
++    child_map = {}
++    for child, parents in parent_map.iteritems():
++        for p in parents:
++            # Any given parent is likely to have only a small handful
++            # of children, many will have only one. So we avoid mem overhead of
++            # a list, in exchange for extra copying of tuples
++            if p not in child_map:
++                child_map[p] = (child,)
++            else:
++                child_map[p] = child_map[p] + (child,)
++    return child_map
++
++
++def _find_possible_heads(parent_map, tip_keys, depth):
++    """Walk backwards (towards children) through the parent_map.
++
++    This finds 'heads' that will hopefully succinctly describe our search
++    graph.
++    """
++    child_map = invert_parent_map(parent_map)
++    heads = set()
++    current_roots = tip_keys
++    walked = set(current_roots)
++    while current_roots and depth > 0:
++        depth -= 1
++        children = set()
++        children_update = children.update
++        for p in current_roots:
++            # Is it better to pre- or post- filter the children?
++            try:
++                children_update(child_map[p])
++            except KeyError:
++                heads.add(p)
++        # If we've seen a key before, we don't want to walk it again. Note that
++        # 'children' stays relatively small while 'walked' grows large. So
++        # don't use 'difference_update' here which has to walk all of 'walked'.
++        # '.difference' is smart enough to walk only children and compare it to
++        # walked.
++        children = children.difference(walked)
++        walked.update(children)
++        current_roots = children
++    if current_roots:
++        # We walked to the end of depth, so these are the new tips.
++        heads.update(current_roots)
++    return heads
++
++
++def _run_search(parent_map, heads, exclude_keys):
++    """Given a parent map, run a _BreadthFirstSearcher on it.
++
++    Start at heads, walk until you hit exclude_keys. As a further improvement,
++    watch for any heads that you encounter while walking, which means they were
++    not heads of the search.
++
++    This is mostly used to generate a succinct recipe for how to walk through
++    most of parent_map.
++
++    :return: (_BreadthFirstSearcher, set(heads_encountered_by_walking))
++    """
++    g = Graph(DictParentsProvider(parent_map))
++    s = g._make_breadth_first_searcher(heads)
++    found_heads = set()
++    while True:
++        try:
++            next_revs = s.next()
++        except StopIteration:
++            break
++        for parents in s._current_parents.itervalues():
++            f_heads = heads.intersection(parents)
++            if f_heads:
++                found_heads.update(f_heads)
++        stop_keys = exclude_keys.intersection(next_revs)
++        if stop_keys:
++            s.stop_searching_any(stop_keys)
++    for parents in s._current_parents.itervalues():
++        f_heads = heads.intersection(parents)
++        if f_heads:
++            found_heads.update(f_heads)
++    return s, found_heads
++
++
++def limited_search_result_from_parent_map(parent_map, missing_keys, tip_keys,
++                                          depth):
++    """Transform a parent_map that is searching 'tip_keys' into an
++    approximate SearchResult.
++
++    We should be able to generate a SearchResult from a given set of starting
++    keys, that covers a subset of parent_map that has the last step pointing at
++    tip_keys. This is to handle the case that really-long-searches shouldn't be
++    started from scratch on each get_parent_map request, but we *do* want to
++    filter out some of the keys that we've already seen, so we don't get
++    information that we already know about on every request.
++
++    The server will validate the search (that starting at start_keys and
++    stopping at stop_keys yields the exact key_count), so we have to be careful
++    to give an exact recipe.
++
++    Basic algorithm is:
++        1) Invert parent_map to get child_map (todo: have it cached and pass it
++           in)
++        2) Starting at tip_keys, walk towards children for 'depth' steps.
++        3) At that point, we have the 'start' keys.
++        4) Start walking parent_map from 'start' keys, counting how many keys
++           are seen, and generating stop_keys for anything that would walk
++           outside of the parent_map.
++
++    :param parent_map: A map from {child_id: (parent_ids,)}
++    :param missing_keys: parent_ids that we know are unavailable
++    :param tip_keys: the revision_ids that we are searching
++    :param depth: How far back to walk.
++    """
++    if not parent_map:
++        # No search to send, because we haven't done any searching yet.
++        return [], [], 0
++    heads = _find_possible_heads(parent_map, tip_keys, depth)
++    s, found_heads = _run_search(parent_map, heads, set(tip_keys))
++    _, start_keys, exclude_keys, key_count = s.get_result().get_recipe()
++    if found_heads:
++        # Anything in found_heads are redundant start_keys, we hit them while
++        # walking, so we can exclude them from the start list.
++        start_keys = set(start_keys).difference(found_heads)
++    return start_keys, exclude_keys, key_count
++
++
++def search_result_from_parent_map(parent_map, missing_keys):
++    """Transform a parent_map into SearchResult information."""
++    if not parent_map:
++        # parent_map is empty or None, simple search result
++        return [], [], 0
++    # start_set is all the keys in the cache
++    start_set = set(parent_map)
++    # result set is all the references to keys in the cache
++    result_parents = set()
++    for parents in parent_map.itervalues():
++        result_parents.update(parents)
++    stop_keys = result_parents.difference(start_set)
++    # We don't need to send ghosts back to the server as a position to
++    # stop either.
++    stop_keys.difference_update(missing_keys)
++    key_count = len(parent_map)
++    if (revision.NULL_REVISION in result_parents
++        and revision.NULL_REVISION in missing_keys):
++        # If we pruned NULL_REVISION from the stop_keys because it's also
++        # in our cache of "missing" keys we need to increment our key count
++        # by 1, because the reconsitituted SearchResult on the server will
++        # still consider NULL_REVISION to be an included key.
++        key_count += 1
++    included_keys = start_set.intersection(result_parents)
++    start_set.difference_update(included_keys)
++    return start_set, stop_keys, key_count
++
++
  def collapse_linear_regions(parent_map):
      """Collapse regions of the graph that are 'linear'.
 === modified file 'bzrlib/remote.py'
 --- bzrlib/remote.py	2011-08-17 01:19:17 +0000
 +++ bzrlib/remote.py	2011-08-24 09:03:40 +0000
@@ -48,6 +48,9 @@
  from bzrlib.trace import mutter, note, warning
++_DEFAULT_SEARCH_DEPTH = 100
++
++
  class _RpcHelper(object):
      """Mixin class that helps with issuing RPCs."""
@@ -1745,26 +1748,15 @@
          if parents_map is None:
              # Repository is not locked, so there's no cache.
              parents_map = {}
--        # start_set is all the keys in the cache
--        start_set = set(parents_map)
--        # result set is all the references to keys in the cache
--        result_parents = set()
--        for parents in parents_map.itervalues():
--            result_parents.update(parents)
--        stop_keys = result_parents.difference(start_set)
--        # We don't need to send ghosts back to the server as a position to
--        # stop either.
--        stop_keys.difference_update(self._unstacked_provider.missing_keys)
--        key_count = len(parents_map)
--        if (NULL_REVISION in result_parents
--            and NULL_REVISION in self._unstacked_provider.missing_keys):
--            # If we pruned NULL_REVISION from the stop_keys because it's also
--            # in our cache of "missing" keys we need to increment our key count
--            # by 1, because the reconsitituted SearchResult on the server will
--            # still consider NULL_REVISION to be an included key.
--            key_count += 1
--        included_keys = start_set.intersection(result_parents)
--        start_set.difference_update(included_keys)
++        if _DEFAULT_SEARCH_DEPTH <= 0:
++            (start_set, stop_keys,
++             key_count) = graph.search_result_from_parent_map(
++                parents_map, self._unstacked_provider.missing_keys)
++        else:
++            (start_set, stop_keys,
++             key_count) = graph.limited_search_result_from_parent_map(
++                parents_map, self._unstacked_provider.missing_keys,
++                keys, depth=_DEFAULT_SEARCH_DEPTH)
          recipe = ('manual', start_set, stop_keys, key_count)
          body = self._serialise_search_recipe(recipe)
          path = self.bzrdir._path_for_remote_call(self._client)
 === modified file 'bzrlib/tests/test_graph.py'
 --- bzrlib/tests/test_graph.py	2011-06-20 11:03:53 +0000
 +++ bzrlib/tests/test_graph.py	2011-08-24 09:03:40 +0000
@@ -1744,3 +1744,74 @@
          self.assertEqual(set([NULL_REVISION, 'tip', 'tag', 'mid']), recipe[2])
          self.assertEqual(0, recipe[3])
          self.assertTrue(result.is_empty())
++
++
++class TestSearchResultFromParentMap(TestGraphBase):
++
++    def assertSearchResult(self, start_keys, stop_keys, key_count, parent_map,
++                           missing_keys=()):
++        (start, stop, count) = _mod_graph.search_result_from_parent_map(
++            parent_map, missing_keys)
++        self.assertEqual((sorted(start_keys), sorted(stop_keys), key_count),
++                         (sorted(start), sorted(stop), count))
++
++    def test_no_parents(self):
++        self.assertSearchResult([], [], 0, {})
++        self.assertSearchResult([], [], 0, None)
++
++    def test_ancestry_1(self):
++        self.assertSearchResult(['rev4'], [NULL_REVISION], len(ancestry_1),
++                                ancestry_1)
++
++    def test_ancestry_2(self):
++        self.assertSearchResult(['rev1b', 'rev4a'], [NULL_REVISION],
++                                len(ancestry_2), ancestry_2)
++        self.assertSearchResult(['rev1b', 'rev4a'], [],
++                                len(ancestry_2)+1, ancestry_2,
++                                missing_keys=[NULL_REVISION])
++
++    def test_partial_search(self):
++        parent_map = dict((k,extended_history_shortcut[k])
++                          for k in ['e', 'f'])
++        self.assertSearchResult(['e', 'f'], ['d', 'a'], 2,
++                                parent_map)
++        parent_map.update((k,extended_history_shortcut[k])
++                          for k in ['d', 'a'])
++        self.assertSearchResult(['e', 'f'], ['c', NULL_REVISION], 4,
++                                parent_map)
++        parent_map['c'] = extended_history_shortcut['c']
++        self.assertSearchResult(['e', 'f'], ['b'], 6,
++                                parent_map, missing_keys=[NULL_REVISION])
++        parent_map['b'] = extended_history_shortcut['b']
++        self.assertSearchResult(['e', 'f'], [], 7,
++                                parent_map, missing_keys=[NULL_REVISION])
++
++
++class TestLimitedSearchResultFromParentMap(TestGraphBase):
++
++    def assertSearchResult(self, start_keys, stop_keys, key_count, parent_map,
++                           missing_keys, tip_keys, depth):
++        (start, stop, count) = _mod_graph.limited_search_result_from_parent_map(
++            parent_map, missing_keys, tip_keys, depth)
++        self.assertEqual((sorted(start_keys), sorted(stop_keys), key_count),
++                         (sorted(start), sorted(stop), count))
++
++    def test_empty_ancestry(self):
++        self.assertSearchResult([], [], 0, {}, (), ['tip-rev-id'], 10)
++
++    def test_ancestry_1(self):
++        self.assertSearchResult(['rev4'], ['rev1'], 4,
++                                ancestry_1, (), ['rev1'], 10)
++        self.assertSearchResult(['rev2a', 'rev2b'], ['rev1'], 2,
++                                ancestry_1, (), ['rev1'], 1)
++
++
++    def test_multiple_heads(self):
++        self.assertSearchResult(['e', 'f'], ['a'], 5,
++                                extended_history_shortcut, (), ['a'], 10)
++        # Note that even though we only take 1 step back, we find 'f', which
++        # means the described search will still find d and c.
++        self.assertSearchResult(['f'], ['a'], 4,
++                                extended_history_shortcut, (), ['a'], 1)
++        self.assertSearchResult(['f'], ['a'], 4,
++                                extended_history_shortcut, (), ['a'], 2)
 === modified file 'bzrlib/tests/test_remote.py'
 --- bzrlib/tests/test_remote.py	2011-08-12 09:49:24 +0000
 +++ bzrlib/tests/test_remote.py	2011-08-24 09:03:40 +0000
@@ -2147,8 +2147,8 @@
          parents = repo.get_parent_map([rev_id])
          self.assertEqual(
              [('call_with_body_bytes_expecting_body',
--              'Repository.get_parent_map', ('quack/', 'include-missing:',
--              rev_id), '\n\n0'),
++              'Repository.get_parent_map',
++              ('quack/', 'include-missing:', rev_id), '\n\n0'),
               ('disconnect medium',),
               ('call_expecting_body', 'Repository.get_revision_graph',
                ('quack/', ''))],
 === modified file 'bzrlib/vf_repository.py'
 --- bzrlib/vf_repository.py	2011-08-02 11:18:43 +0000
 +++ bzrlib/vf_repository.py	2011-08-24 09:03:40 +0000
@@ -70,7 +70,7 @@
+     )
  from bzrlib.trace import (
--    mutter,
++    mutter
+     )