Provide pdiffs for apt-get update

Bug #214612 reported by Johan Kiviniemi
68
This bug affects 12 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Triaged
Low
Unassigned
Raspbian
New
Undecided
Unassigned

Bug Description

Debian has been using pdiffs for apt-get update for a while now. Instead of downloading megabytes of package lists when a tiny part has changed, a number of diffs are downloaded and applied to the local package lists. That makes apt-get update faster for the end user and possibly much cheaper for the mirrors.

Ubuntu repositories do not provide pdiffs and apt in Ubuntu doesn’t try to download them by default.

While we’re waiting for https://blueprints.edge.launchpad.net/ubuntu/+spec/apt-sync, it would be nice to use pdiffs in the meantime, since that functionality has already been implemented and tested.

Revision history for this message
Siegfried Gevatter (rainct) wrote :

I'm also interested in this. Currently an "apt-get update" having just the "deb" lines for current release and a "deb-src" for the development release can take over 5 minutes here using a 3G connection.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

This would be a nice change, let's see what we can do.

Changed in soyuz:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Julian Edwards (julian-edwards) wrote :

Lars Wirzenius has recently tested updates using zsync, we'll look into that.

tags: added: soyuz-publish
Revision history for this message
Magnes (magnesus2) wrote :

So, it's 2011 now. Did you look into that? Because it's really slow to download packages.gz even on fast connections.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

We would love to implement this but we're really busy with more important fixes, like performance enhancements and fixing OOPSes. However, Launchpad is open source so if anyone wants to help fix this they'd get mentoring help from the developers.

Curtis Hovey (sinzui)
Changed in launchpad:
importance: Medium → Low
Revision history for this message
Clint Byrum (clint-fewbar) wrote :

So this may be more important than its "bug importance" would imply.

It would seem that the Ubuntu archive sets Expires: headers about 25 minutes after the modified time of the file it is serving:

[ ] Release 17-Apr-2012 06:53 48K

$ HEAD http://archive.ubuntu.com/ubuntu/dists/precise/Release
200 OK
Cache-Control: max-age=1762, s-maxage=3300, proxy-revalidate
Connection: close
Date: Tue, 17 Apr 2012 07:18:51 GMT
Accept-Ranges: bytes
ETag: "c19b-4bdda64880a80"
Server: Apache/2.2.14 (Ubuntu)
Content-Length: 49563
Content-Type: text/plain
Expires: Tue, 17 Apr 2012 07:48:14 GMT
Last-Modified: Tue, 17 Apr 2012 06:53:14 GMT
Client-Date: Tue, 17 Apr 2012 07:18:51 GMT
Client-Peer: 91.189.92.170:80
Client-Response-Num: 1

Any skew between the Release, Release.gpg, and especially Packages.gz file, means that apt reports a 'hash sum mismatch'. This is particularly frustrating while testing deployment/automation during the dev release, because the files keep on changing.

With the influx of more apt sources (backports, multiarch, extras), the potential for running into this skew gets larger and larger. The Expires: headers means that if you happen to cache the responses mid-update (pretty easy with 5 - 8 minutes seeming the average between Packages.gz and Release writing), then you will have a broken cache for 25 minutes.

With the pdiff format, it would seem that the window for archive skew is, if nothing else, smaller and less painful to repeat. Its also conceivable that Expires: can be relaxed a bit, perhaps to just 10 minutes, if the pdiff format is used since in theory people requesting Release and the pdiff index twice will not be as bad as re-requesting Packages.gz.

Revision history for this message
Nikolaus Waxweiler (madleser) wrote :

Anybody working on this? I'm on a very slow link at home and it's very annoying to have the connection blocked for 5-10 minutes or something while megabytes of package lists are downloaded :(

Revision history for this message
hackel (hackel) wrote :

Ubuntu really needs to implement this or another solution. apt-sync appears to be dead. I have a decent 20M connection, but most of the time I can't pull list or package updates at more than 100 KiB/s. It's not just users on a slow or limited bandwidth link that are affected by this!

Revision history for this message
Michael (michaelraspi) wrote :

It is mid 2018 now and nothing's change, that's quite a shame, especially considering they (Raspberry Foundation) just finally add better Ethernet, but the new Pi 3 doesn't worth it at all.

Raspbian is popular (because it's there ?) but no so much time and energy is put into as opposite to Debian.

It's even weird considering everything should be provided from upstream and a pdiffs are not related to hardware at all AFAIK.

Revision history for this message
Colin Watson (cjwatson) wrote :

The "hash sum mismatch" problems that Clint mentioned were fixed differently (see https://www.chiark.greenend.org.uk/~cjwatson/blog/no-more-hash-sum-mismatch-errors.html).

I sympathise with the slow-link issues; at home I get 2.5Mbps down at the best of times. However, the Debian archive (where pdiffs were designed) cycles once every six hours, whereas the Ubuntu archive cycles much more frequently, potentially as often as once every five minutes if there are changes to be published. IIRC this is throttled for mirroring purposes so in practice most users don't see it cycling that frequently even in development releases, but from the point of view of publishing this means that we might well have to publish a very large number of small pdiffs; and this in turn means that apt would end up making lots of small requests to catch up, which could itself end up being pretty slow. A naïve implementation could well make things worse for a lot of people.

So, I think this project would likely involve working out how to merge some patches to produce a good trade-off between download size and number of requests; perhaps we can get away with just publishing merged diffs to the current version from every version since some time threshold ago (although we'd need to work out how to compute that efficiently), or perhaps we can have some coarser-grained patches as well and work out how to get apt to follow the resulting shorter patch path. It's not going to be just a matter of lifting and shifting the Debian implementation, although it should be possible to use the same on-disk format. https://wiki.debian.org/DebianRepository/Format#indices_difference_files_.28diffs.29 will probably be helpful to anyone working on this.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.