Ubuntu Distributed Development

import_package re-downloads files multiple times

Bug #524123 reported by John A Meinel on 2010-02-18

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Ubuntu Distributed Development	Fix Released	Medium	John A Meinel

Bug Description

I'm trying to run an 'import_package' locally. And it seems to try to be smart about branch history, by using shared repositories, etc.

However, while attempting a local import of "gnome-panel", I see >700MB of data transferred. (This is backed up by running -Dhttp and grepping the Content-Length: sections.)

When I look at the log of files downloaded, I see:

3765.689 fetching https+urllib://launchpad.net/ubuntu/gutsy/+source/gnome-panel/1:2.20.0.1-0ubuntu1/+files/gnome-panel_2.20.0.1.orig.tar.gz
3795.972 fetching https+urllib://launchpad.net/ubuntu/gutsy/+source/gnome-panel/1:2.20.0.1-0ubuntu2/+files/gnome-panel_2.20.0.1.orig.tar.gz
3824.341 fetching https+urllib://launchpad.net/ubuntu/gutsy/+source/gnome-panel/1:2.20.0.1-0ubuntu3/+files/gnome-panel_2.20.0.1.orig.tar.gz
3875.749 fetching https+urllib://launchpad.net/ubuntu/gutsy/+source/gnome-panel/1:2.20.0.1-0ubuntu4/+files/gnome-panel_2.20.0.1.orig.tar.gz
3900.679 fetching https+urllib://launchpad.net/ubuntu/gutsy/+source/gnome-panel/1:2.20.0.1-0ubuntu5/+files/gnome-panel_2.20.0.1.orig.tar.gz
3924.091 fetching https+urllib://launchpad.net/ubuntu/gutsy/+source/gnome-panel/1:2.20.0.1-0ubuntu6/+files/gnome-panel_2.20.0.1.orig.tar.gz
3958.858 fetching https+urllib://launchpad.net/ubuntu/hardy/+source/gnome-panel/1:2.20.0.1-0ubuntu6/+files/gnome-panel_2.20.0.1.orig.tar.gz

If you look at it, this is because there are potentially many Ubuntu packages based on the same orig.tar.gz. It doesn't seem to care that the file exists locally with the same name. (It would overwrite it each time.)

This is also being run from within a single process, so it would likely be able to say "I just downloaded that, trust that it is accurate".

Now the full url is different, so maybe we can't trust it?

Related branches

lp:~jameinel/udd/single-download-524123

Merged into lp:udd

Ubuntu Distributed Development Developers: Pending requested 2010-02-19

Revision history for this message

James Westby (james-w) wrote on 2010-02-18: Re: [Bug 524123] [NEW] import_package re-downloads files multiple times

On Thu, 18 Feb 2010 23:13:53 -0000, John A Meinel <email address hidden> wrote:
> If you look at it, this is because there are potentially many Ubuntu
> packages based on the same orig.tar.gz. It doesn't seem to care that the
> file exists locally with the same name. (It would overwrite it each
> time.)

Oops.

> This is also being run from within a single process, so it would likely
> be able to say "I just downloaded that, trust that it is accurate".
>
> Now the full url is different, so maybe we can't trust it?

We can as long as we don't trust it to be the same from different
distributions.

A cache based on hashes would be perfectly safe, but a little more work.

Without looking I don't know whether the cross-distribution requirement
means it would be just as easy to do the cache.

Thanks,

James

Revision history for this message

John A Meinel (jameinel) wrote on 2010-02-19:

So if we can trust it by distribution, I could just do:

=== modified file 'import_package.py'
--- import_package.py 2010-02-18 20:26:19 +0000
+++ import_package.py 2010-02-19 02:54:41 +0000
@@ -558,7 +558,7 @@
         extract_upstream_branch(update_db, upstream_dir)
         dl_dir = tempfile.mkdtemp()
         try:
- local_dsc_path = dget(importp.get_url(), temp_dir,
+ local_dsc_path = dget(importp.get_url(), temp_dir + distro,
                                   possible_transports=possible_transports)
             update_db.import_package(local_dsc_path,
                     use_time_from_changelog=True)

And then change 'dget' so that if the file already exists in that directory, it skips the download.

Alternatively, we could always download the .dsc file (its pretty small and likely different each time), and just check the hash of the file on disk versus the requested file. That is likely to be pretty easy, given that the .dsc already includes the hash.
It means re-reading the file on disk (unless we cache that), but that is better than downloading and *writing* that file again.

It wouldn't require changing anything on disk that way, just reading an already present file.

Revision history for this message

John A Meinel (jameinel) wrote on 2010-02-19:

Marking as medium because this makes a huge difference in local testing. Specifically, importing gnome-panel download 1GB of data, and only has 350MB. So about 3:1, but I've seen as much as 6:1 downloads of the same file.

Changed in udd:
assignee:	nobody → John A Meinel (jameinel)
importance:	Undecided → Medium
status:	New → In Progress

Revision history for this message

James Westby (james-w) wrote on 2010-02-19: Re: [Bug 524123] Re: import_package re-downloads files multiple times

On Fri, 19 Feb 2010 02:59:17 -0000, John A Meinel <email address hidden> wrote:
> So if we can trust it by distribution, I could just do:
>
> === modified file 'import_package.py'
> --- import_package.py 2010-02-18 20:26:19 +0000
> +++ import_package.py 2010-02-19 02:54:41 +0000
> @@ -558,7 +558,7 @@
> extract_upstream_branch(update_db, upstream_dir)
> dl_dir = tempfile.mkdtemp()
> try:
> - local_dsc_path = dget(importp.get_url(), temp_dir,
> + local_dsc_path = dget(importp.get_url(), temp_dir + distro,
> possible_transports=possible_transports)
> update_db.import_package(local_dsc_path,
> use_time_from_changelog=True)
>
>
> And then change 'dget' so that if the file already exists in that directory, it skips the download.

That would work fine.

> Alternatively, we could always download the .dsc file (its pretty small and likely different each time), and just check the hash of the file on disk versus the requested file. That is likely to be pretty easy, given that the .dsc already includes the hash.
> It means re-reading the file on disk (unless we cache that), but that is better than downloading and *writing* that file again.
>
> It wouldn't require changing anything on disk that way, just reading an
> already present file.

This is the more elegant solution though. It's how things are supposed
to work :-)

I would be happy to see either really to save the headache that you
talk about.

Thanks,

James

James Westby (james-w) on 2010-02-19

Changed in udd:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.