It's impossible to check a librarian file's hash once it's expired

Bug #687752 reported by Julian Edwards
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Triaged
Low
Unassigned

Bug Description

The librarian hashes files on the LibraryFileContent record, which disappears when the files are garbage collected after they expire.

This means that it's impossible to check the hash has once the file is expired. This is important for Soyuz because it needs to prevent re-uploading of old files with a different hash even after they have expired.

Revision history for this message
Stuart Bishop (stub) wrote :

I'm thinking this is outside the scope of the Librarian, and the one-upload-only hashes need to be duplicated in Soyuz. In particular, consider that the file had been uploaded previously as a bug attachment or a blob or similar.

Revision history for this message
Julian Edwards (julian-edwards) wrote : Re: [Bug 687752] Re: It's impossible to check a librarian file's hash once it's expired

On Thursday 09 December 2010 10:52:54 you wrote:
> I'm thinking this is outside the scope of the Librarian, and the one-
> upload-only hashes need to be duplicated in Soyuz. In particular,
> consider that the file had been uploaded previously as a bug attachment
> or a blob or similar.

Why not just move the hash columns from lfc to lfa?

Revision history for this message
William Grant (wgrant) wrote :

It sort of sounds like we need to move the hashes onto a LibraryFile, which sits between the two. So we don't have to duplicate the hashes, but we can still delete the LFC.

Revision history for this message
Stuart Bishop (stub) wrote :

Moving the hash from LFC to LFA is duplicating the hash - there can be multiple LFAs linked to an LFC. It will complicate garbage collection logic, and the data model is broken (well... denormalized) in that you could have multiple LFAs with differing hashes pointing to an LFC.

I'm not sure how your intended work flow plays with other users of the Librarian. If a file is uploaded into the Librarian from some other part of Launchpad, will the publisher refuse to publish it? The alternative of having a table of hashes that Soyuz will block uploads for seems simpler for everyone.

Revision history for this message
William Grant (wgrant) wrote :

We retain references to the expired LFAs, so we know which ones are relevant.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

On Friday 10 December 2010 06:50:03 you wrote:
> Moving the hash from LFC to LFA is duplicating the hash - there can be
> multiple LFAs linked to an LFC. It will complicate garbage collection
> logic, and the data model is broken (well... denormalized) in that you
> could have multiple LFAs with differing hashes pointing to an LFC.

I think the multiple LFA per LFC is model is a little weird, personally. If
it's the same file, it should be the same LFA. Then you'd not have this
denormalization.

> I'm not sure how your intended work flow plays with other users of the
> Librarian. If a file is uploaded into the Librarian from some other part
> of Launchpad, will the publisher refuse to publish it?

That will never be a problem because Soyuz looks for files that only it
uploaded, linked through an archive and publication of some sort.

> The alternative
> of having a table of hashes that Soyuz will block uploads for seems
> simpler for everyone.

You're talking about not denormalizing the librarian model, yet you're
advocating Soyuz effectively do it anyway? I don't understand that.

However, I do concede that adding a hash to one of the Soyuz tables will
indeed be very easy. But is it the right thing to do?

Revision history for this message
Stuart Bishop (stub) wrote :

LFC represents a file on disk. Multiple LFAs will reference a single LFC, saving us disk space when identical files get uploaded multiple times.

Metadata associated with the blob on disk belongs in LFC. Hash, size, path (calculated from the id).

Stuff independent of the blob on disk is in LFA. Name, mimetype etc. You can upload an identical file multiple times with different names and mimetypes.

We used to have a deleted flag on the LFC. We would keep the records around forever, even after we removed the blob from disk. This allowed us to support resurrection of files - upload a fresh copy of a file to the system, and any old expired LFAs would become accessible again. This feature ended up not serving any real purpose in Launchpad, except causing security concerns when a bug attachment that was 'deleted' from a public bug was resurrected when an identical file was attached to a private bug. So we changed things - dropped the deleted flag, removed LFC records when the blob was removed from disk, made LFA.content nullable. That we forgot what blob of data used to be attached to a LFA was a desirable feature.

Moving the hash to LFA is denormalizing the data. I don't think we should do this unless we have to for performance reasons. If we want to take this approach, parts of the Librarian client need to be rewritten (addFile will need to calculate all of the hashes on the client side as it is inserting the LFA record - we should still be able to support streaming), the Librarian, and the Librarian Garbage Collector. All the code that accesses the hashes on the LFC needs to be refactored to use the hash on the LFA (just soyuz?) - going back from LFC to one of its LFAs will require a db query so we can't use compatibility code.

We could consider putting back the deleted column on LFC. If we take this approach, we need to be careful when refactoring the Librarian garbage collector to not allow resurrection. Things that are deleted get LFA.content set to NULL, things that are expired remain linked to the LFC. Uploading an identical blob will resurrect the LFC, but LFAs will not be relinked and we avoid the security concerns.

We should discuss all this in Dallas anyway. I'm not clear on your original use case. Why do we have LFAs not linked to LFCs that we want to block from being reuploaded? I suspect this is abusing the expiry system and you are better off with an explicit list of hashes that should not be reuploaded, rather than blocking a set of files that have expired for some reason or other. Overloading the expired state of a file to block uploads seems dangerous as it becomes impossible to turn off this behavior if necessary.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

On Tuesday 04 January 2011 13:29:31 you wrote:
> We should discuss all this in Dallas anyway.

That'd be great.

> I'm not clear on your original use case.

It's to prevent re-uploading/publishing of the same file (name) in the same
archive context with different contents.

> Why do we have LFAs not linked to LFCs that we want
> to block from being reuploaded?

We never have that, I'm not sure why you think so. The only case is where the
LFC is deleted when the file is removed from disk (which we have to do or
there would not be enough librarian space).

> I suspect this is abusing the expiry
> system and you are better off with an explicit list of hashes that
> should not be reuploaded, rather than blocking a set of files that have
> expired for some reason or other.

We do a combination of both. It's perfectly acceptable to resurrect deleted
files as long as they have the same content.

I am pretty sure that the librarian was designed like this in the first place
with Soyuz in mind.

> Overloading the expired state of a
> file to block uploads seems dangerous as it becomes impossible to turn
> off this behavior if necessary.

We don't do that so it's moot.

Cheers.

Revision history for this message
Stuart Bishop (stub) wrote :

So in summary, Soyuz blocks uploading of some files based on the hashes of files previously uploaded into Soyuz. When we changed how deleted and expired files where handled, we broke this detection as older files are aggressively expired and the hashes removed from the database. Soyuz can no longer reliably ask 'was this file previously uploaded into Soyuz with a different hash?'.

This change was made quite some time ago and recovering the hashes from backups will be time consuming. Do we need to do this?

Revision history for this message
William Grant (wgrant) wrote :

It would be ideal, but it's far too much work to be worth it.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

On Wednesday 05 January 2011 06:04:02 you wrote:
> This change was made quite some time ago and recovering the hashes from
> backups will be time consuming. Do we need to do this?

It would be nice but it's not worth it. We could do with a solution for the
future though, we can chat in Dallas about it.

Cheers.

Tim Penhey (thumper)
tags: added: soyuz-upload
Changed in launchpad:
status: New → Triaged
importance: Undecided → Low
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.