Comment 7 for bug 687752

Revision history for this message
Stuart Bishop (stub) wrote :

LFC represents a file on disk. Multiple LFAs will reference a single LFC, saving us disk space when identical files get uploaded multiple times.

Metadata associated with the blob on disk belongs in LFC. Hash, size, path (calculated from the id).

Stuff independent of the blob on disk is in LFA. Name, mimetype etc. You can upload an identical file multiple times with different names and mimetypes.

We used to have a deleted flag on the LFC. We would keep the records around forever, even after we removed the blob from disk. This allowed us to support resurrection of files - upload a fresh copy of a file to the system, and any old expired LFAs would become accessible again. This feature ended up not serving any real purpose in Launchpad, except causing security concerns when a bug attachment that was 'deleted' from a public bug was resurrected when an identical file was attached to a private bug. So we changed things - dropped the deleted flag, removed LFC records when the blob was removed from disk, made LFA.content nullable. That we forgot what blob of data used to be attached to a LFA was a desirable feature.

Moving the hash to LFA is denormalizing the data. I don't think we should do this unless we have to for performance reasons. If we want to take this approach, parts of the Librarian client need to be rewritten (addFile will need to calculate all of the hashes on the client side as it is inserting the LFA record - we should still be able to support streaming), the Librarian, and the Librarian Garbage Collector. All the code that accesses the hashes on the LFC needs to be refactored to use the hash on the LFA (just soyuz?) - going back from LFC to one of its LFAs will require a db query so we can't use compatibility code.

We could consider putting back the deleted column on LFC. If we take this approach, we need to be careful when refactoring the Librarian garbage collector to not allow resurrection. Things that are deleted get LFA.content set to NULL, things that are expired remain linked to the LFC. Uploading an identical blob will resurrect the LFC, but LFAs will not be relinked and we avoid the security concerns.

We should discuss all this in Dallas anyway. I'm not clear on your original use case. Why do we have LFAs not linked to LFCs that we want to block from being reuploaded? I suspect this is abusing the expiry system and you are better off with an explicit list of hashes that should not be reuploaded, rather than blocking a set of files that have expired for some reason or other. Overloading the expired state of a file to block uploads seems dangerous as it becomes impossible to turn off this behavior if necessary.