librarian trusts user browser supplied mime type for bug attachments (can set type to text/html though it should be text/plain)

Bug #34758 reported by StefanPotyra
8
Affects Status Importance Assigned to Milestone
Launchpad itself
Triaged
Low
Unassigned

Bug Description

If a browser supplies an incorrect mime type when uploading a bug attachment, we honour that and don't allow it to be corrected.

E.g.
http://librarian.launchpad.net/1699905/ChangeLog
get's its content type set to text/html, which makes it quite unreadable in the browser.

strange enough, the diffstat (http://librarian.launchpad.net/1699903/clisp.diffstat) I also attached to the bugreport shows as content-type text/plain.

Workarounds
===========

Get the attachment with wget

Or do that and reupload it with a better browser.

Revision history for this message
James Henstridge (jamesh) wrote :

The librarian is serving those attachments with the mime type it was told to. So we need to work out why it was uploaded as text/html in the first place.

What bug is that file attached to?

Revision history for this message
StefanPotyra (sistpoty) wrote :
Revision history for this message
Andrew Bennetts (spiv) wrote :

The mimetype is easily changeable by clicking the "edit" link in the attachments portlet on the right, which takes you to https://launchpad.net/distros/ubuntu/+source/clisp/+bug/34668/attachments/7592.

I think the bug here is that the +addattachment page doesn't ask for a mimetype, so has to rely on what the browser guesses the file's mimetype is. This is a Malone bug, rather than a Librarian bug.

Changed in launchpad:
status: Unconfirmed → Confirmed
Revision history for this message
Björn Tillenius (bjornt) wrote :

Either we should ask for a mime type when adding the attachment, or we should improve the guessing of the mime type.

Changed in malone:
assignee: nobody → bjornt
Revision history for this message
Kees Cook (kees) wrote :

Just a note; I've run into this a few times when people attach a ".debdiff" file.

Revision history for this message
Matt Zimmerman (mdz) wrote :

I see this fairly regularly as well. It just happened with the logfile attachment in bug 126797.

Is the file type provided by the browser or guessed by launchpad?

Revision history for this message
Stuart Bishop (stub) wrote :

We trust the mime type sent by the browser. So at the moment it is garbage in, garbage out.

If it is only a case of text/plain being sent as text/html, can we special case this in the Librarian? I can't think of a use case where we *want* to store HTML in the Librarian and have it served up as HTML. So we can make the Librarian serve HTML mime types as text/plain or better yet store them in the database as text/plain on upload. I think we need to do this one day anyway, as if the Librarian starts doing authentication it will become a source of attacks.

Revision history for this message
Matt Zimmerman (mdz) wrote : Re: [Bug 34758] Re: librarian will set type to text/html though it should be text/plain

On Tue, Aug 21, 2007 at 04:04:52AM -0000, Stuart Bishop wrote:
> We trust the mime type sent by the browser. So at the moment it is
> garbage in, garbage out.

I wonder what's confusing Firefox, then. These files don't look like HTML
to file(1), for example.

Alexander, can you tell us how the detection works?

> If it is only a case of text/plain being sent as text/html, can we
> special case this in the Librarian? I can't think of a use case where we
> *want* to store HTML in the Librarian and have it served up as HTML. So
> we can make the Librarian serve HTML mime types as text/plain or better
> yet store them in the database as text/plain on upload. I think we need
> to do this one day anyway, as if the Librarian starts doing
> authentication it will become a source of attacks.

It seems like it should be possible to store HTML in the librarian and
record it as such, though I agree that it doesn't make sense to serve it
that way.

--
 - mdz

Revision history for this message
Alexander Sack (asac) wrote : Re: [Bug 34758] Re: librarian will set type to text/html though it should be text/plain

On Tue, Aug 21, 2007 at 08:36:45AM +0100, Matt Zimmerman wrote:
> On Tue, Aug 21, 2007 at 04:04:52AM -0000, Stuart Bishop wrote:
> > We trust the mime type sent by the browser. So at the moment it is
> > garbage in, garbage out.
>
> I wonder what's confusing Firefox, then. These files don't look like HTML
> to file(1), for example.
>
> Alexander, can you tell us how the detection works?

I think its a mix from file-extension and file-introspection .... for
details I would have to investigate ... which I didn't because read
below ...

>
> > If it is only a case of text/plain being sent as text/html, can we
> > special case this in the Librarian? I can't think of a use case where we
> > *want* to store HTML in the Librarian and have it served up as HTML. So
> > we can make the Librarian serve HTML mime types as text/plain or better
> > yet store them in the database as text/plain on upload. I think we need
> > to do this one day anyway, as if the Librarian starts doing
> > authentication it will become a source of attacks.
>

It looks like that this is not true. What I did: I downloaded your
gdm.log from bug 126797 ... uploaded the .log file to some random test
bug 121740 and captured the headers.

And see: launchpad thinks its text/html ... but the headers log reveals
that firefox sends the gdm.log mime-part as text/x-log (e.g. no
text/html) ... so I suspect that launchpad falls back to text/html if
it doesn't know about the mimetype send by the browser (however just
guessing).

FYI, attached the captured header log.

> It seems like it should be possible to store HTML in the librarian and
> record it as such, though I agree that it doesn't make sense to serve it
> that way.
>

IMHO, all mime-types in the text/* hierachy should just be served as
text/plain ... I might miss some corner cases where we want
something different though.

 - Alexander

 �c��

Revision history for this message
Gavin Panella (allenap) wrote : Re: librarian will set type to text/html though it should be text/plain

The debdiff case was dealt with recently in bug 229040, but it looks like a more general fix is necessary. The problem, afaict, is in the `zope.app.content_types.text_type` function, which does some really awful guessing:

def text_type(s):
    s = s.strip()
    # Yuk. See if we can figure out the type by content.
    if s.lower().startswith('<html>') or '</' in s:
        return 'text/html'
    elif s.startswith('<?xml'):
        return 'text/xml'
    else:
        return 'text/plain'

We could call `z.a.content_types.guess_content_type` with an explicit default, then `text_type` will never be called. Some people may rely on the fact that real XML and HTML files are currently detected, so a refined version of `text_type` may be needed instead.

Revision history for this message
Matt Zimmerman (mdz) wrote : Re: [Bug 34758] Re: librarian will set type to text/html though it should be text/plain

On Mon, Jul 07, 2008 at 10:14:04AM -0000, Gavin Panella wrote:
> The debdiff case was dealt with recently in bug 229040, but it looks
> like a more general fix is necessary. The problem, afaict, is in the
> `zope.app.content_types.text_type` function, which does some really
> awful guessing:
>
> def text_type(s):
> s = s.strip()
> # Yuk. See if we can figure out the type by content.
> if s.lower().startswith('<html>') or '</' in s:
> return 'text/html'
> elif s.startswith('<?xml'):
> return 'text/xml'
> else:
> return 'text/plain'
>
> We could call `z.a.content_types.guess_content_type` with an explicit
> default, then `text_type` will never be called. Some people may rely on
> the fact that real XML and HTML files are currently detected, so a
> refined version of `text_type` may be needed instead.

I think simply replacing the first test with something a bit less liberal,
e.g.:

s = s.strip().lower()
if (s.startswith('<html') or s.startswith('<!doctype html') or
    s.startswith('<title') or s.startswith('<head') ):

would be fine. FYI, this example code reflects what file(1) does today,
which seems reasonable enough.

A more comprehensive option might be to validate the file using an HTML
parser, which I'm sure is around somewhere in launchpad already.

--
 - mdz

Gavin Panella (allenap)
Changed in malone:
assignee: bjornt → allenap
milestone: none → 1.99
Gavin Panella (allenap)
Changed in malone:
status: Confirmed → In Progress
Gavin Panella (allenap)
Changed in malone:
milestone: 1.99 → 2.1.8
Gavin Panella (allenap)
Changed in malone:
milestone: 2.1.8 → 2.1.9
Revision history for this message
Björn Tillenius (bjornt) wrote : Re: librarian will set type to text/html though it should be text/plain

Pushing to next milestone, since python-magic still isn't installed on the servers.

Changed in malone:
milestone: 2.1.9 → 2.1.10
Revision history for this message
Gavin Panella (allenap) wrote :

In Hardy, XHTML content is reported as simply text/xml, which breaks a few things. I set about implementing a sniffer for XML files to try and return something more convincing. It works, on Hardy.

However, on Intrepid, the behaviour of things based on libmagic seems to be quite different from Hardy. In particular, bug 285031 is causing serious issues. I'm suspending work on this until I understand the situation more, and until Intrepid moves out of beta.

Changed in malone:
milestone: 2.1.10 → none
Gavin Panella (allenap)
Changed in malone:
assignee: Gavin Panella (allenap) → nobody
status: In Progress → Triaged
Revision history for this message
Robert Collins (lifeless) wrote :

Content sniffing is -evil-. I'm inclined to wontfix this and say that browsers should supply better metadata. That may be impractical, so we could allow folk to set the mime type explicitly after upload.

summary: - librarian will set type to text/html though it should be text/plain
+ librarian trusts user browser supplied mime type for bug attachments
+ (can set type to text/html though it should be text/plain)
Changed in launchpad:
importance: Medium → Low
description: updated
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.