document_new_from_data() arg1 must be without null bytes

Bug #312462 reported by Carl Karsten
44
This bug affects 8 people
Affects Status Importance Assigned to Milestone
Poppler Python Bindings
In Progress
Medium
Gian Mario Tagliaretti
web2-Conf
In Progress
Medium
Unassigned
python-poppler (Ubuntu)
In Progress
Undecided
Unassigned

Bug Description

on the poppler list Albert says:
"the argument 1 must be string without null bytes" problem is a bug
either in the python binding or in the glib binding the python sits on top
of.

>>> import poppler
Xlib: extension "RANDR" missing on display ":0.0".
>>> from gtk.gdk import Pixbuf,COLORSPACE_RGB

>>> pdf=open('pycon08_b.pdf','rb').read()
>>> pixbuf = Pixbuf(COLORSPACE_RGB,False,8,286,225)

>>> p = poppler.document_new_from_data(pdf,len(pdf),password='')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: document_new_from_data() argument 1 must be string without null bytes, not str

Just to be sure it was really looking for 0's:
>>> pdf=(pdf.replace('\x00','a'))
>>> p = poppler.document_new_from_data(pdf,len(pdf),password='')
(no error)

Revision history for this message
Carl Karsten (carlfk) wrote :

I think the answer is around here somewhere:

https://code.fluendo.com/pigment/trac/browser/trunk/libs/codegen/argtypes.py?rev=966#L114

    # allows strings with embedded NULLs.

Revision history for this message
Carl Karsten (carlfk) wrote :
Revision history for this message
Carl Karsten (carlfk) wrote :

http://svn.python.org/view/python/trunk/Python/getargs.c?rev=67905&view=markup

   if ((Py_ssize_t)strlen(*p) != PyString_Size(arg))
    return converterr("string without null bytes",
        arg, msgbuf, bufsize);

probably the code that is catching the null.

Revision history for this message
Carl Karsten (carlfk) wrote :

fixed.

There may be a problem with codegen/argtypes.py not detecting it right, or maybe poppler.c/h poppler_document_new_from_data needs a better signature. don't care, that isn't part of the python-poppler build process.

Revision history for this message
Carl Karsten (carlfk) wrote :

please review my patch and update trunk if it looks reasonable.

Changed in poppler-python:
status: New → In Progress
Changed in web2conf:
importance: Undecided → Medium
status: New → In Progress
Carl Karsten (carlfk)
Changed in python-poppler:
status: New → In Progress
Revision history for this message
Carl Karsten (carlfk) wrote :
Revision history for this message
Carl Karsten (carlfk) wrote :

simple test:

fail:
carl@asus17:~/tss$ python -c "import poppler; poppler.document_new_from_data('\x00',1,password='')"
Xlib: extension "RANDR" missing on display "localhost:13.0".
Traceback (most recent call last):
  File "<string>", line 1, in <module>
TypeError: document_new_from_data() argument 1 must be string without null bytes, not str

pass:
juser@cp666:~/temp$ python -c "import poppler; poppler.document_new_from_data('\x00',1,password='')"
Xlib: extension "RANDR" missing on display "localhost:10.0".
Error: May not be a PDF file (continuing anyway)
Error: PDF file is damaged - attempting to reconstruct xref table...
Error: Couldn't find trailer dictionary
Error: Couldn't read xref table
Traceback (most recent call last):
  File "<string>", line 1, in <module>
glib.GError: PDF document is damaged

yeah, 0 is not a valid pdf. that's not the point.

Revision history for this message
Gian Mario Tagliaretti (gianmt) wrote :

Hi Carl,

making this change will raise a compiler warning about differ in signdess.

I had a look at glib-poppler and the function initializes a new MemStream which expects a char*, so I don't know how feasible a change will be in poppler itself, can you ask them maybe?

thanks for all of your effort!

Revision history for this message
Brian Murray (brian-murray) wrote :

Looking at the attachments in this bug report, I noticed that "poppler.defs.diff" was not flagged as a patch. A patch contains changes to an Ubuntu package that will resolve a bug and this attachment is one! Subsequently, I've checked the patch flag for it. In the future when submitting patches please use the patch checkbox as there are some Launchpad searches that use this feature. Thanks for your contribution Carl Karsten!

Changed in poppler-python:
assignee: nobody → Gian Mario Tagliaretti (gianmt)
importance: Undecided → Medium
milestone: none → development
Revision history for this message
BenjaminBerg (benjamin-sipsolutions) wrote :

Hm, just a note. I don't think that the python version of the function should even have the "length" parameter, unfortunately, it might be hard to remove it at this point (it could instead be just ignored, and made optional).

The bindings should just call len(data) themselves, instead of leaving it to the binding user. This is a) less work for the user of the library and b) fixes potential problems where the passed in length is wrong (which could potentially cause crashes inside Poppler, I expect).

Revision history for this message
Pete (pt.pete.pt) wrote :

Did somebody find a solution to this? In Scientific Linux 6.2 and Debian Squeeze the bug is still present.

Revision history for this message
Pete (pt.pete.pt) wrote :

Ok,

as this bug is open for 4 years now, I spent the afternoon fixing it. The solution was quite easy.

The auto generated wrapping from the .defs file is not sufficient. I added a method to the .override file, which solves the problem.

The problem was, that the C function requires a char pointer (which is not \x00 terminated) and the length. But the automatic mapping interpreted it as a null terminated string and therefore threw a TypeError (because pdf documents typically contain null bytes).

I fixed the mapping by changing "si" to "s#" (http://docs.python.org/c-api/arg.html), which allows null bytes and also renders the length parameter redundant. Therefore I removed it as Benjamin suggested.

As this method probably never worked, there should be no fear to remove the length parameter when thinking about backward compatibility.

Have a nice evening.

Revision history for this message
Pete (pt.pete.pt) wrote :

My first patch didnt contain code to free the memory of the poppler code.

Revision history for this message
Paul Melis (paul-floorball-flamingos) wrote :

If this patch is valid, can it be merged, please?

Revision history for this message
Peter Waller (peter.waller) wrote :

Ping. I really want to use poppler-python and stuff like this makes me cringe.

Revision history for this message
BenjaminBerg (benjamin-sipsolutions) wrote :

I would strongly suggest everyone to move to the GObject Introspection based bindings. These bindings basically make the static bindings obsolete.

Though it looks like the "array length" annotation is missing for the data input. But it does work already.

>>> from gi.repository import Poppler
>>> Poppler.Document.new_from_data('\0', 1, None)
Syntax Warning: May not be a PDF file (continuing anyway)
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't read xref table
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/gi/types.py", line 137, in constructor
    return info.invoke(cls, *args, **kwargs)
gi._glib.GError: PDF document is damaged
>>>

Revision history for this message
Peter Waller (peter.waller) wrote :

BenjaminBerg, this wasn't obvious. I'm glad to hear there is something replacing the bindings and it's not just that it is totally dead. However, if I try and use it:

>>> from gi.repository import Poppler
ERROR:root:Could not find any typelib for Poppler

I can't see any obvious packages that I might be missing (`apt-cache search poppler-glib`). Suggestions?

Revision history for this message
Peter Waller (peter.waller) wrote :

Oh, duh, `gir1.2-poppler-0.18` on ubuntu. Was blind to it.

Revision history for this message
Martin Manns (mmanns) wrote :

Is there already a patched poppler deb package with static bindings?

Importing the gi based version results in (other modules in namespace: cStringIO, math, warnings, cairo, wx, wx.lib.wxcairo, rsvg):

from gi.repository import Poppler
  File "/usr/lib/python2.7/dist-packages/gi/__init__.py", line 39, in <module>
    raise ImportError(_static_binding_error)
ImportError: When using gi.repository you must not import static modules like "gobject". Please change all occurrences of "import gobject" to "from gi.repository import GObject". See: https://bugzilla.gnome.org/show_bug.cgi?id=709183

Revision history for this message
BenjaminBerg (benjamin-sipsolutions) wrote :

Well, you could try to use the introspection bindings for the other packages too (in this case, I think it is only rsvg that needs to be changed)

Revision history for this message
Florian Aspart (florian-aspart) wrote :

I would really need the state function poppler.document_new_from_data to work. Is there any advance on this?

Using the GObject Inspection somehow does not wokr me (I get a segmentation fault)...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.