Okular can't find words with 'fi'.

Bug #411538 reported by Rolando Garza
28
This bug affects 4 people
Affects Status Importance Assigned to Milestone
KDE Graphics
Fix Released
Medium
kdegraphics (Ubuntu)
Fix Released
Low
Unassigned

Bug Description

Binary package hint: okular

1) Using Ubuntu 9.04.
2) okular version: 4:4.2.2-0ubuntu2
3) I expected to find, incrementally, 'infi' in a PDF created with the following words: infinity, infinite, infiltration.
4) Okular didn't find the words. After looking for a cause, I discovered that they were not found because they had 'fi' in them.

More details:
When searching for words like infinity, infinite, infiltrate -- anything with 'if', Okular cannot find them (even tried incremental search, like adding letter by letter).

Take 'infiltrate', for example. Okular can find 'ltrate' and 'in', but cannot find the 'i' in 'fi' (by looking for 'iltrate'). This doesn't happen when using Document Viewer or Acroread, so it is not a latex or kile bug.

ProblemType: Bug
Architecture: i386
DistroRelease: Ubuntu 9.04
Package: okular 4:4.2.2-0ubuntu2
ProcEnviron:
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: kdegraphics
Uname: Linux 2.6.28-15-generic i686

Revision history for this message
In , David D Short (chameleondave) wrote :

Version: (using KDE 4.1.4)
OS: Linux
Installed from: Ubuntu Packages

Here are two words: fire, fire.

They are identical, except that the first one contains a ligature bringing the f and the i together. The second one has no ligature. The first is the standard in proper typesetting, and it is the default output of LaTeX.

If I search for "fire" in Okular, the word will not be found, because Okular doesn't understand the ligature. By way of comparison, Adobe Reader for Linux does understand the ligature, and finds the word.

This can lead to great frustration.

I imagine that this applies to all documents in Okular, rather than being specific to the PDF backend. There are a handful of other common ligatures that this applies to (see http://en.wikipedia.org/wiki/Typographic_ligature).

Revision history for this message
In , Jarauh (jarauh) wrote :

I have the same problem with Version 0.8 on KDE 4.2.

In case anyone needs an example-PDF:
Try to search for "config" in
http://www.nd.edu/~sommese/bertini/BertiniUsersManual.pdf

Revision history for this message
Rolando Garza (rolandog) wrote :
Revision history for this message
skorasaurus (skoraw) wrote :

I can also confirm this bug as well.

okular 0.8.2.
ubuntu 9.04
kde 4.2.2

If you search for 'config' on the PDF http://www.nd.edu/~sommese/bertini/BertiniUsersManual.pdf the first
result returned will be on page 7. However, the first result returned should be
on the bottom of page 3, where 'configurations' is written.

I made sure the 'use case sensitive' and 'from current page' options were NOT
enabled.

Changed in okular (Ubuntu):
status: New → Confirmed
Changed in okular:
status: Unknown → New
Revision history for this message
In , skorasaurus (skoraw) wrote :

I can also confirm this bug as well.

okular 0.8.2.
ubuntu 9.04
kde 4.2.2

If you search for 'config' on the PDF mentioned in Jaruh's link, the first result returned will be on page 7. However, the first result returned should be on the bottom of page 3, where 'configurations' is written.

I made sure the 'use case sensitive' and 'from current page' options were NOT enabled.

Revision history for this message
In , Jens (io-ta) wrote :

I can confirm this bug.

KDE Version 0.8.2 (KDE 4.2.2 (KDE 4.2.2), Kubuntu packages)
Application Universal document viewer
Operating System Linux (x86_64) release 2.6.28.9j2
Compiler cc

This always happens with pdf files produced by pdflatex as it makes use of ligatures.

Revision history for this message
flying sheep (flying-sheep) wrote :

think about ligatures.
i think your pdf contains them and okular doesn’t find the letters “f” and “i” in this order since there is in fact only the ligature glyph “fi”.
i think should convert every search query to a regular expression searching either the ligature or the single letters:
gadaffi → gada(ffi|ffi|ffi)

Revision history for this message
flying sheep (flying-sheep) wrote :

i have written a python function correcting the bug. a okular developer has to port it to c++ (or whatever okular is written in) and call the method ligre on the search string before performing a regular expression search with it.

included in the script is a test string which is transformed perfectly with the method.

Changed in okular (Ubuntu):
status: Confirmed → Triaged
importance: Undecided → Low
affects: okular (Ubuntu) → kdegraphics (Ubuntu)
Revision history for this message
In , Pino Toscano (pinotree) wrote :

*** Bug 213086 has been marked as a duplicate of this bug. ***

Revision history for this message
Patrick Wigmore (patrick-wigmore) wrote :

Discovered bug while searching a PDF that had been output by LyX. This could be reasonably frustrating in very long documents. I'm fairly sure this bug was not present in kpdf on KDE 3, as I remember being able to search for words with ligatures in them when I first discovered ligatures and wanted to zoom in on a nice, PDF rendering of one. I will check to see if the bug is present in kpdf.

Not knowing how these things work but having previously given it a passing thought, I had always assumed that the text in the document must be converted internally by the viewing application into plain text version without ligatures and that this version is what gets used for copying to the clipboard or searching within the document.

Revision history for this message
Patrick Wigmore (patrick-wigmore) wrote :

I can confirm that this bug is not present in KPDF 0.5.10. Perhaps the solution used there can be reintroduced.

Revision history for this message
In , skorasaurus (skoraw) wrote :

A user [flying sheep] on launchpad has written a patch to fix this, in python, which can be found at https://bugs.launchpad.net/okular/+bug/411538/comments/4

Revision history for this message
In , skorasaurus (skoraw) wrote :

Created attachment 40447
patch to search for ligatures [written in python]

written by flying sheep [launchpad], https://bugs.launchpad.net/okular/+bug/411538/comments/4

Revision history for this message
In , Albert Astals Cid (aacid) wrote :

Just for the record, if anyone things that patch is useful, it is not.

Also, for the record, Adobe Reader 9.3 is not able to find the word "configurations" in document from comment #1

Revision history for this message
In , Gelefisk (gelefisk) wrote :

I too can confirm this bug for okular 0.9.5, kubuntu 9.10 and kde 4.3.5. Also, the copy function should separate ligatures, like Evince does.

Revision history for this message
In , Psychonaut (psychonaut) wrote :

Confirming this bug still exists in KDE 4.4.1.

Also, this was previously reported for kpdf as Bug 103621, so more information can be found there.

Revision history for this message
In , Albert Astals Cid (aacid) wrote :

*** Bug 230274 has been marked as a duplicate of this bug. ***

Revision history for this message
Krister Swenson (thekswenson) wrote :

This bug manifests itself with cut and paste as well...
  if you cut "infinity" and paste it you don't get the expected sequence of characters.

Revision history for this message
Krister Swenson (thekswenson) wrote :

Try to cut an paste "infinity"...
   the paste is not what you expect.

Revision history for this message
Krister Swenson (thekswenson) wrote :

Sorry about the multiple posts...
   this was due to a BUG! in launchpad.

Revision history for this message
Rolando Garza (rolandog) wrote :

I copy-pasted this word using skorasaurus' PDF and Krister's suggestion: Configurations

The f and the i are pasted as one single character, so there is a problem with the handling of ligatures as flying sheep suggested.

Revision history for this message
In , Glad-deschrijver (glad-deschrijver) wrote :

*** This bug has been confirmed by popular vote. ***

Changed in okular:
status: New → Confirmed
Revision history for this message
In , flying sheep (flying-sheep) wrote :

reply to comment 7:
i’m sorry my “patch” isn’t useful, but at least it would be a way to quickly circumvent the problem until a better solution is found. and “program x does it equally wrong” is no excuse if we can do it better.

Revision history for this message
In , Pino Toscano (pinotree) wrote :

*** Bug 258515 has been marked as a duplicate of this bug. ***

Changed in okular:
importance: Unknown → Medium
Revision history for this message
In , Albert Astals Cid (aacid) wrote :

SVN commit 1225994 by aacid:

"Normalize" strings so searching for ligatures like "fi" works
Patch by Christopher Reichert
BUGS: 181828

 M +11 -3 textpage.cpp

WebSVN link: http://websvn.kde.org/?view=rev&revision=1225994

Changed in okular:
status: Confirmed → Fix Released
Revision history for this message
In , Foo0m (foo0m) wrote :

Created attachment 60622
Try to search for 'Kaffee' - the ff ligature is the problem

Revision history for this message
Maarten Bezemer (veger) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. We are sorry that we do not always have the capacity to look at all reported bugs in a timely manner. There have been many changes in Ubuntu since that time you reported the bug and your problem may have been fixed with some of the updates.

At least, I cannot reproduce the issue, using the example PDF comment 2. The first found 'coifig' is at the bottom of page 3.

It would help us a lot if you could test it on a currently supported Ubuntu version. When you test it and it is still an issue, kindly upload the updated logs by running apport-collect 411538 and any other logs that are relevant for this particular issue.

Changed in kdegraphics (Ubuntu):
status: Triaged → Incomplete
Revision history for this message
Krister Swenson (thekswenson) wrote :

This bug seems to have been fixed in okular 0.13.3 on ubuntu 11.10.

Changed in kdegraphics (Ubuntu):
status: Incomplete → Fix Released
Changed in kdegraphics:
importance: Unknown → Medium
status: Unknown → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.