highlighting (review tool) transpass columns in double column pdf's

Bug #199468 reported by Everthon Valadão
38
This bug affects 7 people
Affects Status Importance Assigned to Milestone
KDE Graphics
Fix Released
Wishlist
okular (Ubuntu)
Incomplete
Undecided
Unassigned

Bug Description

Binary package hint: okular

when I try to highlight some text with more than one line in double column pdf files, the correspondent line in the previous column is also highlighted.

this make very difficult to review pdf's, because I have to highlight line by line, clicking in the highlight button every time I highlight a single line (this should be "fixed" too).

==================
package version: 0.5.82-0ubuntu3
ubuntu release: 7.10

Revision history for this message
Everthon Valadão (valadao) wrote :
Revision history for this message
In , Alvaro-aguilera (alvaro-aguilera) wrote :

Version: 0.6.3 (using 4.0.3 (KDE 4.0.3) "release 19.2", compiled sources)
Compiler: gcc
OS: Linux (x86_64) release 2.6.22.17-0.1-default

I find Okular's yellow highlighter very useful, something missing however, is the ability to recognize a layout with columns. Almost every PDF I read has such format and Okular forces me to highlight line by line, instead of allowing me to mark the whole paragraph.

Revision history for this message
In , Pino Toscano (pinotree) wrote :

Give a better title, as it's a general "problem".

Revision history for this message
In , Kde2eran (kde2eran) wrote :

In the general case this seems to require a layout analysis, such as OCRopus.

Revision history for this message
In , Pino Toscano (pinotree) wrote :

*** Bug 162957 has been marked as a duplicate of this bug. ***

Revision history for this message
In , Bui Arantsson (bui-foss) wrote :

This feature indeed needs to be fixed, if okular's highlighter tool is to become useful for scientific work, seeing as almost all journals use text-layouts with columns. However, I am not a programmer, and thus have no idea how it should be implemented, and whether analysis of pdf layouts is easy or not. If not, another possibility might be to allow the user to subdivide documents himself. I.e to allow the user to "draw" borders to which the highlighter will limit itself. Almost like setting margins in a word editor, although of course purely for internal use.

Revision history for this message
In , Jospoortvliet (jospoortvliet) wrote :

I just bumped into this when trying to do a screencast about Okular (for a upcoming KDE promo site). I must say it is rather unfortunate, and I don't think I will demo this feature as it is - it will only make ppl feel betrayed if they find out it doesn't work as it should. This is no stab at you guys developing this app - Okular is way cool. It's just that this issue somehow has to be solved. I have no idea if this even works properly in for example Adobe acrobat reader, or any other app - I suspect this is pretty hard to do, given the little I know about layout stuff in PDF's. Pitty...

Anyway. I hope this can be solved someday - somehow. Meanwhile, keep up the good work. Okular is really nice, but still has many small issues...

I do wanna say the selection mechanism you guys made (right mouseclick - select an area - copy text/picture) works SOO GOOD :D

Changed in okular:
status: Unknown → New
Revision history for this message
In , Pino Toscano (pinotree) wrote :

*** Bug 170102 has been marked as a duplicate of this bug. ***

Revision history for this message
Jonathan Thomas (echidnaman) wrote :

Okular lives in kdegraphics from Intrepid on.

Changed in okular:
status: New → Triaged
Revision history for this message
In , Pino Toscano (pinotree) wrote :

*** Bug 175377 has been marked as a duplicate of this bug. ***

Revision history for this message
In , James-rivett-carnac (james-rivett-carnac) wrote :

*** This bug has been confirmed by popular vote. ***

Changed in okular:
status: New → Confirmed
Revision history for this message
In , Michal Witkowski (neuro-o2) wrote :

The same can be said about the text select tool. It highlights text from both columns.

Is there any hope that this might get resolved any time soon?

Revision history for this message
In , Albert Astals Cid (aacid) wrote :

Pattern recognizion of what is a column and what is not based on coordinates of each character is something your brain can do very easily but programming an algorithm that does that is not trivial by far, so i guess the answer is no

Revision history for this message
In , Michal Witkowski (neuro-o2) wrote :

Well, the thing is that both Adobe Reader and Foxit Reader are able to detect columns just fine (text selection, text highlight) so it's possible for sure. Maybe Okular's PDF backend is limited and doesn't provide text-layout information and that's what makes it hard. But saying that it's a hard problem solvable to a computer is just not true.

Revision history for this message
In , Michal Witkowski (neuro-o2) wrote :

Just as I thought, it's a poppler bug. A similar problem is seen in evince (gnome pdf viewer)

https://bugs.launchpad.net/poppler/+bug/33288

Revision history for this message
In , Robert Knight (robertknight) wrote :

> Pattern recognizion of what is a column and what is not based on
> coordinates of each character is something your brain can do very easily
> but programming an algorithm that does that is not trivial by far,
> so i guess the answer is no

It is certainly possible but not trivial - Ocropus provides a free software C++ implementation of algorithms to do this if you're interested. The basic approach is to try to the largest columns of whitespace in the page and divide the text into columns based on that.

Revision history for this message
In , Robert Knight (robertknight) wrote :

> The basic approach is to try to the largest columns of whitespace
> in the page and divide the text into columns based on that.

Sorry, that should read:

The basic approach seems to be finding the largest columns of whitespace in the page and dividing the text into columns based on that.

Revision history for this message
In , Albert Astals Cid (aacid) wrote :

<quote>
Maybe Okular's PDF backend is limited and doesn't provide text-layout
information and that's what makes it hard.
</quote>

I like when people speak if they knew how PDF works. Please if you know that PDF provides text-layout go to the poppler project (which by coincidence i am the maintainer of) and send a patch.

<quote>
But saying that it's a hard problem solvable to a computer is just not true.
</quote>

I also like when people decides that something is not hard because someone else is able of doing it. What about painting the Mona Lisa, it should not be that difficult, someone did it 500 years ago! How you dare to say that painting it is something difficult!

Revision history for this message
In , Isaac Puch Rojo (puchrojo) wrote :

(In reply to comment #15)

OK, The discussion could be more diplomatic and the comment are not very constructive. But if you want to work with scientific paper, this bug is very important.

I only want to ask, if the Okular Team want to work in this problem or no. I would respect that they don't want.

Thanks for the great Program!

Revision history for this message
In , Pino Toscano (pinotree) wrote :

*** Bug 194120 has been marked as a duplicate of this bug. ***

Revision history for this message
In , acrocephalus (dani-valverde) wrote :

What about adding a text highlight tool (just as the selection tool, but to highlight instead of selecting)? It may be an easier solution while looking for a fancier way ...

Revision history for this message
In , Isaac Puch Rojo (puchrojo) wrote :

The solution from Dani Valverde is not perfect, but it will be work.
I give my virtually vote ;-)

By, Isaac

Revision history for this message
In , Chosunsk (chosunsk) wrote :
Revision history for this message
In , Albert Astals Cid (aacid) wrote :

Three years don't make it easier to solve, we still welcome people with knowledge on how to fix it.

Revision history for this message
Krister Swenson (thekswenson) wrote :

I agree that this is very annoying.

Revision history for this message
In , Pino Toscano (pinotree) wrote :

*** Bug 225267 has been marked as a duplicate of this bug. ***

Revision history for this message
In , Ekin-0 (ekin-0) wrote :

I agree that three years do not make it easier to solve but it definitely makes it a must feature that needs to be implemented. By the way, what do developers do between the releases apart from fixing bugs?

Revision history for this message
In , Michal Witkowski (neuro-o2) wrote :

From: https://bugs.freedesktop.org/show_bug.cgi?id=3188

"Comment #45 From Praveen Thirukonda 2009-12-27 00:42:00 PST -------

it seems this bug now has a working patch and yet there has not been any
activity for the past few weeks.
It would really be great if this is committed soon as this is a really annoying
bug for many. "

It seems that there's hope :)

Revision history for this message
In , Albert Astals Cid (aacid) wrote :

#23: What do we do? Well, personally i sleep 7 hours a day, work 8 hours a day, spend 2 eating and preparing things to eat, 1 travelling to and from work, 1 going to shop things to eat and the rest of the 3 hours i try to code things for KDE, but then some user demands to know what i do with my life and that 3 hours become 2.5 hours. You should be happy i have no friends, otherwise that 2.5 hours would be a 0

Revision history for this message
In , Albert Astals Cid (aacid) wrote :

#24 This is not going to help okular at all since we do not use poppler text algorithms since we support text selection for more formats than just PDF

Revision history for this message
In , Ekin-0 (ekin-0) wrote :

I did not mean to be rude when making above statement. I really appreciate KDE and its applications in terms of the approach they have taken, i.e. abundant configurability and capability of the application. If only Okular had this feature.

Revision history for this message
claudio@ubuntu (claudio.ubuntu) wrote :

This bug also applies to text selection (Tools - Text Selection Tool).

Revision history for this message
In , Luigi Toscano (ltosky) wrote :

*** Bug 235531 has been marked as a duplicate of this bug. ***

Revision history for this message
In , yuval aviel (yuval-aviel) wrote :

(In reply to comment #26)
> #24 This is not going to help okular at all since we do not use poppler text
> algorithms since we support text selection for more formats than just PDF

I guess that 90% of Okular users that also use annotation, use it for reading PDF files.

Maybe solving this issue with Poppler solution is not such a bad way to go.

Revision history for this message
In , Chosunsk (chosunsk) wrote :

(In reply to comment #29)
> (In reply to comment #26)
> > #24 This is not going to help okular at all since we do not use poppler text
> > algorithms since we support text selection for more formats than just PDF
>
> I guess that 90% of Okular users that also use annotation, use it for reading
> PDF files.
>
> Maybe solving this issue with Poppler solution is not such a bad way to go.

Indeed, evince, which uses poppler algorithms, supports column selection.

Revision history for this message
In , Albert Astals Cid (aacid) wrote :

Indeed, evince does not support text selection in the horde of document formats that Okular does, our selection might be better or worse but it is [mostly] consisten among document formats.

But you don't really care, you like bashing developers because you think that will make them realize that you are right.

Revision history for this message
In , Peter Hedlund (peter-peterandlinda) wrote :

(In reply to comment #31)
> Indeed, evince does not support text selection in the horde of document formats
> that Okular does, our selection might be better or worse but it is [mostly]
> consisten among document formats.
>
> But you don't really care, you like bashing developers because you think that
> will make them realize that you are right.

Albert, relax. But still, for many users Okular = pdf and for many users pdf = two-column scientific papers. Okular uses the poppler backend for pdf and if the backend now supports column selection, so should Okular.

I am sure there are already some if... then to handle all the formats you say Okular supports. Please consider making it a priority to add use of the poppler backend if the format where selection is happening is pdf.

Thanks,
Peter

Revision history for this message
In , Albert Astals Cid (aacid) wrote :

Let me tell you a secret: i don't need advanced text selection in okular, so obviously it's not my priority.

Now let me tell you another secret: Okular is free software! So all you that need advanced text selection are very welcome to improve okular text selection algorithm send a patch and then not only pdf text selection would be better but all the other formats too! For free!

Revision history for this message
In , Peter Hedlund (peter-peterandlinda) wrote :

(In reply to comment #33)
> Let me tell you a secret: i don't need advanced text selection in okular, so
> obviously it's not my priority.
>
> Now let me tell you another secret: Okular is free software! So all you that
> need advanced text selection are very welcome to improve okular text selection
> algorithm send a patch and then not only pdf text selection would be better but
> all the other formats too! For free!

Here is my secret: I have never needed anything in the program I maintain (KWordQuiz), but I think it is fun when people show interest and tell we about features they would like. If they are reasonable I see it as a challenge to my limited self-taught programming skills to try to implement them. That why I use free software.

I have actually looked in to pdf developement as I had some interest in page manipulation features like adding and removing (pages) so I know it is no walk in the park. Still it seems someone has already done a significant part of the work in this (selection) case. Now is the time to step up to the final challenge or is programming not fun anymore?

Well, back to Adobe Reader...

Revision history for this message
In , Kde2eran (kde2eran) wrote :

Does poppler guess the text layout using some generic heuristic algorithm, or use some explicit information on text ordering embedded in the PDF format? If it's the latter, then Okular ought to use that embedded information, via poppler, instead of discarding it and taking a wild guess instead.

Revision history for this message
In , Alvaro-aguilera (alvaro-aguilera) wrote :

I like the idea of supporting the multiple file formats but I guess that 99% of the people (myself included) use Okular exclusively as a PDF reader. It's a pity that the formant independence gets in the way of implementing features that would be actually useful for the majority of its users. I'd bet that if someone revamps KPDF people would make the switch from one day to the other.

Revision history for this message
In , Robert Knight (robertknight) wrote :

> Does poppler guess the text layout using some generic heuristic algorithm, or
> use some explicit information on text ordering embedded in the PDF format?

PDFs do not contain layout information about how text is structured into paragraphs and columns. As I understand it, what PDF provides is essentially a list of commands that say "draw string S at position P with font F".

I haven't looked into recent versions of Poppler but older versions had some fairly complex heuristic algorithms to try to piece together the layout given the input. These algorithms had some interesting flaws. If I remember correctly, due to numerical instability the order of paragraphs in the output text could differ significantly depending on the processor on which you ran the code.

Revision history for this message
In , Uetsah (uetsah) wrote :

(In reply to comment #31)
> Indeed, evince does not support text selection in the horde of document formats
> that Okular does, our selection might be better or worse but it is [mostly]
> consisten among document formats.

Supporting multiple document formats consistently is great, but won't it be possible to still allow certain features to only be supported by some document formats and not others? Or to be implemented differently for each backend, where it makes sense?

The text selection user interface could still stay the same for every format, but in the background it could use whatever algorithms each respective backend provides for reading or guessing text layout structure.

So in case of PDF documents, the backend would use Poppler's heuristic algorithms. In case of OpenDocument documents, the backend would use the structural information already available in the document file. And so on...

Of course there could also be a generic algorithm that guesses the text structure independently of the document format, but as I understand it, that would be much more work...

Btw, I personally think that even with this feature missing, Okular is still best PDF viewer out there, so thanks for the great work and for giving it away for free... :-)
If this feature is not on your priority list, that's of course totally fine, but please maybe still consider it for the future, just in case one day you're bored and don't know anything else to implement... ;-)

Revision history for this message
In , yves hennequin (yves-hennequin) wrote :

hi
sorry not sure this is the right place for me to comment..
As a user of Okular I would also benefit from the double column recognition for annotations, etc...
My work around is to cover text with an inline comment without text and lower the opacity or put an ellipse and change it into a rectangle.
With this in mind I would then also be happy if I could set the parameters of the annotations (opacity, collors etc...) once for all so that when I place a new one, it has already the look I want.
I have no idea how difficult that is to do...
cheers
y.

Revision history for this message
In , Albert Astals Cid (aacid) wrote :

yves, you are asking for something (I would be happy if I could set the parameters of the annotations) that has nothing to do with this bug. Please open a separate with issue.

Changed in okular:
importance: Unknown → Wishlist
Revision history for this message
In , Pino Toscano (pinotree) wrote :

*** Bug 268334 has been marked as a duplicate of this bug. ***

Revision history for this message
In , Albert Astals Cid (aacid) wrote :

*** Bug 276580 has been marked as a duplicate of this bug. ***

Revision history for this message
In , Albert Astals Cid (aacid) wrote :

This has been implemented in this year GSoC and will be available in Okular as of KDE 4.8

You can find more info at http://tsdgeos.blogspot.com/2011/08/okular-selection-gsoc-in-depth-analysis.html

You are all encouraged to give a try to the current git master code (if you know how to compile) and give back constructive feedback.

Changed in okular-old:
status: Confirmed → Fix Released
Revision history for this message
Maarten Bezemer (veger) wrote :

Since oneiric okular has its own package

affects: kdegraphics (Ubuntu) → okular (Ubuntu)
Revision history for this message
Maarten Bezemer (veger) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. We are sorry that we do not always have the capacity to look at all reported bugs in a timely manner.
According to the upstream report, this issue should have been fixed since KDE 4.8, so Precise should include this fix as well.
It would help us a lot if you could test it on a currently supported Ubuntu version. When you test it and it is still an issue, kindly upload the updated logs by running apport-collect 199468 and any other logs that are relevant for this particular issue.

Changed in okular (Ubuntu):
status: Triaged → Incomplete
Revision history for this message
In , Champignoom (champignoom) wrote :

I'm reading a two-column pdf (https://dl.acm.org/doi/pdf/10.1145/3477113.3487272), for which the selection still doesn't work properly.

Okular Version: 23.08.5
KDE Plasma Version: 5.27.10
KDE Frameworks Version: 5.115.0
Qt Version: 5.15.12

Is there any chance to further improve the column recognition algorithm?

Changed in kdegraphics:
importance: Unknown → Wishlist
status: Unknown → Confirmed
Changed in okular-old:
status: Fix Released → Confirmed
Revision history for this message
In , Albert Astals Cid (aacid) wrote :

I am going to close this, please open a new bug.

This is has been marked as fixed for 13 years old and has more than 20 users that get notified when things change here, and my guess is that they really don't want to be bothered about this particular PDF that fails, because for them it works, if it didn't work, they would have reopened this bug shortly in these 13 years that the bug was marked as fixed.

Changed in kdegraphics:
status: Confirmed → Fix Released
Changed in okular-old:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.