Widows and Orphans in mPDF

Goal: make PDFs generated by Tiki via mPDF look nicer, by adding support for Widows and Orphans, and if possible Bottom Balancing, as explained at: https://guide.pressbooks.com/chapter/widows-orphans-and-bottom-balancing/

Tiki uses the Widows and Orphans CSS properties, but mPDF doesn't yet support them. So once support is added in mPDF 7.x, there should be a such pref in Tiki, defaulted to on, and overrideable per page via PluginPDF

Since the mPDF project doesn't (yet) have a wiki, we'll coordinate here.

Widows and orphans

For lines

Jonny added non-breaking space between last two words of headers in wiki pages: https://sourceforge.net/p/tikiwiki/code/HEAD/tree/trunk/lib/parser/parserlib.php#l3034

For headers and following paragraph

Needs to be added. page-break avoid after doesn't seem to be respected.

Regular text

Perhaps try slightly adjusting spacing between letters?

Next steps

Scope the project
Present plan to Matěj (lead dev of mPDF)
Once approved, get to work!
Test
Document
Promote

Machine Learning

We'll attempt to solve via Rubix ML
Asked on Rubix ML chat room

Copy to clipboard

Marc Laporte, [16.06.20 17:52]
I am wondering if Rubix ML would be a good tool (now or in the future) to help with Widows and Orphans in desktop publishing.

Marc Laporte, [16.06.20 17:52]
Context: https://en.wikipedia.org/wiki/Widows_and_orphans

Marc Laporte, [16.06.20 17:57]
It is actually a lot harder to do that some might think

Rich Davis, [16.06.20 17:58]
I don’t think that would require the use of ml. What is the file format?

Marc Laporte, [16.06.20 18:09]
We have wiki pages (wiki syntax which render to HTML) which then goes to mPDF: https://dev.tiki.org/Widows-and-Orphans-in-mPDF

Marc Laporte, [16.06.20 18:10]
At least two developers worked really hard to add Widows and Orphans to mPDF: https://github.com/mpdf/mpdf/issues/48

Marc Laporte, [16.06.20 18:13]
One of them reportedly said it was an "intractable problem"

Rich Davis, [16.06.20 18:19]
It would become even more difficult with ml. Are you using something like pdf.js to convert from html to pdf?

Marc Laporte, [16.06.20 18:20]
We are using mPDF, which is in PHP: https://packagist.org/packages/mpdf/mpdf

Rich Davis, [16.06.20 18:20]
There’s another old library that does that and I can’t think of the name. Converting html to canvas objects is ludicrously annoying

Rich Davis, [16.06.20 18:21]
Is there anyway to detect when a new page has begun and if the current paragraph is spilling over?

Marc Laporte, [16.06.20 18:22]
"This package identifies all widows and orphans in a document to help a user to get rid of them. The act of resolving still needs to be done manually: By rewriting text, running some paragraph long or short or explicitly breaking in some strategic place." Source: https://ctan.org/pkg/widows-and-orphans?lang=en

Rich Davis, [16.06.20 18:23]
Pdf page sizes are fixed length I believe so testing if a paragraph will spill over given the contextual position in the current page shouldn’t be too much of a task

Rich Davis, [16.06.20 18:23]
Thinking in ml terms, those are the features I would likely need to create a model for it

Rich Davis, [16.06.20 18:24]
But if I have those then I can test the conditions for each paragraph and proactively edit them to make it work.

Marc Laporte, [16.06.20 18:26]
Imagine a 200 page document. And there are several dozens of widows or orphans. When you fix one (ex.: by tweaking font spacing on a portion of the text), you can perhaps resolve or create another issue.

Rich Davis, [16.06.20 18:26]
But when we do it proactively, it updates the relative position of the proceeding html chunk and the pdf file

Rich Davis, [16.06.20 18:27]
So we’re compiling all of this html that’s being transcribed into pdf, page by page.

Marc Laporte, [16.06.20 18:28]
and it also compiles a table of content with page numbers

Rich Davis, [16.06.20 18:28]
If we just have a list of p tags with paragraphs in between, then I first get the position where I’m at in the pdf doc. Let’s say we’ve filled half of it, since I know the max size of the pdf, and I know the character count of the text in the p tag, then I can test whether we need to insert line breaks onto the page until we get a flush insert.

Marc Laporte, [16.06.20 18:29]
It has a huge feature set. The mpdfmanual.pdf (8 megs download) is over 600 pages!
https://github.com/IanNBack/mpdf/raw/master/mpdfmanual.pdf

Rich Davis, [16.06.20 18:29]
That shouldn’t matter

Rich Davis, [16.06.20 18:29]
It may take longer but has no impact on the efficacy of the process

Rich Davis, [16.06.20 18:30]
Beats doing it manually ?

Rich Davis, [16.06.20 18:31]
The only logical hurdle would be if there is a paragraph that extends longer than a full page. Even in that situation the process would hold.

Marc Laporte, [16.06.20 18:31]
hahahahahah, yes

Rich Davis, [16.06.20 18:32]
All you need to set it up is the max character count per pdf page.

Marc Laporte, [16.06.20 18:34]
Well, there are images too!

Marc Laporte, [16.06.20 18:34]
And tables, and columns

Rich Davis, [16.06.20 18:36]
The tables and columns would be handled in a similar way. I’m sure there is a way to get the character equivalent size for each of them.

Rich Davis, [16.06.20 18:38]
In fact, there has to be since pdfs do it.

Marc Laporte, [16.06.20 18:40]
https://texfaq.org/FAQ-widows  "(La)TeX takes some precautions to avoid them, but completely automatic prevention is often impossible. If you are typesetting your own text, consider whether you can bring yourself to change the wording slightly so that the page break will fall differently."

Marc Laporte, [16.06.20 18:41]
https://en.wikipedia.org/wiki/LaTeX was unable to fully solve since 1984.

Marc Laporte, [16.06.20 18:42]
Which is why I was wondering if AI could help ?

Rich Davis, [16.06.20 18:44]
I don’t see how it would without having the information that the solution requires.

Rich Davis, [16.06.20 18:45]
It’s maybe worth trying as it could be trained unsupervised using PDFs with the corresponding html. 

I just think the solution could be dealt with a simpler way.

Rich Davis, [16.06.20 18:46]
One alternative is the use of both characters and pixel dimensions

Marc Laporte, [16.06.20 18:46]
ok

Rich Davis, [16.06.20 18:47]
We know the pixel dimensions of the tables and images and we know the character count of the paragraphs

Rich Davis, [16.06.20 18:47]
So then we’d equate character count to pixel dimensions

Rich Davis, [16.06.20 18:47]
Add up the dimensions of the text block with the image or table and see if there is space.

Andrew DalPino, [16.06.20 19:10]
Sounds like a pretty nuanced problem ... I'll give it some thought and get back to you guys

Andrew DalPino, [16.06.20 19:19]
[In reply to Marc Laporte]
I trust that if LaTeX couldn't do it by now then a traditional rule-based approach is infeasable ... 

Sequence learning is all the rage these days, it sounds like what we'd need is some form of semi/self-supervised sequence classification ... This is a bit different from the Transformer architecture such as with GPT-1/2/3 which outputs another sequence. In this case we'd want to output a class label such as 'full sentence', 'widow', or 'orphan'.

Having that said, we do not support sequence learning yet in Rubix. It's on our roadmap though once we solve the problem of efficient storage and computing on matrices (this is the focus of Henrique's CArray project).

Andrew DalPino, [16.06.20 19:22]
An LTSM (long-term short-term memory) network might work but you'd need labeled data for the classification task

Rich Davis, [16.06.20 19:23]
That would require c array

Andrew DalPino, [16.06.20 19:24]
Right that's too much computation for PHP to handle natively

Rich Davis, [16.06.20 19:24]
Lstm is too expensive

Rich Davis, [16.06.20 19:24]
Same problem stands, the network would basically just learn the math required that’s pretty simple if we have the feature data

Rich Davis, [16.06.20 19:25]
Recurrent lstms also aren’t the most accurate for text. I wonder if they have been used in a regression context

Andrew DalPino, [16.06.20 19:26]
The LSTM would be trained to identify the language concepts of sentences, widows, and orphans from word sequences

Andrew DalPino, [16.06.20 19:26]
And someone would have to look at word sequences and label them in order to get the training data (unless someone has already done this)

Rich Davis, [16.06.20 19:27]
Isn’t the whole goal though just to prevent a paragraph from spilling over onto a new pdf page?

Andrew DalPino, [16.06.20 19:28]
I suppose if you can detect an orphan or widow, then you can apply a rule-based method to either move the sentence to the next page or do something to squeeze it into the current page

Rich Davis, [16.06.20 19:29]
But you do t have to detect

Andrew DalPino, [16.06.20 19:29]
Right that is where the classifier comes into play

Rich Davis, [16.06.20 19:29]
You can know firsthand when a paragraph will spill over

Rich Davis, [16.06.20 19:30]
Smh, this chat when someone tosses us a lax but interesting problem ?‍♂️

Andrew DalPino, [16.06.20 19:31]
Now that I think of it, there may be a sort of 'self supervised' approach we could use ... for example ... Automatically labeling a sample based on the presence of punctuation??? Just thinking outloud ?

Rich Davis, [16.06.20 19:31]
I feel like I’m missing something

Rich Davis, [16.06.20 19:32]
So let’s say the max character count of a pdf page is 2,000 characters

Rich Davis, [16.06.20 19:32]
And that equates to a window size of like 1,960 x 1080 pixels

Rich Davis, [16.06.20 19:33]
Why can’t we just iterate through each paragraph. Count the characters, equate that to dimensions, and test if it fits the current context?

Rich Davis, [16.06.20 19:33]
And ending the page transcription, opening a new one if it doesn’t

Rich Davis, [16.06.20 19:33]
That’s what I don’t understand

Andrew DalPino, [16.06.20 19:34]
Let's have @marclaporte and team (@bush243) clarify the problem when they have the chance

Rich Davis, [16.06.20 19:35]
Because we’re tossing out recurrent lstms. Even with python, training an lstm with 600 pages of text would take about 6-12 hours depending on accuracy requirements and lstms require a shit ton of epochs

Marc Laporte, [16.06.20 21:46]
Ok, we'll prepare something. Thanks!

Please wait

Related links

Widows and orphans

For lines

For headers and following paragraph

Regular text

Next steps

Machine Learning

Keywords

Last Changed Items

Last Changed Pages

Useful Tools

About Tiki

Support

Community

Documentation

Development

Legal

Tiki Project Sites

Networks

Please wait

Backlinks

Page actions

Widows and Orphans in mPDF

Related links

Widows and orphans

For lines

For headers and following paragraph

Regular text

Next steps

Machine Learning

Keywords

Last Changed Items

Last Changed Pages

Useful Tools

About Tiki

Support

Community

Documentation

Development

Legal

Tiki Project Sites

Networks