Goal: make PDFs generated by Tiki via mPDF look nicer, by adding support for Widows and Orphans, and if possible Bottom Balancing, as explained at: https://guide.pressbooks.com/chapter/widows-orphans-and-bottom-balancing/
Tiki uses the Widows and Orphans CSS properties, but mPDF doesn't yet support them. So once support is added in mPDF 7.x, there should be a such pref in Tiki, defaulted to on, and overrideable per page via PluginPDF
Since the mPDF project doesn't (yet) have a wiki, we'll coordinate here.
- no license but some PHP code: https://www.pdflib.com/pdflib-cookbook/text-output/widows-and-orphans/php-widows-and-orphans/
- Prince is a proprietary solution that Pressbooks uses instead of mPDF: https://www.princexml.com/doc/paged/#widows-and-orphans
- wkhtmltopdf added support: https://github.com/wkhtmltopdf/wkhtmltopdf/issues/2457
- WeasyPrint added support: https://github.com/Kozea/WeasyPrint/commit/1d6c94828e3a9bc79dc0a9d4e2377619f0de78ac
- This thread indicates that LibreOffice and Word support Widows and Orphans by default https://bugs.documentfoundation.org/show_bug.cgi?id=89714#c9
- Wikipublisher: widow and orphan control and headings always kept with their following paragraphs
- Jonny added non-breaking space between last two words of headers in wiki pages: https://sourceforge.net/p/tikiwiki/code/HEAD/tree/trunk/lib/parser/parserlib.php#l3034
Needs to be added. page-break avoid after doesn't seem to be respected.
Perhaps try slightly adjusting spacing between letters?
- Scope the project
- Present plan to Matěj (lead dev of mPDF)
- Once approved, get to work!
We'll attempt to solve via Rubix ML
Asked on Rubix ML chat room
Marc Laporte, [16.06.20 17:52] I am wondering if Rubix ML would be a good tool (now or in the future) to help with Widows and Orphans in desktop publishing. Marc Laporte, [16.06.20 17:52] Context: https://en.wikipedia.org/wiki/Widows_and_orphans Marc Laporte, [16.06.20 17:57] It is actually a lot harder to do that some might think Rich Davis, [16.06.20 17:58] I don’t think that would require the use of ml. What is the file format? Marc Laporte, [16.06.20 18:09] We have wiki pages (wiki syntax which render to HTML) which then goes to mPDF: https://dev.tiki.org/Widows-and-Orphans-in-mPDF Marc Laporte, [16.06.20 18:10] At least two developers worked really hard to add Widows and Orphans to mPDF: https://github.com/mpdf/mpdf/issues/48 Marc Laporte, [16.06.20 18:13] One of them reportedly said it was an "intractable problem" Rich Davis, [16.06.20 18:19] It would become even more difficult with ml. Are you using something like pdf.js to convert from html to pdf? Marc Laporte, [16.06.20 18:20] We are using mPDF, which is in PHP: https://packagist.org/packages/mpdf/mpdf Rich Davis, [16.06.20 18:20] There’s another old library that does that and I can’t think of the name. Converting html to canvas objects is ludicrously annoying Rich Davis, [16.06.20 18:21] Is there anyway to detect when a new page has begun and if the current paragraph is spilling over? Marc Laporte, [16.06.20 18:22] "This package identifies all widows and orphans in a document to help a user to get rid of them. The act of resolving still needs to be done manually: By rewriting text, running some paragraph long or short or explicitly breaking in some strategic place." Source: https://ctan.org/pkg/widows-and-orphans?lang=en Rich Davis, [16.06.20 18:23] Pdf page sizes are fixed length I believe so testing if a paragraph will spill over given the contextual position in the current page shouldn’t be too much of a task Rich Davis, [16.06.20 18:23] Thinking in ml terms, those are the features I would likely need to create a model for it Rich Davis, [16.06.20 18:24] But if I have those then I can test the conditions for each paragraph and proactively edit them to make it work. Marc Laporte, [16.06.20 18:26] Imagine a 200 page document. And there are several dozens of widows or orphans. When you fix one (ex.: by tweaking font spacing on a portion of the text), you can perhaps resolve or create another issue. Rich Davis, [16.06.20 18:26] But when we do it proactively, it updates the relative position of the proceeding html chunk and the pdf file Rich Davis, [16.06.20 18:27] So we’re compiling all of this html that’s being transcribed into pdf, page by page. Marc Laporte, [16.06.20 18:28] and it also compiles a table of content with page numbers Rich Davis, [16.06.20 18:28] If we just have a list of p tags with paragraphs in between, then I first get the position where I’m at in the pdf doc. Let’s say we’ve filled half of it, since I know the max size of the pdf, and I know the character count of the text in the p tag, then I can test whether we need to insert line breaks onto the page until we get a flush insert. Marc Laporte, [16.06.20 18:29] It has a huge feature set. The mpdfmanual.pdf (8 megs download) is over 600 pages! https://github.com/IanNBack/mpdf/raw/master/mpdfmanual.pdf Rich Davis, [16.06.20 18:29] That shouldn’t matter Rich Davis, [16.06.20 18:29] It may take longer but has no impact on the efficacy of the process Rich Davis, [16.06.20 18:30] Beats doing it manually 😂 Rich Davis, [16.06.20 18:31] The only logical hurdle would be if there is a paragraph that extends longer than a full page. Even in that situation the process would hold. Marc Laporte, [16.06.20 18:31] hahahahahah, yes Rich Davis, [16.06.20 18:32] All you need to set it up is the max character count per pdf page. Marc Laporte, [16.06.20 18:34] Well, there are images too! Marc Laporte, [16.06.20 18:34] And tables, and columns Rich Davis, [16.06.20 18:36] The tables and columns would be handled in a similar way. I’m sure there is a way to get the character equivalent size for each of them. Rich Davis, [16.06.20 18:38] In fact, there has to be since pdfs do it. Marc Laporte, [16.06.20 18:40] https://texfaq.org/FAQ-widows "(La)TeX takes some precautions to avoid them, but completely automatic prevention is often impossible. If you are typesetting your own text, consider whether you can bring yourself to change the wording slightly so that the page break will fall differently." Marc Laporte, [16.06.20 18:41] https://en.wikipedia.org/wiki/LaTeX was unable to fully solve since 1984. Marc Laporte, [16.06.20 18:42] Which is why I was wondering if AI could help 😊 Rich Davis, [16.06.20 18:44] I don’t see how it would without having the information that the solution requires. Rich Davis, [16.06.20 18:45] It’s maybe worth trying as it could be trained unsupervised using PDFs with the corresponding html. I just think the solution could be dealt with a simpler way. Rich Davis, [16.06.20 18:46] One alternative is the use of both characters and pixel dimensions Marc Laporte, [16.06.20 18:46] ok Rich Davis, [16.06.20 18:47] We know the pixel dimensions of the tables and images and we know the character count of the paragraphs Rich Davis, [16.06.20 18:47] So then we’d equate character count to pixel dimensions Rich Davis, [16.06.20 18:47] Add up the dimensions of the text block with the image or table and see if there is space. Andrew DalPino, [16.06.20 19:10] Sounds like a pretty nuanced problem ... I'll give it some thought and get back to you guys Andrew DalPino, [16.06.20 19:19] [In reply to Marc Laporte] I trust that if LaTeX couldn't do it by now then a traditional rule-based approach is infeasable ... Sequence learning is all the rage these days, it sounds like what we'd need is some form of semi/self-supervised sequence classification ... This is a bit different from the Transformer architecture such as with GPT-1/2/3 which outputs another sequence. In this case we'd want to output a class label such as 'full sentence', 'widow', or 'orphan'. Having that said, we do not support sequence learning yet in Rubix. It's on our roadmap though once we solve the problem of efficient storage and computing on matrices (this is the focus of Henrique's CArray project). Andrew DalPino, [16.06.20 19:22] An LTSM (long-term short-term memory) network might work but you'd need labeled data for the classification task Rich Davis, [16.06.20 19:23] That would require c array Andrew DalPino, [16.06.20 19:24] Right that's too much computation for PHP to handle natively Rich Davis, [16.06.20 19:24] Lstm is too expensive Rich Davis, [16.06.20 19:24] Same problem stands, the network would basically just learn the math required that’s pretty simple if we have the feature data Rich Davis, [16.06.20 19:25] Recurrent lstms also aren’t the most accurate for text. I wonder if they have been used in a regression context Andrew DalPino, [16.06.20 19:26] The LSTM would be trained to identify the language concepts of sentences, widows, and orphans from word sequences Andrew DalPino, [16.06.20 19:26] And someone would have to look at word sequences and label them in order to get the training data (unless someone has already done this) Rich Davis, [16.06.20 19:27] Isn’t the whole goal though just to prevent a paragraph from spilling over onto a new pdf page? Andrew DalPino, [16.06.20 19:28] I suppose if you can detect an orphan or widow, then you can apply a rule-based method to either move the sentence to the next page or do something to squeeze it into the current page Rich Davis, [16.06.20 19:29] But you do t have to detect Andrew DalPino, [16.06.20 19:29] Right that is where the classifier comes into play Rich Davis, [16.06.20 19:29] You can know firsthand when a paragraph will spill over Rich Davis, [16.06.20 19:30] Smh, this chat when someone tosses us a lax but interesting problem 🤦♂️ Andrew DalPino, [16.06.20 19:31] Now that I think of it, there may be a sort of 'self supervised' approach we could use ... for example ... Automatically labeling a sample based on the presence of punctuation??? Just thinking outloud 🧐 Rich Davis, [16.06.20 19:31] I feel like I’m missing something Rich Davis, [16.06.20 19:32] So let’s say the max character count of a pdf page is 2,000 characters Rich Davis, [16.06.20 19:32] And that equates to a window size of like 1,960 x 1080 pixels Rich Davis, [16.06.20 19:33] Why can’t we just iterate through each paragraph. Count the characters, equate that to dimensions, and test if it fits the current context? Rich Davis, [16.06.20 19:33] And ending the page transcription, opening a new one if it doesn’t Rich Davis, [16.06.20 19:33] That’s what I don’t understand Andrew DalPino, [16.06.20 19:34] Let's have @marclaporte and team (@bush243) clarify the problem when they have the chance Rich Davis, [16.06.20 19:35] Because we’re tossing out recurrent lstms. Even with python, training an lstm with 600 pages of text would take about 6-12 hours depending on accuracy requirements and lstms require a shit ton of epochs Marc Laporte, [16.06.20 21:46] Ok, we'll prepare something. Thanks!