PSA: An offline Windows workflow for Balabolka audiobook TTS from PDF
From
Maria Sophia@mariasophia@comprehension.com to
alt.comp.os.windows-10,comp.text.pdf on Wed Mar 4 11:38:41 2026
From Newsgroup: comp.text.pdf
PSA:
An offline Windows workflow for audiobook TTS from PDF
(Based 1/2 on logic but the other 1/2 is based on trial-&-error results.)
Software:
I. Windows 10 (almost everything will be FOSS, offline)
II. Balabolka 2.15.0.811
III. Calibre 8.10.0
IV. PDF ShaperFree 8.9
V. Adobe Acrobat 6 (writer)
Main issues are we don't want to create an audiobook that speaks the...
a. headers/footers
b. images
c. end of line (versus end of sentence, which causes unnatural pauses)
d. index, forward, tables, bibliography, etc.
Main software logic employed was to use the best tool/format for the job...
A. Balabolka needs the cleanest text input you can give it; where
Balabolka (with the Tesseract plugin) can OCR but we're not doing OCR.
B. Acrobat tries to preserve the visual layout of the PDF; while
Calibre does not try to preserve layout (which is better for TTS).
C. Calibre tries to extract logical-reading order (which TTS needs).
D. RTF is the better input for Calibre since an RTF from Acrobat has
i. Real text (not embedded PDF objects)
ii. Fewer hard line breaks (than TXT)
iii. Predictable header/footer patterns (that Calibre can strip)
iv. No images (so nothing confuses the reading order)
v. Consistent encoding (to rebuild paragraphs cleanly)
E. But Acrobat 6 embedded JPEGs as hex blobs inside the RTF.
F. So PDF Shaper Free was used to remove images.
G. And PDF Shaper Free PDF->RTF conversion is cleaner than Acrobat's
(as Acrobat embeds images as hex blobs inside RTF).
But there is still stray artifacts (e.g., the first letter of the first
word of every chapter is a big font and then a space and then the rest of
the word, and some lines are still chopped, and of course the header &
footer are still in the text output).
Problem Statement:
College-aged grandchild wants me to "read" a 212-page 10MB book she has
in PDF so she and I can discuss it over the phone; but I want to "listen"
to that book PDF because reading scanned/text is miserable for my eyes.
Problem Document:
The book "looks" scanned (i.e., sloppy fonts) but Acrobat can select text
and Acrobat search can find a given word, so it's weird text + scanned???
At least that means Tesseract OCR plugins to Balabolka are not needed.
Test Flow:
Adobe Acrobat 6 (Writer) was used to convert the original PDF twice.
1. First the 10MB (original) PDF was "File > Reduced file size" to 27MB
(page 70 & 135 took almost forever, but presumably cleaned artifacts)
2. Then the 27MB "reduced-size" PDF was "File > Save as" a 177MB RTF;
but the RTF had page-end line breaks, so the RTF was discarded.
3. The 27MB reduced-size PDF was fed into PDF Shaper Free
"File > Remove Images", which saved a 17MB image-free PDF
(which, coincidentally, had *much cleaner* text fonts!).
4. Calibre read in the RTF and converted it to TXT but each line still
had a linebreak at the right side of the visible page. Drat.
5. I tried Sigil to make a cleaner EPUB from the Calibre EPUB; but even
though the EPUB has no artificial line breaks, the TXT still had them.
6. Merge lines that are not real paragraph breaks
Balabolka:Edit > Replace [x]Use Regular Expressions
Find: ([^\n])\n([^\n])
Replace: \1 \2
Save as text
7. Fix broken words
Find: ([a-z])-i([a-z])
Replace: \1\2
8. Remove tab characters
Find: \t+
Replace: (empty)
9. Remove page headers (such as "10 WORLD HISTORY")
Find: ^\s*\d+\s+THE DEVILrCOS HORSEMEN\s*$
Replace: (empty)
10. Normalize paragraph spacing (collapse into a single line)
Find: \n{3,}
Replace: \n\n
11. Remove soft-hyphen & ligature debris
Find: -i
Replace: (empty)
Find: ([a-zA-Z])N++([a-zA-Z])
Replace:
12. Tell Balabolka to pause slightly before & after italicized words
Find: \t([A-Z][a-z]+(?:\s+[A-Z][a-z]+)*)
Replace: \1\2
Once you have the clean text, the Balabolka step is trivial.
Save as MP3.
Bitrate: 48 kbps or 64 kbps
Mode: Mono
Sample rate: 22 kHz or 32KHz
Copy to your mobile device and play whenever you want to listen
to the audio book.
--
This is a work in progress but I figured I'd document the steps.
--- Synchronet 3.21d-Linux NewsLink 1.2