Forum: Too Lazy BBS

pdf grep?

From db@dieterhansbritz@gmail.com to comp.text.pdf on Wed Apr 3 12:45:20 2024

From Newsgroup: comp.text.pdf

Under Linux, I can use grep to search a bunch of
files for a character string. Is there an equivalent
command for searching pdf files?
--- Synchronet 3.21d-Linux NewsLink 1.2

From Robert Heller@heller@deepsoft.com to comp.text.pdf on Wed Apr 3 14:03:37 2024

From Newsgroup: comp.text.pdf

Grep may sort of also work with pdf files. You might want to also use the strings command to get "clean" srings. Note: *some* pdf files are just images (no actual text). These would be PDFs created by scanning a document (not
using OCR). Also, many typesetting programs (TeX/LaTex, word-processos, etc), might do some typesetting "magic" (eg ligitures, etc.) that might make things hard for grep.
xpdf includes a text search button as part of its UI.
At Wed, 3 Apr 2024 12:45:20 -0000 (UTC) db <dieterhansbritz@gmail.com> wrote:

Under Linux, I can use grep to search a bunch of
files for a character string. Is there an equivalent
command for searching pdf files?

--
Robert Heller -- Cell: 413-658-7953 GV: 978-633-5364
Deepwoods Software -- Custom Software Services
http://www.deepsoft.com/ -- Linux Administration Services
heller@deepsoft.com -- Webhosting Services
--- Synchronet 3.21d-Linux NewsLink 1.2

From Tim Landscheidt@tim@tim-landscheidt.de to comp.text.pdf on Wed Apr 3 14:22:18 2024

From Newsgroup: comp.text.pdf

db <dieterhansbritz@gmail.com> wrote:

Under Linux, I can use grep to search a bunch of
files for a character string. Is there an equivalent
command for searching pdf files?

You can use pdfgrep (https://pdfgrep.org/) for that. It is
available as a package in Fedora and Debian as well.

Tim
--- Synchronet 3.21d-Linux NewsLink 1.2

From ram@ram@zedat.fu-berlin.de (Stefan Ram) to comp.text.pdf on Wed Apr 3 14:17:22 2024

From Newsgroup: comp.text.pdf

Robert Heller <heller@deepsoft.com> wrote or quoted:

might do some typesetting "magic" (eg ligitures, etc.) that might make things

"ligatures"

Text in PDFs is sometimes compressed. So one can either use
programs like "Agent Ransack" to search for text in PDFs or
tools like "pdftotext" to first create a text file for every
PDF file and then grep those text files.
--- Synchronet 3.21d-Linux NewsLink 1.2

From ram@ram@zedat.fu-berlin.de (Stefan Ram) to comp.text.pdf on Wed Apr 3 14:29:40 2024

From Newsgroup: comp.text.pdf

ram@zedat.fu-berlin.de (Stefan Ram) wrote or quoted:

Text in PDFs is sometimes compressed. So one can either use
programs like "Agent Ransack" to search for text in PDFs or
tools like "pdftotext" to first create a text file for every
PDF file and then grep those text files.

PS: "Agent Ransack" is Windows software. "pdftotext" is also
available for Linux. Converting all PDFs to text files needs
to be done only once, and then search operations on those
text files are faster than scanning the PDF files for text
on every search!
--- Synchronet 3.21d-Linux NewsLink 1.2

From db@dieterhansbritz@gmail.com to comp.text.pdf on Wed Apr 3 15:19:24 2024

From Newsgroup: comp.text.pdf

On 3 Apr 2024 14:29:40 GMT, Stefan Ram wrote:

ram@zedat.fu-berlin.de (Stefan Ram) wrote or quoted:

Text in PDFs is sometimes compressed. So one can either use programs
like "Agent Ransack" to search for text in PDFs or tools like
"pdftotext" to first create a text file for every PDF file and then grep >>those text files.

PS: "Agent Ransack" is Windows software. "pdftotext" is also available
for Linux. Converting all PDFs to text files needs to be done only
once, and then search operations on those text files are faster than
scanning the PDF files for text on every search!

I should maybe have elaborated a bit. Sometimes I
remember a certain phrase or word but forget which
pdf it is in. With text files I can do
grep blabla *.txt
and I wanted an equivalent. Using pdftotext would
mean using it for every suspect pdf. Since a lot of
pdf files are searchable, I figured that such a
command might exist.
But if there really is a pdfgrep command, that might
do the job. I will do some googling, thanks.
--- Synchronet 3.21d-Linux NewsLink 1.2

From db@dieterhansbritz@gmail.com to comp.text.pdf on Thu Apr 4 09:50:46 2024

From Newsgroup: comp.text.pdf

On Wed, 3 Apr 2024 15:19:24 -0000 (UTC), db wrote:

On 3 Apr 2024 14:29:40 GMT, Stefan Ram wrote:

ram@zedat.fu-berlin.de (Stefan Ram) wrote or quoted:

Text in PDFs is sometimes compressed. So one can either use programs
like "Agent Ransack" to search for text in PDFs or tools like
"pdftotext" to first create a text file for every PDF file and then
grep those text files.

PS: "Agent Ransack" is Windows software. "pdftotext" is also
available for Linux. Converting all PDFs to text files needs to be
done only once, and then search operations on those text files are
faster than scanning the PDF files for text on every search!

I should maybe have elaborated a bit. Sometimes I remember a certain
phrase or word but forget which pdf it is in. With text files I can do
grep blabla *.txt and I wanted an equivalent. Using pdftotext would mean using it for every suspect pdf. Since a lot of pdf files are searchable,
I figured that such a command might exist.
But if there really is a pdfgrep command, that might do the job. I will
do some googling, thanks.

I installed pdfgrep in my Kubuntu system, but it is
not happy. Although the man file is there, even help
doesn't work:

pdfgrep --help

terminate called after throwing an instance of 'std::runtime_error'
what(): locale::facet::_S_create_c_locale name not valid
Aborted (core dumped)

??
--- Synchronet 3.21d-Linux NewsLink 1.2

From Peter Flynn@peter@silmaril.ie to comp.text.pdf on Thu Apr 4 16:57:49 2024

From Newsgroup: comp.text.pdf

On 04/04/2024 10:50, db wrote:
[...]

I installed pdfgrep in my Kubuntu system, but it is
not happy. Although the man file is there, even help
doesn't work:

I just installed pdfgrep_2.1.2-1build1_amd64.deb in my Mint 20.1 and it
seems to work OK. What version is the Kubuntu one?

Peter

--- Synchronet 3.21d-Linux NewsLink 1.2

From db@dieterhansbritz@gmail.com to comp.text.pdf on Fri Apr 5 12:31:04 2024

From Newsgroup: comp.text.pdf

On Thu, 4 Apr 2024 16:57:49 +0100, Peter Flynn wrote:

On 04/04/2024 10:50, db wrote:
[...]

I installed pdfgrep in my Kubuntu system, but it is not happy. Although
the man file is there, even help doesn't work:

I just installed pdfgrep_2.1.2-1build1_amd64.deb in my Mint 20.1 and it
seems to work OK. What version is the Kubuntu one?

Peter

The man file for pdfgrep says V. 2.1.1. My Kubuntu
is 23.04.
--- Synchronet 3.21d-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Hannibal
  Fri Jul 3 01:51:09 2026
  from Des Moines via Telnet
- Geek2
  Thu Jul 2 11:41:05 2026
  from Euclid, Oh via Telnet
- Hannibal
  Thu Jul 2 05:49:27 2026
  from Des Moines via SSH
- Geek2
  Wed Jul 1 16:31:20 2026
  from Euclid, Oh via Telnet

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	70
Nodes:	6 (0 / 6)
Uptime:	00:26:37
Calls:	949
Calls today:	1
Files:	1,325
Messages:	281,479

pdf grep?

Who's Online

Recent Visitors

System Info