• pdf grep?

    From db@dieterhansbritz@gmail.com to comp.text.pdf on Wed Apr 3 12:45:20 2024
    From Newsgroup: comp.text.pdf

    Under Linux, I can use grep to search a bunch of
    files for a character string. Is there an equivalent
    command for searching pdf files?
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Robert Heller@heller@deepsoft.com to comp.text.pdf on Wed Apr 3 14:03:37 2024
    From Newsgroup: comp.text.pdf

    Grep may sort of also work with pdf files. You might want to also use the strings command to get "clean" srings. Note: *some* pdf files are just images (no actual text). These would be PDFs created by scanning a document (not
    using OCR). Also, many typesetting programs (TeX/LaTex, word-processos, etc), might do some typesetting "magic" (eg ligitures, etc.) that might make things hard for grep.
    xpdf includes a text search button as part of its UI.
    At Wed, 3 Apr 2024 12:45:20 -0000 (UTC) db <dieterhansbritz@gmail.com> wrote:

    Under Linux, I can use grep to search a bunch of
    files for a character string. Is there an equivalent
    command for searching pdf files?


    --
    Robert Heller -- Cell: 413-658-7953 GV: 978-633-5364
    Deepwoods Software -- Custom Software Services
    http://www.deepsoft.com/ -- Linux Administration Services
    heller@deepsoft.com -- Webhosting Services
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Tim Landscheidt@tim@tim-landscheidt.de to comp.text.pdf on Wed Apr 3 14:22:18 2024
    From Newsgroup: comp.text.pdf

    db <dieterhansbritz@gmail.com> wrote:

    Under Linux, I can use grep to search a bunch of
    files for a character string. Is there an equivalent
    command for searching pdf files?

    You can use pdfgrep (https://pdfgrep.org/) for that. It is
    available as a package in Fedora and Debian as well.

    Tim
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From ram@ram@zedat.fu-berlin.de (Stefan Ram) to comp.text.pdf on Wed Apr 3 14:17:22 2024
    From Newsgroup: comp.text.pdf

    Robert Heller <heller@deepsoft.com> wrote or quoted:
    might do some typesetting "magic" (eg ligitures, etc.) that might make things

    "ligatures"

    Text in PDFs is sometimes compressed. So one can either use
    programs like "Agent Ransack" to search for text in PDFs or
    tools like "pdftotext" to first create a text file for every
    PDF file and then grep those text files.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From ram@ram@zedat.fu-berlin.de (Stefan Ram) to comp.text.pdf on Wed Apr 3 14:29:40 2024
    From Newsgroup: comp.text.pdf

    ram@zedat.fu-berlin.de (Stefan Ram) wrote or quoted:
    Text in PDFs is sometimes compressed. So one can either use
    programs like "Agent Ransack" to search for text in PDFs or
    tools like "pdftotext" to first create a text file for every
    PDF file and then grep those text files.

    PS: "Agent Ransack" is Windows software. "pdftotext" is also
    available for Linux. Converting all PDFs to text files needs
    to be done only once, and then search operations on those
    text files are faster than scanning the PDF files for text
    on every search!
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From db@dieterhansbritz@gmail.com to comp.text.pdf on Wed Apr 3 15:19:24 2024
    From Newsgroup: comp.text.pdf

    On 3 Apr 2024 14:29:40 GMT, Stefan Ram wrote:

    ram@zedat.fu-berlin.de (Stefan Ram) wrote or quoted:
    Text in PDFs is sometimes compressed. So one can either use programs
    like "Agent Ransack" to search for text in PDFs or tools like
    "pdftotext" to first create a text file for every PDF file and then grep >>those text files.

    PS: "Agent Ransack" is Windows software. "pdftotext" is also available
    for Linux. Converting all PDFs to text files needs to be done only
    once, and then search operations on those text files are faster than
    scanning the PDF files for text on every search!

    I should maybe have elaborated a bit. Sometimes I
    remember a certain phrase or word but forget which
    pdf it is in. With text files I can do
    grep blabla *.txt
    and I wanted an equivalent. Using pdftotext would
    mean using it for every suspect pdf. Since a lot of
    pdf files are searchable, I figured that such a
    command might exist.
    But if there really is a pdfgrep command, that might
    do the job. I will do some googling, thanks.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From db@dieterhansbritz@gmail.com to comp.text.pdf on Thu Apr 4 09:50:46 2024
    From Newsgroup: comp.text.pdf

    On Wed, 3 Apr 2024 15:19:24 -0000 (UTC), db wrote:

    On 3 Apr 2024 14:29:40 GMT, Stefan Ram wrote:

    ram@zedat.fu-berlin.de (Stefan Ram) wrote or quoted:
    Text in PDFs is sometimes compressed. So one can either use programs
    like "Agent Ransack" to search for text in PDFs or tools like
    "pdftotext" to first create a text file for every PDF file and then
    grep those text files.

    PS: "Agent Ransack" is Windows software. "pdftotext" is also
    available for Linux. Converting all PDFs to text files needs to be
    done only once, and then search operations on those text files are
    faster than scanning the PDF files for text on every search!

    I should maybe have elaborated a bit. Sometimes I remember a certain
    phrase or word but forget which pdf it is in. With text files I can do
    grep blabla *.txt and I wanted an equivalent. Using pdftotext would mean using it for every suspect pdf. Since a lot of pdf files are searchable,
    I figured that such a command might exist.
    But if there really is a pdfgrep command, that might do the job. I will
    do some googling, thanks.

    I installed pdfgrep in my Kubuntu system, but it is
    not happy. Although the man file is there, even help
    doesn't work:

    pdfgrep --help
    terminate called after throwing an instance of 'std::runtime_error'
    what(): locale::facet::_S_create_c_locale name not valid
    Aborted (core dumped)

    ??
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Peter Flynn@peter@silmaril.ie to comp.text.pdf on Thu Apr 4 16:57:49 2024
    From Newsgroup: comp.text.pdf

    On 04/04/2024 10:50, db wrote:
    [...]
    I installed pdfgrep in my Kubuntu system, but it is
    not happy. Although the man file is there, even help
    doesn't work:

    I just installed pdfgrep_2.1.2-1build1_amd64.deb in my Mint 20.1 and it
    seems to work OK. What version is the Kubuntu one?

    Peter

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From db@dieterhansbritz@gmail.com to comp.text.pdf on Fri Apr 5 12:31:04 2024
    From Newsgroup: comp.text.pdf

    On Thu, 4 Apr 2024 16:57:49 +0100, Peter Flynn wrote:

    On 04/04/2024 10:50, db wrote:
    [...]
    I installed pdfgrep in my Kubuntu system, but it is not happy. Although
    the man file is there, even help doesn't work:

    I just installed pdfgrep_2.1.2-1build1_amd64.deb in my Mint 20.1 and it
    seems to work OK. What version is the Kubuntu one?

    Peter

    The man file for pdfgrep says V. 2.1.1. My Kubuntu
    is 23.04.
    --- Synchronet 3.21d-Linux NewsLink 1.2