• Re: How to extract TABULAR data from a PDF document?

    From Richard Owlett@21:1/5 to David Wright on Thu Apr 17 21:30:01 2025
    On 4/16/25 8:35 AM, David Wright wrote:
    On Wed 16 Apr 2025 at 07:21:07 (-0500), Richard Owlett wrote:
    On 4/15/25 11:01 AM, Kent West wrote:
    On Tue, Apr 15, 2025 at 10:32 AM Nicolas George wrote:
    Richard Owlett (HE12025-04-15):
    I don't know how to approach the problem.
    What I would like to end up with is a CSV formatted file containing the >>>> two
    left columns of Table A4.14 (pages 106&107) of
    [

    https://fns-prod.azureedge.us/sites/default/files/resource-files/TFP2021.pdf
    ].

    Suggestions?

    Have you tried starting with pdftotext -layout and then adding the CSV >>>> delimiters using a powerful editor. The rectangle selection of Vim might >>>> be useful.

    Riffing off of Nicolas' suggestion, here's what I would do:

    $ pdftotext -f 106 -l 107 TFP2021.pdf TFP2021.txt

    As I replied to Nicolas I'll try both that and also a run with the
    "-layout" option.

    BTW I would add 10 to those pagenumbers (physical vs logical
    pages). Otherwise you get the wrong table.

    OOPS! Was so focused on format I missed the content problem ;/


    Ironically, a copy/paste from xpdf seems to do a better job
    than -layout at preserving the columns widths over the page break.
    (Perhaps the text at the bottom of the second page messes with -layout.)

    I liked the text file you attached. Was that the default output of xpdf itself? [Intend to experiment with it this weekend.]


    Then open LibreCalc, and File/Open this file. When the import options
    window appears, change the selection criteria to "Fixed width", and then in >>> the "ruler" bar above the text, click where you want a column divider (like >>> at Columns 39, 60, and 76; just eyeball it. Finish importing the document, >>> and now you have a spreadsheet with the info you want that should be pretty >>> easy to massage into the form you want.

    Any particularly relavant tutorials?

    Perhaps your own thread at:

    https://lists.debian.org/debian-user/2025/02/msg00493.html

    is worth rereading. It seems to be the same operation on the
    same report from 15 years earlier.

    Yes BUT NOT in way you may be expecting ;/
    Someone recalled you saying xpdf was your default PDF viewer.
    So I installed it from the Debian repository via Synaptic.
    [ I'm running Debian 12.8 with MATE 2.53.20 desktop. ]
    In Caja I right click on TFP2021.pdf & choose open with xpdf.
    So far so good ;)
    I navigate to Table A4.14 without problem.
    No problem selecting a rectangular area of interest.

    BUT how do I copy it somewhere useful?

    http://www.xpdfreader.com/xpdf-man.html#CONTROLS says in part


    Toolbar
    toggle sidebar button

    Toggles (i.e., shows or hides) the sidebar.

    status indicator

    This icon is animated while Xpdf is rendering a page. It turns red when an error or warning has been issued. Clicking on it opens the error dialog.

    selection mode

    This icon is an "I-beam" in linear selection mode, and an arrow in block selection mode. Clicking on it toggles between the two selection modes.


    FURTHER DOWN it says
    Text selection
    In block selection mode, dragging the mouse with the left button held down will highlight an arbitrary rectangle. Shift-clicking will extend the selection.

    In linear selection mode, dragging with the left button will highlight text in reading order. Double-clicking or triple-clicking will select a word or a line, respectively. Shift-clicking will extend the selection.

    Selected text can be copied to the clipboard (with the edit/copy menu item). On X11, selected text will be available in the X selection buffer.


    Where is a Toolbar with a sidebar button?
    Where is "edit/copy menu"?

    Does xpdf have illustrations somewhere?

    I suspect xpdf is itself the tool I looking for.

    TIA




    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Detlef Vollmann@21:1/5 to Richard Owlett on Thu Apr 17 23:30:01 2025
    On 4/17/25 21:24, Richard Owlett wrote:

    Selected text can be copied to the clipboard (with the edit/copy menu
    item). On X11, selected text will be available in the X selection buffer.


    Where is a Toolbar with a sidebar button?

    I've never seen such a "sidebar button".
    However, on the left margin of Xpdf there's a fine line and near
    its lower end there's a small square button. You can drag it to
    the right to open the sidebar.

    Where is "edit/copy menu"?

    Again, I've never seen any "edit/copy menu".
    But in an X11 application (like gedit if running on X11) you can
    use the middle mouse button to paste.

    Detlef

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From tomas@tuxteam.de@21:1/5 to jeremy ardley on Fri Apr 18 07:20:01 2025
    On Fri, Apr 18, 2025 at 11:09:52AM +0800, jeremy ardley wrote:

    [...]

    I'm not sure if it is mentioned but just take a picture of each page and ask a good Large Language Model to give you a table.

    After this, I'd double-check each individual number. You'll never know
    if they are being made up, otherwise.

    Cheers
    --
    t

    -----BEGIN PGP SIGNATURE-----

    iF0EABECAB0WIQRp53liolZD6iXhAoIFyCz1etHaRgUCaAHergAKCRAFyCz1etHa Rq6hAJ95LPGb0uX6VH4FJotpDN+1FIrJ5ACdG8rzLiMJJ31Q1Ph0ShxVKeqo0MM=
    =BRJp
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From tomas@tuxteam.de@21:1/5 to jeremy ardley on Fri Apr 18 09:50:02 2025
    On Fri, Apr 18, 2025 at 01:35:19PM +0800, jeremy ardley wrote:

    On 18/4/25 13:10, tomas@tuxteam.de wrote:
    I'm not sure if it is mentioned but just take a picture of each page and ask
    a good Large Language Model to give you a table.
    After this, I'd double-check each individual number. You'll never know
    if they are being made up, otherwise.


    I've been doing this for a couple of years now scanning bank statements etc.

    [...]

    I'm not going to argue with a believer. Do as you see fit.

    Usually, the reasoning is that humans do errors. But we humans have
    had quite a long while to learn to debug human made errors. Those
    made by LLMs look significantly different to me: I'm pretty sure
    this is going to be an issue.

    I see my colleagues now writing programs with LLMs. I don't look
    forward to the day I'll have to debug a larger corpus of this mess.

    Cheers
    --
    t

    -----BEGIN PGP SIGNATURE-----

    iF0EABECAB0WIQRp53liolZD6iXhAoIFyCz1etHaRgUCaAICvAAKCRAFyCz1etHa RhI3AJ0eIeJ99+OlZ2w+D2qumyxKNgfByACfXZ3tI/v8hAKNYjM4bczvOXn6TBE=
    =BRnV
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Owlett@21:1/5 to jeremy ardley on Fri Apr 18 12:10:02 2025
    On 4/17/25 10:09 PM, jeremy ardley wrote:

    On 15/4/25 22:19, Richard Owlett wrote:
    I don't know how to approach the problem.
    What I would like to end up with is a CSV formatted file containing
    the two left columns of Table A4.14 (pages 106&107) of
    [
    https://fns-prod.azureedge.us/sites/default/files/resource-files/TFP2021.pdf
    ].

    Suggestions?

    TIA

    I'm not sure if it is mentioned but just take a picture of each page and
    ask a good Large Language Model to give you a table.

    I could not find 4.14 but converted table 4.12 instead into markdown and
    csv using Claude 3.7 Sonata

    Synaptic finds neither "Claude" nor "Sonata" in the Debian repository.
    I'll not pursue this track further for the current project.
    Thank you.




    # Table A4.12. Thrifty Food Plan Market Basket for males age 51–70, June 2021: quantities, costs, and cost shares of Market Basket Categories in weekly amounts

    | Market Basket Categories | Quantity of each Market Basket Category
    (lbs) | Cost of each Market Basket Category ($) | Cost share of each
    Market Basket Category (%) | |--------------------------|---------------------------------------------|--------------------------------------|---------------------------------------------|

    | Vegetables | 10.65 | 12.09 | 23.05 |
    | Dark-green vegetables | 1.02 | 1.06 | 15.41 |
    | Red and orange vegetables | 2.26 | 3.14 | 25.97 |
    | Beans, peas, lentils | 1.79 | 1.59 | 13.17 |
    | Starchy vegetables | 3.68 | 2.97 | 24.56 |
    | Other vegetables | 1.91 | 2.52 | 20.88 |
    | Fruits | 7.10 | 6.62 | 12.64 |
    | Whole fruit | 3.16 | 3.46 | 52.30 |
    | 100% fruit juice | 3.94 | 3.16 | 47.70 |
    | Grains | 4.38 | 8.93 | 17.03 |
    | Whole-grain staple grains (e.g., rice, pasta, breads, tortillas) |
    2.32 | 5.61 | 62.78 |
    | Whole-grain cereals (e.g., oatmeal, ready-to-eat cereal) | 0.00 | 0.00
    | 0.00 |
    | Refined-grain staple grains (e.g., rice, pasta, breads, tortillas) |
    1.96 | 3.04 | 34.09 |
    | Refined-grain other (e.g., cereals, crackers, snacks) | 0.10 | 0.28 |
    3.14 |
    | Dairy | 11.76 | 7.00 | 13.34 |
    | Low- and non-fat milk, yogurt, soy alternatives | 11.67 | 6.87 | 98.22 |
    | Higher fat milk, yogurt, soy alternatives | 0.088 | 0.12 | 1.72 |
    | Cheese | <0.01 | <0.01 | 0.07 |
    | Protein foods | 4.90 | 14.27 | 27.21 |
    | Meats | 0.79 | 3.24 | 22.71 |
    | Poultry | 1.54 | 3.88 | 27.17 |
    | Eggs | 0.85 | 1.32 | 9.24 |
    | Seafood | 0.79 | 3.44 | 24.10 |
    | Nuts, seeds, soy products | 0.93 | 2.40 | 16.79 |
    | Miscellaneous | 1.45 | 3.52 | 6.72 |
    | Pre-prepared entrees and side dishes (e.g., soups, frozen entrees,
    pizza) | 0.24 | 0.51 | 14.377 |
    | Coffee and tea | 0.17 | 0.87 | 24.80 |
    | Table fats and oils | 0.67 | 1.61 | 45.71 |
    | Sauces, condiments, jams, honey, sugars, spices | 0.28 | 0.36 | 10.12 |
    | Other foods and beverages (e.g., soft drinks, fruit drinks, ice cream, pudding, cookies, candy bars) | 0.09 | 0.18 | 5.01 |
    | Total (Vegetables, Fruits, Grains, Dairy, Protein Foods,
    Miscellaneous) | 40.24 | 52.43 | 100.00 |



    Market Basket Categories,Quantity of each Market Basket Category
    (lbs),Cost of each Market Basket Category ($),Cost share of each Market Basket Category (%)
    Vegetables,10.65,12.09,23.05
    Dark-green vegetables,1.02,1.06,15.41
    Red and orange vegetables,2.26,3.14,25.97
    Beans peas lentils,1.79,1.59,13.17
    Starchy vegetables,3.68,2.97,24.56
    Other vegetables,1.91,2.52,20.88
    Fruits,7.10,6.62,12.64
    Whole fruit,3.16,3.46,52.30
    100% fruit juice,3.94,3.16,47.70
    Grains,4.38,8.93,17.03
    Whole-grain staple grains (e.g. rice pasta breads
    tortillas),2.32,5.61,62.78
    Whole-grain cereals (e.g. oatmeal ready-to-eat cereal),0.00,0.00,0.00 Refined-grain staple grains (e.g. rice pasta breads tortillas),1.96,3.04,34.09
    Refined-grain other (e.g. cereals crackers snacks),0.10,0.28,3.14 Dairy,11.76,7.00,13.34
    Low- and non-fat milk yogurt soy alternatives,11.67,6.87,98.22
    Higher fat milk yogurt soy alternatives,0.088,0.12,1.72 Cheese,<0.01,<0.01,0.07
    Protein foods,4.90,14.27,27.21
    Meats,0.79,3.24,22.71
    Poultry,1.54,3.88,27.17
    Eggs,0.85,1.32,9.24
    Seafood,0.79,3.44,24.10
    Nuts seeds soy products,0.93,2.40,16.79
    Miscellaneous,1.45,3.52,6.72
    Pre-prepared entrees and side dishes (e.g. soups frozen entrees pizza),0.24,0.51,14.377
    Coffee and tea,0.17,0.87,24.80
    Table fats and oils,0.67,1.61,45.71
    Sauces condiments jams honey sugars spices,0.28,0.36,10.12
    Other foods and beverages (e.g. soft drinks fruit drinks ice cream
    pudding cookies candy bars),0.09,0.18,5.01
    Total (Vegetables Fruits Grains Dairy Protein Foods Miscellaneous),40.24,52.43,100.00



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Owlett@21:1/5 to David Wright on Fri Apr 18 12:40:01 2025
    On 4/17/25 9:45 PM, David Wright wrote:
    On Thu 17 Apr 2025 at 14:24:35 (-0500), Richard Owlett wrote:
    On 4/16/25 8:35 AM, David Wright wrote:
    Ironically, a copy/paste from xpdf seems to do a better job
    than -layout at preserving the columns widths over the page break.
    (Perhaps the text at the bottom of the second page messes with -layout.)

    I liked the text file you attached. Was that the default output of
    xpdf itself? [Intend to experiment with it this weekend.]

    In as much as I carried out the actions below, and the
    attached text was what accumulated in the editor's buffer.
    (Two pages of the report, so copy, paste, copy, paste.)

    Someone recalled you saying xpdf was your default PDF viewer.

    Correct. In mc, if I press Return (≡Open) when the cursor is on
    a PDF, it opens in xpdf. If I press F3 (≡View), then it opens
    in zathura. F4 (≡Edit) can toggle between the raw text and a
    hex representation thereof. (I configured these choices in mc.)

    A quick read of https://midnight-commander.org/ suggests that the
    combination of xpdf *and* mc plays a significant role in your success.
    As mc is in the Debian repository, I've just installed it.
    How can I exactly duplicate your default mc settings?
    Is there perhaps a configuration file?
    [ I'll continue with Caja for my routine usage. ]

    So I installed it from the Debian repository via Synaptic.
    [ I'm running Debian 12.8 with MATE 2.53.20 desktop. ]
    In Caja I right click on TFP2021.pdf & choose open with xpdf.
    So far so good ;)
    I navigate to Table A4.14 without problem.
    No problem selecting a rectangular area of interest.

    Yes, it should create a black (?inverse) rectangle over the area.

    BUT how do I copy it somewhere useful?

    I can either press the middle (paste) mouse button, or I can
    press Shift-Insert. The latter may be a default key combination
    as I don't immediately see where it's configured. DEs might
    behave differently, especially when they try to ape Windows;
    so you might try ^V.

    I'll try after I post this.


    So, where to press those keys: in your favourite editor's buffer.
    Don't paste into a shell/command line by accident (unless you've
    got bracketed-paste set: then it doesn't matter).

    Whether anything is pasted depends on there being some text in the
    selection buffer. Not all PDFs will let you copy stuff out like that.
    Also bear in mind that some PDF pages that look like text may
    actually be scanned images.

    With xpdf, the contents of the rectangle is copied, and I've always
    found the boundaries quite precise. OTOH with zathura, the rectangle
    only helps you remember where you started dragging from, as it copies
    in line-mode; and the boundaries are fuzzy compared with typical text selection (where the selection gets highlighted).

    Where is a Toolbar with a sidebar button?

    Detlef posted about a sidebar. The only thing I've seen that used for
    is a navigation tree (like the ones in history and bookmarks for FF).

    Where is "edit/copy menu"?

    Does xpdf have illustrations somewhere?

    I suspect xpdf is itself the tool I looking for.

    AKAIK dragging a mouse is just assumed knowledge nowadays. I haven't
    found it necessary to, say, press a key like ^C to copy, which a
    browser might require. And pasting into an edit buffer is again
    something one does as a matter of routine.

    Cheers,
    David.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Titus Newswanger@21:1/5 to jeremy ardley on Sat Apr 19 17:40:02 2025
    On 4/18/25 02:53, jeremy ardley wrote:

    Obviously you've never had to herd junior developers. I have had to.
    It sucks and productivity is woeful due to all the checking and unit
    testing and such, plus they quite often have comprehension problems
    and are unable to follow instructions - and I'm talking about honours graduates in IT.

    Now I am more productive than literally 10 junior developers. Me and
    Claude that is. The difference is I know how to code and how to craft instructions to smart LLMs like Claude.
    I agree. To a certain extent I know my way around python. Java I never
    studied. If I enlist LLM, in python it multiplies my productivity
    because I know what to ask for and usually notice immediately when it
    drops into a rabbit hole. But coding something like java, LLM takes me
    through every rabbit hole and I can hardly tell the difference but sure
    don't get productive with my project. I'm not implying that LLM is
    better at python than java. My point is, LLM does not make me an instant
    pro on a subject I haven't studied.

    You point is more valid if you are referring to the crap that is
    produced by junior to junior to intermediate programmers being
    assisted by Code Pilot. It's truly awful and yet another evil trick by Micro$oft

    --
    Thank You!

    Titus Newswanger

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Wright@21:1/5 to jeremy ardley on Tue Apr 22 04:50:01 2025
    On Fri 18 Apr 2025 at 11:09:52 (+0800), jeremy ardley wrote:
    On 15/4/25 22:19, Richard Owlett wrote:
    I don't know how to approach the problem.
    What I would like to end up with is a CSV formatted file
    containing the two left columns of Table A4.14 (pages 106&107) of
    [ https://fns-prod.azureedge.us/sites/default/files/resource-files/TFP2021.pdf
    ].

    Suggestions?

    TIA

    I'm not sure if it is mentioned but just take a picture of each page
    and ask a good Large Language Model to give you a table.

    Assuming you took a photograph of the screen, I wouldn't be too
    surprised about its confusing 8 and 0. Everything is so grey on
    grey nowadays …

    I could not find 4.14 but converted table 4.12 instead into markdown
    and csv using Claude 3.7 Sonata

    … but not being able to find 4.14! That's remarkable :)

    # Table A4.12. Thrifty Food Plan Market Basket for males age 51–70,
    June 2021: quantities, costs, and cost shares of Market Basket
    Categories in weekly amounts

    Cheers,
    David.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From tomas@tuxteam.de@21:1/5 to jeremy ardley on Thu Apr 24 08:50:01 2025
    On Thu, Apr 24, 2025 at 11:32:23AM +0800, jeremy ardley wrote:

    On 24/4/25 10:31, Max Nikulin wrote:

    By the way, PDF files may be tagged for screen readers. Is there a dedicated structure to explicitly mark tables? It would be the best
    source for data extraction.


    ISO 14289 is an accessibility standard for PDF. It allows for the creation
    of a "Tagged PDF" where semantic information, including table structures (<Table>, <TR>, <TH>, <TD>), can be embedded in a separate logical structure tree

    You can download it for free at https://pdfa.org/resource/iso-14289-pdfua/

    Oh, thanks for this one :)

    Cheers
    --
    tomás

    -----BEGIN PGP SIGNATURE-----

    iF0EABECAB0WIQRp53liolZD6iXhAoIFyCz1etHaRgUCaAnewAAKCRAFyCz1etHa RvHpAJ4nzL8mdH745iSUuD9ciwVKCIlDHwCaA7rHphmvsekSCLPrkwwm21RvFgY=
    =yK9g
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andrew M.A. Cater@21:1/5 to tomas@tuxteam.de on Thu Apr 24 10:10:01 2025
    On Thu, Apr 24, 2025 at 08:48:43AM +0200, tomas@tuxteam.de wrote:
    On Thu, Apr 24, 2025 at 11:32:23AM +0800, jeremy ardley wrote:

    On 24/4/25 10:31, Max Nikulin wrote:

    By the way, PDF files may be tagged for screen readers. Is there a dedicated structure to explicitly mark tables? It would be the best source for data extraction.


    ISO 14289 is an accessibility standard for PDF. It allows for the creation of a "Tagged PDF" where semantic information, including table structures (<Table>, <TR>, <TH>, <TD>), can be embedded in a separate logical structure
    tree


    Disclaimer: I deal with some accessibility documentation in my day job.
    The problem is that very few authors know this - and very few tools support tagging. Adobe Acrobat is about the best but the $$ versions.

    Informal advice is always "Write it in Word, then let Word convert it to
    PDF" That works if the author is disciplined and knows how to tag,
    heading orders and so on - but it can still produce tagged PDFs that
    are nominally accessible to screen readers but practically unusable.

    The result is that PDFs may well be completely fine as a secure archival format, non-modifiable, readable everywhere - and useless to a segment
    of the population which is blind or visually impaired.
    .
    Deque University - deque.com - has a whole series of accessibility
    courses and a couple of *long* ones on how to write a PDF :(

    This also goes for HTML wihich has to be well written and tagging
    images with alt-text and so on. There is an ARIA standard which
    helps make the web more accessible but that's an adjunct, to
    be used over and above well-written HTML and CSS.

    All best, as ever,

    Andy
    (amacater@debian.org)

    You can download it for free at https://pdfa.org/resource/iso-14289-pdfua/

    Oh, thanks for this one :)

    Cheers
    --
    tomßs

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)