Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 28 |
Nodes: | 6 (0 / 6) |
Uptime: | 68:16:26 |
Calls: | 425 |
Calls today: | 3 |
Files: | 1,025 |
Messages: | 91,563 |
Posted today: | 1 |
On Wed 16 Apr 2025 at 07:21:07 (-0500), Richard Owlett wrote:
On 4/15/25 11:01 AM, Kent West wrote:
On Tue, Apr 15, 2025 at 10:32 AM Nicolas George wrote:
Richard Owlett (HE12025-04-15):Riffing off of Nicolas' suggestion, here's what I would do:
I don't know how to approach the problem.https://fns-prod.azureedge.us/sites/default/files/resource-files/TFP2021.pdf
What I would like to end up with is a CSV formatted file containing the >>>> two
left columns of Table A4.14 (pages 106&107) of
[
].
Suggestions?
Have you tried starting with pdftotext -layout and then adding the CSV >>>> delimiters using a powerful editor. The rectangle selection of Vim might >>>> be useful.
$ pdftotext -f 106 -l 107 TFP2021.pdf TFP2021.txt
As I replied to Nicolas I'll try both that and also a run with the
"-layout" option.
BTW I would add 10 to those pagenumbers (physical vs logical
pages). Otherwise you get the wrong table.
Ironically, a copy/paste from xpdf seems to do a better job
than -layout at preserving the columns widths over the page break.
(Perhaps the text at the bottom of the second page messes with -layout.)
Then open LibreCalc, and File/Open this file. When the import options
window appears, change the selection criteria to "Fixed width", and then in >>> the "ruler" bar above the text, click where you want a column divider (like >>> at Columns 39, 60, and 76; just eyeball it. Finish importing the document, >>> and now you have a spreadsheet with the info you want that should be pretty >>> easy to massage into the form you want.
Any particularly relavant tutorials?
Perhaps your own thread at:
https://lists.debian.org/debian-user/2025/02/msg00493.html
is worth rereading. It seems to be the same operation on the
same report from 15 years earlier.
Toolbar
toggle sidebar button
Toggles (i.e., shows or hides) the sidebar.
status indicator
This icon is animated while Xpdf is rendering a page. It turns red when an error or warning has been issued. Clicking on it opens the error dialog.
selection mode
This icon is an "I-beam" in linear selection mode, and an arrow in block selection mode. Clicking on it toggles between the two selection modes.
Text selection
In block selection mode, dragging the mouse with the left button held down will highlight an arbitrary rectangle. Shift-clicking will extend the selection.
In linear selection mode, dragging with the left button will highlight text in reading order. Double-clicking or triple-clicking will select a word or a line, respectively. Shift-clicking will extend the selection.
Selected text can be copied to the clipboard (with the edit/copy menu item). On X11, selected text will be available in the X selection buffer.
Selected text can be copied to the clipboard (with the edit/copy menu
item). On X11, selected text will be available in the X selection buffer.
Where is a Toolbar with a sidebar button?
Where is "edit/copy menu"?
I'm not sure if it is mentioned but just take a picture of each page and ask a good Large Language Model to give you a table.
On 18/4/25 13:10, tomas@tuxteam.de wrote:
I'm not sure if it is mentioned but just take a picture of each page and askAfter this, I'd double-check each individual number. You'll never know
a good Large Language Model to give you a table.
if they are being made up, otherwise.
I've been doing this for a couple of years now scanning bank statements etc.
On 15/4/25 22:19, Richard Owlett wrote:
I don't know how to approach the problem.I'm not sure if it is mentioned but just take a picture of each page and
What I would like to end up with is a CSV formatted file containing
the two left columns of Table A4.14 (pages 106&107) of
[
https://fns-prod.azureedge.us/sites/default/files/resource-files/TFP2021.pdf
].
Suggestions?
TIA
ask a good Large Language Model to give you a table.
I could not find 4.14 but converted table 4.12 instead into markdown and
csv using Claude 3.7 Sonata
# Table A4.12. Thrifty Food Plan Market Basket for males age 51–70, June 2021: quantities, costs, and cost shares of Market Basket Categories in weekly amounts
| Market Basket Categories | Quantity of each Market Basket Category
(lbs) | Cost of each Market Basket Category ($) | Cost share of each
Market Basket Category (%) | |--------------------------|---------------------------------------------|--------------------------------------|---------------------------------------------|
| Vegetables | 10.65 | 12.09 | 23.05 |
| Dark-green vegetables | 1.02 | 1.06 | 15.41 |
| Red and orange vegetables | 2.26 | 3.14 | 25.97 |
| Beans, peas, lentils | 1.79 | 1.59 | 13.17 |
| Starchy vegetables | 3.68 | 2.97 | 24.56 |
| Other vegetables | 1.91 | 2.52 | 20.88 |
| Fruits | 7.10 | 6.62 | 12.64 |
| Whole fruit | 3.16 | 3.46 | 52.30 |
| 100% fruit juice | 3.94 | 3.16 | 47.70 |
| Grains | 4.38 | 8.93 | 17.03 |
| Whole-grain staple grains (e.g., rice, pasta, breads, tortillas) |
2.32 | 5.61 | 62.78 |
| Whole-grain cereals (e.g., oatmeal, ready-to-eat cereal) | 0.00 | 0.00
| 0.00 |
| Refined-grain staple grains (e.g., rice, pasta, breads, tortillas) |
1.96 | 3.04 | 34.09 |
| Refined-grain other (e.g., cereals, crackers, snacks) | 0.10 | 0.28 |
3.14 |
| Dairy | 11.76 | 7.00 | 13.34 |
| Low- and non-fat milk, yogurt, soy alternatives | 11.67 | 6.87 | 98.22 |
| Higher fat milk, yogurt, soy alternatives | 0.088 | 0.12 | 1.72 |
| Cheese | <0.01 | <0.01 | 0.07 |
| Protein foods | 4.90 | 14.27 | 27.21 |
| Meats | 0.79 | 3.24 | 22.71 |
| Poultry | 1.54 | 3.88 | 27.17 |
| Eggs | 0.85 | 1.32 | 9.24 |
| Seafood | 0.79 | 3.44 | 24.10 |
| Nuts, seeds, soy products | 0.93 | 2.40 | 16.79 |
| Miscellaneous | 1.45 | 3.52 | 6.72 |
| Pre-prepared entrees and side dishes (e.g., soups, frozen entrees,
pizza) | 0.24 | 0.51 | 14.377 |
| Coffee and tea | 0.17 | 0.87 | 24.80 |
| Table fats and oils | 0.67 | 1.61 | 45.71 |
| Sauces, condiments, jams, honey, sugars, spices | 0.28 | 0.36 | 10.12 |
| Other foods and beverages (e.g., soft drinks, fruit drinks, ice cream, pudding, cookies, candy bars) | 0.09 | 0.18 | 5.01 |
| Total (Vegetables, Fruits, Grains, Dairy, Protein Foods,
Miscellaneous) | 40.24 | 52.43 | 100.00 |
Market Basket Categories,Quantity of each Market Basket Category
(lbs),Cost of each Market Basket Category ($),Cost share of each Market Basket Category (%)
Vegetables,10.65,12.09,23.05
Dark-green vegetables,1.02,1.06,15.41
Red and orange vegetables,2.26,3.14,25.97
Beans peas lentils,1.79,1.59,13.17
Starchy vegetables,3.68,2.97,24.56
Other vegetables,1.91,2.52,20.88
Fruits,7.10,6.62,12.64
Whole fruit,3.16,3.46,52.30
100% fruit juice,3.94,3.16,47.70
Grains,4.38,8.93,17.03
Whole-grain staple grains (e.g. rice pasta breads
tortillas),2.32,5.61,62.78
Whole-grain cereals (e.g. oatmeal ready-to-eat cereal),0.00,0.00,0.00 Refined-grain staple grains (e.g. rice pasta breads tortillas),1.96,3.04,34.09
Refined-grain other (e.g. cereals crackers snacks),0.10,0.28,3.14 Dairy,11.76,7.00,13.34
Low- and non-fat milk yogurt soy alternatives,11.67,6.87,98.22
Higher fat milk yogurt soy alternatives,0.088,0.12,1.72 Cheese,<0.01,<0.01,0.07
Protein foods,4.90,14.27,27.21
Meats,0.79,3.24,22.71
Poultry,1.54,3.88,27.17
Eggs,0.85,1.32,9.24
Seafood,0.79,3.44,24.10
Nuts seeds soy products,0.93,2.40,16.79
Miscellaneous,1.45,3.52,6.72
Pre-prepared entrees and side dishes (e.g. soups frozen entrees pizza),0.24,0.51,14.377
Coffee and tea,0.17,0.87,24.80
Table fats and oils,0.67,1.61,45.71
Sauces condiments jams honey sugars spices,0.28,0.36,10.12
Other foods and beverages (e.g. soft drinks fruit drinks ice cream
pudding cookies candy bars),0.09,0.18,5.01
Total (Vegetables Fruits Grains Dairy Protein Foods Miscellaneous),40.24,52.43,100.00
On Thu 17 Apr 2025 at 14:24:35 (-0500), Richard Owlett wrote:
On 4/16/25 8:35 AM, David Wright wrote:
Ironically, a copy/paste from xpdf seems to do a better job
than -layout at preserving the columns widths over the page break.
(Perhaps the text at the bottom of the second page messes with -layout.)
I liked the text file you attached. Was that the default output of
xpdf itself? [Intend to experiment with it this weekend.]
In as much as I carried out the actions below, and the
attached text was what accumulated in the editor's buffer.
(Two pages of the report, so copy, paste, copy, paste.)
Someone recalled you saying xpdf was your default PDF viewer.
Correct. In mc, if I press Return (≡Open) when the cursor is on
a PDF, it opens in xpdf. If I press F3 (≡View), then it opens
in zathura. F4 (≡Edit) can toggle between the raw text and a
hex representation thereof. (I configured these choices in mc.)
So I installed it from the Debian repository via Synaptic.
[ I'm running Debian 12.8 with MATE 2.53.20 desktop. ]
In Caja I right click on TFP2021.pdf & choose open with xpdf.
So far so good ;)
I navigate to Table A4.14 without problem.
No problem selecting a rectangular area of interest.
Yes, it should create a black (?inverse) rectangle over the area.
BUT how do I copy it somewhere useful?
I can either press the middle (paste) mouse button, or I can
press Shift-Insert. The latter may be a default key combination
as I don't immediately see where it's configured. DEs might
behave differently, especially when they try to ape Windows;
so you might try ^V.
So, where to press those keys: in your favourite editor's buffer.
Don't paste into a shell/command line by accident (unless you've
got bracketed-paste set: then it doesn't matter).
Whether anything is pasted depends on there being some text in the
selection buffer. Not all PDFs will let you copy stuff out like that.
Also bear in mind that some PDF pages that look like text may
actually be scanned images.
With xpdf, the contents of the rectangle is copied, and I've always
found the boundaries quite precise. OTOH with zathura, the rectangle
only helps you remember where you started dragging from, as it copies
in line-mode; and the boundaries are fuzzy compared with typical text selection (where the selection gets highlighted).
Where is a Toolbar with a sidebar button?
Detlef posted about a sidebar. The only thing I've seen that used for
is a navigation tree (like the ones in history and bookmarks for FF).
Where is "edit/copy menu"?
Does xpdf have illustrations somewhere?
I suspect xpdf is itself the tool I looking for.
AKAIK dragging a mouse is just assumed knowledge nowadays. I haven't
found it necessary to, say, press a key like ^C to copy, which a
browser might require. And pasting into an edit buffer is again
something one does as a matter of routine.
Cheers,
David.
Obviously you've never had to herd junior developers. I have had to.I agree. To a certain extent I know my way around python. Java I never
It sucks and productivity is woeful due to all the checking and unit
testing and such, plus they quite often have comprehension problems
and are unable to follow instructions - and I'm talking about honours graduates in IT.
Now I am more productive than literally 10 junior developers. Me and
Claude that is. The difference is I know how to code and how to craft instructions to smart LLMs like Claude.
You point is more valid if you are referring to the crap that is
produced by junior to junior to intermediate programmers being
assisted by Code Pilot. It's truly awful and yet another evil trick by Micro$oft
On 15/4/25 22:19, Richard Owlett wrote:
I don't know how to approach the problem.
What I would like to end up with is a CSV formatted file
containing the two left columns of Table A4.14 (pages 106&107) of
[ https://fns-prod.azureedge.us/sites/default/files/resource-files/TFP2021.pdf
].
Suggestions?
TIA
I'm not sure if it is mentioned but just take a picture of each page
and ask a good Large Language Model to give you a table.
I could not find 4.14 but converted table 4.12 instead into markdown
and csv using Claude 3.7 Sonata
# Table A4.12. Thrifty Food Plan Market Basket for males age 51–70,
June 2021: quantities, costs, and cost shares of Market Basket
Categories in weekly amounts
On 24/4/25 10:31, Max Nikulin wrote:
By the way, PDF files may be tagged for screen readers. Is there a dedicated structure to explicitly mark tables? It would be the best
source for data extraction.
ISO 14289 is an accessibility standard for PDF. It allows for the creation
of a "Tagged PDF" where semantic information, including table structures (<Table>, <TR>, <TH>, <TD>), can be embedded in a separate logical structure tree
You can download it for free at https://pdfa.org/resource/iso-14289-pdfua/
On Thu, Apr 24, 2025 at 11:32:23AM +0800, jeremy ardley wrote:
On 24/4/25 10:31, Max Nikulin wrote:
By the way, PDF files may be tagged for screen readers. Is there a dedicated structure to explicitly mark tables? It would be the best source for data extraction.
ISO 14289 is an accessibility standard for PDF. It allows for the creation of a "Tagged PDF" where semantic information, including table structures (<Table>, <TR>, <TH>, <TD>), can be embedded in a separate logical structure
tree
You can download it for free at https://pdfa.org/resource/iso-14289-pdfua/
Oh, thanks for this one :)
Cheers
--
tomßs