Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 35 |
Nodes: | 6 (0 / 6) |
Uptime: | 29:02:03 |
Calls: | 333 |
Files: | 990 |
Messages: | 84,620 |
On 8/25/2024 1:00 AM, Janis Papanagnou wrote:
Myself I'm usually not using CSV format(s), but recently I advertised
GNU Awk (given that newer versions support CSV data processing) to a
friend seeking CSV solutions.
I was quite astonished when I stumbled across a StackOverflow article
about CSV processing with contemporary versions of GNU Awk and read
that you are restricted to comma as separator and double quotes to
enclose strings. The workarounds provided at SO were extremely clumsy.
Given that using ',', ';', '|' (or other delimiters) and also various
types of quotes are just a lexical (no functional) difference I wonder
whether it would be sensible to be able to define them, say, through
setting a PROCINFO element?
Janis
https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk
FYI gawk just inherited those behaviors (plus mandatory stripping of the quotes from quoted fields, see https://lists.gnu.org/archive/html/bug-gawk/2023-11/msg00018.html) from Kernighans awk.
El 26/8/24 a las 14:54, Janis Papanagnou escribió:
My opinion on this is that I wouldn't expect GNU Awk to become a (yet
another) CSV-processor. It's very convenient to have an easy input of
CSV data to be processed like other tabular data with Awk. So removal
of the (outer) quotes, transforming "inner" quotes of fields according
to the CSV-standard(s), and handling the escape symbol, would serve my
expectations. (I don't need CSV-output formatting, but I understand if
there is such a demand.)
Perhaps you could try my gawk-csvio pure gawk library. Just include a
first unconditional rule calling csvimport(...) and the CSV input data
will be automatically converted to a regular OFS-delimited record ready
to be processed.
Please find the library at http://mcollado.z15.es/gawk-extras/. The documentation is available also to be read before downloading.
On 8/26/2024 7:54 AM, Janis Papanagnou wrote:
snip>
I'd have liked to provide more concrete information here, but I'm at
the moment even unable to reproduce Awk's behavior as documented in
its manual; I've tried the following command with various locales
$ echo 4,321 | LC_ALL=en_DK.utf-8 gawk '{ print $1 + 1 }'
-| 5,321
but always got just 5 as result.
You need to specifically TELL gawk to use your locale to read input
numbers:
$ echo 4,321 | LC_ALL=en_DK.utf-8 gawk '{ print $1 + 1 }'
5
$ echo 4,321 | POSIXLY_CORRECT=1 LC_ALL=en_DK.utf-8 gawk '{ print $1 + 1 }' 5,321
$ echo 4,321 | LC_ALL=en_DK.utf-8 gawk -N '{ print $1 + 1 }' 5,321
See https://www.gnu.org/software/gawk/manual/gawk.html#Locale-influences-conversions
for more info on that.
Regards,
Ed
On 8/26/2024 8:39 PM, Janis Papanagnou wrote:
I've missed that there was an explicit
$ export POSIXLY_CORRECT=1
set on the very top of these examples. Gee!
POSIXLY_CORRECT=1 (or equivalently `--posix` aka `-P`) affects numbers
in the input your script reads (as shown in the previous post) and
strings being converted to numbers in your code, it doesn't affect
literal numbers in the source code for your script that awk reads.
In the source code the decimal separator for a literal number (as
opposed to a string being converted to a number) is always `.`.
You can't use, say, a comma as the decimal separator in a literal number because a comma already means something in the awk syntax, e.g. `print
4,321` means the same as "print 4 OFS 321`.
[examples and explanations]
El 27/8/24 a las 2:31, Janis Papanagnou escribió:
If I understand correctly that the library you mention would address
the two topics (field separator and quoting) then there's even less a
point (I suppose) to use the new '--csv' option in GNU Awk; just use
your library instead?
A user decision.
HTH