Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 42 |
Nodes: | 6 (0 / 6) |
Uptime: | 00:24:32 |
Calls: | 220 |
Calls today: | 1 |
Files: | 824 |
Messages: | 121,481 |
Posted today: | 6 |
On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <python-list@python.org> wrote:
import polars as pl
pl.read_json("file.json")
On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <python-list@python.org> wrote:
import polars as pl
pl.read_json("file.json")
This is not going to work unless the computer has a lot more the 60GiB of RAM.
As later suggested a streaming parser is required.
On 2024-09-30, Left Right via Python-list <python-list@python.org> wrote:
Whether and to what degree you can stream JSON depends on JSON
structure. In general, however, JSON cannot be streamed (but commonly
it can be).
Imagine a pathological case of this shape: 1... <60GB of digits>. This
is still a valid JSON (it doesn't have any limits on how many digits a number can have). And you cannot parse this number in a streaming way because in order to do that, you need to start with the least
significant digit.
Which is how arabic numbers were originally parsed, but when
westerners adopted them from a R->L written language, thet didn't flip
them around to match the L->R written language into which they were
being adopted.
So now long numbers can't be parsed as a stream in software. They
should have anticipated this problem back in the 13th century and
flipped the numbers around.
On Tue, 1 Oct 2024 at 02:20, Thomas Passin via Python-list <python-list@python.org> wrote:
On 9/30/2024 11:30 AM, Barry via Python-list wrote:
On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <python-list@python.org> wrote:
import polars as pl
pl.read_json("file.json")
This is not going to work unless the computer has a lot more the 60GiB of RAM.
As later suggested a streaming parser is required.
Streaming won't work because the file is gzipped. You have to receive
the whole thing before you can unzip it. Once unzipped it will be even
larger, and all in memory.
Streaming gzip is perfectly possible. You may be thinking of PKZip
which has its EOCD at the end of the file (although it may still be
possible to stream-decompress if you work at it).
ChrisA
On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <python-list@python.org> wrote:
import polars as pl
pl.read_json("file.json")
This is not going to work unless the computer has a lot more the 60GiB of RAM.
As later suggested a streaming parser is required.
On Tue, 1 Oct 2024 at 04:30, Dan Sommers via Python-list <python-list@python.org> wrote:
But why do I need to start with the least
significant digit?
If you start from the most significant, you don't know anything about
the number until you finish parsing it. There's almost nothing you can
say about a number given that it starts with a particular sequence
(since you don't know how MANY digits there are). However, if you know
the LAST digits, you can make certain statements about it (trivial
examples being whether it's odd or even).
It's not very, well, significant. But there's something to it. And it
extends nicely to p-adic numbers, which can have an infinite number of nonzero digits to the left of the decimal:
https://en.wikipedia.org/wiki/P-adic_number
On Tue, 1 Oct 2024 at 08:56, Grant Edwards via Python-list <python-list@python.org> wrote:
On 2024-09-30, Dan Sommers via Python-list <python-list@python.org> wrote:
In Common Lisp, integers can be written in any integer base from two
to thirty six, inclusive. So knowing the last digit doesn't tell
you whether an integer is even or odd until you know the base
anyway.
I had to think about that for an embarassingly long time before it
clicked.
The only part I'm not clear on is what identifies the base. If you're
going to write numbers little-endian, it's not that hard to also write
them with a base indicator before the digits [...]
2QdxY4RzWzUUiLuE@potatochowder.com writes:
[...]
In Common Lisp, you can write integers as #nnR[digits], where nn is the decimal representation of the base (possibly without a leading zero),
the # and the R are literal characters, and the digits are written in
the intended base. So the input #16fFFFF is read as the integer 65535.
Typo: You meant #16RFFFF, not #16fFFFF.
What am I missing? Handwavingly, start with the first digit, and as
long as the next character is a digit, multipliy the accumulated result
by 10 (or the appropriate base) and add the next value. Oh, and handle scientific notation as a special case, and perhaps fail spectacularly instead of recovering gracefully in certain edge cases. And in the pathological case of a single number with 60 billion digits, run out of memory (and complain loudly to the person who claimed that the file contained a "dataset"). But why do I need to start with the least significant digit?
You probably forgot that it has to be _streaming_. Suppose you parse
the first digit: can you hand this information over to an external
function to process the parsed data? -- No! because you don't know the magnitude yet. What about two digits? -- Same thing. You cannot
leave the parser code until you know the magnitude (otherwise the
information is useless to the external code).
So, even if you have enough memory and don't care about special cases
like scientific notation: yes, you will be able to parse it, but it
won't be a streaming parser.
Consider also an interesting
consequence of SCSI not being able to have infinite words: this means, besides other things that fsync() is nonsense! :) If you aren't
familiar with the concept: UNIX filesystem API suggests that it's
possible to destage arbitrary large file (or a chunk of file) to disk.
But SCSI is built of finite "words" and to describe an arbitrary large
file you'd need to list all the blocks that constitute the file!
the only way to implement fsync() in compliance with the
standard is to sync _everything_
You probably forgot that it has to be _streaming_. Suppose you parse
the first digit: can you hand this information over to an external
function to process the parsed data? -- No! because you don't know the magnitude yet.
If I recognize the first digit, then I *can* hand that over to an
external function to accumulate the digits that follow.
And what is that external function going to do with this information?
The point is you didn't parse anything if you just sent the digit.
You just delegated the parsing further. Parsing is only meaningful if
you extracted some information, but your idea is, essentially "what if
I do nothing?".
Under that constraint, I'm not sure I can parse anything. How can I
parse a string (and hand it over to an external function) until I've
found the closing quote?
Nobody says that parsing a number is the only pathological case. You, however, exaggerate by saying you cannot parse _anything_. You can
parse booleans or null, for example. There's no problem there.
In principle, any language that has infinite words will have the same
problem with streaming [...]
[...] If you ever pondered h/w or low-level
protocols s.a. SCSI or IP [...]
The real problem is how the JSON is set up. If you take umpteen data structures and wrap them all in something like a list, then it may be a tad hard to stream as you may not necessarily be examining the contents till the list finishes gigabytes later.