• Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB

    From Barry@21:1/5 to All on Mon Sep 30 16:30:19 2024
    On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <python-list@python.org> wrote:


    import polars as pl
    pl.read_json("file.json")



    This is not going to work unless the computer has a lot more the 60GiB of RAM.

    As later suggested a streaming parser is required.

    Barry

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Passin@21:1/5 to Barry via Python-list on Mon Sep 30 12:11:46 2024
    On 9/30/2024 11:30 AM, Barry via Python-list wrote:


    On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <python-list@python.org> wrote:


    import polars as pl
    pl.read_json("file.json")



    This is not going to work unless the computer has a lot more the 60GiB of RAM.

    As later suggested a streaming parser is required.

    Streaming won't work because the file is gzipped. You have to receive
    the whole thing before you can unzip it. Once unzipped it will be even
    larger, and all in memory.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From 2QdxY4RzWzUUiLuE@potatochowder.com@21:1/5 to Grant Edwards via Python-list on Mon Sep 30 14:28:33 2024
    On 2024-09-30 at 11:44:50 -0400,
    Grant Edwards via Python-list <python-list@python.org> wrote:

    On 2024-09-30, Left Right via Python-list <python-list@python.org> wrote:
    Whether and to what degree you can stream JSON depends on JSON
    structure. In general, however, JSON cannot be streamed (but commonly
    it can be).

    Imagine a pathological case of this shape: 1... <60GB of digits>. This
    is still a valid JSON (it doesn't have any limits on how many digits a number can have). And you cannot parse this number in a streaming way because in order to do that, you need to start with the least
    significant digit.

    Which is how arabic numbers were originally parsed, but when
    westerners adopted them from a R->L written language, thet didn't flip
    them around to match the L->R written language into which they were
    being adopted.

    Interesting.

    So now long numbers can't be parsed as a stream in software. They
    should have anticipated this problem back in the 13th century and
    flipped the numbers around.

    What am I missing? Handwavingly, start with the first digit, and as
    long as the next character is a digit, multipliy the accumulated result
    by 10 (or the appropriate base) and add the next value. Oh, and handle scientific notation as a special case, and perhaps fail spectacularly
    instead of recovering gracefully in certain edge cases. And in the pathological case of a single number with 60 billion digits, run out of
    memory (and complain loudly to the person who claimed that the file
    contained a "dataset"). But why do I need to start with the least
    significant digit?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Passin@21:1/5 to Chris Angelico via Python-list on Mon Sep 30 13:57:05 2024
    On 9/30/2024 1:00 PM, Chris Angelico via Python-list wrote:
    On Tue, 1 Oct 2024 at 02:20, Thomas Passin via Python-list <python-list@python.org> wrote:

    On 9/30/2024 11:30 AM, Barry via Python-list wrote:


    On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <python-list@python.org> wrote:


    import polars as pl
    pl.read_json("file.json")



    This is not going to work unless the computer has a lot more the 60GiB of RAM.

    As later suggested a streaming parser is required.

    Streaming won't work because the file is gzipped. You have to receive
    the whole thing before you can unzip it. Once unzipped it will be even
    larger, and all in memory.

    Streaming gzip is perfectly possible. You may be thinking of PKZip
    which has its EOCD at the end of the file (although it may still be
    possible to stream-decompress if you work at it).

    ChrisA

    You're right, that's what I was thinking of.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Passin@21:1/5 to Barry via Python-list on Mon Sep 30 14:05:36 2024
    On 9/30/2024 11:30 AM, Barry via Python-list wrote:


    On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <python-list@python.org> wrote:


    import polars as pl
    pl.read_json("file.json")



    This is not going to work unless the computer has a lot more the 60GiB of RAM.

    As later suggested a streaming parser is required.

    There is also the json-stream library, on PyPi at

    https://pypi.org/project/json-stream/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From 2QdxY4RzWzUUiLuE@potatochowder.com@21:1/5 to Chris Angelico via Python-list on Mon Sep 30 18:16:03 2024
    On 2024-10-01 at 04:46:35 +1000,
    Chris Angelico via Python-list <python-list@python.org> wrote:

    On Tue, 1 Oct 2024 at 04:30, Dan Sommers via Python-list <python-list@python.org> wrote:

    But why do I need to start with the least
    significant digit?

    If you start from the most significant, you don't know anything about
    the number until you finish parsing it. There's almost nothing you can
    say about a number given that it starts with a particular sequence
    (since you don't know how MANY digits there are). However, if you know
    the LAST digits, you can make certain statements about it (trivial
    examples being whether it's odd or even).

    But that wasn't the question. Sure, under certain circumstances and for specific use cases and/or requirements, there might be arguments to read potential numbers as strings and possibly not have to parse them
    completely before accepting or rejecting them.

    And if I start with the least significant digit and the number happens
    to be written in scientific notation and/or has a decimal point, then I
    can't tell whether it's odd or even until I further process the whole
    thing anyway.

    It's not very, well, significant. But there's something to it. And it
    extends nicely to p-adic numbers, which can have an infinite number of nonzero digits to the left of the decimal:

    https://en.wikipedia.org/wiki/P-adic_number

    In Common Lisp, integers can be written in any integer base from two to
    thirty six, inclusive. So knowing the last digit doesn't tell you
    whether an integer is even or odd until you know the base anyway.

    Curiously, we agree: if you move the goal posts arbitrarily, then
    some algorithms that parse JSON numbers will fail.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From 2QdxY4RzWzUUiLuE@potatochowder.com@21:1/5 to Chris Angelico via Python-list on Mon Sep 30 20:06:57 2024
    On 2024-10-01 at 09:09:07 +1000,
    Chris Angelico via Python-list <python-list@python.org> wrote:

    On Tue, 1 Oct 2024 at 08:56, Grant Edwards via Python-list <python-list@python.org> wrote:

    On 2024-09-30, Dan Sommers via Python-list <python-list@python.org> wrote:

    In Common Lisp, integers can be written in any integer base from two
    to thirty six, inclusive. So knowing the last digit doesn't tell
    you whether an integer is even or odd until you know the base
    anyway.

    I had to think about that for an embarassingly long time before it
    clicked.

    The only part I'm not clear on is what identifies the base. If you're
    going to write numbers little-endian, it's not that hard to also write
    them with a base indicator before the digits [...]

    In Common Lisp, you can write integers as #nnR[digits], where nn is the
    decimal representation of the base (possibly without a leading zero),
    the # and the R are literal characters, and the digits are written in
    the intended base. So the input #16fFFFF is read as the integer 65535.

    You can also set or bind the global variable *read-base* (yes, the
    asterisks are part of the name) to an integer between 2 and 36, and then anything that looks like an integer in that base is interpreted as such (including literals in programs). The literals I described above are
    still handled correctly no matter the current value of *read-base*. So
    if the value of *read-base* is 16, then the input FFFF is read as the
    integer 65535 (as is the input #16rFFFF).

    (Pedants may point our details I omitted. I admit to omitting them.)

    IIRC, certain [old 8080 and Z-80?] assemblers used to put the base
    indicator at the end. So 10 meant, well, 10, but 10H meant 16 and 10b
    meant 2 (IDK; the capital H and the lower case b both look right to me).

    I don't recall numbers written from least significant digit to most
    significant digit (big and little endian *storage*, yes, but not the
    digits when presented to or read from a human).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From 2QdxY4RzWzUUiLuE@potatochowder.com@21:1/5 to Keith Thompson via Python-list on Tue Oct 1 11:34:45 2024
    On 2024-09-30 at 18:48:02 -0700,
    Keith Thompson via Python-list <python-list@python.org> wrote:

    2QdxY4RzWzUUiLuE@potatochowder.com writes:
    [...]
    In Common Lisp, you can write integers as #nnR[digits], where nn is the decimal representation of the base (possibly without a leading zero),
    the # and the R are literal characters, and the digits are written in
    the intended base. So the input #16fFFFF is read as the integer 65535.

    Typo: You meant #16RFFFF, not #16fFFFF.

    Yep. Sorry.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From 2QdxY4RzWzUUiLuE@potatochowder.com@21:1/5 to Left Right via Python-list on Tue Oct 1 11:47:24 2024
    On 2024-09-30 at 21:34:07 +0200,
    Regarding "Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API,"
    Left Right via Python-list <python-list@python.org> wrote:

    What am I missing? Handwavingly, start with the first digit, and as
    long as the next character is a digit, multipliy the accumulated result
    by 10 (or the appropriate base) and add the next value. Oh, and handle scientific notation as a special case, and perhaps fail spectacularly instead of recovering gracefully in certain edge cases. And in the pathological case of a single number with 60 billion digits, run out of memory (and complain loudly to the person who claimed that the file contained a "dataset"). But why do I need to start with the least significant digit?

    You probably forgot that it has to be _streaming_. Suppose you parse
    the first digit: can you hand this information over to an external
    function to process the parsed data? -- No! because you don't know the magnitude yet. What about two digits? -- Same thing. You cannot
    leave the parser code until you know the magnitude (otherwise the
    information is useless to the external code).

    If I recognize the first digit, then I *can* hand that over to an
    external function to accumulate the digits that follow.

    So, even if you have enough memory and don't care about special cases
    like scientific notation: yes, you will be able to parse it, but it
    won't be a streaming parser.

    Under that constraint, I'm not sure I can parse anything. How can I
    parse a string (and hand it over to an external function) until I've
    found the closing quote?

    How much state can a parser maintain (before it invokes an external
    function) and still be considered streaming? I fear that we may be
    getting hung up on terminology rather than solving the problem at hand.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Greg Ewing@21:1/5 to Left Right on Wed Oct 2 11:07:41 2024
    On 2/10/24 10:03 am, Left Right wrote:
    Consider also an interesting
    consequence of SCSI not being able to have infinite words: this means, besides other things that fsync() is nonsense! :) If you aren't
    familiar with the concept: UNIX filesystem API suggests that it's
    possible to destage arbitrary large file (or a chunk of file) to disk.
    But SCSI is built of finite "words" and to describe an arbitrary large
    file you'd need to list all the blocks that constitute the file!

    I don't follow. What fsync() does is ensure that any data buffered
    in the kernel relating to the file is sent to the storage device.
    It can send as many blocks of data over SCSI as required to
    achieve this. There's no requirement for it to be atomic at the
    level of the interface between the kernel and the hardware.

    Some devices do their own buffering in ways that are invisible to
    the software, so fsync() can't guarantee that the data is actually
    written to the storage medium. But that's a problem stemming from
    the design of the hardware, not the design of the protocol for
    communicating with the hardware.

    the only way to implement fsync() in compliance with the
    standard is to sync _everything_

    Again I'm not sure what you mean here. It may be difficult for the
    kernel to track down exactly what data is relevant to a particular file,
    and so the kernel programmers take the easy way out and just implement
    fsync() as sync(). But again that has nothing to do with the protocol.

    --
    Greg

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Greg Ewing@21:1/5 to Left Right on Wed Oct 2 10:48:24 2024
    On 1/10/24 8:34 am, Left Right wrote:
    You probably forgot that it has to be _streaming_. Suppose you parse
    the first digit: can you hand this information over to an external
    function to process the parsed data? -- No! because you don't know the magnitude yet.

    By that definition of "streaming", no parser can ever be streaming,
    because there will be some constructs that must be read in their
    entirety before a suitably-structured piece of output can be
    emitted.

    The context of this discussion about integers is the claim that
    they *could* be parsed incrementally if they were written little
    endian instead of big endian, but the same argument applies either
    way.

    --
    Greg

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From 2QdxY4RzWzUUiLuE@potatochowder.com@21:1/5 to Left Right on Tue Oct 1 20:20:59 2024
    On 2024-10-01 at 23:03:01 +0200,
    Left Right <olegsivokon@gmail.com> wrote:

    If I recognize the first digit, then I *can* hand that over to an
    external function to accumulate the digits that follow.

    And what is that external function going to do with this information?
    The point is you didn't parse anything if you just sent the digit.
    You just delegated the parsing further. Parsing is only meaningful if
    you extracted some information, but your idea is, essentially "what if
    I do nothing?".

    If the parser detects the first digit of a number, then the parser can
    read digits one at a time (i.e., "streaming"), assimilate and accumulate
    the value of the number being parsed, and successfully finish parsing
    the number it reads a non-digit. Whether the function that accumulates
    the value during the process is internal or external isn't relevant; the
    point is that it is possible to parse integers from most significant
    digit to least significant digit under a streaming model (and if you're sufficiently clever, you can even write partial results to external
    storage and/or another transmission protocol, thus allowing for numbers
    bigger (as measured by JSON or your internal representation) than your
    RAM).

    At most, the parser has to remember the non-digit character it read so
    that it (the parser) can begin to parse whatever comes after the number.
    Does that break your notion of "streaming"?

    Why do I have to start with the least significant digit?

    Under that constraint, I'm not sure I can parse anything. How can I
    parse a string (and hand it over to an external function) until I've
    found the closing quote?

    Nobody says that parsing a number is the only pathological case. You, however, exaggerate by saying you cannot parse _anything_. You can
    parse booleans or null, for example. There's no problem there.

    My intent was only to repeat what you implied: that any parser that
    reads its input until it has parsed a value is not streaming.

    So how much information can the parser keep before you consider it not
    to be "streaming"?

    [...]

    In principle, any language that has infinite words will have the same
    problem with streaming [...]

    So what magic allows anyone to stream any JSON file over SCSI or IP?
    Let alone some kind of "live stream" that by definition is indefinite,
    even if it only lasts a few tenths of a second?

    [...] If you ever pondered h/w or low-level
    protocols s.a. SCSI or IP [...]

    I spent a good deal of my career designing and implementing all manner
    of communicaations protocols, from transmitting and receiving single
    bits over a wire all the way up to what are now known as session and presentation layers. Some imposed maximum lengths in certain places;
    some allowed for indefinite amounts of data to be transferred from one
    end to the other without stopping, resetting, or overflowing. And yet
    somehow, the universe never collapsed.

    If you believe that some implementation of fsync fails to meet a
    specification, or fails to work correctly on files containign JSON, then
    file a bug report.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Greg Ewing@21:1/5 to avi.e.gross@gmail.com on Wed Oct 2 18:27:54 2024
    On 2/10/24 12:26 pm, avi.e.gross@gmail.com wrote:
    The real problem is how the JSON is set up. If you take umpteen data structures and wrap them all in something like a list, then it may be a tad hard to stream as you may not necessarily be examining the contents till the list finishes gigabytes later.

    Yes, if you want to process the items as they come in, you might
    be better off sending a series of separate JSON strings, rather than
    one JSON string containing a list.

    Or, use a specialised JSON parser that processes each item of the
    list as soon as it's finished parsing it, instead of collecting the
    whole list first.

    --
    Greg

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ethan Furman@21:1/5 to All on Wed Oct 2 18:57:51 2024
    This thread is derailing.

    Please consider it closed.

    --
    ~Ethan~
    Moderator

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)