• Script stops running with no error

    From Daniel@21:1/5 to All on Wed Aug 28 22:09:56 2024
    As you all have seen on my intro post, I am in a project using Python
    (which I'm learning as I go) using the wikimedia API to pull data from wiktionary.org. I want to parse the json and output, for now, just the definition of the word.

    Wiktionary is wikimedia's dictionary.

    My requirements for v1

    Query the api for the definition for table (in the python script).
    Pull the proper json
    Parse the json
    output the definition only

    What's happening?

    I run the script and, maybe I don't know shit from shinola, but it
    appears I composed it properly. I wrote the script to do the above.
    The wiktionary json file denotes a list with this character # and
    sublists as ## but numbers them

    On Wiktionary, the definitions are denoted like:

    1. blablabla
    1. blablabla
    2. blablablablabla
    2. balbalbla
    3. blablabla
    1. blablabla


    I wrote my script to alter it so that the sublist are letters

    1. blablabla
    a. blablabla
    b. blablabla
    2. blablabla and so on
    /snip

    At this point, the script stops after it assesses the first line_counter
    and sub_counter. The code is below, please tell me which stupid mistake
    I made (I'm sure it's simple).

    Am I making a bad approach? Is there an easier method of parsing json
    than the way I'm doing it? I'm all ears.

    Be kind, i'm really new at python. Environment is emacs.

    import requests
    import re

    search_url = 'https://api.wikimedia.org/core/v1/wiktionary/en/search/page' search_query = 'table'
    parameters = {'q': search_query}

    response = requests.get(search_url, params=parameters)
    data = response.json()

    page_id = None

    if 'pages' in data:
    for page in data['pages']:
    title = page.get('title', '').lower()
    if title == search_query.lower():
    page_id = page.get('id')
    break

    if page_id:
    content_url =
    f'https://api.wikimedia.org/core/v1/wiktionary/en/page/
    {search_query}'
    response = requests.get(content_url)
    page_data = response.json()
    if 'source' in page_data:
    content = page_data['source']
    cases = {'noun': r'\{en-noun\}(.*?)(?=\{|\Z)',
    'verb': r'\{en-verb\}(.*?)(?=\{|\Z)',
    'adjective': r'\{en-adj\}(.*?)(?=\{|\Z)',
    'adverb': r'\{en-adv\}(.*?)(?=\{|\Z)',
    'preposition': r'\{en-prep\}(.*?)(?=\{|\Z)',
    'conjunction': r'\{en-con\}(.*?)(?=\{|\Z)',
    'interjection': r'\{en-intj\}(.*?)(?=\{|\Z)',
    'determiner': r'\{en-det\}(.*?)(?=\{|\Z)',
    'pronoun': r'\{en-pron\}(.*?)(?=\{|\Z)'
    #make sure there aren't more word types
    }

    def clean_definition(text):
    text = re.sub(r'\[\[(.*?)\]\]', r'\1', text)
    text = text.lstrip('#').strip()
    return text

    print(f"\n*** Definition for {search_query} ***")
    for word_type, pattern in cases.items():
    match = re.search(pattern, content, re.DOTALL)
    if match:
    lines = [line.strip() for line in
    match.group(1).split('\n')
    if line.strip()]
    definition = []
    main_counter = 0
    sub_counter = 'a'

    for line in lines:
    if line.startswith('##*') or line.startswith('##:'):
    continue

    if line.startswith('# ') or line.startswith('#\t'):
    main_counter += 1
    sub_counter = 'a'
    cleaned_line = clean_definition(line)
    definition.append(f"{main_counter}. {cleaned_line}")
    elif line.startswith('##'):
    cleaned_line = clean_definition(line)
    definition.append(f"   {sub_counter}. {cleaned_line}")
    sub_counter = chr(ord(sub_counter) + 1)

    if definition:
    print(f"\n{word_type.capitalize()}\n")
    print("\n".join(definition))
    break
    else:
    print("try again beotch")

    Thanks,

    Daniel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Passin@21:1/5 to Daniel via Python-list on Wed Aug 28 18:32:16 2024
    On 8/28/2024 5:09 PM, Daniel via Python-list wrote:
    As you all have seen on my intro post, I am in a project using Python
    (which I'm learning as I go) using the wikimedia API to pull data from wiktionary.org. I want to parse the json and output, for now, just the definition of the word.

    Wiktionary is wikimedia's dictionary.

    My requirements for v1

    Query the api for the definition for table (in the python script).
    Pull the proper json
    Parse the json
    output the definition only

    What's happening?

    I run the script and, maybe I don't know shit from shinola, but it
    appears I composed it properly. I wrote the script to do the above.
    The wiktionary json file denotes a list with this character # and
    sublists as ## but numbers them

    On Wiktionary, the definitions are denoted like:

    1. blablabla
    1. blablabla
    2. blablablablabla
    2. balbalbla
    3. blablabla
    1. blablabla


    I wrote my script to alter it so that the sublist are letters

    1. blablabla
    a. blablabla
    b. blablabla
    2. blablabla and so on
    /snip

    At this point, the script stops after it assesses the first line_counter
    and sub_counter. The code is below, please tell me which stupid mistake
    I made (I'm sure it's simple).

    Am I making a bad approach? Is there an easier method of parsing json
    than the way I'm doing it? I'm all ears.

    Be kind, i'm really new at python. Environment is emacs.

    import requests
    import re

    search_url = 'https://api.wikimedia.org/core/v1/wiktionary/en/search/page' search_query = 'table'
    parameters = {'q': search_query}

    response = requests.get(search_url, params=parameters)
    data = response.json()

    page_id = None

    if 'pages' in data:
    for page in data['pages']:
    title = page.get('title', '').lower()
    if title == search_query.lower():
    page_id = page.get('id')
    break

    if page_id:
    content_url =
    f'https://api.wikimedia.org/core/v1/wiktionary/en/page/
    {search_query}'
    response = requests.get(content_url)
    page_data = response.json()
    if 'source' in page_data:
    content = page_data['source']
    cases = {'noun': r'\{en-noun\}(.*?)(?=\{|\Z)',
    'verb': r'\{en-verb\}(.*?)(?=\{|\Z)',
    'adjective': r'\{en-adj\}(.*?)(?=\{|\Z)',
    'adverb': r'\{en-adv\}(.*?)(?=\{|\Z)',
    'preposition': r'\{en-prep\}(.*?)(?=\{|\Z)',
    'conjunction': r'\{en-con\}(.*?)(?=\{|\Z)',
    'interjection': r'\{en-intj\}(.*?)(?=\{|\Z)',
    'determiner': r'\{en-det\}(.*?)(?=\{|\Z)',
    'pronoun': r'\{en-pron\}(.*?)(?=\{|\Z)'
    #make sure there aren't more word types
    }

    def clean_definition(text):
    text = re.sub(r'\[\[(.*?)\]\]', r'\1', text)
    text = text.lstrip('#').strip()
    return text

    print(f"\n*** Definition for {search_query} ***")
    for word_type, pattern in cases.items():
    match = re.search(pattern, content, re.DOTALL)
    if match:
    lines = [line.strip() for line in
    match.group(1).split('\n')
    if line.strip()]
    definition = []
    main_counter = 0
    sub_counter = 'a'

    for line in lines:
    if line.startswith('##*') or line.startswith('##:'):
    continue

    if line.startswith('# ') or line.startswith('#\t'):
    main_counter += 1
    sub_counter = 'a'
    cleaned_line = clean_definition(line)
    definition.append(f"{main_counter}. {cleaned_line}")
    elif line.startswith('##'):
    cleaned_line = clean_definition(line)
    definition.append(f"   {sub_counter}. {cleaned_line}")
    sub_counter = chr(ord(sub_counter) + 1)

    if definition:
    print(f"\n{word_type.capitalize()}\n")
    print("\n".join(definition))
    break
    else:
    print("try again beotch")

    You need to check at each part of the code to see if you are getting or producing what you think you are. You also should create a text
    constant containing the JSON input you expect to get. Make sure you can process that. Start simple - one main item. Then two main items. Then
    two main items with one sub item. And so on.

    I'm not sure what you want to produce in the end but this seems awfully
    complex to be starting with. Also you aren't taking advantage of the
    structure inherent in the JSON. If the data response isn't too big, you
    can probably take it as is and use the Python JSON reader to produce a
    Python data structure. It should be much easier (and faster) to process
    the data structure than to repeatedly scan all those lines of data with regexes.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From dn@21:1/5 to Thomas Passin via Python-list on Thu Aug 29 12:07:07 2024
    On 29/08/24 10:32, Thomas Passin via Python-list wrote:
    On 8/28/2024 5:09 PM, Daniel via Python-list wrote:
    As you all have seen on my intro post, I am in a project using Python
    (which I'm learning as I go) using the wikimedia API to pull data from
    wiktionary.org. I want to parse the json and output, for now, just the
    definition of the word.

    Wiktionary is wikimedia's dictionary.

    My requirements for v1

    Query the api for the definition for table (in the python script).
    Pull the proper json
    Parse the json
    output the definition only


    You need to check at each part of the code to see if you are getting or producing what you think you are.  You also should create a text
    constant containing the JSON input you expect to get.  Make sure you can process that.  Start simple - one main item.  Then two main items.  Then two main items with one sub item.  And so on.

    I'm not sure what you want to produce in the end but this seems awfully complex to be starting with.  Also you aren't taking advantage of the structure inherent in the JSON.  If the data response isn't too big, you
    can probably take it as is and use the Python JSON reader to produce a
    Python data structure.  It should be much easier (and faster) to process
    the data structure than to repeatedly scan all those lines of data with regexes.


    Good effort so far!


    Further to @Thomas: the code does seem to be taking the long way around!
    How can we illustrate that, and improve life?


    The Wiktionary docs at https://developer.wikimedia.org/use-content/
    discuss how to use their "Developer Portal". Worth reading!

    As part of the above, we find the "API:Data formats" page (https://www.mediawiki.org/wiki/API:Data_formats) which offers a simple
    example (more simple than your objectives):

    api.php?action=query&titles=Main%20page&format=json

    which produces:

    {
    "query": {
    "pages": {
    "217225": {
    "pageid": 217225,
    "ns": 0,
    "title": "Main page"
    }
    }
    }
    }

    Does this look like a Python dict[ionary's] output to you?

    It is, (more discussion at the web.ref)
    - but it is wrapped into a JSON payload.

    There are various ways of dealing with JSON-formatted data. You're
    already using requests. Perhaps leave such research until later.


    So, as soon as "page_data" is realised from "response", print() it (per
    above: make sure you're actually seeing what you're expecting to see). Computers have this literal habit of doing what we ask, not what we want!

    PS the pprint/pretty printer library offers a neater way of outputting a "nested" data-structure (https://docs.python.org/3/library/pprint.html).


    Thereafter, make as much use of the returned dict/list structure as can.
    At each stage of the 'drilling-down' process, again, print() it (to make
    sure ...)


    In this way the code will step-through the various 'layers' of data-organisation. That observation and stepping-through of 'layers' is
    a hint that the code should (probably) also be organised by 'layer'! For example, the first for-loop finds a page which matches the search-key.
    This could be abstracted into a (well-named) function.

    Thus, you can write a test-harness which provides the function with some
    sample input (which you know from earlier print-outs!) and can ensure
    (with yet another print()) that the returned-result is as-expected!

    NB the test-data and check-print() should be outside the function.
    Please take these steps as-read or as 'rules'. Once your skills expand,
    you will likely become ready to learn about unit-testing, pytest, etc.
    At which time, such ideas will 'fall into place'.


    BTW/whilst that 'unit' is in-focus: how many times will the current code compute search_query.lower()? How many times (per function call) will "search_query" be any different from previous calls? So, should that computation be elsewhere?
    (won't make much difference to execution time, but a coding-skill:
    consider whether to leave computation until the result is actually
    needed (lazy-evaluation), or if early-computation will save unnecessary repeated-computation)


    Similarly, 'lift' constants such as "cases" out of (what will become)
    functions and put them towards the top of the script. This means that
    all such 'definition' and 'configuration' settings will be found
    together in one easy-to-find location AND makes the functional code
    easier to read.


    Now, back to the question: where is the problem arising? Do you know or
    do you only know that what comes-out at the end is
    unattractive/unacceptable?

    The idea of splitting the code into functions (or "units") is not only
    that you could test each and thereby narrow-down the location of the
    problem (and so that we don't have to read so much code in a bid to
    help) but that when you do ask for assistance you will be able to
    provide only the pertinent code AND some sample input-data with expected-results!
    (although, if all our dreams come true, you will answer your own question!)


    OK, is that enough by way of coding-tactics (not to mention the
    web-research) to keep you on-track for a while?

    --
    Regards,
    =dn

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From rbowman@21:1/5 to Daniel on Thu Aug 29 01:33:33 2024
    On Wed, 28 Aug 2024 22:09:56 +0100, Daniel wrote:

    if definition:
    print(f"\n{word_type.capitalize()}\n")
    print("\n".join(definition))
    break

    I don't know if that was intended but the 'break' kicks you out of

    for word_type, pattern in cases.items():

    I added a little debugging to show the cases iteration and commented out
    the break. 'noun' has five lines and appears to be correct. 'verb' has
    two lines, neither of which match the if/else. The others aren't in the
    return from https://api.wikimedia.org/core/v1/wiktionary/en/page/table.

    I have to admit I sometimes miss C where I can bounce between curlies.

    Output:

    python wiki.py

    *** Definition for table ***

    word_type noun pattern: \{en-noun\}(.*?)(?=\{|\Z)
    line }
    line # Furniture with a top surface to accommodate a variety of uses.
    line ## An item of [[furniture]] with a [[flat]] [[top]] [[surface]]
    raised above the ground, usually on one or more legs.
    line ##: ''Set that dish on the '''table''' over there, please.''
    line ##*

    Noun

    1. Furniture with a top surface to accommodate a variety of uses.
       a. An item of furniture with a flat top surface raised above the
    ground, usually on one or more legs.

    word_type verb pattern: \{en-verb\}(.*?)(?=\{|\Z)
    line }
    line #

    word_type adjective pattern: \{en-adj\}(.*?)(?=\{|\Z)

    word_type adverb pattern: \{en-adv\}(.*?)(?=\{|\Z)

    word_type preposition pattern: \{en-prep\}(.*?)(?=\{|\Z)

    word_type conjunction pattern: \{en-con\}(.*?)(?=\{|\Z)

    word_type interjection pattern: \{en-intj\}(.*?)(?=\{|\Z)

    word_type determiner pattern: \{en-det\}(.*?)(?=\{|\Z)

    word_type pronoun pattern: \{en-pron\}(.*?)(?=\{|\Z)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Passin@21:1/5 to dn via Python-list on Wed Aug 28 22:58:02 2024
    On 8/28/2024 8:07 PM, dn via Python-list wrote:
    On 29/08/24 10:32, Thomas Passin via Python-list wrote:
    On 8/28/2024 5:09 PM, Daniel via Python-list wrote:
    As you all have seen on my intro post, I am in a project using Python
    (which I'm learning as I go) using the wikimedia API to pull data from
    wiktionary.org. I want to parse the json and output, for now, just the
    definition of the word.

    Wiktionary is wikimedia's dictionary.

    My requirements for v1

    Query the api for the definition for table (in the python script).
    Pull the proper json
    Parse the json
    output the definition only


    You need to check at each part of the code to see if you are getting
    or producing what you think you are.  You also should create a text
    constant containing the JSON input you expect to get.  Make sure you
    can process that.  Start simple - one main item.  Then two main
    items.  Then two main items with one sub item.  And so on.

    I'm not sure what you want to produce in the end but this seems
    awfully complex to be starting with.  Also you aren't taking advantage
    of the structure inherent in the JSON.  If the data response isn't too
    big, you can probably take it as is and use the Python JSON reader to
    produce a Python data structure.  It should be much easier (and
    faster) to process the data structure than to repeatedly scan all
    those lines of data with regexes.


    Good effort so far!


    Further to @Thomas: the code does seem to be taking the long way around!
    How can we illustrate that, and improve life?


    The Wiktionary docs at https://developer.wikimedia.org/use-content/
    discuss how to use their "Developer Portal". Worth reading!

    As part of the above, we find the "API:Data formats" page (https:// www.mediawiki.org/wiki/API:Data_formats) which offers a simple example
    (more simple than your objectives):

    api.php?action=query&titles=Main%20page&format=json

    which produces:

    {
      "query": {
        "pages": {
          "217225": {
            "pageid": 217225,
            "ns": 0,
            "title": "Main page"
          }
        }
      }
    }

    Does this look like a Python dict[ionary's] output to you?

    It is, (more discussion at the web.ref)
    - but it is wrapped into a JSON payload.

    To give more detail:

    import json
    from pprint import pprint

    DATA = """{
    "query": {
    "pages": {
    "217225": {
    "pageid": 217225,
    "ns": 0,
    "title": "Main page"
    }
    }
    }
    }"""

    data_dict = json.loads(DATA)
    pprint(data_dict)

    Easy. If you have a really big file it can be fearfully slow so it may
    or may not be a good approach for this problem.

    Or you could parse out the data with JSONpath (which I have never used
    but it's the right kind of approach):

    https://pypi.org/project/jsonpath-ng/

    Another possibility: JMESPath:

    https://python.land/data-processing/working-with-json/jmespath

    These kind of approaches also handle the parsing for you and help in constructing queries.

    There are various ways of dealing with JSON-formatted data. You're
    already using requests. Perhaps leave such research until later.


    So, as soon as "page_data" is realised from "response", print() it (per above: make sure you're actually seeing what you're expecting to see). Computers have this literal habit of doing what we ask, not what we want!

    PS the pprint/pretty printer library offers a neater way of outputting a "nested" data-structure (https://docs.python.org/3/library/pprint.html).


    Thereafter, make as much use of the returned dict/list structure as can.
    At each stage of the 'drilling-down' process, again, print() it (to make
    sure ...)


    In this way the code will step-through the various 'layers' of data- organisation. That observation and stepping-through of 'layers' is a
    hint that the code should (probably) also be organised by 'layer'! For example, the first for-loop finds a page which matches the search-key.
    This could be abstracted into a (well-named) function.

    Thus, you can write a test-harness which provides the function with some sample input (which you know from earlier print-outs!) and can ensure
    (with yet another print()) that the returned-result is as-expected!

    NB the test-data and check-print() should be outside the function.
    Please take these steps as-read or as 'rules'. Once your skills expand,
    you will likely become ready to learn about unit-testing, pytest, etc.
    At which time, such ideas will 'fall into place'.


    BTW/whilst that 'unit' is in-focus: how many times will the current code compute search_query.lower()? How many times (per function call) will "search_query" be any different from previous calls? So, should that computation be elsewhere?
    (won't make much difference to execution time, but a coding-skill:
    consider whether to leave computation until the result is actually
    needed (lazy-evaluation), or if early-computation will save unnecessary repeated-computation)


    Similarly, 'lift' constants such as "cases" out of (what will become) functions and put them towards the top of the script. This means that
    all such 'definition' and 'configuration' settings will be found
    together in one easy-to-find location AND makes the functional code
    easier to read.


    Now, back to the question: where is the problem arising? Do you know or
    do you only know that what comes-out at the end is unattractive/ unacceptable?

    The idea of splitting the code into functions (or "units") is not only
    that you could test each and thereby narrow-down the location of the
    problem (and so that we don't have to read so much code in a bid to
    help) but that when you do ask for assistance you will be able to
    provide only the pertinent code AND some sample input-data with expected-results!
    (although, if all our dreams come true, you will answer your own question!)


    OK, is that enough by way of coding-tactics (not to mention the web- research) to keep you on-track for a while?


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)