• Diagnostics

    From Don Y@21:1/5 to All on Sat Oct 12 12:58:02 2024
    Typically, one performs some limited "confidence tests"
    at POST to catch gross failures. As this activity is
    "in series" with normal operation, it tends to be brief
    and not very thorough.

    Many products offer a BIST capability that the user can invoke
    for more thorough testing. This allows the user to decide
    when he can afford to live without the normal functioning of the
    device.

    And, if you are a "robust" designer, you often include invariants
    that verify hardware operations (esp to I/Os) are actually doing
    what they should -- e.g., verifying battery voltage increases
    when you activate the charging circuit, loopbacks on DIOs, etc.

    But, for 24/7/365 boxes, POST is a "once-in-a-lifetime" activity.
    And, BIST might not always be convenient (as well as requiring the
    user's consent and participation).

    There, runtime diagnostics are the only alternative for hardware
    revalidation, PFA and diagnostics.

    How commonly are such mechanisms implemented? And, how thoroughly?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to George Neuner on Sat Oct 19 13:57:30 2024
    On 18/10/2024 23:42, George Neuner wrote:
    On Fri, 18 Oct 2024 20:30:06 -0000 (UTC), antispam@fricas.org (Waldek Hebisch) wrote:

    Don Y <blockedofcourse@foo.invalid> wrote:
    Typically, one performs some limited "confidence tests"
    at POST to catch gross failures. As this activity is
    "in series" with normal operation, it tends to be brief
    and not very thorough.

    Many products offer a BIST capability that the user can invoke
    for more thorough testing. This allows the user to decide
    when he can afford to live without the normal functioning of the
    device.

    And, if you are a "robust" designer, you often include invariants
    that verify hardware operations (esp to I/Os) are actually doing
    what they should -- e.g., verifying battery voltage increases
    when you activate the charging circuit, loopbacks on DIOs, etc.

    But, for 24/7/365 boxes, POST is a "once-in-a-lifetime" activity.
    And, BIST might not always be convenient (as well as requiring the
    user's consent and participation).

    There, runtime diagnostics are the only alternative for hardware
    revalidation, PFA and diagnostics.

    How commonly are such mechanisms implemented? And, how thoroughly?

    This is strange question. AFAIK automatically run diagnostics/checks
    are part of safety regulations. Even if some safety critical software
    does not contain them, nobody is going to admit violationg regulations.
    And things like PLC-s are "dual use", they may be used in non-safety
    role, but vendors claim compliance to safety standards.

    However, only a minor percentage of all devices must comply with such
    safety regulations.

    As I understand it, Don is working on tech for "smart home"
    implementations ... devices that may be expected to run nearly
    constantly (though perhaps not 365/24 with 6 9's reliability), but
    which, for the most part, are /not/ safety critical.

    WRT Don's question, I don't know the answer, but I suspect runtime diagnostics are /not/ routinely implemented for devices that are not
    safety critical. Reason: diagnostics interfere with operation of
    <whatever> they happen to be testing. Even if the test is at low(est) priority and is interruptible by any other activity, it still might
    cause an unacceptable delay in a real time situation. To ensure 100% functionality at all times effectively requires use of redundant
    hardware - which generally is too expensive for a non safety critical
    device.


    That brings up one of the critical points about any kind of runtime
    diagnostics - what do you do if there is a failure? Until you can
    answer that question, any effort on diagnostics is not just pointless,
    but worse than useless because you are adding more stuff that could go
    wrong.

    I think bad or useless diagnostics are a more common problem than
    missing diagnostics. People feel pressured into having them when they
    can't measure anything useful and you can't do anything sensible with
    the results.

    I have seen first-hand how the insistence of having all sorts of
    diagnostics added to a product so that it could be "safety" certified
    actually result in a less reliable and less safe product. The only
    "safety" they provided was legal safety so that people could claim it
    wasn't their fault if it failed, because they had added all the
    self-tests required by the so-called safety experts.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Don Y on Sat Oct 19 14:07:07 2024
    On 19/10/2024 00:30, Don Y wrote:
    Hi George,

    [Hope all is well with you and at home]

    On 10/18/2024 2:42 PM, George Neuner wrote:
    WRT Don's question, I don't know the answer, but I suspect runtime
    diagnostics are /not/ routinely implemented for devices that are not
    safety critical.  Reason: diagnostics interfere with operation of
    <whatever> they happen to be testing.  Even if the test is at low(est)
    priority and is interruptible by any other activity, it still might
    cause an unacceptable delay in a real time situation.

    But, if you *know* when certain aspects of a device will be "called on",
    you can take advantage of that to schedule diagnostics when the device is
    not "needed".  And, in the event that some unexpected "need" arises,
    can terminate or suspend the testing (possibly rendering the effort
    moot if it hasn't yet run to a conclusion).

    E.g., I scrub freed memory pages (zero fill) so information doesn't
    leak across protection domains.  As long as some minimum number
    of *scrubbed* pages are available for use "on demand", why can't
    I *test* the pages yet to be scrubbed?

    You /could/ do that, but what is the point?

    What are you checking for? What is the realistic likelihood of finding
    a problem, and what are the consequences of such a problem? How do you
    test your test routines - are you able to simulate the problem you are
    testing in a good enough manner? What are the circumstances that could
    lead to a fault that you detect with the tests but where you would not
    already see the problem in other ways? Is it realistic to assume that
    your diagnostic test and reporting systems are able to run properly when
    the this problem occurs? If some kind of problem actually occurs, will
    your tests realistically identify it?

    /Those/ are the kinds of questions you should be asking before putting
    in some kind of tests. They are the important questions. Asking "why
    can't I do a test now?" is peanuts in comparison.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Don Y on Sat Oct 19 13:53:33 2024
    Don Y <blockedofcourse@foo.invalid> wrote:
    On 10/18/2024 8:00 PM, Waldek Hebisch wrote:
    Don Y <blockedofcourse@foo.invalid> wrote:
    On 10/18/2024 1:30 PM, Waldek Hebisch wrote:
    Don Y <blockedofcourse@foo.invalid> wrote:
    There, runtime diagnostics are the only alternative for hardware
    revalidation, PFA and diagnostics.

    How commonly are such mechanisms implemented? And, how thoroughly?

    This is strange question. AFAIK automatically run diagnostics/checks
    are part of safety regulations.

    Not all devices are covered by "regulations".

    Well, if device matters then there is implied liabilty
    and nobody want to admit doing bad job. If device
    does not matter, then answer to the original question
    also does not matter.

    In the US, ANYTHING can result in a lawsuit. But, "due diligence"
    can insulate the manufacturer, to some extent. No one ever
    *admits* to "doing a bad job".

    If your doorbell malfunctions, what "damages" are you going
    to claim? If your garage door doesn't open when commanded?
    If your yard doesn't get watered? If you weren't promptly
    notified that the mail had just been delivered? Or, that
    the compressor in the freezer had failed and your foodstuffs
    had spoiled, as a result?

    The costs of litigation are reasonably high. Lawyers want
    to see the LIKELIHOOD of a big payout before entertaining
    such litigation.

    Each item above may contribute to a significant loss. And
    there could push to litigation (say by a consumer advocacy group)
    basically to establish a precedent. So, better have
    record of due diligence.

    And, the *extent* to which testing is done is the subject
    addressed; if I ensure "stuff" *WORKED* when the device was
    powered on (preventing it from continuing on to its normal
    functionality in the event that some failure was detected),
    what assurance does that give me that the device's integrity
    is still intact 8760 hours (1 yr) hours later? 720 hours
    (1 mo)? 168 hours (1 wk)? 24 hours? *1* hour????

    What to test is really domain-specific. Traditional thinking
    is that computer hardware is _much_ more reliable than
    software and software bugs are major source of misbehaviour.

    That hasn't been *proven*. And, "misbehavior" is not the same
    as *failure*.

    First, I mean relevant hardware, that is hardware inside a MCU.
    I think that there are strong arguments that such hardware is
    more reliable than software. I have seen claim based on analysis
    of discoverd failures that software written to rigorous development
    standars exhibits on average about 1 bug (that lead to failure) per
    1000 lines of code. This means that evan small MCU has enough
    space of handful of bugs. And for bigger systems it gets worse.

    And among hardware failures transient upsets, like flipped
    bit are more likely than permanent failure. For example,

    That used to be the thinking with DRAM but studies have shown
    that *hard* failures are more common. These *can* be found...
    *if* you go looking for them!

    I another place I wrote the one of studies that I saw claimed that
    significant number of errors they detected (they monitored changes
    to a memory area that was supposed to be unmodifed) was due to buggy
    software. And DRAM is special.

    E.g., if you load code into RAM (from FLASH) for execution,
    are you sure the image *in* the RAM is the image from the FLASH?
    What about "now"? And "now"?!

    You are supposed to regularly verify sufficiently strong checksum.

    at low safety level you may assume that hardware of a counter
    generating PWM-ed signal works correctly, but you are
    supposed to periodically verify that configuration registers
    keep expected values.

    Why would you expect the registers to lose their settings?
    Would you expect the CPUs registers to be similarly flakey?

    First, such checking is not my idea, but one point from checklist for
    low safety devices. Registers may change due to bugs, EMC events,
    cosmic rays and similar.

    Historically OS-es had a map of bad blocks on the disc and
    avoided allocating them. In principle on system with paging
    hardware the same could be done for DRAM, but I do not think
    anybody is doing this (if domain is serious enough to worry
    about DRAM failures, then it probaly have redundant independent
    computers with ECC DRAM).

    Using ECC DRAM doesn't solve the problem. If you see errors
    reported by your ECC RAM (corrected errors), then when do
    you decide you are seeing too many and losing confidence that
    the ECC is actually *detecting* all multibit errors?

    ECC is part of solution. It may reduce probability of error
    so that you consider them not serious enough. And if you
    really care you may try to increase error rate (say by putting
    RAM chips at increased temperature) and test that your detection
    and recovery strategy works OK.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Waldek Hebisch on Sat Oct 19 09:55:35 2024
    On 10/19/2024 6:53 AM, Waldek Hebisch wrote:
    In the US, ANYTHING can result in a lawsuit. But, "due diligence"
    can insulate the manufacturer, to some extent. No one ever
    *admits* to "doing a bad job".

    If your doorbell malfunctions, what "damages" are you going
    to claim? If your garage door doesn't open when commanded?
    If your yard doesn't get watered? If you weren't promptly
    notified that the mail had just been delivered? Or, that
    the compressor in the freezer had failed and your foodstuffs
    had spoiled, as a result?

    The costs of litigation are reasonably high. Lawyers want
    to see the LIKELIHOOD of a big payout before entertaining
    such litigation.

    Each item above may contribute to a significant loss. And

    Significant loss? From a doorbell failing to ring? Are you
    sure YOUR doorbell has rung EVERY time someone pressed the button?

    there could push to litigation (say by a consumer advocacy group)
    basically to establish a precedent. So, better have
    record of due diligence.

    But things can *still* "fail to perform". That;s the whole point of
    runtime diagnostics; to notice a failure that the user may NOT!
    If you can take remedial action, then you have a notification that
    it is needed. If this requires the assistance of the user, then
    you can REQUEST that. If you can offload some of the responsibilities
    of the device (something that I can do, dynamically), then you
    can elect to do so. If you can do nothing to keep the device in
    service, then you can alert the user of the need for replacement.

    *NOT* knowing of a fault means you gleefully keep operating as
    if everything was fine.

    And, the *extent* to which testing is done is the subject
    addressed; if I ensure "stuff" *WORKED* when the device was
    powered on (preventing it from continuing on to its normal
    functionality in the event that some failure was detected),
    what assurance does that give me that the device's integrity
    is still intact 8760 hours (1 yr) hours later? 720 hours
    (1 mo)? 168 hours (1 wk)? 24 hours? *1* hour????

    What to test is really domain-specific. Traditional thinking
    is that computer hardware is _much_ more reliable than
    software and software bugs are major source of misbehaviour.

    That hasn't been *proven*. And, "misbehavior" is not the same
    as *failure*.

    First, I mean relevant hardware, that is hardware inside a MCU.
    I think that there are strong arguments that such hardware is
    more reliable than software. I have seen claim based on analysis
    of discoverd failures that software written to rigorous development
    standars exhibits on average about 1 bug (that lead to failure) per
    1000 lines of code. This means that evan small MCU has enough
    space of handful of bugs. And for bigger systems it gets worse.

    But bugs need not be consequential. They may be undesirable or
    even annoying but need not have associated "costs".

    I have a cassette deck (Nakamichi Dragon) that has a design flaw.
    When the tape reaches the end of side "A", it is supposed to
    autoreverse and play the "back side". So, the revolutions
    counter counts *up* while playing side A and then back down
    while playing side B.

    However, if you eject the tape just as side A finishes and
    physically flip it over (so side B is the "front" side)
    pressing FORWARD PLAY (which is the direction that the
    reels were moving while the tape counter was counting UP),
    the tape will move FORWARD but the reels will count backwards.

    If you had removed the tape and placed some OTHER tape in
    the mechanism, the same behavior results (obviously) -- moving
    forward but counting backwards. If you turn the deck OFF
    and then back ON, the tape counter moves correctly.

    How am I harmed by this? To what monetary extent? It's
    a race in the hardware & software (the tape counter is
    implemented in a separate MCU). I can avoid the problem
    by NOT ejecting the tape just after the completion of
    side A...

    And among hardware failures transient upsets, like flipped
    bit are more likely than permanent failure. For example,

    That used to be the thinking with DRAM but studies have shown
    that *hard* failures are more common. These *can* be found...
    *if* you go looking for them!

    I another place I wrote the one of studies that I saw claimed that significant number of errors they detected (they monitored changes
    to a memory area that was supposed to be unmodifed) was due to buggy software. And DRAM is special.

    If you have memory protection hardware (I do), then such changes
    can't casually occur; the software has to make a deliberate
    attempt to tell the memory controller to allow such a change.

    E.g., if you load code into RAM (from FLASH) for execution,
    are you sure the image *in* the RAM is the image from the FLASH?
    What about "now"? And "now"?!

    You are supposed to regularly verify sufficiently strong checksum.

    Really? Wanna bet that doesn't happen? How many Linux-based devices
    load applications and start a process to continuously verify the
    integrity of the TEXT segment?

    What are they going to do if they notice a discrepancy? Reload
    the application and hope it avoids any "soft spots" in memory?

    at low safety level you may assume that hardware of a counter
    generating PWM-ed signal works correctly, but you are
    supposed to periodically verify that configuration registers
    keep expected values.

    Why would you expect the registers to lose their settings?
    Would you expect the CPUs registers to be similarly flakey?

    First, such checking is not my idea, but one point from checklist for
    low safety devices. Registers may change due to bugs, EMC events,
    cosmic rays and similar.

    Then you are dealing with high reliability designs. Do you
    really think my microwave oven, stove, furnace, telephone,
    etc. are designed to be resilient to those types of faults?
    Do you think the user could detect such an occurrence?

    Historically OS-es had a map of bad blocks on the disc and
    avoided allocating them. In principle on system with paging
    hardware the same could be done for DRAM, but I do not think
    anybody is doing this (if domain is serious enough to worry
    about DRAM failures, then it probaly have redundant independent
    computers with ECC DRAM).

    Using ECC DRAM doesn't solve the problem. If you see errors
    reported by your ECC RAM (corrected errors), then when do
    you decide you are seeing too many and losing confidence that
    the ECC is actually *detecting* all multibit errors?

    ECC is part of solution. It may reduce probability of error
    so that you consider them not serious enough. And if you
    really care you may try to increase error rate (say by putting
    RAM chips at increased temperature) and test that your detection
    and recovery strategy works OK.

    Studies suggest that temperature doesn't play the role that
    was suspected. What ECC does is give you *data* about faults.
    Without it, you have no way to know about faults /as they
    occur/.

    Testing tries to address faults at different points in their
    lifespans. Predictive Failure Analysis tries to alert to the
    likelihood of *impending* failures BEFORE they occur. So,
    whatever remedial action you might take can happen BEFORE
    something has failed. POST serves a similar role but tries to
    catch failures that have *occurred* before they can affect the
    operation of the device. BIST gives the user a way of making
    that determination (or receiving reassurance) "on demand".
    Run time diagnostics address testing while the device wants
    to remain in operation.

    What you *do* about a failure is up to you, your market and the
    expectations of your users. If a battery fails in SOME of my
    UPSs, they simply won't power on (and, if the periodic run-time
    test is enabled, that test will cause them to unceremoniously
    power themselves OFF as they try to switch to battery power).
    Other UPSs will provide an alert (audible/visual/log message)
    of the fact but give me the option of continuing to POWER
    those devices in the absence of backup protection.

    The latter is far more preferable to me as I can then decide
    when/if I want to replace the batteries without being forced
    to do so, *now*.

    The same is not true of smoke/CO detectors; when they detect
    a failed (failING battery), they are increasingly annoying
    in their insistence that the problem be addressed, now.
    So much so, that it leads to deaths due to the detector
    being taken out of service to stop the damn bleating.

    I have a great deal of latitude in how I handle failures.
    For example, I can busy-out more than 90% of the RAM in a device
    (if something suggested that it was unreliable) and *still*
    provide the functionality of that node -- by running the code
    on another node and leaving just the hardware drivers associated
    with *this* node in place. So, I can alert a user that a
    particular device is in need of service -- yet, continue
    to provide the services that were associated with that device.
    IMO, this is the best of all possible "failure" scenarios;
    the worst being NOT knowing that something is misbehaving.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to blockedofcourse@foo.invalid on Sat Oct 19 15:58:13 2024
    On Fri, 18 Oct 2024 21:05:14 -0700, Don Y
    <blockedofcourse@foo.invalid> wrote:

    In the US, ANYTHING can result in a lawsuit.

    Yes.

    But, "due diligence" can insulate the manufacturer, to some extent.
    No one ever *admits* to "doing a bad job".

    Actually due diligence /can't/ insulate a manufacturer if the issue
    goes to trial. Members of a jury may feel sorry for the litigant(s),
    or conclude that the manufacturer can afford whatever they award ...
    or maybe they just don't like the manufacturer's lawyer.

    Unlike judges, juries do /not/ have to justify their decisions,
    Moreover, in some US juridictions, the decision of a civil case need
    not be unanimous but only that of a quorum.


    If your doorbell malfunctions, what "damages" are you going
    to claim? If your garage door doesn't open when commanded?
    If your yard doesn't get watered? If you weren't promptly
    notified that the mail had just been delivered? Or, that
    the compressor in the freezer had failed and your foodstuffs
    had spoiled, as a result?

    The costs of litigation are reasonably high. Lawyers want
    to see the LIKELIHOOD of a big payout before entertaining
    such litigation.

    So they created the "class action", where all the litigants
    individually may have very small claims, but when put together the
    total becomes significant.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to blockedofcourse@foo.invalid on Sat Oct 19 15:25:43 2024
    On Fri, 18 Oct 2024 15:30:54 -0700, Don Y
    <blockedofcourse@foo.invalid> wrote:

    Hi George,

    [Hope all is well with you and at home]


    Hi Don,

    Same ol', same ol'. Nothing much new to report.


    On 10/18/2024 2:42 PM, George Neuner wrote:
    WRT Don's question, I don't know the answer, but I suspect runtime
    diagnostics are /not/ routinely implemented for devices that are not
    safety critical. Reason: diagnostics interfere with operation of
    <whatever> they happen to be testing. Even if the test is at low(est)
    priority and is interruptible by any other activity, it still might
    cause an unacceptable delay in a real time situation.

    But, if you *know* when certain aspects of a device will be "called on",
    you can take advantage of that to schedule diagnostics when the device is
    not "needed". And, in the event that some unexpected "need" arises,
    can terminate or suspend the testing (possibly rendering the effort
    moot if it hasn't yet run to a conclusion).

    If you "know" a priori when some component will be needed, then you
    can do whatever you want when it is not. The problem is that many
    uses can't be easily anticipated.

    Which circles back to testing priority: if the test is interruptible
    and/or resumeable, then it may be done whenever the component is
    available ... as long as it won't tie up the component if and when it
    becomes needed for something else.


    E.g., I scrub freed memory pages (zero fill) so information doesn't
    leak across protection domains. As long as some minimum number
    of *scrubbed* pages are available for use "on demand", why can't
    I *test* the pages yet to be scrubbed?

    If you're testing memory pages, most likely you are tying up bandwidth
    in the memory system and slowing progress of the real applications.

    Also because you can't accurately judge the "minimum" needed. BSD and
    Linux both have this problem where a sudden burst of allocations
    exhausts the pool of zeroed pages, forcing demand zeroing of new pages
    prior to their re-assignment. Slows the system to a crawl when it
    happens.


    If there is no anticipated short term need for irrigation, why
    can't I momentarily activate individual valves and watch to see that
    the expected amount of water is flowing?

    Because then you are watering (however briefly) when it is not
    expected. What if there was a pesticide application that should not
    be wetted? What if a person is there and gets sprayed by your test?

    Properly, valve testing should be done concurrently with a scheduled
    watering. Check water is flowing when the valve should be open, and
    not flowing when the valve should be closed.


    To ensure 100%
    functionality at all times effectively requires use of redundant
    hardware - which generally is too expensive for a non safety critical
    device.

    Apparently, there is noise about incorporating such hardware into >*automotive* designs (!). I would have thought the time between
    POSTs would have rendered that largely ineffective. OTOH, if
    you imagine a failure can occur ANY time, then "just after
    putting the car in gear" is as good (bad!) a time as any!

    Automotive is going the way of aircraft: standby running lockstep with
    the primary and monitoring its data flow - able to reset the system if
    they disagree, or take over if the primary fails.



    The point here is that there is no "one fits all" philosophy you can
    follow ... what is proper to do depends on what the (sub)system does,
    its criticality, and on the components involved that may need to be
    tested.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to George Neuner on Sat Oct 19 14:32:48 2024
    On 10/19/2024 12:25 PM, George Neuner wrote:
    Same ol', same ol'. Nothing much new to report.

    No news is good news!

    On 10/18/2024 2:42 PM, George Neuner wrote:
    But, if you *know* when certain aspects of a device will be "called on",
    you can take advantage of that to schedule diagnostics when the device is
    not "needed". And, in the event that some unexpected "need" arises,
    can terminate or suspend the testing (possibly rendering the effort
    moot if it hasn't yet run to a conclusion).

    If you "know" a priori when some component will be needed, then you
    can do whatever you want when it is not. The problem is that many
    uses can't be easily anticipated.

    Granted, I can't know when a user *might* want to do some
    asynchronous task. But, the whole point of my system is to
    watch and anticipate needs based on observed behaviors.

    E.g., if the house is unoccupied, then its not likely that
    anyone will want to watch TV -- unless they have *scheduled*
    a recording of a broadcast (in which case, I would know it).

    If the occupants are asleep, then its not likely they will be
    going out for a drive.

    Which circles back to testing priority: if the test is interruptible
    and/or resumeable, then it may be done whenever the component is
    available ... as long as it won't tie up the component if and when it
    becomes needed for something else.

    Exactly. I already have to deal with that in my decisions to
    power down nodes. If my actions are incorrect, then it introduces
    a delay in getting "back" to whatever state I should have been in.

    E.g., I scrub freed memory pages (zero fill) so information doesn't
    leak across protection domains. As long as some minimum number
    of *scrubbed* pages are available for use "on demand", why can't
    I *test* the pages yet to be scrubbed?

    If you're testing memory pages, most likely you are tying up bandwidth
    in the memory system and slowing progress of the real applications.

    But, they wouldn't be scrubbed if there were higher "priority"
    tasks demanding resources. I.e., some other "lower priority"
    task would have been accessing memory.

    Also because you can't accurately judge the "minimum" needed. BSD and
    Linux both have this problem where a sudden burst of allocations
    exhausts the pool of zeroed pages, forcing demand zeroing of new pages
    prior to their re-assignment. Slows the system to a crawl when it
    happens.

    Yes, but you have live users arbitrarily deciding they "need" those
    resources. And, have considerably more pages at risk for use.
    I've only got ~1G per node and (theoretically), a usage model of
    what resources are needed, when (where).

    *Not* clearing the pages leaves a side channel open for information
    leakage so *that* isn't negotiable. Having some "deliberately
    dirty" could be an issue but, even "dirty", they are wiped of
    their previous contents after a single pass through the test.

    If there is no anticipated short term need for irrigation, why
    can't I momentarily activate individual valves and watch to see that
    the expected amount of water is flowing?

    Because then you are watering (however briefly) when it is not
    expected. What if there was a pesticide application that should not
    be wetted? What if a person is there and gets sprayed by your test?

    Irrigation, here, is not airborne. The ground may be wetted in the
    *immediate* vicinity of the emitters activated. But, they operate at
    very low flow rates (liters per HOUR).

    Your goal is to verify the master valve(s) operate (I do that by opening
    the purge valve(s) and letting water drain into a sump); the individual
    valves are operable; and that water *flows* when commanded.

    Properly, valve testing should be done concurrently with a scheduled watering. Check water is flowing when the valve should be open, and
    not flowing when the valve should be closed.

    That happens as part of normal operation. But, NOT knowing until that
    time can lead to plant death. E.g., if the roses don't get watered twice
    a day, they are toast (in this environment). If the cacti valves don't *close*, they are toast. If a line is "failed open", then you've
    a geyser in the yard (and *no* irrigation to those plants)

    Repairs of this nature can be time consuming, depending on the nature
    of the failure (and cost thousands of dollars in labor). The more I
    can deduce about the nature of the failure, the quicker the service
    can be brought back up to par and the less the "diagnostic cost"
    of having someone do so, manually (digging up a yard to determine where
    a line has been punctured; inspecting individual emitters to determine
    which are blocked; visually monitoring for water flow per zone; etc.)

    [Amazing how much these "minimum wage jobs" actually end up costing
    when you have to hire someone! E.g., $160/month to have your "yard
    cleaned" -- *if* you can find someone to do it at that rate! Irrigation
    work starts at kilobucks and is relatively open-ended (as no one can
    assess the nature of the job until they start on it)]

    To ensure 100%
    functionality at all times effectively requires use of redundant
    hardware - which generally is too expensive for a non safety critical
    device.

    Apparently, there is noise about incorporating such hardware into
    *automotive* designs (!). I would have thought the time between
    POSTs would have rendered that largely ineffective. OTOH, if
    you imagine a failure can occur ANY time, then "just after
    putting the car in gear" is as good (bad!) a time as any!

    Automotive is going the way of aircraft: standby running lockstep with
    the primary and monitoring its data flow - able to reset the system if
    they disagree, or take over if the primary fails.

    The point here is that there is no "one fits all" philosophy you can
    follow ... what is proper to do depends on what the (sub)system does,
    its criticality, and on the components involved that may need to be
    tested.

    I am, rather, looking for ideas as to how (others) may have approached
    it. Most of the research I've uncovered deals with servers and their
    ilk. Or, historical information (e.g., MULTICS' "computing as a service" philosophy). E.g., *scheduling* testing vs. opportunistic testing.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to George Neuner on Sat Oct 19 16:26:48 2024
    On 10/19/2024 12:58 PM, George Neuner wrote:
    On Fri, 18 Oct 2024 21:05:14 -0700, Don Y
    But, "due diligence" can insulate the manufacturer, to some extent.
    No one ever *admits* to "doing a bad job".

    Actually due diligence /can't/ insulate a manufacturer if the issue
    goes to trial. Members of a jury may feel sorry for the litigant(s),
    or conclude that the manufacturer can afford whatever they award ...
    or maybe they just don't like the manufacturer's lawyer.

    You missed my "to some extent" qualifier. It allows you to make
    the case /to that jury/ that you *thought* about the potential
    problems and made a concerted attempt to address them. Contrast
    that with "Didn't it occur to the manufacturer that a customer
    might LIKELY use their device in THIS manner, resulting in
    THESE sorts of problems?"

    You can never understand how a device can be "misapplied". But,
    not making ANY attempt to address "off label" uses is sure to
    result in a hostile attitude from those judging your behavior:
    "So, you made HOW MUCH money off this product and still
    couldn't afford the time/effort to have considered these issues?"

    "Small fish" are seldom targeted by such lawsuits as they have
    few assets and can fold without consequences to their owners.

    Unlike judges, juries do /not/ have to justify their decisions,
    Moreover, in some US juridictions, the decision of a civil case need
    not be unanimous but only that of a quorum.

    If your doorbell malfunctions, what "damages" are you going
    to claim? If your garage door doesn't open when commanded?
    If your yard doesn't get watered? If you weren't promptly
    notified that the mail had just been delivered? Or, that
    the compressor in the freezer had failed and your foodstuffs
    had spoiled, as a result?

    The costs of litigation are reasonably high. Lawyers want
    to see the LIKELIHOOD of a big payout before entertaining
    such litigation.

    So they created the "class action", where all the litigants
    individually may have very small claims, but when put together the
    total becomes significant.

    But you still have to demonstrate a loss. And, be able to
    argue some particular value to that loss. "Well, MAYBE
    Publishers' Clearinghouse came to my house to give me that
    oversized/cartoon check but the doorbell MIGHT not have
    rung. So, I want '$3000/week for life' as compensation..."

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Niocl=C3=A1s=C3=A1n_Caile@21:1/5 to All on Sun Oct 20 19:11:31 2024
    On Fri, 18 Oct 2024, Don Y wrote:
    "The costs of litigation are reasonably high."


    Hi Don,

    Court cases' costs are unreasonably high.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Niocl=C3=A1s=C3=A1n_Caile@21:1/5 to All on Sun Oct 20 20:08:50 2024
    On Sat, 19 Oct 2024, Don Y wrote:
    "[. . .]
    [. . .] "Well, MAYBE
    Publishers' Clearinghouse came to my house to give me that
    oversized/cartoon check but the doorbell MIGHT not have
    rung. So, I want '$3000/week for life' as compensation...""


    Hi Don,

    Grady Booch confessed at an International Conference on Software
    Engineering in 2000 that his anti-engineering UML policy produced a door
    bell which was not rung, so a friend spent money on a cellphone call to
    him to open a door! Fools pretend that Grady Booch is an engineering hero!

    Mister Fabio Bertella (then at Optisoft Srl,
    Via A. Bertoloni, 15,
    19038 Sarzana (SP),
    Italy) said in 2007 that Grady Booch "has never written a real program."

    Regards.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Niocl=C3=A1s=C3=A1n_Caile@21:1/5 to Don Y on Sun Oct 20 19:15:29 2024
    On Fri, 18 Oct 2024, Don Y wrote:
    "> To ensure 100%
    functionality at all times effectively requires use of redundant
    hardware - which generally is too expensive for a non safety critical
    device.

    Apparently, there is noise about incorporating such hardware into
    *automotive* designs (!). I would have thought the time between
    POSTs would have rendered that largely ineffective. OTOH, if
    you imagine a failure can occur ANY time, then "just after
    putting the car in gear" is as good (bad!) a time as any!"


    Hi Don,

    We were lectured by a person whom an expensive-German-car manucfaturer
    paid to help make cars, but he told us that this rich German company did
    not pay him enough to buy 1 of these cars so he drives a cheap car from a different company. He might endanger a client's clients' lives, but less
    so his own (though of course a malfunctioning German car could crash into
    him).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to Master_Fontaine_is_dishonest@Strand on Sun Oct 20 15:44:33 2024
    On Sun, 20 Oct 2024 19:11:31 +0200, Nioclásán Caileán de Ghlostéir <Master_Fontaine_is_dishonest@Strand_in_London.Gov.UK> wrote:

    On Fri, 18 Oct 2024, Don Y wrote:
    "The costs of litigation are reasonably high."


    Hi Don,

    Court cases' costs are unreasonably high.

    Courts need not be involved ... just lawyers.

    A few weeks ago there was an article in Forbes magazine saying that
    the /average/ billing by lawyers at moderately sized firms in the US
    is now $1500..$1800/hr, and billing by lawyers from top firms is now $2600..$3000/hr.

    Litigating a product liability case routinely takes hundreds to
    thousands of hours depending. True, many of those hours will be for
    paralegals unless/until the case gets to arbitration or court - but
    even small firms now bill their paralegals at over $200/hr.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Niocl=C3=A1s=C3=A1n_Caile@21:1/5 to All on Sun Oct 20 23:14:52 2024
    On Sun, 20 Oct 2024, George Neuner wrote:
    "A few weeks ago there was an article in Forbes magazine saying that
    the /average/ billing by lawyers at moderately sized firms in the US
    is now $1500..$1800/hr,"

    Yikes!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Don Y on Wed Oct 23 05:53:44 2024
    On 10/19/2024 2:32 PM, Don Y wrote:
    The point here is that there is no "one fits all" philosophy you can
    follow ... what is proper to do depends on what the (sub)system does,
    its criticality, and on the components involved that may need to be
    tested.

    I am, rather, looking for ideas as to how (others) may have approached
    it.  Most of the research I've uncovered deals with servers and their
    ilk.  Or, historical information (e.g., MULTICS' "computing as a service" philosophy).  E.g., *scheduling* testing vs. opportunistic testing.

    "Opportunistic" seems to work well -- *if* you declare the resources
    you will need and wait until you can acquire them.

    The downside is that you may NEVER be able to acquire them,
    based on what processes are active on a node. You wouldn't want
    the diagnostic task to have to KNOW those things!

    As different tests may require different resources, this
    becomes problematic; do you request the largest set? A
    smaller set? Or, design a mechanism to allow for arbitrarily
    complex combinations to be specified <frown>

    This became apparent when running the DRAM test using the
    DRAM emulator (non-production board designed to validate the
    DRAM test by allowing arbitrary fault injection, on demand).
    While it was known that *some* tests could NOT be run out of
    DRAM (which limits their efficacy in a running system), there
    were other system resources that were "silently" called upon
    that would have impacted other coexecuting tasks. <frown>

    The good news (wrt DRAM testing) is that checking for "stuck at"
    faults -- the most prevalent described in published research -- makes
    no special needs for resources, beyond access to DRAM!

    Moral of story: CAREFULLY enumerate (and declare) ALL such
    resources. And, consider how realistic it is to expect
    ALL of them to be available serendipitously in a given node.

    Else, resort to *scheduling* the diagnostic ("maintenance period")

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Don Y on Thu Oct 24 16:34:11 2024
    Don Y <blockedofcourse@foo.invalid> wrote:
    On 10/19/2024 6:53 AM, Waldek Hebisch wrote:
    And, the *extent* to which testing is done is the subject
    addressed; if I ensure "stuff" *WORKED* when the device was
    powered on (preventing it from continuing on to its normal
    functionality in the event that some failure was detected),
    what assurance does that give me that the device's integrity
    is still intact 8760 hours (1 yr) hours later? 720 hours
    (1 mo)? 168 hours (1 wk)? 24 hours? *1* hour????

    What to test is really domain-specific. Traditional thinking
    is that computer hardware is _much_ more reliable than
    software and software bugs are major source of misbehaviour.

    That hasn't been *proven*. And, "misbehavior" is not the same
    as *failure*.

    First, I mean relevant hardware, that is hardware inside a MCU.
    I think that there are strong arguments that such hardware is
    more reliable than software. I have seen claim based on analysis
    of discoverd failures that software written to rigorous development
    standars exhibits on average about 1 bug (that lead to failure) per
    1000 lines of code. This means that evan small MCU has enough
    space of handful of bugs. And for bigger systems it gets worse.

    But bugs need not be consequential. They may be undesirable or
    even annoying but need not have associated "costs".

    The point is that you can not eliminate all bugs. Rather, you
    should have simple code with aim of preventing "cost" of bugs.

    And among hardware failures transient upsets, like flipped
    bit are more likely than permanent failure. For example,

    That used to be the thinking with DRAM but studies have shown
    that *hard* failures are more common. These *can* be found...
    *if* you go looking for them!

    I another place I wrote the one of studies that I saw claimed that
    significant number of errors they detected (they monitored changes
    to a memory area that was supposed to be unmodifed) was due to buggy
    software. And DRAM is special.

    If you have memory protection hardware (I do), then such changes
    can't casually occur; the software has to make a deliberate
    attempt to tell the memory controller to allow such a change.

    The tests where run on Linux boxes with normal memory protection.
    Memory protection does not prevent troubles due to bugs in
    priviledged code. Of course, you can think that you can do
    better than Linux programmers.

    E.g., if you load code into RAM (from FLASH) for execution,
    are you sure the image *in* the RAM is the image from the FLASH?
    What about "now"? And "now"?!

    You are supposed to regularly verify sufficiently strong checksum.

    Really? Wanna bet that doesn't happen? How many Linux-based devices
    load applications and start a process to continuously verify the
    integrity of the TEXT segment?

    Using something like Linux means that you do not care about rare
    problems (or are prepared to resolve them without help of OS).

    What are they going to do if they notice a discrepancy? Reload
    the application and hope it avoids any "soft spots" in memory?

    AFAICS the rule about checking image originally were inteded
    for devices executing code directly from flash, if your "primary
    truth" fails possibilities are limited. With DRAM failures one
    can do much better. The question is mainly probabilities and
    effort.

    at low safety level you may assume that hardware of a counter
    generating PWM-ed signal works correctly, but you are
    supposed to periodically verify that configuration registers
    keep expected values.

    Why would you expect the registers to lose their settings?
    Would you expect the CPUs registers to be similarly flakey?

    First, such checking is not my idea, but one point from checklist for
    low safety devices. Registers may change due to bugs, EMC events,
    cosmic rays and similar.

    Then you are dealing with high reliability designs. Do you
    really think my microwave oven, stove, furnace, telephone,
    etc. are designed to be resilient to those types of faults?
    Do you think the user could detect such an occurrence?

    IIUC microwave, stove and furnace should be. In cell phone
    BMS should be safe and core radio is tightly regulated. Other
    parts seem to be at quality/reliability level of PC-s.

    You clearly want to make your devices more reliable. Bugs
    and various events happen and extra checking is actually
    quite cheap. It is for you to decide if you need/want
    it.

    Historically OS-es had a map of bad blocks on the disc and
    avoided allocating them. In principle on system with paging
    hardware the same could be done for DRAM, but I do not think
    anybody is doing this (if domain is serious enough to worry
    about DRAM failures, then it probaly have redundant independent
    computers with ECC DRAM).

    Using ECC DRAM doesn't solve the problem. If you see errors
    reported by your ECC RAM (corrected errors), then when do
    you decide you are seeing too many and losing confidence that
    the ECC is actually *detecting* all multibit errors?

    ECC is part of solution. It may reduce probability of error
    so that you consider them not serious enough. And if you
    really care you may try to increase error rate (say by putting
    RAM chips at increased temperature) and test that your detection
    and recovery strategy works OK.

    Studies suggest that temperature doesn't play the role that
    was suspected. What ECC does is give you *data* about faults.
    Without it, you have no way to know about faults /as they
    occur/.

    Well, there is evidence that increased temperature inreases
    chance of errors. More precisely, expect errors when you
    operate DRAM close to max allowed temperature. The point is
    that you can cause errors and that way test your recovery
    strategy (untested recovery code is likely to fail when/if
    it is needed).

    Testing tries to address faults at different points in their
    lifespans. Predictive Failure Analysis tries to alert to the
    likelihood of *impending* failures BEFORE they occur. So,
    whatever remedial action you might take can happen BEFORE
    something has failed. POST serves a similar role but tries to
    catch failures that have *occurred* before they can affect the
    operation of the device. BIST gives the user a way of making
    that determination (or receiving reassurance) "on demand".
    Run time diagnostics address testing while the device wants
    to remain in operation.

    What you *do* about a failure is up to you, your market and the
    expectations of your users. If a battery fails in SOME of my
    UPSs, they simply won't power on (and, if the periodic run-time
    test is enabled, that test will cause them to unceremoniously
    power themselves OFF as they try to switch to battery power).
    Other UPSs will provide an alert (audible/visual/log message)
    of the fact but give me the option of continuing to POWER
    those devices in the absence of backup protection.

    The latter is far more preferable to me as I can then decide
    when/if I want to replace the batteries without being forced
    to do so, *now*.

    The same is not true of smoke/CO detectors; when they detect
    a failed (failING battery), they are increasingly annoying
    in their insistence that the problem be addressed, now.
    So much so, that it leads to deaths due to the detector
    being taken out of service to stop the damn bleating.

    I have a great deal of latitude in how I handle failures.
    For example, I can busy-out more than 90% of the RAM in a device
    (if something suggested that it was unreliable) and *still*
    provide the functionality of that node -- by running the code
    on another node and leaving just the hardware drivers associated
    with *this* node in place. So, I can alert a user that a
    particular device is in need of service -- yet, continue
    to provide the services that were associated with that device.
    IMO, this is the best of all possible "failure" scenarios;
    the worst being NOT knowing that something is misbehaving.

    Good.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Don Y on Thu Oct 24 17:52:02 2024
    Don Y <blockedofcourse@foo.invalid> wrote:
    On 10/18/2024 8:53 PM, Waldek Hebisch wrote:
    One of the FETs that controls the shifting of the automatic
    transmission as failed open. How do you detect that /and recover
    from it/?

    Detecting such thing looks easy. Recovery is tricky, because
    if you have spare FET and activate it there is good chance that
    it will fail due to the same reason that the first FET failed.
    OTOH, if you have propely designed circuit around the FET,
    disturbance strong enough to kill the FET is likely to kill
    the controller too.

    The immediate goal is to *detect* that a problem exists.
    If you can't detect, then attempting to recover is a moot point.

    In a car you have signals from wheels and engine, you can use
    those to compute transmission ratio and check is it is expected
    one. Or simply have extra inputs which mountor FET output.

    The camera/LIDAR that the self-drive feature uses is providing
    incorrect data... etc.

    Use 3 (or more) and voting. Of course, this increases cost and one
    have to judge if increase of cost is worth increase in safety

    As well as the reliability of the additional "voting logic".
    If not a set of binary signals, determining what the *correct*
    signal may be can be problematic.

    Matching images is now a stanard technology. And in this case
    "voting logic" is likely to be software and main trouble are
    possible bugs.

    (in self-driving car using multiple sensors looks like no-brainer,
    but if this is just an assist to increase driver comfort then
    result may be different).

    It is different only in the sense of liability and exposure to
    loss. I am not assigning values to those consequences but,
    rather, looking to address the issue of run-time testing, in
    general.

    I doubt in general solutions. Various parts of your system
    may have enough common features to allow single strategy
    in your system. But it is unlikely to generalize to other
    systems. To put it differently, there are probabilites
    of various events and associated costs. Even if you
    refuse to quantify probabilites and costs your design
    decisions (assuming they are rational) will give some
    estimate of them.

    Even if NONE of the failures can result in injury or loss,
    it is unlikely that a user WANTS to have a defective product.
    If the user is technically unable to determine when the
    product is "at fault" (vs. his own misunderstanding of how it
    is *supposed* to work), then those failures contribute to
    the users' frustrations with the product.

    There are innumerable failures that can occur to compromise
    the "system" and no *easy*/inexpensive/reliable way to detect
    and recover from *all* of them.

    Sure. But for common failures or serious failures having non-negligible
    pobability redundancy may offer cheap way to increase reliability.

    For critical functions a car could have 3 processors with
    voting circuitry. With separate chips this would be more expensive
    than single processor, but increase of cost probably would be
    negligible compared to cost of the whole car. And when integrated
    on a single chip cost difference would be tiny.

    IIUC car controller may "reboot" during a ride. Intead of
    rebooting it could handle work to a backup controller.

    How do you know the circuitry (and other mechanisms) that
    implement this hand-over are operational?

    It does not matter if handover _always_ works. What matter is
    if system with handover has lower chance of failure than
    system without handover. Having statistics of actual failures
    (which I do not have but manufacturers should have) and
    after some testing one can estimate failure probablity of
    different designs and possibly decide to use handover.

    Again, I am not interested in "recovery" as that varies with
    the application and risk assessment. What I want to concentrate
    on is reliably *detecting* faults before they lead to product
    failures.

    I contend that the hardware in many devices has that capability
    (to some extent) but that it is underutilized; that the issue
    of detecting faults *after* POST is one that doesn't see much
    attention. The likely thinking being that POST will flag it the
    next time the device is restarted.

    And, that's not acceptable in long-running devices.

    Well, you write that you do not try to build high reliablity
    device. However device which correctly operates for years
    without interruption is considered "high availability" device
    which is a king of high reliablity. And techniques for high
    reliablity seem appropiate here.

    It is VERY difficult to design reliable systems. I am not
    attempting that. Rather, I am trying to address the fact that
    the reassurances POST (and, at the user's perogative, BIST)
    are not guaranteed when a device runs "for long periods of time".

    You may have tests essentially as part of normal operation.

    I suspect most folks have designed devices with UARTs. And,
    having written a driver for it, have noted that framing, parity
    and overrun errors are possible.

    Ask yourself how many of those systems ever *use* that information!
    Is there even a means of propagating it up out of the driver?

    Well, I always use no parity transmission mode. Standard way is
    to use checksums and acknowledgments. That way you know if
    transmission is working correctly. What extra info you expect
    from looking at detailed error info from UART?

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Waldek Hebisch on Thu Oct 24 14:49:42 2024
    On 10/24/2024 10:52 AM, Waldek Hebisch wrote:
    Don Y <blockedofcourse@foo.invalid> wrote:
    On 10/18/2024 8:53 PM, Waldek Hebisch wrote:
    One of the FETs that controls the shifting of the automatic
    transmission as failed open. How do you detect that /and recover
    from it/?

    Detecting such thing looks easy. Recovery is tricky, because
    if you have spare FET and activate it there is good chance that
    it will fail due to the same reason that the first FET failed.
    OTOH, if you have propely designed circuit around the FET,
    disturbance strong enough to kill the FET is likely to kill
    the controller too.

    The immediate goal is to *detect* that a problem exists.
    If you can't detect, then attempting to recover is a moot point.

    In a car you have signals from wheels and engine, you can use
    those to compute transmission ratio and check is it is expected
    one. Or simply have extra inputs which mountor FET output.

    But a *user* can't do that. They can only claim "something doesn't
    feel right about the drive"...

    So, if the controller doesn't do it, what recourse?

    The camera/LIDAR that the self-drive feature uses is providing
    incorrect data... etc.

    Use 3 (or more) and voting. Of course, this increases cost and one
    have to judge if increase of cost is worth increase in safety

    As well as the reliability of the additional "voting logic".
    If not a set of binary signals, determining what the *correct*
    signal may be can be problematic.

    Matching images is now a stanard technology. And in this case
    "voting logic" is likely to be software and main trouble are
    possible bugs.

    The data must be available concurrently in order to "vote" on
    them. And, must be "close enough" to not consider them to differ.
    For high reliability applications, you often *compute* the results
    in different ways / algorithms -- to highlight any issues in
    one implementation over the other. So, the temporal path to
    "their solutions" isn't the same.

    (in self-driving car using multiple sensors looks like no-brainer,
    but if this is just an assist to increase driver comfort then
    result may be different).

    It is different only in the sense of liability and exposure to
    loss. I am not assigning values to those consequences but,
    rather, looking to address the issue of run-time testing, in
    general.

    I doubt in general solutions. Various parts of your system
    may have enough common features to allow single strategy
    in your system. But it is unlikely to generalize to other
    systems. To put it differently, there are probabilites
    of various events and associated costs. Even if you
    refuse to quantify probabilites and costs your design
    decisions (assuming they are rational) will give some
    estimate of them.

    I've asked for other peoples' experiences. I've not expected
    them to have solved *my* problem. Nor do I expect my solution
    to solve theirs. Likewise, why something like Linux wouldn't
    have "the" solution.

    Again, I am not interested in "recovery" as that varies with
    the application and risk assessment. What I want to concentrate
    on is reliably *detecting* faults before they lead to product
    failures.

    I contend that the hardware in many devices has that capability
    (to some extent) but that it is underutilized; that the issue
    of detecting faults *after* POST is one that doesn't see much
    attention. The likely thinking being that POST will flag it the
    next time the device is restarted.

    And, that's not acceptable in long-running devices.

    Well, you write that you do not try to build high reliablity
    device. However device which correctly operates for years
    without interruption is considered "high availability" device
    which is a king of high reliablity. And techniques for high
    reliablity seem appropiate here.

    No. Most devices can't afford the cost/complexity of a true
    high reliability/redundant solution.

    Your car has SOME redundancy in how it handles braking (two
    chamber master cylinder plus "emergency/parking" brake PLUS
    using the engine to slow the vehicle). Yet, absolutely NO
    protection against a catastrophic failure of the steering!

    Redundancy in braking is relatively easy to provide -- esp
    in the volumes produced. So, adds little to the cost and
    complexity of the vehicle. Adding a redundant steering
    mechanism... where would you even BEGIN to address that?

    Cars are a great example of the tradeoffs involved. You invest
    a *little* to detect and report problems instead of a LOT to
    continue operating in their presence. Why not have duplicate
    turn signal indicators (front, rear and side) to guard against
    a bulb failure? Much easier and cheaper to detect that a filament
    has opened and report that to the driver (and HOPE he gets around
    to fixing it).

    If the driver can't be TOLD of faults and failures, then he
    is in the dark as to how effectively his device is performing its
    required actions. "CHECK ENGINE" really does mean that the
    engine *needs* attention. What difference, "CHECK DRAM"?

    It is VERY difficult to design reliable systems. I am not
    attempting that. Rather, I am trying to address the fact that
    the reassurances POST (and, at the user's perogative, BIST)
    are not guaranteed when a device runs "for long periods of time".

    You may have tests essentially as part of normal operation.

    I suspect most folks have designed devices with UARTs. And,
    having written a driver for it, have noted that framing, parity
    and overrun errors are possible.

    Ask yourself how many of those systems ever *use* that information!
    Is there even a means of propagating it up out of the driver?

    Well, I always use no parity transmission mode. Standard way is
    to use checksums and acknowledgments. That way you know if
    transmission is working correctly. What extra info you expect
    from looking at detailed error info from UART?

    That assumes you can control the messages exchanged. If I
    attach a TTY to the console -- routed through a serial port -- on
    my computer, what should the checksum be when I see the "login: "
    message? When I type my name, what checksum should I append
    to the identifier?

    I.e., serial port protocols don't *require* these things.
    If the computer sees "do~n" -- where '~' indicates an overrun
    error in the preceeding character's reception -- it KNOWS
    that I haven't typed exactly three characters: d, o, n.
    So, it shouldn't even ASK for my password, choosing, instead,
    to reissue the login: banner (because it wouldn't know which
    password to validate).

    Likewise, if I saw "log~in: " on my TTY, I *know* that it
    isn't saying "login: " because AT LEAST one character has
    been omitted in that '~'.

    This is easy to fix -- in ALL interactions. But, requires
    the driver to propagate these errors up the stack and the
    application layer to act on them. I.e., if the application
    layer encounters lots of overrun (or partity/framing) errors,
    SOMETHING is wrong with the link and/or the driver.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Waldek Hebisch on Thu Oct 24 14:28:44 2024
    On 10/24/2024 9:34 AM, Waldek Hebisch wrote:
    That hasn't been *proven*. And, "misbehavior" is not the same
    as *failure*.

    First, I mean relevant hardware, that is hardware inside a MCU.
    I think that there are strong arguments that such hardware is
    more reliable than software. I have seen claim based on analysis
    of discoverd failures that software written to rigorous development
    standars exhibits on average about 1 bug (that lead to failure) per
    1000 lines of code. This means that evan small MCU has enough
    space of handful of bugs. And for bigger systems it gets worse.

    But bugs need not be consequential. They may be undesirable or
    even annoying but need not have associated "costs".

    The point is that you can not eliminate all bugs. Rather, you
    should have simple code with aim of preventing "cost" of bugs.

    Code need only be "as simple as possible, /but no simpler/".
    The problem defines the complexity of the solution.

    And among hardware failures transient upsets, like flipped
    bit are more likely than permanent failure. For example,

    That used to be the thinking with DRAM but studies have shown
    that *hard* failures are more common. These *can* be found...
    *if* you go looking for them!

    I another place I wrote the one of studies that I saw claimed that
    significant number of errors they detected (they monitored changes
    to a memory area that was supposed to be unmodifed) was due to buggy
    software. And DRAM is special.

    If you have memory protection hardware (I do), then such changes
    can't casually occur; the software has to make a deliberate
    attempt to tell the memory controller to allow such a change.

    The tests where run on Linux boxes with normal memory protection.
    Memory protection does not prevent troubles due to bugs in
    priviledged code. Of course, you can think that you can do
    better than Linux programmers.

    Linux code is far from "as simple as possible". They are constantly
    trying to make a GENERAL PURPOSE solution for a wide variety of
    applications that THEY envision.

    E.g., if you load code into RAM (from FLASH) for execution,
    are you sure the image *in* the RAM is the image from the FLASH?
    What about "now"? And "now"?!

    You are supposed to regularly verify sufficiently strong checksum.

    Really? Wanna bet that doesn't happen? How many Linux-based devices
    load applications and start a process to continuously verify the
    integrity of the TEXT segment?

    Using something like Linux means that you do not care about rare
    problems (or are prepared to resolve them without help of OS).

    Using <anything> means you don't care about any of the issues
    that the <anything> developers considered unimportant or were
    incapable of/unwilling to addressing.

    What are they going to do if they notice a discrepancy? Reload
    the application and hope it avoids any "soft spots" in memory?

    AFAICS the rule about checking image originally were inteded
    for devices executing code directly from flash, if your "primary
    truth" fails possibilities are limited. With DRAM failures one
    can do much better. The question is mainly probabilities and
    effort.

    One typically doesn't assume flash fails WHILE in use (though, of course,
    it does). DRAM is documented to fail while in use. If you have
    "little enough" of it then you can hope the failures are far enough
    apart, in time, that they just look like "bugs". This is especially
    true if your device is only running "part time" or is unobserved
    for long stretches of time as that "bug" can manifest in numerous
    ways depending on the nature of the DRAM fault and *if* the CPU
    happens to encounter it. E.g., just like a RAID array, absent
    patrol reads, you never know if a file that hasn't been referenced
    in months has suffered any corruption.

    In many applications, there are large swaths of code that get
    executed once or infrequently. E.g., how often does the first
    line of code after main() get executed? If it was corrupted
    after that initial execution/examination, would you know? or care?
    Ah, but if you don't NOTICE that it has been corrupted, then you
    will proceed gleefully ignorant of the fact that your memory
    system is encountering problem(s) and, thus, won't take any steps
    to address them.

    at low safety level you may assume that hardware of a counter
    generating PWM-ed signal works correctly, but you are
    supposed to periodically verify that configuration registers
    keep expected values.

    Why would you expect the registers to lose their settings?
    Would you expect the CPUs registers to be similarly flakey?

    First, such checking is not my idea, but one point from checklist for
    low safety devices. Registers may change due to bugs, EMC events,
    cosmic rays and similar.

    Then you are dealing with high reliability designs. Do you
    really think my microwave oven, stove, furnace, telephone,
    etc. are designed to be resilient to those types of faults?
    Do you think the user could detect such an occurrence?

    IIUC microwave, stove and furnace should be. In cell phone
    BMS should be safe and core radio is tightly regulated. Other
    parts seem to be at quality/reliability level of PC-s.

    You clearly want to make your devices more reliable. Bugs
    and various events happen and extra checking is actually
    quite cheap. It is for you to decide if you need/want
    it.

    Unlike a phone or "small appliance" that you can carry in to a
    service center -- or, return to the store where purchased -- I
    can't expect a user to just pick up a failed/suspect device and
    exchange it for a "new" one. Could you remove the PCB that
    controls your furnace and bring it <somewhere> to have someone
    tell you if it is "acting up" and in need of replacement?
    Would you even THINK to do this?

    Instead, if you were having problems (or suspected you were)
    with your "furnace", you would call a service man to come to
    *it* and have a look. Here (US), that is costly -- they aren't
    going to drive to you without some sort of compensation
    (and they aren't going to do so for "Uber rates").

    E.g., a few winters past, the natural gas supply to our city was
    "compromised"; it was unusually cold and demand exceeded the
    ability of the system to deliver gas at sufficient pressure to
    the entire city.

    Most furnaces rely on a certain flow of fuel to operate. And,
    contain sensors to shutdown the furnace if they sense an
    inadequate fuel supply. So, much of the city had no heat.
    This resulted to most plumbing contractors being overwhelmed
    with calls for service.

    Of course, there was nothing they could *do* to correct the problem.
    But, that didn't stop them from taking orders for service,
    dispatching their trucks and BILLING each of these customers.

    One could argue that a more moral industry might have recommended
    callers "wait a while as there is a citywide problem with the
    gas supply". But, maybe there was something ELSE at fault
    as some of those callers? And, it's a great opportunity to
    get into their homes and try to make a sale to upgrade your
    "old equipment" to unsuspecting homeowners (an HVAC system is
    in the $10K price range for a nominal home).

    Bring your phone into a store complaining of a problem and
    they will likely show you what you are doing wrong. They *may*
    suggest you upgrade that 3 year-old-model -- but, if that
    is the sole reason for their suggested upgrade, you will likely
    decline and walk away. "It's just a phone; if it keeps giving
    me problems, THEN I will upgrade". That's not the case with
    something "magical" (in the eyes of a homeowner) like an
    HVAC system. Upgrading later may be even less convenient than
    it is now! And, it actually *is* an old system... (who defines
    old?)

    Studies suggest that temperature doesn't play the role that
    was suspected. What ECC does is give you *data* about faults.
    Without it, you have no way to know about faults /as they
    occur/.

    Well, there is evidence that increased temperature inreases
    chance of errors. More precisely, expect errors when you
    operate DRAM close to max allowed temperature. The point is
    that you can cause errors and that way test your recovery
    strategy (untested recovery code is likely to fail when/if
    it is needed).

    That was debunked by the data.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Don Y on Fri Oct 18 20:30:06 2024
    Don Y <blockedofcourse@foo.invalid> wrote:
    Typically, one performs some limited "confidence tests"
    at POST to catch gross failures. As this activity is
    "in series" with normal operation, it tends to be brief
    and not very thorough.

    Many products offer a BIST capability that the user can invoke
    for more thorough testing. This allows the user to decide
    when he can afford to live without the normal functioning of the
    device.

    And, if you are a "robust" designer, you often include invariants
    that verify hardware operations (esp to I/Os) are actually doing
    what they should -- e.g., verifying battery voltage increases
    when you activate the charging circuit, loopbacks on DIOs, etc.

    But, for 24/7/365 boxes, POST is a "once-in-a-lifetime" activity.
    And, BIST might not always be convenient (as well as requiring the
    user's consent and participation).

    There, runtime diagnostics are the only alternative for hardware revalidation, PFA and diagnostics.

    How commonly are such mechanisms implemented? And, how thoroughly?

    This is strange question. AFAIK automatically run diagnostics/checks
    are part of safety regulations. Even if some safety critical software
    does not contain them, nobody is going to admit violationg regulations.
    And things like PLC-s are "dual use", they may be used in non-safety
    role, but vendors claim compliance to safety standards.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to Hebisch on Fri Oct 18 17:42:44 2024
    On Fri, 18 Oct 2024 20:30:06 -0000 (UTC), antispam@fricas.org (Waldek
    Hebisch) wrote:

    Don Y <blockedofcourse@foo.invalid> wrote:
    Typically, one performs some limited "confidence tests"
    at POST to catch gross failures. As this activity is
    "in series" with normal operation, it tends to be brief
    and not very thorough.

    Many products offer a BIST capability that the user can invoke
    for more thorough testing. This allows the user to decide
    when he can afford to live without the normal functioning of the
    device.

    And, if you are a "robust" designer, you often include invariants
    that verify hardware operations (esp to I/Os) are actually doing
    what they should -- e.g., verifying battery voltage increases
    when you activate the charging circuit, loopbacks on DIOs, etc.

    But, for 24/7/365 boxes, POST is a "once-in-a-lifetime" activity.
    And, BIST might not always be convenient (as well as requiring the
    user's consent and participation).

    There, runtime diagnostics are the only alternative for hardware
    revalidation, PFA and diagnostics.

    How commonly are such mechanisms implemented? And, how thoroughly?

    This is strange question. AFAIK automatically run diagnostics/checks
    are part of safety regulations. Even if some safety critical software
    does not contain them, nobody is going to admit violationg regulations.
    And things like PLC-s are "dual use", they may be used in non-safety
    role, but vendors claim compliance to safety standards.

    However, only a minor percentage of all devices must comply with such
    safety regulations.

    As I understand it, Don is working on tech for "smart home"
    implementations ... devices that may be expected to run nearly
    constantly (though perhaps not 365/24 with 6 9's reliability), but
    which, for the most part, are /not/ safety critical.

    WRT Don's question, I don't know the answer, but I suspect runtime
    diagnostics are /not/ routinely implemented for devices that are not
    safety critical. Reason: diagnostics interfere with operation of
    <whatever> they happen to be testing. Even if the test is at low(est)
    priority and is interruptible by any other activity, it still might
    cause an unacceptable delay in a real time situation. To ensure 100% functionality at all times effectively requires use of redundant
    hardware - which generally is too expensive for a non safety critical
    device.

    YMMV.
    George

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Waldek Hebisch on Fri Oct 18 15:15:30 2024
    On 10/18/2024 1:30 PM, Waldek Hebisch wrote:
    Don Y <blockedofcourse@foo.invalid> wrote:
    Typically, one performs some limited "confidence tests"
    at POST to catch gross failures. As this activity is
    "in series" with normal operation, it tends to be brief
    and not very thorough.

    Many products offer a BIST capability that the user can invoke
    for more thorough testing. This allows the user to decide
    when he can afford to live without the normal functioning of the
    device.

    And, if you are a "robust" designer, you often include invariants
    that verify hardware operations (esp to I/Os) are actually doing
    what they should -- e.g., verifying battery voltage increases
    when you activate the charging circuit, loopbacks on DIOs, etc.

    But, for 24/7/365 boxes, POST is a "once-in-a-lifetime" activity.
    And, BIST might not always be convenient (as well as requiring the
    user's consent and participation).

    There, runtime diagnostics are the only alternative for hardware
    revalidation, PFA and diagnostics.

    How commonly are such mechanisms implemented? And, how thoroughly?

    This is strange question. AFAIK automatically run diagnostics/checks
    are part of safety regulations.

    Not all devices are covered by "regulations".

    And, the *extent* to which testing is done is the subject
    addressed; if I ensure "stuff" *WORKED* when the device was
    powered on (preventing it from continuing on to its normal
    functionality in the event that some failure was detected),
    what assurance does that give me that the device's integrity
    is still intact 8760 hours (1 yr) hours later? 720 hours
    (1 mo)? 168 hours (1 wk)? 24 hours? *1* hour????

    [I.e., how long a device remains "up" is a function of the device,
    it's application, environment and user]

    Do you just *hope* the device "happens" to fail in a noticeable
    manner so a user is left with no doubt but that the device is
    no longer operational?

    Even if some safety critical software
    does not contain them, nobody is going to admit violationg regulations.
    And things like PLC-s are "dual use", they may be used in non-safety
    role, but vendors claim compliance to safety standards.

    So, if a bit in a RAM in said device *dies* some time after power on,
    is the device going to *know* that has happened? And, signal its
    unwillingness to continue operating? What is going to detect that
    failure?

    What if the bit's failure is inconsequential to the operation
    of the device? E.g., if the bit is part of some not-used
    feature? *Or*, if it has failed in the state it was *supposed*
    to be in??!

    With a "good" POST design, you can reassure the user that the
    device *appears* to be functional. That the data/code stored in it
    are intact (since last time they were accessed). That the memory
    is capable of storing any values that is called on to preserve.
    That the hardware I/Os can control and sense as intended, etc.

    /But, you have no guarantee that this condition will persist!/
    If it WAS guaranteed to persist, then the simple way to make high
    reliability devices would be just to /never turn them off/ to
    take advantage of this "guarantee"!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to George Neuner on Fri Oct 18 15:30:54 2024
    Hi George,

    [Hope all is well with you and at home]

    On 10/18/2024 2:42 PM, George Neuner wrote:
    WRT Don's question, I don't know the answer, but I suspect runtime diagnostics are /not/ routinely implemented for devices that are not
    safety critical. Reason: diagnostics interfere with operation of
    <whatever> they happen to be testing. Even if the test is at low(est) priority and is interruptible by any other activity, it still might
    cause an unacceptable delay in a real time situation.

    But, if you *know* when certain aspects of a device will be "called on",
    you can take advantage of that to schedule diagnostics when the device is
    not "needed". And, in the event that some unexpected "need" arises,
    can terminate or suspend the testing (possibly rendering the effort
    moot if it hasn't yet run to a conclusion).

    E.g., I scrub freed memory pages (zero fill) so information doesn't
    leak across protection domains. As long as some minimum number
    of *scrubbed* pages are available for use "on demand", why can't
    I *test* the pages yet to be scrubbed?

    If I don't *expect* a car to pull up and call for the garage door
    to be opened, why can't I play with the lighting to verify that
    the cameras located within *notice* changes?

    If there is no anticipated short term need for irrigation, why
    can't I momentarily activate individual valves and watch to see that
    the expected amount of water is flowing?

    If a node is powered down due to lack of expected immediate need,
    why not power it *up* and run diagnostics on it? Powering it back
    down once completed -- *or*, aborting the diagnostics if the node
    is called on to be powered up?

    To ensure 100%
    functionality at all times effectively requires use of redundant
    hardware - which generally is too expensive for a non safety critical
    device.

    Apparently, there is noise about incorporating such hardware into
    *automotive* designs (!). I would have thought the time between
    POSTs would have rendered that largely ineffective. OTOH, if
    you imagine a failure can occur ANY time, then "just after
    putting the car in gear" is as good (bad!) a time as any!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Don Y on Sat Oct 19 01:50:34 2024
    Don Y <blockedofcourse@foo.invalid> wrote:
    On 10/18/2024 2:42 PM, George Neuner wrote:

    To ensure 100%
    functionality at all times effectively requires use of redundant
    hardware - which generally is too expensive for a non safety critical
    device.

    Apparently, there is noise about incorporating such hardware into *automotive* designs (!). I would have thought the time between
    POSTs would have rendered that largely ineffective. OTOH, if
    you imagine a failure can occur ANY time, then "just after
    putting the car in gear" is as good (bad!) a time as any!

    TI for several years has nice processors with two cores, which
    are almost in sync, but one is something like one cycle behind
    the other. And there is circuitry to compare that both cores
    produce the same result. This does not cover failures of the
    whole chip, but dramaticaly lowers chance of undetected erros due
    to some transient condition.

    For critical functions a car could have 3 processors with
    voting circuitry. With separate chips this would be more expensive
    than single processor, but increase of cost probably would be
    negligible compared to cost of the whole car. And when integrated
    on a single chip cost difference would be tiny.

    IIUC car controller may "reboot" during a ride. Intead of
    rebooting it could handle work to a backup controller.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to George Neuner on Sat Oct 19 01:25:54 2024
    George Neuner <gneuner2@comcast.net> wrote:
    On Fri, 18 Oct 2024 20:30:06 -0000 (UTC), antispam@fricas.org (Waldek Hebisch) wrote:

    Don Y <blockedofcourse@foo.invalid> wrote:
    Typically, one performs some limited "confidence tests"
    at POST to catch gross failures. As this activity is
    "in series" with normal operation, it tends to be brief
    and not very thorough.

    Many products offer a BIST capability that the user can invoke
    for more thorough testing. This allows the user to decide
    when he can afford to live without the normal functioning of the
    device.

    And, if you are a "robust" designer, you often include invariants
    that verify hardware operations (esp to I/Os) are actually doing
    what they should -- e.g., verifying battery voltage increases
    when you activate the charging circuit, loopbacks on DIOs, etc.

    But, for 24/7/365 boxes, POST is a "once-in-a-lifetime" activity.
    And, BIST might not always be convenient (as well as requiring the
    user's consent and participation).

    There, runtime diagnostics are the only alternative for hardware
    revalidation, PFA and diagnostics.

    How commonly are such mechanisms implemented? And, how thoroughly?

    This is strange question. AFAIK automatically run diagnostics/checks
    are part of safety regulations. Even if some safety critical software
    does not contain them, nobody is going to admit violationg regulations.
    And things like PLC-s are "dual use", they may be used in non-safety
    role, but vendors claim compliance to safety standards.

    However, only a minor percentage of all devices must comply with such
    safety regulations.

    Maybe, if you mean domain specific regulations. But there are
    general EC directives. One may spent money on lawyers and
    research to conclude that some protections are not required
    by law. Or one may implement things like for regulated domain.
    On small scale the second is likely to be cheaper.

    Anyway, I do not know if there is anything specific about
    washing machines, but software for them is clearly written
    as if they were regulated. The same for ovens, heaters etc.

    As I understand it, Don is working on tech for "smart home"
    implementations ... devices that may be expected to run nearly
    constantly (though perhaps not 365/24 with 6 9's reliability), but
    which, for the most part, are /not/ safety critical.

    IMO, "smart home" which matter have safety implications. Even
    if they are not regulated now there is potential for liabilty.
    And new requlations appear quite frequently.

    WRT Don's question, I don't know the answer, but I suspect runtime diagnostics are /not/ routinely implemented for devices that are not
    safety critical. Reason: diagnostics interfere with operation of
    <whatever> they happen to be testing. Even if the test is at low(est) priority and is interruptible by any other activity, it still might
    cause an unacceptable delay in a real time situation. To ensure 100% functionality at all times effectively requires use of redundant
    hardware - which generally is too expensive for a non safety critical
    device.

    IIUC at low levels requirements are not that hard to satisfy,
    especially since in most cases non-working device is deemed "safe".

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Waldek Hebisch on Fri Oct 18 19:38:18 2024
    On 10/18/2024 6:50 PM, Waldek Hebisch wrote:
    Don Y <blockedofcourse@foo.invalid> wrote:
    On 10/18/2024 2:42 PM, George Neuner wrote:

    To ensure 100%
    functionality at all times effectively requires use of redundant
    hardware - which generally is too expensive for a non safety critical
    device.

    Apparently, there is noise about incorporating such hardware into
    *automotive* designs (!). I would have thought the time between
    POSTs would have rendered that largely ineffective. OTOH, if
    you imagine a failure can occur ANY time, then "just after
    putting the car in gear" is as good (bad!) a time as any!

    TI for several years has nice processors with two cores, which
    are almost in sync, but one is something like one cycle behind
    the other. And there is circuitry to compare that both cores
    produce the same result. This does not cover failures of the
    whole chip, but dramaticaly lowers chance of undetected erros due
    to some transient condition.

    The 4th bit in memory location XYZ has failed "stuck at zero".
    How are you going to detect that?

    One of the FETs that controls the shifting of the automatic
    transmission as failed open. How do you detect that /and recover
    from it/?

    The camera/LIDAR that the self-drive feature uses is providing
    incorrect data... etc.

    There are innumerable failures that can occur to compromise
    the "system" and no *easy*/inexpensive/reliable way to detect
    and recover from *all* of them.

    For critical functions a car could have 3 processors with
    voting circuitry. With separate chips this would be more expensive
    than single processor, but increase of cost probably would be
    negligible compared to cost of the whole car. And when integrated
    on a single chip cost difference would be tiny.

    IIUC car controller may "reboot" during a ride. Intead of
    rebooting it could handle work to a backup controller.

    How do you know the circuitry (and other mechanisms) that
    implement this hand-over are operational?

    It is VERY difficult to design reliable systems. I am not
    attempting that. Rather, I am trying to address the fact that
    the reassurances POST (and, at the user's perogative, BIST)
    are not guaranteed when a device runs "for long periods of time".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Don Y on Sat Oct 19 03:00:48 2024
    Don Y <blockedofcourse@foo.invalid> wrote:
    On 10/18/2024 1:30 PM, Waldek Hebisch wrote:
    Don Y <blockedofcourse@foo.invalid> wrote:
    Typically, one performs some limited "confidence tests"
    at POST to catch gross failures. As this activity is
    "in series" with normal operation, it tends to be brief
    and not very thorough.

    Many products offer a BIST capability that the user can invoke
    for more thorough testing. This allows the user to decide
    when he can afford to live without the normal functioning of the
    device.

    And, if you are a "robust" designer, you often include invariants
    that verify hardware operations (esp to I/Os) are actually doing
    what they should -- e.g., verifying battery voltage increases
    when you activate the charging circuit, loopbacks on DIOs, etc.

    But, for 24/7/365 boxes, POST is a "once-in-a-lifetime" activity.
    And, BIST might not always be convenient (as well as requiring the
    user's consent and participation).

    There, runtime diagnostics are the only alternative for hardware
    revalidation, PFA and diagnostics.

    How commonly are such mechanisms implemented? And, how thoroughly?

    This is strange question. AFAIK automatically run diagnostics/checks
    are part of safety regulations.

    Not all devices are covered by "regulations".

    Well, if device matters then there is implied liabilty
    and nobody want to admit doing bad job. If device
    does not matter, then answer to the original question
    also does not matter.

    And, the *extent* to which testing is done is the subject
    addressed; if I ensure "stuff" *WORKED* when the device was
    powered on (preventing it from continuing on to its normal
    functionality in the event that some failure was detected),
    what assurance does that give me that the device's integrity
    is still intact 8760 hours (1 yr) hours later? 720 hours
    (1 mo)? 168 hours (1 wk)? 24 hours? *1* hour????

    What to test is really domain-specific. Traditional thinking
    is that computer hardware is _much_ more reliable than
    software and software bugs are major source of misbehaviour.
    And among hardware failures transient upsets, like flipped
    bit are more likely than permanent failure. For example,
    at low safety level you may assume that hardware of a counter
    generating PWM-ed signal works correctly, but you are
    supposed to periodically verify that configuration registers
    keep expected values. IIUC cristal osciators are likely to fail
    so you are supposed to regularly check for presence of the clock
    and its frequency (this assumes hardware design with a backup
    clock).

    Even if some safety critical software
    does not contain them, nobody is going to admit violationg regulations.
    And things like PLC-s are "dual use", they may be used in non-safety
    role, but vendors claim compliance to safety standards.

    So, if a bit in a RAM in said device *dies* some time after power on,
    is the device going to *know* that has happened? And, signal its unwillingness to continue operating? What is going to detect that
    failure?

    I do not know how PLC manufactures implement checks. Small
    PLC-s are based on MCU-s with static parity protected RAM.
    This may be deemed adequate. PLC-s work in cycles and some
    percentage of the cycle is dedicated to self-test. So big
    PLC may divide memory into smallish regions and in each
    cycle check a single region, walking trough whole memory.

    What if the bit's failure is inconsequential to the operation
    of the device? E.g., if the bit is part of some not-used
    feature? *Or*, if it has failed in the state it was *supposed*
    to be in??!

    I am affraid that usually inconsequential failure gets
    promoted to complete failure. Before 2000 checking showed
    that several BIOS-es "validated" date and "incorrect" (that
    is after 1999) date prevented boot.

    Historically OS-es had a map of bad blocks on the disc and
    avoided allocating them. In principle on system with paging
    hardware the same could be done for DRAM, but I do not think
    anybody is doing this (if domain is serious enough to worry
    about DRAM failures, then it probaly have redundant independent
    computers with ECC DRAM).

    With a "good" POST design, you can reassure the user that the
    device *appears* to be functional. That the data/code stored in it
    are intact (since last time they were accessed). That the memory
    is capable of storing any values that is called on to preserve.
    That the hardware I/Os can control and sense as intended, etc.

    /But, you have no guarantee that this condition will persist!/
    If it WAS guaranteed to persist, then the simple way to make high
    reliability devices would be just to /never turn them off/ to
    take advantage of this "guarantee"!

    Everything here is domain specific. In cheap MCU-based device main
    source of failurs is overvoltage/ESD on MCU pins. This may
    kill the whole chip in which case no software protection can
    help. Or some pins fail, sometimes this may be detected by reading
    appropiate port. If you control electic motor then you probably
    do not want to sent test signals during normal motor operation.
    But you are likely to have some feedback and can verify if feedback
    agrees with expected values. If you get unexpected readings
    you probably will stop the motor.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Don Y on Sat Oct 19 03:53:40 2024
    Don Y <blockedofcourse@foo.invalid> wrote:
    On 10/18/2024 6:50 PM, Waldek Hebisch wrote:
    Don Y <blockedofcourse@foo.invalid> wrote:
    On 10/18/2024 2:42 PM, George Neuner wrote:

    To ensure 100%
    functionality at all times effectively requires use of redundant
    hardware - which generally is too expensive for a non safety critical
    device.

    Apparently, there is noise about incorporating such hardware into
    *automotive* designs (!). I would have thought the time between
    POSTs would have rendered that largely ineffective. OTOH, if
    you imagine a failure can occur ANY time, then "just after
    putting the car in gear" is as good (bad!) a time as any!

    TI for several years has nice processors with two cores, which
    are almost in sync, but one is something like one cycle behind
    the other. And there is circuitry to compare that both cores
    produce the same result. This does not cover failures of the
    whole chip, but dramaticaly lowers chance of undetected erros due
    to some transient condition.

    The 4th bit in memory location XYZ has failed "stuck at zero".
    How are you going to detect that?

    The chips that I mentioned use static memory with ECC. Of course,
    ECC circuitry may fail. There may be error undetected by ECC.
    The two cores may have the same error or comparison circuitry
    may fail to detect the difference. Each may happen, but each
    is much less likely to happen than simple transient error.

    One of the FETs that controls the shifting of the automatic
    transmission as failed open. How do you detect that /and recover
    from it/?

    Detecting such thing looks easy. Recovery is tricky, because
    if you have spare FET and activate it there is good chance that
    it will fail due to the same reason that the first FET failed.
    OTOH, if you have propely designed circuit around the FET,
    disturbance strong enough to kill the FET is likely to kill
    the controller too.

    The camera/LIDAR that the self-drive feature uses is providing
    incorrect data... etc.

    Use 3 (or more) and voting. Of course, this increases cost and one
    have to judge if increase of cost is worth increase in safety
    (in self-driving car using multiple sensors looks like no-brainer,
    but if this is just an assist to increase driver comfort then
    result may be different).

    There are innumerable failures that can occur to compromise
    the "system" and no *easy*/inexpensive/reliable way to detect
    and recover from *all* of them.

    Sure. But for common failures or serious failures having non-negligible pobability redundancy may offer cheap way to increase reliability.

    For critical functions a car could have 3 processors with
    voting circuitry. With separate chips this would be more expensive
    than single processor, but increase of cost probably would be
    negligible compared to cost of the whole car. And when integrated
    on a single chip cost difference would be tiny.

    IIUC car controller may "reboot" during a ride. Intead of
    rebooting it could handle work to a backup controller.

    How do you know the circuitry (and other mechanisms) that
    implement this hand-over are operational?

    It does not matter if handover _always_ works. What matter is
    if system with handover has lower chance of failure than
    system without handover. Having statistics of actual failures
    (which I do not have but manufacturers should have) and
    after some testing one can estimate failure probablity of
    different designs and possibly decide to use handover.

    It is VERY difficult to design reliable systems. I am not
    attempting that. Rather, I am trying to address the fact that
    the reassurances POST (and, at the user's perogative, BIST)
    are not guaranteed when a device runs "for long periods of time".

    You may have tests essentially as part of normal operation.
    Of course, if you have single-tasked design with a task which
    must be "always" ready to respond, then running test becomes
    more complicated. But in most designs you can spare enough
    time slots to run tests during normal operation. Tests may
    interfere with normal operation, but here we are in domain
    specific teritory: sometimes result of operation give enough
    assurance that device is operating correctly. And if testing
    for correct operation is impossible, then there is nothing to
    do, I certainly do not promise to deliver impossible.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Waldek Hebisch on Fri Oct 18 21:05:14 2024
    On 10/18/2024 8:00 PM, Waldek Hebisch wrote:
    Don Y <blockedofcourse@foo.invalid> wrote:
    On 10/18/2024 1:30 PM, Waldek Hebisch wrote:
    Don Y <blockedofcourse@foo.invalid> wrote:
    There, runtime diagnostics are the only alternative for hardware
    revalidation, PFA and diagnostics.

    How commonly are such mechanisms implemented? And, how thoroughly?

    This is strange question. AFAIK automatically run diagnostics/checks
    are part of safety regulations.

    Not all devices are covered by "regulations".

    Well, if device matters then there is implied liabilty
    and nobody want to admit doing bad job. If device
    does not matter, then answer to the original question
    also does not matter.

    In the US, ANYTHING can result in a lawsuit. But, "due diligence"
    can insulate the manufacturer, to some extent. No one ever
    *admits* to "doing a bad job".

    If your doorbell malfunctions, what "damages" are you going
    to claim? If your garage door doesn't open when commanded?
    If your yard doesn't get watered? If you weren't promptly
    notified that the mail had just been delivered? Or, that
    the compressor in the freezer had failed and your foodstuffs
    had spoiled, as a result?

    The costs of litigation are reasonably high. Lawyers want
    to see the LIKELIHOOD of a big payout before entertaining
    such litigation.

    And, the *extent* to which testing is done is the subject
    addressed; if I ensure "stuff" *WORKED* when the device was
    powered on (preventing it from continuing on to its normal
    functionality in the event that some failure was detected),
    what assurance does that give me that the device's integrity
    is still intact 8760 hours (1 yr) hours later? 720 hours
    (1 mo)? 168 hours (1 wk)? 24 hours? *1* hour????

    What to test is really domain-specific. Traditional thinking
    is that computer hardware is _much_ more reliable than
    software and software bugs are major source of misbehaviour.

    That hasn't been *proven*. And, "misbehavior" is not the same
    as *failure*.

    And among hardware failures transient upsets, like flipped
    bit are more likely than permanent failure. For example,

    That used to be the thinking with DRAM but studies have shown
    that *hard* failures are more common. These *can* be found...
    *if* you go looking for them!

    E.g., if you load code into RAM (from FLASH) for execution,
    are you sure the image *in* the RAM is the image from the FLASH?
    What about "now"? And "now"?!

    at low safety level you may assume that hardware of a counter
    generating PWM-ed signal works correctly, but you are
    supposed to periodically verify that configuration registers
    keep expected values.

    Why would you expect the registers to lose their settings?
    Would you expect the CPUs registers to be similarly flakey?

    IIUC cristal osciators are likely to fail
    so you are supposed to regularly check for presence of the clock
    and its frequency (this assumes hardware design with a backup
    clock).

    Even if some safety critical software
    does not contain them, nobody is going to admit violationg regulations.
    And things like PLC-s are "dual use", they may be used in non-safety
    role, but vendors claim compliance to safety standards.

    So, if a bit in a RAM in said device *dies* some time after power on,
    is the device going to *know* that has happened? And, signal its
    unwillingness to continue operating? What is going to detect that
    failure?

    I do not know how PLC manufactures implement checks. Small
    PLC-s are based on MCU-s with static parity protected RAM.
    This may be deemed adequate. PLC-s work in cycles and some
    percentage of the cycle is dedicated to self-test. So big
    PLC may divide memory into smallish regions and in each
    cycle check a single region, walking trough whole memory.

    What if the bit's failure is inconsequential to the operation
    of the device? E.g., if the bit is part of some not-used
    feature? *Or*, if it has failed in the state it was *supposed*
    to be in??!

    I am affraid that usually inconsequential failure gets
    promoted to complete failure. Before 2000 checking showed
    that several BIOS-es "validated" date and "incorrect" (that
    is after 1999) date prevented boot.

    If *a* failure resulted in a catastrophic failure, things would
    be "acceptable" in that the user would KNOW that something is
    wrong without the device having to tell them.

    But, too often, faults can be "absorbed" or lead to unobservable
    errors in operation. What then?

    Somewhere, I have a paper where the researchers simulated faults
    *in* various OS kernels to see how "tolerant" the OS was of these
    faults (which we know *happen*). One would think that *any*
    fault would cause a crash. Yet, MANY faults are sufferable
    (depending on the OS).

    Consider, if a single bit error converts a "JUMP" to a "JUMP IF CARRY"
    but the carry happens to be set, then there is no difference in the
    execution path. If that bit error converts a "saturday" into a
    "sunday", then something that is intended to execute on weekdays (or
    weekends) won't care. Etc.

    Historically OS-es had a map of bad blocks on the disc and
    avoided allocating them. In principle on system with paging
    hardware the same could be done for DRAM, but I do not think
    anybody is doing this (if domain is serious enough to worry
    about DRAM failures, then it probaly have redundant independent
    computers with ECC DRAM).

    Using ECC DRAM doesn't solve the problem. If you see errors
    reported by your ECC RAM (corrected errors), then when do
    you decide you are seeing too many and losing confidence that
    the ECC is actually *detecting* all multibit errors?

    With a "good" POST design, you can reassure the user that the
    device *appears* to be functional. That the data/code stored in it
    are intact (since last time they were accessed). That the memory
    is capable of storing any values that is called on to preserve.
    That the hardware I/Os can control and sense as intended, etc.

    /But, you have no guarantee that this condition will persist!/
    If it WAS guaranteed to persist, then the simple way to make high
    reliability devices would be just to /never turn them off/ to
    take advantage of this "guarantee"!

    Everything here is domain specific. In cheap MCU-based device main
    source of failurs is overvoltage/ESD on MCU pins. This may
    kill the whole chip in which case no software protection can
    help. Or some pins fail, sometimes this may be detected by reading appropiate port. If you control electic motor then you probably
    do not want to sent test signals during normal motor operation.

    That depends on HOW you generate your test signals, what the hardware
    actually looks like and how sensitive the "mechanism" is to such "disturbances". Remember, "you" can see things faster than a mechanism
    can often respond. I.e., if applying power to the motor doesn't
    result in an observable load current (or "micromotion"), then the
    motor is likely not responding.

    But you are likely to have some feedback and can verify if feedback
    agrees with expected values. If you get unexpected readings
    you probably will stop the motor.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Waldek Hebisch on Fri Oct 18 21:17:21 2024
    On 10/18/2024 8:53 PM, Waldek Hebisch wrote:
    One of the FETs that controls the shifting of the automatic
    transmission as failed open. How do you detect that /and recover
    from it/?

    Detecting such thing looks easy. Recovery is tricky, because
    if you have spare FET and activate it there is good chance that
    it will fail due to the same reason that the first FET failed.
    OTOH, if you have propely designed circuit around the FET,
    disturbance strong enough to kill the FET is likely to kill
    the controller too.

    The immediate goal is to *detect* that a problem exists.
    If you can't detect, then attempting to recover is a moot point.

    The camera/LIDAR that the self-drive feature uses is providing
    incorrect data... etc.

    Use 3 (or more) and voting. Of course, this increases cost and one
    have to judge if increase of cost is worth increase in safety

    As well as the reliability of the additional "voting logic".
    If not a set of binary signals, determining what the *correct*
    signal may be can be problematic.

    (in self-driving car using multiple sensors looks like no-brainer,
    but if this is just an assist to increase driver comfort then
    result may be different).

    It is different only in the sense of liability and exposure to
    loss. I am not assigning values to those consequences but,
    rather, looking to address the issue of run-time testing, in
    general.

    Even if NONE of the failures can result in injury or loss,
    it is unlikely that a user WANTS to have a defective product.
    If the user is technically unable to determine when the
    product is "at fault" (vs. his own misunderstanding of how it
    is *supposed* to work), then those failures contribute to
    the users' frustrations with the product.

    There are innumerable failures that can occur to compromise
    the "system" and no *easy*/inexpensive/reliable way to detect
    and recover from *all* of them.

    Sure. But for common failures or serious failures having non-negligible pobability redundancy may offer cheap way to increase reliability.

    For critical functions a car could have 3 processors with
    voting circuitry. With separate chips this would be more expensive
    than single processor, but increase of cost probably would be
    negligible compared to cost of the whole car. And when integrated
    on a single chip cost difference would be tiny.

    IIUC car controller may "reboot" during a ride. Intead of
    rebooting it could handle work to a backup controller.

    How do you know the circuitry (and other mechanisms) that
    implement this hand-over are operational?

    It does not matter if handover _always_ works. What matter is
    if system with handover has lower chance of failure than
    system without handover. Having statistics of actual failures
    (which I do not have but manufacturers should have) and
    after some testing one can estimate failure probablity of
    different designs and possibly decide to use handover.

    Again, I am not interested in "recovery" as that varies with
    the application and risk assessment. What I want to concentrate
    on is reliably *detecting* faults before they lead to product
    failures.

    I contend that the hardware in many devices has that capability
    (to some extent) but that it is underutilized; that the issue
    of detecting faults *after* POST is one that doesn't see much
    attention. The likely thinking being that POST will flag it the
    next time the device is restarted.

    And, that's not acceptable in long-running devices.

    It is VERY difficult to design reliable systems. I am not
    attempting that. Rather, I am trying to address the fact that
    the reassurances POST (and, at the user's perogative, BIST)
    are not guaranteed when a device runs "for long periods of time".

    You may have tests essentially as part of normal operation.

    I suspect most folks have designed devices with UARTs. And,
    having written a driver for it, have noted that framing, parity
    and overrun errors are possible.

    Ask yourself how many of those systems ever *use* that information!
    Is there even a means of propagating it up out of the driver?

    Of course, if you have single-tasked design with a task which
    must be "always" ready to respond, then running test becomes
    more complicated. But in most designs you can spare enough
    time slots to run tests during normal operation. Tests may
    interfere with normal operation, but here we are in domain
    specific teritory: sometimes result of operation give enough
    assurance that device is operating correctly. And if testing
    for correct operation is impossible, then there is nothing to
    do, I certainly do not promise to deliver impossible.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)