Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 35 |
Nodes: | 6 (0 / 6) |
Uptime: | 11:10:38 |
Calls: | 330 |
Files: | 986 |
Messages: | 107,497 |
Posted today: | 1 |
On Fri, 18 Oct 2024 20:30:06 -0000 (UTC), antispam@fricas.org (Waldek Hebisch) wrote:
Don Y <blockedofcourse@foo.invalid> wrote:
Typically, one performs some limited "confidence tests"
at POST to catch gross failures. As this activity is
"in series" with normal operation, it tends to be brief
and not very thorough.
Many products offer a BIST capability that the user can invoke
for more thorough testing. This allows the user to decide
when he can afford to live without the normal functioning of the
device.
And, if you are a "robust" designer, you often include invariants
that verify hardware operations (esp to I/Os) are actually doing
what they should -- e.g., verifying battery voltage increases
when you activate the charging circuit, loopbacks on DIOs, etc.
But, for 24/7/365 boxes, POST is a "once-in-a-lifetime" activity.
And, BIST might not always be convenient (as well as requiring the
user's consent and participation).
There, runtime diagnostics are the only alternative for hardware
revalidation, PFA and diagnostics.
How commonly are such mechanisms implemented? And, how thoroughly?
This is strange question. AFAIK automatically run diagnostics/checks
are part of safety regulations. Even if some safety critical software
does not contain them, nobody is going to admit violationg regulations.
And things like PLC-s are "dual use", they may be used in non-safety
role, but vendors claim compliance to safety standards.
However, only a minor percentage of all devices must comply with such
safety regulations.
As I understand it, Don is working on tech for "smart home"
implementations ... devices that may be expected to run nearly
constantly (though perhaps not 365/24 with 6 9's reliability), but
which, for the most part, are /not/ safety critical.
WRT Don's question, I don't know the answer, but I suspect runtime diagnostics are /not/ routinely implemented for devices that are not
safety critical. Reason: diagnostics interfere with operation of
<whatever> they happen to be testing. Even if the test is at low(est) priority and is interruptible by any other activity, it still might
cause an unacceptable delay in a real time situation. To ensure 100% functionality at all times effectively requires use of redundant
hardware - which generally is too expensive for a non safety critical
device.
Hi George,
[Hope all is well with you and at home]
On 10/18/2024 2:42 PM, George Neuner wrote:
WRT Don's question, I don't know the answer, but I suspect runtime
diagnostics are /not/ routinely implemented for devices that are not
safety critical. Reason: diagnostics interfere with operation of
<whatever> they happen to be testing. Even if the test is at low(est)
priority and is interruptible by any other activity, it still might
cause an unacceptable delay in a real time situation.
But, if you *know* when certain aspects of a device will be "called on",
you can take advantage of that to schedule diagnostics when the device is
not "needed". And, in the event that some unexpected "need" arises,
can terminate or suspend the testing (possibly rendering the effort
moot if it hasn't yet run to a conclusion).
E.g., I scrub freed memory pages (zero fill) so information doesn't
leak across protection domains. As long as some minimum number
of *scrubbed* pages are available for use "on demand", why can't
I *test* the pages yet to be scrubbed?
On 10/18/2024 8:00 PM, Waldek Hebisch wrote:
Don Y <blockedofcourse@foo.invalid> wrote:
On 10/18/2024 1:30 PM, Waldek Hebisch wrote:
Don Y <blockedofcourse@foo.invalid> wrote:
There, runtime diagnostics are the only alternative for hardware
revalidation, PFA and diagnostics.
How commonly are such mechanisms implemented? And, how thoroughly?
This is strange question. AFAIK automatically run diagnostics/checks
are part of safety regulations.
Not all devices are covered by "regulations".
Well, if device matters then there is implied liabilty
and nobody want to admit doing bad job. If device
does not matter, then answer to the original question
also does not matter.
In the US, ANYTHING can result in a lawsuit. But, "due diligence"
can insulate the manufacturer, to some extent. No one ever
*admits* to "doing a bad job".
If your doorbell malfunctions, what "damages" are you going
to claim? If your garage door doesn't open when commanded?
If your yard doesn't get watered? If you weren't promptly
notified that the mail had just been delivered? Or, that
the compressor in the freezer had failed and your foodstuffs
had spoiled, as a result?
The costs of litigation are reasonably high. Lawyers want
to see the LIKELIHOOD of a big payout before entertaining
such litigation.
And, the *extent* to which testing is done is the subject
addressed; if I ensure "stuff" *WORKED* when the device was
powered on (preventing it from continuing on to its normal
functionality in the event that some failure was detected),
what assurance does that give me that the device's integrity
is still intact 8760 hours (1 yr) hours later? 720 hours
(1 mo)? 168 hours (1 wk)? 24 hours? *1* hour????
What to test is really domain-specific. Traditional thinking
is that computer hardware is _much_ more reliable than
software and software bugs are major source of misbehaviour.
That hasn't been *proven*. And, "misbehavior" is not the same
as *failure*.
And among hardware failures transient upsets, like flipped
bit are more likely than permanent failure. For example,
That used to be the thinking with DRAM but studies have shown
that *hard* failures are more common. These *can* be found...
*if* you go looking for them!
E.g., if you load code into RAM (from FLASH) for execution,
are you sure the image *in* the RAM is the image from the FLASH?
What about "now"? And "now"?!
at low safety level you may assume that hardware of a counter
generating PWM-ed signal works correctly, but you are
supposed to periodically verify that configuration registers
keep expected values.
Why would you expect the registers to lose their settings?
Would you expect the CPUs registers to be similarly flakey?
Historically OS-es had a map of bad blocks on the disc and
avoided allocating them. In principle on system with paging
hardware the same could be done for DRAM, but I do not think
anybody is doing this (if domain is serious enough to worry
about DRAM failures, then it probaly have redundant independent
computers with ECC DRAM).
Using ECC DRAM doesn't solve the problem. If you see errors
reported by your ECC RAM (corrected errors), then when do
you decide you are seeing too many and losing confidence that
the ECC is actually *detecting* all multibit errors?
In the US, ANYTHING can result in a lawsuit. But, "due diligence"
can insulate the manufacturer, to some extent. No one ever
*admits* to "doing a bad job".
If your doorbell malfunctions, what "damages" are you going
to claim? If your garage door doesn't open when commanded?
If your yard doesn't get watered? If you weren't promptly
notified that the mail had just been delivered? Or, that
the compressor in the freezer had failed and your foodstuffs
had spoiled, as a result?
The costs of litigation are reasonably high. Lawyers want
to see the LIKELIHOOD of a big payout before entertaining
such litigation.
Each item above may contribute to a significant loss. And
there could push to litigation (say by a consumer advocacy group)
basically to establish a precedent. So, better have
record of due diligence.
And, the *extent* to which testing is done is the subject
addressed; if I ensure "stuff" *WORKED* when the device was
powered on (preventing it from continuing on to its normal
functionality in the event that some failure was detected),
what assurance does that give me that the device's integrity
is still intact 8760 hours (1 yr) hours later? 720 hours
(1 mo)? 168 hours (1 wk)? 24 hours? *1* hour????
What to test is really domain-specific. Traditional thinking
is that computer hardware is _much_ more reliable than
software and software bugs are major source of misbehaviour.
That hasn't been *proven*. And, "misbehavior" is not the same
as *failure*.
First, I mean relevant hardware, that is hardware inside a MCU.
I think that there are strong arguments that such hardware is
more reliable than software. I have seen claim based on analysis
of discoverd failures that software written to rigorous development
standars exhibits on average about 1 bug (that lead to failure) per
1000 lines of code. This means that evan small MCU has enough
space of handful of bugs. And for bigger systems it gets worse.
And among hardware failures transient upsets, like flipped
bit are more likely than permanent failure. For example,
That used to be the thinking with DRAM but studies have shown
that *hard* failures are more common. These *can* be found...
*if* you go looking for them!
I another place I wrote the one of studies that I saw claimed that significant number of errors they detected (they monitored changes
to a memory area that was supposed to be unmodifed) was due to buggy software. And DRAM is special.
E.g., if you load code into RAM (from FLASH) for execution,
are you sure the image *in* the RAM is the image from the FLASH?
What about "now"? And "now"?!
You are supposed to regularly verify sufficiently strong checksum.
at low safety level you may assume that hardware of a counter
generating PWM-ed signal works correctly, but you are
supposed to periodically verify that configuration registers
keep expected values.
Why would you expect the registers to lose their settings?
Would you expect the CPUs registers to be similarly flakey?
First, such checking is not my idea, but one point from checklist for
low safety devices. Registers may change due to bugs, EMC events,
cosmic rays and similar.
Historically OS-es had a map of bad blocks on the disc and
avoided allocating them. In principle on system with paging
hardware the same could be done for DRAM, but I do not think
anybody is doing this (if domain is serious enough to worry
about DRAM failures, then it probaly have redundant independent
computers with ECC DRAM).
Using ECC DRAM doesn't solve the problem. If you see errors
reported by your ECC RAM (corrected errors), then when do
you decide you are seeing too many and losing confidence that
the ECC is actually *detecting* all multibit errors?
ECC is part of solution. It may reduce probability of error
so that you consider them not serious enough. And if you
really care you may try to increase error rate (say by putting
RAM chips at increased temperature) and test that your detection
and recovery strategy works OK.
In the US, ANYTHING can result in a lawsuit.
But, "due diligence" can insulate the manufacturer, to some extent.
No one ever *admits* to "doing a bad job".
If your doorbell malfunctions, what "damages" are you going
to claim? If your garage door doesn't open when commanded?
If your yard doesn't get watered? If you weren't promptly
notified that the mail had just been delivered? Or, that
the compressor in the freezer had failed and your foodstuffs
had spoiled, as a result?
The costs of litigation are reasonably high. Lawyers want
to see the LIKELIHOOD of a big payout before entertaining
such litigation.
Hi George,
[Hope all is well with you and at home]
On 10/18/2024 2:42 PM, George Neuner wrote:
WRT Don's question, I don't know the answer, but I suspect runtime
diagnostics are /not/ routinely implemented for devices that are not
safety critical. Reason: diagnostics interfere with operation of
<whatever> they happen to be testing. Even if the test is at low(est)
priority and is interruptible by any other activity, it still might
cause an unacceptable delay in a real time situation.
But, if you *know* when certain aspects of a device will be "called on",
you can take advantage of that to schedule diagnostics when the device is
not "needed". And, in the event that some unexpected "need" arises,
can terminate or suspend the testing (possibly rendering the effort
moot if it hasn't yet run to a conclusion).
E.g., I scrub freed memory pages (zero fill) so information doesn't
leak across protection domains. As long as some minimum number
of *scrubbed* pages are available for use "on demand", why can't
I *test* the pages yet to be scrubbed?
If there is no anticipated short term need for irrigation, why
can't I momentarily activate individual valves and watch to see that
the expected amount of water is flowing?
To ensure 100%
functionality at all times effectively requires use of redundant
hardware - which generally is too expensive for a non safety critical
device.
Apparently, there is noise about incorporating such hardware into >*automotive* designs (!). I would have thought the time between
POSTs would have rendered that largely ineffective. OTOH, if
you imagine a failure can occur ANY time, then "just after
putting the car in gear" is as good (bad!) a time as any!
Same ol', same ol'. Nothing much new to report.
On 10/18/2024 2:42 PM, George Neuner wrote:
But, if you *know* when certain aspects of a device will be "called on",
you can take advantage of that to schedule diagnostics when the device is
not "needed". And, in the event that some unexpected "need" arises,
can terminate or suspend the testing (possibly rendering the effort
moot if it hasn't yet run to a conclusion).
If you "know" a priori when some component will be needed, then you
can do whatever you want when it is not. The problem is that many
uses can't be easily anticipated.
Which circles back to testing priority: if the test is interruptible
and/or resumeable, then it may be done whenever the component is
available ... as long as it won't tie up the component if and when it
becomes needed for something else.
E.g., I scrub freed memory pages (zero fill) so information doesn't
leak across protection domains. As long as some minimum number
of *scrubbed* pages are available for use "on demand", why can't
I *test* the pages yet to be scrubbed?
If you're testing memory pages, most likely you are tying up bandwidth
in the memory system and slowing progress of the real applications.
Also because you can't accurately judge the "minimum" needed. BSD and
Linux both have this problem where a sudden burst of allocations
exhausts the pool of zeroed pages, forcing demand zeroing of new pages
prior to their re-assignment. Slows the system to a crawl when it
happens.
If there is no anticipated short term need for irrigation, why
can't I momentarily activate individual valves and watch to see that
the expected amount of water is flowing?
Because then you are watering (however briefly) when it is not
expected. What if there was a pesticide application that should not
be wetted? What if a person is there and gets sprayed by your test?
Properly, valve testing should be done concurrently with a scheduled watering. Check water is flowing when the valve should be open, and
not flowing when the valve should be closed.
To ensure 100%
functionality at all times effectively requires use of redundant
hardware - which generally is too expensive for a non safety critical
device.
Apparently, there is noise about incorporating such hardware into
*automotive* designs (!). I would have thought the time between
POSTs would have rendered that largely ineffective. OTOH, if
you imagine a failure can occur ANY time, then "just after
putting the car in gear" is as good (bad!) a time as any!
Automotive is going the way of aircraft: standby running lockstep with
the primary and monitoring its data flow - able to reset the system if
they disagree, or take over if the primary fails.
The point here is that there is no "one fits all" philosophy you can
follow ... what is proper to do depends on what the (sub)system does,
its criticality, and on the components involved that may need to be
tested.
On Fri, 18 Oct 2024 21:05:14 -0700, Don Y
But, "due diligence" can insulate the manufacturer, to some extent.
No one ever *admits* to "doing a bad job".
Actually due diligence /can't/ insulate a manufacturer if the issue
goes to trial. Members of a jury may feel sorry for the litigant(s),
or conclude that the manufacturer can afford whatever they award ...
or maybe they just don't like the manufacturer's lawyer.
Unlike judges, juries do /not/ have to justify their decisions,
Moreover, in some US juridictions, the decision of a civil case need
not be unanimous but only that of a quorum.
If your doorbell malfunctions, what "damages" are you going
to claim? If your garage door doesn't open when commanded?
If your yard doesn't get watered? If you weren't promptly
notified that the mail had just been delivered? Or, that
the compressor in the freezer had failed and your foodstuffs
had spoiled, as a result?
The costs of litigation are reasonably high. Lawyers want
to see the LIKELIHOOD of a big payout before entertaining
such litigation.
So they created the "class action", where all the litigants
individually may have very small claims, but when put together the
total becomes significant.
functionality at all times effectively requires use of redundant
hardware - which generally is too expensive for a non safety critical
device.
On Fri, 18 Oct 2024, Don Y wrote:
"The costs of litigation are reasonably high."
Hi Don,
Court cases' costs are unreasonably high.
The point here is that there is no "one fits all" philosophy you can
follow ... what is proper to do depends on what the (sub)system does,
its criticality, and on the components involved that may need to be
tested.
I am, rather, looking for ideas as to how (others) may have approached
it. Most of the research I've uncovered deals with servers and their
ilk. Or, historical information (e.g., MULTICS' "computing as a service" philosophy). E.g., *scheduling* testing vs. opportunistic testing.
On 10/19/2024 6:53 AM, Waldek Hebisch wrote:
And, the *extent* to which testing is done is the subject
addressed; if I ensure "stuff" *WORKED* when the device was
powered on (preventing it from continuing on to its normal
functionality in the event that some failure was detected),
what assurance does that give me that the device's integrity
is still intact 8760 hours (1 yr) hours later? 720 hours
(1 mo)? 168 hours (1 wk)? 24 hours? *1* hour????
What to test is really domain-specific. Traditional thinking
is that computer hardware is _much_ more reliable than
software and software bugs are major source of misbehaviour.
That hasn't been *proven*. And, "misbehavior" is not the same
as *failure*.
First, I mean relevant hardware, that is hardware inside a MCU.
I think that there are strong arguments that such hardware is
more reliable than software. I have seen claim based on analysis
of discoverd failures that software written to rigorous development
standars exhibits on average about 1 bug (that lead to failure) per
1000 lines of code. This means that evan small MCU has enough
space of handful of bugs. And for bigger systems it gets worse.
But bugs need not be consequential. They may be undesirable or
even annoying but need not have associated "costs".
And among hardware failures transient upsets, like flipped
bit are more likely than permanent failure. For example,
That used to be the thinking with DRAM but studies have shown
that *hard* failures are more common. These *can* be found...
*if* you go looking for them!
I another place I wrote the one of studies that I saw claimed that
significant number of errors they detected (they monitored changes
to a memory area that was supposed to be unmodifed) was due to buggy
software. And DRAM is special.
If you have memory protection hardware (I do), then such changes
can't casually occur; the software has to make a deliberate
attempt to tell the memory controller to allow such a change.
E.g., if you load code into RAM (from FLASH) for execution,
are you sure the image *in* the RAM is the image from the FLASH?
What about "now"? And "now"?!
You are supposed to regularly verify sufficiently strong checksum.
Really? Wanna bet that doesn't happen? How many Linux-based devices
load applications and start a process to continuously verify the
integrity of the TEXT segment?
What are they going to do if they notice a discrepancy? Reload
the application and hope it avoids any "soft spots" in memory?
at low safety level you may assume that hardware of a counter
generating PWM-ed signal works correctly, but you are
supposed to periodically verify that configuration registers
keep expected values.
Why would you expect the registers to lose their settings?
Would you expect the CPUs registers to be similarly flakey?
First, such checking is not my idea, but one point from checklist for
low safety devices. Registers may change due to bugs, EMC events,
cosmic rays and similar.
Then you are dealing with high reliability designs. Do you
really think my microwave oven, stove, furnace, telephone,
etc. are designed to be resilient to those types of faults?
Do you think the user could detect such an occurrence?
Historically OS-es had a map of bad blocks on the disc and
avoided allocating them. In principle on system with paging
hardware the same could be done for DRAM, but I do not think
anybody is doing this (if domain is serious enough to worry
about DRAM failures, then it probaly have redundant independent
computers with ECC DRAM).
Using ECC DRAM doesn't solve the problem. If you see errors
reported by your ECC RAM (corrected errors), then when do
you decide you are seeing too many and losing confidence that
the ECC is actually *detecting* all multibit errors?
ECC is part of solution. It may reduce probability of error
so that you consider them not serious enough. And if you
really care you may try to increase error rate (say by putting
RAM chips at increased temperature) and test that your detection
and recovery strategy works OK.
Studies suggest that temperature doesn't play the role that
was suspected. What ECC does is give you *data* about faults.
Without it, you have no way to know about faults /as they
occur/.
Testing tries to address faults at different points in their
lifespans. Predictive Failure Analysis tries to alert to the
likelihood of *impending* failures BEFORE they occur. So,
whatever remedial action you might take can happen BEFORE
something has failed. POST serves a similar role but tries to
catch failures that have *occurred* before they can affect the
operation of the device. BIST gives the user a way of making
that determination (or receiving reassurance) "on demand".
Run time diagnostics address testing while the device wants
to remain in operation.
What you *do* about a failure is up to you, your market and the
expectations of your users. If a battery fails in SOME of my
UPSs, they simply won't power on (and, if the periodic run-time
test is enabled, that test will cause them to unceremoniously
power themselves OFF as they try to switch to battery power).
Other UPSs will provide an alert (audible/visual/log message)
of the fact but give me the option of continuing to POWER
those devices in the absence of backup protection.
The latter is far more preferable to me as I can then decide
when/if I want to replace the batteries without being forced
to do so, *now*.
The same is not true of smoke/CO detectors; when they detect
a failed (failING battery), they are increasingly annoying
in their insistence that the problem be addressed, now.
So much so, that it leads to deaths due to the detector
being taken out of service to stop the damn bleating.
I have a great deal of latitude in how I handle failures.
For example, I can busy-out more than 90% of the RAM in a device
(if something suggested that it was unreliable) and *still*
provide the functionality of that node -- by running the code
on another node and leaving just the hardware drivers associated
with *this* node in place. So, I can alert a user that a
particular device is in need of service -- yet, continue
to provide the services that were associated with that device.
IMO, this is the best of all possible "failure" scenarios;
the worst being NOT knowing that something is misbehaving.
On 10/18/2024 8:53 PM, Waldek Hebisch wrote:
One of the FETs that controls the shifting of the automatic
transmission as failed open. How do you detect that /and recover
from it/?
Detecting such thing looks easy. Recovery is tricky, because
if you have spare FET and activate it there is good chance that
it will fail due to the same reason that the first FET failed.
OTOH, if you have propely designed circuit around the FET,
disturbance strong enough to kill the FET is likely to kill
the controller too.
The immediate goal is to *detect* that a problem exists.
If you can't detect, then attempting to recover is a moot point.
The camera/LIDAR that the self-drive feature uses is providing
incorrect data... etc.
Use 3 (or more) and voting. Of course, this increases cost and one
have to judge if increase of cost is worth increase in safety
As well as the reliability of the additional "voting logic".
If not a set of binary signals, determining what the *correct*
signal may be can be problematic.
(in self-driving car using multiple sensors looks like no-brainer,
but if this is just an assist to increase driver comfort then
result may be different).
It is different only in the sense of liability and exposure to
loss. I am not assigning values to those consequences but,
rather, looking to address the issue of run-time testing, in
general.
Even if NONE of the failures can result in injury or loss,
it is unlikely that a user WANTS to have a defective product.
If the user is technically unable to determine when the
product is "at fault" (vs. his own misunderstanding of how it
is *supposed* to work), then those failures contribute to
the users' frustrations with the product.
There are innumerable failures that can occur to compromise
the "system" and no *easy*/inexpensive/reliable way to detect
and recover from *all* of them.
Sure. But for common failures or serious failures having non-negligible
pobability redundancy may offer cheap way to increase reliability.
For critical functions a car could have 3 processors with
voting circuitry. With separate chips this would be more expensive
than single processor, but increase of cost probably would be
negligible compared to cost of the whole car. And when integrated
on a single chip cost difference would be tiny.
IIUC car controller may "reboot" during a ride. Intead of
rebooting it could handle work to a backup controller.
How do you know the circuitry (and other mechanisms) that
implement this hand-over are operational?
It does not matter if handover _always_ works. What matter is
if system with handover has lower chance of failure than
system without handover. Having statistics of actual failures
(which I do not have but manufacturers should have) and
after some testing one can estimate failure probablity of
different designs and possibly decide to use handover.
Again, I am not interested in "recovery" as that varies with
the application and risk assessment. What I want to concentrate
on is reliably *detecting* faults before they lead to product
failures.
I contend that the hardware in many devices has that capability
(to some extent) but that it is underutilized; that the issue
of detecting faults *after* POST is one that doesn't see much
attention. The likely thinking being that POST will flag it the
next time the device is restarted.
And, that's not acceptable in long-running devices.
It is VERY difficult to design reliable systems. I am not
attempting that. Rather, I am trying to address the fact that
the reassurances POST (and, at the user's perogative, BIST)
are not guaranteed when a device runs "for long periods of time".
You may have tests essentially as part of normal operation.
I suspect most folks have designed devices with UARTs. And,
having written a driver for it, have noted that framing, parity
and overrun errors are possible.
Ask yourself how many of those systems ever *use* that information!
Is there even a means of propagating it up out of the driver?
Don Y <blockedofcourse@foo.invalid> wrote:
On 10/18/2024 8:53 PM, Waldek Hebisch wrote:
One of the FETs that controls the shifting of the automatic
transmission as failed open. How do you detect that /and recover
from it/?
Detecting such thing looks easy. Recovery is tricky, because
if you have spare FET and activate it there is good chance that
it will fail due to the same reason that the first FET failed.
OTOH, if you have propely designed circuit around the FET,
disturbance strong enough to kill the FET is likely to kill
the controller too.
The immediate goal is to *detect* that a problem exists.
If you can't detect, then attempting to recover is a moot point.
In a car you have signals from wheels and engine, you can use
those to compute transmission ratio and check is it is expected
one. Or simply have extra inputs which mountor FET output.
The camera/LIDAR that the self-drive feature uses is providing
incorrect data... etc.
Use 3 (or more) and voting. Of course, this increases cost and one
have to judge if increase of cost is worth increase in safety
As well as the reliability of the additional "voting logic".
If not a set of binary signals, determining what the *correct*
signal may be can be problematic.
Matching images is now a stanard technology. And in this case
"voting logic" is likely to be software and main trouble are
possible bugs.
(in self-driving car using multiple sensors looks like no-brainer,
but if this is just an assist to increase driver comfort then
result may be different).
It is different only in the sense of liability and exposure to
loss. I am not assigning values to those consequences but,
rather, looking to address the issue of run-time testing, in
general.
I doubt in general solutions. Various parts of your system
may have enough common features to allow single strategy
in your system. But it is unlikely to generalize to other
systems. To put it differently, there are probabilites
of various events and associated costs. Even if you
refuse to quantify probabilites and costs your design
decisions (assuming they are rational) will give some
estimate of them.
Again, I am not interested in "recovery" as that varies with
the application and risk assessment. What I want to concentrate
on is reliably *detecting* faults before they lead to product
failures.
I contend that the hardware in many devices has that capability
(to some extent) but that it is underutilized; that the issue
of detecting faults *after* POST is one that doesn't see much
attention. The likely thinking being that POST will flag it the
next time the device is restarted.
And, that's not acceptable in long-running devices.
Well, you write that you do not try to build high reliablity
device. However device which correctly operates for years
without interruption is considered "high availability" device
which is a king of high reliablity. And techniques for high
reliablity seem appropiate here.
It is VERY difficult to design reliable systems. I am not
attempting that. Rather, I am trying to address the fact that
the reassurances POST (and, at the user's perogative, BIST)
are not guaranteed when a device runs "for long periods of time".
You may have tests essentially as part of normal operation.
I suspect most folks have designed devices with UARTs. And,
having written a driver for it, have noted that framing, parity
and overrun errors are possible.
Ask yourself how many of those systems ever *use* that information!
Is there even a means of propagating it up out of the driver?
Well, I always use no parity transmission mode. Standard way is
to use checksums and acknowledgments. That way you know if
transmission is working correctly. What extra info you expect
from looking at detailed error info from UART?
That hasn't been *proven*. And, "misbehavior" is not the same
as *failure*.
First, I mean relevant hardware, that is hardware inside a MCU.
I think that there are strong arguments that such hardware is
more reliable than software. I have seen claim based on analysis
of discoverd failures that software written to rigorous development
standars exhibits on average about 1 bug (that lead to failure) per
1000 lines of code. This means that evan small MCU has enough
space of handful of bugs. And for bigger systems it gets worse.
But bugs need not be consequential. They may be undesirable or
even annoying but need not have associated "costs".
The point is that you can not eliminate all bugs. Rather, you
should have simple code with aim of preventing "cost" of bugs.
And among hardware failures transient upsets, like flipped
bit are more likely than permanent failure. For example,
That used to be the thinking with DRAM but studies have shown
that *hard* failures are more common. These *can* be found...
*if* you go looking for them!
I another place I wrote the one of studies that I saw claimed that
significant number of errors they detected (they monitored changes
to a memory area that was supposed to be unmodifed) was due to buggy
software. And DRAM is special.
If you have memory protection hardware (I do), then such changes
can't casually occur; the software has to make a deliberate
attempt to tell the memory controller to allow such a change.
The tests where run on Linux boxes with normal memory protection.
Memory protection does not prevent troubles due to bugs in
priviledged code. Of course, you can think that you can do
better than Linux programmers.
E.g., if you load code into RAM (from FLASH) for execution,
are you sure the image *in* the RAM is the image from the FLASH?
What about "now"? And "now"?!
You are supposed to regularly verify sufficiently strong checksum.
Really? Wanna bet that doesn't happen? How many Linux-based devices
load applications and start a process to continuously verify the
integrity of the TEXT segment?
Using something like Linux means that you do not care about rare
problems (or are prepared to resolve them without help of OS).
What are they going to do if they notice a discrepancy? Reload
the application and hope it avoids any "soft spots" in memory?
AFAICS the rule about checking image originally were inteded
for devices executing code directly from flash, if your "primary
truth" fails possibilities are limited. With DRAM failures one
can do much better. The question is mainly probabilities and
effort.
at low safety level you may assume that hardware of a counter
generating PWM-ed signal works correctly, but you are
supposed to periodically verify that configuration registers
keep expected values.
Why would you expect the registers to lose their settings?
Would you expect the CPUs registers to be similarly flakey?
First, such checking is not my idea, but one point from checklist for
low safety devices. Registers may change due to bugs, EMC events,
cosmic rays and similar.
Then you are dealing with high reliability designs. Do you
really think my microwave oven, stove, furnace, telephone,
etc. are designed to be resilient to those types of faults?
Do you think the user could detect such an occurrence?
IIUC microwave, stove and furnace should be. In cell phone
BMS should be safe and core radio is tightly regulated. Other
parts seem to be at quality/reliability level of PC-s.
You clearly want to make your devices more reliable. Bugs
and various events happen and extra checking is actually
quite cheap. It is for you to decide if you need/want
it.
Studies suggest that temperature doesn't play the role that
was suspected. What ECC does is give you *data* about faults.
Without it, you have no way to know about faults /as they
occur/.
Well, there is evidence that increased temperature inreases
chance of errors. More precisely, expect errors when you
operate DRAM close to max allowed temperature. The point is
that you can cause errors and that way test your recovery
strategy (untested recovery code is likely to fail when/if
it is needed).
Typically, one performs some limited "confidence tests"
at POST to catch gross failures. As this activity is
"in series" with normal operation, it tends to be brief
and not very thorough.
Many products offer a BIST capability that the user can invoke
for more thorough testing. This allows the user to decide
when he can afford to live without the normal functioning of the
device.
And, if you are a "robust" designer, you often include invariants
that verify hardware operations (esp to I/Os) are actually doing
what they should -- e.g., verifying battery voltage increases
when you activate the charging circuit, loopbacks on DIOs, etc.
But, for 24/7/365 boxes, POST is a "once-in-a-lifetime" activity.
And, BIST might not always be convenient (as well as requiring the
user's consent and participation).
There, runtime diagnostics are the only alternative for hardware revalidation, PFA and diagnostics.
How commonly are such mechanisms implemented? And, how thoroughly?
Don Y <blockedofcourse@foo.invalid> wrote:
Typically, one performs some limited "confidence tests"
at POST to catch gross failures. As this activity is
"in series" with normal operation, it tends to be brief
and not very thorough.
Many products offer a BIST capability that the user can invoke
for more thorough testing. This allows the user to decide
when he can afford to live without the normal functioning of the
device.
And, if you are a "robust" designer, you often include invariants
that verify hardware operations (esp to I/Os) are actually doing
what they should -- e.g., verifying battery voltage increases
when you activate the charging circuit, loopbacks on DIOs, etc.
But, for 24/7/365 boxes, POST is a "once-in-a-lifetime" activity.
And, BIST might not always be convenient (as well as requiring the
user's consent and participation).
There, runtime diagnostics are the only alternative for hardware
revalidation, PFA and diagnostics.
How commonly are such mechanisms implemented? And, how thoroughly?
This is strange question. AFAIK automatically run diagnostics/checks
are part of safety regulations. Even if some safety critical software
does not contain them, nobody is going to admit violationg regulations.
And things like PLC-s are "dual use", they may be used in non-safety
role, but vendors claim compliance to safety standards.
Don Y <blockedofcourse@foo.invalid> wrote:
Typically, one performs some limited "confidence tests"
at POST to catch gross failures. As this activity is
"in series" with normal operation, it tends to be brief
and not very thorough.
Many products offer a BIST capability that the user can invoke
for more thorough testing. This allows the user to decide
when he can afford to live without the normal functioning of the
device.
And, if you are a "robust" designer, you often include invariants
that verify hardware operations (esp to I/Os) are actually doing
what they should -- e.g., verifying battery voltage increases
when you activate the charging circuit, loopbacks on DIOs, etc.
But, for 24/7/365 boxes, POST is a "once-in-a-lifetime" activity.
And, BIST might not always be convenient (as well as requiring the
user's consent and participation).
There, runtime diagnostics are the only alternative for hardware
revalidation, PFA and diagnostics.
How commonly are such mechanisms implemented? And, how thoroughly?
This is strange question. AFAIK automatically run diagnostics/checks
are part of safety regulations.
Even if some safety critical software
does not contain them, nobody is going to admit violationg regulations.
And things like PLC-s are "dual use", they may be used in non-safety
role, but vendors claim compliance to safety standards.
WRT Don's question, I don't know the answer, but I suspect runtime diagnostics are /not/ routinely implemented for devices that are not
safety critical. Reason: diagnostics interfere with operation of
<whatever> they happen to be testing. Even if the test is at low(est) priority and is interruptible by any other activity, it still might
cause an unacceptable delay in a real time situation.
To ensure 100%
functionality at all times effectively requires use of redundant
hardware - which generally is too expensive for a non safety critical
device.
On 10/18/2024 2:42 PM, George Neuner wrote:
To ensure 100%
functionality at all times effectively requires use of redundant
hardware - which generally is too expensive for a non safety critical
device.
Apparently, there is noise about incorporating such hardware into *automotive* designs (!). I would have thought the time between
POSTs would have rendered that largely ineffective. OTOH, if
you imagine a failure can occur ANY time, then "just after
putting the car in gear" is as good (bad!) a time as any!
On Fri, 18 Oct 2024 20:30:06 -0000 (UTC), antispam@fricas.org (Waldek Hebisch) wrote:
Don Y <blockedofcourse@foo.invalid> wrote:
Typically, one performs some limited "confidence tests"
at POST to catch gross failures. As this activity is
"in series" with normal operation, it tends to be brief
and not very thorough.
Many products offer a BIST capability that the user can invoke
for more thorough testing. This allows the user to decide
when he can afford to live without the normal functioning of the
device.
And, if you are a "robust" designer, you often include invariants
that verify hardware operations (esp to I/Os) are actually doing
what they should -- e.g., verifying battery voltage increases
when you activate the charging circuit, loopbacks on DIOs, etc.
But, for 24/7/365 boxes, POST is a "once-in-a-lifetime" activity.
And, BIST might not always be convenient (as well as requiring the
user's consent and participation).
There, runtime diagnostics are the only alternative for hardware
revalidation, PFA and diagnostics.
How commonly are such mechanisms implemented? And, how thoroughly?
This is strange question. AFAIK automatically run diagnostics/checks
are part of safety regulations. Even if some safety critical software
does not contain them, nobody is going to admit violationg regulations.
And things like PLC-s are "dual use", they may be used in non-safety
role, but vendors claim compliance to safety standards.
However, only a minor percentage of all devices must comply with such
safety regulations.
As I understand it, Don is working on tech for "smart home"
implementations ... devices that may be expected to run nearly
constantly (though perhaps not 365/24 with 6 9's reliability), but
which, for the most part, are /not/ safety critical.
WRT Don's question, I don't know the answer, but I suspect runtime diagnostics are /not/ routinely implemented for devices that are not
safety critical. Reason: diagnostics interfere with operation of
<whatever> they happen to be testing. Even if the test is at low(est) priority and is interruptible by any other activity, it still might
cause an unacceptable delay in a real time situation. To ensure 100% functionality at all times effectively requires use of redundant
hardware - which generally is too expensive for a non safety critical
device.
Don Y <blockedofcourse@foo.invalid> wrote:
On 10/18/2024 2:42 PM, George Neuner wrote:
To ensure 100%
functionality at all times effectively requires use of redundant
hardware - which generally is too expensive for a non safety critical
device.
Apparently, there is noise about incorporating such hardware into
*automotive* designs (!). I would have thought the time between
POSTs would have rendered that largely ineffective. OTOH, if
you imagine a failure can occur ANY time, then "just after
putting the car in gear" is as good (bad!) a time as any!
TI for several years has nice processors with two cores, which
are almost in sync, but one is something like one cycle behind
the other. And there is circuitry to compare that both cores
produce the same result. This does not cover failures of the
whole chip, but dramaticaly lowers chance of undetected erros due
to some transient condition.
For critical functions a car could have 3 processors with
voting circuitry. With separate chips this would be more expensive
than single processor, but increase of cost probably would be
negligible compared to cost of the whole car. And when integrated
on a single chip cost difference would be tiny.
IIUC car controller may "reboot" during a ride. Intead of
rebooting it could handle work to a backup controller.
On 10/18/2024 1:30 PM, Waldek Hebisch wrote:
Don Y <blockedofcourse@foo.invalid> wrote:
Typically, one performs some limited "confidence tests"
at POST to catch gross failures. As this activity is
"in series" with normal operation, it tends to be brief
and not very thorough.
Many products offer a BIST capability that the user can invoke
for more thorough testing. This allows the user to decide
when he can afford to live without the normal functioning of the
device.
And, if you are a "robust" designer, you often include invariants
that verify hardware operations (esp to I/Os) are actually doing
what they should -- e.g., verifying battery voltage increases
when you activate the charging circuit, loopbacks on DIOs, etc.
But, for 24/7/365 boxes, POST is a "once-in-a-lifetime" activity.
And, BIST might not always be convenient (as well as requiring the
user's consent and participation).
There, runtime diagnostics are the only alternative for hardware
revalidation, PFA and diagnostics.
How commonly are such mechanisms implemented? And, how thoroughly?
This is strange question. AFAIK automatically run diagnostics/checks
are part of safety regulations.
Not all devices are covered by "regulations".
And, the *extent* to which testing is done is the subject
addressed; if I ensure "stuff" *WORKED* when the device was
powered on (preventing it from continuing on to its normal
functionality in the event that some failure was detected),
what assurance does that give me that the device's integrity
is still intact 8760 hours (1 yr) hours later? 720 hours
(1 mo)? 168 hours (1 wk)? 24 hours? *1* hour????
Even if some safety critical software
does not contain them, nobody is going to admit violationg regulations.
And things like PLC-s are "dual use", they may be used in non-safety
role, but vendors claim compliance to safety standards.
So, if a bit in a RAM in said device *dies* some time after power on,
is the device going to *know* that has happened? And, signal its unwillingness to continue operating? What is going to detect that
failure?
What if the bit's failure is inconsequential to the operation
of the device? E.g., if the bit is part of some not-used
feature? *Or*, if it has failed in the state it was *supposed*
to be in??!
With a "good" POST design, you can reassure the user that the
device *appears* to be functional. That the data/code stored in it
are intact (since last time they were accessed). That the memory
is capable of storing any values that is called on to preserve.
That the hardware I/Os can control and sense as intended, etc.
/But, you have no guarantee that this condition will persist!/
If it WAS guaranteed to persist, then the simple way to make high
reliability devices would be just to /never turn them off/ to
take advantage of this "guarantee"!
On 10/18/2024 6:50 PM, Waldek Hebisch wrote:
Don Y <blockedofcourse@foo.invalid> wrote:
On 10/18/2024 2:42 PM, George Neuner wrote:
To ensure 100%
functionality at all times effectively requires use of redundant
hardware - which generally is too expensive for a non safety critical
device.
Apparently, there is noise about incorporating such hardware into
*automotive* designs (!). I would have thought the time between
POSTs would have rendered that largely ineffective. OTOH, if
you imagine a failure can occur ANY time, then "just after
putting the car in gear" is as good (bad!) a time as any!
TI for several years has nice processors with two cores, which
are almost in sync, but one is something like one cycle behind
the other. And there is circuitry to compare that both cores
produce the same result. This does not cover failures of the
whole chip, but dramaticaly lowers chance of undetected erros due
to some transient condition.
The 4th bit in memory location XYZ has failed "stuck at zero".
How are you going to detect that?
One of the FETs that controls the shifting of the automatic
transmission as failed open. How do you detect that /and recover
from it/?
The camera/LIDAR that the self-drive feature uses is providing
incorrect data... etc.
There are innumerable failures that can occur to compromise
the "system" and no *easy*/inexpensive/reliable way to detect
and recover from *all* of them.
For critical functions a car could have 3 processors with
voting circuitry. With separate chips this would be more expensive
than single processor, but increase of cost probably would be
negligible compared to cost of the whole car. And when integrated
on a single chip cost difference would be tiny.
IIUC car controller may "reboot" during a ride. Intead of
rebooting it could handle work to a backup controller.
How do you know the circuitry (and other mechanisms) that
implement this hand-over are operational?
It is VERY difficult to design reliable systems. I am not
attempting that. Rather, I am trying to address the fact that
the reassurances POST (and, at the user's perogative, BIST)
are not guaranteed when a device runs "for long periods of time".
Don Y <blockedofcourse@foo.invalid> wrote:
On 10/18/2024 1:30 PM, Waldek Hebisch wrote:
Don Y <blockedofcourse@foo.invalid> wrote:
There, runtime diagnostics are the only alternative for hardware
revalidation, PFA and diagnostics.
How commonly are such mechanisms implemented? And, how thoroughly?
This is strange question. AFAIK automatically run diagnostics/checks
are part of safety regulations.
Not all devices are covered by "regulations".
Well, if device matters then there is implied liabilty
and nobody want to admit doing bad job. If device
does not matter, then answer to the original question
also does not matter.
And, the *extent* to which testing is done is the subject
addressed; if I ensure "stuff" *WORKED* when the device was
powered on (preventing it from continuing on to its normal
functionality in the event that some failure was detected),
what assurance does that give me that the device's integrity
is still intact 8760 hours (1 yr) hours later? 720 hours
(1 mo)? 168 hours (1 wk)? 24 hours? *1* hour????
What to test is really domain-specific. Traditional thinking
is that computer hardware is _much_ more reliable than
software and software bugs are major source of misbehaviour.
And among hardware failures transient upsets, like flipped
bit are more likely than permanent failure. For example,
at low safety level you may assume that hardware of a counter
generating PWM-ed signal works correctly, but you are
supposed to periodically verify that configuration registers
keep expected values.
IIUC cristal osciators are likely to fail
so you are supposed to regularly check for presence of the clock
and its frequency (this assumes hardware design with a backup
clock).
Even if some safety critical software
does not contain them, nobody is going to admit violationg regulations.
And things like PLC-s are "dual use", they may be used in non-safety
role, but vendors claim compliance to safety standards.
So, if a bit in a RAM in said device *dies* some time after power on,
is the device going to *know* that has happened? And, signal its
unwillingness to continue operating? What is going to detect that
failure?
I do not know how PLC manufactures implement checks. Small
PLC-s are based on MCU-s with static parity protected RAM.
This may be deemed adequate. PLC-s work in cycles and some
percentage of the cycle is dedicated to self-test. So big
PLC may divide memory into smallish regions and in each
cycle check a single region, walking trough whole memory.
What if the bit's failure is inconsequential to the operation
of the device? E.g., if the bit is part of some not-used
feature? *Or*, if it has failed in the state it was *supposed*
to be in??!
I am affraid that usually inconsequential failure gets
promoted to complete failure. Before 2000 checking showed
that several BIOS-es "validated" date and "incorrect" (that
is after 1999) date prevented boot.
Historically OS-es had a map of bad blocks on the disc and
avoided allocating them. In principle on system with paging
hardware the same could be done for DRAM, but I do not think
anybody is doing this (if domain is serious enough to worry
about DRAM failures, then it probaly have redundant independent
computers with ECC DRAM).
With a "good" POST design, you can reassure the user that the
device *appears* to be functional. That the data/code stored in it
are intact (since last time they were accessed). That the memory
is capable of storing any values that is called on to preserve.
That the hardware I/Os can control and sense as intended, etc.
/But, you have no guarantee that this condition will persist!/
If it WAS guaranteed to persist, then the simple way to make high
reliability devices would be just to /never turn them off/ to
take advantage of this "guarantee"!
Everything here is domain specific. In cheap MCU-based device main
source of failurs is overvoltage/ESD on MCU pins. This may
kill the whole chip in which case no software protection can
help. Or some pins fail, sometimes this may be detected by reading appropiate port. If you control electic motor then you probably
do not want to sent test signals during normal motor operation.
But you are likely to have some feedback and can verify if feedback
agrees with expected values. If you get unexpected readings
you probably will stop the motor.
One of the FETs that controls the shifting of the automatic
transmission as failed open. How do you detect that /and recover
from it/?
Detecting such thing looks easy. Recovery is tricky, because
if you have spare FET and activate it there is good chance that
it will fail due to the same reason that the first FET failed.
OTOH, if you have propely designed circuit around the FET,
disturbance strong enough to kill the FET is likely to kill
the controller too.
The camera/LIDAR that the self-drive feature uses is providing
incorrect data... etc.
Use 3 (or more) and voting. Of course, this increases cost and one
have to judge if increase of cost is worth increase in safety
(in self-driving car using multiple sensors looks like no-brainer,
but if this is just an assist to increase driver comfort then
result may be different).
There are innumerable failures that can occur to compromise
the "system" and no *easy*/inexpensive/reliable way to detect
and recover from *all* of them.
Sure. But for common failures or serious failures having non-negligible pobability redundancy may offer cheap way to increase reliability.
For critical functions a car could have 3 processors with
voting circuitry. With separate chips this would be more expensive
than single processor, but increase of cost probably would be
negligible compared to cost of the whole car. And when integrated
on a single chip cost difference would be tiny.
IIUC car controller may "reboot" during a ride. Intead of
rebooting it could handle work to a backup controller.
How do you know the circuitry (and other mechanisms) that
implement this hand-over are operational?
It does not matter if handover _always_ works. What matter is
if system with handover has lower chance of failure than
system without handover. Having statistics of actual failures
(which I do not have but manufacturers should have) and
after some testing one can estimate failure probablity of
different designs and possibly decide to use handover.
It is VERY difficult to design reliable systems. I am not
attempting that. Rather, I am trying to address the fact that
the reassurances POST (and, at the user's perogative, BIST)
are not guaranteed when a device runs "for long periods of time".
You may have tests essentially as part of normal operation.
Of course, if you have single-tasked design with a task which
must be "always" ready to respond, then running test becomes
more complicated. But in most designs you can spare enough
time slots to run tests during normal operation. Tests may
interfere with normal operation, but here we are in domain
specific teritory: sometimes result of operation give enough
assurance that device is operating correctly. And if testing
for correct operation is impossible, then there is nothing to
do, I certainly do not promise to deliver impossible.