Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 43 |
Nodes: | 6 (0 / 6) |
Uptime: | 104:19:06 |
Calls: | 290 |
Files: | 905 |
Messages: | 76,612 |
On Thu, 10 Oct 2024 19:21:20 +0000, David Brown wrote:
On 10/10/2024 20:38, MitchAlsup1 wrote:
This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.
No, it is not. It has absolutely /nothing/ to do with the ISA.
For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.
The existence of a dedicated assembly instruction does not let you write
an efficient memmove() in standard C.
{
memmove( p, q, size );
}
Where the compiler produces the MM instruction itself. Looks damn
close to standard C to me !!
OR
for( int i = 0, i < size; i++ )
p[i] = q[i];
Which gets compiled to memcpy()--also looks to be standard C.
OR
p_struct = q_struct;
gets compiled to::
memmove( &p_struct, &q_struct, sizeof( q_struct ) );
also looks to be std C.
On 10/10/24 2:21 PM, David Brown wrote:
[ SNIP]
If the compiler generates the memmove instruction, then one doesn't
The existence of a dedicated assembly instruction does not let you
write an efficient memmove() in standard C. That's why I said there
was no connection between the two concepts.
have to write memmove() is C - it is never called/used.
For some targets, it can be helpful to write memmove() in assembly or
using inline assembly, rather than in non-portable C (which is the
common case).
Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.
It is not that simple.
There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things up
that is proportionally more costly for small transfers. Often that
can be eliminated when the compiler optimises the functions inline -
when the compiler knows the size of the move/copy, it can optimise
directly.
The use of wider register sizes can help to some extent, but not once
you have reached the width of the internal buses or cache bandwidth.
In general, there will be many aspects of a C compiler's code
generator, its run-time support library, and C standard libraries that
can work better if they are optimised for each new generation of
processor. Sometimes you just need to re-compile the library with a
newer compiler and appropriate flags, other times you need to modify
the library source code. None of this is specific to memmove().
But it is true that you get an easier and more future-proof memmove()
and memcopy() if you have an ISA that supports scalable vector
processing of some kind, such as ARM and RISC-V have, rather than
explicitly sized SIMD registers.
Not applicable.
On 10/10/2024 23:19, Brian G. Lucas wrote:
Not applicable.
I don't understand what you mean by that. /What/ is not applicable
to /what/ ?
On Fri, 11 Oct 2024 13:37:03 +0200
David Brown <david.brown@hesbynett.no> wrote:
On 10/10/2024 23:19, Brian G. Lucas wrote:
Not applicable.
I don't understand what you mean by that. /What/ is not applicable
to /what/ ?
Brian probably meant to say that that it is not applicable to his my66k
LLVM back end.
But I am pretty sure that what you suggest is applicable, but bad idea
for memcpy/memmove routine that targets Arm+SVE.
Dynamic dispatch based on concrete core features/identification, i.e.
exactly the same mechanism that is done on "non-scalable"
architectures, would provide better performance. And memcpy/memmove is certainly sufficiently important to justify an additional development
effort.
On 10/9/2024 1:20 PM, David Brown wrote:
There are lots of parts of the standard C library that cannot be
written completely in portable standard C. (How would you write
a function that handles files? You need non-portable OS calls.)
That's why these things are in the standard library in the first
place.
I agree with everything you say up until the last sentence. There
are several languages, mostly older ones like Fortran and COBOL,
where the file handling/I/O are defined portably within the
language proper, not in a separate library. It just moves the
non-portable stuff from the library writer (as in C) to the
compiler writer (as in Fortran, COBOL, etc.)
Do you think you can just write this :
void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}
in your library's source?
Stefan Monnier <monnier@iro.umontreal.ca> writes:
In the VMS/WinNT way, each memory section is defined as either sharedInteresting. Do you happen to have a pointer for further reading
or private when created and cannot be changed. This allows optimizations >>> in page table and page file handling.
about it?
*nix needs to maintain various data structures to support forkingI can't imagine what those datastructures would be (which might be just
memory just in case it happens.
another way to say that I was brought up on POSIX and can't imagine the
world differently).
http://bitsavers.org/pdf/dec/vax/vms/training/EY-8264E-DP_VMS_Internals_and_Data_Structures_4.4_1988.pdf
On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:
Do you think you can just write this :
void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}
in your library's source?
.global memmove
memmove:
MM R2,R1,R3
RET
sure !
On 11/10/2024 20:55, MitchAlsup1 wrote:
On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:
Do you think you can just write this :
void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}
in your library's source?
.global memmove
memmove:
MM R2,R1,R3
RET
sure !
You are either totally clueless, or you are trolling. And I know you
are not clueless.
This discussion has become pointless.
On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:
Do you think you can just write this :
void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}
in your library's source?
.global memmove
memmove:
MM R2,R1,R3
RET
sure !
On Mon, 14 Oct 2024 17:19:40 +0200
David Brown <david.brown@hesbynett.no> wrote:
My only point of contention is that the existence or lack of such
instructions does not make any difference to whether or not you can
write a good implementation of memcpy() or memmove() in portable
standard C.
You are moving a goalpost.
One does not need "good implementation" in a sense you have in mind.
All one needs is an implementation that pattern matching logic of
compiler unmistakably recognizes as memove/memcpy. That is very easily
done in standard C. For memmove, I had shown how to do it in one of the
posts below. For memcpy its very obvious, so no need to show.
[...] I really don't think any of us really disagree, it is just
that we have been discussing two (mostly) orthogonal issues.
On Mon, 14 Oct 2024 19:39:41 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:
On 13/10/2024 17:45, Anton Ertl wrote:
I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for
equality). Rarely needing something does not mean /never/ needing
it.
OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??
Depends. On the Burroughs mainframe there could be eight
active segments and the segment number was part of the pointer.
Pointers were 32-bits (actually 8 BCD digits)
S s OOOOOO
Where 'S' was a sign digit (C or D), 's' was the
segment number (0-7) and OOOOOO was the six digit
offset within the segment (500kB/1000kD each).
A particular task (process) could have up to
one million "environments", each environment
could have up to 100 "memory areas (up to 1000kD)
of which the first eight were loaded into the
processor base/limit registers. Index registers
were 8 digits and were loaded with a pointer as
described above. Operands could optionally select
one of the index registers and the operand address
was treated as an offset to the index register;
there were 7 index registers.
Access to memory areas 8-99 use string instructions
where the pointer was 16 BCD digits:
EEEEEEMM SsOOOOOO
Where EEEEEE was the evironment number (0-999999);
environments starting with D00000 were reserved for
the MCP (Operating System). MM was the memory area
number and the remaining eight digits described the
data within the memory area. A subroutine call could
call within a memory area or switch to a new environment.
Memory area 1 was the code region for the segment,
Memory area 0 held the stack and some global variables
and was typically shared by all environments.
Memory areas 2-7 were application dependent and could
be configured to be shared between environments at
link time.
What was the size of phiscal address space ?
I would suppose, more than 1,000,000 words?
Michael S <already5chosen@yahoo.com> writes:
On Mon, 14 Oct 2024 19:39:41 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:
On 13/10/2024 17:45, Anton Ertl wrote:
I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for
equality). Rarely needing something does not mean /never/
needing it.
OK, take a segmented memory model with 16-bit pointers and a
24-bit virtual address space. How do you actually compare to
segmented pointers ??
Depends. On the Burroughs mainframe there could be eight
active segments and the segment number was part of the pointer.
Pointers were 32-bits (actually 8 BCD digits)
S s OOOOOO
Where 'S' was a sign digit (C or D), 's' was the
segment number (0-7) and OOOOOO was the six digit
offset within the segment (500kB/1000kD each).
A particular task (process) could have up to
one million "environments", each environment
could have up to 100 "memory areas (up to 1000kD)
of which the first eight were loaded into the
processor base/limit registers. Index registers
were 8 digits and were loaded with a pointer as
described above. Operands could optionally select
one of the index registers and the operand address
was treated as an offset to the index register;
there were 7 index registers.
Access to memory areas 8-99 use string instructions
where the pointer was 16 BCD digits:
EEEEEEMM SsOOOOOO
Where EEEEEE was the evironment number (0-999999);
environments starting with D00000 were reserved for
the MCP (Operating System). MM was the memory area
number and the remaining eight digits described the
data within the memory area. A subroutine call could
call within a memory area or switch to a new environment.
Memory area 1 was the code region for the segment,
Memory area 0 held the stack and some global variables
and was typically shared by all environments.
Memory areas 2-7 were application dependent and could
be configured to be shared between environments at
link time.
What was the size of phiscal address space ?
I would suppose, more than 1,000,000 words?
It varied based on the generation. In the
1960s, a half megabyte (10^6 digits)
was the limit.
In the 1970s, the architecture supported
10^8 digits, the largest B4800 systems
were shipped with 2 million digits (1MB).
In 1979, the B4900 was introduced supporting
up to 10MB (20 MD), later increased to
20MB/40MD.
In the 1980s, the largest systems (V500)
supported up to 10^9 digits. It
was that generation of machine where the
environment scheme was introduced.
Binaries compiled in 1966 ran on all
generations without recompilation.
There was room in the segmentation structures
for up to 10^18 digit physical addresses
(where the segments were aligned on 10^3
digit boundaries).
Unisys discontinued that line of systems in 1992.
I don't see an advantage in being able to implement them in standard C.
I /do/ see an advantage in being able to do so well in non-standard, implementation-specific C.
The reason why you might want your own special memmove, or your own
special malloc, is that you are doing niche and specialised software.
For example, you might be making real-time software and require specific
time constraints on these functions. In such cases, you are not
interested in writing fully portable software - it will already contain
many implementation-specific features or use compiler extensions.
On Fri, 18 Oct 2024 14:06:17 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Mon, 14 Oct 2024 19:39:41 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:
On 13/10/2024 17:45, Anton Ertl wrote:
I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for
equality). Rarely needing something does not mean /never/
needing it.
OK, take a segmented memory model with 16-bit pointers and a
24-bit virtual address space. How do you actually compare to
segmented pointers ??
Depends. On the Burroughs mainframe there could be eight
active segments and the segment number was part of the pointer.
Pointers were 32-bits (actually 8 BCD digits)
S s OOOOOO
Where 'S' was a sign digit (C or D), 's' was the
segment number (0-7) and OOOOOO was the six digit
offset within the segment (500kB/1000kD each).
A particular task (process) could have up to
one million "environments", each environment
could have up to 100 "memory areas (up to 1000kD)
of which the first eight were loaded into the
processor base/limit registers. Index registers
were 8 digits and were loaded with a pointer as
described above. Operands could optionally select
one of the index registers and the operand address
was treated as an offset to the index register;
there were 7 index registers.
Access to memory areas 8-99 use string instructions
where the pointer was 16 BCD digits:
EEEEEEMM SsOOOOOO
Where EEEEEE was the evironment number (0-999999);
environments starting with D00000 were reserved for
the MCP (Operating System). MM was the memory area
number and the remaining eight digits described the
data within the memory area. A subroutine call could
call within a memory area or switch to a new environment.
Memory area 1 was the code region for the segment,
Memory area 0 held the stack and some global variables
and was typically shared by all environments.
Memory areas 2-7 were application dependent and could
be configured to be shared between environments at
link time.
What was the size of phiscal address space ?
I would suppose, more than 1,000,000 words?
It varied based on the generation. In the
1960s, a half megabyte (10^6 digits)
was the limit.
In the 1970s, the architecture supported
10^8 digits, the largest B4800 systems
were shipped with 2 million digits (1MB).
In 1979, the B4900 was introduced supporting
up to 10MB (20 MD), later increased to
20MB/40MD.
In the 1980s, the largest systems (V500)
supported up to 10^9 digits. It
was that generation of machine where the
environment scheme was introduced.
Binaries compiled in 1966 ran on all
generations without recompilation.
There was room in the segmentation structures
for up to 10^18 digit physical addresses
(where the segments were aligned on 10^3
digit boundaries).
So, can it be said that ar least some of B6500-compatible models
suffered from the same problem as 80286 - the segment of maximal size
didn't cover all linear (or physical) address space?
Or their index register width was increased to accomodate 1e9 digits in
the single segment?
Unisys discontinued that line of systems in 1992.
I thought it lasted longer. My impresion was that there were still
hardware implemntation (alongside with emulation on Xeons) sold up
until 15 years ago.
On 16/10/2024 08:21, David Brown wrote:
I have a vague feeling that once upon a time I wrote a malloc for an
I don't see an advantage in being able to implement them in standard
C. I /do/ see an advantage in being able to do so well in
non-standard, implementation-specific C.
The reason why you might want your own special memmove, or your own
special malloc, is that you are doing niche and specialised software.
For example, you might be making real-time software and require
specific time constraints on these functions. In such cases, you are
not interested in writing fully portable software - it will already
contain many implementation-specific features or use compiler extensions.
embedded system. Having only one process it had access to the entire
memory range, and didn't need to talk to the OS. Entirely C is quite
feasible there.
But memmove? On an 80286 it will be using rep movsw, rather than a
software loop, to copy the memory contents to the new location.
_That_ does require assembler, or compiler extensions, not standard C.
antispam@fricas.org (Waldek Hebisch) writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
antispam@fricas.org (Waldek Hebisch) writes:
From my point of view main drawbacks of 286 is poor support for
large arrays and problem for Lisp-like system which have a lot
of small data structures and traverse then via pointers.
Yes. In the first case the segments are too small, in the latter case
there are too few segments (if you have one segment per object).
In the second case one can pack several objects into single
segment, so except for loct security properties this is not
a big problem.
If you go that way, you lose all the benefits of segments, and run
into the "segments too small" problem. Which you then want to
circumvent by using segment and offset in your addressing of the small
data structures, which leads to:
But there is a lot of loading segment registers
and slow loading is a problem.
...
Using 16-bit offsets for jumps inside procedure and
segment-offset pair for calls is likely to lead to better
or similar performance as purely 32-bit machine.
With the 80286's segments and their slowness, that is very doubtful.
The 8086 has branches with 8-bit offsets and branches and calls with
16-bit offsets. The 386 in 32-bit mode has branches with 8-bit
offsets and branches and calls with 32-bit offsets; if 16-bit offsets
for branches would be useful enough for performance, they could
instead have designed the longer branch length to be 16 bits, and
maybe a prefix for 32-bit branch offsets.
At that time Intel apparently wanted to avoid having too many
instructions.
Looking in my Pentium manual, the section on CALL has a 20 lines for
"call intersegment", "call gate" (with priviledge variants) and "call
to task" instructions, 10 of which probably already existed on the 286 (compared to 2 lines for "call near" instructions that existed on the
286), and the "Operation" section (the specification in pseudocode)
consumes about 4 pages, followed by a 1.5 page "Description" section.
9 of these 10 far call variants deal with protected-mode things, so
Intel obviously had no qualms about adding instruction variants. If
they instead had no protected mode, but some 32-bit support, including
the near call with 32-bit offset that I suggest, that would have
reduced the number of instruction variants.
I used Xenix on a 286 in 1986 or 1987; my impression is that programs
were limited to 64KB code and 64KB data size, exactly the PDP-11 model
you denounce.
Maybe. I have seen many cases where sofware essentiallt "wastes"
good things offered by hardware.
Which "good things offered by hardware" do you see "wasted" by this
usage in Xenix?
To me this seems to be the only workable way to use
the 286 protected mode. Ok, the medium model (near data, far code)
may also have been somewhat workable, but looking at the cycle counts
for the protected-mode far calls on the Pentium (and on the 286 they
were probably even more costly), which start at 22 cycles for a "call
gate, same priviledge" (compared to 1 cycle on the Pentium for a
direct call near), one would strongly prefer the small model.
Every successful software used direct access to hardware because of
performance; the rest waned. Using BIOS calls was just too slow.
Lotus 1-2-3 won out over VisiCalc and Multiplan by being faster from
writing directly to video.
For most early graphic cards direct screen access could be allowed
just by allocating appropriate segment. And most non-games
could gain good performance with better system interface.
I think that variaty of tricks used in games and their
popularity made protected mode system much less appealing
to vendors. And that discouraged work on better interfaces
for non-games.
MicroSoft and IBM invested lots of work in a 286 protected-mode
interface: OS/2 1.x. It was limited to the 286 at the insistence of
IBM, even though work started in August 1985, when they already knew
that the 386 was coming soon. OS/2 1.0 was released in April 1987,
1.5 years after the 386.
OS/2 1.x flopped, and by the time OS/2 was adjusted to the 386, it was
too late, so the 286 killed OS/2; here we have a case of a software
project being death-marched by tying itself to "good things offered by hardware" (except that Microsoft defected from the death march after a
few years).
Meanwhile, Microsoft introduced Windows/386 in September 1987 (in
addition to the base (8086) variant of Windows 2.0, which was released
in December 1987), which used 386 protected mode and virtual 8086 mode
(which was missing in the "brain-damaged" (Bill Gates) 286). So
Windows completely ignored 286 protected mode. Windows eventually
became a big success.
Also, Microsoft started NT OS/2 in November 1988 to target the 386
while IBM was still working on 286 OS/2. Eventually Microsoft and IBM
parted ways, NT OS/2 became Windows NT, which is the starting point of
all remaining Windowses from Windows XP onwards.
Xenix, apart from OS/2 the only other notable protected-mode OS for
the 286, was ported to the 386 in 1987, after SCO secured "knowledge
from Microsoft insiders that Microsoft was no longer developing
Xenix", so SCO (or Microsoft) might have done it even earlier if the commercial situation had been less muddled; in any case, Xenix jumped
the 286 ship ASAP.
The verdict is: The only good use of the 286 is as a faster 8086;
small memory model multi-tasking use is possible, but the 64KB
segments are so limiting that everybody who understood software either decided to skip this twist (MicroSoft, except on their OS/2 death
march), or jumped ship ASAP (SCO).
More generally, vendors could release separate versions of
programs for 8086 and 286 but few did so.
Were there any who released software both as 8086 and a protected-mode
80286 variants? Microsoft/SCO with Xenix, anyone else?
And users having
only binaries wanted to use 8086 on their new systems which
led to heroic efforts like OS/2 DOS box and later Linux
dosemu. But integration of 8086 programs with protected
mode was solved too late for 286 model to gain traction
(and on 286 "DOS box" had to run in real mode, breaking
normal system protection).
Linux never ran on a 80286, and DOSemu uses the virtual 8086 mode,
which does not require heroic efforts AFAIK.
There was various segmented hardware around, first and foremost (for
the designers of the 80286), the iAPX432. And as you write, all the
good reasons that resulted in segments on the iAPX432 also persisted
in the 80286. However, given the slowness of segmentation, only the
tiny (all in one segment), small (one segment for code and one for
data), and maybe medium memory models (one data segment) are
competetive in protected mode compared to real mode.
AFAICS that covered wast majority of programs during eighties.
The "vast majority" is not enough; if a key application like Lotus
1-2-3 or Wordperfect did not work on the DOS alternative, the DOS
alternative was not used. And Lotus 1-2-3 and Wordperfect certainly
did not limit themselves to 64KB of data.
Turbo Pascal offered only medium memory model
Acoording to Terje Mathiesen, it also offered the large memory model.
On its Wikipedia page, I find: "Besides allowing applications larger
than 64 KB, Byte in 1988 reported ... for version 4.0". So apparently
Turbo Pascal 4.0 introduced support for the large memory model in
1988.
Intel apparently assumed that programmers are willing to spend
extra work to get good performance and IMO this was right
as a general statement. Intel probably did not realize that
programmers will be very reluctant to spent work on security
features and in particular to spent work on making programs
fast in 286 protected mode.
80286 protected mode is never faster than real mode on the same CPU,
so the way to make programs fast on the 286 is to stick with real
mode; using the small memory model is an alternative, but as
mentioned, the memory limits are too restrictive.
Intel probably assumend that 286 would cover most needs,
As far as protected mode was concerned, they hardly could have been
more wrong.
especially
given that most system had much less memory than 16 MB theoreticlly
allowed by 286.
They provided 24 address pins, so they obviously assumed that there
would be 80286 systems with >8MB. 64KB segments are already too
limiting on systems with 1MB (which was supported by the 8086),
probably even for anything beyond 128KB.
IMO this is partially true: there
is a class of programs which with some work fit into medium
model, but using flat address space is easier. I think that
on 286 (that is with 16 bit bus) those programs (assuming enough
tuning) run faster than flat 32-bit version.
Maybe in real mode. Certainly not in protected mode. Just run your
tuned large-model protected-mode program against a 32-bit small-model
program for the same task on a 386SX (which is reported as having a
very similar speed to the 80286 on 16-bit programs).
And even if you
find one case where the protected-mode program wins, nobody found it
worth their time to do this nonsense.
And so OS/2 flopped despite
being backed by IBM and, until 1990, Microsoft.
But I think that Intel segmentation had some
attractive features during eighties.
You are one of a tiny minority. Even Intel finally saw the light, as
did everybody else, and nowadays segments are just a bad memory.
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
Meanwhile, Microsoft introduced Windows/386 in September 1987 (in
addition to the base (8086) variant of Windows 2.0, which was released
in December 1987), which used 386 protected mode and virtual 8086 mode
(which was missing in the "brain-damaged" (Bill Gates) 286). So
Windows completely ignored 286 protected mode. Windows eventually
became a big success.
What I recall is a bit different. IIRC first successful version of
Windows, that is Windows 3.0 had 3 modes of operation: 8086 compatible,
286 protected mode and 386 protected mode. Only later Microsoft
dropped requirement for 8086 compatiblity.
I think still later it dropped 286 support.
Windows 95 was supposed to be 32-bit, but contained quite a lot
of 16-bit code.
IIRC system interface to Windows 3.0 and 3.1 was 16-bit and only
later Microsoft released extention allowing 32-bit system calls.
I have no information about Windows internals except for some
public statements by Microsoft and other people, but I think
it reasonable to assume that Windows was actually a succesful
example of 8086/286/386 compatibility. That is their 16 bit
code could use real mode segmentation or protected mode
segmentation the later both for 286 and 386. For 32-bit
version they added translation layer to transform arguments
between 16-bit world and 32-bit world. It is possible
that this translation layer involved a lot of effort.
16 bit dispatching "thunk" DLL to translate calls for everyfunction of every board that we might possibly want to use ...
Anyway, it seems that Windows was at least as tied to 286
as OS/2 when it became sucessful and dropped 286 support
later. And for long time after dropping 286 support
Windows massively used 16-bit segments.
IIUC Microsoft Windows up to 3.0 and probably everbody who wanted
to say "supported on Windows". That is Windows 3.0 on 286 almost
surely used 286 protected mode and probably run "Windows" programs
in protected mode. But Windows also supported 8086 and Microsoft
guidelines insisted that proper "Windows program" should run on
8086.
... Even Intel finally saw the light, as
did everybody else, and nowadays segments are just a bad memory.
Well, 16-bit segments clearly are too limited when one has several
megabytes of memory. And consistently 32-bit segmented system
increases memory use which is nontrivial cost. OTOH there is
question how much customers are going to pay for security
features. I think recent times show that secuity has significant
costs. But lack of security may lead to big losses. So
there is no easy choice.
Now people talk more about capabilities. AFAICS capabilities
offer more than segments, but are going to have higher cost.
So abstractly, for some systems segments still may look
attractive. OTOH we now understand that software ecosystem
is much more varied than prevalent view in seventies, so
system that fit well to segments are a tiny part.
And considering bad memory, do you remember PAE? That had
similar spirit to 8086 segmentation. I guess that due
to bad feeling for segments among programmers (and possibly
more relevant compatiblity troubles) Intel did not extend
this to segments, but spirit was still there.