Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 43 |
Nodes: | 6 (0 / 6) |
Uptime: | 96:20:19 |
Calls: | 290 |
Files: | 904 |
Messages: | 76,426 |
On Fri, Nov 15, 2024 at 10:35 AM Michael <confabulate@kintzios.com> wrote:
Host managed SMRs (HM-SMR) require the OS and FS to be aware of the need for sequential writes and manage submitted data sympathetically to this limitation of the SMR drive, by queuing up random writes in batches and submitting these as a sequential stream.
I understand the ext4-lazy option and some patches on btrfs have improved performance of these filesystems on SMR drivers, but perhaps f2fs will perform better? :-/
IMO a host-managed solution is likely to be the only thing that will
work reliably.
If the drive supports discard/trim MAYBE a dumber
drive might be able to be used with the right filesystem. Even if
you're doing "write-once" workloads any kind of metadata change is
going to cause random writes unless the filesystem was designed for
SMR. Ideally you'd store metadata on a non-SMR device, though it
isn't strictly necessary with a log-based approach.
If the SMR drive tries really hard to not look like an SMR drive and
doesn't support discard/trim then even an SMR-aware solution probably
won't be able to use it effectively. The drive is going to keep doing read-before-write cycles to preserve data even if there is nothing
useful to preserve.
--
Rich
I assume (simplistically) with DM-SMRs the
discard-garbage collection is managed wholly by the onboard drive controller, while with HM-SMRs the OS will signal the drive to start trimming when the workload is low in order to distribute the timing overheads to the system's idle time.
On Sat, Nov 16, 2024 at 6:02 AM Michael <confabulate@kintzios.com> wrote:
I assume (simplistically) with DM-SMRs the
discard-garbage collection is managed wholly by the onboard drive controller, while with HM-SMRs the OS will signal the drive to start trimming when the workload is low in order to distribute the timing overheads to the system's idle time.
I'll admit I haven't looked into the details as I have no need for SMR
and there aren't any good FOSS solutions for using it that I'm aware
of (just a few that might be slightly less terrible). However, this
doesn't seem correct for two reasons:
First, I'm not sure why HM-SMR would even need a discard function.
The discard command is used to tell a drive that a block is safe to
overwrite without preservation. A host-managed SMR drive doesn't need
to know what data is disposable and what data is not. It simply needs
to write data when the host instructs it to do so, destroying other
data in the process, and it is the host's job to not destroy anything
it cares about. If a write requires a prior read, then the host needs
to first do the read, then adjust the written data appropriately so
that nothing is lost.
Second, there is no reason that any drive of any kind (SMR or SSD)
NEEDS to do discard/trim operations when the drive is idle, because discard/trim is entirely a metadata operation that doesn't require IO
with the drive data itself. Now, some drives might CHOOSE to
implement it that way, but they don't have to. On an SSD, a discard
command does not mean that the drive needs to erase or move any data
at all. It just means that if there is a subsequent erase that would
impact that block, it isn't necessary to first read the data and
re-write it afterwards. A discard could be implemented entirely in non-volatile metadata storage, such as with a bitmap. For a DM-SMR
using flash for this purpose would make a lot of sense - you wouldn't
need much of it.
This is probably why you have endless arguing online about whether discard/trim is helpful for SSDs. It completely depends on how the
drive implements the command. The drives I've owned can discard
blocks without any impact on IO, but I've heard some have a terrible
impact on IO. It is just like how you can complete the same sort
operation in seconds or hours depending on how dumb your sorting
algorithm is.
In any case, to really take advantage of SMR the OS needs to
understand exactly how to structure its writes so as to not take a
penalty, and that requires information about the implementation of the storage that isn't visible in a DM-SMR.
Sure, some designs will do
better on SMR even without this information, but I don't think they'll
ever be all that efficient. It is no different from putting f2fs on a
flash drive with a brain-dead discard implementation - even if the OS
does all its discards in nice consolidated contiguous operations it
doesn't mean that the drive will handle that in milliseconds instead
of just blocking all IO for an hour - sure, the drive COULD do the
operation quickly, but that doesn't mean that the firmware designers
didn't just ignore the simplest use case in favor of just optimizing
around the assumption that NTFS is the only filesystem in the world.
As I understand it from reading various articles, the constraint of having to write sequentially a whole band when a random block changes within a band works the same on both HM-SMR and the more common DM-SMR.
What differs with
HM-SMR instructions is the host is meant to take over the management of random
writes and submit these as sequential whole band streams to the drive to be committed without a read-modify-write penalty. I suppose for the host to have
to read the whole band first from the drive, modify it and then submit it to the drive to write it as a whole band will be faster than letting the drive manage this operation internally and getting its internal cache full.
This
will not absolve the drive firmware from having to manage its own trim operations and the impact metadata changes could have on the drive, but some timing optimisation is perhaps reasonable.
I don't know if SMRs use flash to record their STL status and data allocation between their persistent cache and shingled storage space. I would think yes,
or at least they ought to. Without metadata written to different media, for such a small random write to take place atomically a whole SMR band will be read, modified in memory, written to a new temporary location and finally overwrite the original SMR band.
Well, drive-managed SMR drives typically have CMR regions for data
caching, and they could also be used to store the bitmap. Cheap
drives might not support trim at all, and would just preserve all data
on write. After all, it isn't performance that is driving the
decision to sneak SMR into consumer drives. Flash would be the most
sensible way to do it though.
On Sat, Nov 16, 2024 at 2:47 PM Michael <confabulate@kintzios.com> wrote:
What differs with
HM-SMR instructions is the host is meant to take over the management of random writes and submit these as sequential whole band streams to the drive to be committed without a read-modify-write penalty. I suppose for the host to have to read the whole band first from the drive, modify it
and then submit it to the drive to write it as a whole band will be
faster than letting the drive manage this operation internally and
getting its internal cache full.
I doubt this would be any faster with a host-managed drive. The same
pattern of writes is going to incur the same penalties.
The idea of a host-managed drive is to avoid the random writes in the
first place, and the need to do the random reads. For this to work
the host has to know where the boundaries of the various regions are
and where it is safe to begin writes in each region.
Sure, a host could just use software to make the host-managed drive
behave the same as a drive-managed drive, but there isn't much benefit
there. You'd want to use a log-based storage system/etc to just avoid
the random writes entirely. You might not even want to use a POSIX filesystem on it.
This
will not absolve the drive firmware from having to manage its own trim operations and the impact metadata changes could have on the drive, but some timing optimisation is perhaps reasonable.
Why would a host-managed SMR drive have ANY trim operations? What
does trimming even mean on a host-managed drive?
Trimming is the act of telling the drive that it is safe to delete a
block without preserving it. A host-managed drive shouldn't need to
be concerned with preserving any data during a write operation. If it
is told to write something, it will just overwrite the data in the
subsequent overlapping cylinders.
On Saturday 16 November 2024 20:13:30 GMT Rich Freeman wrote:
The idea of a host-managed drive is to avoid the random writes in the
first place, and the need to do the random reads. For this to work
the host has to know where the boundaries of the various regions are
and where it is safe to begin writes in each region.
The random reads do not incur a time penalty, it is the R-M-W ops that cost time.
The host don't need to know where bands start and finish, only needs to submit data in whole sequential streams, so they can be written directly to the disk as in a CMR. As long as data and metadata are submitted and written directly, the SMR would be alike a CMR in terms of its performance.
I assumed, may be wrongly, there is still an STL function performed by the controller on HM-SMRs, to de-allocate deleted data bands whenever files are deleted, perform secure data deletions via its firmware, etc. However, I can see if this is managed at the fs journal layer the drive controller could be dumb in this respect.
It would be interesting to see how different fs types perform on DM-SMRs.
On Sun, Nov 17, 2024 at 6:22 AM Michael <confabulate@kintzios.com>[snip ....]
wrote:
On Saturday 16 November 2024 20:13:30 GMT Rich Freeman wrote:
What about DEC-Tape? :-) (https://en.wikipedia.org/wiki/DECtape) (IIt would be interesting to see how different fs types perform onDM-SMRs.
Not that interesting, for me personally. That's like asking how well different filesystems would perform on tape. If I'm storing data on
tape, I'll use an algorithm designed to work on tape, and a tape
drive that actually has a command set that doesn't try to pretend
that it is useful for random writes.
SMR is pretty analogous to tape, with the benefit of being as fast as
CMR for random reads.
What about DEC-Tape? :-) (https://en.wikipedia.org/wiki/DECtape) (I
may even have a few left in a closet somewhere, if only I could find
someone to read them.)
it.
I've looked into it for backup but you need to store a LOT of
data for it to make sense. The issue is that the drives are just super-expensive. You can get much older generation drives used for reasonable prices, but then the tapes have a very low capacity but
they aren't that cheap, so your cost per TB is pretty high, and of
course you have the inconvenience of long backup times and lots of
tape changes.
The newer generation drives are very reasonable in
terms of cost per TB, but the drives themselves cost thousands of
dollars. Unless you're archiving hundreds of TB it is cheaper to just
buy lots of USB3 hard drives at $15/TB, and then you get the random IO performance as a bonus.
The main downside to HDD at smaller scales is
that the drives themselves are more fragile, but that is mostly if
you're dropping them - in terms of storage conditions tape needs
better care than many appreciate for it to remain reliable.
My question is this. Given they cost about $20 more, from what I've
found anyway, is it worth it? Is there a downside to this new set of
heads being added? I'm thinking a higher failure rate, more risk to
data or something like that. I think this is a fairly new thing, last
couple years or so maybe. We all know how some new things don't work out.
Just looking for thoughts and opinions, facts if someone has some.
Failure rate compared to single actuator drives if there is such data.
My searched didn't help me find anything useful.
P. S. My greens are growing like weeds. Usually they ready to pick by
now but having to wait for the tree to be cut down and cut up delayed
that. Should be ready by Christmas, I hope. Oh, planted oats, clover,
kale and some other extra seeds I had in open area. I saw a LARGE buck
deer the other night snacking on the oats. My neighbor would rather see
it in his freezer tho. o_0
My question is this. Given they cost about $20 more, from what I've
found anyway, is it worth it? Is there a downside to this new set of
heads being added? I'm thinking a higher failure rate, more risk to
data or something like that. I think this is a fairly new thing, last
couple years or so maybe. We all know how some new things don't work out.
Howdy,
One of my PVs is about 83% full. Time to add more space, soon anyway.
I try not to go past 90%. Anyway, I was looking at hard drives and
noticed something new. I think I saw one a while back but didn't look
into it at the time. I'm looking at 18TB drives, right now. Some new Seagate drives have dual actuators. Basically, they have two sets of
heads. In theory, if circumstances are right, it could read data twice
as fast. Of course, most of the time that won't be the case but it can happen often enough to make it get data a little faster. Even a 25% or
30% increase gives Seagate something to brag about. Another sales tool.
Some heavy data users wouldn't mind either.
My question is this. Given they cost about $20 more, from what I've
found anyway, is it worth it? Is there a downside to this new set of
heads being added? I'm thinking a higher failure rate, more risk to
data or something like that. I think this is a fairly new thing, last
couple years or so maybe. We all know how some new things don't work out.
Just looking for thoughts and opinions, facts if someone has some.
Failure rate compared to single actuator drives if there is such data.
My searched didn't help me find anything useful.
Thanks.
Dale
:-) :-)
Am Donnerstag, 14. November 2024, 20:12:25 Mitteleuropäische Normalzeit
The only Seagate 7200RPM disk I have started playing up a month ago. I
now
have to replace it. :-(
The German tech bubble has a saying when it’s about Seagate: “Sie geht oder
sie geht nicht”. It plays on the fact that “sie geht” (literally “she runs”¹, meaning “it works”) sounds very similar to “Seagate”. So the
literal joke is “Either it works or it doesn’t”, and the meta joke is “Seagate or not Seagate”.
Michael wrote:
On Wednesday 13 November 2024 23:10:10 GMT Dale wrote:
Howdy,
One of my PVs is about 83% full. Time to add more space, soon anyway.
I try not to go past 90%. Anyway, I was looking at hard drives and
noticed something new. I think I saw one a while back but didn't look
into it at the time. I'm looking at 18TB drives, right now. Some new
Seagate drives have dual actuators. Basically, they have two sets of
heads. In theory, if circumstances are right, it could read data twice
as fast. Of course, most of the time that won't be the case but it can
happen often enough to make it get data a little faster. Even a 25% or
30% increase gives Seagate something to brag about. Another sales tool. >>
Some heavy data users wouldn't mind either.
My question is this. Given they cost about $20 more, from what I've
found anyway, is it worth it? Is there a downside to this new set of
heads being added? I'm thinking a higher failure rate, more risk to
data or something like that. I think this is a fairly new thing, last
couple years or so maybe. We all know how some new things don't work
out.
Just looking for thoughts and opinions, facts if someone has some.
Failure rate compared to single actuator drives if there is such data.
My searched didn't help me find anything useful.
Thanks.
Dale
:-) :-)
I don't know much about these drives beyond what the OEM claims. From
what I read, I can surmise the following hypotheses:
These drives draw more power from your PSU and although they are filled with helium to mitigate against higher power/heat, they will require
better cooling at the margin than a conventional drive.
Your system will use dev-libs/libaio to read the whole disk as a single SATA drive (a SAS port will read it as two separate LUNs). The first 50% of LBAs will be accessed by the first head and the last 50% by the other head. So far, so good.
Theoretically, I suspect this creates a higher probability of failure. In the hypothetical scenario of a large sequential write where both heads
are writing data of a single file, then both heads must succeed in their write operation. The cumulative probability of success of head A + head B is calculated as P(A⋂B). As an example, if say the probability of a successful write of each head is 80%, the cumulative probability of both heads succeeding is only 64%:
0.8 * 0.8 = 0.64
As long as I didn't make any glaring errors, this simplistic thought experiment assumes all else being equal with a conventional single head drive, but it never is. The reliability of a conventional non-helium filled drive may be lower to start with. Seagate claim their Exos 2 reliability is comparable to other enterprise-grade hard drives, but I don't have any real world experience to share here. I expect by the time enough reliability statistics are available, the OEMs would have moved on to different drive technologies.
When considering buying this drive you could look at the market segment needs and use cases Seagate/WD could have tried to address by developing and marketing this technology. These drives are for cloud storage implementations, where higher IOPS, data density and speed of read/write
is
desired, while everything is RAID'ed and backed up. The trade off is
power
usage and heat.
Personally, I tend to buy n-1 versions of storage solutions, for the following reasons:
1. Price per GB is cheaper.
2. Any bad news and rumours about novel failing technologies or unsuitable implementations (e.g. unmarked SMRs being used in NAS) tend to spread far and wide over time.
3. High volume sellers start offering discounts for older models.
However, I don't have a need to store the amount of data you do. Most of my drives stay empty. Here's a 4TB spinning disk with 3 OS and 9 partitions:
~ # gdisk -l /dev/sda | grep TiB
Disk /dev/sda: 7814037168 sectors, 3.6 TiB
Total free space is 6986885052 sectors (3.3 TiB)
HTH
Sounds like my system may not can even handle one of these. I'm not
sure my SATA ports support that stuff.
It sounds like this is not something I really need anyway.
After all, I'm already spanning my data
over three drives. I'm sure some data is coming from each drive. No
way to really know for sure but makes sense.
Do you have a link or something to a place that explains what parts of
the Seagate model number means? I know ST is for Seagate. The size is
next. After that, everything I find is old and outdated. I looked on
the Seagate website to but had no luck. I figure someone made one, somewhere. A link would be fine.
Thanks.
Dale
:-) :-)
I've had a Seagate, a Maxtor from way back and a Western Digital go bad. This is one reason I don't knock any drive maker. Any of them can produce a bad drive.
It's one thing that kinda gets on my nerves about SMR. It seems,
sounds, like they tried to hide it from people to make money. Thing is,
as some learned, they don't do well in a RAID and some other
situations. Heck, they do OK reading but when writing, they can get
real slow when writing a lot of data. Then you have to wait until it
gets done redoing things so that it is complete.
Lol, writing the above text gave me the strange feeling of having written it before. So I looked into my archive and I have indeed: in June 2014 *and*
in December 2020. 🫣
The biggest downside to the large drives available now, even if SMART
tells you a drive is failing, you likely won't have time to copy the
data over to a new drive before it fails. On a 18TB drive, using
pvmove, it can take a long time to move data.
I don't even want to think what it would cost to put
all my 100TBs or so on SSD or NVME drives. WOW!!!
On 14/11/2024 20:33, Dale wrote:
It's one thing that kinda gets on my nerves about SMR. It seems,
sounds, like they tried to hide it from people to make money. Thing is,
as some learned, they don't do well in a RAID and some other
situations. Heck, they do OK reading but when writing, they can get
real slow when writing a lot of data. Then you have to wait until it
gets done redoing things so that it is complete.
Incidentally, when I looked up HAMR (I didn't know what it was) it's
touted as making SMR obsolete. I can see why ...
And dual actuator? I would have thought that would be good for SMR
drives. Not that I have a clue how they work internally, but I would
have thought it made sense to have zones and a streaming log-structured layout. So when the user is using it, you're filling up the zones, and
then when the drive has "free time", it takes a full zone that has the largest "freed/dead space" and streams it to the current zone, one
actuator to read and one to write. Indeed, it could possibly do that
while the drive is being used ...
Cheers,
Wol
The thing about my data, it's mostly large video files. If I were
storing documents or something, then SSD or something would be a good
option. Plus, I mostly write once, then it either sits there a while or
gets read on occasion.
I checked and I have some 56,000 videos. That
doesn't include Youtube videos. This is also why I wanted to use that checksum script.
I do wish there was a easy way to make columns work when we copy and
paste into email. :/
Dale
:-) :-)
Rich Freeman wrote:
On Thu, Nov 14, 2024 at 6:10 PM Dale <rdalek1967@gmail.com> wrote:
The biggest downside to the large drives available now, even if SMART
tells you a drive is failing, you likely won't have time to copy the
data over to a new drive before it fails. On a 18TB drive, using
pvmove, it can take a long time to move data.
[…]
I think I did some math on this once. I'm not positive on this and it
could vary depending on system ability of moving data. I think about
8TB is as large as you want if you get a 24 hour notice from SMART and
see that notice fairly quickly to act on. Anything beyond that and you
may not have enough time to move data, if the data is even good still.
I don't even want to think what it would cost to put
all my 100TBs or so on SSD or NVME drives. WOW!!!
# kubectl rook-ceph ceph osd df class ssd
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP
META AVAIL %USE VAR PGS STATUS
8 ssd 6.98630 1.00000 7.0 TiB 1.7 TiB 1.7 TiB 63 MiB 3.9
GiB 5.3 TiB 24.66 1.04 179 up
[…]
I do wish there was a easy way to make columns work when we copy and
paste into email. :/
Dale
:-) :-)
Michael wrote:
On Friday 15 November 2024 05:53:53 GMT Dale wrote:
The thing about my data, it's mostly large video files. If I were
storing documents or something, then SSD or something would be a good
option. Plus, I mostly write once, then it either sits there a while or >> gets read on occasion.
For a write once - read often use case, the SMR drives are a good
solution.
They were designed for this purpose. Because of their shingled layers
they
provide higher storage density than comparable CMR drives.
True but I don't like when I'm told a write is done, it kinda isn't. I recall a while back I reorganized some stuff, mostly renamed directories
but also moved some files. Some were Youtube videos. It took about 30 minutes to update the data on the SMR backup drive. The part I see
anyway.
It sat there for a hour at least doing that bumpy thing before
it finally finished. I realize if I just turn the drive off, the data
is still there. Still, I don't like it appearing to be done when it
really is still working on it.
Another thing, I may switch to RAID one
of these days. If I do, that drive isn't a good option.
When I update my backups, I start the one I do with my NAS setup first.
Then I start the home directory backup with the SMR drive. I then
backup everything else I backup on other drives. I do that so that I
can leave the SMR drive at least powered on while it does it's bumpy
thing and I do other backups. Quite often, the SMR drive is the last
one I put back in the safe. That bumpy thing can take quite a while at times.
Host managed SMRs (HM-SMR) require the OS and FS to be aware of the need for sequential writes and manage submitted data sympathetically to this limitation
of the SMR drive, by queuing up random writes in batches and submitting these as a sequential stream.
I understand the ext4-lazy option and some patches on btrfs have improved performance of these filesystems on SMR drivers, but perhaps f2fs will perform
better? :-/