• RX2800 sporadic disk I/O slowdowns

    From Richard Jordan@21:1/5 to All on Fri Oct 18 13:26:53 2024
    RX2800 i4 server, 64GB RAM, 4 processors, P410i controller with 10 each
    2TB disks in RAID 6, broken down into volumes.

    We periodically (sometimes steady once a week, but sometimes more
    frequent) one overnight batch job take much longer than normal to run.
    Normal runtime about 30-35 minutes will take 4.5 - 6.5 hours. Several
    images called by that job all run much slower than normal. At the end
    the overall CPU and I/O counts are very close between a normal and a
    long job.

    The data files are very large indexed files. Records are read and
    updated but not added in this job; output is just tabulated reports.

    We've run monitors for all and disk and also built polling snapshot jobs
    that check for locked/busy files, other active batch jobs, auto-checked
    through system analyzer looking for any other processes accessing the
    busy files at the same time as the problem batch (two data files show
    long busy periods but we do not show any other process with channels to
    that file at the same time except for backup, see next).

    The backups start at the same time, but do not get to the data disks
    until well after the problem job normally completes; that does cause
    concurrent access to the problem files but it occurs only when the job
    has already run long. so it is not the cause Overall backup time is
    about the same regardless of how long the problem batch takes.

    Monitor during a long run shows average and peak I/O rates to the disks
    with busy files at about 1/2 of what they do for normal runs. We can
    see that in the process snapshots too; the direct i/o count on a slow
    run increases much more slowly than on a normal run but both normal and
    long runs end up with close to the same CPU time and total I/Os.

    Other jobs in monitor are somewhat slowed down but nowhere near as much
    (and they do much less access).

    Before anyone asks, the indexed files could probably use a
    cleanup/rebuild, but if thats the cause would we see periodic
    performance issues? I would expect them to be constant.

    There is a backup server available, so I'm going to restore backups of
    the two problem files to it and do rebuilds to see how long it takes;
    that will determine how/when we can do it on the production server.



    So something is apparently causing it to be I/O constrained but so far
    we can't find it. Same concurrent processes, other jobs don't appear to
    be slowed down much (but may be much less i/o sensitive or using data
    on other disks, I threw that question to the devs).

    Is there anything in the background below VMS that could cause this?
    The controller doing drive checks or other maintenance activities?

    Thanks for any ideas.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Craig A. Berry@21:1/5 to Richard Jordan on Fri Oct 18 17:09:44 2024
    On 10/18/24 1:26 PM, Richard Jordan wrote:
    Monitor during a long run shows average and peak I/O rates to the disks
    with busy files at about 1/2 of what they do for normal runs.

    That is exactly what happens when the cache battery on a RAID controller
    dies. Maybe yours is half-dead and sometimes takes a charge and
    sometimes doesn't? MSA$UTIL should show the status of your P410.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Craig A. Berry on Sat Oct 19 00:07:21 2024
    On Fri, 18 Oct 2024 17:09:44 -0500, Craig A. Berry wrote:

    That is exactly what happens when the cache battery on a RAID controller dies.

    I hate hardware RAID. Has VMS still not got any equivalent to mdraid?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Arne_Vajh=C3=B8j?=@21:1/5 to Lawrence D'Oliveiro on Fri Oct 18 20:39:33 2024
    On 10/18/2024 8:35 PM, Lawrence D'Oliveiro wrote:
    On Fri, 18 Oct 2024 20:22:23 -0400, Arne Vajhøj wrote:
    On 10/18/2024 8:07 PM, Lawrence D'Oliveiro wrote:

    Has VMS still not got any equivalent to mdraid?

    VMS got volume shadowing in 1986 I believe.

    Relevance being?

    It is OS provided software RAID.

    Isn't that what you are asking for?

    Arne

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Sat Oct 19 00:56:51 2024
    On Fri, 18 Oct 2024 20:39:33 -0400, Arne Vajhøj wrote:

    On 10/18/2024 8:35 PM, Lawrence D'Oliveiro wrote:

    On Fri, 18 Oct 2024 20:22:23 -0400, Arne Vajhøj wrote:

    On 10/18/2024 8:07 PM, Lawrence D'Oliveiro wrote:

    Has VMS still not got any equivalent to mdraid?

    VMS got volume shadowing in 1986 I believe.

    Relevance being?

    It is OS provided software RAID.

    Does “volume shadowing” mean just RAID 1?

    <https://manpages.debian.org/8/mdadm.8.en.html>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Arne_Vajh=C3=B8j?=@21:1/5 to Lawrence D'Oliveiro on Fri Oct 18 21:12:30 2024
    On 10/18/2024 8:56 PM, Lawrence D'Oliveiro wrote:
    On Fri, 18 Oct 2024 20:39:33 -0400, Arne Vajhøj wrote:
    On 10/18/2024 8:35 PM, Lawrence D'Oliveiro wrote:
    On Fri, 18 Oct 2024 20:22:23 -0400, Arne Vajhøj wrote:
    On 10/18/2024 8:07 PM, Lawrence D'Oliveiro wrote:
    Has VMS still not got any equivalent to mdraid?

    VMS got volume shadowing in 1986 I believe.

    Relevance being?

    It is OS provided software RAID.

    Does “volume shadowing” mean just RAID 1?

    I believe so.

    0, 5, 6 and 10 requires a RAID controller.

    Arne

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Sat Oct 19 00:35:43 2024
    On Fri, 18 Oct 2024 20:22:23 -0400, Arne Vajhøj wrote:

    On 10/18/2024 8:07 PM, Lawrence D'Oliveiro wrote:

    Has VMS still not got any equivalent to mdraid?

    VMS got volume shadowing in 1986 I believe.

    Relevance being?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Arne_Vajh=C3=B8j?=@21:1/5 to Lawrence D'Oliveiro on Fri Oct 18 20:22:23 2024
    On 10/18/2024 8:07 PM, Lawrence D'Oliveiro wrote:
    On Fri, 18 Oct 2024 17:09:44 -0500, Craig A. Berry wrote:
    That is exactly what happens when the cache battery on a RAID controller
    dies.

    I hate hardware RAID. Has VMS still not got any equivalent to mdraid?

    ????

    VMS got volume shadowing in 1986 I believe.

    Arne

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Volker Halle@21:1/5 to All on Sat Oct 19 09:02:57 2024
    Rich,

    this would be a perfect oppurtunity to run T4 - and look at the disk
    response times.

    Volker.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Jordan@21:1/5 to All on Mon Nov 4 17:03:36 2024
    Followup on this. I'm looking at one of Hein's presentations on RMS
    indexed files, tuning, etc.

    Presuming the system has plenty of memory and per autogen its state of
    tune is pretty close to what autogen wants, is there any downside to
    setting a count of global buffers on the large indexed data files
    involved in this issue (the ones that show extended 'busy' channels in
    system analyzer)? Can it cause any problems that would impact production?

    We already tested setting a modest process RMS buffer count for indexed
    files on the accounts used for batch operations, and that seems to make
    a modest improvement in runtime and a significant reduction in direct
    I/Os. Saved 3-4 minutes on a 32-34 minute runtime but DIOs dropped from
    ~5.1 million to ~4.3 million.

    Unfortunately we still had two jobs run long, one over 7 hours so they
    killed it, the other about 4.5 hours but with the same reduced 4.3M DIO
    count. So it helped in general but did not make a difference to the
    problem. I don't expect the global buffers to fix the problem either
    but its worth testing for performance reasons.

    Thanks

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From abrsvc@21:1/5 to All on Tue Nov 5 12:22:31 2024
    Note: Global buffers can be an advantage, but are not used when dealing
    with duplicate secondary keys. Those are handled in local buffers. I
    have seen drastic differences in performance when changing bucket sizes
    more with secondary keys that have many duplicates than with primary
    keyed access. Hein has some tools that analyze the statistics of
    indexed files that report the number of I/Os per operation. High values
    here can indicate inefficient use of buckets or buckets that are too
    small forcing the use of more I/Os to retrieve buckets. Increasing the
    bucket size can significantly reduce I/Os resulting in better overall
    stats.

    This won't directly address the reported slowdown, but might be a
    trigger for it depending upon data locality.

    Dan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Volker Halle@21:1/5 to All on Tue Nov 5 17:32:02 2024
    Am 18.10.2024 um 20:26 schrieb Richard Jordan:

    We periodically (sometimes steady once a week, but sometimes more
    frequent) one overnight batch job take much longer than normal to run.
    Normal runtime about 30-35 minutes will take 4.5 - 6.5 hours.  Several images called by that job all run much slower than normal.  At the end
    the overall CPU and I/O counts are very close between a normal and a
    long job.

    If 'overall CPU and I/O counts' are about the same, please re-consider
    my advice to run T4. Look at the disk response times and I/O queue
    length and compare a 'good' and a 'slow' run.

    If 'the problem' would be somewhere in the disk IO sub-system, changing
    RMS buffers will only 'muddy the waters'.

    Volker.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Jordan@21:1/5 to abrsvc on Tue Nov 5 13:59:30 2024
    On 11/5/24 6:22 AM, abrsvc wrote:
    Note:  Global buffers can be an advantage, but are not used when dealing with duplicate secondary keys.  Those are handled in local buffers. I
    have seen drastic differences in performance when changing bucket sizes
    more with secondary keys that have many duplicates than with primary
    keyed access.  Hein has some tools that analyze the statistics of
    indexed files that report the number of I/Os per operation. High values
    here can indicate inefficient use of buckets or buckets that are too
    small forcing the use of more I/Os to retrieve buckets.  Increasing the bucket size can significantly reduce I/Os resulting in better overall
    stats.

    This won't directly address the reported slowdown, but might be a
    trigger for it depending upon data locality.

    Dan

    Dan,
    Apparently the name of Hein's tools changed and I just found the
    one referred to in the presentation. Will try it on backup copies of
    the file (on the backup server) and see what it says.

    We tested doing a plain convert on all of the files involved in
    this situation on the backup server, and that task may be doable one
    file per weekend, but if the tuning apps require changes that mean doing
    an unload/reload of the file, going to have t find out how long that
    takes; backup windows are tight and except for rare VMS upgrade days (or
    when we moved from the RX3600 to these new servers), downtime is very
    hard to get.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Jordan@21:1/5 to Volker Halle on Tue Nov 5 13:51:20 2024
    On 11/5/24 10:32 AM, Volker Halle wrote:
    Am 18.10.2024 um 20:26 schrieb Richard Jordan:

    We periodically (sometimes steady once a week, but sometimes more
    frequent) one overnight batch job take much longer than normal to run.
    Normal runtime about 30-35 minutes will take 4.5 - 6.5 hours.  Several
    images called by that job all run much slower than normal.  At the end
    the overall CPU and I/O counts are very close between a normal and a
    long job.

    If 'overall CPU and I/O counts' are about the same, please re-consider
    my advice to run T4. Look at the disk response times and I/O queue
    length and compare a 'good' and a 'slow' run.

    If 'the problem' would be somewhere in the disk IO sub-system, changing
    RMS buffers will only 'muddy the waters'.

    Volker.

    Volker,
    we are getting T4 running on the backup server to re-learn it; its
    been more than 10 years since we played with it on another box.

    I have monitor running and have been checking the I/O rates and
    queue lengths during the 30+ minute runs and the multi-hour runs, and
    the only diffs there are the overall I/O rates to the two disks are much
    lower on the long runs than on the normal short ones.

    But we'll try T4 and see what it shows once I'm happy with it on
    the backup server.

    This stuff is interfering with getting the 8.4-2L3 testing done so
    we can upgrade the production server asap.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Volker Halle@21:1/5 to All on Thu Nov 7 11:15:48 2024
    Am 05.11.2024 um 20:51 schrieb Richard Jordan:
    On 11/5/24 10:32 AM, Volker Halle wrote:
    Am 18.10.2024 um 20:26 schrieb Richard Jordan:
    ...
         I have monitor running and have been checking the I/O rates and queue lengths during the 30+ minute runs and the multi-hour runs, and
    the only diffs there are the overall I/O rates to the two disks are much lower on the long runs than on the normal short ones.

    Rich,

    did you consider running some disk-IO benchmarking tool ? On the two
    disks sometimes affected by the problem ? And on other disks on this
    RAID controller ?

    This could provide some baseline achievable I/O rates and response
    times. You could then run those test while 'the problem' exists and
    during the 'short runs'.

    If you also see the problem with a standard disk-IO benchmark,
    considerations about local/global buffers may be less important.

    There is the DISKBLOCK tool on the Freeware CDs, but I also have a more
    current version on: https://eisner.decuserve.org/~halle/#diskblock

    DISKBLOCK has a 'TEST' command to performance disk performance testing (read-only and/or read-write).

    Volker.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)