Forum: Too Lazy BBS

Who's Online
Recent Visitors
- Mhmrules
  Thu Dec 19 22:44:03 2024
  from Floresville, Tx via Telnet
- Rixter
  Thu Dec 19 08:30:07 2024
  from Madison, Nc via SSH
- Apam
  Wed Dec 18 22:11:11 2024
  from Toowoomba, Qld via Telnet
- Stingray
  Wed Dec 18 18:24:34 2024
  from A-Net-Online.Lol via Telnet

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	42
Nodes:	6 (0 / 6)
Uptime:	01:16:56
Calls:	220
Calls today:	1
Files:	824
Messages:	121,522
Posted today:	6

RX2800 sporadic disk I/O slowdowns

From Richard Jordan@21:1/5 to All on Fri Oct 18 13:26:53 2024

RX2800 i4 server, 64GB RAM, 4 processors, P410i controller with 10 each
2TB disks in RAID 6, broken down into volumes.

We periodically (sometimes steady once a week, but sometimes more
frequent) one overnight batch job take much longer than normal to run.
Normal runtime about 30-35 minutes will take 4.5 - 6.5 hours. Several
images called by that job all run much slower than normal. At the end
the overall CPU and I/O counts are very close between a normal and a
long job.

The data files are very large indexed files. Records are read and
updated but not added in this job; output is just tabulated reports.

We've run monitors for all and disk and also built polling snapshot jobs
that check for locked/busy files, other active batch jobs, auto-checked
through system analyzer looking for any other processes accessing the
busy files at the same time as the problem batch (two data files show
long busy periods but we do not show any other process with channels to
that file at the same time except for backup, see next).

The backups start at the same time, but do not get to the data disks
until well after the problem job normally completes; that does cause
concurrent access to the problem files but it occurs only when the job
has already run long. so it is not the cause Overall backup time is
about the same regardless of how long the problem batch takes.

Monitor during a long run shows average and peak I/O rates to the disks
with busy files at about 1/2 of what they do for normal runs. We can
see that in the process snapshots too; the direct i/o count on a slow
run increases much more slowly than on a normal run but both normal and
long runs end up with close to the same CPU time and total I/Os.

Other jobs in monitor are somewhat slowed down but nowhere near as much
(and they do much less access).

Before anyone asks, the indexed files could probably use a
cleanup/rebuild, but if thats the cause would we see periodic
performance issues? I would expect them to be constant.

There is a backup server available, so I'm going to restore backups of
the two problem files to it and do rebuilds to see how long it takes;
that will determine how/when we can do it on the production server.

So something is apparently causing it to be I/O constrained but so far
we can't find it. Same concurrent processes, other jobs don't appear to
be slowed down much (but may be much less i/o sensitive or using data
on other disks, I threw that question to the devs).

Is there anything in the background below VMS that could cause this?
The controller doing drive checks or other maintenance activities?

Thanks for any ideas.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Craig A. Berry@21:1/5 to Richard Jordan on Fri Oct 18 17:09:44 2024

On 10/18/24 1:26 PM, Richard Jordan wrote:

Monitor during a long run shows average and peak I/O rates to the disks
with busy files at about 1/2 of what they do for normal runs.

That is exactly what happens when the cache battery on a RAID controller
dies. Maybe yours is half-dead and sometimes takes a charge and
sometimes doesn't? MSA$UTIL should show the status of your P410.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Craig A. Berry on Sat Oct 19 00:07:21 2024

On Fri, 18 Oct 2024 17:09:44 -0500, Craig A. Berry wrote:

That is exactly what happens when the cache battery on a RAID controller dies.

I hate hardware RAID. Has VMS still not got any equivalent to mdraid?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?UTF-8?Q?Arne_Vajh=C3=B8j?=@21:1/5 to Lawrence D'Oliveiro on Fri Oct 18 20:39:33 2024

On 10/18/2024 8:35 PM, Lawrence D'Oliveiro wrote:

On Fri, 18 Oct 2024 20:22:23 -0400, Arne Vajhøj wrote:

On 10/18/2024 8:07 PM, Lawrence D'Oliveiro wrote:

Has VMS still not got any equivalent to mdraid?

VMS got volume shadowing in 1986 I believe.

Relevance being?

It is OS provided software RAID.

Isn't that what you are asking for?

Arne

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Sat Oct 19 00:56:51 2024

On Fri, 18 Oct 2024 20:39:33 -0400, Arne Vajhøj wrote:

On 10/18/2024 8:35 PM, Lawrence D'Oliveiro wrote:

On Fri, 18 Oct 2024 20:22:23 -0400, Arne Vajhøj wrote:

On 10/18/2024 8:07 PM, Lawrence D'Oliveiro wrote:

Has VMS still not got any equivalent to mdraid?

VMS got volume shadowing in 1986 I believe.

Relevance being?

It is OS provided software RAID.

Does “volume shadowing” mean just RAID 1?

<https://manpages.debian.org/8/mdadm.8.en.html>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?UTF-8?Q?Arne_Vajh=C3=B8j?=@21:1/5 to Lawrence D'Oliveiro on Fri Oct 18 21:12:30 2024

On 10/18/2024 8:56 PM, Lawrence D'Oliveiro wrote:

On Fri, 18 Oct 2024 20:39:33 -0400, Arne Vajhøj wrote:

On 10/18/2024 8:35 PM, Lawrence D'Oliveiro wrote:

On Fri, 18 Oct 2024 20:22:23 -0400, Arne Vajhøj wrote:

On 10/18/2024 8:07 PM, Lawrence D'Oliveiro wrote:

Has VMS still not got any equivalent to mdraid?

VMS got volume shadowing in 1986 I believe.

Relevance being?

It is OS provided software RAID.

Does “volume shadowing” mean just RAID 1?

I believe so.

0, 5, 6 and 10 requires a RAID controller.

Arne

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Sat Oct 19 00:35:43 2024

On Fri, 18 Oct 2024 20:22:23 -0400, Arne Vajhøj wrote:

On 10/18/2024 8:07 PM, Lawrence D'Oliveiro wrote:

Has VMS still not got any equivalent to mdraid?

VMS got volume shadowing in 1986 I believe.

Relevance being?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?UTF-8?Q?Arne_Vajh=C3=B8j?=@21:1/5 to Lawrence D'Oliveiro on Fri Oct 18 20:22:23 2024

On 10/18/2024 8:07 PM, Lawrence D'Oliveiro wrote:

On Fri, 18 Oct 2024 17:09:44 -0500, Craig A. Berry wrote:

That is exactly what happens when the cache battery on a RAID controller
dies.

I hate hardware RAID. Has VMS still not got any equivalent to mdraid?

????

VMS got volume shadowing in 1986 I believe.

Arne

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Volker Halle@21:1/5 to All on Sat Oct 19 09:02:57 2024

Rich,

this would be a perfect oppurtunity to run T4 - and look at the disk
response times.

Volker.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Jordan@21:1/5 to All on Mon Nov 4 17:03:36 2024

Followup on this. I'm looking at one of Hein's presentations on RMS
indexed files, tuning, etc.

Presuming the system has plenty of memory and per autogen its state of
tune is pretty close to what autogen wants, is there any downside to
setting a count of global buffers on the large indexed data files
involved in this issue (the ones that show extended 'busy' channels in
system analyzer)? Can it cause any problems that would impact production?

We already tested setting a modest process RMS buffer count for indexed
files on the accounts used for batch operations, and that seems to make
a modest improvement in runtime and a significant reduction in direct
I/Os. Saved 3-4 minutes on a 32-34 minute runtime but DIOs dropped from
~5.1 million to ~4.3 million.

Unfortunately we still had two jobs run long, one over 7 hours so they
killed it, the other about 4.5 hours but with the same reduced 4.3M DIO
count. So it helped in general but did not make a difference to the
problem. I don't expect the global buffers to fix the problem either
but its worth testing for performance reasons.

Thanks

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From abrsvc@21:1/5 to All on Tue Nov 5 12:22:31 2024

Note: Global buffers can be an advantage, but are not used when dealing
with duplicate secondary keys. Those are handled in local buffers. I
have seen drastic differences in performance when changing bucket sizes
more with secondary keys that have many duplicates than with primary
keyed access. Hein has some tools that analyze the statistics of
indexed files that report the number of I/Os per operation. High values
here can indicate inefficient use of buckets or buckets that are too
small forcing the use of more I/Os to retrieve buckets. Increasing the
bucket size can significantly reduce I/Os resulting in better overall
stats.

This won't directly address the reported slowdown, but might be a
trigger for it depending upon data locality.

Dan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Volker Halle@21:1/5 to All on Tue Nov 5 17:32:02 2024

Am 18.10.2024 um 20:26 schrieb Richard Jordan:

We periodically (sometimes steady once a week, but sometimes more
frequent) one overnight batch job take much longer than normal to run.
Normal runtime about 30-35 minutes will take 4.5 - 6.5 hours. Several images called by that job all run much slower than normal. At the end
the overall CPU and I/O counts are very close between a normal and a
long job.

If 'overall CPU and I/O counts' are about the same, please re-consider
my advice to run T4. Look at the disk response times and I/O queue
length and compare a 'good' and a 'slow' run.

If 'the problem' would be somewhere in the disk IO sub-system, changing
RMS buffers will only 'muddy the waters'.

Volker.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Jordan@21:1/5 to abrsvc on Tue Nov 5 13:59:30 2024

On 11/5/24 6:22 AM, abrsvc wrote:

Note: Global buffers can be an advantage, but are not used when dealing with duplicate secondary keys. Those are handled in local buffers. I
have seen drastic differences in performance when changing bucket sizes
more with secondary keys that have many duplicates than with primary
keyed access. Hein has some tools that analyze the statistics of
indexed files that report the number of I/Os per operation. High values
here can indicate inefficient use of buckets or buckets that are too
small forcing the use of more I/Os to retrieve buckets. Increasing the bucket size can significantly reduce I/Os resulting in better overall
stats.

This won't directly address the reported slowdown, but might be a
trigger for it depending upon data locality.

Dan

Dan,
Apparently the name of Hein's tools changed and I just found the
one referred to in the presentation. Will try it on backup copies of
the file (on the backup server) and see what it says.

We tested doing a plain convert on all of the files involved in
this situation on the backup server, and that task may be doable one
file per weekend, but if the tuning apps require changes that mean doing
an unload/reload of the file, going to have t find out how long that
takes; backup windows are tight and except for rare VMS upgrade days (or
when we moved from the RX3600 to these new servers), downtime is very
hard to get.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Jordan@21:1/5 to Volker Halle on Tue Nov 5 13:51:20 2024

On 11/5/24 10:32 AM, Volker Halle wrote:

Am 18.10.2024 um 20:26 schrieb Richard Jordan:

We periodically (sometimes steady once a week, but sometimes more
frequent) one overnight batch job take much longer than normal to run.
Normal runtime about 30-35 minutes will take 4.5 - 6.5 hours. Several
images called by that job all run much slower than normal. At the end
the overall CPU and I/O counts are very close between a normal and a
long job.

If 'overall CPU and I/O counts' are about the same, please re-consider
my advice to run T4. Look at the disk response times and I/O queue
length and compare a 'good' and a 'slow' run.

If 'the problem' would be somewhere in the disk IO sub-system, changing
RMS buffers will only 'muddy the waters'.

Volker.

Volker,
we are getting T4 running on the backup server to re-learn it; its
been more than 10 years since we played with it on another box.

I have monitor running and have been checking the I/O rates and
queue lengths during the 30+ minute runs and the multi-hour runs, and
the only diffs there are the overall I/O rates to the two disks are much
lower on the long runs than on the normal short ones.

But we'll try T4 and see what it shows once I'm happy with it on
the backup server.

This stuff is interfering with getting the 8.4-2L3 testing done so
we can upgrade the production server asap.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Volker Halle@21:1/5 to All on Thu Nov 7 11:15:48 2024

Am 05.11.2024 um 20:51 schrieb Richard Jordan:

On 11/5/24 10:32 AM, Volker Halle wrote:

Am 18.10.2024 um 20:26 schrieb Richard Jordan:

...
I have monitor running and have been checking the I/O rates and queue lengths during the 30+ minute runs and the multi-hour runs, and
the only diffs there are the overall I/O rates to the two disks are much lower on the long runs than on the normal short ones.

Rich,

did you consider running some disk-IO benchmarking tool ? On the two
disks sometimes affected by the problem ? And on other disks on this
RAID controller ?

This could provide some baseline achievable I/O rates and response
times. You could then run those test while 'the problem' exists and
during the 'short runs'.

If you also see the problem with a standard disk-IO benchmark,
considerations about local/global buffers may be less important.

There is the DISKBLOCK tool on the Freeware CDs, but I also have a more
current version on: https://eisner.decuserve.org/~halle/#diskblock

DISKBLOCK has a 'TEST' command to performance disk performance testing (read-only and/or read-write).

Volker.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

Recent Visitors

System Info

RX2800 sporadic disk I/O slowdowns