(This might be blindingly obvious to most regulars, but I thought
I'd post this, just in case for some discussion)
SIMD is not always about vectorizing loops, they can also be used
for tree-shaped reductions (not sure what the canonical name is).
Consider the following problem: You have 128 consecutive bytes and
want to find the minimum value, and you have 512-bit SIMD registers.
You load the values into two 512-bit = 64 byte registers A and B, into positions A<0:7>, A<8:15>, ... A<504:511>, and B correspondingyl.
You do a SIMD 8-bit "min" operation. The target c register contains
the minimum values: min(A<0:7>,B<0:7>), min(A<8:15>,B<8:15>, ... min(A<504:511>,b<504:511>). C need not be distinct from A or B.
You then have 64 values.
You then move the upper values of C into register D (which need not
be distinct from A or B), giving you D<0:7=min(A<256:263>,B<256:263)
etc.
You then do the min operation with half the length of the registers,
giving you 32 values.
And so on, until you have the single value, which is reached in
seven steps.
This kind of thing is, AFAIK, used in high-performance code such
as JSON parsers or regexp matchers. An example (in principle)
can be found at https://gitlab.ethz.ch/extra_projects/fastjson/ .
Just wondering... is this somethig that VVM or similar could
also do? Or does this actually require SIMD and the necessary
shift, rotate or permute intructions?
Just wondering... is this somethig that VVM or similar could
also do?
(This might be blindingly obvious to most regulars, but I thought
I'd post this, just in case for some discussion)
SIMD is not always about vectorizing loops, they can also be used
for tree-shaped reductions (not sure what the canonical name is).
Consider the following problem: You have 128 consecutive bytes and
want to find the minimum value, and you have 512-bit SIMD registers.
You load the values into two 512-bit = 64 byte registers A and B, into positions A<0:7>, A<8:15>, ... A<504:511>, and B correspondingyl.
You do a SIMD 8-bit "min" operation. The target c register contains
the minimum values: min(A<0:7>,B<0:7>), min(A<8:15>,B<8:15>, ... min(A<504:511>,b<504:511>). C need not be distinct from A or B.
You then have 64 values.
You then move the upper values of C into register D (which need not
be distinct from A or B), giving you D<0:7=min(A<256:263>,B<256:263)
etc.
You then do the min operation with half the length of the registers,
giving you 32 values.
And so on, until you have the single value, which is reached in
seven steps.
This kind of thing is, AFAIK, used in high-performance code such
as JSON parsers or regexp matchers. An example (in principle)
can be found at https://gitlab.ethz.ch/extra_projects/fastjson/ .
Just wondering... is this somethig that VVM or similar could
also do? Or does this actually require SIMD and the necessary
shift, rotate or permute intructions?
(This might be blindingly obvious to most regulars, but I thought
I'd post this, just in case for some discussion)
SIMD is not always about vectorizing loops, they can also be used
for tree-shaped reductions (not sure what the canonical name is).
Consider the following problem: You have 128 consecutive bytes and
want to find the minimum value, and you have 512-bit SIMD registers.
You load the values into two 512-bit = 64 byte registers A and B, into positions A<0:7>, A<8:15>, ... A<504:511>, and B correspondingyl.
You do a SIMD 8-bit "min" operation. The target c register contains
the minimum values: min(A<0:7>,B<0:7>), min(A<8:15>,B<8:15>, ... min(A<504:511>,b<504:511>). C need not be distinct from A or B.
You then have 64 values.
You then move the upper values of C into register D (which need not
be distinct from A or B), giving you D<0:7=min(A<256:263>,B<256:263)
etc.
You then do the min operation with half the length of the registers,
giving you 32 values.
And so on, until you have the single value, which is reached in
seven steps.
This kind of thing is, AFAIK, used in high-performance code such
as JSON parsers or regexp matchers. An example (in principle)
can be found at https://gitlab.ethz.ch/extra_projects/fastjson/ .
Just wondering... is this somethig that VVM or similar could
also do? Or does this actually require SIMD and the necessary
shift, rotate or permute intructions?
RISCV IIRC has several reduction operations including min that finds the >minimum of all the values in a vector register
so I think it does not
need a tree.
Thomas Koenig <tkoenig@netcologne.de> posted:
(This might be blindingly obvious to most regulars, but I thought
I'd post this, just in case for some discussion)
SIMD is not always about vectorizing loops, they can also be used
for tree-shaped reductions (not sure what the canonical name is).
I use the term "treeification" for this. It is useful in both SW and
HW applications.
Consider the following problem: You have 128 consecutive bytes and
want to find the minimum value, and you have 512-bit SIMD registers.
You load the values into two 512-bit = 64 byte registers A and B, into positions A<0:7>, A<8:15>, ... A<504:511>, and B correspondingyl.
Since B is A+512 and 512 = -+ of the size, it is akin to how Cooley
and Tukey turned DFT into FFT--AND THIS is the key to many large scale reduction processes. It also happens to be the way to Vectorize FFTs
for event faster performance on machines with vector capabilities.
You do a SIMD 8-bit "min" operation. The target c register contains
the minimum values: min(A<0:7>,B<0:7>), min(A<8:15>,B<8:15>, ... min(A<504:511>,b<504:511>). C need not be distinct from A or B.
You then have 64 values.
You then move the upper values of C into register D (which need not
be distinct from A or B), giving you D<0:7=min(A<256:263>,B<256:263)
etc.
You then do the min operation with half the length of the registers,
giving you 32 values.
And so on, until you have the single value, which is reached in
seven steps.
This kind of thing is, AFAIK, used in high-performance code such
as JSON parsers or regexp matchers. An example (in principle)
can be found at https://gitlab.ethz.ch/extra_projects/fastjson/ .
Just wondering... is this somethig that VVM or similar could
also do? Or does this actually require SIMD and the necessary
shift, rotate or permute intructions?
I will get back to you on this.
Single rounding for 2^n FPs
Without the logarithmic attempts, the general word used to describe
these things is "reduction".
VVM has the ability to choose execution width (based on HW resources
and based on data recurrence). In the past I have given examples
where VVM is executing at "width" and then because of a memory
"alias" has to drop back to 1-wide until the pointers cross before
reverting back to full width.
This algorithm (reduction) is a modification to dynamic width control,
where width is constant until the final K iterations and then decreases
by -+ each iteration thereafter. So, fundamentally, VVM does not have
a problem with reductions "expressed right".
However, the given problem of 512-bits (64-bytes) might not find much
if any speedup, due to initialization, and a potential stutter step
on each DIV-2 iteration.
It might be better to allow the HW to recognize some inst have
certain properties and integrate those into VVM recognition so
that VVM performs a wide calculation; roughly akin to the
following:
for(...)
local_minimum[ k>>3 ] = MIN( a[k,k+7] ); k+=8;
for(...)
global_minimum = MIN( local_minimum[i,i+7] ); i+=8;
For sizes as small as 512-bits, VVM might not have an advantage.
On the
other hand, if HW knew certain things about some instructions, the top
loop might be performed simultaneously with the bottom loop--more or
less like having an adder that performs {8|u64, 16|u32, 32|u16, 64|u8} calculations simultaneously in reduction form {Exact for integer,
Single rounding for 2^n FPs} and this wide adder feeds the second
calculation 1 K|u reduction per cycle in a single merged loop.
Needs more thought. {Known problem}
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Single rounding for 2^n FPs
Given that usual FP arithmetics does not satisfy the associative law,
one cannot transform easily-written reductions like
double r=0.0
for (...)
r += a[i];
into any more efficient form (e.g., one with several computation
strands or one that uses a complete tree evaluation). Compilers use
the allowance given by -ffast-math to reassociate the computations and
to vectorize such code. Of course, if you implement FP operations in
general without inexact results, your FP operations are associative,
but I fail to see how that can be achieved in general.
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
Anton, I am surprised you have not heard of Quires !!! where the...
accumulator is the full exponent and fraction width.
Also, given a Quire, the compiler becomes FREE to reorder arithmetic
terms in a reduction.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Anton, I am surprised you have not heard of Quires !!! where the >accumulator is the full exponent and fraction width...
Also, given a Quire, the compiler becomes FREE to reorder arithmetic
terms in a reduction.
Am I surprised that you did not read all of what I wrote?
Unfortunately not.
Anyway, maybe this time you will read it, and I will spell it out more clearly:
With two 8-wide SIMD DP FP adders and three cycles of latency, you can
use six strands of SIMD addition (48 strands of scalar addition) to
make full use of the SIMD units.
How many accumulators or (if you really need to use that term) quires
do you have in your architecture and in your microarchitecture; how
many DP FP additions to one accumulator/quire can be started per
cycle?
And how long does the next bunch of additions then have to
wait before being started?
If the end result cannot compete with SIMD units on the machines where
the programs run, I fear that the accumulator/quire will stay a
feature for a certain niche of users, while another group of users
will ignore it for performance reasons.
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Anton, I am surprised you have not heard of Quires !!! where the...
accumulator is the full exponent and fraction width.
Also, given a Quire, the compiler becomes FREE to reorder arithmetic
terms in a reduction.
Am I surprised that you did not read all of what I wrote?
Unfortunately not.
Anyway, maybe this time you will read it, and I will spell it out more clearly:
With two 8-wide SIMD DP FP adders and three cycles of latency, you can
use six strands of SIMD addition (48 strands of scalar addition) to
make full use of the SIMD units.
How many accumulators or (if you really need to use that term) quires
do you have in your architecture and in your microarchitecture; how
many DP FP additions to one accumulator/quire can be started per
cycle? And how long does the next bunch of additions then have to
wait before being started?
If the end result cannot compete with SIMD units on the machines where
the programs run, I fear that the accumulator/quire will stay a
feature for a certain niche of users, while another group of users
will ignore it for performance reasons.
(This might be blindingly obvious to most regulars, but I thought
I'd post this, just in case for some discussion)
SIMD is not always about vectorizing loops, they can also be used
for tree-shaped reductions (not sure what the canonical name is).
Consider the following problem: You have 128 consecutive bytes and
want to find the minimum value, and you have 512-bit SIMD registers.
On 12/26/2025 1:57 PM, Thomas Koenig wrote:
(This might be blindingly obvious to most regulars, but I thought
I'd post this, just in case for some discussion)
SIMD is not always about vectorizing loops, they can also be used
for tree-shaped reductions (not sure what the canonical name is).
Consider the following problem: You have 128 consecutive bytes and
want to find the minimum value, and you have 512-bit SIMD registers.
Thomas, this is an excellent "test case" as it brings out at least two issues. There has been discussion in this thread about the "reduction" problem. Let me start on the other problem, that I call ALU underutilization. It is caused by requiring lots of simple operations
on small data elements. For this example, I assume a four wide My 66000.
Lets look at just the first pass. I think the simplest coding would
have the VVM loop consisting of two load instructions, two add
instructions to increment the addresses and a min instruction. Letting
VVM do its magic, this would generate 4 byte min operations at a time,
(one per ALU) and thus the loop would be executed 64/4 = 16 times. I
don't know how your hypothetical SIMD machine would do this, but it
might do all 64 min operations in a single operation, or perhaps 2.
This puts VVM at a substantial performance disadvantage.
I have a possible suggestion to help this. I don't claim it is the best solution.
The problem stems from using only 8 bits of the 64 bit integer ALU for
each operation, leading to more operations. So one possible solution
would be to add a new instruction modifier that tells the system that
any relevant operations under its mask will do the whole register worth
of operations using the size already specified in the the operation.
Since the min instruction would already have specified bytes,
with the modification, the instruction would do 8 byte min operations at once,
this reducing the loop count by a factor of 8.
Of course, this
generalized to half words and words as well, and to similar "simple" instructions such as add/subtract, etc.
Note that this already "fits"
in the existing 64 bit ALUs, with the addition of a little logic to
suppress carries, etc. to allow the simultaneous use of all the ALU bits.
Comments?
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 12/26/2025 1:57 PM, Thomas Koenig wrote:
(This might be blindingly obvious to most regulars, but I thought
I'd post this, just in case for some discussion)
SIMD is not always about vectorizing loops, they can also be used
for tree-shaped reductions (not sure what the canonical name is).
Consider the following problem: You have 128 consecutive bytes and
want to find the minimum value, and you have 512-bit SIMD registers.
Thomas, this is an excellent "test case" as it brings out at least two
issues. There has been discussion in this thread about the "reduction"
problem. Let me start on the other problem, that I call ALU
underutilization. It is caused by requiring lots of simple operations
on small data elements. For this example, I assume a four wide My 66000.
Lets look at just the first pass. I think the simplest coding would
have the VVM loop consisting of two load instructions, two add
instructions to increment the addresses and a min instruction. Letting
VVM do its magic, this would generate 4 byte min operations at a time,
(one per ALU) and thus the loop would be executed 64/4 = 16 times. I
don't know how your hypothetical SIMD machine would do this, but it
might do all 64 min operations in a single operation, or perhaps 2.
This puts VVM at a substantial performance disadvantage.
I have a possible suggestion to help this. I don't claim it is the best
solution.
The problem stems from using only 8 bits of the 64 bit integer ALU for
each operation, leading to more operations. So one possible solution
would be to add a new instruction modifier that tells the system that
any relevant operations under its mask will do the whole register worth
of operations using the size already specified in the the operation.
This is exactly what VVM does, BTW. Smaller than register widths are
SIMDed into single "units of work" up to register width and performed
with the carry-chains clipped.
Since the min instruction would already have specified bytes,
It is the memory instruction that specifies data width.
On 12/29/2025 11:59 AM, MitchAlsup wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 12/26/2025 1:57 PM, Thomas Koenig wrote:
(This might be blindingly obvious to most regulars, but I thought
I'd post this, just in case for some discussion)
SIMD is not always about vectorizing loops, they can also be used
for tree-shaped reductions (not sure what the canonical name is).
Consider the following problem: You have 128 consecutive bytes and
want to find the minimum value, and you have 512-bit SIMD registers.
Thomas, this is an excellent "test case" as it brings out at least two
issues. There has been discussion in this thread about the "reduction"
problem. Let me start on the other problem, that I call ALU
underutilization. It is caused by requiring lots of simple operations
on small data elements. For this example, I assume a four wide My 66000. >>
Lets look at just the first pass. I think the simplest coding would
have the VVM loop consisting of two load instructions, two add
instructions to increment the addresses and a min instruction. Letting
VVM do its magic, this would generate 4 byte min operations at a time,
(one per ALU) and thus the loop would be executed 64/4 = 16 times. I
don't know how your hypothetical SIMD machine would do this, but it
might do all 64 min operations in a single operation, or perhaps 2.
This puts VVM at a substantial performance disadvantage.
I have a possible suggestion to help this. I don't claim it is the best >> solution.
The problem stems from using only 8 bits of the 64 bit integer ALU for
each operation, leading to more operations. So one possible solution
would be to add a new instruction modifier that tells the system that
any relevant operations under its mask will do the whole register worth
of operations using the size already specified in the the operation.
This is exactly what VVM does, BTW. Smaller than register widths are
SIMDed into single "units of work" up to register width and performed
with the carry-chains clipped.
Oh, I didn't realize that VVM already did that. Bravo!
Since the min instruction would already have specified bytes,
It is the memory instruction that specifies data width.
I thought with your latest modifications to the ISA that instructions
like min specified a data width. But using the width specified in the memory reference instruction seems fine. I can't think of a useful case where the two would be different.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 12/29/2025 11:59 AM, MitchAlsup wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 12/26/2025 1:57 PM, Thomas Koenig wrote:
(This might be blindingly obvious to most regulars, but I thought
I'd post this, just in case for some discussion)
SIMD is not always about vectorizing loops, they can also be used
for tree-shaped reductions (not sure what the canonical name is).
Consider the following problem: You have 128 consecutive bytes and
want to find the minimum value, and you have 512-bit SIMD registers.
Thomas, this is an excellent "test case" as it brings out at least two >>>> issues. There has been discussion in this thread about the "reduction" >>>> problem. Let me start on the other problem, that I call ALU
underutilization. It is caused by requiring lots of simple operations >>>> on small data elements. For this example, I assume a four wide My 66000. >>>>
Lets look at just the first pass. I think the simplest coding would
have the VVM loop consisting of two load instructions, two add
instructions to increment the addresses and a min instruction. Letting >>>> VVM do its magic, this would generate 4 byte min operations at a time, >>>> (one per ALU) and thus the loop would be executed 64/4 = 16 times. I
don't know how your hypothetical SIMD machine would do this, but it
might do all 64 min operations in a single operation, or perhaps 2.
This puts VVM at a substantial performance disadvantage.
I have a possible suggestion to help this. I don't claim it is the best >>>> solution.
The problem stems from using only 8 bits of the 64 bit integer ALU for >>>> each operation, leading to more operations. So one possible solution
would be to add a new instruction modifier that tells the system that
any relevant operations under its mask will do the whole register worth >>>> of operations using the size already specified in the the operation.
This is exactly what VVM does, BTW. Smaller than register widths are
SIMDed into single "units of work" up to register width and performed
with the carry-chains clipped.
Oh, I didn't realize that VVM already did that. Bravo!
It is how the HW works::
An integer adder is comprised of 8|u9-bit sections. You feed 8-data
bits into each section, and you feed into 9th-bit::
00 if you want the carry clipped,
01 if you want the carry propagated,
11 if you want a carry generated for next section;
An adder comprised of 9-bit sections is no more gates of delay than one comprised of 8-bit sections.
Since the min instruction would already have specified bytes,
It is the memory instruction that specifies data width.
I thought with your latest modifications to the ISA that instructions
like min specified a data width. But using the width specified in the
memory reference instruction seems fine. I can't think of a useful case
where the two would be different.
Whereas the current ISA does have size/calculation, I have not had the time to go back and examine vVM to the extent necessary to make any statements
on how vVM would work with these.
Note also that this works well for min (and max), but for add, etc. if
the data is unsigned only if you don't care where in the sequence an >overflow occurred, but but if the data is signed, only if you don't even >care if overflow occurred.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
Note also that this works well for min (and max), but for add, etc. if
the data is unsigned only if you don't care where in the sequence an
overflow occurred, but but if the data is signed, only if you don't even
care if overflow occurred.
Addition with trapping on overflow or with overflow detection is not associative. Addition modulo 2^n is.
If you have a sticky overflow bit (Power has something in that
direction), or if you trap on overflow (MIPS and Alpha have such
instructions for signed addition), you will certainly notice if an
overflow occured for signed additions for the particular evaluation
used on the machine. If there are evaluations with and without
overflow, I think that those without are preferable (although probably
not enough to pay the performance cost for ensuring that).
Which programming language are you thinking of where reduction uses
some addition with overflow detection.
I was
trying to clarify my "astonishment" and work through the details of what >Mitch said earlier about SIMDifying the calculations in the context of
VVM. I was trying to understand exactly what the hardware was doing in
the various cases.
0, the range of r is the range of int8_n and also use 8-bit min inthat case.
On 1/1/2026 2:21 PM, Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
Note also that this works well for min (and max), but for add, etc. if
the data is unsigned only if you don't care where in the sequence an
overflow occurred, but but if the data is signed, only if you don't even >> care if overflow occurred.
Addition with trapping on overflow or with overflow detection is not associative. Addition modulo 2^n is.
Yes. You have given a more precise mathematical statement of of what I
said above.
If you have a sticky overflow bit (Power has something in that
direction), or if you trap on overflow (MIPS and Alpha have such instructions for signed addition), you will certainly notice if an
overflow occured for signed additions for the particular evaluation
used on the machine. If there are evaluations with and without
overflow, I think that those without are preferable (although probably
not enough to pay the performance cost for ensuring that).
Which programming language are you thinking of where reduction uses
some addition with overflow detection.
I wasn't thinking of any programming language in particular. I was
trying to clarify my "astonishment" and work through the details of what Mitch said earlier about SIMDifying the calculations in the context of
VVM. I was trying to understand exactly what the hardware was doing in
the various cases.
On 1/1/2026 2:21 PM, Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
Note also that this works well for min (and max), but for add, etc. if
the data is unsigned only if you don't care where in the sequence an
overflow occurred, but but if the data is signed, only if you don't even >> care if overflow occurred.
Addition with trapping on overflow or with overflow detection is not associative. Addition modulo 2^n is.
Yes. You have given a more precise mathematical statement of of what I
said above.
If you have a sticky overflow bit (Power has something in that
direction), or if you trap on overflow (MIPS and Alpha have such instructions for signed addition), you will certainly notice if an
overflow occured for signed additions for the particular evaluation
used on the machine. If there are evaluations with and without
overflow, I think that those without are preferable (although probably
not enough to pay the performance cost for ensuring that).
Which programming language are you thinking of where reduction uses
some addition with overflow detection.
I wasn't thinking of any programming language in particular. I was
trying to clarify my "astonishment" and work through the details of what Mitch said earlier about SIMDifying the calculations in the context of
VVM. I was trying to understand exactly what the hardware was doing in
the various cases.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
I was
trying to clarify my "astonishment" and work through the details of what >Mitch said earlier about SIMDifying the calculations in the context of >VVM. I was trying to understand exactly what the hardware was doing in >the various cases.
I don't know how MY66000 deals with overflow detection.
I assume that MY66000 has instructions that add modulo 2^n.
If the code uses these instructions (as code compiled from typical C, C++, or Java code
does), this is no obstacle to reassociation and to vectorizing the
reduction.
A bigger problem may be that in C (and a number of other languages)
all shorter integers are implicitly converted to int before any
operation,
and the user is likely to use at least int, but maybe even--- Synchronet 3.21a-Linux NewsLink 1.2
long for the accumulator in the reduction:
----------------
- anton
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
However, last week someone suggested a 3-input addition, which I gave
some thought--Exactly what is the overflow condition of a ADD3 when >2-operands have one sign bit and the other has the opposite sign bit ??
In the 2-input case:: result<63> != Operand1<63> ^ Operand2<63>
Is there a similar 3-input equation ??
However, last week someone suggested a 3-input addition, which I gave
some thought--Exactly what is the overflow condition of a ADD3 when 2-operands have one sign bit and the other has the opposite sign bit ??
In the 2-input case:: result<63> != Operand1<63> ^ Operand2<63>
Is there a similar 3-input equation ??
However, last week someone suggested a 3-input addition, which I gave
some thought--Exactly what is the overflow condition of a ADD3 when 2-operands have one sign bit and the other has the opposite sign bit ??
In the 2-input case:: result<63> != Operand1<63> ^ Operand2<63>
Is there a similar 3-input equation ??
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
However, last week someone suggested a 3-input addition, which I gave
some thought--Exactly what is the overflow condition of a ADD3 when >2-operands have one sign bit and the other has the opposite sign bit ??
In the 2-input case:: result<63> != Operand1<63> ^ Operand2<63>
Is there a similar 3-input equation ??
I doubt it. If you really care for overflow, just produce a 66-bit
result, and check if it is in the range [-2^63,2^63).
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
However, last week someone suggested a 3-input addition, which I gave
some thought--Exactly what is the overflow condition of a ADD3 when 2-operands have one sign bit and the other has the opposite sign bit ??
In the 2-input case:: result<63> != Operand1<63> ^ Operand2<63>
Is there a similar 3-input equation ??
How 'bout sign-extending your 64bits to 65bits, then doing your ADD3 and
then making sure the top 2 bits of the result are equal?
Stefan--- Synchronet 3.21a-Linux NewsLink 1.2
| Sysop: | Amessyroom |
|---|---|
| Location: | Fayetteville, NC |
| Users: | 54 |
| Nodes: | 6 (0 / 6) |
| Uptime: | 17:43:49 |
| Calls: | 742 |
| Files: | 1,218 |
| D/L today: |
4 files (8,203K bytes) |
| Messages: | 184,414 |
| Posted today: | 1 |