• non-recursive fft

    From MatthewA@matthewaudio@gmail.com to comp.dsp on Fri Nov 22 08:37:32 2019
    From Newsgroup: comp.dsp

    Is there any info on how to implement fft and ifft algorithms that aren't recursive? I'd like to possibly calculate a really long on inside a signal chain so attempting to compute a 2 second fft in one function call would probably be disastrous.

    -Matt
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From spope384@spope384@gmail.com (Steve Pope) to comp.dsp on Fri Nov 22 17:38:30 2019
    From Newsgroup: comp.dsp

    MatthewA <matthewaudio@gmail.com> wrote:

    Is there any info on how to implement fft and ifft algorithms that
    aren't recursive? I'd like to possibly calculate a really long on
    inside a signal chain so attempting to compute a 2 second fft in one
    function call would probably be disastrous.

    They usually are not recursive.

    I think what you're saying though is you want a lower-latency FFT/IFFT.
    There can be a complexity/tradeoff issue but you can by throwing
    compute resources at it make the latency as close to zero as you wish.

    Steve
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MatthewA@matthewaudio@gmail.com to comp.dsp on Fri Nov 22 10:04:15 2019
    From Newsgroup: comp.dsp

    Gosh, Steve... apologies for my naivety. The only implementation I've seen has been recursive. I'm a bit of a noob in this regard. I'm definitely looking to compute these frames with a high degree of latency. The idea here is that we trade immediacy for spectral accuracy because, in my use case, latency isn't a huge problem as long as it stays under a second or two. In my current implementation I use a low priority thread and do it on the graphics processor since it could be up to a few seconds of 44.1k audio.
    The point of the original question is, how can I spread the computation of an FFT out over multiple vector-based performance routine calls. (not sure the exact term for that.)
    ra|
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From spope384@spope384@gmail.com (Steve Pope) to comp.dsp on Fri Nov 22 18:19:28 2019
    From Newsgroup: comp.dsp

    MatthewA <matthewaudio@gmail.com> wrote:

    Gosh, Steve... apologies for my naivety. The only implementation I've
    seen has been recursive. I'm a bit of a noob in this regard.

    No problem.

    In terms of DSP, an FFT is not recursive, it is a FIR (finite impulse
    response) filter, not an IIR (infinite impulse response, recursive) filter.

    In terms of being sometimes a piece of software, sure an FFT would likely
    be "recursive", or at least re-entrant, by building up bigger FFT's
    from smaller ones in a recursion.

    I could have guessed this is what you meant.

    Any, "unrolling" loops and recursion in software can speed things up,
    but not maybe to the extent you need.

    Steve
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Piergiorgio Sartor@piergiorgio.sartor.this.should.not.be.used@nexgo.REMOVETHIS.de to comp.dsp on Fri Nov 22 19:16:18 2019
    From Newsgroup: comp.dsp

    On 22/11/2019 19.04, MatthewA wrote:
    Gosh, Steve... apologies for my naivety. The only implementation I've seen has been recursive. I'm a bit of a noob in this regard.

    Uh?
    Honestly, maybe there is a misunderstanding
    about "recursive".

    I'm definitely looking to compute these frames with a high degree of
    latency. The idea here is that we trade immediacy for spectral accuracy because, in my use case, latency isn't a huge problem as long as it
    stays under a second or two. In my current implementation I use a low priority thread and do it on the graphics processor since it could be up
    to a few seconds of 44.1k audio.

    The point of the original question is, how can I spread the computation of an FFT out over multiple vector-based performance routine calls. (not sure the exact term for that.)

    Usually, at the core of the FFT, there is
    the "butterfly" operation.
    Typically a FFT is a combination / cascade
    of these "butterfly".

    So, depending on what you mean with
    "multiple vector-based performance routine
    calls" (and you should be sure of the exact
    term for that) there could be different
    structures combining the "butterfly".

    For example:

    https://archive.cnx.org/contents/d5a89353-c8ac-428e-8fe3-e7ebf72881b8@6/alternate-fft-structures

    I would use FFTW, anyway...

    bye,
    --

    piergiorgio
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MatthewA@matthewaudio@gmail.com to comp.dsp on Fri Nov 22 10:33:54 2019
    From Newsgroup: comp.dsp

    Ouch!
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Piergiorgio Sartor@piergiorgio.sartor.this.should.not.be.used@nexgo.REMOVETHIS.de to comp.dsp on Fri Nov 22 19:38:47 2019
    From Newsgroup: comp.dsp

    On 22/11/2019 19.33, MatthewA wrote:
    Ouch!

    Ops, sorry, I did not meant to be
    rude or offend, or I misunderstood...

    bye,
    --

    piergiorgio
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Christian Gollwitzer@auriocus@gmx.de to comp.dsp on Sat Nov 23 00:05:05 2019
    From Newsgroup: comp.dsp

    Am 22.11.19 um 19:04 schrieb MatthewA:
    Gosh, Steve... apologies for my naivety. The only implementation I've seen has been recursive. I'm a bit of a noob in this regard. I'm definitely looking to compute these frames with a high degree of latency. The idea here is that we trade immediacy for spectral accuracy because, in my use case, latency isn't a huge problem as long as it stays under a second or two. In my current implementation I use a low priority thread and do it on the graphics processor since it could be up to a few seconds of 44.1k audio.

    The point of the original question is, how can I spread the computation of an FFT out over multiple vector-based performance routine calls. (not sure the exact term for that.)


    Ah, so the "real question" is: How can I efficiently compute an FFT on a parallel vector processor like a GPU?

    Unfortunately, that question is nontrivial, because, as you have
    discovered, there is a lot of dopendency between the input and output
    values. Each input value in a FFT has an influence onto each output
    value, which makes parallel processing hard.

    You have several options, depending on your input length, that might
    speed up things.


    1) Do you really need the full FFT, or just a few single bins? If only a
    few bins are required, a direct convolution can be quite efficient on
    the GPU. You can precompute the sine/cosine for these frequency bins and compute the direct dot product.

    "True" FFT only pays off if you want the full transform. There is even a special algorithm for the computation of single FFT bins, the Goertzel algorithm.

    2) Do you have maybe multiple FFTs which you can process in parallel? In particular, if you compute 2D FFTs, then the FFT runs on each row and
    then on each column, which can all be processed in parallel

    3) There are open source libraries for FFTs on GPUs, e.g. https://github.com/clMathLibraries/clFFT for OpenCL, based on some early examples by Apple. The code is not that straight forward and it requires multiple passes on the host compiler, so your mileage may vary.

    Christian

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MatthewA@matthewaudio@gmail.com to comp.dsp on Fri Nov 22 16:06:47 2019
    From Newsgroup: comp.dsp

    Gosh I had no clue I could be interpreted so many ways. I am very sorry. Let me put it in pseudocode:

    cnt = 0
    maxCnt= 2000

    audioPerform(double *in, double *out){

    //this will interrupt audio!!!!
    if(cnt==0)
    Compute an entire 16777216 bin fft

    cnt =(cnt+1)%maxCnt
    }


    audioPerform(double *in, double *out){

    //this is presumable will not stop audio!!!!
    Compute (16777216/maxCnt) bins of a 16777216 Bin fft.


    }
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Tauno Voipio@tauno.voipio@notused.fi.invalid to comp.dsp on Sat Nov 23 10:04:11 2019
    From Newsgroup: comp.dsp

    On 23.11.19 02:06, MatthewA wrote:
    Gosh I had no clue I could be interpreted so many ways. I am very sorry. Let me put it in pseudocode:

    cnt = 0
    maxCnt= 2000

    audioPerform(double *in, double *out){

    //this will interrupt audio!!!!
    if(cnt==0)
    Compute an entire 16777216 bin fft

    cnt =(cnt+1)%maxCnt
    }


    audioPerform(double *in, double *out){

    //this is presumable will not stop audio!!!!
    Compute (16777216/maxCnt) bins of a 16777216 Bin fft.


    }


    The FFT cannot be partitioned in the way you ask,
    but you can compute one butterfly at a time. It does
    not, however, produce any useful results until the
    last butterfly is computed.
    --

    -TV


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Marcel Mueller@news.5.maazl@spamgourmet.org to comp.dsp on Sat Nov 23 09:32:18 2019
    From Newsgroup: comp.dsp

    Am 23.11.19 um 00:05 schrieb Christian Gollwitzer:
    Am 22.11.19 um 19:04 schrieb MatthewA:
    The point of the original question is, how can I spread the
    computation of an FFT out over multiple vector-based performance
    routine calls.-a (not sure the exact term for that.)

    Ah, so the "real question" is: How can I efficiently compute an FFT on a parallel vector processor like a GPU?

    Unfortunately, that question is nontrivial, because, as you have
    discovered, there is a lot of dopendency between the input and output values. Each input value in a FFT has an influence onto each output
    value, which makes parallel processing hard.

    In fact it is possible and it is implemented e.g. by the hello_fft
    example for the Raspberry Pi. (Even Pi Model 1 is by far fast enough to
    do real time audio processing with FFT on the GPU, e.g a 64 k FIR filter.)

    But you are right, you have to synchronize several times depending on
    the capacity of your single stages. The basic idea is that any FFT can
    be divided into cascaded smaller FFTs of arbitrary size.
    It depends on the actual architecture how many synchronizations are
    required. If for instance your hardware can process a 4 bit FFT (16
    values) in a single shader unit and you want to do a 16 bit FFT (64k)
    then you need 4 stages. After each stage the result is written back to
    memory and each stage processes different tuples of 16 values in a
    shader unit. In the example you need 4 synchronizations to do an 64 k
    FFT, 3 between the stages an the last one when everything is done.

    You should take care of caching issues during processing because the
    memory accesses of some stages aligned at higher powers of 2 might
    impact cache efficiency largely. So basically FFT is mostly memory bound rather than CPU/GPU bound on modern hardware.
    It may simplify things significantly if you can pass either the input or
    the output array in bit reversed order, e.g. store the frequency domain
    values always bit reversed.


    Marcel
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From radams2000@radams2000@gmail.com to comp.dsp on Sat Nov 23 04:48:26 2019
    From Newsgroup: comp.dsp

    In the past I have coded an FFT where you calculate one column per function call (there are log2(fftsize) columns). There is a variant of the standard FFT where the arrows that connect one column to the next are always the same. ItrCOs in Oppenheim and Schaefer if you have it. This would allow you to call the same function (with different twiddle factors passed to the function) on each call. The drawback is that the input and output vectors are in a strange order.
    Alternatively you could call a different function for each column calculation.
    Typically in an audio application you are using overlapping time-domain windows separated by a rCLhoprCY factor. If the hop amount is greater than the number of FFT columns then you might be able to execute one FFT column per input sample, and still keep up with the input data rate.
    Bob
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From theman@theman@ericjacobsen.org (Eric Jacobsen) to comp.dsp on Sat Nov 23 15:28:12 2019
    From Newsgroup: comp.dsp

    On Fri, 22 Nov 2019 08:37:32 -0800 (PST), MatthewA
    <matthewaudio@gmail.com> wrote:

    Is there any info on how to implement fft and ifft algorithms that aren't recursive? I'd like to possibly calculate a really long on inside a signal chain so attempting to compute a 2 second fft in one function call would probably be disastrous.

    -Matt

    Multiple FFTs can be pipelined, since independent processors can work
    on each stage (there are log2(N) stages in an N-point FFT), so that a
    new FFT output is produced after every stage-length computation rather
    than waiting for an entire FFT computation.

    A single, large-N FFT could be computed in parallel by using a DFT
    instead of an FFT. There would be more computational load overall,
    but you can exploit parallelism to potentially reduce latency. For
    an N-point DFT you could have N separate processors compute each
    output, since there is no interdepence between outputs. If you only
    have M processors, then each processor computes N/M outputs.

    Not sure if that's what you're looking for, but thought I'd throw it
    out there.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Greg Berchin@gjberchin@charter.net to comp.dsp on Mon Dec 2 15:54:40 2019
    From Newsgroup: comp.dsp

    On Friday, November 22, 2019 at 12:04:20 PM UTC-6, MatthewA wrote:

    The point of the original question is, how can I spread the computation of an FFT out over multiple vector-based performance routine calls. (not sure the exact term for that.)
    If I understand your question correctly, you want to spread the computation of an N-point DFT across P processors, with each processor computing a unique portion of the transform, such that they can all be combined afterward.
    Going back to the defining equation:
    N-1
    X(k) = SUM {x(n)-+exp(-j2PIkn/N)}, k = 0 ... N
    n=0
    You could have one processor compute, for example, X(k) for k = 0 ... N/P-1, another compute for k = N/P ... 2N/P-1, etc. Search for "pruned FFT".
    Alternatively, you could have one processor compute
    N/P-1
    X0(k) = SUM {x(n)-+exp(-j2PIkn/N)}, k = 0 ... N,
    n=0
    another processor compute
    2N/P-1
    X1(k) = SUM {x(n)-+exp(-j2PIkn/N)}, k = 0 ... N,
    n=N/P
    and so on. It should be fairly easy to devise a fast algorithm for this.
    You could even combine both techniques.
    - Greg
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From gah4@gah4@u.washington.edu to comp.dsp on Mon Jan 27 01:53:21 2020
    From Newsgroup: comp.dsp

    On Friday, November 22, 2019 at 8:37:37 AM UTC-8, MatthewA wrote:
    Is there any info on how to implement fft and ifft algorithms
    that aren't recursive? I'd like to possibly calculate a really
    long on inside a signal chain so attempting to compute
    a 2 second fft in one function call would probably be disastrous.

    The FFT is defined through recursion, but rarely implemented that way.

    The recursive definition makes it easy to see where the log(N)
    comes from, though.

    It is rarely most efficient to write recursive algorithms using
    actual recursion, the recursive descent compiler possibly being
    an exception.

    As for FFT, it isn't hard to describe in parallel terms, but the
    array element access pattern complicates the implementation on
    some hardware.

    --- Synchronet 3.22a-Linux NewsLink 1.2