Counter Intuitive Results: Optimising an FFT routine for Speed

Discussion:

(too old to reply)

Paul Brown

2003-10-06 15:50:12 UTC

I am confused - I have gone through a two day exercise tuning up an
FFT routine, and at the last stage I find that where the biggest gains
were expected, all my conventional understanding of efficient C seems
to be turned on its head.

Up until this point I have achieved good speed improvements following
an evolutionary approach, starting from the routine entry point I have
modified code, run, validated and timed a test case and follow the
steepest descent optimisation approach.

When I started, with a Cooley-Tukey routine, I was at 2.7ms per 1024
point complex FFT (1GHz Celeron - Win2000). I am still using a baby C
system - Turbo C, but would be surprised if this is the cause of the
problems.

I have reduced this time to 0.6ms per FFT call, when I reached the
heart of the code, the innermost of 3 loops, where the correlation
between complex exponentials and data occurs :

tempr=wr*data[ixj]-wi*data[ixj+1];
tempi=wr*data[ixj+1]+wi*data[ixj];
data[ixj]=data[ixData]-tempr;
data[ixj+1]=data[ixData+1]-tempi;
data[ixData] += tempr;
data[ixData+1] += tempi;

This piece of code is executed 5120 times (ie. N Log_2(N) ) - so a lot
of opportunity to improve things. I have removed superfluous array
indexing and changed to full 64 bit arithmetic (which has proved to be
faster than 32bit in previous tweaking) as follows

#if defined(FFT_DOUBLE)
# define srcPtr0_ doublePtr_1__
# define srcPtr1_ doublePtr_2__
# define tgtPtr0_ doublePtr_3__
# define tgtPtr1_ doublePtr_4__
#endif
#if defined(FFT_FLOAT)
# define srcPtr0_ floatPtr_1__
# define srcPtr1_ floatPtr_2__
# define tgtPtr0_ floatPtr_3__
# define tgtPtr1_ floatPtr_4__
#endif

srcPtr1_ = srcPtr0_ = &( data[ixj]); ++srcPtr1_;
tgtPtr1_ = tgtPtr0_ = &( data[ixData]); ++tgtPtr1_;

curReal = wr **srcPtr0_ -wi **srcPtr1_;
curImag= wr **srcPtr1_ +wi **srcPtr0_;
*srcPtr0_ = *tgtPtr0_ -curReal;
*srcPtr1_ = *tgtPtr1_ -curImag;
*tgtPtr0_ += curReal;
*tgtPtr1_ += curImag;

# undef srcPtr0_
# undef srcPtr1_
# undef tgtPtr0_
# undef tgtPtr1_

However now I find the time per call has gone up significantly.

Using the FFT_FLOAT option above, the call time increased by 39%
compared with the original code.

Using the FFT_DOUBLE option, it is a bit quicker, only 20% longer. So
by rights I should just use the original code, with all variables
changed to doubles, but I am loathe to do so just yet until I
understand why the current bottleneck exists.

Why should accessing vectorised data using pointer references be so
much slower than indexing - surely Turbo-C cannot have pointer
de-referencing so badly wrong?

Some other ancillary points that I would be grateful for feedback on :

I have written a software timer that squeezes better resolution out of
clock() than the CLK_TCK (supposedly 18.2ms), which is normally
available.

I have timed some runs with a stopwatch, this is what I got

_(" This is the standard value, ms per each tick in clock()")
#undef CLK_TCK 18.2

_(" This measured over 45s on a Celeron 1GHz, ms per each tick in
clock()")
#define CLK_TCK 56.12

On a 850MHz Athlon (Win-98) system I can get 600ns resolution, however
with all the system overheads in Win-2k, even using a 1GHz Celeron, I
only achieve around 1ms resolution.

Is there any easy way to disable or reduce a lot of the Win-2k system
workload to get more accurate timing? I remember reading about 5
years ago of a "software stopwatch" that could read a far higher
resolution CPU timer, anyone have a reference to this?

These are typically the sort of results I am getting

FFT_NRC starting to execute 5000 loops
FFT_NRC time = 934456 ±88(3.5s) ns

FFT_NRC_prev starting to execute 5000 loops
FFT_NRC_prev time = 776079 ±88(3.5s) ns

FFT_NRC time change :FFT_NRC_Prev = 158.377 ± 0.124(3.5s) µs
(20.41%)

FFT_NRC_prev starting to execute 1000 loops
FFT_NRC_prev time = 777115 ±451(3.5s) ns

FFT_NRC starting to execute 1000 loops
FFT_NRC time = 935104 ±451(3.5s) ns

Thanks,
Paul.

Tristan Miller

2003-10-06 17:04:27 UTC

Permalink

Greetings.

In article <***@posting.google.com>, Paul Brown
wrote:
[massive post about optimization using Turbo C on Windows snipped]

Holy crossposting, Batman! What does this message have to do with Oberon,
functional programming, or ISO/ANSI C?

For algorithm questions, try an algorithm or programming newsgroup. (I
suppose sci.math.num-analysis would be appropriate for this.)

For questions about programming for specific compilers or operating systems,
post to a group dedicated to your particular compiler and/or operating
system.

comp.lang.c is for discussion of standard C only, and not how to tweak
specific compiler or system settings.

comp.lang.functional is, I assume, about programming languages such as LISP
and Scheme, not procedural languages like C.

I have no idea what comp.lang.oberon is doing in the headers.

Regards,
Tristan

--
_
_V.-o Tristan Miller [en,(fr,de,ia)] >< Space is limited
/ |`-' -=-=-=-=-=-=-=-=-=-=-=-=-=-=-= <> In a haiku, so it's hard
(7_\\ http://www.nothingisreal.com/ >< To finish what you

Eric Sosman

2003-10-06 17:28:50 UTC

Permalink

Post by Paul Brown
I am confused - I have gone through a two day exercise tuning up an
FFT routine, and at the last stage I find that where the biggest gains
were expected, all my conventional understanding of efficient C seems
to be turned on its head.
[Replaced array-index operations with explicit pointers, and
the code got slower.]
Why should accessing vectorised data using pointer references be so
much slower than indexing - surely Turbo-C cannot have pointer
de-referencing so badly wrong?

You've posted this to quite an array of newsgroups. Here in
comp.lang.c (where I've set the followups so as to spare the rest
of the wide audience), the question is officially off-topic because
the C Standard makes no promises about the speeds or even relative
speeds of different constructs. On the other hand, this off-topic
question of efficiency is of great interest, so it winds up getting
written about quite a lot ...

At the simplest level, your question is number 20.14 in the
comp.lang.c Frequently Asked Questions (FAQ) list

http://www.eskimo.com/~scs/C-faq/top.html

... but you may be unsatisfied with the answer, which boils down
to "It depends." For a concrete example of why array indexing
might (*might*) win in your situation, consider this: In code
like your original (much snipped):

data[ixj]=data[ixData]-tempr;
data[ixj+1]=data[ixData+1]-tempi;

... it is fairly easy for the compiler to prove to itself that
data[ixj] and data[ixj+1] are distinct objects, and data[ixData]
and data[ixData+1] are likewise distinct, and that all these are
distinct from tempr and tempi. (A really smart optimizer may even
be able to prove ixj!=ixData.) If the compiler can determine that
these data objects do not overlap or otherwise interfere, it can
perhaps rearrange the computation for more efficient use of the
numeric pipelines or of the memory access path. Perhaps it can
fetch both data[ixj] and data[ixj+1] in one operation, operate on
both of them in registers or other CPU-internal fast locations,
and use just one operation to store both results.

But when you rewrite with explicit pointers (paraphrased):

*out_r = *in_r - tempr;
*out_i = *in_i - tempi;

... it may not be so easy for the compiler to determine that the
pointers are aimed at distinct data objects. And if there's any
possibility that some of these pointers address the same data
locations, the compiler's freedom to rearrange things dwindles.
For example, if the compiler thinks it might be possible that
out_r==&tempi, it dares not fetch the value of tempi until after
it's stored the result of the first assignment -- that is, the
compiler may not be able to hold tempr and tempi in registers,
and may generate code to fetch them anew every time.

Of course, this is speculation: It's only meant to offer a
plausible example of the kind of thing that *might* be going on.
You'd have to examine the actual generated code to have a hope
of figuring out what's really happening in any particular case.
And, of course, the conclusions would only be good for that one
implementation, and under that one set of circumstances: add a
perfectly innocent printf() somewhere and things might change.

--
***@sun.com

Malcolm

2003-10-06 21:48:29 UTC

Permalink

Post by Paul Brown
Why should accessing vectorised data using pointer references be so
much slower than indexing - surely Turbo-C cannot have pointer
de-referencing so badly wrong?

x86 has a shortage of registers. By forcing the compiler to store
intermediate results in pointers, you might be running it out of registers
and causing thrashing of the register file.
Just a guess - you'd need to examine the assembly to see where it is really
going wrong.
If you need to optimise code, try not to use an old compiler for a new
target. It can't take advantage of any changes to the architecture since
when it was written.

John R. Strohm

2003-10-10 03:41:51 UTC

Permalink

Post by Paul Brown
tempr=wr*data[ixj]-wi*data[ixj+1];
tempi=wr*data[ixj+1]+wi*data[ixj];
data[ixj]=data[ixData]-tempr;
data[ixj+1]=data[ixData+1]-tempi;
data[ixData] += tempr;
data[ixData+1] += tempi;

It has been quite a few years since I last looked at FFT code, and that was
a casual look during some self-education time. This is the fundamental
radix-2 butterfly, isn't it?

Christian Bau

2003-10-10 06:38:09 UTC

Permalink

Post by John R. Strohm

It has been quite a few years since I last looked at FFT code, and that was
a casual look during some self-education time. This is the fundamental
radix-2 butterfly, isn't it?

Looks like it, and programmed in a way that is guaranteed to run slow by
preventing optimisations.