Paul Brown
2003-10-06 15:50:12 UTC
I am confused - I have gone through a two day exercise tuning up an
FFT routine, and at the last stage I find that where the biggest gains
were expected, all my conventional understanding of efficient C seems
to be turned on its head.
Up until this point I have achieved good speed improvements following
an evolutionary approach, starting from the routine entry point I have
modified code, run, validated and timed a test case and follow the
steepest descent optimisation approach.
When I started, with a Cooley-Tukey routine, I was at 2.7ms per 1024
point complex FFT (1GHz Celeron - Win2000). I am still using a baby C
system - Turbo C, but would be surprised if this is the cause of the
problems.
I have reduced this time to 0.6ms per FFT call, when I reached the
heart of the code, the innermost of 3 loops, where the correlation
between complex exponentials and data occurs :
tempr=wr*data[ixj]-wi*data[ixj+1];
tempi=wr*data[ixj+1]+wi*data[ixj];
data[ixj]=data[ixData]-tempr;
data[ixj+1]=data[ixData+1]-tempi;
data[ixData] += tempr;
data[ixData+1] += tempi;
This piece of code is executed 5120 times (ie. N Log_2(N) ) - so a lot
of opportunity to improve things. I have removed superfluous array
indexing and changed to full 64 bit arithmetic (which has proved to be
faster than 32bit in previous tweaking) as follows
#if defined(FFT_DOUBLE)
# define srcPtr0_ doublePtr_1__
# define srcPtr1_ doublePtr_2__
# define tgtPtr0_ doublePtr_3__
# define tgtPtr1_ doublePtr_4__
#endif
#if defined(FFT_FLOAT)
# define srcPtr0_ floatPtr_1__
# define srcPtr1_ floatPtr_2__
# define tgtPtr0_ floatPtr_3__
# define tgtPtr1_ floatPtr_4__
#endif
srcPtr1_ = srcPtr0_ = &( data[ixj]); ++srcPtr1_;
tgtPtr1_ = tgtPtr0_ = &( data[ixData]); ++tgtPtr1_;
curReal = wr **srcPtr0_ -wi **srcPtr1_;
curImag= wr **srcPtr1_ +wi **srcPtr0_;
*srcPtr0_ = *tgtPtr0_ -curReal;
*srcPtr1_ = *tgtPtr1_ -curImag;
*tgtPtr0_ += curReal;
*tgtPtr1_ += curImag;
# undef srcPtr0_
# undef srcPtr1_
# undef tgtPtr0_
# undef tgtPtr1_
However now I find the time per call has gone up significantly.
Using the FFT_FLOAT option above, the call time increased by 39%
compared with the original code.
Using the FFT_DOUBLE option, it is a bit quicker, only 20% longer. So
by rights I should just use the original code, with all variables
changed to doubles, but I am loathe to do so just yet until I
understand why the current bottleneck exists.
Why should accessing vectorised data using pointer references be so
much slower than indexing - surely Turbo-C cannot have pointer
de-referencing so badly wrong?
Some other ancillary points that I would be grateful for feedback on :
I have written a software timer that squeezes better resolution out of
clock() than the CLK_TCK (supposedly 18.2ms), which is normally
available.
I have timed some runs with a stopwatch, this is what I got
_(" This is the standard value, ms per each tick in clock()")
#undef CLK_TCK 18.2
_(" This measured over 45s on a Celeron 1GHz, ms per each tick in
clock()")
#define CLK_TCK 56.12
On a 850MHz Athlon (Win-98) system I can get 600ns resolution, however
with all the system overheads in Win-2k, even using a 1GHz Celeron, I
only achieve around 1ms resolution.
Is there any easy way to disable or reduce a lot of the Win-2k system
workload to get more accurate timing? I remember reading about 5
years ago of a "software stopwatch" that could read a far higher
resolution CPU timer, anyone have a reference to this?
These are typically the sort of results I am getting
FFT_NRC starting to execute 5000 loops
FFT_NRC time = 934456 ±88(3.5s) ns
FFT_NRC_prev starting to execute 5000 loops
FFT_NRC_prev time = 776079 ±88(3.5s) ns
FFT_NRC time change :FFT_NRC_Prev = 158.377 ± 0.124(3.5s) µs
(20.41%)
FFT_NRC_prev starting to execute 1000 loops
FFT_NRC_prev time = 777115 ±451(3.5s) ns
FFT_NRC starting to execute 1000 loops
FFT_NRC time = 935104 ±451(3.5s) ns
Thanks,
Paul.
FFT routine, and at the last stage I find that where the biggest gains
were expected, all my conventional understanding of efficient C seems
to be turned on its head.
Up until this point I have achieved good speed improvements following
an evolutionary approach, starting from the routine entry point I have
modified code, run, validated and timed a test case and follow the
steepest descent optimisation approach.
When I started, with a Cooley-Tukey routine, I was at 2.7ms per 1024
point complex FFT (1GHz Celeron - Win2000). I am still using a baby C
system - Turbo C, but would be surprised if this is the cause of the
problems.
I have reduced this time to 0.6ms per FFT call, when I reached the
heart of the code, the innermost of 3 loops, where the correlation
between complex exponentials and data occurs :
tempr=wr*data[ixj]-wi*data[ixj+1];
tempi=wr*data[ixj+1]+wi*data[ixj];
data[ixj]=data[ixData]-tempr;
data[ixj+1]=data[ixData+1]-tempi;
data[ixData] += tempr;
data[ixData+1] += tempi;
This piece of code is executed 5120 times (ie. N Log_2(N) ) - so a lot
of opportunity to improve things. I have removed superfluous array
indexing and changed to full 64 bit arithmetic (which has proved to be
faster than 32bit in previous tweaking) as follows
#if defined(FFT_DOUBLE)
# define srcPtr0_ doublePtr_1__
# define srcPtr1_ doublePtr_2__
# define tgtPtr0_ doublePtr_3__
# define tgtPtr1_ doublePtr_4__
#endif
#if defined(FFT_FLOAT)
# define srcPtr0_ floatPtr_1__
# define srcPtr1_ floatPtr_2__
# define tgtPtr0_ floatPtr_3__
# define tgtPtr1_ floatPtr_4__
#endif
srcPtr1_ = srcPtr0_ = &( data[ixj]); ++srcPtr1_;
tgtPtr1_ = tgtPtr0_ = &( data[ixData]); ++tgtPtr1_;
curReal = wr **srcPtr0_ -wi **srcPtr1_;
curImag= wr **srcPtr1_ +wi **srcPtr0_;
*srcPtr0_ = *tgtPtr0_ -curReal;
*srcPtr1_ = *tgtPtr1_ -curImag;
*tgtPtr0_ += curReal;
*tgtPtr1_ += curImag;
# undef srcPtr0_
# undef srcPtr1_
# undef tgtPtr0_
# undef tgtPtr1_
However now I find the time per call has gone up significantly.
Using the FFT_FLOAT option above, the call time increased by 39%
compared with the original code.
Using the FFT_DOUBLE option, it is a bit quicker, only 20% longer. So
by rights I should just use the original code, with all variables
changed to doubles, but I am loathe to do so just yet until I
understand why the current bottleneck exists.
Why should accessing vectorised data using pointer references be so
much slower than indexing - surely Turbo-C cannot have pointer
de-referencing so badly wrong?
Some other ancillary points that I would be grateful for feedback on :
I have written a software timer that squeezes better resolution out of
clock() than the CLK_TCK (supposedly 18.2ms), which is normally
available.
I have timed some runs with a stopwatch, this is what I got
_(" This is the standard value, ms per each tick in clock()")
#undef CLK_TCK 18.2
_(" This measured over 45s on a Celeron 1GHz, ms per each tick in
clock()")
#define CLK_TCK 56.12
On a 850MHz Athlon (Win-98) system I can get 600ns resolution, however
with all the system overheads in Win-2k, even using a 1GHz Celeron, I
only achieve around 1ms resolution.
Is there any easy way to disable or reduce a lot of the Win-2k system
workload to get more accurate timing? I remember reading about 5
years ago of a "software stopwatch" that could read a far higher
resolution CPU timer, anyone have a reference to this?
These are typically the sort of results I am getting
FFT_NRC starting to execute 5000 loops
FFT_NRC time = 934456 ±88(3.5s) ns
FFT_NRC_prev starting to execute 5000 loops
FFT_NRC_prev time = 776079 ±88(3.5s) ns
FFT_NRC time change :FFT_NRC_Prev = 158.377 ± 0.124(3.5s) µs
(20.41%)
FFT_NRC_prev starting to execute 1000 loops
FFT_NRC_prev time = 777115 ±451(3.5s) ns
FFT_NRC starting to execute 1000 loops
FFT_NRC time = 935104 ±451(3.5s) ns
Thanks,
Paul.