I managed to keep this repo private past the official release date, so to make sure there are still two clear weeks to work on it, I'm pushing the submission date back by two days to:
Wed Nov 11th, 22:00
The following one I will keep on schedule (Monday submission), and will actually just prepare and release it this week. No real point keeping the first four CWs private.
- Overview ===========
This coursework explores parallelism using threaded-building blocks. You should (if you're doing it right), see at least some examples of linear speedup here, i.e. the performance rises in proportion to the number of CPU cores.
This distribution contains a basic object framework for creating and using fourier transforms, as well as two implementations:
- A direct fourier transform taking O(n^2) steps.
- A recursive fast-fourier transform taking O(n log n) steps.
Also included within the package is a very simple test suite to check that the transforms work, and a registry which allows new transforms to be added to the package.
Your job in this coursework is to explore a number of basic (but effective) ways of using TBB to accelerate existing code, and to compare and contrast the performance benefits of them.
Dislaimer: Never write your own fourier transform for production use. Everything you do here will be slower and less accurate than existing FFT libraries.
- Environment and setup =======================
You can download Threaded Building Blocks from the TBB Website or it is available for many package managers
At the time of writing the stable version is 4.2, which is what I will be using in the tests for the coursework. That said, I have no expectations of incompatibility with the later version which I now see is available. You have the choice of downloading and building them from source (you have knowledge of how to do that via configure/make), or you can install the binary packages on your machine.
The target environment for this coursework is Ubuntu Server 14.04, specifically the Amazon AWS AMI version of Ubuntu 14.04. However, Ubuntu in a virtual machine should be almost identical, and most linuxes (Linii?) will work fine. TBB also works well under Visual Studio, cygwin32, and mingw, and according to previous students under MacOS as well - part of the point of using TBB is that it has good cross-platform support. The only common platform I know where it doesn't work is cygwin64.
What you choose to do the development in is up to you, but the assessed compilation and evaluation will be done under Ubuntu in AWS. My plan is to use a c4.8xlarge instance, but you certainly shouldn't optimise your code specifically for that machine. This shouldn't matter to you now (as you're not relying on anything apart from TBB), but is more encouragement to try out AWS before you have to in the next coursework.
When I build your code, I will not be using your makefiles, and will be compiling your .cpp files directly. There will be a copy of TBB 4.2 available in the environment that your code will use, but exactly how that happens is opaque to you.
Submission will be via github, so one thing you need to do is become part of the HPCE github organisation. This will give you access to a private repository that you can push to, and I can pull from. During submission I will be doing a git clone of your repository, then building it in place. To get access (do this now):
-
Get a github account (if you don't have one).
-
Note down your github id and your imperial login (e.g. mine is dt10).
-
Send an email to me, from an Imperial email account, with the title "[HPCE-github-request]". The body should consist of your github id and your imperial login, so something like:
github: m8pple imperial: dt10
I will then add you to the HPCE organisation, and you will get a private repository. I'll be doing this manually and in batches, so it may take a day or so to turn it around. So do it when you read this.
Note: Again, it is not a disaster if submission via github doesn't work, I will also be getting people to do a blackboard submission as backup so I get the code. However, I did this for the first time last year, and it worked fine.
Note: This section is new for this year, as setting up TBB was confusing for some students before. All the commands are based on ones I've types, but let me know if there are improvements to be made.
First get the CW3 files and move into that directory:
git clone https://github.com/HPCE/hpce-2015-cw3
cd hpce-2015-cw3
To check your TBB setup, there is a small program called
src/test_tbb.cpp
. The build command is set up as (linux/cygwin/mingw):
make bin/test_tbb
or (visual studio command-line:
nmake -f makefile.mk bin/test_tbb
but you'll probably need to customise the build a bit, depending on your environment and installation.
Use your package manager to install TBB. For example, if you are in Ubuntu or Debian, use:
sudo apt-cache search tbb
to find the available packages, then install them. Make sure you also include the "dev" packages. A typical installation on Ubuntu 14 would be:
sudo apt-get install libtbb-dev libtbb2
(I have no idea where the "2" comes from, it just seems to be needed).
You should then be able to find the headers and libraries in standard system places. On my Ubuntu, I found them at:
# Headers
ls /usr/include/tbb
# Libraries (compile-time and run-time)
ls /usr/lib/libtbb.so
If you are lucky, then you will be able to build straight off:
make bin/test_tbb
bin/test_tbb
Otherwise you may need to manually set paths:
make bin/test_tbb TBB_INC_DIR=<PathToHeaders> TBB_LIB_DIR=<PathToLibraries>
Note: Make sure that does not contain the final
tbb
.
Once this builds, you may want to edit the makefile in order to hard-code the two variables (though it is up to you).
When you try running it, you may also need to explicitly tell the loader where to look for run-time libraries:
export LD_LIBRARY_PATH=<AbsolutePathToLibraries>
bin/test_tbb
Note: The path to LD_LIBRARY_PATH needs to be absolute, rather than
relative (remember the same problem occurred with --prefix
in CW2).
Note: I don't know of a way to make this work in cygwin64 at the moment.
For cygwin32 you'll need a slightly patched version of TBB, which is available here:
http://cas.ee.ic.ac.uk/people/dt10/teaching/2015/hpce/cw3/tbb42_20140601oss_src_cygwin.tgz
For other platforms I'd suggest getting the version off the web-site.
Build steps should be:
wget http://cas.ee.ic.ac.uk/people/dt10/teaching/2015/hpce/cw3/tbb42_20140601oss_src_cygwin.tgz
tar -xf tbb42_20140601oss_src_cygwin.tgz
cd tbb42_20140601oss
make -j 4 # OMG, parallelism!
cd ..
If you now look around, you should be able to find the various directories:
# The include files
ls */include/tbb/*.h
# The compile-time libraries (cygwin)
ls */build/*/*.dll.a
# The run-time libraries (cgwin)
ls */build/*_release/*.dll
# The compile-time and run-time libraries (linux)
ls */build/*_release/*.so
The paths to these things are what you need to in order to build. For example, in my cygwin it was:
make bin/test_tbb TBB_INC_DIR=tbb42_20140601oss/include TBB_LIB_DIR=tbb42_20140601oss/build/cygwin_ia32_gcc_cc4.9.3_newlib_release
Note: the include directory path should not include the final tbb
component.
Once this builds, you may want to edit the makefile in order to hard-code the two variables.
If you now try running:
bin/test_tbb
you will probably see an error message like:
<SomePath>/hpce-2015-cw3/bin/test_tbb.exe: error while loading shared libraries:
libtbb.dll: cannot open shared object file: No such file or directory
There are two common solutions. First, add the directory containing the dll to your path:
export PATH=$PATH:$(pwd)/tbb42_20140601oss/build/PATH_TO_YOUR_RUN_TIME_LIBRARIES/
or copy the the dll to the same directory as the binaries:
cp tbb42_20140601oss/build/PATH_TO_YOUR_RUN_TIME_LIBRARIES/*.dll bin
You should now be able to run the program, and all your cores should light up.
For visual studio builds it is easiest to work on the command line (though of course, I would say that).
Rather than building from source, I suggest you download the windows binary builds from the TBB website. Once you unzip it you'll find a directory structure that contains:
-
Header files in
include
-
Compile-time libraries in
lib
-
Run-time libraries in
bin
Depending on which processor you have (ia32 vs intel64), and which version of visual studio you have (vc10, vc11, vc12), you should choose the appropriate version of TBB. For example I have Visual Studio 2010 installed, so I started a command prompt using:
- Start
- Microsoft Visual Studio 2010
- Visual Studio Tools
- Visual Studio Command Prompt (2010)
I then get a console that explicitly tells me that I'm running the x86 (32-bit) version of the 2010 tools.
I chose to unzip TBB into tbb_win
, so that means my paths
for this toolchain are:
-
Headers:
tbb_win\include
-
Compile-time libraries:
tbb_win\lib\ia32\vc10
-
Run-time libraries:
tbb_win\bin\ia32\vc10
The magic compilation incantation for me was:
nmake -f makefile.mk bin\test_tbb.exe TBB_INC_DIR=tbb_win\include TBB_LIB_DIR=tbb_win\lib\ia32\vc10
but you'll want to adjust for your environment. Once it's setup, you may want to adjust the makefile.
In order to run, you'll need to make sure that
tbb.dll
can be found when the program starts. Techniques
include copying the dll to the bin directory:
copy tbb_win\YOUR_CPU\YOUR_TOOLS\tbb.dll bin
copying it to the current directory:
copy tbb_win\YOUR_CPU\YOUR_TOOLS\tbb.dll .
or adding it to the path:
setenv PATH=%PATH%;ABSOLUTE_PATH_TO_THE_DLL_DIRECTORY
The package you have been given is a slightly overblown
framework for implementing fourier transforms. The file
include\fourier_transform.hpp
defines a generic
fourier transform interface, allowing forwards and
backwards transforms.
In the src directory, there are two fourier transform implementations, and two programs:
-
test_fourier_transform, which will run a number of simple tests on a given transform to check it works. The level of acceptable error in the results is defined as 1e-9 (which is quite high).
-
time_fourier_transform, which will time a given transform for increasing transform sizes with a given level of allowed parallelism.
If you build the two programs (using the appropriate makefile, or some other mechanism), then you should be able to test and time the two existing transforms. For example, you can do:
bin/test_fourier_transform
which will list the two existing transforms, and:
bin/time_fourier_transform hpce.fast_fourier_transform
to see the effect of increasing transform size on
execution time (look in time_fourier_transform.cpp
to
see what the columns mean).
Even though there is no parallelism in the default framework, it still relies on TBB to control parallelism, so it will not build, link, or execute, without TBB being available.
Note: the direct fourier transform is very likely to fail some tests, while the fast fourier transform (should) pass all of them. This demonstrates that even though two algorithms calculate the same result, the order of calculation can change the numerical accuracy significantly. In case you are not aware, floating point is not exact, and if you add together long sequences the error adds up quickly.
You may wish to explore the algorithmic tradeoffs between the sequential direct and fast fourier transforms - remember that parallelism and optimisation are never a substitute for an algorithm with a better intrinsic complexity.
- Using tbb::parallel_for in direct_fourier_transform ======================================================
The file src/direct_fourier_transform.cpp
contains a classic
discrete fourier transform, which takes O(n^2) operations to
do an n-point fourier transform. The structure of this
code is essentially:
for a=0:n-1
out[a]=0;
for b=0:n-1
out[a]=out[a] + in[i] * exp(-i * 2*pi * a * b / n);
end
end
This should remind you of roots of unity and such-like from way back when.
There are two loops here, with a loop-carried dependency through the inner loop, so a natural approach is to try to parallelise over the outer loop. In matlab this would be inefficient using parfor, except for very large transforms, but in TBB the overhead is fairly small.
The parfor equivalent is given by tbb::parallel_for, which takes a functor (function-like object) and applies it over a loop. The simplest approach is to take a loop which looks like:
for(unsigned i=0;i<n;i++){
f(i);
}
and turn it into:
tbb::parallel_for(0u, n, f);
This approach works, but with C++11 lambda functions we can make it look nicer. So we can take:
for(unsigned i=0;i<n;i++){
y[i]=f(x[i]);
}
and turn it into:
#include "tbb/parallel_for.h"
tbb::parallel_for(0u, n, [=](unsigned i){
y[i]=f(x[i]);
});
If you have problems compiling this, and get loads of cryptic error
messages, it may be because the types passed to the function
don't agree. In particular, both the begin and end of the iteration
range should have the same type. In the example above, 0u
has
the type unsigned
, so if n
had the type int
, there might
be a type error (as the iteration type TI can't be both int
and
unsigned
). To resolve it you can either make sure both parts
have the same time:
tbb::parallel_for((unsigned)0, (unsigned)n, [=](unsigned i){
y[i]=f(x[i]);
});
or you can explicitly state the iteration type:
tbb::parallel_for<unsigned>(0, n, [=](unsigned i){
y[i]=f(x[i]);
});
Despite the way it initially appears, parallel_for
is
still a normal function, but we just happen to be passing
in a variable which is code. The [=] syntax is introducing
a lambda, in a very similar way to the @ syntax in matlab,
so we could also do it in a much more matlab way:
auto loop_body = [=](unsigned i){
y[i]=f(x[i]);
};
tbb::parallel_for(0u, n, loop_body);
This is not the only form of parallel_for
, but it is
the easiest and most direct. Other forms allow for
more efficient execution, but require more thought.
_Note: I've been inconsistent between the use of unsigned
as
used in this document, and size_t
as used in the fourier
transform framework. Really I should have chosen one of size_t
or unsigned
everywhere, as we might have a 32-bit unsigned
but a 64-bit size_t
. I'm not going to update now, as I don't
want to make large-scale changes.
Thanks to @raymond-williams in (Issue 7)[HPCE/hpce-2018-cw2#7].
TODO: update post-deadline.
_
The framework is designed to support multiple fourier
transforms which you can select between, so we'll
need a way of distinguishing your transform from
anyone else's (in principle I should be able to create
one giant executable containing everyone in the
class's transforms). The basic framework uses the namespace
hpce
, but your classes will live in the namespace
hpce::your_login
, and the source files in src/your_login
.
For example, my namespace is hpce::dt10
, and my
source files go in src/dt10
.
There are five steps in this process:
- Creating the new fourier transform class
- Registering the class with the fourier framework
- Adding the parallel constructs to the new class
- Testing the parallel functionality (up to you)
- Finding the new execution time (up to you, see notes at the end).
Copy src/direct_fourier_transform.cpp
into a new
file called src/your_login/direct_fourier_transform_parfor.cpp
.
Modify the new file so that the contained class is called
hpce::your_login::direct_fourier_transform_parfor
, and reports
hpce.your_login.direct_fourier_transform_parfor
from name()
. Apart
from renaming, you don't need to change any functionality yet.
To declare something in a nested namespace, simply
insert another namespace declaration inside the existing
one. For example, if you currently have hpce::my_class
:
namespace hpce{
class my_class{
...
};
};
you could get it into a new namespace called bobble, by changing it to:
namespace hpce{
namespace bobble{
class my_class{
...
};
};
};
which would result in a class with the name hpce::bobble::my_class
.
Add your new file to the set of objects (either by adding it to FOURIER_IMPLEMENTATION_OBJS in the makefile, or by adding it to your visual studio project), and check that it still compiles.
As part of the modifications, you should have found
a function at the bottom of the file called std::shared_ptr<fourier_transform> Create_direct_fourier_transform()
,
which you modified to std::shared_ptr<fourier_transform> Create_direct_fourier_transform_parfor()
.
This is the factory function, used by the rest of the
framework to create transforms by name, without knowing
how they are implemented.
If you look in src/fourier_transform_register_factories.cpp
, you'll
see a function called fourier_transform::RegisterDefaultFactories
,
which is where you can register new transforms. To minimise
compile-time dependencies, the core framework knows nothing
about the transforms - all it knows is how to create them.
Towards the top is a space to declare your external factory
function, which can be uncommented. Then at the bottom
of RegisterDefaultFactories
, uncomment the call which
registers the factory.
At this point, you should find that your new implementation
is listed if you build test_fourier_transform
and do:
test_fourier_transform hpce.your_login.direct_fourier_transform_parfor
Hopefully your implementation still works, in so far as the execution will be identical.
If your transform doesn't turn up at run-time, or the code won't compile,
make very sure that you have renamed everything within
src/your_login/direct_fourier_transform_parfor.cpp
to the
new name. Also make sure that the factory function is declared
as std::shared_ptr<fourier_transform> hpce::your_login::Create_direct_fourier_transform_parfor()
,
both in src/your_login/direct_fourier_transform_parfor.cpp
, and in
src/fourier_transform_register_factories.cpp
(particularly if you get a linker error).
You need to rewrite the outer loop in both forwards_impl
and backwards_impl
,
using the transformation of for loop to tbb::parallel_for
shown previously. I would
suggest doing one, running the tests, and then doing the other. You'll
need to make sure that you include the appropriate header for parallel_loop from
TBB at the top of the file, so that the function can be found. The linker path
will also need to be setup using whatever build tool you are using (e.g. the
settings in visual studio, or LDFLAGS in a makefile).
- Using tbb::task_group in fast_fourier_transform ==================================================
The file src/fast_fourier_transform.cpp
contains a radix-2
fast fourier transform, which takes O(n log n) operations to
do a transform. There are two main opportunities for parallelism
in this algorithm, both in forwards_impl()
:
- The recursive splitting, where the function makes two recursive calls to itself.
- The iterative joining, where the results from the two recursive calls are joined together.
We will first exploit just the recursive splitting using
tbb::task_group
.
Task groups allow us to specify sets of heterogenous
tasks that can run in parallel - by heterogenous, we
mean that each of the tasks can run different code and
perform a different function, unlike parallel_for
where
one function is used for the whole iteration space.
Task groups are declared as an object on the stack:
#include "tbb/task_group.h"
// Within some function, create a group
tbb::task_group group;
you can then add tasks to the group dynamically, using anything which looks like a nullary (zero-input) function as the starting point:
unsigned x=1, y=2;
group.run( [&x](){ x=x*2; } );
group.run( [&y](){ y=y+2; } );
After this code runs, we can't say anything about the values of x and y, as each one has been captured by reference (the [&]) but we don't know if they have been modified yet. It is is possible that zero, one, or both of the tasks have completed, so to rejoin the tasks and synchronise we need to do:
group.wait();
After this function executes, all tasks in the group must have finished, so we know that x==2 and y==4.
An illegal use of this construct would be for both tasks to have a reference to x:
// don't do this
unsigned x=7;
group.run( [&x](){ x=x*2; } );
group.run( [&x](){ x=x+2; } );
group.wait();
Because both tasks have a reference to x, any changes in one task are visible in the other task. We don't know what order the two tasks will run in, so the output could be one of:
- x == (7*2)+2
- x == (7+2)*2
- x == 7*2
- x == 7+2
This would be a case of a data-race condition, which is why you should never have two threads sharing the same memory.
Copy src/fast_fourier_transform.cpp
into a new
file called src/your_login/fast_fourier_transform_taskgroup.cpp
.
Modify the new file so that the contained class is called
hpce::your_login::fast_fourier_transform_taskgroup
, and reports
hpce.your_login.fast_fourier_transform_taskgroup
from name(). Apart
from renaming, you don't need to change any functionality yet.
As before, register the implementation with the implementation
in src/fourier_transform_register_factories.cpp
, and check that
the transform still passes the test cases.
In the fast fourier transform there is a natural splitting recursion in the section:
forwards_impl(m,wn*wn,pIn,2*sIn,pOut,sOut);
forwards_impl(m,wn*wn,pIn+sIn,2*sIn,pOut+sOut*m,sOut);
Modify the code to use tbb::task_group to turn the two calls into child tasks. Don't worry about efficiency yet, and keep splitting the tasks down to the point of individual elements - splitting down to individual elements is the wrong thing to do, as it introduces masses of overhead, but we are establishing a base-case here.
As before, test the implementation to make sure it still works.
- Exploring the grain size of parallel_for ===========================================
We are now going to explore tuning the grain size, which is essentially adjusting the computation to parallel overhead ration.
By default, tbb::parallel_for
uses something called
the auto_partitioner
, which is used to partition
your iteration space into sequential segments. The auto-partitioner
attempts to balance the number of parallel tasks created
against the amount of computation at each iteration
point. Internally it tries to split the iteration space
up recursively into parallel chunks, and then switches
to serial execution within each chunk once it has enough
parallel tasks for the number of processors.
In our direct FFT the amount of work within each parallel iteration will depend on the transform size, as the total cost is O(n^2), and the cost per inner loop iteration is O(n). As the transform gets larger, the total amount of true work is scaling as O(n^2), while the potential overhead due to parallelism is scaling at O(n). Putting numbers on this, if we assume t_c is the cost per data-point, and t_p is the extra overhead per parallel task, then the overall cost is t_p n + t_c n^2. Asymptotically this becomes O(n^2), but for finite sizes we know that t_p >> t_c (by a factor of around 1000), so for n~1000 the overhead will equal the actual compute time.
We can explore this and control this by using manual grain size control, which explicitly says how big each parallel task should be. There is an alternate form of parallel_for, which describes a chunked iteration space:
tbb::parallel_for(tbb::blocked_range<unsigned>(i,j,K), [&](const tbb::blocked_range<unsigned> &chunk){
for(unsigned i=chunk.begin(); i!=chunk.end(); i++ ){
y[i]=myLoop(i);
}
}, tbb::simple_partitioner());
this is still equivalent both to the original loop, but now we have more control. If we unpack it a bit, we could say:
// Our iteration space is over the unsigneds
typedef tbb::blocked_range<unsigned> my_range_t;
// I want to iterate over the half-open space [i,j),
// and the parallel chunk size should be K.
my_range_t range(i,j,K);
// This will apply my function over the half-open
// range [chunk.begin(),chunk.end) sequentially.
auto f=[&](const my_range_t &chunk){
for(unsigned i=chunk.begin(); i!=chunk.end(); i++ ){
y[i]=myLoop(i);
}
};
We now have the choice of executing it directly:
f(range); // Apply f over range [i,j) sequentially
or in parallel with chunk size of K:
tbb::parallel_for(range, f, tbb::simple_partitioner());
The final tbb::simple_partitioner()
argument is telling
TBB "I know what I am doing; I have decided that K is the
best chunk size." We could alternatively leave it blank,
which would default to the auto partitioner, which would
tell TBB "I know that the chunk size should not be less than
K, but if you want to do larger chunks, go for it.".
We want to set a good chunk size, but we also want it to be user tunable for a specific machine. So we are going to allow the user to choose a value K at run-time using an environment variable.
The function getenv
allows a program to read environment variables
at run-time. So if I choose an environment variable called HPCE_X, I could
create a C++ program read_x
:
#include <cstdlib>
int main()
{
char *v=getenv("HPCE_X");
if(v==NULL){
printf("HPCE_X is not set.\n");
}else{
printf("HPCE_X = %s\n", v);
}
return 0;
}
then on the command line I could do:
> ./read_x
HPCE_X is not set.
> export HPCE_X=wibble
> ./read_x
HPCE_X = wibble
> export HPCE_X=100
> ./read_x
HCPE_X = 100
Environment variables are a way of defining ambient properties, and are often used for tuning the execution of program for the specific machine they are on (I'm not saying it's the best way, but it happens a lot). I used the "HPCE_" prefix as I want to try to avoid clashes with other programs that might want a variable called "X".
Create a new class src/your_login/direct_fourier_transform_chunked.cpp
based
on src/your_login/direct_fourier_transform_parfor.cpp
with class name hpce::your_login::direct_fourier_transform_chunked
, and name
hpce.your_login.direct_fourier_transform_chunked
.
Change the parfor loop to use a chunk size K, where K is either
specified as a decimal integer using the environment
variable HPCE_DIRECT_OUTER_K
, or if it is not set, you should use
a sensible default. For example, if the user does:
export HPCE_DIRECT_OUTER_K=16
you would use a chunk size of 16 (which will work well for larger n, but less well as n becomes smaller).
You should think about what "sensible default" means: the TBB user guide gives some guidance in the form of a "rule of thumb". Your rule of thumb should probably take into account the size of the inner loop: remember that the work is O(n^2), and the work within each original parfor iteration is O(n). But on the other side, you always want enough tasks to keep the processors occupied (no matter how many there are), so you can't set the chunk size to K=n.
- Adjustable grain size for the FFT ====================================
Our recursive parallel FFT is currently splitting down to individual tasks, which goes completely against the TBB advice about the minimum work per task.
Go back into src/your_login/fast_fourier_transform_taskgroup.cpp
and make the base-case adjustable using the environment variable
HPCE_FFT_RECURSION_K
. This should adjust things at run-time,
so that if I choose:
export HPCE_FFT_RECURSION_K=1
then the work is split down to individual tasks, so the run-time should be very similar to the original, while if we do:
export HPCE_FFT_RECURSION_K=16
then the implementation will stop parallel recursion for at a size of 16, and switch to serial recursion (i.e. normal function calls).
You don't want the overhead of calling getenv
all over the
place, so I would suggest calling it once at construction
time and caching it in a member variable for the lifetime
of the FFT instance.
Note: this is an in-place modification, rather than a new class.
- Using parallel iterations in the FFT =======================================
The FFT contains a for loop which at first glance appears to be impossible to parallelise, due to the loop carried dependency through w:
std::complex<double> w=std::complex<double>(1.0, 0.0);
for (size_t j=0;j<m;j++){
std::complex<double> t1 = w*pOut[m+j];
std::complex<double> t2 = pOut[j]-t1;
pOut[j] = pOut[j]+t1;
pOut[j+m] = t2;
w = w*wn;
}
However, we can in fact parallelise this loop as long as we exploit some mathematical properties and batch things carefully. On each iteration the loop calculates w=w*wn, so if we look at the value of w at the start of each iteration we have:
- j=0, w=1
- j=1, w=1 * wn
- j=2, w=1 * wn * wn
Generalising, we find that for iteration j, w=wn^j.
Hopefully it is obvious that raising something to the power of i takes substantially less than i operations. Try calculating (1+1e-8)^1e8 in matlab, and notice:
- It is almost equal to e. Nice.
- It clearly does not take anything like 1e8 operations to calculate.
In C++ the std::complex
class supports the std::pow
operator,
which will raise a complex number to a real scalar, which
can be used to jump ahead in the computation. In principle
we could use this property to make the loops completely
independent, but this will likely slow things down, as
powering is quite expensive (compared to one multiply). Instead we can use the idea
of agglomeration, which means instead of reducing the
code to the finest grain parallel tasks we can, we'll
group some of them into sequential tasks to reduce
overhead. Agglomeration is actually the same principle as the chunking
above, but now we have to think more carefully about
how to split into chunks.
So within each chunk [i,j) we need to:
- Skip wn ahead to wn^i
- Iterate as before over [i,j)
Possible factors of K are easy to choose in this case, because m is always a power of two, so K can be any power of two less than or equal to m (including K=1).
Create a new class based on src/fast_fourier_transform.cpp
, with:
- File name:
src/your_login/fast_fourier_transform_parfor.cpp
- Class name:
hpce::your_login::fast_fourier_transform_parfor
- Display name:
hpce.your_login.fast_fourier_transform_parfor
This class should be based on the sequential version, not on the task based version, so there is only one kind of parallelism.
First apply the loop transformation described above,
without introducing any parallelism, and check it works
with various values of K, via the environment variable
HPCE_FFT_LOOP_K
.
Note that m gets smaller and smaller as it splits, so you need to worry about m < K at the leaves of the recursion. A simple solution is to use a guarded version, such that if m < = K the original code is used, and if m > K the new code is used.
Once you have got it working with a non parallel chunked
loop, replace the outer loop with a parallel_for
loop
using simple_partitioner
, and check that it still works
for different values of HPCE_FFT_LOOP_K
. You will
probably not see much speed-up here, as the dominant
cost tends to be the recursive part.
As before, if HPCE_FFT_LOOP_K
is not set, choose a sensible
default based on your analysis of the scaling with n, and/or
experiments. Though remember, it should be a sensible default
for all machines (even those with 64 cores).
- Combine both types of parallelism ====================================
We now have two types of parallelism that we know work,
so the natural choice is to combine them together.
Create a new implementation called fast_fourier_transform_combined
,
using the conventions for naming from before, and integrate
both forms of parallelism. This version should check both
the HPCE_FFT_LOOP_K
and HPCE_FFT_RECURSION_K
variables, and
fall back on a default if either or both is not set.
Exploring the performance of this one is quite interesting, particularly the interaction between the loop and recursion chunk sizes.
- Optimisation ===============
Our final implementation is a tweaking pass of the parallelised version: we have introduced and exploited two types of parallelism, but there are other optimisations that could be done.
Create a new implementation called fast_fourier_transform_opt
,
and apply any optimisations you think reasonable (I'm not
talking about low-level tweaks, these should be low-hanging
fruit). Particular things to consider are:
-
You may wish to stop creating separate tasks during the splitting, once the sub-tasks drop below a certain size.
-
There is some overhead involved in the serial version because it is radix2 (i.e. there is a base-case where n==2). You may want to look at stopping sequential splitting when n==4.
-
The timing program measures a forward and then backwards transform. Are there any serial bottlenecks that can be removed, and if so, what chunk size should be used?
To re-iterate, the aim here is not low-level optimisation and tweaking, it is for tuning and re-organising at a high-level, while keeping reasonably readable and portable code. By applying these techniques, I got about a 6 times speed-up over the original code using a 8-core (hyper-threaded) machine for largish n.
- Measuring execution time and scalability ===========================================
I am not going to dictate how you explore scalability, as I don't want to over-specify, but it is something you should naturally be doing. Particular parameters to look at are:
-
Parallel overhead. How fast is the TBB version with 1 processor versus the serial processor?
-
Number of processors versus execution time. If you double the processor count, is it twice as fast?
-
The various values of K versus execution time. You would typically expect a bathtub curve, where K=1 is very slow due to parallel overhead, and K=max is slow because there is no parallelism, but how does it behave inbetween?
The program time_fourier_transform
will allow you to
measure scalability against n, and to vary the number
of processors. You could copy and modify this to automatically
explore other properties, by noting that setenv
can be
used to change an environment variable from within a
program.
Alternatively, you can exploit shell scripting, for example, something like:
for K="1 2 4 8 16 32"; do
echo $K;
export HPCE_FFT_RECURSION_K=$K;
./time_fourier_transform hpce.your_login.fast_fourier_transform_taskgroup
end
So formally I require nothing from you to demonstrate experiments in scaling. However, I encourage you to commit and submit a few notes or documents about what you did, for example some graphs or results, as I may ask you about this coursework in the oral and it would be useful for you to know (or show) what your did.
- Submission =============
Double-check your names all match up, as I'll be trying
to create your transforms both by direct instantiation,
and by pulling them out of the transform registry. Also,
don't forget that "your_login" or $LOGIN
does actually mean your
login (your imperial login) needs to be substituted in
wherever it is mentioned.
Specific files (with appropriately named classes) which should exist are:
src/$LOGIN/direct_fourier_transform_parfor.cpp
src/$LOGIN/direct_fourier_transform_chunked.cpp
src/$LOGIN/fast_fourier_transform_combined.cpp
src/$LOGIN/fast_fourier_transform_parfor.cpp
src/$LOGIN/fast_fourier_transform_taskgroup.cpp
src/$LOGIN/fast_fourier_transform_opt.cpp
However, you can put other files in if you want. Performance results are always interesting, though not required.
and specific transforms that should be registered are:
hpce.$LOGIN.direct_fourier_transform_parfor
hpce.$LOGIN.direct_fourier_transform_chunked
hpce.$LOGIN.fast_fourier_transform_combined
hpce.$LOGIN.fast_fourier_transform_parfor
hpce.$LOGIN.fast_fourier_transform_taskgroup
hpce.$LOGIN.fast_fourier_transform_opt
Hopefully this should all still be in a git repository, and you will also have a private remote repository in the HPCE organisation. "Push" your local repository to your private remote repository, making sure all the source files have been staged and added. (You can do a push whenever you want as you go (it's actually a good idea), but make sure you do one for "submission"). You may want to do a "clone" into a separate directory to make sure that everything has made it.
After pushing to your private remote repository, zip up your directory, including the .git, but excluding any binary files (executables, objects)
I would suggest compiling and running your submission in AWS, just to get the feel of it. A correctly written submission should compile anywhere, with no real dependency on the environment, but it is good to try things out.