-
Notifications
You must be signed in to change notification settings - Fork 6
/
FAQ.txt
116 lines (88 loc) · 5.46 KB
/
FAQ.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
Frequently Asked Questions about FastFlow
=========================================
Questions:
1. What's FastFlow ?
FastFlow adopts an algorithmic skeleton based approach to address
multicore programmability, in order to address two problems:
1) to implement efficient shared memory management
mechanisms and 2) to raise the level of programming abstractions.
FastFlow provides full support for an important class of applications,
namely streaming applications. In this respects, it provides the user
with a set of stream parallel skeletons: pipeline, farm .... loops?????????
Skeletons embody most of the cumbersome and error prone details
relative to shared memory handling in multicore code.
In particular, the FastFlow run time support takes care of all the
synchronizations needed and related to the communication among the
different parallel entities resulting from the compilation of
the FastFlow skeletons used in an application.
Furthermore, skeletons can be arbitrarily nested to model increasingly
complex parallelism exploitation patterns.
The FastFlow implementation guarantees an
efficient execution of the skeletons on currently available multicore systems
by building the skeletons themselves on top of a library of very efficient,
lock free producer/comsumer queues.
2. What's the difference between FastFlow and FastFlow accelerator ?
The FastFlow accelerator is an extension of the FastFlow framework
aiming at simplifying the porting of existing sequential code to
multicore. A FastFlow accelerator is software device defined as a
composition of FastFlow patterns (e.g. pipe(S1,S2), farm(S),
pipe(S1,farm(S2)), ...) that can be started independently from the
main flow of control; one or more accelerators can be (dynamically)
started in one application. Each accelerator exhibits a well-defined
parallel semantics that depend from its particular patter
composition. Tasks can be asynchronously offloaded (so-called
self-offloaded) onto an accelerator. Results from an accelerators can
return to the caller thread either in a blocking or non-blocking
fashion. FastFlow accelerators enable programmers to 1) create a strem
of tasks from a loop or a recursive call; 2) parallelize kernels of
code changing the original code in very local way (as an example a
part of a loop body). A FastFlow accelerator typically work in
non-blocking fashion on a subset of cores of the CPUs, but can be
transiently suspended to release hardware resources to efficiently
manage non-contiguous bursts of tasks.
<Figura>
3. Using 1-to-1 FIFO queues (i.e. Single Writer/Single-Reader
queues or just SWSR) means potentially n^2 queues.
How big are the queues? How much memory may be consumed on a
many-core system? Is this approach scalable?
An empty SWSR queue on a 64bit platform has a size of 144 bytes.
A 1-to-1 FIFO queue may be bounded in size (i.e. just a circular buffer)
or may be unbounded (i.e. the queue allocates/deallocates buffer space on
demand and in chunks).
This unbounded queue supports the implementation of deadlock-free
cyclic networks. The queues store memory pointers so in general are quite
small, typically just few KB.
Since in FF programs we mainly use composition of farm and pipeline skeleton
which does not require a complete connection among skeletons' stages, the
resulting streaming network is scalable.
Thus the approach is scalable as much as the underline streaming network modeled
is scalable.
4. In the matrix multiplication example, we start N^2 tasks.
Does it ever make sense to start more tasks than cores?
It mainly depends on the definition of tasks. In the matrix multiplication
we have N^2 tasks at the finer grain. This does not translate on N^2 threads.
Generally, a small number of threads will execute the tasks in parallel.
The very simple matrix multiplication application, is an example of parallelization
through streamization w.r.t. classical data-parallel parallelization, so, in this
respect, it should be taken as a proof that such approach can be applied, with good
performance results, also in these worst cases.
5. How to choose task granularity on FastFlow ?
FastFlow lower level mechanisms are quite efficient. It demonstrates good speedups
when computing tasks that last for just a few microseconds. So choice the right
granularity should not be a big issue.
6. How is composition and split-merge of the streams handled by FastFlow ?
7. What is the actual benefit of FastFlow in terms of reduced programming effort
if compared with OpenMP or Intel Threading Build Blocks (TBB) ?
Of course, the argument for programmability will only be fully ‘proven’ via a large
study in which programmers with equal starting knowledge of differing technologies
develop a range of applications in a range of technologies and compare experiences.
Such study is difficult to arrange and execute. For the moment the argument for scalability
can only be based on a subjective assessment of the abstraction levels of the differing
technologies and limited empirical experience.
For the latter we can report that the entire YaDT-FF parallelisation required just few days
of work, which the data mining experts report is significantly less than the time needed
to parallelise the same application with OpenMP or TBB; and that this is mainly due to the
fact that FastFlow provides a native way of implementing a D&C that can be used to structure
the YaDT accelerator.
8. How about Single-Reader/Multiple-Writers (SRMW) abd Multiple-Writers/Single-Reader
queues (MWSR) ?