CIS565-Fall-2017 · lostink · Sep 19, 2017 · Sep 19, 2017 · Sep 19, 2017 · Sep 19, 2017
diff --git a/INSTRUCTION.md b/INSTRUCTION.md
@@ -242,7 +242,6 @@ The title should be "Project 2: YOUR NAME".
 The template of the comment section of your pull request is attached below, you can do some copy and paste:  
 
 * [Repo Link](https://link-to-your-repo)
-* `Your PENNKEY`
 * (Briefly) Mentions features that you've completed. Especially those bells and whistles you want to highlight
     * Feature 0
     * Feature 1

diff --git a/README.md b/README.md
@@ -3,11 +3,109 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Name: Jiahao Liu
+* Tested on: Windows 10, i7-3920XM CPU @ 2.90GHz 3.10 GHz 16GB, GTX 980m SLI 8192MB (personal computer)
 
-### (TODO: Your README)
+Project Description and features implemented
+======================
+
+### Project Description
+
+This project is tend to compare running performance difference in computing prefix sum between CPU scan, naive GPU scan, efficient GPU scan and thrust scan.
+
+### Features implemented
+
+* CPU scan
+
+* Naive GPU scan with shared memory
+
+* Efficient GPU scan
+
+* Thrust scan
+
+Performance Analysis
+======================
+
+After trying I found for all the block size the program has the same performance, so I just chose the largest power of two as my block size.
+
+![](img/1.png)
+
+![](img/2.png)
+
+When includes the time of copy data from host to device, thrust scan and naive scan runs no faster then the CPU scan.
+
+![](img/3.png)
+
+Even with data copy, the efficient scan runs faster then the CPU scan.
+
+![](img/4.png)
+
+Only includes the time for computing, we can see the graph above.  Thrust scan may have some built-in operations that really cost time. Naive scan roughly has more memory access then efficient scan( n^2/2 vs 3*n) but fewer calculating and comparing operations. This means the battleneck is on the memory I/O. We cannot evaluate thrust since we don't even know what happens inside.
+
+There is a very interesting that power-of-two thrust scan really spent some time. Comparing what I know for access array on ram with power-of-two length, this maybe the consequence for access confliction due to the time cost of finding next memory unit on hardware.
+
+### Running Result
+
+```
+****************
+** SCAN TESTS **
+****************
+    [  28  13   1  37  12  43  45  30  45  30  16  35  30 ...  33   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0.001324ms    (std::chrono Measured)
+    [   0  28  41  42  79  91 134 179 209 254 284 300 335 ... 9519 9552 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0.001655ms    (std::chrono Measured)
+    [   0  28  41  42  79  91 134 179 209 254 284 300 335 ... 9421 9449 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.006976ms    (CUDA Measured)
+    [   0  28  41  42  79  91 134 179 209 254 284 300 335 ... 9519 9552 ]
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.00688ms    (CUDA Measured)
+    [   0  28  41  42  79  91 134 179 209 254 284 300 335 ...   0   0 ]
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 0ms    (CUDA Measured)
+    [   0  28  41  42  79  91 134 179 209 254 284 300 335 ... 9519 9552 ]
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 0ms    (CUDA Measured)
+    [   0  28  41  42  79  91 134 179 209 254 284 300 335 ... 9421 9449 ]
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.025152ms    (CUDA Measured)
+    [   0  28  41  42  79  91 134 179 209 254 284 300 335 ... 9519 9552 ]
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.0216ms    (CUDA Measured)
+    [   0  28  41  42  79  91 134 179 209 254 284 300 335 ... 9421 9449 ]
+    passed
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   0   1   3   3   1   2   2   1   2   0   2   3   3 ...   2   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 0.001655ms    (std::chrono Measured)
+    [   1   3   3   1   2   2   1   2   2   3   3   1   2 ...   3   2 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 0.001655ms    (std::chrono Measured)
+    [   1   3   3   1   2   2   1   2   2   3   3   1   2 ...   3   1 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 0.001655ms    (std::chrono Measured)
+    [   1   3   3   1   2   2   1   2   2   3   3   1   2 ...   3   2 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 0.474944ms    (CUDA Measured)
+    [   1   3   3   1   2   2   1   2   2   3   3   1   2 ...   3   2 ]
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 0.416544ms    (CUDA Measured)
+    [   1   3   3   1   2   2   1   2   2   3   3   1   2 ...   3   1 ]
+    passed
 
+```
diff --git a/img/1.png b/img/1.png
diff --git a/img/2.png b/img/2.png
diff --git a/img/3.png b/img/3.png
diff --git a/img/4.png b/img/4.png
diff --git a/src/main.cpp b/src/main.cpp
@@ -13,7 +13,7 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 384; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int a[SIZE], b[SIZE], c[SIZE];
 
@@ -49,42 +49,42 @@ int main(int argc, char* argv[]) {
     printDesc("naive scan, power-of-two");
     StreamCompaction::Naive::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("naive scan, non-power-of-two");
     StreamCompaction::Naive::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, power-of-two");
     StreamCompaction::Efficient::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, non-power-of-two");
     StreamCompaction::Efficient::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
+    printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
     printDesc("thrust scan, power-of-two");
     StreamCompaction::Thrust::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("thrust scan, non-power-of-two");
     StreamCompaction::Thrust::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
+    printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
     printf("\n");
@@ -129,14 +129,14 @@ int main(int argc, char* argv[]) {
     printDesc("work-efficient compact, power-of-two");
     count = StreamCompaction::Efficient::compact(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedCount, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient compact, non-power-of-two");
     count = StreamCompaction::Efficient::compact(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
     system("pause"); // stop Win32 console from closing on exit

diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -24,6 +24,11 @@ namespace StreamCompaction {
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
             // TODO
+			int index = blockIdx.x * blockDim.x + threadIdx.x;
+			if (index >= n) {
+				return;
+			}
+			bools[index] = idata[index] == 0 ? 0 : 1;
         }
 
         /**
@@ -32,8 +37,12 @@ namespace StreamCompaction {
          */
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
-            // TODO
-        }
+			int index = blockIdx.x * blockDim.x + threadIdx.x;
+			if (index >= n || bools[index] == 0) {
+				return;
+			}
+			odata[indices[index]] = idata[index];
+		}
 
     }
 }
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -1,15 +1,15 @@
 #include <cstdio>
 #include "cpu.h"
 
-#include "common.h"
+#include "common.h"
 
 namespace StreamCompaction {
     namespace CPU {
-        using StreamCompaction::Common::PerformanceTimer;
-        PerformanceTimer& timer()
-        {
-	        static PerformanceTimer timer;
-	        return timer;
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer& timer()
+        {
+	        static PerformanceTimer timer;
+	        return timer;
         }
 
         /**
@@ -20,6 +20,10 @@ namespace StreamCompaction {
         void scan(int n, int *odata, const int *idata) {
 	        timer().startCpuTimer();
             // TODO
+			odata[0] = 0;
+			for (int i(1); i < n; ++i) {
+				odata[i] = odata[i - 1] + idata[i - 1];
+			}
 	        timer().endCpuTimer();
         }
 
@@ -29,10 +33,14 @@ namespace StreamCompaction {
          * @returns the number of elements remaining after compaction.
          */
         int compactWithoutScan(int n, int *odata, const int *idata) {
-	        timer().startCpuTimer();
-            // TODO
-	        timer().endCpuTimer();
-            return -1;
+			int compactSum = 0;
+
+			for (int i(0); i < n; ++i) {
+				if (idata[i] == 0) continue;
+				odata[compactSum++] = idata[i];
+			}
+
+			return compactSum;
         }
 
         /**
@@ -41,10 +49,24 @@ namespace StreamCompaction {
          * @returns the number of elements remaining after compaction.
          */
         int compactWithScan(int n, int *odata, const int *idata) {
-	        timer().startCpuTimer();
-	        // TODO
-	        timer().endCpuTimer();
-            return -1;
+			int *flag = new int[n];
+			int m = 0;
+
+			for (int i = 0; i < n; ++i) {
+				flag[i] = idata[i] == 0 ? 0 : 1;
+			}
+
+			scan(n, odata, flag);
+			m = odata[n - 1];
+
+			for (int i = 0; i < n; ++i) {
+				if (flag[i] == 0) continue;
+				odata[odata[i]] = idata[i];
+			}
+
+			delete flag;
+
+			return m;
         }
     }
 }