CIS565-Fall-2018 · emily-vo · Sep 19, 2018
diff --git a/README.md b/README.md
@@ -3,12 +3,81 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Emily Vo
+  * [LinkedIn](linkedin.com/in/emilyvo), [personal website](emilyhvo.com)
+* Tested on: Windows 10, i7-7700HQ @ 2.8GHz 16GB, GTX 1060 6GB (Personal Computer)
+Updated the CMakeLists.txt to sm_61.
 
-### (TODO: Your README)
+### PERFORMANCE ANALYSIS
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+* Compare all of these GPU Scan implementations (Naive, Work-Efficient, and Thrust) to the serial CPU version of Scan. Plot a graph of the comparison (with array size on the independent axis).
+![](img/runtime_vs_size.PNG)
 
+* To guess at what might be happening inside the Thrust implementation (e.g. allocation, memory copy), take a look at the Nsight timeline for its execution. Your analysis here doesn't have to be detailed, since you aren't even looking at the code for the implementation. Write a brief explanation of the phenomena you see here.
+
+The first thrust run (power of 2) is much slower than the second (non power of 2), and much slower than all implementation at all array sizes except those greater than 2^20. This might be due to the first instance of invoking thrust requiring a lot of extra time to set up the library and any utility classes.
+
+The CPU implementation is surprisingly fast compared to the parallel implementations. This is likely due to the lack of overhead in kernel invocations, and speed of the cache, as the CPU processes the array sequentially.
+
+The work efficient implementation is slower than the naive solution. This is surprising, as the naive implementation does less work, but it appears that once again, like the CPU implementation, the fewer kernal invocations, as well as the lack of copying memory between the host and device, makes the naive solution faster. As the arrays get larger, I would expect this amount of overhead to stay constant, and so I would expect the work efficient implementation to eventually be faster.
+
+When we look at the curves, we see that the CPU implementation grows linearly as the array size grows (the line looks exponential as the x axis increases exponentially). This is expected, as the number of operations is linear. The other curves grow much slower. The Thrust implementation stays almost constant, indicating that the majority of the work it does is not actually related to computing the scan.
+
+* Paste the output of the test program into a triple-backtick block in your README.
+```
+****************
+** SCAN TESTS **
+****************
+    [   0   1   2   3   4   5   6   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0.000365ms    (std::chrono Measured)
+    [   0   0   1   3   6  10  15  21 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0ms    (std::chrono Measured)
+    [   0   0   1   3   6 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.014464ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.012608ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 0.116896ms    (CUDA Measured)
+    [   0   0   1   3   6  10  15  21 ]
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 0.08192ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 4.4583ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.014304ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   0   3   2   1   3   2   0   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 0ms    (std::chrono Measured)
+    [   3   2   1   3   2 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 0.000365ms    (std::chrono Measured)
+    [   3   2   1   3 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 2.09687ms    (std::chrono Measured)
+    [   3   2   1   3   2 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 0.088352ms    (CUDA Measured)
+    [   3   2   1   3   2 ]
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 0.081344ms    (CUDA Measured)
+    [   3   2   1   3 ]
+    passed
+```
diff --git a/img/runtime_vs_size.PNG b/img/runtime_vs_size.PNG
diff --git a/src/main.cpp b/src/main.cpp
@@ -13,83 +13,208 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
-const int NPOT = SIZE - 3; // Non-Power-Of-Two
-int *a = new int[SIZE];
-int *b = new int[SIZE];
-int *c = new int[SIZE];
 
 int main(int argc, char* argv[]) {
-    // Scan tests
+	//for (int i = 12; i <= 20; i++) {
+		int SIZE = 1 << 8; // feel free to change the size of array
+		int NPOT = SIZE - 3; // Non-Power-Of-Two
+		int *a = new int[SIZE];
+		int *b = new int[SIZE];
+		int *c = new int[SIZE];
+
+		genArray(SIZE - 1, a, 50);  // Leave a 0 at the end to test that edge case
+		a[SIZE - 1] = 0;
+		//printArray(SIZE,x` a, true);
+
+		// initialize b using StreamCompaction::CPU::scan you implement
+		// We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct.
+		// At first all cases passed because b && c are all zeroes.
+		zeroArray(SIZE, b);
+		//printDesc("cpu scan, power-of-two");
+		StreamCompaction::CPU::scan(SIZE, b, a);
+		printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+		//printArray(SIZE, b, true);
+		//printCmpResult(NPOT, b, c);
+
+		zeroArray(SIZE, c);
+		//printDesc("cpu scan, non-power-of-two");
+		StreamCompaction::CPU::scan(NPOT, c, a);
+		printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+		//printArray(NPOT, b, true);
+		//printCmpResult(NPOT, b, c);
+
+		zeroArray(SIZE, c);
+		//printDesc("naive scan, power-of-two");
+		StreamCompaction::Naive::scan(SIZE, c, a);
+		printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+		//printArray(SIZE, c, true);
+		//printCmpResult(SIZE, b, c);
+
+
+		zeroArray(SIZE, c);
+		//printDesc("naive scan, non-power-of-two");
+		StreamCompaction::Naive::scan(NPOT, c, a);
+		printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+		//printArray(SIZE, c, true);
+		//printCmpResult(NPOT, b, c);
+
+		zeroArray(SIZE, c);
+		//printDesc("work-efficient scan, power-of-two");
+		StreamCompaction::Efficient::scan(SIZE, c, a);
+		printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+		//printArray(SIZE, c, true);
+		//printCmpResult(SIZE, b, c);
+
+		zeroArray(SIZE, c);
+		//printDesc("work-efficient scan, non-power-of-two");
+		StreamCompaction::Efficient::scan(NPOT, c, a);
+		printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+		//printArray(NPOT, c, true);
+		//printCmpResult(NPOT, b, c);
+
+		zeroArray(SIZE, c);
+		//printDesc("thrust scan, power-of-two");
+		StreamCompaction::Thrust::scan(SIZE, c, a);
+		printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+		//printArray(SIZE, c, true);
+		//printCmpResult(SIZE, b, c);
+
+		zeroArray(SIZE, c);
+		//printDesc("thrust scan, non-power-of-two");
+		StreamCompaction::Thrust::scan(NPOT, c, a);
+		printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+		//printArray(NPOT, c, true);
+		//printCmpResult(NPOT, b, c);
+
+
+		// Compaction tests
+
+		genArray(SIZE - 1, a, 4);  // Leave a 0 at the end to test that edge case
+		a[SIZE - 1] = 0;
+		//printArray(SIZE, a, true);
 
+		int count, expectedCount, expectedNPOT;
+
+		// initialize b using StreamCompaction::CPU::compactWithoutScan you implement
+		// We use b for further comparison. Make sure your StreamCompaction::CPU::compactWithoutScan is correct.
+		zeroArray(SIZE, b);
+		//printDesc("cpu compact without scan, power-of-two");
+		count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
+		printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+		expectedCount = count;
+		//printArray(count, b, true);
+		//printCmpLenResult(count, expectedCount, b, b);
+
+		zeroArray(SIZE, c);
+		//printDesc("cpu compact without scan, non-power-of-two");
+		count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a);
+		printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+		expectedNPOT = count;
+		//printArray(count, c, true);
+		//printCmpLenResult(count, expectedNPOT, b, c);
+
+		zeroArray(SIZE, c);
+		//printDesc("cpu compact with scan");
+		count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
+		printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+		//printArray(count, c, true);
+		//printCmpLenResult(count, expectedCount, b, c);
+
+		zeroArray(SIZE, c);
+		//printDesc("work-efficient compact, power-of-two");
+		count = StreamCompaction::Efficient::compact(SIZE, c, a);
+		printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+		//printArray(count, c, true);
+		//printCmpLenResult(count, expectedCount, b, c);
+
+		zeroArray(SIZE, c);
+		//printDesc("work-efficient compact, non-power-of-two");
+		count = StreamCompaction::Efficient::compact(NPOT, c, a);
+		printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+		//printArray(count, c, true);
+		//printCmpLenResult(count, expectedNPOT, b, c);
+		std::cout << "" << std::endl;
+		delete[] a;
+		delete[] b;
+		delete[] c;
+
+	//}
+
+	// Scan tests
+//int a[SIZE] = { 0, 1, 2, 3, 4, 5, 6, 7 };
+
+
+
+	system("pause"); // stop Win32 console from closing on exit
+
+
+	/*
+    // Scan tests
+	//int a[SIZE] = { 0, 1, 2, 3, 4, 5, 6, 7 };
     printf("\n");
     printf("****************\n");
     printf("** SCAN TESTS **\n");
     printf("****************\n");
 
-    genArray(SIZE - 1, a, 50);  // Leave a 0 at the end to test that edge case
-    a[SIZE - 1] = 0;
-    printArray(SIZE, a, true);
+    //genArray(SIZE - 1, a, 50);  // Leave a 0 at the end to test that edge case
+	a[SIZE - 1] = 0;
+    //printArray(SIZE, a, true);
 
     // initialize b using StreamCompaction::CPU::scan you implement
     // We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct.
     // At first all cases passed because b && c are all zeroes.
     zeroArray(SIZE, b);
-    printDesc("cpu scan, power-of-two");
+    //printDesc("cpu scan, power-of-two");
     StreamCompaction::CPU::scan(SIZE, b, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    printArray(SIZE, b, true);
+    //printArray(SIZE, b, true);
+	//printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
-    printDesc("cpu scan, non-power-of-two");
+    //printDesc("cpu scan, non-power-of-two");
     StreamCompaction::CPU::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    printArray(NPOT, b, true);
+    //printArray(NPOT, b, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
-    printDesc("naive scan, power-of-two");
+    //printDesc("naive scan, power-of-two");
     StreamCompaction::Naive::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
     //printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
-	/* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
-	onesArray(SIZE, c);
-	printDesc("1s array for finding bugs");
-	StreamCompaction::Naive::scan(SIZE, c, a);
-	printArray(SIZE, c, true); */
 
     zeroArray(SIZE, c);
-    printDesc("naive scan, non-power-of-two");
+    //printDesc("naive scan, non-power-of-two");
     StreamCompaction::Naive::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
     //printArray(SIZE, c, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
-    printDesc("work-efficient scan, power-of-two");
+    //printDesc("work-efficient scan, power-of-two");
     StreamCompaction::Efficient::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+//printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
-    printDesc("work-efficient scan, non-power-of-two");
+    //printDesc("work-efficient scan, non-power-of-two");
     StreamCompaction::Efficient::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
     //printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
-    printDesc("thrust scan, power-of-two");
+    //printDesc("thrust scan, power-of-two");
     StreamCompaction::Thrust::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
     //printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
-    printDesc("thrust scan, non-power-of-two");
+    //printDesc("thrust scan, non-power-of-two");
     StreamCompaction::Thrust::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
     //printArray(NPOT, c, true);
@@ -111,37 +236,37 @@ int main(int argc, char* argv[]) {
     // initialize b using StreamCompaction::CPU::compactWithoutScan you implement
     // We use b for further comparison. Make sure your StreamCompaction::CPU::compactWithoutScan is correct.
     zeroArray(SIZE, b);
-    printDesc("cpu compact without scan, power-of-two");
+    //printDesc("cpu compact without scan, power-of-two");
     count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
     expectedCount = count;
-    printArray(count, b, true);
+    //printArray(count, b, true);
     printCmpLenResult(count, expectedCount, b, b);
 
     zeroArray(SIZE, c);
-    printDesc("cpu compact without scan, non-power-of-two");
+    //printDesc("cpu compact without scan, non-power-of-two");
     count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
     expectedNPOT = count;
-    printArray(count, c, true);
+    //printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
     zeroArray(SIZE, c);
-    printDesc("cpu compact with scan");
+    //printDesc("cpu compact with scan");
     count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    printArray(count, c, true);
+    //printArray(count, c, true);
     printCmpLenResult(count, expectedCount, b, c);
 
     zeroArray(SIZE, c);
-    printDesc("work-efficient compact, power-of-two");
+    //printDesc("work-efficient compact, power-of-two");
     count = StreamCompaction::Efficient::compact(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
     //printArray(count, c, true);
     printCmpLenResult(count, expectedCount, b, c);
 
     zeroArray(SIZE, c);
-    printDesc("work-efficient compact, non-power-of-two");
+    //printDesc("work-efficient compact, non-power-of-two");
     count = StreamCompaction::Efficient::compact(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
     //printArray(count, c, true);
@@ -151,4 +276,5 @@ int main(int argc, char* argv[]) {
 	delete[] a;
 	delete[] b;
 	delete[] c;
+	*/
 }