diff --git a/README.md b/README.md
index cad1abd..1a4f51d 100644
--- a/README.md
+++ b/README.md
@@ -5,16 +5,137 @@ CUDA Rasterizer
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 4**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Rishabh Shah
+* Tested on: Windows 10, i7-6700HQ @ 2.6GHz 16GB, GTX 960M 4096MB (Laptop)
 
-### (TODO: Your README)
+## Overview
+In this project, I implemented a Rasterizer using CUDA. Rasterization is a technique used to draw 3D geometry on the screen in real-time. The geometry is in the form of vertices which are to be converted into geometry primitives (like triangles), and plotted on the screen.
 
-*DO NOT* leave the README to the last minute! It is a crucial part of the
-project, and we will not be able to grade you without a good README.
+![duck](renders/duck_gif.gif)
 
+*the artefacts are due to gif compression and is not seen during actual run of the program*
+
+## Functionalities Implemented
+
+* Pipeline Stages:
+    * Vertex shading (`_vertexTransformAndAssembly` in `rasterize.cu`)
+    * Primitive assembly (`_primitiveAssembly` in `rasterize.cu`)
+    * Rasterization (`_rasterizeTriangles` in `rasterize.cu`)
+    * Fragment shading (Partly in `_rasterizeTriangles`, partly in `render` in `rasterize.cu`)
+    * Depth buffer for depth testing
+    * Fragment-to-depth-buffer writing with atomics for race avoidance.
+    * Lambert shading. (Inside `render` in `rasterize.cu`)
+
+* Additional Features:
+    * Post processing shader: Gaussian blur, with and without shared memory
+    * Additional pipeline stage:
+        * Backface culling, with and without stream compaction
+        * Blend for simulating depth of field
+    * UV texture mapping with bilinear texture filtering and perspective
+      correct texture coordinates
+    * Supersample antialiasing (SSAA)
+    * Support for rasterizing additional primitives:
+       * Lines
+       * Points
+
+
+#### Minor optimizations
+
+* Using `constant memory` for storing constant variables
+* Avoid if-else statements wherever possible
+* As few global memory reads as possible
+* Ignore the triangles during rasterization if the bounding box is outside the camera's view frustum
+
+
+## Results
+
+
+#### Lambertian Shading and Debug Views
+
+I create simple implementations for a few debug views to help me visualize what's going on.
+
+| Lambertian | Depth | Normals |
+| ----- | ----- | ----- |
+| ![](renders/diffuseOnly.png) | ![](renders/debug_depth.png) | ![](renders/debug_normals.png) |
+
+
+#### Texturing
+
+The first image is of naive UV mapping. The checkerboard looks incorrect because the UVs are interpolated without taking into account the perspective diminishing along the Z-axis. Performing perspective correct z-interpolation results in a correct result. Bilinear filtering further enhances the output by smoothing out sharp edges in the texture.
+
+| Texture | Perspective correction | Bilinear filtering |
+| ----- | ----- | ----- |
+| ![](renders/texture_check.png) | ![](renders/tex_perspcorr_check.png) | ![](renders/texture_persp_bil_check.png) |
+
+
+#### Point and Line Rendering, and Backface Culling
+
+I chose to implement Point and Line rendering in the rasterizer and not in the primitive assembly. So, it would be easy to change a few lines of code and enable rendering of multiple primitives simultaneously. This can be useful in visual debugging (like wireframe over shaded view in Maya). Line rendering is done using [Digital Differential Analyser (DDA) algorithm](http://www.geeksforgeeks.org/dda-line-generation-algorithm-computer-graphics/).
+
+Also, as can be seen in the third image, I implemented backface culling. Here, we basically remove the triangles that are not facing the camera, to reduce the computational cost. The code has two versions of it. One removes the triangles using stream compaction (thrust::copy_if), and the other one just skips the triangles that are not facing the camera, during rasterization.
+
+| Points | Lines | Backface culling |
+| ----- | ----- | ----- |
+| ![](renders/duck_points.png) | ![](renders/duck_line.png) | ![](renders/duck_backfacecull.png) |
+
+** Performance evaluation for backface culling**
+
+|  | Compaction (using thrust) | Rasterization |
+| ----- | ----- | ----- |
+|Without culling|0.0|1.216|
+|Cull without compaction|0.0|1.10222|
+|Cull with compaction|1.71259|0.499752|
+
+![chart](renders/chart.png)
+
+* Backface culling has some performance boost, but I think it would be much larger with large scenes with complecated meshes.
+* After compaction, rasterization is dramatically faster, even for simple meshes. But this comes at the cost of the large compaction overhead. This will only be useful when the scene is too large and the output resolution is also very high.
+
+#### Super Sample Anti-Aliasing
+
+In SSAA, we render the scene at a higher resolution, and then downsample it by averaging neighborhoods to get a smooth image free of jaggies. It is a very inefficient process, and just minimum anti-aliasing would require rendering 4 times larger image.
+
+| No SSAA | SSAA |
+| ----- | ----- |
+| ![](renders/truck_noSSAA.png) | ![](renders/truck_SSAA_4x.png) |
+
+
+#### Depth of Field (using Gaussian Blur with shared-memory and Blending)
+
+I imaplemented Gaussian Blur as a two pass process. The first pass blurs along the X-direction, and the second pass blurs along the Y-direction. This gives correct result as Gaussian blur is a seperable operation. Also, applying Gaussian blur multiple times is more efficient than applying bigger kernels. So, my implementation does Gaussian blur over 9x9 pixels twice.
+
+I am using shared memory to store the image so that I dont have to do 9 global memory reads for every thread. I have also implemented a version without shared memory for performance comparison.
+
+Depth of field can be obtained by adding a blend stage between the blur result and output. Here, we interpolate between the original render and the blurred result using depth. The results are not bad, but not realistic. Also, in some scenes, the ringing effect due to Gaussian blur can be seen when the mesh is too close and blurred.
+
+*the grainy result is due to gif compression and is not seen during actual run of the program*
+
+![](renders/dof_truck.gif)
+
+** Performance evaluation for shared memory blurring**
+
+| Without shared memory | With shared memory |
+| ----- | ----- |
+|6.93293|4.63802|
+
+![chart](renders/chart_1.png)
+
+* As expected, blurring is much faster with shared memory.
+* This performance gain would double when we apply the blurring twice for DOF.
+
+### Legendary Bloopers
+
+![](renders/blooper-duck-Capture.png)
+
+![](renders/duck_painting.png)
 
 ### Credits
 
 * [tinygltfloader](https://github.com/syoyo/tinygltfloader) by [@soyoyo](https://github.com/syoyo)
 * [glTF Sample Models](https://github.com/KhronosGroup/glTF/blob/master/sampleModels/README.md)
+* [Digital Differential Analyser (DDA) algorithm](http://www.geeksforgeeks.org/dda-line-generation-algorithm-computer-graphics/)
+* [Depth of Field](https://mynameismjp.wordpress.com/the-museum/samples-tutorials-tools/depth-of-field-sample/)
+* [Shared Memory](https://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/)
+* [Bilinear Texture Filtering](https://en.wikipedia.org/wiki/Bilinear_filtering)
+* [Gaussian Blur 1](http://www.sunsetlakesoftware.com/2013/10/21/optimizing-gaussian-blurs-mobile-gpu)
+* [Gaussian Blur 2](http://rastergrid.com/blog/2010/09/efficient-gaussian-blur-with-linear-sampling/)
diff --git a/renders/blooper-duck-Capture.png b/renders/blooper-duck-Capture.png
new file mode 100644
index 0000000..56fb38f
Binary files /dev/null and b/renders/blooper-duck-Capture.png differ
diff --git a/renders/chart.png b/renders/chart.png
new file mode 100644
index 0000000..9f04f83
Binary files /dev/null and b/renders/chart.png differ
diff --git a/renders/chart_1.png b/renders/chart_1.png
new file mode 100644
index 0000000..8ec47b5
Binary files /dev/null and b/renders/chart_1.png differ
diff --git a/renders/cow-points-normals.PNG b/renders/cow-points-normals.PNG
new file mode 100644
index 0000000..c01497a
Binary files /dev/null and b/renders/cow-points-normals.PNG differ
diff --git a/renders/cow-points-white.PNG b/renders/cow-points-white.PNG
new file mode 100644
index 0000000..548944c
Binary files /dev/null and b/renders/cow-points-white.PNG differ
diff --git a/renders/cow-wire.PNG b/renders/cow-wire.PNG
new file mode 100644
index 0000000..49480e6
Binary files /dev/null and b/renders/cow-wire.PNG differ
diff --git a/renders/debug_depth.png b/renders/debug_depth.png
new file mode 100644
index 0000000..eeae883
Binary files /dev/null and b/renders/debug_depth.png differ
diff --git a/renders/debug_normals.png b/renders/debug_normals.png
new file mode 100644
index 0000000..2f7042f
Binary files /dev/null and b/renders/debug_normals.png differ
diff --git a/renders/diffuseOnly.png b/renders/diffuseOnly.png
new file mode 100644
index 0000000..1cd7cf9
Binary files /dev/null and b/renders/diffuseOnly.png differ
diff --git a/renders/dof_truck.gif b/renders/dof_truck.gif
new file mode 100644
index 0000000..13baa36
Binary files /dev/null and b/renders/dof_truck.gif differ
diff --git a/renders/duck-textured.PNG b/renders/duck-textured.PNG
new file mode 100644
index 0000000..a046794
Binary files /dev/null and b/renders/duck-textured.PNG differ
diff --git a/renders/duck_DOF.gif b/renders/duck_DOF.gif
new file mode 100644
index 0000000..437607f
Binary files /dev/null and b/renders/duck_DOF.gif differ
diff --git a/renders/duck_backfacecull.png b/renders/duck_backfacecull.png
new file mode 100644
index 0000000..8fa1aac
Binary files /dev/null and b/renders/duck_backfacecull.png differ
diff --git a/renders/duck_diffuse.png b/renders/duck_diffuse.png
new file mode 100644
index 0000000..4a381b3
Binary files /dev/null and b/renders/duck_diffuse.png differ
diff --git a/renders/duck_gif.gif b/renders/duck_gif.gif
new file mode 100644
index 0000000..2cc6bd4
Binary files /dev/null and b/renders/duck_gif.gif differ
diff --git a/renders/duck_line.png b/renders/duck_line.png
new file mode 100644
index 0000000..31ba076
Binary files /dev/null and b/renders/duck_line.png differ
diff --git a/renders/duck_painting.gif b/renders/duck_painting.gif
new file mode 100644
index 0000000..57a93c3
Binary files /dev/null and b/renders/duck_painting.gif differ
diff --git a/renders/duck_painting.png b/renders/duck_painting.png
new file mode 100644
index 0000000..1399930
Binary files /dev/null and b/renders/duck_painting.png differ
diff --git a/renders/duck_points.png b/renders/duck_points.png
new file mode 100644
index 0000000..2b26b49
Binary files /dev/null and b/renders/duck_points.png differ
diff --git a/renders/2-cylinder-engine-load-correct.png b/renders/old/2-cylinder-engine-load-correct.png
similarity index 100%
rename from renders/2-cylinder-engine-load-correct.png
rename to renders/old/2-cylinder-engine-load-correct.png
diff --git a/renders/duck-diffuse-texture.png b/renders/old/duck-diffuse-texture.png
similarity index 100%
rename from renders/duck-diffuse-texture.png
rename to renders/old/duck-diffuse-texture.png
diff --git a/renders/triangle.png b/renders/old/triangle.png
similarity index 100%
rename from renders/triangle.png
rename to renders/old/triangle.png
diff --git a/renders/tex_perspcorr_check.png b/renders/tex_perspcorr_check.png
new file mode 100644
index 0000000..b49ccae
Binary files /dev/null and b/renders/tex_perspcorr_check.png differ
diff --git a/renders/texture.png b/renders/texture.png
new file mode 100644
index 0000000..6cf3e16
Binary files /dev/null and b/renders/texture.png differ
diff --git a/renders/texture_check.png b/renders/texture_check.png
new file mode 100644
index 0000000..1ee6967
Binary files /dev/null and b/renders/texture_check.png differ
diff --git a/renders/texture_persp_bil_check.png b/renders/texture_persp_bil_check.png
new file mode 100644
index 0000000..db503a2
Binary files /dev/null and b/renders/texture_persp_bil_check.png differ
diff --git a/renders/texture_perspcorr.png b/renders/texture_perspcorr.png
new file mode 100644
index 0000000..022ee0b
Binary files /dev/null and b/renders/texture_perspcorr.png differ
diff --git a/renders/truck-line.PNG b/renders/truck-line.PNG
new file mode 100644
index 0000000..b23f427
Binary files /dev/null and b/renders/truck-line.PNG differ
diff --git a/renders/truck_SSAA_4x.png b/renders/truck_SSAA_4x.png
new file mode 100644
index 0000000..3613737
Binary files /dev/null and b/renders/truck_SSAA_4x.png differ
diff --git a/renders/truck_noSSAA.png b/renders/truck_noSSAA.png
new file mode 100644
index 0000000..72ff696
Binary files /dev/null and b/renders/truck_noSSAA.png differ
diff --git a/src/main.cpp b/src/main.cpp
index 7986959..dc2e868 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -14,117 +14,118 @@
 #define TINYGLTF_LOADER_IMPLEMENTATION
 #include <util/tiny_gltf_loader.h>
 
-//-------------------------------
-//-------------MAIN--------------
-//-------------------------------
+ //-------------------------------
+ //-------------MAIN--------------
+ //-------------------------------
 
 int main(int argc, char **argv) {
-    if (argc != 2) {
-        cout << "Usage: [gltf file]. Press Enter to exit" << endl;
-		getchar();
-        return 0;
-    }
-
-	tinygltf::Scene scene;
-	tinygltf::TinyGLTFLoader loader;
-	std::string err;
-	std::string input_filename(argv[1]);
-	std::string ext = getFilePathExtension(input_filename);
-
-	bool ret = false;
-	if (ext.compare("glb") == 0) {
-		// assume binary glTF.
-		ret = loader.LoadBinaryFromFile(&scene, &err, input_filename.c_str());
-	} else {
-		// assume ascii glTF.
-		ret = loader.LoadASCIIFromFile(&scene, &err, input_filename.c_str());
-	}
-
-	if (!err.empty()) {
-		printf("Err: %s\n", err.c_str());
-	}
-
-	if (!ret) {
-		printf("Failed to parse glTF\n");
-		return -1;
-	}
-
-
-    frame = 0;
-    seconds = time(NULL);
-    fpstracker = 0;
-
-    // Launch CUDA/GL
-    if (init(scene)) {
-        // GLFW main loop
-        mainLoop();
-    }
-
+  if (argc != 2) {
+    cout << "Usage: [gltf file]. Press Enter to exit" << endl;
+    getchar();
     return 0;
+  }
+
+  tinygltf::Scene scene;
+  tinygltf::TinyGLTFLoader loader;
+  std::string err;
+  std::string input_filename(argv[1]);
+  std::string ext = getFilePathExtension(input_filename);
+
+  bool ret = false;
+  if (ext.compare("glb") == 0) {
+    // assume binary glTF.
+    ret = loader.LoadBinaryFromFile(&scene, &err, input_filename.c_str());
+  }
+  else {
+    // assume ascii glTF.
+    ret = loader.LoadASCIIFromFile(&scene, &err, input_filename.c_str());
+  }
+
+  if (!err.empty()) {
+    printf("Err: %s\n", err.c_str());
+  }
+
+  if (!ret) {
+    printf("Failed to parse glTF\n");
+    return -1;
+  }
+
+
+  frame = 0;
+  seconds = time(NULL);
+  fpstracker = 0;
+
+  // Launch CUDA/GL
+  if (init(scene)) {
+    // GLFW main loop
+    mainLoop();
+  }
+
+  return 0;
 }
 
 void mainLoop() {
-    while (!glfwWindowShouldClose(window)) {
-        glfwPollEvents();
-        runCuda();
+  while (!glfwWindowShouldClose(window)) {
+    glfwPollEvents();
+    runCuda();
 
-        time_t seconds2 = time (NULL);
+    time_t seconds2 = time(NULL);
 
-        if (seconds2 - seconds >= 1) {
+    if (seconds2 - seconds >= 1) {
 
-            fps = fpstracker / (seconds2 - seconds);
-            fpstracker = 0;
-            seconds = seconds2;
-        }
-
-        string title = "CIS565 Rasterizer | " + utilityCore::convertIntToString((int)fps) + " FPS";
-        glfwSetWindowTitle(window, title.c_str());
+      fps = fpstracker / (seconds2 - seconds);
+      fpstracker = 0;
+      seconds = seconds2;
+    }
 
-        glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo);
-        glBindTexture(GL_TEXTURE_2D, displayImage);
-        glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, width, height, GL_RGBA, GL_UNSIGNED_BYTE, NULL);
-        glClear(GL_COLOR_BUFFER_BIT);
+    string title = "CIS565 Rasterizer | " + utilityCore::convertIntToString((int)fps) + " FPS";
+    glfwSetWindowTitle(window, title.c_str());
 
-        // VAO, shader program, and texture already bound
-        glDrawElements(GL_TRIANGLES, 6,  GL_UNSIGNED_SHORT, 0);
-        glfwSwapBuffers(window);
-    }
-    glfwDestroyWindow(window);
-    glfwTerminate();
+    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo);
+    glBindTexture(GL_TEXTURE_2D, displayImage);
+    glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, width, height, GL_RGBA, GL_UNSIGNED_BYTE, NULL);
+    glClear(GL_COLOR_BUFFER_BIT);
+
+    // VAO, shader program, and texture already bound
+    glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_SHORT, 0);
+    glfwSwapBuffers(window);
+  }
+  glfwDestroyWindow(window);
+  glfwTerminate();
 }
 
 //-------------------------------
 //---------RUNTIME STUFF---------
 //-------------------------------
-float scale = 1.0f;
+float scale = 1.f;
 float x_trans = 0.0f, y_trans = 0.0f, z_trans = -10.0f;
 float x_angle = 0.0f, y_angle = 0.0f;
 void runCuda() {
-    // Map OpenGL buffer object for writing from CUDA on a single GPU
-    // No data is moved (Win & Linux). When mapped to CUDA, OpenGL should not use this buffer
-    dptr = NULL;
+  // Map OpenGL buffer object for writing from CUDA on a single GPU
+  // No data is moved (Win & Linux). When mapped to CUDA, OpenGL should not use this buffer
+  dptr = NULL;
 
-	glm::mat4 P = glm::frustum<float>(-scale * ((float)width) / ((float)height),
-		scale * ((float)width / (float)height),
-		-scale, scale, 1.0, 1000.0);
+  glm::mat4 P = glm::frustum<float>(-scale * ((float)width) / ((float)height),
+    scale * ((float)width / (float)height),
+    -scale, scale, 1.0, 1000.0);
 
-	glm::mat4 V = glm::mat4(1.0f);
+  glm::mat4 V = glm::mat4(1.0f);
 
-	glm::mat4 M =
-		glm::translate(glm::vec3(x_trans, y_trans, z_trans))
-		* glm::rotate(x_angle, glm::vec3(1.0f, 0.0f, 0.0f))
-		* glm::rotate(y_angle, glm::vec3(0.0f, 1.0f, 0.0f));
+  glm::mat4 M =
+    glm::translate(glm::vec3(x_trans, y_trans, z_trans))
+    * glm::rotate(x_angle, glm::vec3(1.0f, 0.0f, 0.0f))
+    * glm::rotate(y_angle, glm::vec3(0.0f, 1.0f, 0.0f));
 
-	glm::mat3 MV_normal = glm::transpose(glm::inverse(glm::mat3(V) * glm::mat3(M)));
-	glm::mat4 MV = V * M;
-	glm::mat4 MVP = P * MV;
+  glm::mat3 MV_normal = glm::transpose(glm::inverse(glm::mat3(V) * glm::mat3(M)));
+  glm::mat4 MV = V * M;
+  glm::mat4 MVP = P * MV;
 
-    cudaGLMapBufferObject((void **)&dptr, pbo);
-	rasterize(dptr, MVP, MV, MV_normal);
-    cudaGLUnmapBufferObject(pbo);
+  cudaGLMapBufferObject((void **)&dptr, pbo);
+  rasterize(dptr, MVP, MV, MV_normal);
+  cudaGLUnmapBufferObject(pbo);
 
-    frame++;
-    fpstracker++;
+  frame++;
+  fpstracker++;
 }
 
 //-------------------------------
@@ -132,149 +133,149 @@ void runCuda() {
 //-------------------------------
 
 bool init(const tinygltf::Scene & scene) {
-    glfwSetErrorCallback(errorCallback);
+  glfwSetErrorCallback(errorCallback);
 
-    if (!glfwInit()) {
-        return false;
-    }
-
-    width = 800;
-    height = 800;
-    window = glfwCreateWindow(width, height, "CIS 565 Pathtracer", NULL, NULL);
-    if (!window) {
-        glfwTerminate();
-        return false;
-    }
-    glfwMakeContextCurrent(window);
-    glfwSetKeyCallback(window, keyCallback);
+  if (!glfwInit()) {
+    return false;
+  }
 
-    // Set up GL context
-    glewExperimental = GL_TRUE;
-    if (glewInit() != GLEW_OK) {
-        return false;
+  width = 800;
+  height = 800;
+  window = glfwCreateWindow(width, height, "CIS 565 Pathtracer", NULL, NULL);
+  if (!window) {
+    glfwTerminate();
+    return false;
+  }
+  glfwMakeContextCurrent(window);
+  glfwSetKeyCallback(window, keyCallback);
+
+  // Set up GL context
+  glewExperimental = GL_TRUE;
+  if (glewInit() != GLEW_OK) {
+    return false;
+  }
+
+  // Initialize other stuff
+  initVAO();
+  initTextures();
+  initCuda();
+  initPBO();
+
+  // Mouse Control Callbacks
+  glfwSetMouseButtonCallback(window, mouseButtonCallback);
+  glfwSetCursorPosCallback(window, mouseMotionCallback);
+  glfwSetScrollCallback(window, mouseWheelCallback);
+
+  {
+    std::map<std::string, std::vector<std::string> >::const_iterator it(
+      scene.scenes.begin());
+    std::map<std::string, std::vector<std::string> >::const_iterator itEnd(
+      scene.scenes.end());
+
+    for (; it != itEnd; it++) {
+      for (size_t i = 0; i < it->second.size(); i++) {
+        std::cout << it->second[i]
+          << ((i != (it->second.size() - 1)) ? ", " : "");
+      }
+      std::cout << " ] " << std::endl;
     }
+  }
 
-    // Initialize other stuff
-    initVAO();
-    initTextures();
-    initCuda();
-	initPBO();
 
-	// Mouse Control Callbacks
-	glfwSetMouseButtonCallback(window, mouseButtonCallback);
-	glfwSetCursorPosCallback(window, mouseMotionCallback);
-	glfwSetScrollCallback(window, mouseWheelCallback);
+  rasterizeSetBuffers(scene);
 
-	{
-		std::map<std::string, std::vector<std::string> >::const_iterator it(
-			scene.scenes.begin());
-		std::map<std::string, std::vector<std::string> >::const_iterator itEnd(
-			scene.scenes.end());
+  GLuint passthroughProgram;
+  passthroughProgram = initShader();
 
-		for (; it != itEnd; it++) {
-			for (size_t i = 0; i < it->second.size(); i++) {
-				std::cout << it->second[i]
-					<< ((i != (it->second.size() - 1)) ? ", " : "");
-			}
-			std::cout << " ] " << std::endl;
-		}
-	}
+  glUseProgram(passthroughProgram);
+  glActiveTexture(GL_TEXTURE0);
 
-
-	rasterizeSetBuffers(scene);
-
-    GLuint passthroughProgram;
-    passthroughProgram = initShader();
-
-    glUseProgram(passthroughProgram);
-    glActiveTexture(GL_TEXTURE0);
-
-    return true;
+  return true;
 }
 
 void initPBO() {
-    // set up vertex data parameter
-    int num_texels = width * height;
-    int num_values = num_texels * 4;
-    int size_tex_data = sizeof(GLubyte) * num_values;
+  // set up vertex data parameter
+  int num_texels = width * height;
+  int num_values = num_texels * 4;
+  int size_tex_data = sizeof(GLubyte) * num_values;
 
-    // Generate a buffer ID called a PBO (Pixel Buffer Object)
-    glGenBuffers(1, &pbo);
+  // Generate a buffer ID called a PBO (Pixel Buffer Object)
+  glGenBuffers(1, &pbo);
 
-    // Make this the current UNPACK buffer (OpenGL is state-based)
-    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo);
+  // Make this the current UNPACK buffer (OpenGL is state-based)
+  glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo);
 
-    // Allocate data for the buffer. 4-channel 8-bit image
-    glBufferData(GL_PIXEL_UNPACK_BUFFER, size_tex_data, NULL, GL_DYNAMIC_COPY);
-    cudaGLRegisterBufferObject(pbo);
+  // Allocate data for the buffer. 4-channel 8-bit image
+  glBufferData(GL_PIXEL_UNPACK_BUFFER, size_tex_data, NULL, GL_DYNAMIC_COPY);
+  cudaGLRegisterBufferObject(pbo);
 
 }
 
 void initCuda() {
-    // Use device with highest Gflops/s
-    cudaGLSetGLDevice(0);
+  // Use device with highest Gflops/s
+  cudaGLSetGLDevice(0);
 
-    rasterizeInit(width, height);
+  rasterizeInit(width, height);
 
-    // Clean up on program exit
-    atexit(cleanupCuda);
+  // Clean up on program exit
+  atexit(cleanupCuda);
 }
 
 void initTextures() {
-    glGenTextures(1, &displayImage);
-    glBindTexture(GL_TEXTURE_2D, displayImage);
-    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
-    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
-    glTexImage2D( GL_TEXTURE_2D, 0, GL_RGBA8, width, height, 0, GL_BGRA,
-                  GL_UNSIGNED_BYTE, NULL);
+  glGenTextures(1, &displayImage);
+  glBindTexture(GL_TEXTURE_2D, displayImage);
+  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
+  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
+  glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, width, height, 0, GL_BGRA,
+    GL_UNSIGNED_BYTE, NULL);
 }
 
 void initVAO(void) {
-    GLfloat vertices[] = {
-        -1.0f, -1.0f,
-        1.0f, -1.0f,
-        1.0f,  1.0f,
-        -1.0f,  1.0f,
-    };
-
-    GLfloat texcoords[] = {
-        1.0f, 1.0f,
-        0.0f, 1.0f,
-        0.0f, 0.0f,
-        1.0f, 0.0f
-    };
-
-    GLushort indices[] = { 0, 1, 3, 3, 1, 2 };
-
-    GLuint vertexBufferObjID[3];
-    glGenBuffers(3, vertexBufferObjID);
-
-    glBindBuffer(GL_ARRAY_BUFFER, vertexBufferObjID[0]);
-    glBufferData(GL_ARRAY_BUFFER, sizeof(vertices), vertices, GL_STATIC_DRAW);
-    glVertexAttribPointer((GLuint)positionLocation, 2, GL_FLOAT, GL_FALSE, 0, 0);
-    glEnableVertexAttribArray(positionLocation);
-
-    glBindBuffer(GL_ARRAY_BUFFER, vertexBufferObjID[1]);
-    glBufferData(GL_ARRAY_BUFFER, sizeof(texcoords), texcoords, GL_STATIC_DRAW);
-    glVertexAttribPointer((GLuint)texcoordsLocation, 2, GL_FLOAT, GL_FALSE, 0, 0);
-    glEnableVertexAttribArray(texcoordsLocation);
-
-    glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, vertexBufferObjID[2]);
-    glBufferData(GL_ELEMENT_ARRAY_BUFFER, sizeof(indices), indices, GL_STATIC_DRAW);
+  GLfloat vertices[] = {
+      -1.0f, -1.0f,
+      1.0f, -1.0f,
+      1.0f,  1.0f,
+      -1.0f,  1.0f,
+  };
+
+  GLfloat texcoords[] = {
+      1.0f, 1.0f,
+      0.0f, 1.0f,
+      0.0f, 0.0f,
+      1.0f, 0.0f
+  };
+
+  GLushort indices[] = { 0, 1, 3, 3, 1, 2 };
+
+  GLuint vertexBufferObjID[3];
+  glGenBuffers(3, vertexBufferObjID);
+
+  glBindBuffer(GL_ARRAY_BUFFER, vertexBufferObjID[0]);
+  glBufferData(GL_ARRAY_BUFFER, sizeof(vertices), vertices, GL_STATIC_DRAW);
+  glVertexAttribPointer((GLuint)positionLocation, 2, GL_FLOAT, GL_FALSE, 0, 0);
+  glEnableVertexAttribArray(positionLocation);
+
+  glBindBuffer(GL_ARRAY_BUFFER, vertexBufferObjID[1]);
+  glBufferData(GL_ARRAY_BUFFER, sizeof(texcoords), texcoords, GL_STATIC_DRAW);
+  glVertexAttribPointer((GLuint)texcoordsLocation, 2, GL_FLOAT, GL_FALSE, 0, 0);
+  glEnableVertexAttribArray(texcoordsLocation);
+
+  glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, vertexBufferObjID[2]);
+  glBufferData(GL_ELEMENT_ARRAY_BUFFER, sizeof(indices), indices, GL_STATIC_DRAW);
 }
 
 
 GLuint initShader() {
-    const char *attribLocations[] = { "Position", "Tex" };
-    GLuint program = glslUtility::createDefaultProgram(attribLocations, 2);
-    GLint location;
+  const char *attribLocations[] = { "Position", "Tex" };
+  GLuint program = glslUtility::createDefaultProgram(attribLocations, 2);
+  GLint location;
 
-    glUseProgram(program);
-    if ((location = glGetUniformLocation(program, "u_image")) != -1) {
-        glUniform1i(location, 0);
-    }
+  glUseProgram(program);
+  if ((location = glGetUniformLocation(program, "u_image")) != -1) {
+    glUniform1i(location, 0);
+  }
 
-    return program;
+  return program;
 }
 
 //-------------------------------
@@ -282,38 +283,38 @@ GLuint initShader() {
 //-------------------------------
 
 void cleanupCuda() {
-    if (pbo) {
-        deletePBO(&pbo);
-    }
-    if (displayImage) {
-        deleteTexture(&displayImage);
-    }
+  if (pbo) {
+    deletePBO(&pbo);
+  }
+  if (displayImage) {
+    deleteTexture(&displayImage);
+  }
 }
 
 void deletePBO(GLuint *pbo) {
-    if (pbo) {
-        // unregister this buffer object with CUDA
-        cudaGLUnregisterBufferObject(*pbo);
+  if (pbo) {
+    // unregister this buffer object with CUDA
+    cudaGLUnregisterBufferObject(*pbo);
 
-        glBindBuffer(GL_ARRAY_BUFFER, *pbo);
-        glDeleteBuffers(1, pbo);
+    glBindBuffer(GL_ARRAY_BUFFER, *pbo);
+    glDeleteBuffers(1, pbo);
 
-        *pbo = (GLuint)NULL;
-    }
+    *pbo = (GLuint)NULL;
+  }
 }
 
 void deleteTexture(GLuint *tex) {
-    glDeleteTextures(1, tex);
-    *tex = (GLuint)NULL;
+  glDeleteTextures(1, tex);
+  *tex = (GLuint)NULL;
 }
 
 void shut_down(int return_code) {
-    rasterizeFree();
-    cudaDeviceReset();
+  rasterizeFree();
+  cudaDeviceReset();
 #ifdef __APPLE__
-    glfwTerminate();
+  glfwTerminate();
 #endif
-    exit(return_code);
+  exit(return_code);
 }
 
 //------------------------------
@@ -321,22 +322,22 @@ void shut_down(int return_code) {
 //------------------------------
 
 void errorCallback(int error, const char *description) {
-    fputs(description, stderr);
+  fputs(description, stderr);
 }
 
 void keyCallback(GLFWwindow *window, int key, int scancode, int action, int mods) {
-    if (key == GLFW_KEY_ESCAPE && action == GLFW_PRESS) {
-        glfwSetWindowShouldClose(window, GL_TRUE);
-    }
+  if (key == GLFW_KEY_ESCAPE && action == GLFW_PRESS) {
+    glfwSetWindowShouldClose(window, GL_TRUE);
+  }
 }
 
 //----------------------------
 //----- util -----------------
 //----------------------------
 static std::string getFilePathExtension(const std::string &FileName) {
-	if (FileName.find_last_of(".") != std::string::npos)
-		return FileName.substr(FileName.find_last_of(".") + 1);
-	return "";
+  if (FileName.find_last_of(".") != std::string::npos)
+    return FileName.substr(FileName.find_last_of(".") + 1);
+  return "";
 }
 
 
@@ -349,52 +350,52 @@ enum ControlState { NONE = 0, ROTATE, TRANSLATE };
 ControlState mouseState = NONE;
 void mouseButtonCallback(GLFWwindow* window, int button, int action, int mods)
 {
-	if (action == GLFW_PRESS)
-	{
-		if (button == GLFW_MOUSE_BUTTON_LEFT)
-		{
-			mouseState = ROTATE;
-		}
-		else if (button == GLFW_MOUSE_BUTTON_RIGHT)
-		{
-			mouseState = TRANSLATE;
-		}
-
-	}
-	else if (action == GLFW_RELEASE)
-	{
-		mouseState = NONE;
-	}
+  if (action == GLFW_PRESS)
+  {
+    if (button == GLFW_MOUSE_BUTTON_LEFT)
+    {
+      mouseState = ROTATE;
+    }
+    else if (button == GLFW_MOUSE_BUTTON_RIGHT)
+    {
+      mouseState = TRANSLATE;
+    }
+
+  }
+  else if (action == GLFW_RELEASE)
+  {
+    mouseState = NONE;
+  }
 }
 
 double lastx = (double)width / 2;
 double lasty = (double)height / 2;
 void mouseMotionCallback(GLFWwindow* window, double xpos, double ypos)
 {
-	const double s_r = 0.01;
-	const double s_t = 0.01;
-
-	double diffx = xpos - lastx;
-	double diffy = ypos - lasty;
-	lastx = xpos;
-	lasty = ypos;
-
-	if (mouseState == ROTATE)
-	{
-		//rotate
-		x_angle += (float)s_r * diffy;
-		y_angle += (float)s_r * diffx;
-	}
-	else if (mouseState == TRANSLATE)
-	{
-		//translate
-		x_trans += (float)(s_t * diffx);
-		y_trans += (float)(-s_t * diffy);
-	}
+  const double s_r = 0.01;
+  const double s_t = 0.01;
+
+  double diffx = xpos - lastx;
+  double diffy = ypos - lasty;
+  lastx = xpos;
+  lasty = ypos;
+
+  if (mouseState == ROTATE)
+  {
+    //rotate
+    x_angle += (float)s_r * diffy;
+    y_angle += (float)s_r * diffx;
+  }
+  else if (mouseState == TRANSLATE)
+  {
+    //translate
+    x_trans += (float)(s_t * diffx);
+    y_trans += (float)(-s_t * diffy);
+  }
 }
 
 void mouseWheelCallback(GLFWwindow* window, double xoffset, double yoffset)
 {
-	const double s = 1.0;	// sensitivity
-	z_trans += (float)(s * yoffset);
+  const double s = 1.0;	// sensitivity
+  z_trans += (float)(s * yoffset);
 }
diff --git a/src/rasterize.cu b/src/rasterize.cu
index 1262a09..b1bb545 100644
--- a/src/rasterize.cu
+++ b/src/rasterize.cu
@@ -18,6 +18,33 @@
 #include <glm/gtc/quaternion.hpp>
 #include <glm/gtc/matrix_transform.hpp>
 
+#include <thrust/execution_policy.h>
+#include <thrust/partition.h>
+
+#include <chrono>
+
+#define POINTS 0
+#define WIREFRAME 0
+
+#define DEBUG_DEPTH 0
+#define DEBUG_NORMALS 0
+
+#define TEXTURE 1
+#define TEXTURE_PERSP_CORRECT 1
+#define TEXTURE_BILINEAR_FILT 1
+
+#define BLUR 0
+#define BLUR_SHARED 0
+
+#define SSAO 0 // SCREEN SPACE AMBIENT OCCLUSION : WORK UNDER PROGRESS 
+
+#define BBOX_OPTIMIZATIONS 1
+#define BACK_FACE_CULLING 0
+#define BACK_FACE_CULLING_WITHOUT_COMPACTION 0
+
+#define SSAA 2 // SUPERSAMPLE ANTIALIASING
+               // 1 (min value : no AA), 2, 4, ...
+
 namespace {
 
 	typedef unsigned short VertexIndex;
@@ -28,31 +55,32 @@ namespace {
 
 	typedef unsigned char BufferByte;
 
-	enum PrimitiveType{
+	enum PrimitiveType {
 		Point = 1,
 		Line = 2,
 		Triangle = 3
 	};
 
 	struct VertexOut {
-		glm::vec4 pos;
-
-		// TODO: add new attributes to your VertexOut
-		// The attributes listed below might be useful, 
-		// but always feel free to modify on your own
-
-		 glm::vec3 eyePos;	// eye space position used for shading
-		 glm::vec3 eyeNor;	// eye space normal used for shading, cuz normal will go wrong after perspective transformation
-		// glm::vec3 col;
-		 glm::vec2 texcoord0;
-		 TextureData* dev_diffuseTex = NULL;
-		// int texWidth, texHeight;
-		// ...
+    glm::vec4 pos;
+
+    // TODO: add new attributes to your VertexOut
+    // The attributes listed below might be useful, 
+    // but always feel free to modify on your own
+
+    glm::vec3 eyePos;	// eye space position used for shading
+    glm::vec3 eyeNor;	// eye space normal used for shading, cuz normal will go wrong after perspective transformation
+    //glm::vec3 col;
+    glm::vec2 texcoord0;
+    TextureData* dev_diffuseTex = NULL;
+    int texWidth, texHeight;
+    // ...
 	};
 
 	struct Primitive {
 		PrimitiveType primitiveType = Triangle;	// C++ 11 init
 		VertexOut v[3];
+    bool back = false; // used for back face culling
 	};
 
 	struct Fragment {
@@ -62,10 +90,11 @@ namespace {
 		// The attributes listed below might be useful, 
 		// but always feel free to modify on your own
 
-		// glm::vec3 eyePos;	// eye space position used for shading
-		// glm::vec3 eyeNor;
-		// VertexAttributeTexcoord texcoord0;
-		// TextureData* dev_diffuseTex;
+		glm::vec3 eyePos;	// eye space position used for shading
+		glm::vec3 eyeNor;
+		VertexAttributeTexcoord texcoord0;
+		TextureData* dev_diffuseTex = NULL;
+    int texWidth, texHeight;
 		// ...
 	};
 
@@ -100,16 +129,70 @@ namespace {
 
 static std::map<std::string, std::vector<PrimitiveDevBufPointers>> mesh2PrimitivesMap;
 
-
 static int width = 0;
 static int height = 0;
 
+static int totalNumPrimitivesCompact = 0; // updated after compaction by culling
 static int totalNumPrimitives = 0;
 static Primitive *dev_primitives = NULL;
 static Fragment *dev_fragmentBuffer = NULL;
 static glm::vec3 *dev_framebuffer = NULL;
 
 static int * dev_depth = NULL;	// you might need this buffer when doing depth test
+static int * dev_mutex = NULL;
+
+static glm::vec3 *dev_postBuffer1 = NULL;
+static glm::vec3 *dev_postBuffer2 = NULL;
+
+static Primitive *dev_primitives_compact = NULL;
+
+// gaussian kernel
+// source: http://www.sunsetlakesoftware.com/2013/10/21/optimizing-gaussian-blurs-mobile-gpu
+// source: http://rastergrid.com/blog/2010/09/efficient-gaussian-blur-with-linear-sampling/
+__constant__ float mat[5] =
+  { 0.2270270270, 0.1945945946, 0.1216216216, 0.0540540541, 0.0162162162 };
+
+__constant__ float light[3] = {100.f, 100.f, 100.f};
+
+/**
+* Called once at the beginning of the program to allocate memory.
+*/
+void rasterizeInit(int w, int h) {
+  width = w * SSAA;
+  height = h * SSAA;
+  cudaFree(dev_fragmentBuffer);
+  cudaMalloc(&dev_fragmentBuffer, width * height * sizeof(Fragment));
+  cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
+  cudaFree(dev_framebuffer);
+  cudaMalloc(&dev_framebuffer, width * height * sizeof(glm::vec3));
+  cudaMemset(dev_framebuffer, 0, width * height * sizeof(glm::vec3));
+
+  cudaMalloc(&dev_postBuffer1, width * height * sizeof(glm::vec3));
+  cudaMemset(dev_postBuffer1, 0, width * height * sizeof(glm::vec3));
+  cudaMalloc(&dev_postBuffer2, width * height * sizeof(glm::vec3));
+  cudaMemset(dev_postBuffer2, 0, width * height * sizeof(glm::vec3));
+
+  cudaFree(dev_depth);
+  cudaMalloc(&dev_depth, width * height * sizeof(int));
+  cudaFree(dev_mutex);
+  cudaMalloc(&dev_mutex, width * height * sizeof(int));
+
+  checkCUDAError("rasterizeInit");
+}
+
+__global__
+void initDepth(int w, int h, int *depth)
+{
+  int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+  int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+
+  if (x < w && y < h)
+  {
+    int index = x + (y * w);
+    depth[index] = INT_MAX;
+  }
+}
+
 
 /**
  * Kernel that writes the image to the OpenGL PBO directly.
@@ -118,13 +201,22 @@ __global__
 void sendImageToPBO(uchar4 *pbo, int w, int h, glm::vec3 *image) {
     int x = (blockIdx.x * blockDim.x) + threadIdx.x;
     int y = (blockIdx.y * blockDim.y) + threadIdx.y;
-    int index = x + (y * w);
+    int width = w / SSAA;
+    int height = h / SSAA;
+    int index = x + (y * width);
 
-    if (x < w && y < h) {
+    if (x < width && y < height) {
         glm::vec3 color;
-        color.x = glm::clamp(image[index].x, 0.0f, 1.0f) * 255.0;
-        color.y = glm::clamp(image[index].y, 0.0f, 1.0f) * 255.0;
-        color.z = glm::clamp(image[index].z, 0.0f, 1.0f) * 255.0;
+        for (int i = 0; i < SSAA; i++) {
+          for (int j = 0; j < SSAA; j++) {
+            int ind = (x *SSAA) + i + (y * SSAA + j) * w;
+            color.x += glm::clamp(image[ind].x, 0.0f, 1.0f) * 255.0;
+            color.y += glm::clamp(image[ind].y, 0.0f, 1.0f) * 255.0;
+            color.z += glm::clamp(image[ind].z, 0.0f, 1.0f) * 255.0;
+          }
+        }
+        color /= (SSAA*SSAA);
+
         // Each thread writes one pixel location in the texture (textel)
         pbo[index].w = 0;
         pbo[index].x = color.x;
@@ -133,6 +225,52 @@ void sendImageToPBO(uchar4 *pbo, int w, int h, glm::vec3 *image) {
     }
 }
 
+
+/**
+* Returns texture color for a fragment
+*/
+__device__
+glm::vec3 getColor(Fragment &fragment, glm::vec2 uv) {
+  // the color is stored in a 1D array of floats..
+  //    convert to 1D index
+  //    scale by 3
+
+  int index = ((int)uv.x + (int)uv.y * fragment.texWidth) * 3;
+  glm::vec3 col(fragment.dev_diffuseTex[index],
+    fragment.dev_diffuseTex[index + 1],
+    fragment.dev_diffuseTex[index + 2]);
+  return col / 255.f; // map colors to 0-1 range..
+}
+
+
+/**
+* Returns Bilinear Filtered texture color for a fragment
+* Reference: https://en.wikipedia.org/wiki/Bilinear_filtering
+*/
+__device__
+glm::vec3 getBilinearFilteredColor(Fragment &fragment, glm::vec2 uv) {
+  // get 4 valid indices..
+  int intX0 = uv.x;
+  int intY0 = uv.y;
+  int intX1 = glm::clamp(intX0 + 1, 0, fragment.texWidth - 1);
+  int intY1 = glm::clamp(intY0 + 1, 0, fragment.texHeight - 1);
+
+  // get colors at 4 texels..
+  glm::vec3 col00 = getColor(fragment, glm::vec2(intX0, intY0));
+  glm::vec3 col01 = getColor(fragment, glm::vec2(intX0, intY1));
+  glm::vec3 col10 = getColor(fragment, glm::vec2(intX1, intY0));
+  glm::vec3 col11 = getColor(fragment, glm::vec2(intX1, intY1));
+
+  // lerp based on fractional parts..
+  float fracX = uv.x - intX0;
+  float fracY = uv.y - intY0;
+
+  glm::vec3 col0001 = glm::mix(col00, col01, fracY);
+  glm::vec3 col1011 = glm::mix(col10, col11, fracY);
+  return glm::mix(col0001, col1011, fracX);
+}
+
+
 /** 
 * Writes fragment colors to the framebuffer
 */
@@ -140,49 +278,223 @@ __global__
 void render(int w, int h, Fragment *fragmentBuffer, glm::vec3 *framebuffer) {
     int x = (blockIdx.x * blockDim.x) + threadIdx.x;
     int y = (blockIdx.y * blockDim.y) + threadIdx.y;
-    int index = x + (y * w);
-
+    
     if (x < w && y < h) {
-        framebuffer[index] = fragmentBuffer[index].color;
+      int index = x + (y * w);
+
+      glm::vec3 &outPix = framebuffer[index];
+      Fragment &frag = fragmentBuffer[index];
+
+      if (frag.color == glm::vec3()) {
+        outPix *= 0.f;
+        return;
+      }
+		  
+#if POINTS || WIREFRAME
+      outPix = frag.color;
+
+#elif DEBUG_DEPTH
+      outPix = frag.color; // maybe scale this to look better???
+
+#elif DEBUG_NORMALS
+      outPix = frag.color;
+
+#elif TEXTURE
+      if (frag.dev_diffuseTex != NULL) {
+
+  #if TEXTURE_BILINEAR_FILT
+        glm::vec2 uv(frag.texcoord0.x * frag.texWidth, frag.texcoord0.y * frag.texHeight);
+        outPix = getBilinearFilteredColor(frag, uv);
+  #else
+        glm::vec2 uv(frag.texcoord0.x * frag.texWidth, frag.texcoord0.y * frag.texHeight);
+        outPix = getColor(frag, uv);
+  #endif
+
+        // LAMBERT SHADING:
+        glm::vec3 lightDir = glm::normalize(frag.eyePos - glm::vec3(1.f));
+        outPix *= glm::max(fabs(glm::dot(lightDir, frag.eyeNor)), 0.2f);
+      }
+      else {
+        outPix = glm::vec3(); // reset color..
+      }
+
+#else
+
+      // LAMBERT SHADING:
+
+      glm::vec3 lightDir = frag.eyePos - glm::vec3(light[0], light[1], light[2]);
+      lightDir = glm::normalize(lightDir);
+      float dot = glm::dot(-lightDir, frag.eyeNor);
+      if (dot <= 0.1f) {
+        dot = 0.1f;
+      }
+      glm::vec3 col = frag.color * dot;
+      outPix = col;
+
+#endif
+    }
+}
 
-		// TODO: add your fragment shader code here
 
+
+// Screen space ambient occlusion
+// https://www.gamedev.net/articles/programming/graphics/a-simple-and-practical-approach-to-ssao-r2753/
+__global__
+void shaderSSAO(int w, int h, Fragment *fragmentBuffer) {
+  int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+  int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+
+  if (x < w && y < h) {
+    int index = x + (y * w);
+
+    Fragment &frag = fragmentBuffer[index];
+    float ao = 0.f;
+    for (int i = -2; i < 3; i++) {
+      for (int j = -2; j < 3; j++) {
+        int xi = x + i;
+        int yi = y + i;
+
+        if (xi > 0 && yi > 0 && xi < w && yi < h && i != 0 && j != 0) {
+          int idxi = xi + (yi * w);
+          Fragment &fragi = fragmentBuffer[idxi];
+          glm::vec3 N = fragi.eyeNor;
+          glm::vec3 V = fragi.eyePos - frag.eyePos;
+          float d = glm::length(V);
+          ao += glm::max(0.f, glm::dot(N, glm::normalize(V))) * (1.f / (1.f + d));
+        }
+      }
     }
+
+    frag.color = ao * glm::vec3(1.f);
+  }
 }
 
+
+
 /**
- * Called once at the beginning of the program to allocate memory.
- */
-void rasterizeInit(int w, int h) {
-    width = w;
-    height = h;
-	cudaFree(dev_fragmentBuffer);
-	cudaMalloc(&dev_fragmentBuffer, width * height * sizeof(Fragment));
-	cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
-    cudaFree(dev_framebuffer);
-    cudaMalloc(&dev_framebuffer,   width * height * sizeof(glm::vec3));
-    cudaMemset(dev_framebuffer, 0, width * height * sizeof(glm::vec3));
-    
-	cudaFree(dev_depth);
-	cudaMalloc(&dev_depth, width * height * sizeof(int));
+* Post Processing Shader
+* operates on framebuffer
+* uses the __constant__ gaussian kernel defined above..
+*/
+__global__
+void postProcess(bool dirX, int w, int h, glm::vec3 *frameBuffer, int* depthBuffer, glm::vec3 *postBuffer) {
+  int x = dirX ? threadIdx.x : blockIdx.x;
+  int y = dirX ? blockIdx.x : threadIdx.x;
+  int index = x + (y * w);
+
+  if (x < w && y < h) {
+
+    glm::vec3 col = mat[0] * frameBuffer[index];
+    if (dirX) {
+      for (int i = 1; i < 5; i++) {
+        if (x + i < w) {
+          col += mat[i] * frameBuffer[(x + i) + y * w];
+        }
+        if (x - i >= 0) {
+          col += mat[i] * frameBuffer[(x - i) + y * w];
+        }
+      }
+    }
+    else {
+      for (int i = 1; i < 5; i++) {
+        if (y + i < h) {
+          col += mat[i] * frameBuffer[x + (y + i) * w];
+        }
+        if (y - i >= 0) {
+          col += mat[i] * frameBuffer[x + (y - i) * w];
+        }
+      }
+    }
 
-	checkCUDAError("rasterizeInit");
+    postBuffer[index] = col;
+  }
+  else {
+    postBuffer[index] = frameBuffer[index];
+  }
 }
 
 __global__
-void initDepth(int w, int h, int * depth)
-{
-	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
-	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+void postProcessShared(bool dirX, int w, int h, glm::vec3 *frameBuffer, int* depthBuffer, glm::vec3 *postBuffer) {
+  int x = dirX ? threadIdx.x : blockIdx.x;
+  int y = dirX ? blockIdx.x : threadIdx.x;
+  int index = x + (y * w);
+
+  if (x >= w || y >= h) {
+    return;
+  }
+
+  // https://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/
+  extern __shared__ glm::vec3 sm[];
+  if (dirX) {
+    sm[x] = frameBuffer[index];
+  }
+  else {
+    sm[y] = frameBuffer[index];
+  }
+  __syncthreads();
+
+  int depth = depthBuffer[index];
+  glm::vec3 col;
+  if (dirX) {
+    col = mat[0] * sm[x];
+    for (int i = 1; i < 5; i++) {
+      if (x + i < w) {
+        if (abs(depthBuffer[(x + i) + y * w] - depth) < 10000) {
+          col += mat[i] * sm[x + i];
+
+        }
+      }
+      if (x - i >= 0) {
+        if (abs(depthBuffer[(x - i) + y * w] - depth) < 10000) {
+          col += mat[i] * sm[x - i];
+        }
+      }
+    }
+  }
+  else {
+    col = mat[0] * sm[y];
+    for (int i = 1; i < 5; i++) {
+      if (y + i < h) {
+        if (abs(depthBuffer[x + (y + i) * w] - depth) < 10000) {
+          col += mat[i] * sm[y + i];
+        }
+      }
+      if (y - i >= 0) {
+        if (abs(depthBuffer[x + (y - i) * w] - depth) < 10000) {
+          col += mat[i] * sm[y - i];
+        }
+      }
+    }
+  }
 
-	if (x < w && y < h)
-	{
-		int index = x + (y * w);
-		depth[index] = INT_MAX;
-	}
+  postBuffer[index] = col;
 }
 
 
+/**
+* Blend kernel for DOF
+* reference: https://mynameismjp.wordpress.com/the-museum/samples-tutorials-tools/depth-of-field-sample/
+*/
+__global__
+void postBlend(int w, int h, glm::vec3 *frameBuffer, int *depthBuffer, glm::vec3 *postBuffer) {
+  int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+  int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+  int index = x + (y * w);
+
+  if (x < w && y < h && frameBuffer[index]!=glm::vec3()) {
+    float z = depthBuffer[index] * .0001f;
+    //z = z < 0.3 ? fabs(z - 0.3) * 2.f : fabs(z - 0.3) * 2.f;
+    //postBuffer[index] *= z;
+    //postBuffer[index] += (1 - z) * frameBuffer[index];
+
+    z = fabs(z - 0.5) * 2.f;
+    postBuffer[index] *= z;
+    postBuffer[index] += (1 - z) * frameBuffer[index];
+  }
+}
+
+
+
 /**
 * kern function with support for stride to sometimes replace cudaMemcpy
 * One thread is responsible for copying one component
@@ -600,6 +912,9 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 	// 3. Malloc for dev_primitives
 	{
 		cudaMalloc(&dev_primitives, totalNumPrimitives * sizeof(Primitive));
+#if BACK_FACE_CULLING
+    cudaMalloc(&dev_primitives_compact, totalNumPrimitives * sizeof(Primitive));
+#endif
 	}
 	
 
@@ -639,16 +954,37 @@ void _vertexTransformAndAssembly(
 		// Then divide the pos by its w element to transform into NDC space
 		// Finally transform x and y to viewport space
 
+    VertexOut &outVert = primitive.dev_verticesOut[vid];
+    
+    glm::vec4 inPos(primitive.dev_position[vid], 1.f);
+    glm::vec4 outPos;
+
+    outPos = MVP * inPos; // transform
+    outPos /= outPos.w; // rehomogenize
+
+    outVert.pos.x = (1.f - outPos.x) * width * 0.5f;
+    outVert.pos.y = (1.f - outPos.y) * height * 0.5f;
+    outVert.pos.z = -outPos.z;
+
+    outVert.eyePos = glm::vec3(MV * inPos);
+    outVert.eyeNor = glm::normalize(MV_normal * primitive.dev_normal[vid]);
+
+#if TEXTURE
+
+    outVert.texcoord0 = primitive.dev_texcoord0[vid];
+    outVert.dev_diffuseTex = primitive.dev_diffuseTex;
+    outVert.texWidth = primitive.diffuseTexWidth;
+    outVert.texHeight = primitive.diffuseTexHeight;
+
+#endif
 		// TODO: Apply vertex assembly here
 		// Assemble all attribute arraies into the primitive array
-		
 	}
 }
 
 
 
 static int curPrimitiveBeginId = 0;
-
 __global__ 
 void _primitiveAssembly(int numIndices, int curPrimitiveBeginId, Primitive* dev_primitives, PrimitiveDevBufPointers primitive) {
 
@@ -660,21 +996,210 @@ void _primitiveAssembly(int numIndices, int curPrimitiveBeginId, Primitive* dev_
 		// TODO: uncomment the following code for a start
 		// This is primitive assembly for triangles
 
-		//int pid;	// id for cur primitives vector
-		//if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES) {
-		//	pid = iid / (int)primitive.primitiveType;
-		//	dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType]
-		//		= primitive.dev_verticesOut[primitive.dev_indices[iid]];
-		//}
+		int pid;	// id for cur primitives vector
+		if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES) {
+			pid = iid / (int)primitive.primitiveType;
+      VertexOut &vout = primitive.dev_verticesOut[primitive.dev_indices[iid]];
+      dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType]
+        = vout;
+
+#if BACK_FACE_CULLING
 
+      if (glm::dot(vout.eyeNor, glm::vec3(0.f, 0.f, 1.f)) < 0.f) {
+        dev_primitives[pid + curPrimitiveBeginId].back = true;
+      }
+      else {
+        dev_primitives[pid + curPrimitiveBeginId].back = false;
+      }
+
+#endif
+		}
 
 		// TODO: other primitive types (point, line)
+    // point, line handled in rasterization..
 	}
 	
 }
 
 
 
+
+/**
+* Creates a line from two points (A and B) using Digital Differential Analyzer (DDA) Algorithm
+*
+* Reference: http://www.geeksforgeeks.org/dda-line-generation-algorithm-computer-graphics/
+*/
+__device__
+void drawLine(glm::vec3 A, glm::vec3 B, Fragment *fragBuf, int width) {
+  float dx = B.x - A.x;
+  float dy = B.y - A.y;
+  float steps = fabs(dx) > fabs(dy) ? dx : dy;
+  dx /= fabs(steps);
+  dy /= fabs(steps);
+
+  float x = A.x, y = A.y;
+  for (int i = 0; i < steps; i++) {
+    int index = (int)x + (int)y * width;
+    fragBuf[index].color = glm::vec3(0.98);
+    x += dx;
+    y += dy;
+  }
+}
+
+
+
+__global__
+void _rasterizeTriangles(int numTris, Primitive *dev_primitives, Fragment *fragBuf, int *depthBuf, int *mutex, int width, int height) {
+
+  // index id
+  int tid = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (tid < numTris) {
+    Primitive &prim = dev_primitives[tid];
+    glm::vec3 tri[3] = { glm::vec3(prim.v[0].pos), 
+                         glm::vec3(prim.v[1].pos), 
+                         glm::vec3(prim.v[2].pos) };
+    
+#if BACK_FACE_CULLING_WITHOUT_COMPACTION
+    if (glm::dot(prim.v[0].eyeNor, glm::vec3(0.f, 0.f, 1.f)) < 0.f) {
+      return;
+    }
+#endif
+
+#if POINTS
+
+    for (int i = 0; i < 3; i++) {
+      int x = tri[i].x;
+      int y = tri[i].y;
+      int index = x + (y * width);
+
+  #if DEBUG_DEPTH
+      fragBuf[index].color = glm::abs(glm::vec3(1.f + tri[i].z));
+  #elif DEBUG_NORMALS
+      fragBuf[index].color = prim.v[i].eyeNor;
+  #else
+      fragBuf[index].color = glm::vec3(0.98);
+  #endif
+    
+    }
+
+#elif WIREFRAME
+
+    drawLine(tri[0], tri[1], fragBuf, width);
+    drawLine(tri[1], tri[2], fragBuf, width);
+    drawLine(tri[2], tri[0], fragBuf, width);
+
+#else
+
+    AABB bbox = getAABBForTriangle(tri);
+
+#if BBOX_OPTIMIZATIONS
+    // return when the entire bbox is outside screen..
+    if (bbox.min.x > width || bbox.max.x<0 || bbox.min.y>height || bbox.max.y < 0) {
+      return;
+    }
+
+    // clip bounding boxes to the screen size.. 
+    // won't cause much divergence as most of the threads 
+    // in a warp would fall in the same category most of the time..
+    if (bbox.min.x < 0) {
+      bbox.min.x = 0.f;
+    }
+    if (bbox.min.y < 0) {
+      bbox.min.y = 0.f;
+    }
+    if (bbox.max.x > width) {
+      bbox.min.x = width;
+    }
+    if (bbox.max.y > height) {
+      bbox.min.x = height;
+    }
+    
+#endif
+
+    for (int y = bbox.min.y; y <= bbox.max.y; y++) {
+      for (int x = bbox.min.x; x <= bbox.max.x; x++) {
+
+        glm::vec2 point(x, y);
+        glm::vec3 baryCoord = calculateBarycentricCoordinate(tri, point);
+
+        if (!isBarycentricCoordInBounds(baryCoord)) {
+          continue;
+        }
+
+        int baryZ = getZAtCoordinate(baryCoord, tri) * 10000;
+
+        int index = x + (y * width);
+
+        glm::vec3 eyePosition = prim.v[0].eyePos * baryCoord.x
+          + prim.v[1].eyePos * baryCoord.y
+          + prim.v[2].eyePos * baryCoord.z;
+
+        bool isSet;
+        do {
+          isSet = (atomicCAS(&mutex[index], 0, 1) == 0);
+          if (isSet) {
+
+            if (depthBuf[index] > baryZ) {
+
+              depthBuf[index] = baryZ;
+
+              fragBuf[index].eyePos = eyePosition;
+
+              fragBuf[index].eyeNor = prim.v[0].eyeNor * baryCoord.x
+                + prim.v[1].eyeNor * baryCoord.y
+                + prim.v[2].eyeNor * baryCoord.z;
+
+#if DEBUG_DEPTH
+
+              fragBuf[index].color = glm::abs(glm::vec3(1.f - baryZ / 10000.f)); 
+
+#elif DEBUG_NORMALS
+
+              fragBuf[index].color = fragBuf[index].eyeNor;
+
+#elif TEXTURE
+#if TEXTURE_PERSP_CORRECT
+              float z0 = prim.v[0].eyePos.z, z1 = prim.v[1].eyePos.z, z2 = prim.v[2].eyePos.z;
+              float z = baryCoord.x / z0 + baryCoord.y / z1 + baryCoord.z / z2;
+              fragBuf[index].texcoord0 = (prim.v[0].texcoord0 / z0 * baryCoord.x
+                + prim.v[1].texcoord0 / z1 * baryCoord.y
+                + prim.v[2].texcoord0 / z2 * baryCoord.z) / z;
+#else
+              fragBuf[index].texcoord0 = prim.v[0].texcoord0 * baryCoord.x
+                + prim.v[1].texcoord0 * baryCoord.y
+                + prim.v[2].texcoord0 * baryCoord.z;
+#endif
+              fragBuf[index].dev_diffuseTex = prim.v[0].dev_diffuseTex;
+              fragBuf[index].texWidth = prim.v[0].texWidth;
+              fragBuf[index].texHeight = prim.v[0].texHeight;
+
+              fragBuf[index].color = glm::vec3(0.98);
+#else // lambert
+              fragBuf[index].color = glm::vec3(0.98);
+#endif
+            }
+            mutex[index] = 0;
+          }
+        } while (!isSet);
+      }
+    }
+
+#endif
+
+  }
+}
+
+
+/**
+* predicate struct for thrust::copy_if
+*/
+struct isNotBack {
+  __host__ __device__
+  bool operator()(const Primitive &prim) {
+    return !prim.back;
+  }
+};
+
 /**
  * Perform rasterization.
  */
@@ -702,10 +1227,10 @@ void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const g
 				dim3 numBlocksForVertices((p->numVertices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
 				dim3 numBlocksForIndices((p->numIndices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
 
-				_vertexTransformAndAssembly << < numBlocksForVertices, numThreadsPerBlock >> >(p->numVertices, *p, MVP, MV, MV_normal, width, height);
+				_vertexTransformAndAssembly <<<numBlocksForVertices, numThreadsPerBlock>>> (p->numVertices, *p, MVP, MV, MV_normal, width, height);
 				checkCUDAError("Vertex Processing");
 				cudaDeviceSynchronize();
-				_primitiveAssembly << < numBlocksForIndices, numThreadsPerBlock >> >
+				_primitiveAssembly <<<numBlocksForIndices, numThreadsPerBlock>>>
 					(p->numIndices, 
 					curPrimitiveBeginId, 
 					dev_primitives, 
@@ -718,20 +1243,110 @@ void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const g
 
 		checkCUDAError("Vertex Processing and Primitive Assembly");
 	}
-	
-	cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
-	initDepth << <blockCount2d, blockSize2d >> >(width, height, dev_depth);
-	
-	// TODO: rasterize
 
+  cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
+  cudaMemset(dev_mutex, 0, width * height * sizeof(int));
+  initDepth << <blockCount2d, blockSize2d >> > (width, height, dev_depth);
+
+#if BACK_FACE_CULLING
+
+  cudaMemset(dev_primitives_compact, 0, totalNumPrimitives * sizeof(Primitive));
+  Primitive* end = thrust::copy_if(thrust::device, dev_primitives, dev_primitives + totalNumPrimitives, dev_primitives_compact, isNotBack());
+  checkCUDAError("Back face culling: thrust::partition");
+  totalNumPrimitivesCompact = end - dev_primitives_compact;
+
+  //// TODO: rasterize
+  dim3 numThreadsPerBlock(128);
+  dim3 numBlocksPerTriangle((totalNumPrimitivesCompact + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
+  _rasterizeTriangles << <numBlocksPerTriangle, numThreadsPerBlock >> >
+    (totalNumPrimitivesCompact, dev_primitives_compact, dev_fragmentBuffer, dev_depth, dev_mutex, width, height);
+  checkCUDAError("rasterize triangles");
+
+#else
+
+  //// TODO: rasterize
+  dim3 numThreadsPerBlock(128);
+  dim3 numBlocksPerTriangle((totalNumPrimitives + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
+  _rasterizeTriangles <<<numBlocksPerTriangle, numThreadsPerBlock>>>
+    (totalNumPrimitives, dev_primitives, dev_fragmentBuffer, dev_depth, dev_mutex, width, height);
+  checkCUDAError("rasterize triangles");
+
+
+#endif
+
+#if SSAO
+
+  // SSAO
+  shaderSSAO << <blockCount2d, blockSize2d >> > (width, height, dev_fragmentBuffer);
+  checkCUDAError("SSAO");
+
+#endif
+
+  // Copy fragmentbuffer colors into framebuffer [frame buffer is sent for post processing]
+  render <<<blockCount2d, blockSize2d>>> (width, height, dev_fragmentBuffer, dev_framebuffer);
+  checkCUDAError("fragment shader");
+
+#if BLUR
+  #if BLUR_SHARED
+
+    // Doing seperate blur in X and then in Y to make things faster.. 
+    // Using 2 postbuffers to avoid over-writing and collision between threads while doing blurring in Y dir..
+
+    // blur in x dir
+    dim3 threadsX(width), blocksX(height), threadsY(height), blocksY(width);
+    bool dirX = true;
+    postProcessShared <<<blocksX, threadsX, width * sizeof(glm::vec3)>>> (dirX, width, height, dev_framebuffer, dev_depth, dev_postBuffer1);
+    checkCUDAError("post shader X");
 
+    // blur in y dir
+    dirX = false;
+    postProcessShared <<<blocksY, threadsY, height * sizeof(glm::vec3)>>> (dirX, width, height, dev_postBuffer1, dev_depth, dev_postBuffer2);
+    checkCUDAError("post shader Y");
+
+    // 2nd pass for more blurring..
+    dirX = true;
+    postProcessShared <<<blocksX, threadsX, width * sizeof(glm::vec3)>>> (dirX, width, height, dev_postBuffer2, dev_depth, dev_postBuffer1);
+    checkCUDAError("post shader X");
+
+    // blur in y dir
+    dirX = false;
+    postProcessShared <<<blocksY, threadsY, height * sizeof(glm::vec3)>>> (dirX, width, height, dev_postBuffer1, dev_depth, dev_postBuffer2);
+    checkCUDAError("post shader Y");
+
+  #else
+
+    // This version doesn't use shared memory. It is implemented only for the sake of comparison
+    // blur in x dir
+    dim3 threadsX(width), threadsY(height), blocksX(height), blocksY(width);
+    bool dirX = true;
+    postProcess <<<blocksX, threadsX>>> (dirX, width, height, dev_framebuffer, dev_depth, dev_postBuffer1);
+    checkCUDAError("post shader X");
+
+    // blur in y dir
+    dirX = false;
+    postProcess <<<blocksY, threadsY>>> (dirX, width, height, dev_postBuffer1, dev_depth, dev_postBuffer2);
+    checkCUDAError("post shader Y");
+
+  #endif
+
+    // blend for DOF
+    postBlend <<<blockCount2d, blockSize2d>>> (width, height, dev_framebuffer, dev_depth, dev_postBuffer2);
+    checkCUDAError("copy post render result to pbo");
 
-    // Copy depthbuffer colors into framebuffer
-	render << <blockCount2d, blockSize2d >> >(width, height, dev_fragmentBuffer, dev_framebuffer);
-	checkCUDAError("fragment shader");
     // Copy framebuffer into OpenGL buffer for OpenGL previewing
-    sendImageToPBO<<<blockCount2d, blockSize2d>>>(pbo, width, height, dev_framebuffer);
-    checkCUDAError("copy render result to pbo");
+    sendImageToPBO <<<blockCount2d, blockSize2d>>> (pbo, width, height, dev_postBuffer2);
+    checkCUDAError("copy post render result to pbo");
+
+#else
+
+  blockSize2d = dim3(sideLength2d, sideLength2d);
+  blockCount2d = dim3((width/SSAA - 1) / blockSize2d.x + 1, (height/SSAA - 1) / blockSize2d.y + 1);
+
+  // Copy framebuffer into OpenGL buffer for OpenGL previewing
+  sendImageToPBO <<<blockCount2d, blockSize2d>>> (pbo, width, height, dev_framebuffer);
+  checkCUDAError("copy render result to pbo");
+
+#endif
 }
 
 /**
@@ -741,36 +1356,48 @@ void rasterizeFree() {
 
     // deconstruct primitives attribute/indices device buffer
 
-	auto it(mesh2PrimitivesMap.begin());
-	auto itEnd(mesh2PrimitivesMap.end());
-	for (; it != itEnd; ++it) {
-		for (auto p = it->second.begin(); p != it->second.end(); ++p) {
-			cudaFree(p->dev_indices);
-			cudaFree(p->dev_position);
-			cudaFree(p->dev_normal);
-			cudaFree(p->dev_texcoord0);
-			cudaFree(p->dev_diffuseTex);
+  auto it(mesh2PrimitivesMap.begin());
+  auto itEnd(mesh2PrimitivesMap.end());
+  for (; it != itEnd; ++it) {
+    for (auto p = it->second.begin(); p != it->second.end(); ++p) {
+      cudaFree(p->dev_indices);
+      cudaFree(p->dev_position);
+      cudaFree(p->dev_normal);
+      cudaFree(p->dev_texcoord0);
+      cudaFree(p->dev_diffuseTex);
 
-			cudaFree(p->dev_verticesOut);
+      cudaFree(p->dev_verticesOut);
 
-			
-			//TODO: release other attributes and materials
-		}
-	}
 
-	////////////
+      //TODO: release other attributes and materials
+    }
+  }
+
+  ////////////
+
+  cudaFree(dev_primitives);
+  dev_primitives = NULL;
+
+  cudaFree(dev_fragmentBuffer);
+  dev_fragmentBuffer = NULL;
+
+  cudaFree(dev_framebuffer);
+  dev_framebuffer = NULL;
+
+  cudaFree(dev_postBuffer1);
+  dev_postBuffer1 = NULL;
 
-    cudaFree(dev_primitives);
-    dev_primitives = NULL;
+  cudaFree(dev_postBuffer2);
+  dev_postBuffer2 = NULL;
 
-	cudaFree(dev_fragmentBuffer);
-	dev_fragmentBuffer = NULL;
+  cudaFree(dev_depth);
+  dev_depth = NULL;
 
-    cudaFree(dev_framebuffer);
-    dev_framebuffer = NULL;
+  cudaFree(dev_mutex);
+  dev_mutex = NULL;
 
-	cudaFree(dev_depth);
-	dev_depth = NULL;
+  cudaFree(dev_primitives_compact);
+  dev_primitives = NULL;
 
-    checkCUDAError("rasterize Free");
+  checkCUDAError("rasterize Free");
 }
diff --git a/util/CMakeLists.txt b/util/CMakeLists.txt
index c995fae..f57f1b7 100644
--- a/util/CMakeLists.txt
+++ b/util/CMakeLists.txt
@@ -8,5 +8,5 @@ set(SOURCE_FILES
 
 cuda_add_library(util
    ${SOURCE_FILES}
-   OPTIONS -arch=sm_52
+   OPTIONS -arch=sm_20
    )