CIS565-Fall-2021 · mattelser · Sep 16, 2021 · Sep 17, 2021 · Sep 17, 2021 · Sep 18, 2021
diff --git a/README.md b/README.md
@@ -3,12 +3,61 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Matt Elser
+  * [LinkedIn](https://www.linkedin.com/in/matt-elser-97b8151ba/), [twitter](twitter.com/__mattelser__)
+* Tested on: Tested on: Ubuntu 20.04, i3-10100F @ 3.6GHz 16GB, GeForce 1660 Super 6GB
 
-### (TODO: Your README)
+![timing data table](img/timingData.png)
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+### Main Features
+This project implements an exclusive scan (performing an operation on (in this case a sum of) all previous 
+elements along an array) and stream compaction (removing elements from an array based on a condition) using 
+the CPU, GPU with CUDA, and the CUDA powered library `Thrust`. 
+- all required algorithms work
+- Efficient scan and compaction are faster than CPU implementation for arrays of sufficient size
+- Radix sort has been implemented (without tiling or bitonic merge)
 
+### Time Comparison
+The test setup in `main.cpp` has been coopted to set up repeated timings to remove noise from measurements.
+
+![scan comparison plot](img/scanPlot.png)
+![compact comparison plot](img/compactPlot.png)
+
+CPU scan and compact may be faster for smaller arrays, but the efficient GPU algorithms become more efficient 
+at array size 2^16 for compact and 2^17 for efficient scan. 
+
+### Block Size Comparison
+changing the number of threads per blocks did not have a noticeable impact on timing. Minor differences are 
+can be seen in the graph below, but almost all are around one standard deviation of another, so this may just
+be noise. Data was gathered by timing 100 runs of each algorithm with an array of size 2^22, see the bottom of 
+the readme for the data table.
+
+![block comparison plot](img/blocksizePlot.png)
+
+### Known limitations
+- [FIXED] The Naive implementation fails for array sizes greater than 2^25. 
+  - Naive was calling an inefficient number of threads, leading to higher-than needed `threadIdx.x` 
+    values. When multiplied to get the `index` this overflowed int and yielded a negative index.
+    Logic around indices (reasonably) assumed positive values and therefore caused an out of bounds write.
+- compact scan fails for array sizes greater than 2^28 due to running out of CPU memory on the (16Gb) test machine.
+
+### Extra Credit
+- Work Efficient GPU algorithms are more efficient than CPU (for large array sizes). This was achieved 
+  by removing divergent threads from `upsweep` and `downsweep` kernel calls. Prior to the discussion in class of
+  modifying the indexing for optimal thread scheduling, these algorithms were implemented to acheive the same end 
+  iteratively. Each layer is a separate kernel call, and each kernel call only spawns one thread per index being 
+  written to (i.e. n/2 for the first call, n/4 for the next, etc.). This does not have the same benefit of contiguous 
+  memory reads, however. Of note, this iterative method does simplify syncing of threads/kernels. No threads on any 
+  given layer are reading/writing from/to any of the indices being read/written from/to by any other thread. Since 
+  All layers are separate kernel calls (on the default stream and therefore with an implicit join/sync), so no 
+  explicit syncs are needed. 
+- Radix sort has been implemented, though without tiling or bitonic merge. It sorts correctly for all array sizes (power of two 
+  and non-power of two) up until 2^27, at which point the test machine runs out of memory. Radix sorting has been 
+  validated against `Thrust`'s sort (though the timing of the two are different by several orders of magnitude). 
+  The algorithm has not been optimized to use shared memory or contiguous memory reads, and would fail for arrays with
+  negative values. Here is a plot comparing the timing of the radix sort implementation with `Thrust`s sort
+  ![sort runtime comparison](img/sortPlot.png)
+
+
+
+![block comparison table](img/blockComparison.png)
diff --git a/img/blockComparison.png b/img/blockComparison.png
diff --git a/img/blocksizePlot.png b/img/blocksizePlot.png
diff --git a/img/compactPlot.png b/img/compactPlot.png
diff --git a/img/scanPlot.png b/img/scanPlot.png
diff --git a/img/sortPlot.png b/img/sortPlot.png
diff --git a/img/timingData.png b/img/timingData.png
diff --git a/src/main.cpp b/src/main.cpp
@@ -10,10 +10,12 @@
 #include <stream_compaction/cpu.h>
 #include <stream_compaction/naive.h>
 #include <stream_compaction/efficient.h>
+#include <stream_compaction/radix.h>
 #include <stream_compaction/thrust.h>
+#include <vector>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 26; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
@@ -54,11 +56,13 @@ int main(int argc, char* argv[]) {
     //printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
-    /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
+    /*
+    //For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
     onesArray(SIZE, c);
     printDesc("1s array for finding bugs");
     StreamCompaction::Naive::scan(SIZE, c, a);
-    printArray(SIZE, c, true); */
+    printArray(SIZE, c, true); 
+    */
 
     zeroArray(SIZE, c);
     printDesc("naive scan, non-power-of-two");
@@ -78,7 +82,6 @@ int main(int argc, char* argv[]) {
     printDesc("work-efficient scan, non-power-of-two");
     StreamCompaction::Efficient::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
@@ -147,8 +150,133 @@ int main(int argc, char* argv[]) {
     //printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
+    // --- repeated timing --- 
+    printf("\n");
+    printf("******************************\n");
+    printf("** SCAN & COMPACTION TIMING **\n");
+    printf("******************************\n");
+
+    int NUM_TIMINGS = 100;
+    std::vector<float> data;
+    float stdDev;
+    float mean;
+
+    printf("  Data gathered from %i runs with array size %i (2^%i)\n", NUM_TIMINGS, SIZE, ilog2(SIZE));
+    printf("--------------------------------------------------------------\n\n");
+    printf("------------------------------| mean (ms) |--| stdDev (ms) |--\n");
+    printf("------ Scan ------\n");
+
+    // CPU
+    for (int i = 0; i < NUM_TIMINGS; i++) {
+        zeroArray(SIZE, c);
+        StreamCompaction::CPU::scan(SIZE, c, a);
+        data.push_back(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation());
+    }
+    tabulate(&data, &mean, &stdDev);
+    printf("CPU Scan                \t%f\t%f\n", mean, stdDev);
+    data.clear();
+
+    // Naive
+    for (int i = 0; i < NUM_TIMINGS; i++) {
+        zeroArray(SIZE, c);
+        StreamCompaction::Naive::scan(SIZE, c, a);
+        data.push_back(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation());
+    }
+    tabulate(&data, &mean, &stdDev);
+    printf("Naive GPU Scan          \t%f\t%f\n", mean, stdDev);
+    data.clear();
+
+    // work efficient
+    for (int i = 0; i < NUM_TIMINGS; i++) {
+        zeroArray(SIZE, c);
+        StreamCompaction::Efficient::scan(SIZE, c, a);
+        data.push_back(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation());
+    }
+    tabulate(&data, &mean, &stdDev);
+    printf("Work Efficient GPU Scan \t%f\t%f\n", mean, stdDev);
+    data.clear();
+
+    // work efficient
+    for (int i = 0; i < NUM_TIMINGS; i++) {
+        zeroArray(SIZE, c);
+		StreamCompaction::Thrust::scan(SIZE, c, a);
+        data.push_back(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation());
+    }
+    tabulate(&data, &mean, &stdDev);
+    printf("Thrust Library Scan      \t%f\t%f\n", mean, stdDev);
+    data.clear();
+
+    printf("----- Compact -----\n");
+
+    // CPU
+    for (int i = 0; i < NUM_TIMINGS; i++) {
+        zeroArray(SIZE, c);
+        StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
+        data.push_back(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation());
+    }
+    tabulate(&data, &mean, &stdDev);
+    printf("CPU compact without Scan \t%f\t%f\n", mean, stdDev);
+    data.clear();
+
+    for (int i = 0; i < NUM_TIMINGS; i++) {
+        zeroArray(SIZE, c);
+        StreamCompaction::CPU::compactWithScan(SIZE, b, a);
+        data.push_back(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation());
+    }
+    tabulate(&data, &mean, &stdDev);
+    printf("CPU compact with Scan    \t%f\t%f\n", mean, stdDev);
+    data.clear();
+
+    // work efficient
+    for (int i = 0; i < NUM_TIMINGS; i++) {
+        zeroArray(SIZE, c);
+        StreamCompaction::Efficient::compact(SIZE, c, a);
+        data.push_back(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation());
+    }
+    tabulate(&data, &mean, &stdDev);
+    printf("Work Efficient GPU compact\t%f\t%f\n", mean, stdDev);
+    data.clear();
+
+    // --- Radix Sort ---
+
+    printf("\n");
+    printf("******************************\n");
+    printf("********* RADIX SORT *********\n");
+    printf("******************************\n");
+
+    genArray(SIZE, a, 50);  
+    printArray(SIZE, a, true);
+
+    zeroArray(SIZE, b);
+    printDesc("thrust sort, power-of-two");
+    StreamCompaction::Thrust::sort(SIZE, b, a);
+    printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    printArray(SIZE, b, true);
+
+    zeroArray(SIZE, c);
+    printDesc("radix sort, power-of-two");
+    StreamCompaction::Radix::sort(SIZE, c, a);
+    printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    printArray(SIZE, c, true);
+    printCmpResult(SIZE, b, c);
+
+    zeroArray(NPOT, b);
+    printDesc("thrust sort, non-power-of-two");
+    StreamCompaction::Thrust::sort(NPOT, b, a);
+    printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    printArray(NPOT, b, true);
+
+    zeroArray(NPOT, c);
+    printDesc("radix sort, non-power-of-two");
+    StreamCompaction::Radix::sort(NPOT, c, a);
+    printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+
+    printArray(NPOT, c, true);
+    printCmpResult(NPOT, b, c);
+
     system("pause"); // stop Win32 console from closing on exit
     delete[] a;
     delete[] b;
     delete[] c;
 }
+
diff --git a/src/testing_helpers.hpp b/src/testing_helpers.hpp
@@ -5,6 +5,7 @@
 #include <iostream>
 #include <string>
 #include <ctime>
+#include <vector>
 
 template<typename T>
 int cmpArrays(int n, T *a, T *b) {
@@ -74,3 +75,19 @@ void printElapsedTime(T time, std::string note = "")
 {
     std::cout << "   elapsed time: " << time << "ms    " << note << std::endl;
 }
+
+void tabulate(std::vector<float>* data, float* mean, float* stdDev) {
+    float sum = 0;
+    for (auto i : *data) {
+        sum += i;
+    }
+    *mean = sum / data->size();
+
+    float variance = 0;
+    for (auto i : *data) {
+        variance += (i - *mean) * (i - *mean);
+    }
+    variance /= data->size();
+
+    *stdDev = sqrt(variance);
+}
diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt
@@ -4,6 +4,7 @@ set(headers
     "naive.h"
     "efficient.h"
     "thrust.h"
+    "radix.h"
     )
 
 set(sources
@@ -12,6 +13,7 @@ set(sources
     "naive.cu"
     "efficient.cu"
     "thrust.cu"
+    "radix.cu"
     )
 
 list(SORT headers)

diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -23,7 +23,12 @@ namespace StreamCompaction {
          * which map to 0 will be removed, and elements which map to 1 will be kept.
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-            // TODO
+            int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (index >= n) {
+                return;
+            }
+
+            bools[index] = idata[index] != 0;
         }
 
         /**
@@ -32,7 +37,14 @@ namespace StreamCompaction {
          */
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
-            // TODO
+            int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (index >= n) {
+                return;
+            }
+
+            if (bools[index]) {
+                odata[indices[index]] = idata[index];
+            }
         }
 
     }

diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -19,7 +19,12 @@ namespace StreamCompaction {
          */
         void scan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            odata[0] = 0;
+
+            for (int i = 1; i < n; i++) {
+                odata[i] = idata[i - 1] + odata[i - 1];
+            }
+
             timer().endCpuTimer();
         }
 
@@ -30,9 +35,15 @@ namespace StreamCompaction {
          */
         int compactWithoutScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            int oi = 0;
+            for (int i = 0; i < n; i++) {
+                if (idata[i] != 0) {
+                    odata[oi] = idata[i];
+                    oi++;
+                }
+            }
             timer().endCpuTimer();
-            return -1;
+            return oi;
         }
 
         /**
@@ -42,9 +53,30 @@ namespace StreamCompaction {
          */
         int compactWithScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+
+            int* shouldInclude = new int[n];
+            int* scan = new int[n];
+
+            for (int i = 0; i < n; i++) {
+                shouldInclude[i] = (idata[i] != 0) ? 1 : 0;
+            }
+
+            scan[0] = 0;
+            for (int i = 1; i < n ; i++) {
+                scan[i] = shouldInclude[i-1] + scan[i - 1];
+            }
+
+            int lastIndex = 0;
+            for (int i = 0; i < n; i++) {
+                if (shouldInclude[i] != 0) {
+                    lastIndex = scan[i];
+                    odata[lastIndex] = idata[i];
+                }
+            }
+            delete[] shouldInclude;
+            delete[] scan;
             timer().endCpuTimer();
-            return -1;
+            return lastIndex+1;
         }
     }
 }