Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project 2: Matt Elser #15

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 55 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,61 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Matt Elser
* [LinkedIn](https://www.linkedin.com/in/matt-elser-97b8151ba/), [twitter](twitter.com/__mattelser__)
* Tested on: Tested on: Ubuntu 20.04, i3-10100F @ 3.6GHz 16GB, GeForce 1660 Super 6GB

### (TODO: Your README)
![timing data table](img/timingData.png)

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
### Main Features
This project implements an exclusive scan (performing an operation on (in this case a sum of) all previous
elements along an array) and stream compaction (removing elements from an array based on a condition) using
the CPU, GPU with CUDA, and the CUDA powered library `Thrust`.
- all required algorithms work
- Efficient scan and compaction are faster than CPU implementation for arrays of sufficient size
- Radix sort has been implemented (without tiling or bitonic merge)

### Time Comparison
The test setup in `main.cpp` has been coopted to set up repeated timings to remove noise from measurements.

![scan comparison plot](img/scanPlot.png)
![compact comparison plot](img/compactPlot.png)

CPU scan and compact may be faster for smaller arrays, but the efficient GPU algorithms become more efficient
at array size 2^16 for compact and 2^17 for efficient scan.

### Block Size Comparison
changing the number of threads per blocks did not have a noticeable impact on timing. Minor differences are
can be seen in the graph below, but almost all are around one standard deviation of another, so this may just
be noise. Data was gathered by timing 100 runs of each algorithm with an array of size 2^22, see the bottom of
the readme for the data table.

![block comparison plot](img/blocksizePlot.png)

### Known limitations
- [FIXED] The Naive implementation fails for array sizes greater than 2^25.
- Naive was calling an inefficient number of threads, leading to higher-than needed `threadIdx.x`
values. When multiplied to get the `index` this overflowed int and yielded a negative index.
Logic around indices (reasonably) assumed positive values and therefore caused an out of bounds write.
- compact scan fails for array sizes greater than 2^28 due to running out of CPU memory on the (16Gb) test machine.

### Extra Credit
- Work Efficient GPU algorithms are more efficient than CPU (for large array sizes). This was achieved
by removing divergent threads from `upsweep` and `downsweep` kernel calls. Prior to the discussion in class of
modifying the indexing for optimal thread scheduling, these algorithms were implemented to acheive the same end
iteratively. Each layer is a separate kernel call, and each kernel call only spawns one thread per index being
written to (i.e. n/2 for the first call, n/4 for the next, etc.). This does not have the same benefit of contiguous
memory reads, however. Of note, this iterative method does simplify syncing of threads/kernels. No threads on any
given layer are reading/writing from/to any of the indices being read/written from/to by any other thread. Since
All layers are separate kernel calls (on the default stream and therefore with an implicit join/sync), so no
explicit syncs are needed.
- Radix sort has been implemented, though without tiling or bitonic merge. It sorts correctly for all array sizes (power of two
and non-power of two) up until 2^27, at which point the test machine runs out of memory. Radix sorting has been
validated against `Thrust`'s sort (though the timing of the two are different by several orders of magnitude).
The algorithm has not been optimized to use shared memory or contiguous memory reads, and would fail for arrays with
negative values. Here is a plot comparing the timing of the radix sort implementation with `Thrust`s sort
![sort runtime comparison](img/sortPlot.png)



![block comparison table](img/blockComparison.png)
Binary file added img/blockComparison.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/blocksizePlot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/compactPlot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/scanPlot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/sortPlot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/timingData.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
136 changes: 132 additions & 4 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,12 @@
#include <stream_compaction/cpu.h>
#include <stream_compaction/naive.h>
#include <stream_compaction/efficient.h>
#include <stream_compaction/radix.h>
#include <stream_compaction/thrust.h>
#include <vector>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int SIZE = 1 << 26; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
Expand Down Expand Up @@ -54,11 +56,13 @@ int main(int argc, char* argv[]) {
//printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

/* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
/*
//For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
onesArray(SIZE, c);
printDesc("1s array for finding bugs");
StreamCompaction::Naive::scan(SIZE, c, a);
printArray(SIZE, c, true); */
printArray(SIZE, c, true);
*/

zeroArray(SIZE, c);
printDesc("naive scan, non-power-of-two");
Expand All @@ -78,7 +82,6 @@ int main(int argc, char* argv[]) {
printDesc("work-efficient scan, non-power-of-two");
StreamCompaction::Efficient::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
Expand Down Expand Up @@ -147,8 +150,133 @@ int main(int argc, char* argv[]) {
//printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);

// --- repeated timing ---
printf("\n");
printf("******************************\n");
printf("** SCAN & COMPACTION TIMING **\n");
printf("******************************\n");

int NUM_TIMINGS = 100;
std::vector<float> data;
float stdDev;
float mean;

printf(" Data gathered from %i runs with array size %i (2^%i)\n", NUM_TIMINGS, SIZE, ilog2(SIZE));
printf("--------------------------------------------------------------\n\n");
printf("------------------------------| mean (ms) |--| stdDev (ms) |--\n");
printf("------ Scan ------\n");

// CPU
for (int i = 0; i < NUM_TIMINGS; i++) {
zeroArray(SIZE, c);
StreamCompaction::CPU::scan(SIZE, c, a);
data.push_back(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation());
}
tabulate(&data, &mean, &stdDev);
printf("CPU Scan \t%f\t%f\n", mean, stdDev);
data.clear();

// Naive
for (int i = 0; i < NUM_TIMINGS; i++) {
zeroArray(SIZE, c);
StreamCompaction::Naive::scan(SIZE, c, a);
data.push_back(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation());
}
tabulate(&data, &mean, &stdDev);
printf("Naive GPU Scan \t%f\t%f\n", mean, stdDev);
data.clear();

// work efficient
for (int i = 0; i < NUM_TIMINGS; i++) {
zeroArray(SIZE, c);
StreamCompaction::Efficient::scan(SIZE, c, a);
data.push_back(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation());
}
tabulate(&data, &mean, &stdDev);
printf("Work Efficient GPU Scan \t%f\t%f\n", mean, stdDev);
data.clear();

// work efficient
for (int i = 0; i < NUM_TIMINGS; i++) {
zeroArray(SIZE, c);
StreamCompaction::Thrust::scan(SIZE, c, a);
data.push_back(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation());
}
tabulate(&data, &mean, &stdDev);
printf("Thrust Library Scan \t%f\t%f\n", mean, stdDev);
data.clear();

printf("----- Compact -----\n");

// CPU
for (int i = 0; i < NUM_TIMINGS; i++) {
zeroArray(SIZE, c);
StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
data.push_back(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation());
}
tabulate(&data, &mean, &stdDev);
printf("CPU compact without Scan \t%f\t%f\n", mean, stdDev);
data.clear();

for (int i = 0; i < NUM_TIMINGS; i++) {
zeroArray(SIZE, c);
StreamCompaction::CPU::compactWithScan(SIZE, b, a);
data.push_back(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation());
}
tabulate(&data, &mean, &stdDev);
printf("CPU compact with Scan \t%f\t%f\n", mean, stdDev);
data.clear();

// work efficient
for (int i = 0; i < NUM_TIMINGS; i++) {
zeroArray(SIZE, c);
StreamCompaction::Efficient::compact(SIZE, c, a);
data.push_back(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation());
}
tabulate(&data, &mean, &stdDev);
printf("Work Efficient GPU compact\t%f\t%f\n", mean, stdDev);
data.clear();

// --- Radix Sort ---

printf("\n");
printf("******************************\n");
printf("********* RADIX SORT *********\n");
printf("******************************\n");

genArray(SIZE, a, 50);
printArray(SIZE, a, true);

zeroArray(SIZE, b);
printDesc("thrust sort, power-of-two");
StreamCompaction::Thrust::sort(SIZE, b, a);
printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
printArray(SIZE, b, true);

zeroArray(SIZE, c);
printDesc("radix sort, power-of-two");
StreamCompaction::Radix::sort(SIZE, c, a);
printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(NPOT, b);
printDesc("thrust sort, non-power-of-two");
StreamCompaction::Thrust::sort(NPOT, b, a);
printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
printArray(NPOT, b, true);

zeroArray(NPOT, c);
printDesc("radix sort, non-power-of-two");
StreamCompaction::Radix::sort(NPOT, c, a);
printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");

printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

system("pause"); // stop Win32 console from closing on exit
delete[] a;
delete[] b;
delete[] c;
}

17 changes: 17 additions & 0 deletions src/testing_helpers.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
#include <iostream>
#include <string>
#include <ctime>
#include <vector>

template<typename T>
int cmpArrays(int n, T *a, T *b) {
Expand Down Expand Up @@ -74,3 +75,19 @@ void printElapsedTime(T time, std::string note = "")
{
std::cout << " elapsed time: " << time << "ms " << note << std::endl;
}

void tabulate(std::vector<float>* data, float* mean, float* stdDev) {
float sum = 0;
for (auto i : *data) {
sum += i;
}
*mean = sum / data->size();

float variance = 0;
for (auto i : *data) {
variance += (i - *mean) * (i - *mean);
}
variance /= data->size();

*stdDev = sqrt(variance);
}
2 changes: 2 additions & 0 deletions stream_compaction/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ set(headers
"naive.h"
"efficient.h"
"thrust.h"
"radix.h"
)

set(sources
Expand All @@ -12,6 +13,7 @@ set(sources
"naive.cu"
"efficient.cu"
"thrust.cu"
"radix.cu"
)

list(SORT headers)
Expand Down
16 changes: 14 additions & 2 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,12 @@ namespace StreamCompaction {
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int index = (blockIdx.x * blockDim.x) + threadIdx.x;
if (index >= n) {
return;
}

bools[index] = idata[index] != 0;
}

/**
Expand All @@ -32,7 +37,14 @@ namespace StreamCompaction {
*/
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
int index = (blockIdx.x * blockDim.x) + threadIdx.x;
if (index >= n) {
return;
}

if (bools[index]) {
odata[indices[index]] = idata[index];
}
}

}
Expand Down
42 changes: 37 additions & 5 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,12 @@ namespace StreamCompaction {
*/
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
odata[0] = 0;

for (int i = 1; i < n; i++) {
odata[i] = idata[i - 1] + odata[i - 1];
}

timer().endCpuTimer();
}

Expand All @@ -30,9 +35,15 @@ namespace StreamCompaction {
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int oi = 0;
for (int i = 0; i < n; i++) {
if (idata[i] != 0) {
odata[oi] = idata[i];
oi++;
}
}
timer().endCpuTimer();
return -1;
return oi;
}

/**
Expand All @@ -42,9 +53,30 @@ namespace StreamCompaction {
*/
int compactWithScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO

int* shouldInclude = new int[n];
int* scan = new int[n];

for (int i = 0; i < n; i++) {
shouldInclude[i] = (idata[i] != 0) ? 1 : 0;
}

scan[0] = 0;
for (int i = 1; i < n ; i++) {
scan[i] = shouldInclude[i-1] + scan[i - 1];
}

int lastIndex = 0;
for (int i = 0; i < n; i++) {
if (shouldInclude[i] != 0) {
lastIndex = scan[i];
odata[lastIndex] = idata[i];
}
}
delete[] shouldInclude;
delete[] scan;
timer().endCpuTimer();
return -1;
return lastIndex+1;
}
}
}
Loading