Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project 2: Lindsay Smith #9

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 94 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,100 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Lindsay Smith
* [LinkedIn](https://www.linkedin.com/in/lindsay-j-smith/), [personal website](https://lindsays-portfolio-d6aa5d.webflow.io/).
* Tested on: Windows 10, i7-11800H 144Hz 16GB RAM, GeForce RTX 3060 512GB SSD (Personal Laptop)

### (TODO: Your README)
For this project I implemented scan and compact algorithms and compared their various runtimes
on the GPU and CPU, testing with various array sizes and analyzing how efficient each one was.

Implemenation includes:
* CPU Scan + Stream Compaction
* GPU Naive Scan
* GPU Work-Efficient Scan
* GPU Work-Efficient Compaction
* Thrust Scan (mainly for comparison purposes)

For all of my analysis I am utilizing a block size of 256, which I found to be optimal.

![](img/ScanTimes.png)

In this graph the lower times are better.
We can see that the CPU is actually faster than the GPU for most array sizes.
I did have exceptions to this however in the range of about 2^16 - 2^18, where the GPU scans were faster. Of course we
can also see that the Thrust scan is faster than all of the implementations I wrote, again with an exception at 2^16.
I am not sure the reason for this, but it seems that an array size around 2^16 is somewhat optimal for performance of
my implementations. If I continue with arry sizes into the millions, past what is shown in the graph, the discrepency
in the execution times only widens.

It is very clear that Thrust has optimizations far beyond my own implementations. With a small data set the differences
are not as apparent, but again once we reach thousands and millions of data points in the array, it is clear how
much more efficient the Thrust algorithm is. I would assume that Thrust takes advantage of memory, as I know that
my implementation could be made more efficient by writing to contiguous memory and allowing the threads that are not working
to be reused. If my implementations took advantage of this I would be able to reduce the number of threads by half at
each kernel call, and therefore would have more threads available to do work for other calls.

![](img/CompactTimes.png)

In this graph we also want to note that the lower values are faster runtimes and therefore "better".
It is interesting here that although for scan the CPU was consistently faster than the GPU, once we add the compact step
the CPU is no longer faster. The GPU here is only slower than the CPU for very small array sizes, but once we get into arrays
with thousands and especially millions of elements the GPU is definitively faster. This is probably due to the fact that
the GPU is taking advantages of a very high number of threads, whereas the CPU cannot do that and takes a very long time to
calculate each step of the compact algorithm.

```
****************
** SCAN TESTS **
****************
[ 46 29 32 19 41 43 12 31 4 10 2 6 39 ... 0 0 ]
==== cpu scan, power-of-two ====
elapsed time: 0.3919ms (std::chrono Measured)
[ 0 46 75 107 126 167 210 222 253 257 267 269 275 ... 6427122 6427122 ]
==== cpu scan, non-power-of-two ====
elapsed time: 0.3764ms (std::chrono Measured)
[ 0 46 75 107 126 167 210 222 253 257 267 269 275 ... 6427059 6427087 ]
passed
==== naive scan, power-of-two ====
elapsed time: 0.100352ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 0.098304ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 0.206848ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 0.205824ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 0.171008ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 0.17728ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 3 3 0 0 0 3 1 3 2 0 0 2 0 ... 2 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 0.6072ms (std::chrono Measured)
[ 3 3 3 1 3 2 2 3 3 3 3 1 3 ... 1 2 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 0.572ms (std::chrono Measured)
[ 3 3 3 1 3 2 2 3 3 3 3 1 3 ... 2 1 ]
passed
==== cpu compact with scan ====
elapsed time: 1.4129ms (std::chrono Measured)
[ 3 3 3 1 3 2 2 3 3 3 3 1 3 ... 1 2 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 0.219264ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 0.219136ms (CUDA Measured)
passed```

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)

Binary file added img/CompactTimes.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/ScanTimes.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
17 changes: 17 additions & 0 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,16 @@ namespace StreamCompaction {
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int idx = threadIdx.x + (blockDim.x * blockIdx.x);
if (idx >= n) {
return;
}
if (idata[idx] != 0) {
bools[idx] = 1;
}
else {
bools[idx] = 0;
}
}

/**
Expand All @@ -33,6 +43,13 @@ namespace StreamCompaction {
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
int idx = threadIdx.x + (blockDim.x * blockIdx.x);
if (idx >= n) {
return;
}
if (bools[idx] == 1) {
odata[indices[idx]] = idata[idx];
}
}

}
Expand Down
38 changes: 34 additions & 4 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,11 @@ namespace StreamCompaction {
*/
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
// TODO (DONE)
odata[0] = 0;
for (int i = 0; i < n - 1; i++) {
odata[i + 1] = idata[i] + odata[i];
}
timer().endCpuTimer();
}

Expand All @@ -30,9 +34,16 @@ namespace StreamCompaction {
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
// TODO (DONE)
int num = 0;
for (int i = 0; i < n; i++) {
if (idata[i]) {
odata[num] = idata[i];
num++;
}
}
timer().endCpuTimer();
return -1;
return num;
}

/**
Expand All @@ -43,8 +54,27 @@ namespace StreamCompaction {
int compactWithScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int *arr = new int[n];
for (int i = 0; i < n; i++) {
arr[i] = idata[i] ? 1 : 0;
}

int* scanArr = new int[n];
//scan(n, scanArr, arr);
scanArr[0] = 0;
for (int i = 0; i < n - 1; i++) {
scanArr[i + 1] = arr[i] + scanArr[i];
}

int num = 0;
for (int i = 0; i < n; i++) {
if (arr[i]) {
odata[scanArr[i]] = idata[i];
num++;
}
}
timer().endCpuTimer();
return -1;
return num;
}
}
}
101 changes: 98 additions & 3 deletions stream_compaction/efficient.cu
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,65 @@ namespace StreamCompaction {
return timer;
}

int* dev_buf;
int* dev_bools;
int* dev_idata;
int* dev_odata;
int* dev_scanned;

__global__ void kernUpSweep(int N, int* data, int offset) {
int idx = threadIdx.x + (blockDim.x * blockIdx.x);
if (idx >= N) {
return;
}

if (idx % (2 * offset) == 0) {
data[idx + offset * 2 - 1] += data[idx + offset - 1];
}
}

__global__ void kernDownSweep(int N, int* data, int offset) {
int idx = threadIdx.x + (blockDim.x * blockIdx.x);
if (idx >= N) {
return;
}

if (idx % (2 * offset) == 0) {
int temp = data[idx + offset - 1];
data[idx + offset - 1] = data[idx + offset * 2 - 1];
data[idx + offset * 2 - 1] += temp;
}
}

/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
// TODO (DONE)

//other variables
int blockSize = 256;
int max = ilog2ceil(n);
int numObj = (int)powf(2, max);
dim3 numBlocks((numObj + blockSize - 1) / blockSize);

//malloc + memcopy
cudaMalloc((void**)&dev_buf, sizeof(int) * numObj);
cudaMemcpy(dev_buf, idata, numObj * sizeof(int), cudaMemcpyHostToDevice);

timer().startGpuTimer();
// TODO
//call kernels
for (int i = 0; i < max; i++) {
kernUpSweep<<<numBlocks, blockSize>>>(numObj, dev_buf, (int)powf(2, i));
}
cudaMemset(dev_buf + numObj - 1, 0, sizeof(int));
for (int i = max - 1; i >= 0; i--) {
kernDownSweep<<<numBlocks, blockSize>>>(numObj, dev_buf, (int)powf(2, i));
}
timer().endGpuTimer();

cudaMemcpy(odata, dev_buf, sizeof(int) * n, cudaMemcpyDeviceToHost);
cudaFree(dev_buf);
}

/**
Expand All @@ -31,10 +83,53 @@ namespace StreamCompaction {
* @returns The number of elements remaining after compaction.
*/
int compact(int n, int *odata, const int *idata) {
//variables
int blockSize = 256;
int max = ilog2ceil(n);
int numObj = (int)powf(2, max);
dim3 numBlocks((numObj + blockSize - 1) / blockSize);

//malloc
cudaMalloc((void**)&dev_bools, sizeof(int) * numObj);
cudaMalloc((void**)&dev_idata, sizeof(int) * numObj);
cudaMalloc((void**)&dev_odata, sizeof(int) * numObj);
cudaMalloc((void**)&dev_scanned, sizeof(int) * numObj);

cudaMemcpy(dev_idata, idata, numObj * sizeof(int), cudaMemcpyHostToDevice);

timer().startGpuTimer();
// TODO
// TODO (DONE)
StreamCompaction::Common::kernMapToBoolean<<<numBlocks, blockSize>>>(numObj, dev_bools, dev_idata);
cudaMemcpy(dev_scanned, dev_bools, sizeof(int) * numObj, cudaMemcpyDeviceToDevice);

for (int i = 0; i < max; i++) {
kernUpSweep<<<numBlocks, blockSize>>>(numObj, dev_scanned, (int)powf(2, i));
}
cudaMemset(dev_scanned + numObj - 1, 0, sizeof(int));
for (int i = max - 1; i >= 0; i--) {
kernDownSweep<<<numBlocks, blockSize>>>(numObj, dev_scanned, (int)powf(2, i));
}

StreamCompaction::Common::kernScatter<<<numBlocks, blockSize>>>(numObj, dev_odata, dev_idata, dev_bools, dev_scanned);

timer().endGpuTimer();
return -1;

int* arr = new int[numObj];
cudaMemcpy(arr, dev_bools, sizeof(int) * numObj, cudaMemcpyDeviceToHost);
cudaMemcpy(odata, dev_odata, sizeof(int) * n, cudaMemcpyDeviceToHost);

cudaFree(dev_idata);
cudaFree(dev_bools);
cudaFree(dev_scanned);
cudaFree(dev_odata);

int count = 0;
for (int i = 0; i < n; i++) {
if (arr[i] == 1) {
count++;
}
}
return count;
}
}
}
45 changes: 43 additions & 2 deletions stream_compaction/naive.cu
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,56 @@ namespace StreamCompaction {
static PerformanceTimer timer;
return timer;
}
// TODO: __global__
// TODO: (DONE)
__global__ void kernNaiveScan(int N, int* odata, int* idata, int offset) {
int idx = threadIdx.x + (blockIdx.x * blockDim.x);
if (idx >= N) {
return;
}
if (idx >= offset) {
odata[idx] = idata[idx - offset] + idata[idx];
}
else {
odata[idx] = idata[idx];
}
}

/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
// TODO (DONE)
const int blockSize = 128;
dim3 blocksPerGrid((n + blockSize - 1) / blockSize);
int max = ilog2ceil(n);

//buffers
int* buf1;
int* buf2;

//malloc
cudaMalloc((void**)&buf1, n * sizeof(int));
//checkCUDAErrorWithLine("cudaMalloc buf1 failed!");
cudaMalloc((void**)&buf2, n * sizeof(int));
//checkCUDAErrorWithLine("cudaMalloc buf2 failed!");

//fill array
cudaMemcpy(buf1, idata, sizeof(int) * n, cudaMemcpyHostToDevice);

//call kernel
timer().startGpuTimer();
// TODO
for (int i = 1; i <= max; i++) {
kernNaiveScan<<<blocksPerGrid, blockSize>>>(n, buf2, buf1, (int)powf(2, i - 1));
std::swap(buf1, buf2);
}
timer().endGpuTimer();

//copy data to odata
odata[0] = 0;
cudaMemcpy(odata + 1, buf1, sizeof(int) * n, cudaMemcpyDeviceToHost);

cudaFree(buf1);
cudaFree(buf2);
}
}
}
8 changes: 5 additions & 3 deletions stream_compaction/thrust.cu
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,13 @@ namespace StreamCompaction {
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
//TODO (DONE)
thrust::device_vector<int> dv_in(idata, idata + n);
thrust::device_vector<int> dv_out(n);
timer().startGpuTimer();
// TODO use `thrust::exclusive_scan`
// example: for device_vectors dv_in and dv_out:
// thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
timer().endGpuTimer();
thrust::copy(dv_out.begin(), dv_out.end(), odata);
}
}
}