Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project 2: Raymond Yang #5

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
61 changes: 54 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,61 @@
CUDA Stream Compaction
Project 2 CUDA Stream Compaction
======================

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Raymond Yang
* [LinkedIn](https://www.linkedin.com/in/raymond-yang-b85b19168)
* Tested on:
* 09/21/2021
* Windows 10
* NVIDIA GeForce GTX 1080 Ti.
* Submitted on: 09/21/2021

### (TODO: Your README)
## Introduction
The objective of this assignment was to implement Stream Compaction. Stream compaction involves three main processes:
* Boolean Mapping: Given an input array of data `idata`, this data must first be mapped to a boolean array `dev_bool`. This boolean array evaluates whether that input data at each index is desireable. If so, the data will be kept during Scatter. If not, the data will be removed during Scatter.
* Scan: Given an input boolean array `dev_bool`, scan will output a prefix sum of array values. This output array `dev_dataPadded` should match the format of an exclusive scan. This output array contains the indices of where desirable values in the original input data will be stored during Scatter.

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
<p align="center">
<img src="img/ss0.PNG" alt="drawing" width="500" />
</p>

* Scatter: Given `dev_bool` and `dev_dataPadded`, Scatter will output a final array `dev_odata` that contains only desirable values.

### CPU Implementation
`stream_compaction/cpu.cu` implements scan and scatter on the host machine without interacting with the GPU. This implementation follows a simple for loop that iterates through the input data. The output of our CPU implementation becomes the baseline for comparison with respect to time performance and output accuracy of Naive and Work Efficient.

### Naive Implementation
`stream_compaction/naive.cu` implements a naive approach to scan. This process is "embarassingly parallel" and attempts to perform scan by taking advantage of parallelism. The approach is theoretically `O(log n)` efficient with respect to input data size. Input values at different strides are paired and summed and returned to the input array. This approach is repeated `log n` times. Current implementation does not take advantage of any optimizations to improve performance.
<p align="center">
<img src="img/ss1.PNG" alt="drawing" width="500" />
</p>

### Work Efficient Implementation
`stream_compaction/efficient.cu` implements a work-efficient approach to scan and compact. The scan process is broken into two parts:
* Upsweep:
<p align="center">
<img src="img/ss2.PNG" alt="drawing" width="500" />
</p>
* Downsweep:
<p align="center">
<img src="img/ss3.PNG" alt="drawing" width="500" />
</p>

## Data Analysis
Benchmark was run on 3 implementations (CPU, Naive, Work-Eff) and the Thrust API. Benchmarks were recorded after 10 successive runs. Benchmarks were recorded in milliseconds. Benchmarks were ran on two data sets. The first data set (PoT = Power of Two) is an array of size 2<sup>20</sup> populated by random values. The second data set (NPoT = Not Power of Two) is an array of size 2<sup>20</sup> - 3 populated by random values.

| SCAN | CPU (ms) | Naive (ms) | Work-Eff (ms) | Thrust (ms) |
|------|----------|------------|---------------|-------------|
| PoT | 0.613 | 2.026 | 2.884 | 0.272 |
| NPoT | 0.514 | 2.030 | 2.876 | 0.299 |

<p align="center">
<img src="img/ss5.PNG" alt="drawing" width="600" />
</p>

## Limitation of Current Design
Implementations failed to take advantage of additional optimizations specified by <a href="/INSTRUCTION.md">instructions</a>
These include:
* [Optimizing thread and block usage](https://github.com/CIS565-Fall-2021/Project2-Stream-Compaction/blob/main/INSTRUCTION.md#part-5-why-is-my-gpu-approach-so-slow-extra-credit-5).
* [Utilizing shared memory to drastically reduce memory read and write time](https://github.com/CIS565-Fall-2021/Project2-Stream-Compaction/blob/main/INSTRUCTION.md#part-7-gpu-scan-using-shared-memory--hardware-optimizationextra-credit-10).
Binary file added img/ss0.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/ss1.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/ss2.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/ss3.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/ss4.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/ss5.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading