You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to reproduce the latency and throughput performance reported in the paper, but I cannot obtain the same results. I use the same network configuration described in the paper, but my results are significantly different.
Configuration
Platform
Ultra96V2
Fold configuration
The paper reports the "per-layer total fold" of CNV-max, so I configured the PE and SIMD to have the same number of folds as listed in Table 2 in the paper:
Since the accelerator computes one fold each clock cycle, I used the estimated cycle count of each layer to confirm that the per-layer total fold matches what's reported in the paper:
The experiments in the paper use a Zynq-7000 platform running at 200MHz, and I am using an Ultra96-V2 running at 100MHz.
From the estimated performance, the throughput is 12k, about half of the result reported in the paper (21.9k), which I think is as expected. However, the measured performance on board is much lower.
I checked the resource usage, and I'm using more LUT and less BRAM than those reported in the paper:
resource
LUT
BRAM
Paper
46,253
186
Reproduce
52,265
113.5
Question
Is this result as expected? Why are the performances much lower than estimated? Is there anything I should change in the configuration? Please let me know if you need anything else.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Description
I am trying to reproduce the latency and throughput performance reported in the paper, but I cannot obtain the same results. I use the same network configuration described in the paper, but my results are significantly different.
Configuration
Platform
Fold configuration
The paper reports the "per-layer total fold" of CNV-max, so I configured the PE and SIMD to have the same number of folds as listed in Table 2 in the paper:
Since the accelerator computes one fold each clock cycle, I used the estimated cycle count of each layer to confirm that the per-layer total fold matches what's reported in the paper:
Estimated Performance
The experiments in the paper use a Zynq-7000 platform running at 200MHz, and I am using an Ultra96-V2 running at 100MHz.
From the estimated performance, the throughput is 12k, about half of the result reported in the paper (21.9k), which I think is as expected. However, the measured performance on board is much lower.
Measured Performance
I used the same driver code and test API provided here: FINN-example.
The test code on ultra96-v2:
The measured performance is much lower than expected:
If I turn on asynch execution, the throughput is higher but still lower than expected:
Resource Usage
I checked the resource usage, and I'm using more LUT and less BRAM than those reported in the paper:
Question
Is this result as expected? Why are the performances much lower than estimated? Is there anything I should change in the configuration? Please let me know if you need anything else.
Beta Was this translation helpful? Give feedback.
All reactions