[FINN Examples] Unable to reproduce CNV-max performance reported in paper #721

zzzDavid · 2022-12-27T23:32:32Z

zzzDavid
Dec 27, 2022

Description

I am trying to reproduce the latency and throughput performance reported in the paper, but I cannot obtain the same results. I use the same network configuration described in the paper, but my results are significantly different.

Configuration

Platform

Ultra96V2

Fold configuration

The paper reports the "per-layer total fold" of CNV-max, so I configured the PE and SIMD to have the same number of folds as listed in Table 2 in the paper:

{
  "Defaults": {},
  "Thresholding_Batch_0": {
    "PE": 1,
    "ram_style": "distributed"
  },
  "ConvolutionInputGenerator_0": {
    "SIMD": 3,
    "ram_style": "distributed"
  },
  "MatrixVectorActivation_0": {
    "PE": 64,
    "SIMD": 3,
    "ram_style": "auto"
  },
  "ConvolutionInputGenerator_1": {
    "SIMD": 64,
    "ram_style": "distributed"
  },
  "MatrixVectorActivation_1": {
    "PE": 64,
    "SIMD": 64,
    "ram_style": "auto"
  },
  "ConvolutionInputGenerator_2": {
    "SIMD": 64,
    "ram_style": "distributed"
  },
  "MatrixVectorActivation_2": {
    "PE": 32,
    "SIMD": 64,
    "ram_style": "auto"
  },
  "ConvolutionInputGenerator_3": {
    "SIMD": 64,
    "ram_style": "distributed"
  },
  "MatrixVectorActivation_3": {
    "PE": 32,
    "SIMD": 64,
    "ram_style": "auto"
  },
  "ConvolutionInputGenerator_4": {
    "SIMD": 16,
    "ram_style": "distributed"
  },
  "MatrixVectorActivation_4": {
    "PE": 32,
    "SIMD": 16,
    "ram_style": "auto"
  },
  "ConvolutionInputGenerator_5": {
    "SIMD": 64,
    "ram_style": "distributed"
  },
  "MatrixVectorActivation_5": {
    "PE": 2,
    "SIMD":64,
    "ram_style": "auto"
  },
  "MatrixVectorActivation_6": {
    "PE": 4,
    "SIMD": 4,
    "ram_style": "auto"
  },
  "MatrixVectorActivation_7": {
    "PE": 4,
    "SIMD": 8,
    "ram_style": "auto"
  },
  "MatrixVectorActivation_8": {
    "PE": 5,
    "SIMD": 1,
    "ram_style": "auto"
  },
  "LabelSelect_Batch_0": {
    "PE": 1
  }
}

Since the accelerator computes one fold each clock cycle, I used the estimated cycle count of each layer to confirm that the per-layer total fold matches what's reported in the paper:

{
  "Thresholding_Batch_0": 3072,
  "ConvolutionInputGenerator_0": 8196,
  "MatrixVectorActivation_0": 8100,
  "ConvolutionInputGenerator_1": 7146,
  "MatrixVectorActivation_1": 7056,
  "StreamingMaxPool_Batch_0": 980,
  "ConvolutionInputGenerator_2": 1338,
  "MatrixVectorActivation_2": 5184,
  "ConvolutionInputGenerator_3": 1872,
  "MatrixVectorActivation_3": 7200,
  "StreamingMaxPool_Batch_1": 125,
  "ConvolutionInputGenerator_4": 768,
  "MatrixVectorActivation_4": 5184,
  "ConvolutionInputGenerator_5": 72,
  "MatrixVectorActivation_5": 4608,
  "MatrixVectorActivation_6": 8192,
  "MatrixVectorActivation_7": 8192,
  "MatrixVectorActivation_8": 1024,
  "LabelSelect_Batch_0": 10
}

Estimated Performance

The experiments in the paper use a Zynq-7000 platform running at 200MHz, and I am using an Ultra96-V2 running at 100MHz.

From the estimated performance, the throughput is 12k, about half of the result reported in the paper (21.9k), which I think is as expected. However, the measured performance on board is much lower.

{
  "critical_path_cycles": 78319,
  "max_cycles": 8196,
  "max_cycles_node_name": "ConvolutionInputGenerator_0",
  "estimated_throughput_fps": 12201.073694485114,
  "estimated_latency_ns": 783190.0
}

Measured Performance

I used the same driver code and test API provided here: FINN-example.

The test code on ultra96-v2:

import finn_examples
from finn_examples import models
# instantiate the accelerator
accel = models.cnv_w1a1_cifar10()
res = accel.throughput_test()
print(json.dumps(res, indent=2))

The measured performance is much lower than expected:

{
  "runtime[ms]": 0.5919933319091797,
  "throughput[images/s]": 1689.2082158679018,
  "DRAM_in_bandwidth[MB/s]": 5.189247639146195,
  "DRAM_out_bandwidth[MB/s]": 0.0016892082158679017,
  "fclk[mhz]": 99.999,
  "batch_size": 1,
  "fold_input[ms]": 0.05412101745605469,
  "pack_input[ms]": 0.05745887756347656,
  "copy_input_data_to_device[ms]": 1.0628700256347656,
  "copy_output_data_from_device[ms]": 0.2636909484863281,
  "unpack_output[ms]": 0.4413127899169922,
  "unfold_output[ms]": 0.0896453857421875
}

If I turn on asynch execution, the throughput is higher but still lower than expected:

{
  "runtime[ms]": 0.2238750457763672,
  "throughput[images/s]": 4466.777422790203,
  "DRAM_in_bandwidth[MB/s]": 13.721940242811502,
  "DRAM_out_bandwidth[MB/s]": 0.004466777422790202,
  "fclk[mhz]": 99.999,
  "batch_size": 1,
  "fold_input[ms]": 0.05412101745605469,
  "pack_input[ms]": 0.05626678466796875,
  "copy_input_data_to_device[ms]": 0.9977817535400391,
  "copy_output_data_from_device[ms]": 0.2548694610595703,
  "unpack_output[ms]": 0.43082237243652344,
  "unfold_output[ms]": 0.08940696716308594
}

Resource Usage

I checked the resource usage, and I'm using more LUT and less BRAM than those reported in the paper:

resource	LUT	BRAM
Paper	46,253	186
Reproduce	52,265	113.5

Question

Is this result as expected? Why are the performances much lower than estimated? Is there anything I should change in the configuration? Please let me know if you need anything else.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FINN Examples] Unable to reproduce CNV-max performance reported in paper #721

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

[FINN Examples] Unable to reproduce CNV-max performance reported in paper #721

zzzDavid Dec 27, 2022

Description

Configuration

Platform

Fold configuration

Estimated Performance

Measured Performance

Resource Usage

Question

Replies: 0 comments

zzzDavid
Dec 27, 2022