[RTL SWG] Support SIMD < C in window-parallel mode #922

fpjentzsch · 2023-11-20T18:37:10Z

Previously, full SWG SIMD parallelism (SIMD = # Channels) was required before enabling the window-parallel mode. Due to the depthwise data layout, this prevented VVAU SIMD unfolding (across the kernel dimensions) unless VVAU PE (across the channel dimension) was maxed out.

This adds support for SWG SIMD < C when the SWG is in window-parallel and depthwise mode.
Note that the SIMD of the SWG must match the PE of the following VVAU.
VVAU SIMD < K is supported via a normal DWC, which is inserted automatically by the compiler.

This experimental HLS DWC component was previously introduced as a workaround for this problem and should now be obsolete: Xilinx/finn-hlslib#134

SWG SIMD < C is also allowed in the 1x1 kernel case (no matter whether parallel_window is set or not), which should fix #895.

maltanar · 2023-11-26T10:25:05Z

Many thanks for this feature @fpjentzsch ! Not a full review but while testing this on a larger network, I did come across one issue during FIFO sizing with verilator. Although not part of the changes from this PR itself, I think the kinds of SWGs enabled by the PR are more likely to run into this issue: the DEPTH of the swg_reg_buffer is sometimes large enough to run into particular verilator coding style limitations.

Specifically, this is the error message I observed:

%Error-BLKLOOPINIT: /scratch/finn/verilator_fifosim_iletty6g/finn_design_wrapper.v:3250:17: Unsupported: Delayed assignment to array inside for loops (non-delayed is ok - see docs)
 3250 |             Data[i] <= Data[i-1];

The offending piece of code in context below - the DEPTH of the swg_reg_buffer instances in this example can be as large as 832 and this seems to be larger than the verilator default limits for loop unrolling in this scenario.

I'll give this a try with --unroll-count 1000 to see if it makes a difference for my use-case, but perhaps there is a better solution here, either by changing the coding style or getting the FINN compiler to generate code with those statements unrolled. (update: that did not help, unfortunately)

maltanar · 2023-11-27T08:30:01Z

A suggestion from @preusser (which verilator seems to be happy with) is to use a sliced vector assignment instead of the for-loop:

if(shift_enable) begin
  if(DEPTH > 1)  Data[DEPTH-1:1] <= Data[DEPTH-2:0];
  Data[0] <= shift_in;
end

fpjentzsch · 2024-01-04T08:28:19Z

Thanks @maltanar, I incorporated this fix and, from my side, we could merge this PR already.

To increase resource efficiency in cases like this, I'm currently experimenting with a "depth threshold" setting, which would split up deep shift registers where not all elements need to be accessed in parallel (such as for large dilation_w or large #Channels/SIMD) into smaller shift registers and LUTRAM/BRAM buffers. I prefer to avoid yet another configurable attribute for the custom_op, so I'm doing some benchmarking in an attempt to find a reasonable value for this threshold and the overall resource impact that this could have.

preusser

Thanks, @fpjentzsch!

[RTL SWG] Support SIMD < C in window-parallel mode

4c80cf8

auphelia requested a review from mmrahorovic November 21, 2023 09:24

Apply to 1x1 kernel, simplify logic, fix edge cases

b89dd62

mmrahorovic mentioned this pull request Nov 27, 2023

Support for multi-packed DSP58s for VVUs #907

Closed

10 tasks

auphelia removed the request for review from mmrahorovic November 28, 2023 10:39

mmrahorovic mentioned this pull request Nov 29, 2023

[MVU/VVU] Support for double-pumped DSPs #929

Closed

[RTL SWG] Use sliced vector assignment to avoid Verilator limitation

eddbd27

preusser requested review from preusser and auphelia January 5, 2024 09:54

auphelia approved these changes Jan 5, 2024

View reviewed changes

preusser approved these changes Jan 8, 2024

View reviewed changes

auphelia merged commit e3cb226 into Xilinx:dev Jan 8, 2024
2 checks passed

This was referenced Feb 8, 2024

Support for multi-packed DSP58s for VVUs #975

Closed

[MVU/VVU] Support for double-pumped DSPs #976

Closed

mmrahorovic mentioned this pull request Mar 4, 2024

Refactoring of RTL MVAU #995

Merged

19 tasks

mmrahorovic mentioned this pull request Mar 14, 2024

Refactoring of RTL VVAU #1000

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RTL SWG] Support SIMD < C in window-parallel mode #922

[RTL SWG] Support SIMD < C in window-parallel mode #922

fpjentzsch commented Nov 20, 2023 •

edited

Loading

maltanar commented Nov 26, 2023 •

edited

Loading

maltanar commented Nov 27, 2023

fpjentzsch commented Jan 4, 2024

preusser left a comment

[RTL SWG] Support SIMD < C in window-parallel mode #922

[RTL SWG] Support SIMD < C in window-parallel mode #922

Conversation

fpjentzsch commented Nov 20, 2023 • edited Loading

maltanar commented Nov 26, 2023 • edited Loading

maltanar commented Nov 27, 2023

fpjentzsch commented Jan 4, 2024

preusser left a comment

Choose a reason for hiding this comment

fpjentzsch commented Nov 20, 2023 •

edited

Loading

maltanar commented Nov 26, 2023 •

edited

Loading