CenterPoint Backbone preprocessing optimization #83

angry-crab · 2022-12-19T06:30:44Z

The current implementation of scatter has some limitation.

the GPU implementation hard coded iterator bindings which might not work for certain devices. For example, for OpenCL backend, if a GPU has only one dimension global work size.

        for j in T.thread_binding(0, 560, thread = "blockIdx.x"):
            for k in T.thread_binding(0, 560, thread = "blockIdx.y"):
                for i in T.thread_binding(0, 32, thread = "threadIdx.x"):

There is no room for optimization because of hard code. Normally, we need to create schedule from IRModule and define optimization strategies.
Need to create a optimization schedule and measure its performance.

The text was updated successfully, but these errors were encountered:

angry-crab · 2022-12-19T06:43:11Z

1 can be solved by implementing scatter from a top level, ie TE or Relay.

angry-crab added the enhancement New feature or request label Dec 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CenterPoint Backbone preprocessing optimization #83

CenterPoint Backbone preprocessing optimization #83

angry-crab commented Dec 19, 2022 •

edited

Loading

angry-crab commented Dec 19, 2022 •

edited

Loading

CenterPoint Backbone preprocessing optimization #83

CenterPoint Backbone preprocessing optimization #83

Comments

angry-crab commented Dec 19, 2022 • edited Loading

angry-crab commented Dec 19, 2022 • edited Loading

angry-crab commented Dec 19, 2022 •

edited

Loading

angry-crab commented Dec 19, 2022 •

edited

Loading