[pre-RFC][BYOC] Marvell ML/AI Accelerator Integration #9730

ccjoechou · 2021-12-13T20:47:02Z

Thanks for contributing to TVM! Please refer to guideline https://tvm.apache.org/docs/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.

jroesch · 2021-12-14T00:44:05Z

Thanks for your contribution. When you have a chance could you update the description of the PR with more details, and potentially link to the pre-RFC discussion? Thanks!

ccjoechou · 2021-12-14T01:06:27Z

Hello @tqchen, @jroesch, or any other reviewer:

We have also posted a pre-RFC (11691 with the same title for discussion along with this POC PR and code changes. Once we can get some feedback, we (Marvell) are going to request for a RFC via the other site https://github.com/apache/tvm-rfcs.

As the Jenkins failure, I saw the console log of this commit's Jenkins BUILD:cpu stage and I will need some help from a reviewer.

FYI - In this pre-RFC BYOC-Marvell changes, we need to define a new dmlc.json.Write() function for the writer, which is going to be called & used by the Save() function of the newly defined GraphInputNodeMrvlExt class. This is in the src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file.
However, by doing the template specialization code block at the end of the graph_executor_codegen_mrvl.cc file, it led to:

in libtvm_runtime.so: both tvm::relay::backend::GraphOpNode and tvm::relay::backend::GraphOpNode::Save(dmlc::JSONWriter*) const are undefined in libtvm_runtime.so
in libtvm_runtime.so: corresponding GraphOpNodeMrvlExt is defined
but in libtvm.so: all of the above are defined

We can see the above from nm and c++filter commands below...
tvm$ nm build/libtvm_runtime.so | grep GraphOpNode
00000000001cf510 W _ZN3tvm5relay7backend11GraphOpNodeD0Ev
00000000001cf480 W _ZN3tvm5relay7backend11GraphOpNodeD1Ev
00000000001cf480 W _ZN3tvm5relay7backend11GraphOpNodeD2Ev
00000000001c87e0 T _ZN3tvm5relay7backend18GraphOpNodeMrvlExt19GetLoadedGraphAttrsB5cxx11Ev
00000000001c9500 T _ZN3tvm5relay7backend18GraphOpNodeMrvlExt4LoadEPN4dmlc10JSONReaderE
00000000001c88b0 T _ZN3tvm5relay7backend18GraphOpNodeMrvlExt9LoadAttrsEPN4dmlc10JSONReaderE
00000000001d2730 W ZN3tvm5relay7backend18GraphOpNodeMrvlExtC1ERKS2
00000000001d2730 W ZN3tvm5relay7backend18GraphOpNodeMrvlExtC2ERKS2
00000000001cf5a0 W _ZN3tvm5relay7backend18GraphOpNodeMrvlExtD0Ev
00000000001cf670 W _ZN3tvm5relay7backend18GraphOpNodeMrvlExtD1Ev
00000000001cf670 W _ZN3tvm5relay7backend18GraphOpNodeMrvlExtD2Ev
U _ZNK3tvm5relay7backend11GraphOpNode4SaveEPN4dmlc10JSONWriterE
00000000001cd1b0 W _ZNK3tvm5relay7backend18GraphOpNodeMrvlExt4TypeEv
00000000001d2aa0 W ZNSt6vectorIN3tvm5relay7backend18GraphOpNodeMrvlExtESaIS3_EE17_M_realloc_insertIJRKS3_EEEvN9__gnu_cxx17__normal_iteratorIPS3_S5_EEDpOT
00000000001cf310 W _ZNSt6vectorIN3tvm5relay7backend18GraphOpNodeMrvlExtESaIS3_EED1Ev
00000000001cf310 W _ZNSt6vectorIN3tvm5relay7backend18GraphOpNodeMrvlExtESaIS3_EED2Ev
U _ZTIN3tvm5relay7backend11GraphOpNodeE
00000000007c2ef8 V _ZTIN3tvm5relay7backend18GraphOpNodeMrvlExtE
0000000000529340 V _ZTSN3tvm5relay7backend18GraphOpNodeMrvlExtE
U _ZTVN3tvm5relay7backend11GraphOpNodeE
00000000007c30a8 V _ZTVN3tvm5relay7backend18GraphOpNodeMrvlExtE
tvm$ c++filt _ZNK3tvm5relay7backend11GraphOpNode4SaveEPN4dmlc10JSONWriterE
tvm::relay::backend::GraphOpNode::Save(dmlc::JSONWriter*) const
tvm$ c++filt _ZTIN3tvm5relay7backend11GraphOpNodeE
typeinfo for tvm::relay::backend::GraphOpNode

But in the libtvm.so file, they are defined...
tvm$ nm build/libtvm.so | grep GraphOpNodeE
00000000016782c0 W _ZNSt23_Sp_counted_ptr_inplaceIN3tvm5relay7backend11GraphOpNodeESaIS3_ELN9__gnu_cxx12_Lock_policyE2EE10_M_destroyEv
0000000001678190 W _ZNSt23_Sp_counted_ptr_inplaceIN3tvm5relay7backend11GraphOpNodeESaIS3_ELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv
0000000001678320 W _ZNSt23_Sp_counted_ptr_inplaceIN3tvm5relay7backend11GraphOpNodeESaIS3_ELN9__gnu_cxx12_Lock_policyE2EE14_M_get_deleterERKSt9type_info
00000000016781d0 W _ZNSt23_Sp_counted_ptr_inplaceIN3tvm5relay7backend11GraphOpNodeESaIS3_ELN9__gnu_cxx12_Lock_policyE2EED0Ev
0000000001678180 W _ZNSt23_Sp_counted_ptr_inplaceIN3tvm5relay7backend11GraphOpNodeESaIS3_ELN9__gnu_cxx12_Lock_policyE2EED1Ev
0000000001678180 W _ZNSt23_Sp_counted_ptr_inplaceIN3tvm5relay7backend11GraphOpNodeESaIS3_ELN9__gnu_cxx12_Lock_policyE2EED2Ev
000000000247c9e8 V _ZTIN3tvm5relay7backend11GraphOpNodeE
000000000247cc10 V _ZTISt23_Sp_counted_ptr_inplaceIN3tvm5relay7backend11GraphOpNodeESaIS3_ELN9__gnu_cxx12_Lock_policyE2EE
0000000001db58a0 V _ZTSN3tvm5relay7backend11GraphOpNodeE
0000000001db67e0 V _ZTSSt23_Sp_counted_ptr_inplaceIN3tvm5relay7backend11GraphOpNodeESaIS3_ELN9__gnu_cxx12_Lock_policyE2EE
000000000247ccc8 V _ZTVN3tvm5relay7backend11GraphOpNodeE
000000000247ced0 V _ZTVSt23_Sp_counted_ptr_inplaceIN3tvm5relay7backend11GraphOpNodeESaIS3_ELN9__gnu_cxx12_Lock_policyE2EE

Thus, the tests/python/unittest/test_runtime_module_load.py::test_dso_module_load runtime test failed during the Jenkins BUILD:cpu stage.
I wonder whether we (Marvell) are missing a runtime registration somewhere?
Thanks,

Joe

ccjoechou · 2021-12-15T01:44:22Z

Hello @jroesch,
After done more self-discovery from reviewing other BYOC code files and CMake setup files, I think now I am aware of that, in the Mrvl.cmake file, the source file lists for libtvm_runtime.so (e.g., RUNTIME_SRCS) and for libtvm.so (e.g., COMPILER_SRCS) need to be separated. I have all BYOC-Marvell C++ files in the RUNTIME_SRCS list.
I will spend time to make adjustments and then run local TVM Jenkinsfile build & tests before I can git push a fix.
Thanks,

Joe

ssdurako · 2021-12-15T20:23:59Z

Wondering if someone can help out provide some insights and pointers related to the failing "rust" tests.
In addition, there is quite a bit of "noise" with the cudart cluttering useful messages, so would appreciate if that can be suppressed/filtered out. - Thank you

2021-12-15 18:46:35.799320: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH:

ccjoechou · 2021-12-17T20:39:55Z

Hello @jroesch,

We have followed the tvm-RFC steps; and we have converted contents in our pre-RFC posting into a RFC one, plus, added more required sub sections as explained by the RFC template format.
Please see a copy below.
We will find time to upload our RFC to the GitHub repo https://github.com/apache/tvm-rfcs very soon.
Thanks,

Joe

RFC.md below

Feature Name: (fill me in with a unique identifier, my_awesome_feature)
Start Date: (fill me in with today's date, YYYY-MM-DD)
RFC PR: apache/tvm-rfcs#0000
GitHub Issue: apache/tvm#0000
GitHub pre-RFC PR: apache/tvm-PR-9730
GitHub pre-RFC discussion: BYOC-Marvell

Summary

Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.

Motivation

Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
we can bring TVM capability and experience to our customers.

Guide-level explanation

Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.

STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:

Load pre-trained network into TVM IR graph
Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
in the accelerator
Do additional Marvell-specific transform pass(es) to further optimize IR graph
Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
(e.g., ARMv9) subgraphs
- These subgraphs cover the whole pre-trained network
- For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
  as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
  conv2d + add + batch_norm + tuple.getitem(0) + relu
- For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
  non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
  only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
Do code-gen step for each for-accelerator Mrvl subgraph:
- Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
  file and a Constants-JSON file are produced for the Mrvl subgraph

STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph

The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
in model binary file
Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
to upstream

STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph

The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
focus only on float16 inference run

STEP (4) Use TVM-llvm Compiler & Runtime to run inference

Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
LLVM build
For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
a TVM-target non-Mrvl subgraph is implemented

Reference-level explanation

Illustration using a MNIST model

Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).

  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]

  keras.Input(shape=input_shape)
  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
  keras.layers.MaxPooling2D(pool_size=(2, 2))
  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
  keras.layers.MaxPooling2D(pool_size=(2, 2))
  keras.layers.Dropout(0.3)
  keras.layers.Reshape()
  keras.layers.Dense(256, activation="relu")
  keras.layers.Dense(10)

  Generate Output-Tensor - output_shape: [1, 10]

  top_label_id = numpy.argmax(Output-Tensor)
  # fashion label map
  fashion_label_dictionary = {
      0: "T-shirt/top",
      1: "Trouser",
      2: "Pullover",
      3: "Dress",
      4: "Coat",
      5: "Sandal",
      6: "Shirt",
      7: "Sneaker",
      8: "Bag",
      9: "Ankle boot",
  }
  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")

We can train the above MNIST fashion model using the following train_images dataset and save
the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
golden label, golden_output_labels[i], to validate the inference result.

(train_images, train_labels), (
    orig_test_images,
    golden_output_labels,
) = keras.datasets.fashion_mnist.load_data()

As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
(nodes_json_filename) and Constants-JSON file (consts_json_filename).

Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
Marvell-specific byoc_const_params and byoc_external_graph_json objects.
In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
respectively.

    # load pre-trained model
    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
    mod, params = relay.frontend.from_onnx(
        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
    )


    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
    (
        nodes_json_filename,
        consts_json_filename,
        mod_mrvl_subgraph,
        mod_non_mrvl_subgraph,
        mrvl_layers_in_mrvl_subgraph,
        mrvl_layers_in_non_mrvl_subgraph,
    ) = aot_build_and_json_codegen(
        model_name="mnist_fashion",
        working_dir="mnist",
        mod,
        params,
    )


    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
    (
        mod_mrvl_subgraph,
        mod_non_mrvl_subgraph,
        orig_params,
        opt_level,
        disabled_pass,
        orig_mod,
        mrvl_layers_in_mrvl_subgraph,
    ) = mrvl.partition_for_mrvl(
        mod,
        params=params,
        tvm_custom_dict={},
        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
        flow_pass=1,
    )

    build_target, device_id = "llvm", 0
    mod_name = relay.backend.utils.mangle_module_name("")
    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
    byoc_const_params = byoc_executor.get_params()
    byoc_external_graph_json = byoc_executor.get_external_graph_json()

    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
        byoc_external_graph_json,
        byoc_const_params,
        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
    )

The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.

Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
no non-Mrvl subgraph.

Below is the original IR graph - i.e., right after from_onnx() call

    #[version = "0.0.5"]
    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
}

We can get to the following one Mrvl subgraph by applying the default strategy.
- in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
  to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for

    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
      %5 = @tvmgen_mrvl_main_5(%4, /* en_id=4141 */) /* ty=Tensor[(1, 1568), float32] */;
      %6 = @tvmgen_mrvl_main_6(%5, /* en_id=4142 */) /* ty=Tensor[(1, 256), float32] */;
      @tvmgen_mrvl_main_7(%6, /* en_id=4143 */) /* ty=Tensor[(1, 10), float32] */
    }

In the above Mrvl subgraph, it is formed by "not-yet optimized Marvell (backend) layers". For example,
tvmgen_mrvl_main_0 to tvmgen_mrvl_main_7 are composited/fused Marvell layers.
- In the mrvl.mrvl_pattern_table() function, fusing patterns have been defined in order to composite
  original IR nodes into Marvell backend layers.
- For example, the following 3 IR call nodes (nn.conv2d + nn.bias_add + nn.relu) in the original IR graph
  are composited into one Marvell layer: tvmgen_mrvl_main_1, conceptually speaking.

      # from original IR graphs
      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;


      # from Mrvl subgraph
      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
      def @tvmgen_mrvl_main_3(%mrvl_3_i0: Tensor[(1, 14, 14, 64), float32], Inline=1, Compiler="mrvl",
          global_symbol="tvmgen_mrvl_main_3", Primitive=1) -> Tensor[(1, 14, 14, 32), float32] {

        %13 = fn (%FunctionVar_0_0: Tensor[(1, 14, 14, 64), float32], PartitionedFromPattern="nn.conv2d_add_nn.relu_",
            Composite="mrvl.conv2d_nhwc2nhwc") -> Tensor[(1, 14, 14, 32), float32] {
          %11 = nn.conv2d(%FunctionVar_0_0, meta[relay.Constant][2] /* ty=Tensor[(32, 2, 2, 64), float32] */,
              padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], data_layout="NHWC", kernel_layout="OHWI",
              out_layout="NHWC", /* en_id=781 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
          %12 = add(%11, meta[relay.Constant][3] /* ty=Tensor[(1, 1, 1, 32), float32] */,
              /* en_id=789 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
          nn.relu(%12, /* en_id=793 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
        };

        %13(%mrvl_3_i0, /* en_id=3343 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
      }

Because Marvell backend layer uses NHWC format (for instance, for Conv2D, Pool2D, and Sum2D),
the relay.transform.ConvertLayout() pass is applied in the mrvl.py file. As a result, NHWC format is used
for Marvell layer: tvmgen_mrvl_main_1 to tvmgen_mrvl_main_4. In addition, the first tvmgen_mrvl_main_0 layer
is corresponding to a layout_transform() operation, which takes the original input tensor in src_layout="NCHW"
and convert the input to a dst_layout="NHWC" tensor.

      relay.transform.ConvertLayout(
          {"nn.conv2d": ["NHWC", "OHWI"], "nn.max_pool2d": ["NHWC"]}
      ),

      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;

      def @tvmgen_mrvl_main_0(%mrvl_0_i0: Tensor[(1, 1, 28, 28), float32], Inline=1, Compiler="mrvl",
          global_symbol="tvmgen_mrvl_main_0", Primitive=1) -> Tensor[(1, 28, 28, 1), float32] {
        layout_transform(%mrvl_0_i0, src_layout="NCHW", dst_layout="NHWC",
            /* en_id=3334 */) /* ty=Tensor[(1, 28, 28, 1), float32] */
      }

Currently, in order for the following Marvell classes/functions to identify a Mrvl subgraphs and a non-Mrvl
subgraph from the layout-converted, composited/fused IR graph, we are utilizing the unique en_id attribute
stored for the Class CallNode and the class Tuple (include/tvm/relay/expr.h).
- in mrvl.py: class MrvlIRGraphUtils.RestOfMrvlLayers(ExprMutator) is used to convert the non-Mrvl subgraph,
  which can have composited Marvell layer(s) back to their original IR nodes (e.g., to use original tensor
  layout and with no compositions)
- in mrvl.py: class MrvlIRGraphUtils.RestMrvlLayersGetInputs(ExprVisitor) is used to reconstruct the input
  tensor for the non-Mrvl subgraph so that it become a IR graph, which is recognized by the TVM LLVM build.
- in mrvl.py: the revert_mrvl_mod_to_orig() function is defined to convert the initial non-Mrvl subgraph back
  to a IR subgraph using original layouts with no Marvell-specific compositions (e.g., similar to what was
  given by the frontend)

def revert_mrvl_mod_to_orig(mod_mrvl_subgraph, mrvl_layers_in_mrvl_subgraph, debug=False):
    """

    def run_opt_pass(mod, passes):
        passes = passes if isinstance(passes, list) else [passes]
        seq = tvm.transform.Sequential(passes)
        with tvm.transform.PassContext(opt_level=3):
            mod = seq(mod)
        return mod

    mod_new = tvm.IRModule(mod_mrvl.functions, mod_mrvl.type_definitions)
    mod_new["main"] = MrvlSubgraphToRevert(mrvl_layers_in_mrvl_subgraph, mod_mrvl).visit(mod_mrvl["main"])
    mod_new = relay.transform.RemoveUnusedFunctions()(mod_new)
    mod_new = relay.transform.InferType()(mod_new)
    mod_new = run_opt_pass(mod_new, relay.transform.DefuseOps())
    mod_new = run_opt_pass(mod_new, relay.transform.ConvertLayout({"nn.conv2d": ["NCHW", "OIHW"], "nn.max_pool2d": ["NCHW"]}))
    mod_new = run_opt_pass(mod_new, relay.transform.SimplifyExpr())
    mod_new = run_opt_pass(mod_new, relay.transform._ffi_api.DropNoopTranspose())
    mod_new = run_opt_pass(mod_new, relay.transform.InferType())
    return mod_new

Marvell-specific graph executor codegen, We have defined call backs and extension functions in the following files:
- Some common classes have been moved from the original src/relay/backend/graph_executor_codegen.cc file to the
  new src/relay/backend/graph_executor_codegen.h file so that they can be shared by Marvell-specific functions
  and derived classes defined in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
- new definitions are listed below:

    /////////////
    // in the new src/relay/backend/graph_executor_codegen.h file
    /*! \brief Node types */
    enum GraphNodeType {
      kGraphNop,
      kGraphInputNode,
      kGraphOpNode,
      kGraphInputNodeExt,
      kGraphOpNodeExt,
    };

    
    class ExternalJsonWriterCB {
     public:
      template <class T>
      void RegisterCB(T* const object,
                      void (T::*const mf)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
                                          std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
        using namespace std::placeholders;
        callback_ = std::bind(mf, object, _1, _2, _3, _4);
        hasCallback_ = true;
      }
      void RegisterCB(void (*const fun)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
                                        std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
        callback_ = fun;
        hasCallback_ = true;
      }
      void Exe(dmlc::JSONWriter* external_writer, Array<tvm::runtime::Module> mod,
               std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads) {
        ICHECK(hasCallback_) << "ERROR: no registered callback";
        callback_(external_writer, mod, nodes, heads);
      }
      inline bool HasCallback() { return hasCallback_; }

     private:
      std::function<void(dmlc::JSONWriter*, Array<tvm::runtime::Module>, std::vector<GraphObjectPtr>,
                         std::vector<GraphNodeRef>)>
          callback_;
      bool hasCallback_{false};
    };

    /////////////
    // in the new src/relay/backend/graph_executor_codegen.cc file
    class GraphExecutorCodegen : public backend::MemoizedExprTranslator<std::vector<GraphNodeRef>> {
     public:
      GraphExecutorCodegen(runtime::Module* mod, const TargetMap& targets)
          : mod_(mod), targets_(targets) {
        // we need the following variable to be a static member of the class so we can access
        //   its setting in the following static GetExternalJsonWriter() function; but this static
        //   member can actually be used as a local Callback setting for "per" GraphExecutorCodegen
        //   instantiation during each TVM build-codegen flow
        external_json_writer_ = std::make_shared<ExternalJsonWriterCB>();
        ICHECK(external_json_writer_);
      }
      static ExternalJsonWriterCB* GetExternalJsonWriter() { return external_json_writer_.get(); }
      ....
      LoweredOutput Codegen(IRModule mod, relay::Function func, String mod_name) {
        ....

        // if it has been registered for this GraphExecutorCodegen object, call the external JSON writer
        if (external_json_writer_->HasCallback()) {
          std::ostringstream external_os;
          dmlc::JSONWriter external_writer(&external_os);
          external_json_writer_->Exe(&external_writer, ret.external_mods, nodes_, heads_);
          ret.external_graph_json = external_os.str();
        }

        return ret;
      }
    };

    extern "C" ExternalJsonWriterCB* GetExternalJsonWriter() {
      return GraphExecutorCodegen::GetExternalJsonWriter();
    }

    /////////////
    // in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
    // Marvell-specific extentions
    class GraphInputNodeMrvlExt : public GraphInputNode {
        ...
        GraphNodeType Type() const override { return kGraphInputNodeExt; }
        void Save(dmlc::JSONWriter* writer) const override { /* extensions */ }
    }

    class GraphOpNodeMrvlExt : public GraphOpNode {
        ...
        GraphNodeType Type() const override { return kGraphOpNodeExt; }
        void Load(dmlc::JSONReader* reader) override;
        void LoadAttrs(dmlc::JSONReader* reader);
        std::pair<std::string, GraphAttrs> GetLoadedGraphAttrs();
    }

    class MrvlExtJson {
     public:
      MrvlExtJson() {
        ICHECK(!GetExternalJsonWriter()->HasCallback()) << "ERROR: has registered callback";
        GetExternalJsonWriter()->RegisterCB(this, &MrvlExtJson::GetExternalJSON);
      }
      virtual ~MrvlExtJson() {}
      void GetExternalJSON(dmlc::JSONWriter* writer, Array<tvm::runtime::Module> external_mods,
                           std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads);
      void LoadExternalJsonAttrs(std::unordered_map<std::string, GraphAttrs>* external_attrs_map,
                                 const Array<tvm::runtime::Module>& external_mods);
    };

the need to link between pre-trained model and final Marvell backend layer - for instance, through tvm_custom
- We did not include prototype code in PR-9730 but intend to provide our sample changes in another RFC and PR.

Drawbacks

We haven't identified any major not do items. Several other designs are by choices - that is we understand that
there are benefits for doing or benefits for not-doing.

Rationale and alternatives

We follow the TVM BYOC framework to enable BYOC Marvell flow without impacting any TVM core features.

Unresolved questions

We are following the existing TVM BYOC framework and example files.
- for example: to do IR compositions, to define own IR passes, to mix implementations in Python/C++, and etc.
We have extended graph_executor_codegen.cc and JSON loader/saver in order to read and write out Marvell specific
attributes
Currently, we haven't spend enough time to under how tvm/rust/cargo requirements and steps. Therefore, we are
bypassing the tvm/Jenkinsfile's tests/scripts/task_rust.sh step. We will need help to re-enable the step.
We like to duplicate the Jenkins environment in order to run tvm/Jenkinsfile as is, but, we ran into many issues.
Currently, we have a tvm-like Jenksinsfile environment to only run a subset of test suites using a modified
Jenkinsfile.
We have identified a need to allow a call-back function to be registered when generating Mrvl-BYOC-specific
Nodes-JSON file. We are trying to follow TVM Python/CPP-CB style as much as possible. But, since our callback
function tvm/src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc::GetExternalJSON() function is using
non-simple argument types, we need help from TVM community to provide suggestions/guidelines in order to make
new CB code better to meet TVM community requirements here.
For one Mrvl-BYOC relay transformation pass, we have identified a need to inject a (global) expr node ID for the
RelayExprNode class and its derived classes: Tuple and CallNode, so that during the transformation pass, we can
uniquely identify each Tuple or CallNode object. Again, we need help from TVM community to provide
suggestions/guidelines here in order to know whether this is one of the best ways to achieve the Mrvl-BYOC need.
We also identified a need to maintain linkages between (operator-)information described in the original, given
pre-trained network model and the code-gen JSON files so that the compiler backend will be able to report user-level
(e.g., meaningful-to-user) messages regarding the given pre-trained network. For instance, in the
tvm/python/tvm/relay/frontend/onnx.py and common.py files, we can see user-level information being captured using
“tvm_custom” related code as in original onnx.py file for the given pre-trained network; but, in common.py, the code
later drops the linkage, via attrs.pop(“tvm_custom”), and does not pass the linkage onto the initial relay IR graph.
We have a draft solution to maintain linkages between the given pre-trained network model and its relay IR graph
(using expr node ID and tvm custom ID, plus, a few utility functions), but would like to know whether the TVM
community has any better or work-in-progress resolution.
When using TVM RPC code to exercise and run inference on a remote-hosted Mrvl ML/AI HW accelerator for the Mrvl
subgraph, we ran into one minor issue and have made local TVM RPC enhancement so that, when a TVM RPC client sends
a file to the remote server, the TVM RPC client can know where the remote server saves the file on the remote machine.
Since this is not directly related to this Mrvl-BYOC PR, we will find time to contribute this enhance back in another
TVM PR soon.
In order for us to generate the constants-JSON file, we must “NOT” remove external params, which were stored in
metadata module, in the BuildRelay() function defined in the tvm/src/relay/backend/build_module.cc file. Currently,
we are using the CPP directive: #ifndef TVM_USE_MRVL to achieve the not-removal requirement for the Mrvl-BYOC flow
when config.cmake has USE_MRVL ON. We are not sure whether there are side effects due to not removing external params
in the BuildRelay() function. Are there any other (better) resolution regarding this matter?
- We also wonder whether this tests/python/relay/test_external_codegen.py test suite's test case,
  test_load_params_with_constants_in_ext_codegen(), needs to be pytest.mark.skipif(True if USE_MRVL is ON)?

Future possibilities

For this BYOC-Marvell RFC, we are focusing on relay compilation and codegen to generate a Nodes-JSON file and a
Constants-JSON file. The next thing is to expand to include Marvell driver and runtime supports.
For the first Marvell-BYOC revision, solution for at most one integration step from a for-accelerator Mrvl subgraph
to a TVM-target non-Mrvl subgraph is provided for a pre-trained network model. Plus the Mrvl subgraph can only use
input tensor(s) of given pre-trained model as its subgraph’s input tensors. How to efficiently handle the integration
of and data communication between multiple Mrvl subgraphs and multiple non-Mrvl subgraphs at inference runtime will
be needed.
Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. We are working on a Mrvl
Bring-You-Own-Quantization-Int8 flow under the tvm/python/tvm/relay/quantize/contrib/mrvl folder. When we have a solid
POC codebase, we will start to communicate with the TVM Community via another pre-RFC/RFC/PR.

ccjoechou · 2021-12-17T21:38:35Z

Hi @jroesch,

We have filed a RFC PR-48 (RFC][BYOC] Marvell ML/AI Accelerator Integration) at the TVM GitHub https://github.com/apache/tvm-rfcs repo.

Thanks,

Joe

…c-mrvl

ssdurako · 2022-01-12T07:12:00Z

Hello @jroesch ,

Trying to refresh this pre-RFC as it seems to have become dormant (perhaps due to Holidays). Would appreciate your help as it seems to be blocked by rust/cargo related issues that we are not familiar with.
Related RFC is "apache/tvm-rfcs#48"

Thank you!

ssdurako · 2022-01-18T18:13:05Z

@areusch @jroesch
Is there some aging process to follow in the TVM Open Source so issues get looked at and advanced without getting stale for weeks? As there are timelines that get affected by such delays.

Thanks

jroesch removed request for a team, leandron and wweic December 14, 2021 00:44

jroesch removed request for ZihengJiang, junrushao and mbaret December 14, 2021 00:44

jroesch self-assigned this Dec 14, 2021

jroesch added status: need review status: need RFC need RFC discussion labels Dec 14, 2021

ccjoechou requested review from Huyuwei, Hzfengsy, jcf94, kevinthesun, kparzysz-quic, Laurawly, siju-samuel and srkreddy1238 as code owners December 15, 2021 17:21

ccjoechou mentioned this pull request Dec 17, 2021

[RFC][BYOC] Marvell ML/AI Accelerator Integration apache/tvm-rfcs#48

Open

Joe (Chien-Chun) Chou added 2 commits December 18, 2021 07:50

[pre-RFC][BYOC] Marvell ML/AI Accelerator Integration

2b5c376

[pre-RFC][BYOC] Marvell ML/AI Accelerator Integration

ecb1b10

ccjoechou force-pushed the byoc-mrvl branch from ba5a2a8 to ecb1b10 Compare December 18, 2021 17:25

Joe (Chien-Chun) Chou added 2 commits December 18, 2021 10:08

[pre-RFC][BYOC] Marvell ML/AI Accelerator Integration

effe695

Merge branch 'byoc-mrvl' of https://github.com/ccjoechou/tvm into byo…

9b01ecc

…c-mrvl

areusch added needs-triage PRs or issues that need to be investigated by maintainers to find the right assignees to address it and removed needs-triage PRs or issues that need to be investigated by maintainers to find the right assignees to address it labels Oct 19, 2022

tqchen closed this Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pre-RFC][BYOC] Marvell ML/AI Accelerator Integration #9730

[pre-RFC][BYOC] Marvell ML/AI Accelerator Integration #9730

ccjoechou commented Dec 13, 2021

jroesch commented Dec 14, 2021

ccjoechou commented Dec 14, 2021

ccjoechou commented Dec 15, 2021

ssdurako commented Dec 15, 2021

ccjoechou commented Dec 17, 2021

ccjoechou commented Dec 17, 2021

ssdurako commented Jan 12, 2022

ssdurako commented Jan 18, 2022

[pre-RFC][BYOC] Marvell ML/AI Accelerator Integration #9730

[pre-RFC][BYOC] Marvell ML/AI Accelerator Integration #9730

Conversation

ccjoechou commented Dec 13, 2021

jroesch commented Dec 14, 2021

ccjoechou commented Dec 14, 2021

ccjoechou commented Dec 15, 2021

ssdurako commented Dec 15, 2021

ccjoechou commented Dec 17, 2021

Summary

Motivation

Guide-level explanation

Reference-level explanation

Illustration using a MNIST model

Drawbacks

Rationale and alternatives

Unresolved questions

Future possibilities

ccjoechou commented Dec 17, 2021

ssdurako commented Jan 12, 2022

ssdurako commented Jan 18, 2022