Skip to content

Releases: NVIDIA-Merlin/HugeCTR

Merlin: HugeCTR V3.7 (Merlin 22.06)

16 Jun 04:52
8a91274
Compare
Choose a tag to compare

What's New in Version 3.7

  • 3G Embedding Developer Preview:
    The 3.7 version introduces next-generation of embedding as a developer preview feature. We call it 3G embedding because it is the new update to the HugeCTR embedding interface and implementation since the unified embedding in v3.1 version, which was the second one.
    Compared with the previous embedding, there are three main changes in the embedding collection.

    • First, it allows users to fuse embedding tables with different embedding vector sizes. The previous embedding can only fuse embedding tables with the same embedding vector size.
      The enhancement boosts both flexibility and performance.
    • Second, it extends the functionality of embedding by supporting the concat combiner and supporting different slot lookup on the same embedding table.
    • Finally, the embedding collection is powerful enough to support arbitrary embedding table placement which includes data parallel and model parallel.
      By providing a plan JSON file, you can configure the table placement strategy as you specify.
      See the dlrm_train.py file in the embedding_collection_test directory of the repository for a more detailed usage example.
  • HPS Performance Improvements:

    • Kafka: Model parameters are now stored in Kafka in a bandwidth-saving multiplexed data format.
      This data format vastly increases throughput. In our lab, we measured transfer speeds up to 1.1 Gbps for each Kafka broker.
    • HashMap backend: Parallel and single-threaded hashmap implementations have been replaced by a new unified implementation.
      This new implementation uses a new memory-pool based allocation method that vastly increases upsert performance without diminishing recall performance.
      Compared with the previous implementation, you can expect a 4x speed improvement for large-batch insertion operations.
    • Suppressed and simplified log: Most log messages related to HPS have the log level changed to TRACE rather than INFO or DEBUG to reduce logging verbosity.
  • Offline Inference Usability Enhancements:

    • The thread pool size is configurable in the Python interface, which is useful for studying the embedding cache performance in scenarios of asynchronous update. Previously it was set as the minimum value of 16 and std::thread::hardware_concurrency(). For more information, please refer to Hierarchical Parameter Server Configuration.
  • DataGenerator Performance Improvements:
    You can specify the num_threads parameter to parallelize a Norm dataset generation.

  • Evaluation Metric Improvements:

    • Average loss performance improvement in multi-node environments.
    • AUC performance optimization and safer memory management.
    • Addition of NDCG and SMAPE.
  • Embedding Training Cache Parquet Demo:
    Created a keyset extractor script to generate keyset files for Parquet datasets.
    Provided users with an end-to-end demo of how to train a Parquet dataset using the embedding cache mode.
    See the Embedding Training Cache Example notebook.

  • Documentation Enhancements:
    The documentation details for HugeCTR Hierarchical Parameter Server Database Backend are updated for consistency and clarity.

  • Issues Fixed:

    • If slot_size_array is specified, workspace_size_per_gpu_in_mb is no longer required.
    • If you build and install HugeCTR from scratch, you can specify the CMAKE_INSTALL_PREFIX CMake variable to identify the installation directory for HugeCTR.
    • Fixed SOK hang issue when calling sok.Init() with a large number of GPUs. See the GitHub issue 261 and 302 for more details.
  • Known Issues:

    • HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources.
      If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:

        -shm-size=1g -ulimit memlock=-1

      See also the NCCL known issue and the GitHub issue.

    • KafkaProducers startup succeeds even if the target Kafka broker is unresponsive.
      To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.

    • The number of data files in the file list should be greater than or equal to the number of data reader workers.
      Otherwise, different workers are mapped to the same file and data loading does not progress as expected.

    • Joint loss training with a regularizer is not supported.

    • The Criteo 1 TB click logs dataset that is used with many HugeCTR sample programs and notebooks is currently unavailable.
      Until the dataset becomes downloadable again, you can run those samples based on our synthetic dataset generator.
      For more information, see the Getting Started section of the repository README file.

    • Data generator of parquet type produces inconsistent file names between _metadata.json and actual dataset files, which will result in core dump fault when using the synthetic dataset.

Merlin: HugeCTR V3.6

11 May 14:29
Compare
Choose a tag to compare

What's New in Version 3.6

  • Concat 3D Layer:
    In previous releases, the Concat layer could handle two-dimensional (2D) input tensors only.
    Now, the input can be three-dimensional (3D) and you can concatenate the inputs along axis 1 or 2.
    For more information, see the API documentation for the Concat Layer.

  • Dense Column List Support in Parquet DataReader:
    In previous releases, HugeCTR assumes each dense feature has a single value and it must be the scalar data type float32.
    Now, you can mix float32 or list[float32] for dense columns.
    This enhancement means that each dense feature can have more than one value.
    For more information, see the API documentation for the Parquet dataset format.

  • Support for HDFS is Re-enabled in Merlin Containers:
    Support for HDFS in Merlin containers is an optional dependency now.
    For more information, see HDFS Support.

  • Evaluation Metric Enhancements:
    In previous releases, HugeCTR computes AUC for binary classification only.
    Now, HugeCTR supports AUC for multi-label classification.
    The implementation is inspired by sklearn.metrics.roc_auc_score and performs the unweighted macro-averaging strategy that is the default for scikit-learn.
    You can specify a value for the label_dim parameter of the input layer to enable multi-label classification and HugeCTR will compute the multi-label AUC.

  • Log Output Format Change:
    The default log format now includes milliseconds.

  • Documentation Enhancements:

    • These release notes are included in the documentation and are available at https://nvidia-merlin.github.io/HugeCTR/v3.6/release_notes.html.
    • The Configuration section of the Hierarchical Parameter Server information is updated with more information about the parameters in the configuration file.
    • The example notebooks that demonstrate how to work with multi-modal data are reorganized in the navigation.
      The notebooks are now available under the heading Multi-Modal Example Notebooks.
      This change is intended to make it easier to find the notebooks.
    • The documentation in the sparse_operation_kit directory of the repository on GitHub is updated with several clarifications about SOK.
  • Issues Fixed:

    • The dlrm_kaggle_fp32.py file in the samples/dlrm/ directory of the repository is updated to show the correct number of samples.
      The num_samples value is now set to 36672493.
      This fixes GitHub issue 301.
    • Hierarchical Parameter Server (HPS) would produce a runtime error when the GPU cache was turned off.
      This issue is now fixed.
  • Known Issues:

    • HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources.
      If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:

        -shm-size=1g -ulimit memlock=-1

      See also the NCCL known issue and the GitHub issue.

    • KafkaProducers startup succeeds even if the target Kafka broker is unresponsive.
      To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.

    • The number of data files in the file list should be greater than or equal to the number of data reader workers.
      Otherwise, different workers are mapped to the same file and data loading does not progress as expected.

    • Joint loss training with a regularizer is not supported.

    • The Criteo 1 TB click logs dataset that is used with many HugeCTR sample programs and notebooks is currently unavailable.
      Until the dataset becomes downloadable again, you can run those samples based on our synthetic dataset generator.
      For more information, see the Getting Started section of the repository README file.

Merlin: HugeCTR V3.5

01 Apr 13:54
Compare
Choose a tag to compare

What's New in Version 3.5

  • HPS interface encapsulation and exporting as library: We encapsulate the Hierarchical Parameter Server(HPS) interfaces and deliver it as a standalone library. Besides, we prodvide HPS Python APIs and demonstrate the usage with a notebook. For more information, please refer to Hierarchical Parameter Server and HPS Demo.

  • Hierarchical Parameter Server Triton Backend: The HPS Backend is a framework for embedding vectors looking up on large-scale embedding tables that was designed to effectively use GPU memory to accelerate the looking up by decoupling the embedding tables and embedding cache from the end-to-end inference pipeline of the deep recommendation model. For more information, please refer to Hierarchical Parameter Server.

  • SOK pip release: SOK pip releases on https://pypi.org/project/merlin-sok/. Now users can install SOK via pip install merlin-sok.

  • Joint loss and multi-tasks training support:: We support joint loss in training so that users can train with multiple labels and tasks with different weights. MMoE sample is added to show the usage here.

  • HugeCTR documentation on web page: Now users can visit our web documentation.

  • ONNX converter enhancement:: We enable converting MultiCrossEntropyLoss and CrossEntropyLoss layers to ONNX to support multi-label inference. For more information, please refer to HugeCTR to ONNX Converter.

  • HDFS python API enhancement:

    • Simplified DataSourceParams so that users do not need to provide all the paths before they are really necessary. Now users only have to pass DataSourceParams once when creating a solver.
    • Later paths will be automatically regarded as local paths or HDFS paths depending on the DataSourceParams setting. See notebook for usage.
  • HPS performance optimization: We use better method to determine partition number in database backends in HPS.

  • Bug fixing:

    • HugeCTR input layer now can take dense_dim greater than 1000.

Merlin: HugeCTR V3.4.1

01 Mar 12:46
Compare
Choose a tag to compare

What's New in Version 3.4.1

  • Support mixed precision inference for dataset with multiple labels: We enable FP16 for the Softmax layer and support mixed precision for multi-label inference. For more information, please refer to Inference API.

  • Support multi-GPU offline inference with Python API: We support multi-GPU offline inference with the Python interface, which can leverage Hierarchical Parameter Server and enable concurrent execution on multiple devices. For more information, please refer to Inference API and Multi-GPU Offline Inference Notebook.

  • Introduction to metadata.json: We add the introduction to _metadata.json for Parquet datasets. For more information, please refer to Parquet.

  • Documents and tool for workspace size per GPU estimation: we add a tool named embedding_workspace_calculator to help calculate workspace_size_per_gpu_in_mb required by hugectr.SparseEmbedding. For more information, please refer to embedding_workspace_calculator/README.md and QA 24.

  • Improved Debugging Capability: The old logging system, which was flagged as deprecated for some time has been removed. All remaining log messages and outputs have been revised and migrated to the new logging system (base/debug/logging.hpp/cpp). During this revision, we also adjusted log levels for log messages throughout the entire codebase to improve visibility of relevant information.

  • Support HDFS Parameter Server in Training:

    • Decoupled HDFS in Merlin containers to make the HDFS support more flexible. Users can now compile HDFS related functionalities optionally.
    • Now supports loading and dumping models and optimizer states from HDFS.
    • Added a notebook to show how to use HugeCTR with HDFS.
  • Support Multi-hot Inference on Hugectr Backend: We support categorical input in multi-hot format for HugeCTR Backend inference.

  • Multi-label inference with mixed precision: Mixed precision training is enabled for softmax layer.

  • Python Script and documentation demonstrating how to analyze model files: In this release, we provide a script to retreive vocabulary information from model file. Please find more details on the readme

  • Bug Fixing:

    • Mirror strategy bug in SOK (see in #291)
    • Can't import sparse operation kit in nvcr.io/nvidia/merlin/merlin-tensorflow-training:22.03 (see in #296)
    • HPS: Fixed access violation that can occur during initialization when not configuring a volatile DB.

Known Issues

  • HugeCTR uses NCCL to share data between ranks, and NCCL may require shared system memory for IPC and pinned (page-locked) system memory resources. When using NCCL inside a container, it is recommended that you increase these resources by issuing: -shm-size=1g -ulimit memlock=-1
    See also NCCL's known issue. And the GitHub issue.

  • KafkaProducers startup will succeed, even if the target Kafka broker is unresponsive. In order to avoid data-loss in conjunction with streaming model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers is up, working properly and reachable from the node where you run HugeCTR.

  • The number of data files in the file list should be no less than the number of data reader workers. Otherwise, different workers will be mapped to the same file and data loading does not progress as expected.

Merlin: HugeCTR V3.4

28 Jan 07:42
9e6cbf5
Compare
Choose a tag to compare

What's New in Version 3.4

  • Supporting HugeCTR Development with Merlin Unified Container: From Merlin v22.02 we encourage you to develop HugeCTR under Merlin Unified Container (release container) according to the instructions in Contributor Guide to keep consistent.

  • Hierarchical Parameter Server (HPS) Enhancements:

    • Missing key insertion feature: Via a simple flag, it is now possible to configure HugeCTR such that missed embedding-table entries during lookup are automatically inserted into volatile database layers such as the Redis and Hashmap backends.
    • Asynchronous timestamp refresh: In the last release we introduced the passing-of-time-aware eviction policies. These are policies that are applied to shrink database partitions through dropping keys if they grow beyond certain limits. However, the time-information utilized by these eviction policies represented the update time. Hence, an embedding was evicted based on the time passed since its last update. If you operate HugeCTR in inference mode, the embedding table is typically immutable. With the above-described missing key insertion feature we now support actively tuning the contents of volatile database layers to the data distribution during lookup. To allow time-based eviction to take place, it is now possible to enable timestamp refreshing for frequently used embeddings. Once enabled, refreshing is handled asynchronously using background threads. Hence, it won’t block your inference jobs. For most applications, the associated performance impact from enabling this feature is barely noticeable.
    • Support HDFS(Hadoop Distributed File System) Parameter Server in Training:
      • A new Python API DataSourceParams used to specify the file system and paths to data and model files.
      • Support loading data from HDFS to the local file system for HugeCTR training.
      • Support dumping trained model and optimizer states into HDFS.
    • Online seamless update of the parameters of the dense part of the model: HugeCTR Backend has supported online model version updating by the Load API of Triton (including the seamless update of the dense part and corresponding embedding inference cache for the same model), and the Load API is still fully compatible with online deployment of new models.
  • Sparse Operation Kit Enhancements:

    • Mixed Precision Training: Mixed precision training can be enabled via TF’s pattern to enhance the training performance and lessen memory usage.
    • DLRM Benchmark: DLRM is a standard benchmark for recommendation model training. A note book is added to address the performance of SOK on this benchmark in this release.
    • Support Uint32_t / int64_t key dtype in SOK: Int64 or uint32 can be used as the key data type for SOK’s embedding. By default, it is int64.
    • Add TensorFlow initializers support: Tensorflow native initializer can be used in SOK now. e.g. sok.All2AllDenseEmbedding(embedding_initializer=tf.keras.initializers.RandomUniform())
  • User Experience Enhancements

    • We have revised several notebooks and readme files to clarify instructions and make HugeCTR more accessible in general.
    • Thanks to GitHub user @MuYu-zhi , who brought to our attention that having configured too few shared memory can impact the proper operation of HugeCTR. We extended the SOK docker setup instructions to address how such issues can be resolved using the --shm-size setting of docker.
    • Although HugeCTR is designed for scalability, having a beefy machine is not necessary for smaller workloads and testing. We added information about the required specs for notebook testing environments in README.
  • Inference for Multi-tasking: We support HugeCTR inference for multiple tasks. When the label dimension is the number of binary classification tasks and MultiCrossEntropyLoss is employed during training, the shape of inference results will be (batch_size*num_batches, label_dim). For more information, please refer to Inference API.

  • Fix the Embedding Cache Issue for Super Small Embedding Tables

Merlin: HugeCTR V3.3.1

11 Jan 05:59
Compare
Choose a tag to compare

What's New in Version 3.3.1

  • Hierarchical Parameter Server Enhancements:
    • Online deployment of new models and recycling of old models: In this release, HugeCTR Backend is fully compatible with the model control protocol of Triton. Adding the configuration of a new model to the HPS configuration file. The HugeCTR Backend has supported online deployment of new models by the Load API of Triton. The old models can also be recycled online by the Unload API.
    • Simplified database backend: Multi-nodes, single-node and all other kinds of volatile database backends can now be configured using the same configuration object.
    • Multi-threaded optimization of Redis code: ~2.3x speedup up over HugeCTR v3.3.
    • Fix to some issues: Build HPS test environment and implement unit test of each component; Access violation issue of online Kafka updates; Parquet data reader incorrectly parses the index of categorical features in the case of multiple embedded tables; HPS Redis Backend overflow handling not invoked upon single insertions.
  • New group of fused fully connected layers: We support adding a group of fused fully connected layers when constructing the model graph. A concise Python interface is provided for users to adjust the number of layers, as well as to specify the output dimensions in each layer, which makes it easy to leverage the highly-optimized fused fully connected layer in HugeCTR. For more information, please refer to GroupDenseLayer
  • Fix to some issues:
    • Warnning is added for the case users forget to import mpi before launching multi-process job
    • Removing massive log when runing with embedding training cache
    • Removing lagacy conda related informations from documents

Merlin: HugeCTR V3.3

07 Dec 11:37
Compare
Choose a tag to compare

What's New in Version 3.3

  • Hierarchical Parameter Server:

    • Support Incremental Models Updating From Online Training: HPS now supports iterative model updating via Kafka message queues. It is now possible to connect HugeCTR with Apache Kafka deployments to update the model in-place in real-time. This feature is supported in both phases, training and inference. Please refer to the Demo Notebok.
    • Support Embedding keys Eviction Mechanism: In-memory databases such as Redis or CPU memory backed storage are used now as the feature memory management. Hence, when performing iterative updating, they will automatically evict infrequently used embeddings as training progresses.
    • Support Embedding Cache Asynchronous Refresh Mechanism: We have supported the asynchronous refreshing of incremental embedding keys into the embedding cache. Refresh operation will be triggered when completing the model version iteration or incremental parameters output from online training. The Distributed Database and Persistent Database will be updated by the distributed event streaming platform(Kafka). And then the GPU embedding cache will refresh the values of the existing embedding keys and replace them with the latest incremental embedding vectors. Please refer to the HPS README.
    • Other Improvements: Backend implementations for databases are now fully configurable. JSON interface parser can cope better with inaccurate parameterization. Less and if (hopefully) more meaningful jabber! Based on your requests, we revised the log levels for throughout the entire database backend API of the parameter server. Selected configuration options are now printed wholesomely and uniformly to the log. Errors provide more verbose information on the matter at hand. Improved performance of Redis cluster backend. Improved performance of CPU memory database backend.
  • SOK TF 1.15 Support: In this version, SOK can be used along with TensorFlow 1.15. See README. Dedicated CUDA stream is used for SOK’s Ops, and kernel interleaving might be eliminated. Users can now install SOK via pip install SparseOperationKit, which no longer requires root access to compile SOK and no need to copy python scripts. There was a hanging issue in tf.distribute.MirroredStrategy when TensorFlow version greater than 2.4. In this version, this issue in TensorFlow 2.5+ is fixed.

  • MLPerf v1.1 integration

    • Hybrid-embedding indices pre-computing:The indices needed for hybrid embedding are pre-computed ahead of time and are overlapped with previous iterations.
    • Cached evaluation indices::The hybrid-embedding indices for eval are cached when applicable, hence eliminating the re-computing of the indices at every eval iteration.
    • MLP weight/data gradients calculation overlap::The weight gradients of MLP are calculated asynchronously with respect to the data gradients, enabling overlap between these two computations.
    • Better compute-communication overlap::Better overlap between compute and communication has been enabled to improve training throughput.
    • Fused weight conversion::The FP32-to-FP16 conversion of the weights are now fused into the SGD optimizer, saving trips to memory.
    • GraphScheduler::GrapScheduler was added to control the timing of cudaGraph launching. With GraphScheduler, the gap between adjacent cudaGraphs is eliminated.
  • Multi-node training support on the cluster without RDMA:We support multi-node training without RDMA now. You can specify allreduce algorithm as AllReduceAlgo.NCCL and it can support non-RDMA hardware. For more information, please refer to all_reduce_algo in CreateSolver API.

  • SOK support device setting with tf.configtf.config.set_visible_device can be used to set the visible GPUs for each process. Meanwhile, CUDA_VISIBLE_DEVICES can also be used to achieve the same purpose. When tf.distribute.Strategy is used, device argument must not be set.

  • User defined name is supported in model dumping: We support specifying the model name with the training API CreateSolver, which will be dumped to the JSON configuration file with the API Model.graph_to_json. This feature will facilitate the Triton deployment of saved HugeCTR models, and help to distinguish between models when Kafka sends parameters from the training side to the inference side.

  • Fine-grained control of the embedding layers: We support the fine-grained control of the embedding layers. Users can freeze or unfreeze the weights of a specific embedding layer with the APIs Model.freeze_embedding and Model.unfreeze_embedding. Besides, the weights of multiple embedding layers can be loaded independently, which enables the use case of loading pre-trained embeddings for a particular layer. For more information, please refer to Model API and Section 3.4 of HugeCTR Criteo Notebook.

Merlin: HugeCTR V3.3

06 Dec 15:33
Compare
Choose a tag to compare
Merlin: HugeCTR V3.3 Pre-release
Pre-release
Merge branch 'sparse_op_kit_integration' into 'v3.3-integration'

Sparse op kit integration

See merge request dl/hugectr/hugectr!587

Merlin: HugeCTR v3.2.1

02 Nov 12:24
Compare
Choose a tag to compare

What's New in Version 3.2.1

  • Performance optimization on GPU embedding cache: We have optimized the performance of the GPU embedding cache stand-alone module. Now the performance has been significantly improved under small to medium batch sizes. For large batch sizes, the performance remains unchanged. This feature does not introduce any mandatory changes to the interface of the GPU embedding cache, so any existing code that uses this module does not need to change. For more information, please refer to the document of the GPU embedding cache under the gpu_cache folder.

  • Host memory cache for HugeCTR embedding training cache: We have introduced the host memory cache (HMEM-Cache) based PS for the incremental training, which is a component of the Embedding Training Cache (MOS) and responsible for handling the case when the embedding table is too large to even fit into the host memory. We have provided the SSD-based PS for this scenario in former releases, but the SSD-based PS will be deprecated from the v3.3 release due to its unsatisfactory performance. Please check the Host Memory Cache in MOS for a detailed introduction.
    Compared with the former SSD-based PS, the loading and dumping bandwidth of the HMEM-Cache based PS can be substantially improved if it is properly configured, which contributes to the incremental training of models with huge embedding tables when using the MOS feature in HugeCTR.
    To ease the utilization of the MOS feature, we have also simplified the python interface of MOS. Specifically, we drop the use_host_memory_ps entry by providing the ps_types entry for choosing the HMEM-based PS or the HMEM-Cache based PS; A unified entry sparse_models is introduced, and you don’t need to use different entries to tell whether a pre-trained embedding table exists or not. For a detailed explanation of the python interface, please check the HugeCTR Python Interface.

  • Debugging Capability Improvement:We have introduced a set of new debugging capability features, which include the multi-level logging, more informative throw and check. We also provide a set of kernel debugging functions. Based on these features, we are actively working on making the information and error messages from HugeCTR cleaner, so that our users are well informed about what is happening with their training & inference code at a desired level. Stay tuned! For more detailed information, check out comments in header files located at HugeCTR/include/base/debug.

  • Embedding cache asynchronous insertion mechanism:We have supported the asynchronous insertion of missing embedding keys into the embedding cache. This feature can be activated automatically through user-defined hit rate threshold in configuration file.When the real hit rate of the embedding cache is higher than the user-defined threshold, the embedding cache will insert the missing key asynchronously, and vice versa, it will still be inserted in a synchronous way to ensure high accuracy of inference requests. Through the asynchronous insertion method, compared with the previous synchronous method, the real hit rate of the embedding cache can be further improved after the embedding cache reaches the user-defined threshold.

  • Performance optimization of Parameter Server:We have added support for multiple database interfaces to our parameter server. In particular, we added an “in memory” database, that utilizes the local CPU memory for storing and recalling embeddings and uses multi-threading to accelerate look-up and storage.
    Further, we revised support for “distributed” storage of embeddings in a Redis cluster. This way, you can use the combined CPU-accessible memory of your cluster for storing embeddings. The new implementation is up over two orders of magnitude faster than the previous.
    Further, we performance-optimized support for the “persistent” storage and retrieval of embeddings via RocksDB through the structured use of column families.
    Creating a hierarchical storage (i.e. using Redis as distributed cache, and RocksDB as fallback), is supported as well. These advantages are free to end-users, as there is no need to adjust the PS configuration.
    We plan to further integrate the hierarchical parameter server with other features, such as the GPU backed embedding caches in upcoming releases. Stay tuned!

  • Graph Analysis to internalize the Slice layer: The branch topology is inherently supported by the HugeCTR model graph, which requires users to explicitly insert a Slice layer with Python APIs to enable it. In order to simplify the usage, the Slice layer for the branch topology can be abstracted away in the Python interface. The graph analysis will be conducted to resolve the tensor dependency and the Slice layer will be internally inserted if the same tensor is consumed more than once to form the branch topology. The previous usage of explicitly adding the Slice layer is still supported, while using this new feature to internalize it is strongly recommended. Please refer to Getting Started to see how to construct a model graph with branches without the Slice layer. You can refer to Slice Layer for more details.

Merlin: HugeCTR V3.2

22 Sep 06:40
3cf91dc
Compare
Choose a tag to compare

What's New in Version 3.2

  • New HugeCTR to ONNX Converter: We’re introducing a new HugeCTR to ONNX converter in the form of a Python package. All graph configuration files are required and model weights must be formatted as inputs. You can specify where you want to save the converted ONNX model. You can also convert sparse embedding models. For more information, refer to HugeCTR to ONNX Converter and HugeCTR2ONNX Demo Notebook.

  • New Hierarchical Storage Mechanicsm on the Parameter Server (POC): We’ve implemented a hierarchical storage mechanism between local SSDs and CPU memory. As a result, embedding tables no longer have to be stored in the local CPU memory. The distributed Redis cluster is being implemented as a CPU cache to store larger embedding tables and interact with the GPU embedding cache directly. The local RocksDB serves as a query engine to back up the complete embedding table on the local SSDs and assist the Redis cluster with looking up missing embedding keys. Please find more information here

  • Parquet Format Support Within the Data Generator: The HugeCTR data generator now supports the parquet format, which can be configured easily using the Python API. For more information, refer to Data Generator API.

  • Python Interface Support for the Data Generator: The data generator has been enabled within the HugeCTR Python interface. The parameters associated with the data generator have been encapsulated into the DataGeneratorParams struct, which is required to initialize the DataGenerator instance. You can use the data generator's Python APIs to easily generate the Norm, Parquet, or Raw dataset formats with the desired distribution of sparse keys. For more information, refer to Data Generator API and Data Generator Samples.

  • Improvements to the Formula of the Power Law Simulator within the Data Generator: We've modified the formula of the power law simulator within the data generator so that a positive alpha value is always produced, which will be needed for most use cases. The alpha values for Long, Medium, and Short within the power law distribution are 0.9, 1.1, and 1.3 respectively. For more information, refer to Data Generator API.

  • Support for Arbitrary Input and Output Tensors in the Concat and Slice Layers: The Concat and Slice layers now support any number of input and output tensors. Previously, these layers were limited to a maximum of four tensors.

  • New Continuous Training Notebook: We’ve added a new notebook to demonstrate how to perform continuous training using the model oversubscription (also referred to as Embedding Training Cache) feature. For more information, refer to HugeCTR Continuous Training.

  • New HugeCTR Contributor Guide: We've added a new HugeCTR Contributor Guide that explains how to contribute to HugeCTR, which may involve reporting and fixing a bug, introducing a new feature, or implementing a new or pending feature.

  • Enhancements to Sparse Operation Kits (SOK): SOK now supports TensorFlow 2.5 and 2.6. We also added support for identity hashing, dynamic input, and Horovod within SOK. Lastly, we added a new SOK docs set to help you get started with SOK.

  • Supporting Arbitrary Number of Inputs in Concat Layer and Slice Layer: The Concat and Slice layers now support any number of input and output tensors, respectively. Previously, these layers would be limited to a maximum of 4 tensors.

  • Fix power law in Data Generator (Generalize the power law simulator in Data Generator): We’ve modified the formula of the power law simulator to make for the positive alpha value, which is more general in different use cases. Besides, the alpha values for Long, Medium and Short of power law distribution are 0.9, 1.1 and 1.3 respectively. For more information, see Data Generator API.