Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add WholeGraph Support #6

Closed
wants to merge 24 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
2189eb3
wg option
alexbarghi-nv Jan 3, 2024
b6b939f
add ability to construct fro wg embedding
alexbarghi-nv Jan 3, 2024
9d893ec
[BUG]Fix non-type template parameter to cugraph::relabel (#4064)
naimnv Jan 4, 2024
4366f9f
wholegraph
alexbarghi-nv Jan 5, 2024
4a5712d
generators
alexbarghi-nv Jan 5, 2024
afb000c
style
alexbarghi-nv Jan 5, 2024
cff6cdf
reformat
alexbarghi-nv Jan 5, 2024
a59bd76
Remove Experimental Wrappers from GNN Code (#4070)
alexbarghi-nv Jan 5, 2024
c7b720d
[FEA]: Add DASK edgelist and graph support to the Dataset API (#4035)
huiyuxie Jan 9, 2024
cd5fc6f
build wheels for `cugraph-dgl` and `cugraph-pyg` (#4075)
tingyu66 Jan 9, 2024
5e8e9b5
Fix MG weighted similarity test failure (#4054)
seunghwak Jan 10, 2024
ae25ea1
Adds `nx-cugraph` benchmarks for 23.12 algos (SSSP, pagerank, hits, k…
rlratzel Jan 10, 2024
35ae8ef
Correct `cugraph-pyg` package name used in wheels and fix test script…
tingyu66 Jan 10, 2024
b22dd99
refactor CUDA versions in dependencies.yaml (#4084)
jameslamb Jan 11, 2024
88c3884
nx-cugraph: indicate which plc algorithms are used and version_added …
eriknw Jan 11, 2024
c09db10
Sampling Performance Testing (#3584)
alexbarghi-nv Jan 12, 2024
24d02a5
Fix OOB error, BFS C API should validate that the source vertex is a …
ChuckHastings Jan 12, 2024
5d4ba38
MNMG ECG (#4030)
naimnv Jan 12, 2024
aa66a32
Remove usages of rapids-env-update (#4090)
KyleFromNVIDIA Jan 12, 2024
4748ca1
nx-cugraph: PLC now handles isolated nodes; clean up our workarounds …
eriknw Jan 16, 2024
8672534
nx-cugraph: add weakly connected components (#4071)
eriknw Jan 17, 2024
eacdf58
Provide explicit pool sizes and avoid RMM detail APIs (#4086)
harrism Jan 17, 2024
c5d2a9a
`nx-cugraph`: add `to_undirected` method; add reciprocity algorithms …
eriknw Jan 18, 2024
5e5094a
fix merge conflicts
alexbarghi-nv Jan 18, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions .github/workflows/build.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -133,3 +133,43 @@ jobs:
sha: ${{ inputs.sha }}
date: ${{ inputs.date }}
package-name: nx-cugraph
wheel-build-cugraph-dgl:
needs: wheel-publish-cugraph
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/[email protected]
with:
build_type: ${{ inputs.build_type || 'branch' }}
branch: ${{ inputs.branch }}
sha: ${{ inputs.sha }}
date: ${{ inputs.date }}
script: ci/build_wheel_cugraph-dgl.sh
wheel-publish-cugraph-dgl:
needs: wheel-build-cugraph-dgl
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/[email protected]
with:
build_type: ${{ inputs.build_type || 'branch' }}
branch: ${{ inputs.branch }}
sha: ${{ inputs.sha }}
date: ${{ inputs.date }}
package-name: cugraph-dgl
wheel-build-cugraph-pyg:
needs: wheel-publish-cugraph
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/[email protected]
with:
build_type: ${{ inputs.build_type || 'branch' }}
branch: ${{ inputs.branch }}
sha: ${{ inputs.sha }}
date: ${{ inputs.date }}
script: ci/build_wheel_cugraph-pyg.sh
wheel-publish-cugraph-pyg:
needs: wheel-build-cugraph-pyg
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/[email protected]
with:
build_type: ${{ inputs.build_type || 'branch' }}
branch: ${{ inputs.branch }}
sha: ${{ inputs.sha }}
date: ${{ inputs.date }}
package-name: cugraph-pyg
34 changes: 34 additions & 0 deletions .github/workflows/pr.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,10 @@ jobs:
- wheel-tests-cugraph
- wheel-build-nx-cugraph
- wheel-tests-nx-cugraph
- wheel-build-cugraph-dgl
- wheel-tests-cugraph-dgl
- wheel-build-cugraph-pyg
- wheel-tests-cugraph-pyg
- devcontainer
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/[email protected]
Expand Down Expand Up @@ -127,6 +131,36 @@ jobs:
with:
build_type: pull-request
script: ci/test_wheel_nx-cugraph.sh
wheel-build-cugraph-dgl:
needs: wheel-tests-cugraph
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/[email protected]
with:
build_type: pull-request
script: ci/build_wheel_cugraph-dgl.sh
wheel-tests-cugraph-dgl:
needs: wheel-build-cugraph-dgl
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/[email protected]
with:
build_type: pull-request
script: ci/test_wheel_cugraph-dgl.sh
matrix_filter: map(select(.ARCH == "amd64"))
wheel-build-cugraph-pyg:
needs: wheel-tests-cugraph
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/[email protected]
with:
build_type: pull-request
script: ci/build_wheel_cugraph-pyg.sh
wheel-tests-cugraph-pyg:
needs: wheel-build-cugraph-pyg
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/[email protected]
with:
build_type: pull-request
script: ci/test_wheel_cugraph-pyg.sh
matrix_filter: map(select(.ARCH == "amd64" and .CUDA_VER == "11.8.0"))
devcontainer:
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/[email protected]
Expand Down
18 changes: 18 additions & 0 deletions .github/workflows/test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -57,3 +57,21 @@ jobs:
date: ${{ inputs.date }}
sha: ${{ inputs.sha }}
script: ci/test_wheel_nx-cugraph.sh
wheel-tests-cugraph-dgl:
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/[email protected]
with:
build_type: nightly
branch: ${{ inputs.branch }}
date: ${{ inputs.date }}
sha: ${{ inputs.sha }}
script: ci/test_wheel_cugraph-dgl.sh
wheel-tests-cugraph-pyg:
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/[email protected]
with:
build_type: nightly
branch: ${{ inputs.branch }}
date: ${{ inputs.date }}
sha: ${{ inputs.sha }}
script: ci/test_wheel_cugraph-pyg.sh
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ repos:
pass_filenames: false
additional_dependencies: [gitpython]
- repo: https://github.com/rapidsai/dependency-file-generator
rev: v1.5.1
rev: v1.8.0
hooks:
- id: rapids-dependency-file-generator
args: ["--clean"]
2 changes: 1 addition & 1 deletion benchmarks/cugraph/standalone/bulk_sampling/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@ You will need to modify the bash scripts to run appopriately for your environmen
desired training workflow. The standard sbatch arguments are at the top of the script, such as
job name, queue, etc. These will need to be modified for your SLURM cluster.

Next are arguments for the container image (which is currently set to the current DLFW image),
Next are arguments for the container image (required),
and directories where the data and outputs are stored. The directories default to subdirectories
of the current working directory. But if there is a high-throughput storage system available,
using that storage for the samples and datasets is highly recommended.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
os.environ["RAPIDS_NO_INITIALIZE"] = "1"
os.environ["CUDF_SPILL"] = "1"
os.environ["LIBCUDF_CUFILE_POLICY"] = "KVIKIO"
os.environ["KVIKIO_NTHREADS"] = "64"
os.environ["KVIKIO_NTHREADS"] = "8"

import argparse
import json
Expand Down Expand Up @@ -123,6 +123,13 @@ def parse_args():
required=True,
)

parser.add_argument(
"--use_wholegraph",
action="store_true",
help="Whether to use WholeGraph feature storage",
required=False,
)

parser.add_argument(
"--model",
type=str,
Expand Down Expand Up @@ -162,6 +169,13 @@ def parse_args():
required=False,
)

parser.add_argument(
"--skip_download",
action="store_true",
help="Whether to skip downloading",
required=False,
)

return parser.parse_args()


Expand All @@ -186,16 +200,37 @@ def main(args):

world_size = int(os.environ["SLURM_JOB_NUM_NODES"]) * args.gpus_per_node

if args.use_wholegraph:
# TODO support DGL too
# TODO support WG without cuGraph
if args.framework not in ["cuGraphPyG"]:
raise ValueError("WG feature store only supported with cuGraph backends")
from pylibwholegraph.torch.initialize import (
get_global_communicator,
get_local_node_communicator,
)

logger.info("initializing WG comms...")
wm_comm = get_global_communicator()
get_local_node_communicator()

wm_comm = wm_comm.wmb_comm
logger.info(f"rank {global_rank} successfully initialized WG comms")
wm_comm.barrier()

dataset = OGBNPapers100MDataset(
replication_factor=args.replication_factor,
dataset_dir=args.dataset_dir,
train_split=args.train_split,
val_split=args.val_split,
load_edge_index=(args.framework == "PyG"),
backend="wholegraph" if args.use_wholegraph else "torch",
)

if global_rank == 0:
# Note: this does not generate WG files
if global_rank == 0 and not args.skip_download:
dataset.download()

dist.barrier()

fanout = [int(f) for f in args.fanout.split("_")]
Expand Down Expand Up @@ -234,6 +269,7 @@ def main(args):
replace=False,
num_neighbors=fanout,
batch_size=args.batch_size,
backend="wholegraph" if args.use_wholegraph else "torch",
)
else:
raise ValueError("unsupported framework")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,10 @@
import os
import json

from cugraph.utilities.utils import import_optional

wgth = import_optional("pylibwholegraph.torch")


class OGBNPapers100MDataset(Dataset):
def __init__(
Expand All @@ -34,6 +38,7 @@ def __init__(
train_split=0.8,
val_split=0.5,
load_edge_index=True,
backend="torch",
):
self.__replication_factor = replication_factor
self.__disk_x = None
Expand All @@ -43,6 +48,7 @@ def __init__(
self.__train_split = train_split
self.__val_split = val_split
self.__load_edge_index = load_edge_index
self.__backend = backend

def download(self):
import logging
Expand Down Expand Up @@ -152,6 +158,27 @@ def download(self):
)
ldf.to_parquet(node_label_file_path)

# WholeGraph
wg_bin_file_path = os.path.join(dataset_path, "wgb", "paper")
if self.__replication_factor == 1:
wg_bin_rep_path = os.path.join(wg_bin_file_path, "node_feat.d")
else:
wg_bin_rep_path = os.path.join(
wg_bin_file_path, f"node_feat_{self.__replication_factor}x.d"
)

if not os.path.exists(wg_bin_rep_path):
os.makedirs(wg_bin_rep_path)
if dataset is None:
from ogb.nodeproppred import NodePropPredDataset

dataset = NodePropPredDataset(
name="ogbn-papers100M", root=self.__dataset_dir
)
node_feat = dataset[0][0]["node_feat"]
for k in range(self.__replication_factor):
node_feat.tofile(os.path.join(wg_bin_rep_path, f"{k:04d}.bin"))

@property
def edge_index_dict(
self,
Expand Down Expand Up @@ -224,21 +251,52 @@ def edge_index_dict(

@property
def x_dict(self) -> Dict[str, torch.Tensor]:
if self.__disk_x is None:
if self.__backend == "wholegraph":
self.__load_x_wg()
else:
self.__load_x_torch()

return self.__disk_x

def __load_x_torch(self) -> None:
node_type_path = os.path.join(
self.__dataset_dir, "ogbn_papers100M", "npy", "paper"
)
if self.__replication_factor == 1:
full_path = os.path.join(node_type_path, "node_feat.npy")
else:
full_path = os.path.join(
node_type_path, f"node_feat_{self.__replication_factor}x.npy"
)

if self.__disk_x is None:
if self.__replication_factor == 1:
full_path = os.path.join(node_type_path, "node_feat.npy")
else:
full_path = os.path.join(
node_type_path, f"node_feat_{self.__replication_factor}x.npy"
)
self.__disk_x = {"paper": torch.as_tensor(np.load(full_path, mmap_mode="r"))}

self.__disk_x = {"paper": np.load(full_path, mmap_mode="r")}
def __load_x_wg(self) -> None:
node_type_path = os.path.join(
self.__dataset_dir, "ogbn_papers100M", "wgb", "paper"
)
if self.__replication_factor == 1:
full_path = os.path.join(node_type_path, "node_feat.d")
else:
full_path = os.path.join(
node_type_path, f"node_feat_{self.__replication_factor}x.d"
)

return self.__disk_x
file_list = [os.path.join(full_path, f) for f in os.listdir(full_path)]

x = wgth.create_embedding_from_filelist(
wgth.get_global_communicator(),
"chunked", # TODO support other options
"cpu", # TODO support GPU
file_list,
torch.float32,
128,
)

print("created x wg embedding", x)

self.__disk_x = {"paper": x}

@property
def y_dict(self) -> Dict[str, torch.Tensor]:
Expand Down
13 changes: 0 additions & 13 deletions benchmarks/cugraph/standalone/bulk_sampling/run_sampling.sh
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,6 @@ export CUDF_SPILL=1
export LIBCUDF_CUFILE_POLICY="OFF"
export GPUS_PER_NODE=8

PATCH_CUGRAPH=1

export SCHEDULER_FILE=$SCHEDULER_FILE
export LOGS_DIR=$LOGS_DIR

Expand All @@ -60,17 +58,6 @@ else
${MG_UTILS_DIR}/run-dask-process.sh workers &
fi

if [[ $PATCH_CUGRAPH == 1 ]]; then
mkdir /opt/cugraph-patch
git clone https://github.com/alexbarghi-nv/cugraph -b dlfw-patch-24.01 /opt/cugraph-patch

rm /opt/rapids/cugraph/python/cugraph/cugraph/structure/graph_implementation/simpleDistributedGraph.py
cp /opt/cugraph-patch/python/cugraph/cugraph/structure/graph_implementation/simpleDistributedGraph.py /opt/rapids/cugraph/python/cugraph/cugraph/structure/graph_implementation/simpleDistributedGraph.py
rm /usr/local/lib/python3.10/dist-packages/cugraph/structure/graph_implementation/simpleDistributedGraph.py
cp /opt/cugraph-patch/python/cugraph/cugraph/structure/graph_implementation/simpleDistributedGraph.py /usr/local/lib/python3.10/dist-packages/cugraph/structure/graph_implementation/simpleDistributedGraph.py

fi

echo "properly waiting for workers to connect"
NUM_GPUS=$(python -c "import os; print(int(os.environ['SLURM_JOB_NUM_NODES'])*int(os.environ['GPUS_PER_NODE']))")
handleTimeout 120 python ${MG_UTILS_DIR}/wait_for_workers.py \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
#SBATCH -N 1
#SBATCH -t 00:25:00

CONTAINER_IMAGE="/lustre/fsw/rapids/abarghi/dlfw_patched.squash"
CONTAINER_IMAGE=${CONTAINER_IMAGE:="please_specify_container"}
SCRIPTS_DIR=$(pwd)
LOGS_DIR=${LOGS_DIR:=$(pwd)"/logs"}
SAMPLES_DIR=${SAMPLES_DIR:=$(pwd)/samples}
Expand Down
Loading
Loading