Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[clang module] Current Working Directory Pruning #124786

Merged
merged 5 commits into from
Feb 5, 2025

Conversation

qiongsiwu
Copy link
Contributor

When computing the context hash, clang always includes the compiler's working directory. This can lead to situations when the only difference between two compilations is the working directory, different module variants are generated. These variants are redundant. This PR implements an optimization that ignores the working directory when computing the context hash when safe.

Specifically, clang checks if it is safe to ignore the working directory in isSafeToIgnoreCWD. The check involves going through compile command options to see if any paths specified are relative. The definition of relative path used here is that the input path is not empty, and llvm::sys::path::is_absolute is false. If all the paths examined are not relative, clang considers it safe to ignore the current working directory and does not consider the working directory when computing the context hash.

@qiongsiwu qiongsiwu marked this pull request as draft January 28, 2025 16:41
@llvmbot llvmbot added the clang Clang issues not falling into any other category label Jan 28, 2025
@llvmbot
Copy link
Member

llvmbot commented Jan 28, 2025

@llvm/pr-subscribers-clang

Author: Qiongsi Wu (qiongsiwu)

Changes

When computing the context hash, clang always includes the compiler's working directory. This can lead to situations when the only difference between two compilations is the working directory, different module variants are generated. These variants are redundant. This PR implements an optimization that ignores the working directory when computing the context hash when safe.

Specifically, clang checks if it is safe to ignore the working directory in isSafeToIgnoreCWD. The check involves going through compile command options to see if any paths specified are relative. The definition of relative path used here is that the input path is not empty, and llvm::sys::path::is_absolute is false. If all the paths examined are not relative, clang considers it safe to ignore the current working directory and does not consider the working directory when computing the context hash.


Full diff: https://github.com/llvm/llvm-project/pull/124786.diff

5 Files Affected:

  • (modified) clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h (+4-1)
  • (modified) clang/lib/Tooling/DependencyScanning/ModuleDepCollector.cpp (+89-3)
  • (added) clang/test/ClangScanDeps/modules-context-hash-cwd.c (+123)
  • (modified) clang/test/ClangScanDeps/working-dir.m (+1-1)
  • (modified) clang/tools/clang-scan-deps/ClangScanDeps.cpp (+2)
diff --git a/clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h b/clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h
index 4a343f2872d8d9..9ad8e68c33eb10 100644
--- a/clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h
+++ b/clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h
@@ -63,7 +63,10 @@ enum class ScanningOptimizations {
   /// Canonicalize -D and -U options.
   Macros = 8,
 
-  DSS_LAST_BITMASK_ENUM(Macros),
+  /// Ignore the compiler's working directory if it is safe.
+  IgnoreCWD = 0x10,
+
+  DSS_LAST_BITMASK_ENUM(IgnoreCWD),
   Default = All
 };
 
diff --git a/clang/lib/Tooling/DependencyScanning/ModuleDepCollector.cpp b/clang/lib/Tooling/DependencyScanning/ModuleDepCollector.cpp
index 2e97cac0796cee..714efb86fa3796 100644
--- a/clang/lib/Tooling/DependencyScanning/ModuleDepCollector.cpp
+++ b/clang/lib/Tooling/DependencyScanning/ModuleDepCollector.cpp
@@ -397,9 +397,92 @@ void ModuleDepCollector::applyDiscoveredDependencies(CompilerInvocation &CI) {
   }
 }
 
+static bool isSafeToIgnoreCWD(const CowCompilerInvocation &CI) {
+  // Check if the command line input uses relative paths.
+  // It is not safe to ignore the current working directory if any of the
+  // command line inputs use relative paths.
+#define IF_RELATIVE_RETURN_FALSE(PATH)                                         \
+  do {                                                                         \
+    if (!PATH.empty() && !llvm::sys::path::is_absolute(PATH))                  \
+      return false;                                                            \
+  } while (0)
+
+#define IF_ANY_RELATIVE_RETURN_FALSE(PATHS)                                    \
+  do {                                                                         \
+    if (std::any_of(PATHS.begin(), PATHS.end(), [](const auto &P) {            \
+          return !P.empty() && !llvm::sys::path::is_absolute(P);               \
+        }))                                                                    \
+      return false;                                                            \
+  } while (0)
+
+  // Header search paths.
+  const auto &HeaderSearchOpts = CI.getHeaderSearchOpts();
+  IF_RELATIVE_RETURN_FALSE(HeaderSearchOpts.Sysroot);
+  for (auto &Entry : HeaderSearchOpts.UserEntries)
+    if (Entry.IgnoreSysRoot)
+      IF_RELATIVE_RETURN_FALSE(Entry.Path);
+  IF_RELATIVE_RETURN_FALSE(HeaderSearchOpts.ResourceDir);
+  IF_RELATIVE_RETURN_FALSE(HeaderSearchOpts.ModuleCachePath);
+  IF_RELATIVE_RETURN_FALSE(HeaderSearchOpts.ModuleUserBuildPath);
+  for (auto I = HeaderSearchOpts.PrebuiltModuleFiles.begin(),
+            E = HeaderSearchOpts.PrebuiltModuleFiles.end();
+       I != E;) {
+    auto Current = I++;
+    IF_RELATIVE_RETURN_FALSE(Current->second);
+  }
+  IF_ANY_RELATIVE_RETURN_FALSE(HeaderSearchOpts.PrebuiltModulePaths);
+  IF_ANY_RELATIVE_RETURN_FALSE(HeaderSearchOpts.VFSOverlayFiles);
+
+  // Preprocessor options.
+  const auto &PPOpts = CI.getPreprocessorOpts();
+  IF_ANY_RELATIVE_RETURN_FALSE(PPOpts.MacroIncludes);
+  IF_ANY_RELATIVE_RETURN_FALSE(PPOpts.Includes);
+  IF_RELATIVE_RETURN_FALSE(PPOpts.ImplicitPCHInclude);
+
+  // Frontend options.
+  const auto &FrontendOpts = CI.getFrontendOpts();
+  for (const FrontendInputFile &Input : FrontendOpts.Inputs) {
+    if (Input.isBuffer())
+      continue; // FIXME: Can this happen when parsing command-line?
+
+    IF_RELATIVE_RETURN_FALSE(Input.getFile());
+  }
+  IF_RELATIVE_RETURN_FALSE(FrontendOpts.CodeCompletionAt.FileName);
+  IF_ANY_RELATIVE_RETURN_FALSE(FrontendOpts.ModuleMapFiles);
+  IF_ANY_RELATIVE_RETURN_FALSE(FrontendOpts.ModuleFiles);
+  IF_ANY_RELATIVE_RETURN_FALSE(FrontendOpts.ModulesEmbedFiles);
+  IF_ANY_RELATIVE_RETURN_FALSE(FrontendOpts.ASTMergeFiles);
+  IF_RELATIVE_RETURN_FALSE(FrontendOpts.OverrideRecordLayoutsFile);
+  IF_RELATIVE_RETURN_FALSE(FrontendOpts.StatsFile);
+
+  // Filesystem options.
+  const auto &FileSystemOpts = CI.getFileSystemOpts();
+  IF_RELATIVE_RETURN_FALSE(FileSystemOpts.WorkingDir);
+
+  // Codegen options.
+  const auto &CodeGenOpts = CI.getCodeGenOpts();
+  IF_RELATIVE_RETURN_FALSE(CodeGenOpts.DebugCompilationDir);
+  IF_RELATIVE_RETURN_FALSE(CodeGenOpts.CoverageCompilationDir);
+
+  // Sanitizer options.
+  IF_ANY_RELATIVE_RETURN_FALSE(CI.getLangOpts().NoSanitizeFiles);
+
+  // Coverage mappings.
+  IF_RELATIVE_RETURN_FALSE(CodeGenOpts.ProfileInstrumentUsePath);
+  IF_RELATIVE_RETURN_FALSE(CodeGenOpts.SampleProfileFile);
+  IF_RELATIVE_RETURN_FALSE(CodeGenOpts.ProfileRemappingFile);
+
+  // Dependency output options.
+  for (auto &ExtraDep : CI.getDependencyOutputOpts().ExtraDeps)
+    IF_RELATIVE_RETURN_FALSE(ExtraDep.first);
+
+  return true;
+}
+
 static std::string getModuleContextHash(const ModuleDeps &MD,
                                         const CowCompilerInvocation &CI,
                                         bool EagerLoadModules,
+                                        bool IgnoreCWD,
                                         llvm::vfs::FileSystem &VFS) {
   llvm::HashBuilder<llvm::TruncatedBLAKE3<16>, llvm::endianness::native>
       HashBuilder;
@@ -410,7 +493,7 @@ static std::string getModuleContextHash(const ModuleDeps &MD,
   HashBuilder.add(getClangFullRepositoryVersion());
   HashBuilder.add(serialization::VERSION_MAJOR, serialization::VERSION_MINOR);
   llvm::ErrorOr<std::string> CWD = VFS.getCurrentWorkingDirectory();
-  if (CWD)
+  if (CWD && !IgnoreCWD)
     HashBuilder.add(*CWD);
 
   // Hash the BuildInvocation without any input files.
@@ -443,8 +526,11 @@ static std::string getModuleContextHash(const ModuleDeps &MD,
 
 void ModuleDepCollector::associateWithContextHash(
     const CowCompilerInvocation &CI, ModuleDeps &Deps) {
-  Deps.ID.ContextHash = getModuleContextHash(
-      Deps, CI, EagerLoadModules, ScanInstance.getVirtualFileSystem());
+  bool IgnoreCWD = any(OptimizeArgs & ScanningOptimizations::IgnoreCWD) &&
+                   isSafeToIgnoreCWD(CI);
+  Deps.ID.ContextHash =
+      getModuleContextHash(Deps, CI, EagerLoadModules, IgnoreCWD,
+                           ScanInstance.getVirtualFileSystem());
   bool Inserted = ModuleDepsByID.insert({Deps.ID, &Deps}).second;
   (void)Inserted;
   assert(Inserted && "duplicate module mapping");
diff --git a/clang/test/ClangScanDeps/modules-context-hash-cwd.c b/clang/test/ClangScanDeps/modules-context-hash-cwd.c
new file mode 100644
index 00000000000000..45be72301c635d
--- /dev/null
+++ b/clang/test/ClangScanDeps/modules-context-hash-cwd.c
@@ -0,0 +1,123 @@
+// Test current directory pruning when computing the context hash.
+
+// REQUIRES: shell
+
+// RUN: rm -rf %t
+// RUN: split-file %s %t
+// RUN: sed -e "s|DIR|%/t|g" %t/cdb0.json.in > %t/cdb0.json
+// RUN: sed -e "s|DIR|%/t|g" %t/cdb1.json.in > %t/cdb1.json
+// RUN: sed -e "s|DIR|%/t|g" %t/cdb2.json.in > %t/cdb2.json
+// RUN: clang-scan-deps -compilation-database %t/cdb0.json -format experimental-full > %t/result0.json
+// RUN: clang-scan-deps -compilation-database %t/cdb1.json -format experimental-full > %t/result1.json
+// RUN: clang-scan-deps -compilation-database %t/cdb2.json -format experimental-full -optimize-args=header-search,system-warnings,vfs,canonicalize-macros > %t/result2.json
+// RUN: cat %t/result0.json %t/result1.json | FileCheck %s
+// RUN: cat %t/result0.json %t/result2.json | FileCheck %s -check-prefix=SKIPOPT
+
+//--- cdb0.json.in
+[{
+  "directory": "DIR",
+  "command": "clang -c DIR/tu.c -fmodules -fmodules-cache-path=DIR/cache -IDIR/include/ -o DIR/tu.o",
+  "file": "DIR/tu.c"
+}]
+
+//--- cdb1.json.in
+[{
+  "directory": "DIR/a",
+  "command": "clang -c DIR/tu.c -fmodules -fmodules-cache-path=DIR/cache -IDIR/include/ -o DIR/tu.o",
+  "file": "DIR/tu.c"
+}]
+
+//--- cdb2.json.in
+[{
+  "directory": "DIR/a/",
+  "command": "clang -c DIR/tu.c -fmodules -fmodules-cache-path=DIR/cache -IDIR/include/ -o DIR/tu.o",
+  "file": "DIR/tu.c"
+}]
+
+//--- include/module.modulemap
+module mod {
+  header "mod.h"
+}
+
+//--- include/mod.h
+
+//--- tu.c
+#include "mod.h"
+
+// Check that result0 and result1 compute the same hash with optimization
+// on. The only difference between result0 and result1 is the compiler's
+// working directory.
+// CHECK:     {
+// CHECK-NEXT:  "modules": [
+// CHECK-NEXT:   {
+// CHECK-NEXT:     "clang-module-deps": [],
+// CHECK:          "context-hash": "[[HASH:.*]]",
+// CHECK:        }
+// CHECK:       "translation-units": [
+// CHECK:        {
+// CHECK:          "commands": [
+// CHECK:          {
+// CHECK-NEXT:        "clang-context-hash": "{{.*}}",
+// CHECK-NEXT:        "clang-module-deps": [
+// CHECK-NEXT:          {
+// CHECK-NEXT:            "context-hash": "[[HASH]]",
+// CHECK-NEXT:            "module-name": "mod"
+// CHECK:               }
+// CHECK:             ],
+// CHECK:     {
+// CHECK-NEXT:   "modules": [
+// CHECK-NEXT:    {
+// CHECK-NEXT:      "clang-module-deps": [],
+// CHECK:           "context-hash": "[[HASH]]",
+// CHECK:         }
+// CHECK:        "translation-units": [
+// CHECK:         {
+// CHECK:           "commands": [
+// CHECK:           {
+// CHECK-NEXT:         "clang-context-hash": "{{.*}}",
+// CHECK-NEXT:         "clang-module-deps": [
+// CHECK-NEXT:           {
+// CHECK-NEXT:             "context-hash": "[[HASH]]",
+// CHECK-NEXT:             "module-name": "mod"
+// CHECK:               }
+// CHECK:              ],
+
+// Check that result0 and result2 compute different hashes because
+// the working directory optmization is turned off for result2.
+// SKIPOPT:      {
+// SKIPOPT-NEXT:   "modules": [
+// SKIPOPT-NEXT:    {
+// SKIPOPT-NEXT:      "clang-module-deps": [],
+// SKIPOPT:           "context-hash": "[[HASH0:.*]]",
+// SKIPOPT:         }
+// SKIPOPT:        "translation-units": [
+// SKIPOPT:         {
+// SKIPOPT:            "commands": [
+// SKIPOPT:             {
+// SKIPOPT-NEXT:          "clang-context-hash": "{{.*}}",
+// SKIPOPT-NEXT:          "clang-module-deps": [
+// SKIPOPT-NEXT:            {
+// SKIPOPT-NEXT:              "context-hash": "[[HASH0]]",
+// SKIPOPT-NEXT:              "module-name": "mod"
+// SKIPOPT:            }
+// SKIPOPT:          ],
+// SKIPOPT:      {
+// SKIPOPT-NEXT:   "modules": [
+// SKIPOPT-NEXT:     {
+// SKIPOPT-NEXT:       "clang-module-deps": [],
+// SKIPOPT-NOT:        "context-hash": "[[HASH0]]",
+// SKIPOPT:            "context-hash": "[[HASH2:.*]]",
+// SKIPOPT:          }
+// SKIPOPT:       "translation-units": [
+// SKIPOPT:         {
+// SKIPOPT:           "commands": [
+// SKIPOPT:             {
+// SKIPOPT-NEXT:          "clang-context-hash": "{{.*}}",
+// SKIPOPT-NEXT:          "clang-module-deps": [
+// SKIPOPT-NEXT:            {
+// SKIPOPT-NOT:              "context-hash": "[[HASH0]]",
+// SKIPOPT-NEXT:             "context-hash": "[[HASH2]]"
+// SKIPOPT-NEXT:              "module-name": "mod"
+// SKIPOPT:            }
+// SKIPOPT:          ],
+
diff --git a/clang/test/ClangScanDeps/working-dir.m b/clang/test/ClangScanDeps/working-dir.m
index a04f8c2486b98d..c6b7b1988d3cf7 100644
--- a/clang/test/ClangScanDeps/working-dir.m
+++ b/clang/test/ClangScanDeps/working-dir.m
@@ -2,7 +2,7 @@
 // RUN: split-file %s %t
 // RUN: sed -e "s|DIR|%/t|g" %t/build/compile-commands.json.in > %t/build/compile-commands.json
 // RUN: clang-scan-deps -compilation-database %t/build/compile-commands.json \
-// RUN:   -j 1 -format experimental-full --optimize-args=all > %t/deps.db
+// RUN:   -j 1 -format experimental-full --optimize-args=header-search,system-warnings,vfs,canonicalize-macros > %t/deps.db
 // RUN: cat %t/deps.db | sed 's:\\\\\?:/:g' | FileCheck %s -DPREFIX=%/t
 
 // Check that there are two separate modules hashes. One for each working dir.
diff --git a/clang/tools/clang-scan-deps/ClangScanDeps.cpp b/clang/tools/clang-scan-deps/ClangScanDeps.cpp
index 709dc513be2811..8d429534a20073 100644
--- a/clang/tools/clang-scan-deps/ClangScanDeps.cpp
+++ b/clang/tools/clang-scan-deps/ClangScanDeps.cpp
@@ -164,6 +164,8 @@ static void ParseArgs(int argc, char **argv) {
             .Case("system-warnings", ScanningOptimizations::SystemWarnings)
             .Case("vfs", ScanningOptimizations::VFS)
             .Case("canonicalize-macros", ScanningOptimizations::Macros)
+            .Case("ignore-current-working-dir",
+                  ScanningOptimizations::IgnoreCWD)
             .Case("all", ScanningOptimizations::All)
             .Default(std::nullopt);
     if (!Optimization) {

Copy link

github-actions bot commented Jan 28, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

Copy link
Contributor

@Bigcheese Bigcheese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like the right approach, but I think it would be good to have a test for the relative path checking. Not every option needs a test, just one should be fine.

@qiongsiwu qiongsiwu self-assigned this Jan 30, 2025
@qiongsiwu qiongsiwu marked this pull request as ready for review January 30, 2025 19:06
@qiongsiwu qiongsiwu requested a review from Bigcheese January 30, 2025 19:06
Copy link
Collaborator

@cachemeifyoucan cachemeifyoucan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better if this optimization happens really early in the process since you only visit all the options in CI? In that case, you can just reset the CurrentWorkingDirectory in the VFS so all the searching is done without CWD. This avoids any hard to debug issues if some options are not taken care of (needs CWD but not checked) but the trade off is more explicit errors during scanning.

@Bigcheese
Copy link
Contributor

It has to happen after the header search optimization in case that removes relative header search paths.

@qiongsiwu
Copy link
Contributor Author

Gentle ping for review. Thanks!

Copy link
Contributor

@Bigcheese Bigcheese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

Copy link
Collaborator

@cachemeifyoucan cachemeifyoucan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some small style comments.

@qiongsiwu qiongsiwu merged commit 54acda2 into llvm:main Feb 5, 2025
8 checks passed
qiongsiwu added a commit to qiongsiwu/llvm-project that referenced this pull request Feb 5, 2025
When computing the context hash, `clang` always includes the compiler's
working directory. This can lead to situations when the only difference
between two compilations is the working directory, different module
variants are generated. These variants are redundant. This PR implements
an optimization that ignores the working directory when computing the
context hash when safe.

Specifically, `clang` checks if it is safe to ignore the working
directory in `isSafeToIgnoreCWD`. The check involves going through
compile command options to see if any paths specified are relative. The
definition of relative path used here is that the input path is not
empty, and `llvm::sys::path::is_absolute` is false. If all the paths
examined are not relative, `clang` considers it safe to ignore the
current working directory and does not consider the working directory
when computing the context hash.

(cherry picked from commit 54acda2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants