Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of EXISTS and NOT EXISTS #1703

Merged
merged 40 commits into from
Feb 15, 2025
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
b455626
Add the required workflow files...
joka921 Oct 8, 2024
3205fa2
A dummy file for the workflow run thingy.
joka921 Oct 8, 2024
215a927
Merge pull request #1 from joka921/qlever-conformance-tests
joka921 Oct 8, 2024
b7bedba
Another test...
joka921 Oct 8, 2024
dfca875
Merge pull request #2 from joka921/qlever-conformance-tests
joka921 Oct 8, 2024
8bd6299
More sparql conformance stuff...
joka921 Oct 8, 2024
31cb0d5
Merge pull request #3 from joka921/qlever-conformance-tests
joka921 Oct 8, 2024
5cb6a0e
Backup in the middle.
joka921 Jan 7, 2025
e356ee1
Add some parsing and add some thoughts.
joka921 Jan 7, 2025
fc20174
Also implement NOT EXISTS
joka921 Jan 7, 2025
dde296b
Fix a small warning, to feed this to the tool.
joka921 Jan 7, 2025
0d1c788
Some cleanups and fixes.
joka921 Jan 8, 2025
7ff49c9
Fix compilation.
joka921 Jan 8, 2025
7ec8947
Fix the many many segfaults.
joka921 Jan 8, 2025
c03f3e5
Fix another bug.
joka921 Jan 8, 2025
2da52ab
Fix another bug.
joka921 Jan 8, 2025
cbbc771
Fix another bug.
joka921 Jan 8, 2025
91e5802
blub.
joka921 Jan 8, 2025
c3a9a7d
Added some more tests.
joka921 Jan 8, 2025
0adbfa6
Add some tests at least for the parser and query planner.
joka921 Jan 8, 2025
babd294
Some more tests.
joka921 Jan 9, 2025
6766af3
Added some comments.
joka921 Jan 9, 2025
f2524a8
Merge branch 'master' into exists
joka921 Jan 9, 2025
3a574ea
This is commented and very clean.
joka921 Jan 9, 2025
5809be2
better tests.
joka921 Jan 9, 2025
5294357
Made a pass over `ExistsJoin.h` and `ExistsJoin.cpp`
Jan 10, 2025
0917636
Merge remote-tracking branch 'origin/master' into exists
Feb 4, 2025
2bc5bdf
Changes by Hannah improving documentation and comments
Feb 5, 2025
c2abadd
Fix typo
Feb 5, 2025
c3f0e88
Merge remote-tracking branch 'origin/master' into exists
joka921 Feb 14, 2025
a6842f2
Merged and everything.
joka921 Feb 14, 2025
893f64f
Merge remote-tracking branch 'origin/exists' into exists
joka921 Feb 14, 2025
ee495f4
The test is currently not compiling, as we still have to apply severa…
joka921 Feb 14, 2025
ca30b5a
Also test different datasets.
joka921 Feb 14, 2025
d48d76b
Fix the name of the conformance test-suite
joka921 Feb 14, 2025
87ba7ca
Merge remote-tracking branch 'origin/master' into exists
Feb 14, 2025
cfe3c17
Minor improvements from Hannah's review
Feb 14, 2025
0cd71ac
Merge branch 'master' into exists
hannahbast Feb 14, 2025
608d0ea
Re-insert the `baseIri_` declaration in `SparqlQleverVisitor.h`
Feb 14, 2025
092e0d9
Revert changes in .github/workflows
Feb 15, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions src/engine/Bind.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,22 @@
#include "Bind.h"

#include "engine/CallFixedSize.h"
#include "engine/ExistsJoin.h"
#include "engine/QueryExecutionTree.h"
#include "engine/sparqlExpressions/SparqlExpression.h"
#include "engine/sparqlExpressions/SparqlExpressionGenerators.h"
#include "util/ChunkedForLoop.h"
#include "util/Exception.h"

// _____________________________________________________________________________
Bind::Bind(QueryExecutionContext* qec,
std::shared_ptr<QueryExecutionTree> subtree, parsedQuery::Bind b)
: Operation(qec), _subtree(std::move(subtree)), _bind(std::move(b)) {
_subtree = ExistsJoin::addExistsJoinsToSubtree(
_bind._expression, std::move(_subtree), getExecutionContext(),
cancellationHandle_);
}

// BIND adds exactly one new column
size_t Bind::getResultWidth() const { return _subtree->getResultWidth() + 1; }

Expand Down
6 changes: 3 additions & 3 deletions src/engine/Bind.h
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,14 @@
#include "engine/sparqlExpressions/SparqlExpressionPimpl.h"
#include "parser/ParsedQuery.h"

/// BIND operation, currently only supports a very limited subset of expressions
// BIND operation.
class Bind : public Operation {
public:
static constexpr size_t CHUNK_SIZE = 10'000;

// ____________________________________________________________________________
Bind(QueryExecutionContext* qec, std::shared_ptr<QueryExecutionTree> subtree,
parsedQuery::Bind b)
: Operation(qec), _subtree(std::move(subtree)), _bind(std::move(b)) {}
parsedQuery::Bind b);

private:
std::shared_ptr<QueryExecutionTree> _subtree;
Expand Down
2 changes: 1 addition & 1 deletion src/engine/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -14,5 +14,5 @@ add_library(engine
CartesianProductJoin.cpp TextIndexScanForWord.cpp TextIndexScanForEntity.cpp
TextLimit.cpp LazyGroupBy.cpp GroupByHashMapOptimization.cpp SpatialJoin.cpp
CountConnectedSubgraphs.cpp SpatialJoinAlgorithms.cpp PathSearch.cpp ExecuteUpdate.cpp
Describe.cpp GraphStoreProtocol.cpp)
Describe.cpp ExistsJoin.cpp GraphStoreProtocol.cpp)
qlever_target_link_libraries(engine util index parser sparqlExpressions http SortPerformanceEstimator Boost::iostreams s2)
199 changes: 199 additions & 0 deletions src/engine/ExistsJoin.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
// Copyright 2025, University of Freiburg
// Chair of Algorithms and Data Structures
// Author: Johannes Kalmbach <[email protected]>

#include "engine/ExistsJoin.h"

#include "engine/QueryPlanner.h"
#include "engine/sparqlExpressions/ExistsExpression.h"
#include "engine/sparqlExpressions/SparqlExpression.h"
#include "util/JoinAlgorithms/JoinAlgorithms.h"

// _____________________________________________________________________________
ExistsJoin::ExistsJoin(QueryExecutionContext* qec,
std::shared_ptr<QueryExecutionTree> left,
std::shared_ptr<QueryExecutionTree> right,
Variable existsVariable)
: Operation{qec},
left_{std::move(left)},
right_{std::move(right)},
joinColumns_{QueryExecutionTree::getJoinColumns(*left_, *right_)},
existsVariable_{std::move(existsVariable)} {
// Make sure that the left and right input are sorted on the join columns.
std::tie(left_, right_) = QueryExecutionTree::createSortedTrees(
std::move(left_), std::move(right_), joinColumns_);
}

// _____________________________________________________________________________
string ExistsJoin::getCacheKeyImpl() const {
return absl::StrCat("EXISTS JOIN left: ", left_->getCacheKey(),
" right: ", right_->getCacheKey());
}

// _____________________________________________________________________________
string ExistsJoin::getDescriptor() const { return "Exists Join"; }

// ____________________________________________________________________________
VariableToColumnMap ExistsJoin::computeVariableToColumnMap() const {
auto res = left_->getVariableColumns();
AD_CONTRACT_CHECK(
!res.contains(existsVariable_),
"The target variable of an EXISTS join must be a new variable");
res[existsVariable_] = makeAlwaysDefinedColumn(getResultWidth() - 1);
return res;
}

// ____________________________________________________________________________
size_t ExistsJoin::getResultWidth() const {
// We add one column to the input.
return left_->getResultWidth() + 1;
}

// ____________________________________________________________________________
vector<ColumnIndex> ExistsJoin::resultSortedOn() const {
// We add one column to `left_`, but do not change the order of the rows.
return left_->resultSortedOn();
}

// ____________________________________________________________________________
float ExistsJoin::getMultiplicity(size_t col) {
// The multiplicities of all columns except the last one are the same as in
// `left_`.
if (col < getResultWidth() - 1) {
return left_->getMultiplicity(col);
}
// For the added (Boolean) column we take a dummy value, assuming that it
// will not be used for subsequent joins or other operations that make use of
// the multiplicities.
return 1;
}

// ____________________________________________________________________________
uint64_t ExistsJoin::getSizeEstimateBeforeLimit() {
return left_->getSizeEstimate();
}

// ____________________________________________________________________________
size_t ExistsJoin::getCostEstimate() {
// The implementation is a linear zipper join.
return left_->getCostEstimate() + right_->getCostEstimate() +
left_->getSizeEstimate() + right_->getSizeEstimate();
}

// ____________________________________________________________________________
ProtoResult ExistsJoin::computeResult([[maybe_unused]] bool requestLaziness) {
auto leftRes = left_->getResult();
auto rightRes = right_->getResult();
const auto& left = leftRes->idTable();
const auto& right = rightRes->idTable();

// We reuse the generic `zipperJoinWithUndef` function, which has two two
// callbacks: one for each matching pair of rows from `left` and `right`, and
// one for rows in the left input that have no matching counterpart in the
// right input. The first callback can be a noop, and the second callback
// gives us exactly those rows, where the value in the to-be-added result
// column should be `false`.
//
// the inverse of the value needed for the added Boolean
// column.

// Extract the join columns from both inputs to make the following code
// easier.
ad_utility::JoinColumnMapping joinColumnData{joinColumns_, left.numColumns(),
right.numColumns()};
IdTableView<0> joinColumnsLeft =
left.asColumnSubsetView(joinColumnData.jcsLeft());
IdTableView<0> joinColumnsRight =
right.asColumnSubsetView(joinColumnData.jcsRight());
checkCancellation();

// Compute `isCheap`, which is true iff there are no UNDEF values in the join
// columns (in which case we can use a simpler and cheaper join algorithm).
//
// TODO<joka921> There are many other cases where a cheaper implementation can
// be chosen, but we leave those for another PR, this is the most common case.
namespace stdr = ql::ranges;
size_t numJoinColumns = joinColumnsLeft.numColumns();
AD_CORRECTNESS_CHECK(numJoinColumns == joinColumnsRight.numColumns());
bool isCheap = stdr::none_of(
ad_utility::integerRange(numJoinColumns), [&](const auto& col) {
return (stdr::any_of(joinColumnsRight.getColumn(col),
&Id::isUndefined)) ||
(stdr::any_of(joinColumnsLeft.getColumn(col), &Id::isUndefined));
});

// Nothing to do for the actual matches.
auto noopRowAdder = ad_utility::noop;

// Store the indices of rows for which the value of the `EXISTS` (in the added
// Boolean column) should be `false`.
std::vector<size_t, ad_utility::AllocatorWithLimit<size_t>> notExistsIndices{
allocator()};
// The callback is called with iterators, so we convert them back to indices.
auto actionForNotExisting =
[&notExistsIndices, begin = joinColumnsLeft.begin()](const auto& itLeft) {
notExistsIndices.push_back(itLeft - begin);
};

// Run `zipperJoinWithUndef` with the described callbacks and the mentioned
// optimization in case we know that there are no UNDEF values in the join
// columns.
auto checkCancellationLambda = [this] { checkCancellation(); };
auto runZipperJoin = [&](auto findUndef) {
[[maybe_unused]] auto numOutOfOrder = ad_utility::zipperJoinWithUndef(
joinColumnsLeft, joinColumnsRight, ql::ranges::lexicographical_compare,
noopRowAdder, findUndef, findUndef, actionForNotExisting,
checkCancellationLambda);
};
if (isCheap) {
runZipperJoin(ad_utility::noop);
} else {
runZipperJoin(ad_utility::findSmallerUndefRanges);
}

// Add the result column from the computed `notExistsIndices` (which tell us
// where the value should be `false`).
IdTable result = left.clone();
result.addEmptyColumn();
decltype(auto) existsCol = result.getColumn(getResultWidth() - 1);
ql::ranges::fill(existsCol, Id::makeFromBool(true));
for (size_t notExistsIndex : notExistsIndices) {
existsCol[notExistsIndex] = Id::makeFromBool(false);
}

// The added column only contains Boolean values, and adds no new words to the
// local vocabulary, so we can simply copy the local vocab from `leftRes`.
return {std::move(result), resultSortedOn(), leftRes->getCopyOfLocalVocab()};
}

// _____________________________________________________________________________
std::shared_ptr<QueryExecutionTree> ExistsJoin::addExistsJoinsToSubtree(
const sparqlExpression::SparqlExpressionPimpl& expression,
std::shared_ptr<QueryExecutionTree> subtree, QueryExecutionContext* qec,
const ad_utility::SharedCancellationHandle& cancellationHandle) {
// Extract all `EXISTS` functions from the given `expression`.
std::vector<const sparqlExpression::SparqlExpression*> existsExpressions;
expression.getPimpl()->getExistsExpressions(existsExpressions);

// For each `EXISTS` function, add the corresponding `ExistsJoin`.
for (auto* expr : existsExpressions) {
const auto& exists =
dynamic_cast<const sparqlExpression::ExistsExpression&>(*expr);
// Currently some FILTERs are applied multiple times (in particular, this
// happens when there are OPTIONAL joins in the query). In these cases we
// have to make sure that the `ExistsJoin` is added only once.
//
// TODO(question from Hannah's review): Why does the following implement
// what the preceding comment says?
if (subtree->isVariableCovered(exists.variable())) {
continue;
}

Check warning on line 190 in src/engine/ExistsJoin.cpp

View check run for this annotation

Codecov / codecov/patch

src/engine/ExistsJoin.cpp#L189-L190

Added lines #L189 - L190 were not covered by tests
QueryPlanner qp{qec, cancellationHandle};
auto pq = exists.argument();
auto tree =
std::make_shared<QueryExecutionTree>(qp.createExecutionTree(pq));
subtree = ad_utility::makeExecutionTree<ExistsJoin>(
qec, std::move(subtree), std::move(tree), exists.variable());
}
return subtree;
}
78 changes: 78 additions & 0 deletions src/engine/ExistsJoin.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
// Copyright 2025, University of Freiburg
// Chair of Algorithms and Data Structures
// Author: Johannes Kalmbach <[email protected]>

#pragma once

#include "engine/Operation.h"
#include "engine/QueryExecutionTree.h"

// The implementation of an "EXISTS join", which we use to realize the semantics
// of the SPARQL `EXISTS` function. The join takes two subtrees as input, and
// and returns the left subtree with an additional boolean column that is `true`
// iff at least one matching row is contained in the right subtree.
class ExistsJoin : public Operation {
private:
// The left and right child.
std::shared_ptr<QueryExecutionTree> left_;
std::shared_ptr<QueryExecutionTree> right_;
std::vector<std::array<ColumnIndex, 2>> joinColumns_;

// The variable of the added (Boolean) result column.
Variable existsVariable_;

public:
// Constructor. The `existsVariable` (the variable for the added column) must
// not yet be bound by `left`.
ExistsJoin(QueryExecutionContext* qec,
std::shared_ptr<QueryExecutionTree> left,
std::shared_ptr<QueryExecutionTree> right,
Variable existsVariable);

// For a given subtree and a given expression, extract all the
// `ExistsExpression`s from the expression and add one `ExistsJoin` per
// `ExistsExpression` to the subtree. The left side of the `ExistsJoin` is
// the input subtree, the right hand side of the `ExistsJoin` as well as the
// variable to which the result is bound are extracted from the
// `ExistsExpression`. The returned subtree can then be used to evaluate the
// `expression`.
//
// NOTE: `ExistsExpression` is a dummy that only reads the values of the
// column that is added by the `ExistsJoin`. The main work is done by the
// latter and not by the former.
static std::shared_ptr<QueryExecutionTree> addExistsJoinsToSubtree(
const sparqlExpression::SparqlExpressionPimpl& expression,
std::shared_ptr<QueryExecutionTree> subtree, QueryExecutionContext* qec,
const ad_utility::SharedCancellationHandle& cancellationHandle);

// All following functions are inherited from `Operation`, see there for
// comments.
protected:
string getCacheKeyImpl() const override;

public:
string getDescriptor() const override;

size_t getResultWidth() const override;

vector<ColumnIndex> resultSortedOn() const override;

bool knownEmptyResult() override { return left_->knownEmptyResult(); }

Check warning on line 60 in src/engine/ExistsJoin.h

View check run for this annotation

Codecov / codecov/patch

src/engine/ExistsJoin.h#L60

Added line #L60 was not covered by tests

float getMultiplicity(size_t col) override;

private:
uint64_t getSizeEstimateBeforeLimit() override;

public:
size_t getCostEstimate() override;

vector<QueryExecutionTree*> getChildren() override {
return {left_.get(), right_.get()};
}

private:
ProtoResult computeResult([[maybe_unused]] bool requestLaziness) override;

VariableToColumnMap computeVariableToColumnMap() const override;
};
4 changes: 4 additions & 0 deletions src/engine/Filter.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@

#include "backports/algorithm.h"
#include "engine/CallFixedSize.h"
#include "engine/ExistsJoin.h"
#include "engine/QueryExecutionTree.h"
#include "engine/sparqlExpressions/SparqlExpression.h"
#include "engine/sparqlExpressions/SparqlExpressionGenerators.h"
Expand All @@ -28,6 +29,9 @@ Filter::Filter(QueryExecutionContext* qec,
: Operation(qec),
_subtree(std::move(subtree)),
_expression{std::move(expression)} {
_subtree = ExistsJoin::addExistsJoinsToSubtree(
_expression, std::move(_subtree), getExecutionContext(),
cancellationHandle_);
setPrefilterExpressionForChildren();
}

Expand Down
8 changes: 7 additions & 1 deletion src/engine/GroupBy.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
#include <absl/strings/str_join.h>

#include "engine/CallFixedSize.h"
#include "engine/ExistsJoin.h"
#include "engine/IndexScan.h"
#include "engine/Join.h"
#include "engine/LazyGroupBy.h"
Expand Down Expand Up @@ -52,6 +53,12 @@ GroupBy::GroupBy(QueryExecutionContext* qec, vector<Variable> groupByVariables,
ql::ranges::sort(_groupByVariables, std::less<>{}, &Variable::name);

auto sortColumns = computeSortColumns(subtree.get());

for (const auto& alias : _aliases) {
subtree = ExistsJoin::addExistsJoinsToSubtree(
alias._expression, std::move(subtree), getExecutionContext(),
cancellationHandle_);
}
_subtree =
QueryExecutionTree::createSortedTree(std::move(subtree), sortColumns);
}
Expand Down Expand Up @@ -1527,7 +1534,6 @@ Result GroupBy::computeGroupByForHashMapOptimization(
// NOTE: If the input blocks have very similar or even identical non-empty
// local vocabs, no deduplication is performed.
localVocab.mergeWith(std::span{&inputLocalVocab, 1});

// Setup the `EvaluationContext` for this input block.
sparqlExpression::EvaluationContext evaluationContext(
*getExecutionContext(), _subtree->getVariableColumns(), inputTable,
Expand Down
12 changes: 4 additions & 8 deletions src/engine/MultiColumnJoin.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -237,12 +237,6 @@
rowAdder.addRow(itLeft - beginLeft, itRight - beginRight);
};

auto findUndef = [](const auto& row, auto begin, auto end,
bool& resultMightBeUnsorted) {
return ad_utility::findSmallerUndefRanges(row, begin, end,
resultMightBeUnsorted);
};

// `isCheap` is true iff there are no UNDEF values in the join columns. In
// this case we can use a much cheaper algorithm.
// TODO<joka921> There are many other cases where a cheaper implementation can
Expand All @@ -265,8 +259,10 @@
} else {
return ad_utility::zipperJoinWithUndef(
leftJoinColumns, rightJoinColumns,
ql::ranges::lexicographical_compare, addRow, findUndef, findUndef,
ad_utility::noop, checkCancellationLambda);
ql::ranges::lexicographical_compare, addRow,
ad_utility::findSmallerUndefRanges,
ad_utility::findSmallerUndefRanges, ad_utility::noop,
checkCancellationLambda);

Check warning on line 265 in src/engine/MultiColumnJoin.cpp

View check run for this annotation

Codecov / codecov/patch

src/engine/MultiColumnJoin.cpp#L262-L265

Added lines #L262 - L265 were not covered by tests
}
}();
*result = std::move(rowAdder).resultTable();
Expand Down
Loading
Loading