Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redact sensitive information in catalog queries #24563

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

piotrrzysko
Copy link
Member

@piotrrzysko piotrrzysko commented Dec 23, 2024

Description

This a follow-up to #24562 that introduces redacting of security-sensitive information in statements containing connector properties, specifically:

  • CREATE CATALOG
  • EXPLAIN CREATE CATALOG
  • PREPARE CREATE CATALOG

The current approach is as follows:

  • For syntactically valid statements, only properties containing sensitive information are masked.
  • If a valid query references a nonexistent connector, all properties are masked.
  • If a query fails before or during parsing, nothing is masked.

Redacted queries are returned through the REST API, the system.runtime.queries table, and query events (QueryCreatedEvent and QueryCompletedEvent).

Notice that currently this PR includes two commits from #24562.

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Section
* Redact sensitive information in statements containing connector properties. ({issue}`23106`)

Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good to me. Some comments.

return statementRedactingEnabled;
}

@Config("statement-redacting-enabled")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mosabua for suggestions about config naming. 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want an option to disable this. Maybe as a temporary kill switch, but we should remove this as soon as we are happy with this feature

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, we can prefix with experimental. in that case like we have done in past to clarify this. Or maybe deprecated. from the beginning.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added deprecated. prefix.

}

@Override
protected Node visitCreateCatalog(CreateCatalog createCatalog, Void context)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some way to notice when we need to add new node visitors here?

Should this be a "wrapper" like the various Forwarding*** classes and a test to assert that full set of methods is overridden? That way once new methods get added we'll explicitly need to either override to do no-op or to redact?

WDYT? Might be overkill for now so need to change anything - just to have a discussion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point.

Perhaps this test could verify that all (minus exclusions) visit methods are implemented only for Statement nodes and not for all possible Node types.

I think adding such a test is feasible, but I'll hold off for now to ensure we’ve reached agreement on the core parts of this functionality (e.g., the SPI, where the redacting is performed, etc.).

@@ -240,7 +248,7 @@ private <C> void createQueryInternal(QueryId queryId, Span querySpan, Slug slug,
DispatchQuery dispatchQuery = dispatchQueryFactory.createDispatchQuery(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this automatically also handles things like event listener and QueryResource right?

Might be worth to explicitly call it out in the commit message (although you do imply that by mentioning anything using QueryInfo/BasicQueryInfo).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this automatically also handles things like event listener and QueryResource right?

Correct.

I extracted tests confirming that to separate commits into separate commits to avoid distracting from the core functionality of redacting.

I refined the commit message and included your suggestion.

return statementRedactingEnabled;
}

@Config("statement-redacting-enabled")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want an option to disable this. Maybe as a temporary kill switch, but we should remove this as soon as we are happy with this feature


public class SensitiveStatementRedactor
{
public static final String REDACTED_VALUE = "***";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider a better value here than just ***. We could also consider using a special function like $redacted$(), which just throws exceptions if you try to actuall call that function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*** seems to be almost what everyone uses for redaction.

Can you expand on the function idea? Is that to make it so that the output of SHOW CREATE CATALOG (as an example) is valid but still fails when you try to run it.

@piotrrzysko piotrrzysko force-pushed the redact-sensitive-queries branch from ed595a1 to 654d3e2 Compare January 9, 2025 17:51
@github-actions github-actions bot added hudi Hudi connector iceberg Iceberg connector delta-lake Delta Lake connector hive Hive connector labels Jan 9, 2025
@piotrrzysko piotrrzysko force-pushed the redact-sensitive-queries branch 3 times, most recently from e049e24 to c31d6e1 Compare January 13, 2025 15:28
@piotrrzysko piotrrzysko force-pushed the redact-sensitive-queries branch 2 times, most recently from df20a77 to 2490789 Compare January 20, 2025 08:34
The SPI will be used by the engine to redact security-sensitive
information in statements that manage catalogs. It has been added at the
connector factory level, rather than the connector level, to allow more
flexibility in retrieving properties. In some cases, we want to perform
redacting before a connector is initiated. For example, when we create a
new catalog by issuing the CREATE CATALOG statement.
Exposed properties fall into one of the following categories: they are
either explicitly marked as security-sensitive or are unknown. The
connector assumes that unknown properties might be misspelled
security-sensitive properties.
This preparatory commit enables bootstrapping HDFS to retrieve its
security-sensitive properties.
This commit introduces redacting of security-sensitive information in
the following statements:

* CREATE CATALOG
* EXPLAIN CREATE CATALOG
* PREPARE CREATE CATALOG

The current approach is as follows:

* For syntactically valid statements, only properties containing
sensitive information are masked.
* If a query is syntactically valid but retrieving security-sensitive
properties fails for any reason (e.g., the query references a
nonexistent connector or catalog property evaluation fails), all
properties are masked.
* If a query fails before or during parsing, nothing is masked.

The redacted form is created right before initialization of the
QueryStateMachine and is propagated to all places that create QueryInfo
and BasicQueryInfo (e.g., REST endpoints, query events, and
the system.runtime.queries table). Before this change,
QueryInfo/BasicQueryInfo stored the raw query text received from the end
user. From now on, the text will be altered for the cases listed above.
@piotrrzysko piotrrzysko force-pushed the redact-sensitive-queries branch from 2490789 to 7eec53c Compare January 20, 2025 08:50
@piotrrzysko
Copy link
Member Author

A few questions/suggestions:

  1. For now, I’m not masking syntactically invalid or unsupported queries (e.g., EXPLAIN ANALYZE CREATE CATALOG) in any way. Initially, I handled this by replacing the entire query text with ***. However, this seems like a significant change from the user’s perspective. I suggest starting a separate discussion about it and addressing it as a follow-up if needed.

  2. Regarding $redacted$() vs. ***, I propose creating a GitHub issue to start a discussion. I believe we need to involve more people in this conversation. If we decide to go with $redacted$(), input from Martin would be necessary, as this is a syntax-related change.

  3. I noticed an inconsistency around PREPARE CREATE CATALOG. While issuing PREPARE CREATE CATALOG is allowed, executing the prepared statement is not. Please take a look at this test: link.
    Currently, I’m not masking EXECUTE arguments because I’m unsure which direction we prefer:

    • Forbid PREPARE CREATE CATALOG, or
    • Add full support for it.

@dain @hashhar I'd appreciate your feedback.

@JsonConstructor for TrimmedBasicQueryInfo was introduced to facilitate
the deserialization of server responses in tests.
@piotrrzysko piotrrzysko force-pushed the redact-sensitive-queries branch from 7eec53c to 26978c8 Compare January 20, 2025 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector hive Hive connector hudi Hudi connector iceberg Iceberg connector
Development

Successfully merging this pull request may close these issues.

Redact properties from CREATE CATALOG in query info, so they are not present in any outputs
3 participants