- Added support for format_datetime function in presto to Databricks (#1250). A new
format_datetime
function has been added to theParser
class in thepresto.py
file to provide support for formatting datetime values in Presto on Databricks. This function utilizes theDateFormat.from_arg_list
method from thelocal_expression
module to format datetime values according to a specified format string. To ensure compatibility and consistency between Presto and Databricks, a new test filetest_format_datetime_1.sql
has been added, containing SQL queries that demonstrate the usage of theformat_datetime
function in Presto and its equivalent in Databricks,DATE_FORMAT
. This standalone change adds new functionality without modifying any existing code. - Added support for SnowFlake
SUBSTR
(#1238). This commit enhances the library's SnowFlake support by adding theSUBSTR
function, which was previously unsupported and existed only as an alternative toSUBSTRING
. The project now fully supports both functions, and theSUBSTRING
function can be used interchangeably withSUBSTR
via the newwithConversionStrategy(SynonymOf("SUBSTR"))
method. Additionally, this commit supersedes a previous pull request that lacked a GPG signature and includes a test for theSUBSTR
function. TheARRAY_SLICE
function has also been updated to match SnowFlake's behavior, and the project now supports a more comprehensive list of SQL functions with their corresponding arity. - Added support for json_size function in presto (#1236). A new
json_size
function for Presto has been added, which determines the size of a JSON object or array and returns an integer. Two new methods,_build_json_size
andget_json_object
, have been implemented to handle JSON objects and arrays differently, and the Parser and Tokenizer classes of the Presto class have been updated to include the new json_size function. An alternative implementation for Databricks using SQL functions is provided, and a test case is added to cover a fixedis not null
error for json_extract in the Databricks generator. Additionally, a new test file for Presto has been added to test the functionality of thejson_extract
function in Presto, and a new methodGetJsonObject
is introduced to extract a JSON object from a given path. Thejson_extract
function has also been updated to extract the value associated with a specified key from JSON data in both Presto and Databricks. - Enclosed subqueries in parenthesis (#1232). This PR introduces changes to the ExpressionGenerator and LogicalPlanGenerator classes to ensure that subqueries are correctly enclosed in parentheses during code generation. Previously, subqueries were not always enclosed in parentheses, leading to incorrect code. This issue has been addressed by enclosing subqueries in parentheses in the
in
andscalarSubquery
methods, and by adding new match cases forir.Filter
in theLogicalPlanGenerator
class. The changes also take care to avoid doubling enclosing parentheses in the.. IN(SELECT...)
pattern. New methods have not been added, and existing functionality has been modified to ensure that subqueries are correctly enclosed in parentheses, leading to the generation of correct SQL code. Test cases have been included in a separate PR. These changes improve the correctness of the generated code, avoiding issues such asSELECT * FROM SELECT * FROM t WHERE a >
aWHERE a > 'b'
and ensuring that the generated code includes parentheses around subqueries. - Fixed serialization of MultipleErrors (#1177). In the latest release, the encoding of errors in the
com.databricks.labs.remorph.coverage
package has been improved with an update to theencoders.scala
file. The change involves a fix for serializingMultipleErrors
instances using theasJson
method on each error instead of just the message. This modification ensures that all relevant information about each error is included in the encoded output, improving the accuracy of serialization forMultipleErrors
class. Users who handle multiple errors and require precise serialization representation will benefit from this enhancement, as it guarantees comprehensive information encoding for each error instance. - Fixed presto strpos and array_average functions (#1196). This PR introduces new classes
Locate
andNamedStruct
in thelocal_expression.py
file to handle theSTRPOS
andARRAY_AVERAGE
functions in a Databricks environment, ensuring compatibility with Presto SQL. TheSTRPOS
function, used to locate the position of a substring within a string, now uses theLocate
class and emits a warning regarding differences in implementation between Presto and Databricks SQL. A new method_build_array_average
has been added to handle theARRAY_AVERAGE
function in Databricks, which calculates the average of an array, accommodating nulls, integers, and doubles. Two SQL test cases have been added to demonstrate the use of theARRAY_AVERAGE
function with arrays containing integers and doubles. These changes promote compatibility and consistent behavior between Presto and Databricks when dealing withSTRPOS
andARRAY_AVERAGE
functions, enhancing the ability to migrate between the systems smoothly. - Handled presto Unnest cross join to Databricks lateral view (#1209). This release introduces new features and updates for handling Presto UNNEST cross joins in Databricks, utilizing the lateral view feature. New methods have been added to improve efficiency and robustness when handling UNNEST cross joins. Additionally, new test cases have been implemented for Presto and Databricks to ensure compatibility and consistency between the two systems in handling UNNEST cross joins, array construction and flattening, and parsing JSON data. Some limitations and issues remain, which will be addressed in future work. The acceptance tests have also been updated, with certain tests now expected to pass, while others may still fail. This release aims to improve the functionality and compatibility of Presto and Databricks when handling UNNEST cross joins and JSON data.
- Implemented remaining TSQL set operations (#1227). This pull request enhances the TSql parser by adding support for parsing and converting the set operations
UNION [ALL]
,EXCEPT
, andINTERSECT
to the Intermediate Representation (IR). Initially, the grammar recognized these operations, but they were not being converted to the IR. This change resolves issues #1126 and #1102 and includes new unit, transpiler, and functional tests, ensuring the correct behavior of these set operations, including precedence rules. The commit also introduces a new test file,union-all.sql
, demonstrating the correct handling of simpleUNION ALL
operations, ensuring consistent output across TSQL and Databricks SQL platforms. - Supported multiple columns in order by clause in for ARRAYAGG (#1228). This commit enhances the ARRAYAGG and LISTAGG functions by adding support for multiple columns in the order by clause and sorting in both ascending and descending order. A new method, sortArray, has been introduced to handle multiple sort orders. The changes also improve the functionality of the ARRAYAGG function in the Snowflake dialect by supporting multiple columns in the ORDER BY clause, with an optional DESC keyword for each column. The
WithinGroupParams
dataclass has been updated in the local expression module to include a list of tuples for the order columns and their sorting direction. These changes provide increased flexibility and control over the output of the ARRAYAGG and LISTAGG functions - Added TSQL parser support for
(LHS) UNION RHS
queries (#1211). In this release, we have implemented support for a new form of UNION in the TSQL parser, specifically for queries formatted as(SELECT a from b) UNION [ALL] SELECT x from y
. This allows the union of two SELECT queries with an optional ALL keyword to include duplicate rows. The implementation includes a new case statement in theTSqlRelationBuilder
class that handles this form of UNION, creating aSetOperation
object with the left-hand side and right-hand side of the union, and anis_all
flag based on the presence of the ALL keyword. Additionally, we have added support for parsing right-associative UNION clauses in TSQL queries, enhancing the flexibility and expressiveness of the TSQL parser for more complex and nuanced queries. The commit also includes new test cases to verify the correct translation of TSQL set operations to Databricks SQL, resolving issue #1127. This enhancement allows for more accurate parsing of TSQL queries that use the UNION operator in various formats. - Added support for inline columns in CTEs (#1184). In this release, we have added support for inline columns in Common Table Expressions (CTEs) in Snowflake across various components of our open-source library. This includes updates to the AST (Abstract Syntax Tree) for better TSQL translation and the introduction of the new case class
KnownInterval
for handling intervals. We have also implemented a new method,DealiasInlineColumnExpressions
, in theSnowflakePlanParser
class to parse inline columns in CTEs and modify the class constructor to include this new method. Additionally, a new private case classInlineColumnExpression
has been introduced to allow for more efficient processing of Snowflake CTEs. The SnowflakeToDatabricksTranspiler has also been updated to support inline columns in CTEs, as demonstrated by a new test case. These changes improve compatibility, precision, and usability of the codebase, providing a better overall experience for software engineers working with CTEs in Snowflake. - Implemented AST for positional column identifiers (#1181). The recent change introduces an Abstract Syntax Tree (AST) for positional column identifiers in the Snowflake project, specifically in the
ExpressionGenerator
class. The newNameOrPosition
type represents a column identifier, either by name or position. TheId
andPosition
classes inherit fromNameOrPosition
, and thenameOrPosition
method has been added to check and return the appropriate SQL representation. However, due to Databricks' lack of positional column identifier support, the generator side does not yet support this feature. This means that the schema of the table is required to properly translate queries involving positional column identifiers. This enhancement increases the system's flexibility in handling Snowflake's query structures, with the potential for more comprehensive generator-side support in the future. - Implemented GROUP BY ALL (#1180). The
GROUP BY ALL
clause is now supported in the LogicalPlanGenerator class of the remorph project, with the addition of a new case to handle the GroupByAll type and updated implementation for the Pivot type. A new case object calledGroupByAll
has been added to the relations.scala file's sealed trait "GroupType". A new test case has been implemented in the SnowflakeToDatabricksTranspilerTest class to check the correct transpilation of theGROUP BY ALL
clause from Snowflake SQL syntax to Databricks SQL syntax. These changes allow for more flexibility and control in grouping operations and enable the implementation of specific functionality for the GROUP BY ALL clause in Snowflake, improving compatibility with Snowflake SQL syntax.
Dependency updates:
- Added IR for stored procedures (#1161). In this release, we have made significant enhancements to the project by adding support for stored procedures. We have introduced a new
CreateVariable
case class to manage variable creation within the intermediate representation (IR), and removed theSetVariable
case class as it is now redundant. A newCaseStatement
class has been added to represent SQL case statements with value match, and aCompoundStatement
class has been implemented to enable encapsulation of a sequence of logical plans within a single compound statement. TheDeclareCondition
,DeclareContinueHandler
, andDeclareExitHandler
case classes have been introduced to handle conditional logic and exit handlers in stored procedures. New classesDeclareVariable
,ElseIf
,ForStatement
,If
,Iterate
,Leave
,Loop
,RepeatUntil
,Return
,SetVariable
, andSignal
have been added to the project to provide more comprehensive support for procedural language features and control flow management in stored procedures. We have also included SnowflakeCommandBuilder support for stored procedures and updated thevisitExecuteTask
method to handle stored procedure calls using theSetVariable
method. - Added Variant Support (#998). In this commit, support for the Variant datatype has been added to the create table functionality, enhancing the system's compatibility with Snowflake's datatypes. A new VariantType has been introduced, which allows for more comprehensive handling of data during create table operations. Additionally, a
remarks VARIANT
line is added in the CREATE TABLE statement and the corresponding spec test has been updated. The Variant datatype is a flexible datatype that can store different types of data, such as arrays, objects, and strings, offering increased functionality for users working with variant data. Furthermore, this change will enable the use of the Variant datatype in Snowflake tables and improves the data modeling capabilities of the system. - Added
PySpark
generator (#1026). The engineering team has developed a newPySpark
generator for thecom.databricks.labs.remorph.generators
package. This addition introduces a new parameter,logical
, of typeGenerator[ir.LogicalPlan, String]
, in theSQLGenerator
for SQL queries. A new abstract classBasePythonGenerator
has been added, which extends theGenerator
class and generates Python code. AExpressionGenerator
class has also been added, which extendsBasePythonGenerator
and is responsible for generating Python code forir.Expression
objects. A newLogicalPlanGenerator
class has been added, which extendsBasePythonGenerator
and is responsible for generating Python code for a givenir.LogicalPlan
. A newStatementGenerator
class has been implemented, which convertsStatement
objects into Python code. A new Python-generating class,PythonGenerator
, has been added, which includes the implementation of an abstract syntax tree (AST) for Python in Scala. This AST includes classes for various Python language constructs. Additionally, new implicit classes forPythonInterpolator
,PythonOps
, andPythonSeqOps
have been added to allow for the creation of PySpark code using the Remorph framework. TheAndOrToBitwise
rule has been implemented to convertAnd
andOr
expressions to their bitwise equivalents. TheDotToFCol
rule has been implemented to transform code that references columns using dot notation in a DataFrame to use thecol
function with a string literal of the column name instead. A newPySparkStatements
object andPySparkExpressions
class have been added, which provide functionality for transforming expressions in a data processing pipeline to PySpark equivalents. TheSnowflakeToPySparkTranspiler
class has been added to transpile Snowflake queries to PySpark code. A newPySpark
generator has been added to theTranspiler
class, which is implemented as an instance of theSqlGenerator
class. This change enhances theTranspiler
class with a newPySpark
generator and improves serialization efficiency. - Added
debug-bundle
command for folder-to-folder translation (#1045). In this release, we have introduced adebug-bundle
command to the remorph project's CLI, specifically added to theproxy_command
function, which already includesdebug-script
,debug-me
, anddebug-coverage
commands. This new command enhances the tool's debugging capabilities, allowing developers to generate a bundle of translated queries for folder-to-folder translation tasks. Thedebug-bundle
command accepts three flags:dialect
,src
, anddst
, specifying the SQL dialect, source directory, and destination directory, respectively. Furthermore, the update includes refactoring theFileSetGenerator
class in theorchestration
package of thecom.databricks.labs.remorph.generators
package, adding adebug-bundle
command to theMain
object, and updating theFileQueryHistoryProvider
method in theApplicationContext
trait. These improvements focus on providing a convenient way to convert folder-based SQL scripts to other formats like SQL and PySpark, enhancing the translation capabilities of the project. - Added
ruff
Python formatter proxy (#1038). In this release, we have added support for theruff
Python formatter in our project's continuous integration and development workflow. We have also introduced a newFORMAT
stage in theWorkflowStage
object in theResult
Scala object to include formatting as a separate step in the workflow. A newRuffFormatter
class has been added to format Python code using theruff
tool, and aStandardInputPythonSubprocess
class has been included to run a Python subprocess and capture its output and errors. Additionally, we have added a proxy for theruff
formatter to the SnowflakeToPySparkTranspilerTest for Scala to improve the readability of the transpiled Python code generated by the SnowflakeToPySparkTranspiler. Lastly, we have introduced a newruff
formatter proxy in the test code for the transpiler library to enforce format and style conventions in Python code. These changes aim to improve the development and testing experience for the project and ensure that the code follows the desired formatting and style standards. - Added baseline for translating workflows (#1042). In this release, several new features have been added to the open-source library to improve the translation of workflows. A new dependency for the Jackson YAML data format library, version 2.14.0, has been added to the pom.xml file to enable processing YAML files and converting them to Java objects. A new
FileSet
class has been introduced, which provides an in-memory data structure to manage a set of files, allowing users to add, retrieve, and remove files by name and persist the contents of the files to the file system. A newFileSetGenerator
class has been added that generates aFileSet
object from aJobNode
object, enabling the translation of workflows by generating all necessary files for a workspace. A newDefineJob
class has been developed to define a new rule for processingJobNode
objects in the Remorph system, converting instances ofSuccessPy
andSuccessSQL
intoPythonNotebookTask
andSqlNotebookTask
objects, respectively. Additionally, various new classes, such asGenerateBundleFile
,QueryHistoryToQueryNodes
,ReformatCode
,TryGeneratePythonNotebook
,TryGenerateSQL
,TrySummarizeFailures
,InformationFile
,SuccessPy
,SuccessSQL
,FailedQuery
,Migration
,PartialQuery
,QueryPlan
,RawMigration
,Comment
, andPlanComment
, have been introduced to provide a more comprehensive and nuanced job orchestration framework. TheLibrary
case class has been updated to better separate concerns between library configuration and code assets. These changes address issue #1042 and provide a more robust and flexible workflow translation solution. - Added correct generation of
databricks.yml
forQueryHistory
(#1044). The FileSet class in the FileSet.scala file has been updated to include a new method that correctly generates thedatabricks.yml
file for theQueryHistory
feature. This file is used for orchestrating cross-compiled queries, creating three files in total - two SQL notebooks with translated and formatted queries and adatabricks.yml
file to define an asset bundle for the queries. The new method in the FileSet class writes the content to the file using theFiles.write
method from thejava.nio.file
package instead of the previously usedPrintWriter
. The FileSetGenerator class has been updated to include the newdatabricks.yml
file generation, and new rules and methods have been added to improve the accuracy and consistency of schema definitions in the generated orchestration files. Additionally, the DefineJob and DefineSchemas classes have been introduced to simplify the orchestration generation process. - Added documentation around Transformation (#1043). In this release, the Transformation class in our open-source library has been enhanced with detailed documentation, type parameters, and new methods. The class represents a stateful computation that produces an output of type Out while managing a state of type State. The new methods include map and flatMap for modifying the output and chaining transformations, as well as run and runAndDiscardState for executing the computation with a given initial state and producing a Result containing the output and state. Additionally, we have introduced a new trait called TransformationConstructors that provides constructors for successful transformations, error transformations, lifted results, state retrieval, replacement, and updates. The CodeGenerator trait in our code generation library has also been updated with several new methods for more control and flexibility in the code generation process. These include commas and spaces for formatting output, updateGenCtx for updating the GeneratorContext, nest and unnest for indentation, withIndentedBlock for producing nested blocks of code, and withGenCtx for creating transformations that use the current GeneratorContext.
- Added tests for Snow ARRAY_REMOVE function (#979). In this release, we have added tests for the Snowflake ARRAY_REMOVE function in the SnowflakeToDatabricksTranspilerTest. The tests, currently ignored, demonstrate the usage of the ARRAY_REMOVE function with different data types, such as integers and doubles. A TODO comment is included for a test case involving VARCHAR casting, to be enabled once the necessary casting functionality is implemented. This update enhances the library's capabilities and ensures that the ARRAY_REMOVE function can handle a variety of data types. Software engineers can refer to these tests to understand the usage of the ARRAY_REMOVE function in the transpiler and the planned casting functionality.
- Avoid non local return (#1052). In this release, the
render
method of thegenerators
package object in thecom.databricks.labs.remorph
package has been updated to avoid using non-local returns and follow recommended coding practices. Instead of returning early from the method, it now usesOption
to track failures and atry-catch
block to handle exceptions. In cases of exception during string concatenation, the method sets thefailureOpt
variable toSome(lift(KoResult(WorkflowStage.GENERATE, UncaughtException(e))))
. Additionally, the test file "CodeInterpolatorSpec.scala" has been modified to fix an issue with exception handling. In the updated code, new variables for each argument are introduced, and the problematic code is placed within an interpolated string, allowing for proper exception handling. This release enhances the robustness and reliability of the code interpolator and ensures that the method follows recommended coding practices. - Collect errors in
Phase
(#1046). The open-source library Remorph has received significant updates, focusing on enhancing error collection and simplifying the Transformation class. The changes include a new methodrecordError
in the abstract Phase trait and its concrete implementations for collecting errors during each phase. The Transformation class has been simplified by removing the unused Phase parameter, while the Generator, CodeGenerator, and FileSetGenerator have been specialized to use Transformation without the Phase parameter. The TryGeneratePythonNotebook, TryGenerateSQL, CodeInterpolator, and TBASeqOps classes have been updated for a more concise and focused state. The imports have been streamlined, and the PySparkGenerator, SQLGenerator, and PlanParser have been modified to remove the unused Phase type parameter. A new test file, TransformationTest, has been added to check the error collection functionality in the Transformation class. Overall, these enhancements improve the reliability, readability, and maintainability of the Remorph library. - Correctly generate
F.fn_name
for builtin PySpark functions (#1037). This commit introduces changes to the generation ofF.fn_name
for builtin PySpark functions in the PySparkExpressions.scala file, specifically for PySpark's builtin functions (fn
). It includes a new case to handle these functions by converting them to their lowercase equivalent in Python usingLocale.getDefault
. Additionally, changes have been made to handle window specifications more accurately, such as usingImportClassSideEffect
withwindowSpec
and generating and applying a window function (fn
) over it. TheLAST_VALUE
function has been modified toLAST
in the SnowflakeToDatabricksTranspilerTest.scala file, and new methods such asFirst
,Last
,Lag
,Lead
, andNthValue
have been added to the SnowflakeCallMapper class. These changes improve the accuracy, flexibility, and compatibility of PySpark when working with built-in functions and window specifications, making the codebase more readable, maintainable, and efficient. - Create Command Extended (#1033). In this release, the open-source library has been updated with several new features related to table management and SQL code generation. A new method
replaceTable
has been added to theLogicalPlanGenerator
class, which generates SQL code for aReplaceTableCommand
IR node and replaces an existing table with the same name if it already exists. Additionally, support has been added for generating SQL code for anIdentityConstraint
IR node, which specifies whether a column is an auto-incrementing identity column. TheCREATE TABLE
statement has been updated to include theAUTOINCREMENT
andREPLACE
constraints, and a newIdentityConstraint
case class has been introduced to extend the capabilities of theUnnamedConstraint
class. TheTSqlDDLBuilder
class has also been updated to handle theIDENTITY
keyword more effectively. A new command implementation withAUTOINCREMENT
andREPLACE
constraints has been added, and a new SQL script has been included in the functional tests for testing CREATE DDL statements with identity columns. Finally, the SQL transpiler has been updated to support theCREATE OR REPLACE PROCEDURE
syntax, providing more flexibility and convenience for users working with stored procedures in their SQL code. These updates aim to improve the functionality and ease of use of the open-source library for software engineers working with SQL code and table management. - Don't draft automated releases (#995). In this release, we have made a modification to the release.yml file in the .github/workflows directory by removing the "draft: true" line. This change removes the creation of draft releases in the automated release process, simplifying it and making it more straightforward for users to access new versions of the software. The job section of the release.yml file now only includes the
release
job, with the "release-signing-artifacts: true" still enabled, ensuring that the artifacts are signed. This improvement enhances the overall release process, making it more efficient and user-friendly. - Enhance the Snow ARRAY_SORT function support (#994). With this release, the Snowflake ARRAY_SORT function now supports Boolean literals as parameters, improving its functionality. The changes include validating Boolean parameters in SnowflakeCallMapper.scala, throwing a TranspileException for unsupported arguments, and simplifying the IR using the DBSQL SORT_ARRAY function. Additionally, new test cases have been added to SnowflakeCallMapperSpec for the ARRAY_SORT and ARRAY_SLICE functions. The SnowflakeToDatabricksTranspilerTest class has also been updated with new test cases that cover the enhanced ARRAY_SORT function's various overloads and combinations of Boolean literals, NULLs, and a custom sorting function. This ensures that invalid usage is caught during transpilation, providing better error handling and supporting more use cases.
- Ensure that unparsable text is not lost in the generated output (#1012). In this release, we have implemented an enhancement to the error handling strategy in the ANTLR-generated parsers for SQL. This change records where parsing failed and gathers un-parsable input, preserving them as custom error nodes in the ParseTree at strategic points. The new custom error strategy allows visitors for higher level rules such as
sqlCommand
in Snowflake andsqlClauses
in TSQL to check for an error node in the children and generate an Ir node representing the un-parsed text. Additionally, new methods have been introduced to handle error recovery, find the highest context in the tree for the particular parser, and recursively find the context. A separate improvement is planned to ensure the PLanParser no longer stops when syntax errors are discovered, allowing safe traversal of the ParseTree. This feature is targeted towards software engineers working with SQL parsing and aims to improve error handling and recovery. - Fetch table definitions for TSQL (#986). This pull request introduces a new
TableDefinition
case class that encapsulates metadata properties for tables in TSQL, such as catalog name, schema name, table name, location, table format, view definition, columns, table size, and comments. ATSqlTableDefinitions
class has been added with methods to retrieve table definitions, all schemas, and all catalogs from TSQL. TheSnowflakeTypeBuilder
is updated to parse data types from TSQL. TheSnowflakeTableDefinitions
class has been refactored to use the newTableDefinition
case class and retrieve table definitions more efficiently. The changes also include adding two new test cases to verify the correct retrieval of table definitions and catalogs for TSQL. - Fixed handling of projected expressions in
TreeNode
(#1159). In this release, we have addressed the handling of projected expressions in theTreeNode
class, resolving issues #1072 and #1159. Theexpressions
method in thePlan
abstract class has been modified to include thefinal
keyword, restricting overriding in subclasses. This method now returns all expressions present in a query from the current plan operator and its descendants. Additionally, we have introduced a new private method,seqToExpressions
, used for recursively finding all expressions from a given sequence. TheProject
class, representing a relational algebra operation that projects a subset of columns in a table, now utilizes a newcolumns
parameter instead ofexpressions
. Similar changes have been applied to other classes extendingUnaryNode
, such asJoin
,Deduplicate
, andHint
. Thevalues
parameter of theValues
class has also been altered to accurately represent input values. A new test class,JoinTest
, has been introduced to verify the correct propagation of expressions in join nodes, ensuring intended data transformations. - Handling any_keys_match from presto (#1048). In this commit, we have added support for the
any_keys_match
Presto function in Databricks by implementing it using existing Databricks functions. Theany_keys_match
function checks if any keys in a map match a given condition. Specifically, we have introduced two new classes,MapKeys
andArrayExists
, which allow us to extract keys from the input map and check if any of the keys satisfy the given condition using theexists
function. This is accomplished by renamingexists
toarray_exists
to better reflect its purpose. Additionally, we have provided a Databricks SQL query that mimics the behavior of theany_keys_match
function in Presto and added tests to ensure that it works as expected. These changes enable users to perform equivalent operations with a consistent syntax in Databricks and Presto. - Improve IR for job nodes (#1041). The open-source library has undergone improvements to the Intermediate Representation (IR) for job nodes, as indicated by the commit message "Improve IR for job nodes." This release introduces several significant changes, including: Refactoring of the
JobNode
class to extend theTreeNode
class and the addition of a new abstract classLeafJobNode
that overrides thechildren
method to always return an emptySeq
. Enhancements to theClusterSpec
case class, which now includes atoSDK
method that properly initializes and sets the values of the fields in the SDKClusterSpec
object. Improvements to theNewClusterSpec
class, which updates the types of several optional fields and introduces changes to thetoSDK
method for better conversion to the SDK format. Removal of theJob
class, which previously represented a job in the IR of workflows. Changes to theJobCluster
case class, which updates thenewCluster
attribute fromClusterSpec
toNewClusterSpec
. Update to theJobEmailNotifications
class, which now extendsLeafJobNode
and includes new methods and overwrites existing ones fromLeafJobNode
. Improvements to theJobNotificationSettings
class, which replaces the originaltoSDK
method with a new implementation for more accurate SDK representation of job notification settings. Refactoring of theJobParameterDefinition
class, which updates thetoSDK
method for more efficient conversion to the SDK format. These changes simplify the representation of job nodes, align the codebase more closely with the underlying SDK, and improve overall code maintainability and compatibility with other Remorph components. - Query History From Folder (#991). The Estimator class in the Remorph project has been updated to enhance the query history interface by adding metadata from reading from a folder, improving its ability to handle queries from different users and increasing the accuracy of estimation reports. The Anonymizer class has also been updated to handle cases where the user field is missing, ensuring the anonymization process can proceed smoothly and preventing potential errors. A new FileQueryHistory class has been added to provide query history functionality by reading metadata from a specified folder. The SnowflakeQueryHistory class has been updated to directly implement the history() method and include new fields in the ExecutedQuery objects, such as 'id', 'source', 'timestamp', 'duration', 'user', and 'filename'. A new ExecutedQuery case class has been introduced, which now includes optional
user
andfilename
fields, and a new QueryHistoryProvider trait has been added with a method history() that returns a QueryHistory object containing a sequence of ExecutedQuery objects, enhancing the query history interface's flexibility and power. Test suites and test data for the Anonymizer and TableGraph classes have been updated to accommodate these changes, allowing for more comprehensive testing of query history functionality. A FileQueryHistorySpec test file has been added to test the FileQueryHistory class's ability to correctly extract queries from SQL files, ensuring the class works as expected. - Rework serialization using circe+jackson (#1163). In pull request #1163, the serialization mechanism in the project has been refactored to use the Circe and Jackson libraries, replacing the existing ujson library. This change includes the addition of the Circe, Circe-generic-extras, and Circe-jackson libraries, which are licensed under the Apache 2.0 license. The project now includes the copyright notices and license information for all open-source projects that have contributed code to it, ensuring compliance with open-source licenses. The
CoverageTest
class has been updated to incorporate error encoding using Circe and Jackson libraries, and theEstimationReport
case classes no longer have implicitReadWriter
instances defined using macroRW. Instead, circe and Jackson encode and decode instances are likely defined elsewhere in the codebase. TheBaseQueryRunner
abstract class has been updated to handle both parsing and transpilation errors in a more uniform way, using afailures
field instead oftranspilation_error
orparsing_error
. Additionally, a new file,encoders.scala
, has been introduced, which defines encoders for serializing objects to JSON using the Circe and Jackson libraries. These changes aim to improve serialization and deserialization performance and capabilities, simplify the codebase, and ensure consistent and readable output. - Some window functions does not support window frame conditions (#999). The Snowflake expression builder has been updated to correct the default window frame specifications for certain window functions and modify the behavior of the ORDER BY clause in these functions. This change ensures that the expression builder generates the correct SQL syntax for unsupported functions like "LAG", "DENSE_RANK", "LEAD", "PERCENT_RANK", "RANK", and "ROW_NUMBER", improving the compatibility and reliability of the generated queries. Additionally, a unit test for the
SnowflakeExpressionBuilder
has been updated to account for changes in the way window functions are handled, enhancing the accuracy of the builder in generating valid SQL for window functions in Snowflake. - Split workflow definitions into sensible packages (#1039). The AutoScale class has been refactored and moved to a new package,
com.databricks.labs.remorph.intermediate.workflows.clusters
, extendingJobNode
fromcom.databricks.labs.remorph.intermediate.workflows
. It now includes a case class for auto-scaling that takes optional integer argumentsmaxWorkers
andminWorkers
, and a single methodapply
that creates and configures a cluster using the Databricks SDK'sComputeService
. The AwsAttributes and AzureAttributes classes have also been moved to thecom.databricks.labs.remorph.intermediate.workflows.clusters
package and now extendJobNode
. These classes manage AWS and Azure-related attributes for compute resources in a workflow. The ClientsTypes case class has been moved to a new clusters sub-package within the workflows package and now extendsJobNode
, and the ClusterLogConf class has been moved to a new clusters package. The JobDeployment class has been refactored and moved tocom.databricks.labs.remorph.intermediate.workflows.jobs
, and the JobEmailNotifications, JobsHealthRule, and WorkspaceStorageInfo classes have been moved to new packages and now importJobNode
. These changes improve the organization and maintainability of the codebase, making it easier to understand and navigate. - TO_NUMBER/TO_DECIMAL/TO_NUMERIC without precision and scale (#1053). This pull request introduces improvements to the transpilation process for handling cases where precision and scale are not specified for TO_NUMBER, TO_DECIMAL, or TO_NUMERIC Snowflake functions. The updated transpiler now automatically applies default values when these parameters are omitted, with precision set to the maximum allowed value of 38 and scale set to 0. A new method has been added to manage these cases, and four new test cases have been included to verify the transpilation of TO_NUMBER and TO_DECIMAL functions without specified precision and scale, and with various input formats. This change ensures consistent behavior across different SQL dialects for cases where precision and scale are not explicitly defined in the conversion functions.
- Table comments captured as part of Snowflake Table Definition (#989). In this release, we have added support for capturing table comments as part of Snowflake Table Definitions in the remorph library. This includes modifying the TableDefinition case class to include an optional comment field, and updating the SQL query in the SnowflakeTableDefinitions class to retrieve table comments. A new integration test for Snowflake table definitions has also been introduced to ensure the proper functioning of the new feature. This test creates a connection to the Snowflake database, retrieves a list of table definitions, and checks for the presence of table comments. These changes are part of our ongoing efforts to improve metadata capture for Snowflake tables (Note: The commit message references issue #945 on GitHub, which this pull request is intended to close.)
- Use Transformation to get rid of the ctx parameter in generators (#1040). The
Generating
class has undergone significant changes, removing thectx
parameter and introducing two new phases,Parsing
andBuildingAst
, in the sealed traitPhase
. TheParsing
phase extendsPhase
with a previous phase ofInit
and contains the source code and filename. TheBuildingAst
phase extendsPhase
with a previous phase ofParsing
and contains the parsed tree and the previous phase. TheOptimizing
phase now contains the unoptimized plan and the previous phase. TheGenerating
phase now contains the optimized plan, the current node, the total statements, the transpiled statements, theGeneratorContext
, and the previous phase. Additionally, theTransformationConstructors
trait has been updated to allow for the creation of Transformation instances specific to a certain phase of a workflow. TherunQuery
method in theBaseQueryRunner
abstract class has been updated to use a newtranspile
method provided by theTranspiler
trait, and theEstimator
class in theEstimation
module has undergone changes to remove thectx
parameter in generators. Overall, these changes simplify the implementation, improve code maintainability, and enable better separation of concerns in the codebase. - With Recursive (#1000). In this release, we have introduced several enhancements for
With Recursive
statements in SQL parsing and processing for the Snowflake database. A new IR (Intermediate Representation) for With Recursive CTE (Common Table Expression) has been implemented in the SnowflakeAstBuilder.scala file. A new case class, WithRecursiveCTE, has been added to the SnowflakeRelationBuilder class in the databricks/labs/remorph project, which extends RelationCommon and includes two members: ctes and query. The buildColumns method in the SnowflakeRelationBuilder class has been updated to handle cases where columnList is null and extract column names differently. Additionally, a new test has been added in SnowflakeAstBuilderSpec.scala that verifies the correct handling of a recursive CTE query. These enhancements improve the support for recursive queries in the Snowflake database, enabling more powerful and flexible querying capabilities for developers and data analysts working with complex data structures. - [chore] fixed query coverage report (#1160). In this release, we have addressed the issue #1160 related to the query coverage report. We have implemented changes to the
QueryRunner
abstract class in thecom.databricks.labs.remorph.coverage
package. TheReportEntryReport
constructor now accepts a new parameterparsed
, which is set to 1 if there is no transpilation error and 0 otherwise. Previously,parsed
was always set to 1, regardless of the presence of a transpilation error. We also updated theextractQueriesFromFile
andextractQueriesFromFolder
methods in theFileQueryHistory
class to return a singleExecutedQuery
instance, allowing for better query coverage reporting. Additionally, we modified the behavior of thehistory()
method of thefileQueryHistory
object in theFileQueryHistorySpec
test case. The method now returns a query history object with a single query having asource
including the text "SELECT * FROM table1;" and "SELECT * FROM table2;", effectively merging the previous two queries into one. These changes ensure that the report accurately reflects whether the query was successfully transpiled, parsed, and stored in the query history. It is crucial to test thoroughly any parts of the code that rely on thehistory()
method to return separate queries, as the behavior of the method has changed.
- Added private key authentication for sf (#917). This commit adds support for private key authentication to the Snowflake data source connector, providing users with more flexibility and security. The
cryptography
library is used to process the user-provided private key, with priority given to thepem_private_key
secret, followed by thesfPassword
secret. If neither secret is found, an exception is raised. However, password-based authentication is still used when JDBC options are provided, as Spark JDBC does not currently support private key authentication. A new exception class,InvalidSnowflakePemPrivateKey
, has been added for handling invalid or malformed private keys. Additionally, new tests have been included for reading data with private key authentication, handling malformed private keys, and checking for missing authentication keys. The notice has been updated to include thecryptography
library's copyright and license information. - Added support for
PARSE_JSON
andVARIANT
datatype (#906). This commit introduces support for thePARSE_JSON
function andVARIANT
datatype in the Snowflake parser, addressing issue #894. The implementation removes the experimental dialect, enabling support for theVARIANT
datatype and usingPARSE_JSON
for it. Thevariant_explode
function is also utilized. During transpilation to Snowflake, whenever the:
operator is encountered in the SELECT statement, everything will be treated as aVARIANT
on the Databricks side to handle differences between Snowflake and Databricks in accessing variant types. These changes are implemented using ANTLR. - Added upgrade script and modified metrics sql (#990). In this release, the open-source library has been updated with several improvements to the metrics system, database upgrades, and setup process. The setup_spark_remote.sh script now checks if the Spark server is running by pinging localhost:4040, rather than sleeping for a fixed time, allowing for faster execution and more accurate server status determination. The metrics table's insert statement has been updated to cast values to Bigint for better handling of larger counts. An upgrade script has been added to facilitate required modifications, and the setup_spark_remote.sh script has been modified to validate URLs. A new Python file for upgrading the metrics table's data types has been added, which includes a function to recreate the table with the correct data types for specific columns. The upgrade_common module now includes several functions for upgrading database tables, and a new unit test file, test_upgrade_common.py, has been added with test cases for these functions. Lastly, the upgrade script for v0.4.0 has been updated to simplify the process of checking if the main table in the reconcile metadata needs to be recreated and to add an
operation_name
column. These changes improve the library's functionality, accuracy, and robustness, particularly for larger datasets and upgrading processes, enhancing the overall user experience. - Basic CTAS Implementation (#968). This pull request adds basic support for the CREATE TABLE AS SELECT (CTAS) statement in Snowflake, enabling users to create a new table by selecting data from an existing table or query. In the LogicalPlanGenerator class, a new method has been implemented to handle CTAS statements, which generates the appropriate SQL command for creating a table based on the result of a select query. The SnowflakeDDLBuilder class now includes a relationBuilder class member for building relations based on Snowflake DDL input, and the visitCreateTableAsSelect method has been overridden to parse CTAS statements and construct corresponding IR objects. The test suite has been expanded to include a new spec for CTAS statements and a test case for the CTAS statement "CREATE TABLE t1 AS SELECT c1, c2 FROM t2;". Additionally, a new test file "test_ctas_complex.sql" has been added, containing SQL statements for creating a new table by selecting columns from multiple tables and computing new columns through various data manipulations. The implementation also includes adding new SQL statements for CTAS in both Snowflake and Databricks dialects, allowing for testing the CTAS functionality for both.
- Create repeatable estimator for Snowflake query history (#924). This commit introduces a new coverage estimation tool for analyzing query history in a database, initially implemented for Snowflake. The tool parses and transpiles query history into Databricks SQL and reports on the percentage of query history it can process. It includes a new
SnowflakePlanParser
class that handles Snowflake query plans, aSqlGenerator
class that generates Databricks SQL from optimized logical plans, and adialect
method that returns the dialect string. The long-term plan is to extend this functionality to other supported databases and dialects and include a report on SQL complexity. Additionally, test cases have been added to theAnonymizerTest
class to ensure the correct functionality of theAnonymizer
class, which anonymizes executed queries when provided with aPlanParser
object. TheAnonymizer
class is intended to be used as part of the coverage estimation tool, which will provide analysis of query history for various databases. - Created a mapping dict for algo for each dialect at layer level (#934). A series of changes have been implemented to improve the reconciliation process and the handling of hash algorithms in the open-source library. A mapping dictionary algorithm to dialect has been introduced at the layer level to enhance the reconciliation process. The
get_hash_transform
function now accepts a newlayer
argument and returns a list of hash algorithms from theHashAlgoMapping
dictionary. A newHashAlgoMapping
class has been added to map algorithms to a dialect for hashing, replacing the previousDialectHashConfig
class. A new functionget_dialect
has been introduced to retrieve the dialect based on the layer. The_hash_transform
function and thebuild_query
method have been updated to use thelayer
parameter when determining the dialect. These changes provide more precise control over the algorithm used for hash transformation based on the source layer and the target dialect, resulting in improved reconciliation performance and accuracy. - Fetch TableDefinitions from Snowflake (#904). A new
SnowflakeTableDefinitions
class has been added to simplify the discovery of Snowflake table metadata. This class establishes a connection with a Snowflake database through a Connection object, and provides methods such asgetDataType
andgetTableDefinitionQuery
to parse data types and generate queries for table definitions. Moreover, it includes agetTableDefinitions
method to retrieve all table definitions in a Snowflake database as a sequence ofTableDefinition
objects, which encapsulates various properties of each table. The class also features methods for retrieving all catalogs and schemas in a Snowflake database. AlongsideSnowflakeTableDefinitions
, a new test class,SnowflakeTableDefinitionTest
, has been introduced to verify the behavior ofgetTableDefinitions
and ensure that the class functions as intended, adhering to the desired behavior. - Guide user on missing configuration file (#930). In this commit, the
_verify_recon_table_config
method in therunner.py
file of thedatabricks/labs/remorph
package has been updated to handle missing reconcile table configurations during installation. When the reconcile table configuration is not found, an error message will now display the name of the requested configuration file. This enhancement helps users identify the specific configuration file they need to provide in their workspace, addressing issue #919. This commit is co-authored by Ludovic Claude. - Implement more missing visitor functions for Snowflake and TSQL (#975). In this release, we have added several missing visitor methods for the Snowflake and TSQL builder classes to improve the reliability and maintainability of our parser. Previously, when a visitor method was missing, the default visitor was called, causing the visit of all children of the ParseTree, which was not ideal. This could lead to incorrect results due to a slight modification in the ANTLR grammar inadvertently breaking the visitor. In this release, we have implemented several new visitor methods for both Snowflake and TSQL builder classes, including the
visitDdlCommand
method in theSnowflakeDDLBuilder
class and thevisitDdlClause
method in theTSqlDDLBuilder
class. These new methods ensure that the ParseTree is traversed correctly and that the correct IR node is returned. ThevisitDdlCommand
method checks for different types of DDL commands, such as create, alter, drop, and undrop, and calls the appropriate method for each type. ThevisitDdlClause
method contains a sequence of methods corresponding to different DDL clauses and calls the first non-null method in the sequence. These changes significantly improve the robustness of our parser and enhance the reliability of our code. - Introduce typed errors (#981). This commit introduces typed errors in the form of a new class,
UnexpectedNode
, and several case classes includingParsingError
,UnsupportedDataType
,WrongNumberOfArguments
,UnsupportedArguments
, andUnsupportedDateTimePart
in various packages, as part of the ongoing effort to replace exception throwing with returningResult
types in future pull requests. These changes will improve error handling and provide more context and precision for errors, facilitating debugging and maintenance of the remorph library and data type generation functionality. TheTranspileException
class is now constructed with specific typed error instances, and theErrorCollector
andErrorDetail
classes have been updated to useParsingError
. Additionally, theSnowflakeCallMapper
andSnowflakeTimeUnits
classes have been updated to use the new typed error mechanism, providing more precise error handling for Snowflake-specific functions and expressions. - Miscellaneous improvements to Snowflake parser (#952). This diff brings several miscellaneous improvements to the Snowflake parser in the open-source library, targeting increased parse and transpilation success rates. The modifications include updating the
colDecl
rule to allow optional data types, introducing anobjectField
rule, and enabling date and timestamp literals as strings. Additionally, the parser has been refined to handle identifiers more efficiently, such as hashes within the AnonymizerTest. The expected Ast for certain test cases has also been updated to improve parser accuracy. These changes aim to create a more robust and streamlined Snowflake parser, minimizing parsing errors and enhancing overall user experience for project adopters. Furthermore, the error handling and reporting capabilities of the Snowflake parser have been improved with new case classes,IndividualError
andErrorsSummary
, and updated error messages. - Moved intermediate package out of parsers (#972). In this release, the
intermediate
package has been refactored out of theparsers
package, aligning with the design principle that parsers should depend on the intermediate representation instead of the other way around. This change affects various classes and methods across the project, all of which have been updated to import theintermediate
package from its new location. No new functionality has been introduced, but the refactoring improves the package structure and dependency management. TheEstimationAnalyzer
class in thecoverage/estimation
package has been updated to import classes from the new location of theintermediate
package, and itsevaluateTree
method has been updated to use the new import path forLogicalPlan
andExpression
. Other affected classes includeSnowflakeTableDefinitions
,SnowflakeLexer
,SnowflakeParser
,SnowflakeTypeBuilder
,GeneratorContext
,DataTypeGenerator
,IRHelpers
, and multiple test files. - Patch Function without Brackets (#907). This commit introduces new lexer and parser rules to handle Snowflake SQL functions without parentheses, specifically impacting CURRENT_DATE, CURRENT_TIME, CURRENT_TIMESTAMP, LOCALTIME, and LOCALTIMESTAMP. The new rules allow these functions to be used without parentheses, consistent with Snowflake SQL. This change fixes functional tests and includes documentation for the affected functions. However, there is a pending task to add or fix more test cases to ensure comprehensive testing of the new rules. Additionally, the syntax of the SELECT statement for the CURRENT_TIMESTAMP function has been updated, removing the need for the parameter 'col1'. This change simplifies the syntax for certain SQL functions in the codebase and improves the consistency and reliability of the functional tests.
- Root Table (#936). The PR #<pr-number> introduces a new class
TableGraph
that extendsDependencyGraph
and implementsLazyLogging
trait. This class builds a graph of tables and their dependencies based on query history and table definitions. It provides methods to add nodes and edges, build the graph, and retrieve root, upstream, and downstream tables. TheDependencyGraph
trait offers a more structured and flexible way to handle table dependencies. This change is part of the Root Table feature (issue #936) that identifies root tables in a graph of table dependencies, closing issue #23. The PR includes a newTableGraphTest
class that demonstrates the use of these methods and verifies their behavior for better data flow understanding and optimization. - Snowflake Merge Implementation (#964). In this release, we have implemented the Merge statement for the Snowflake parser, which enables updating or deleting rows in a target table based on matches with a source table, and inserting new rows into the target table when there are no matches. This feature includes updates to the SnowflakeDMLBuilder and SnowflakeExpressionBuilder classes, allowing for proper handling of column names and MERGE queries. Additionally, we have added test cases to the SnowflakeASTBuilder, SnowflakeDMLBuilderSpec, and SnowflakeToDatabricksTranspiler to ensure the accurate translation and execution of MERGE statements for the Snowflake dialect. These changes bring important database migration and synchronization capabilities to our open-source library, improving its functionality and usability for software engineers.
- TSQL: Implement CREATE TABLE (#911). This commit implements the TSQL CREATE TABLE command and its various options and forms, including CTAS, graph node syntax, and analytics variants, as well as syntactical differences for SQL Server. The DDL and DML visitors have been moved from the AST and Relation visitors to separate classes for better responsibility segregation. The LogicalPlanGenerator class has been updated to generate unique constraints, primary keys, foreign keys, check constraints, default value constraints, and identity constraints for the CREATE TABLE command. Additionally, new classes for generating SQL options and handling unresolved options during transpilation have been added to enhance the parser's capability to manage various options and forms. These changes improve the transpilation of TSQL code and the organization of the codebase, making it easier to maintain and extend.
- Transpile Snow ARRAY_SORT function (#973). In this release, we have implemented support for the Snowflake ARRAY_SORT function in our open-source library. This feature has been added as part of issue #973, and it involves the addition of two new private methods,
arraySort
andmakeArraySort
, to theSnowflakeCallMapper
class. ThearraySort
method takes a sequence of expressions as input and sorts the array using themakeArraySort
method. ThemakeArraySort
method handles both null and non-null values, sorts the array in ascending or descending order based on the provided parameter, and determines the position of null or small values based on the nulls first parameter. The sorted array is then returned as anir.ArraySort
expression. This functionality allows for the sorting of arrays in Snowflake SQL to be translated to equivalent code in the target language. This enhancement simplifies the process of working with arrays in Snowflake SQL and provides users with a more streamlined experience. - Transpile Snow MONTHS_BETWEEN function correctly (#963). In this release, the remorph library's SnowflakeCallMapper class in the com/databricks/labs/remorph/parsers/snowflake/rules package has been updated to handle the MONTHS_BETWEEN function. A new case has been added that creates a MonthsBetween object with the first two arguments of the function call and a boolean value of true. This change enhances compatibility and ensures that the output accurately reflects the intended functionality. Additionally, new test cases have been introduced to the SnowflakeCallMapperSpec for the transpilation of the MONTHS_BETWEEN function. These test cases demonstrate accurate mapping of the function to the MonthsBetween class and proper casting of inputs as dates or timestamps, improving the reliability and precision of date and time calculations.
- Updated Installation to handle install errors (#962). In this release, we've made significant improvements to the
remorph
project, addressing and resolving installation errors that were occurring during the installation process in development mode. We've introduced a newProductInfo
class in thewheels
module, which provides information about the products being installed. This change replaces the use ofWheelsV2
in two test functions. Additionally, we've updated theworkspace_installation
method inapplication.py
to handle installation errors more effectively, addressing the dependency on workspace.remorph
due to wheels. We've also added new methods toinstallation.py
to manage local and remote version files, and updated the_upgrade_reconcile_workflow
function to ensure the correct wheel path is used during installation. These changes improve the overall quality of the codebase, making it easier for developers to adopt and maintain the project, and ensure a more seamless installation experience for users. - Updated catalog operations logging (#910). In this release, the setup process for the catalog, schema, and volume in the configurator module has been simplified and improved. The previous implementation repeatedly prompted the user for input until the correct input was provided or a maximum number of attempts was reached. The updated code now checks if the catalog, schema, or volume already exists and either uses it or prompts the user to create it once. If the user does not have the necessary privileges to use the catalog, schema, or volume, an error message is logged and the installation is aborted. New methods have been added to check for necessary privileges, such as
has_necessary_catalog_access
,has_necessary_schema_access
, andhas_necessary_volume_access
, which return a boolean indicating whether the user has the necessary privileges and log an error message with the missing privileges if not. The logging for catalog operations in the install.py file has also been updated to check for privileges at the end of the process and list any missing privileges for each catalog object. Additionally, changes have been made to the unit tests for the ResourceConfigurator class to ensure that the system handles cases where the user does not have the necessary permissions to access catalogs, schemas, or volumes, preventing unauthorized access and maintaining the security and integrity of the system. - Updated remorph reconcile workflow to use wheels instead of pypi (#884). In this release, the installation process for the Remorph library has been updated to allow for the use of locally uploaded wheel files instead of downloading the package from PyPI. This change includes updates to the
install
and_deploy_jobs
methods in therecon.py
file to accept a newwheel_paths
argument, which is used to pass the path of the Remorph wheel file to thedeploy_recon_job
method. The_upgrade_reconcile_workflow
function in thev0.4.0_add_main_table_operation_name_column.py
file has also been updated to upload the wheel package to the workspace and pass its path to thedeploy_reconcile_job
method. Additionally, thedeploy_recon_job
method in theJobDeployment
class now accepts a newwheel_file
argument, which represents the name of the wheel file for the remorph library. These changes address issues faced by customers with no public internet access and enable the use of new features before they are released on PyPI. Thetest_recon.py
file in thetests/unit/deployment
directory has also been updated to reflect these changes. - Upgrade script Implementation (#777). In this release, we've implemented an upgrade script as part of pull request #777, which resolves issue #769. This change introduces a new
Upgrades
class inapplication.py
that acceptsproduct_info
andinstallation
as parameters and includes a cached propertywheels
for improved performance. Additionally, we've added new methods to theWorkspaceInstaller
class for handling upgrade-related tasks, including the creation of aProductInfo
object, interacting with the Databricks SDK, and handling potential errors. We've also added a test case to ensure that upgrades are applied correctly on more recent versions. These changes are part of our ongoing effort to enhance the management and application of upgrades to installed products. - bug fix for to_array function (#961). A bug fix has been implemented to improve the
TO_ARRAY
function in our open-source library. Previously, this function expected only one parameter, but it has been updated to accept two parameters, with the second being optional. This change brings the function in line with other functions in the class, improving flexibility and ensuring backward compatibility. TheTO_ARRAY
function is used to convert a given expression to an array if it is not null and return null otherwise. The commit also includes updates to theGenerator
class, where a new entry for theToArray
expression has been added to theexpression_map
dictionary. Additionally, a newToArray
class has been introduced as a subclass ofFunc
, allowing the function to handle a variable number of arguments more gracefully. Relevant updates have been made to the functional tests for theto_array
function for both Snowflake and Databricks SQL, demonstrating its handling of null inputs and comparing it with the corresponding ARRAY function in each SQL dialect. Overall, these changes enhance the functionality and adaptability of theTO_ARRAY
function. - feat: Implement all of TSQL predicates except for SOME ALL ANY (#922). In this commit, we have implemented the IR generation for several TSQL predicates including IN, IS, BETWEEN, LIKE, EXISTS, and FREETEXT, thereby improving the parser's ability to handle a wider range of TSQL syntax. The
TSqlParser
class has been updated with new methods and changes to existing ones, including the addition of new labeled expressions to thepredicate
rule. Additionally, we have corrected an error in the LIKE predicate's implementation, allowing the ESCAPE character to accept a full expression that evaluates to a single character at runtime, rather than assuming it to be a single character at parse time. These changes provide more flexibility and adherence to the TSQL standard, enhancing the overall functionality of the project for our adopters.
- Added query history retrieval from Snowflake (#874). This release introduces query history retrieval from Snowflake, enabling expanded compatibility and data source options for the system. The update includes adding the Snowflake JDBC driver and its dependencies to the
pom.xml
file, and the implementation of a newSnowflakeQueryHistory
class to retrieve query history from Snowflake. TheAnonymizer
object is also added to anonymize query histories by fingerprinting queries based on their structure. Additionally, several case classes are added to represent various types of data related to query execution and table definitions in a Snowflake database. A newEnvGetter
class is also included to retrieve environment variables for use in testing. Test files for theAnonymizer
andSnowflakeQueryHistory
classes are added to ensure proper functionality. - Added support for
ALTER TABLE
:ADD COLUMNS
,DROP COLUMNS
,RENAME COLUMNS
, andDROP CONSTRAINTS
(#861). In this release, support for variousALTER TABLE
SQL commands has been added to our open-source library, includingADD COLUMNS
,DROP COLUMNS
,RENAME COLUMNS
, andDROP CONSTRAINTS
. These features have been implemented in theLogicalPlanGenerator
class, which now includes a new private methodalterTable
that takes a context and anAlterTableCommand
object and returns anALTER TABLE
SQL statement. Additionally, a new sealed traitTableAlteration
has been introduced, with four case classes extending it to handle specific table alteration operations. TheSnowflakeTypeBuilder
class has also been updated to parse and build Snowflake-specific SQL types for these commands. These changes provide improved functionality for managing and manipulating tables in Snowflake, making it easier for users to work with and modify their data. The new functionality has been tested using theSnowflakeToDatabricksTranspilerTest
class, which specifies SnowflakeALTER TABLE
commands and the expected transpiled results. - Added support for
STRUCT
types and conversions (#852). This change adds support forSTRUCT
types and conversions in the system by implementing newStructType
,StructField
, andStructExpr
classes for parsing, data type inference, and code generation. It also maps theOBJECT_CONSTRUCT
from Snowflake and introduces updates to various case classes such asJsonExpr
,Struct
, andStar
. These improvements enhance the system's capability to handle complex data structures, ensuring better compatibility with external data sources and expanding the range of transformations available for users. Additionally, the changes include the addition of test cases to verify the functionality of generating SQL data types forSTRUCT
expressions and handling JSON literals more accurately. - Minor upgrades to Snowflake parameter processing (#871). This commit includes minor upgrades to Snowflake parameter processing, enhancing the consistency and readability of the code. The changes normalize parameter generation to use
${}
syntax for clarity and to align with Databricks notebook examples. An extra coverage test for variable references within strings has been added. The specific changes include updating a SELECT statement in a Snowflake SQL query to use ${} for parameter processing. The commit also introduces a new SQL file for functional tests related to Snowflake's parameter processing, which includes commented out and alternate syntax versions of a query. This commit is part of continuous efforts to improve the functionality, reliability, and usability of the Snowflake parameter processing feature. - Patch/reconcile support temp views (#901). The latest update to the remorph-reconcile library adds support for temporary views, a new feature that was not previously available. With this change, the system can now handle
global_temp
for temporary views by modifying the_get_schema_query
method to return a query for theglobal_temp
schema if the schema name is set as such. Additionally, theread_data
method was updated to correctly handle the namespace and catalog for temporary views. The new variablenamespace_catalog
has been introduced, which is set tohive_metastore
if the catalog is not set, and to the original catalog with the added schema otherwise. Thetable_with_namespace
variable is then updated to use thenamespace_catalog
and table name, allowing for correct querying of temporary views. These modifications enable remorph-reconcile to work seamlessly with temporary views, enhancing its flexibility and functionality. The updated unit tests reflect these changes, with assertions to ensure that the correct SQL statements are being generated and executed for temporary views. - Reconcile Table Recon JSON filename updates (#866). The Remorph project has implemented a change to the naming convention and placement of the configuration file for the table reconciliation process. The configuration file, previously named according to individual preference, must now follow the pattern
recon_config_<DATA_SOURCE>_<SOURCE_CATALOG_OR_SCHEMA>_<REPORT_TYPE>.json
and be placed in the.remorph
directory within the Databricks Workspace. Examples of Table Recon filenames for Snowflake, Oracle, and Databricks source systems have been provided for reference. Additionally, thedata_source
field in the config file has been updated to accurately reflect the data source. The case of the filename should now match the case ofSOURCE_CATALOG_OR_SCHEMA
as defined in the config. Compliance with this new naming convention and placement is required for the successful execution of the table reconciliation process. - [snowflake] parse parameters (#855). The open-source library has undergone changes related to Scalafmt configuration, Snowflake SQL parsing, and the introduction of a new
ExpressionGenerator
class method. The Scalafmt configuration change introduces a newdocstrings.wrap
option set tofalse
, disabling docstring wrapping at the specified column limit. ThedanglingParentheses.preset
option is also set tofalse
, disabling the formatting rule for unnecessary parentheses. In Snowflake SQL parsing, new token types, lexer modes, and parser rules have been added to improve the parsing of string literals and other elements. A newvariable
method in theExpressionGenerator
class generates SQL expressions forir.Variable
objects. A newVariable
case class has been added to represent a variable in an expression, and theSchemaReference
case class now takes a single child expression. TheSnowflakeDDLBuilder
class has a new method,extractString
, to safely extract strings from ANTLR4 context objects. TheSnowflakeErrorStrategy
object now includes new parameters for parsing Snowflake syntax, and the Snowflake LexerSpec test class has new methods for filling tokens from an input string and dumping the token list. Tests have been added for various string literal scenarios, and the SnowflakeAstBuilderSpec includes a new test case for handling thetranslate amps
functionality. The Snowflake SQL queries in the test file have been updated to standardize parameter referencing syntax, improving consistency and readability. - fixed current_date() generation (#890). This release includes a fix for an issue with the generation of the
current_date()
function in SQL queries, specifically for the Snowflake dialect. A test case in thesqlglot-incorrect
category has been updated to use the correct syntax for theCURRENT_DATE
function, which includes parentheses (SELECT CURRENT_DATE() FROM tabl;
). Additionally, thecurrent_date()
function is now called consistently throughout the tests, either asCURRENT_DATE
orCURRENT_DATE()
, depending on the syntax required by Snowflake. No new methods were added, and the existing functionality was changed only to correct thecurrent_date()
generation. This improvement ensures accurate and consistent generation of thecurrent_date()
function across different SQL dialects, enhancing the reliability and accuracy of the tests.
- Added Translation Support for
!
ascommands
and&
forParameters
(#771). This commit adds translation support for using "!" as commands and "&" as parameters in Snowflake code within the remorph tool, enhancing compatibility with Snowflake syntax. The "!set exit_on_error=true" command, which previously caused an error, is now treated as a comment and prepended with--
in the output. The "&" symbol, previously unrecognized, is converted to its Databricks equivalent "$", which represents parameters, allowing for proper handling of Snowflake SQL code containing "!" commands and "&" parameters. These changes improve the compatibility and robustness of remorph with Snowflake code and enable more efficient processing of Snowflake SQL statements. Additionally, the commit introduces a new test suite for Snowflake commands, enhancing code coverage and ensuring proper functionality of the transpiler. - Added
LET
andDECLARE
statements parsing in Snowflake PL/SQL procedures (#548). This commit introduces support for parsingDECLARE
andLET
statements in Snowflake PL/SQL procedures, enabling variable declaration and assignment. It adds new grammar rules, refactors code using ScalaSubquery, and implements IR visitors forDECLARE
andLET
statements with Variable Assignment and ResultSet Assignment. TheRETURN
statement and parameterized expressions are also now supported. Note thatCURSOR
is not yet covered. These changes allow for improved processing and handling of Snowflake PL/SQL code, enhancing the overall functionality of the library. - Added logger statements in get_schema function (#756). In this release, enhanced logging has been implemented in the Metadata (Schema) fetch functions, specifically in the
get_schema
function and other metadata fetch functions within Oracle, SnowflakeDataSource modules. The changes include logger statements that log the schema query, start time, and end time, providing better visibility into the performance and behavior of these functions during debugging or monitoring. The logging functionality is implemented using the built-inlogging
module and timestamps are obtained using thedatetime
module. In the SnowflakeDataSource class, RuntimeError or PySparkException will be raised if the user's current role lacks the necessary privileges to access the specified Information Schema object. The INFORMATION_SCHEMA table in Snowflake is used to fetch the schema, with the query modified to handle unquoted and quoted identifiers and the ordinal position of columns. Theget_schema_query
function has also been updated for better formatting for the SQL query used to fetch schema information. The schema fetching method remains unchanged, but these enhancements provide more detailed logging for debugging and monitoring purposes. - Aggregates Reconcile CLI Implementation (#770). The
Aggregates Reconcile CLI Implementation
commit introduces a new command-line interface (CLI) for reconcile jobs, specifically for aggregated data. This change adds a new parameter, "operation_name", to the run method in the runner.py file, which determines the type of reconcile operation to perform. A new function, _trigger_reconcile_aggregates, has been implemented to reconcile aggregate data based on provided configurations and log the reconciliation process outcome. Additionally, new methods for defining job parameters and settings, such asmax_concurrent_runs
and "parameters", have been included. This CLI implementation enhances the customizability and control of the reconciliation process for users, allowing them to focus on specific use cases and data aggregations. The changes also include new test cases in test_runner.py to ensure the proper behavior of the ReconcileRunner class when theaggregates-reconcile
operation_name is set. - Aggregates Reconcile Updates (#784). This commit introduces significant updates to the
Table Deployment
feature, enabling it to supportAggregate Tables
deployment and modifying the persistence logic for tables. Notable changes include the addition of a newaggregates
attribute to theTable
class in the configuration, which allows users to specify aggregate functions and optionally group by specific columns. The reconcile process now captures mismatch data, missing rows in the source, and missing rows in the target in the recon metrics tables. Furthermore, the aggregates reconcile process supports various aggregate functions like min, max, count, sum, avg, median, mode, percentile, stddev, and variance. The documentation has been updated to reflect these improvements. The commit also removes thepercentile
function from the reconciliation configuration and modifies theaggregate_metrics
SQL query, enhancing the flexibility of theTable Deployment
feature forAggregate Tables
. Users should note that thepercentile
function is no longer a valid option and should update their code accordingly. - Aggregates Reconcile documentation (#779). In this commit, the Aggregates Reconcile utility has been enhanced with new documentation and visualizations for improved understanding and usability. The utility now includes a flow diagram, visualization, and README file illustrating how it compares specific aggregate metrics between source and target data residing on Databricks. A new configuration sample is added, showcasing the reconciliation of two tables using various aggregate functions, join columns, transformations, filters, and JDBC ReaderOptions configurations. The commit also introduces two Mermaid flowchart diagrams, depicting the reconciliation process with and without a
group by
operation. Additionally, new flow diagram visualizations in PNG and GIF formats have been added, aiding in understanding the process flow of the Aggregates Reconcile feature. The reconcile configuration samples in the documentation have also been updated with a spelling correction for clarity. - Bump sqlglot from 25.6.1 to 25.8.1 (#749). In this version update, the
sqlglot
dependency has been bumped from 25.6.1 to 25.8.1, bringing several bug fixes and new features related to various SQL dialects such as BigQuery, DuckDB, and T-SQL. Notable changes include support for BYTEINT in BigQuery, improved parsing and transpilation of StrToDate in ClickHouse, and support for SUMMARIZE in DuckDB. Additionally, there are bug fixes for DuckDB and T-SQL, including wrapping left IN clause json extract arrow operand and handling JSON_QUERY with a single argument. The update also includes refactors and changes to the ANNOTATORS and PARSER modules to improve dialect-aware annotation and consistency. This pull request is compatible withsqlglot
version 25.6.1 and below and includes a detailed list of commits and their corresponding changes. - Generate window functions (#772). In this release, we have added support for generating SQL
WINDOW
andSortOrder
expressions in theExpressionGenerator
class. This enhancement includes the ability to generate aWINDOW
expression with a window function, partitioning and ordering clauses, and an optional window frame, using thewindow
andframeBoundary
methods. ThesortOrder
method now generates the SQLSortOrder
expression, which includes the expression to sort by, sort direction, and null ordering. Additional methodsorNull
anddoubleQuote
return a string representing a NULL value and a string enclosed in double quotes, respectively. These changes provide increased flexibility for handling more complex expressions in SQL. Additionally, new test cases have been added to theExpressionGeneratorTest
to ensure the correct generation of SQL window functions, specifically theROW_NUMBER()
function with various partitioning, ordering, and framing specifications. These updates improve the robustness and functionality of theExpressionGenerator
class for generating SQL window functions. - Implement TSQL specific function call mapper (#765). This commit introduces several new features to enhance compatibility between TSQL and Databricks SQL. A new method,
interval
, has been added to generate a Databricks SQL compatible string for intervals in a TSQL expression. Theexpression
method has been updated to handle certain functions directly, improving translation efficiency. Specifically, the DATEADD function is now translated to Databricks SQL's DATE_ADD, ADD_MONTHS, and xxx + INTERVAL n {days|months|etc} constructs. The changes also include a new sealed traitKnownIntervalType
, a new case classKnownInterval
, and a new classTSqlCallMapper
for mapping TSQL functions to Databricks SQL equivalents. Furthermore, the commit introduces new tests for TSQL specific function call mappers, ensuring proper translation of TSQL functions to Databricks SQL compatible constructs. These improvements collectively facilitate better integration and compatibility between TSQL and Databricks SQL. - Improve TSQL and Snowflake parser and lexer (#757). In this release, the open-source library's Snowflake and TSQL lexers and parsers have been improved for better functionality and robustness. For the Snowflake lexer, unnecessary escape sequence processing has been removed, and various options have been corrected to be simple strings. The lexer now accepts a question mark as a placeholder for prepared statements in Snowflake statements. The TSQL lexer has undergone minor improvements, such as aligning the catch-all rule name with Snowflake. The Snowflake parser now accepts the question mark as a
PARAM
placeholder and simplifies thetypeFileformat
rule to accept a singleSTRING
token. Additionally, several new keywords have been added to the TSQL lexer, improving consistency and clarity. These changes aim to simplify lexer and parser rules, enhance option handling and placeholders, and ensure consistency between Snowflake and TSQL. - Patch Information Schema Predicate Pushdown for Snowflake (#764). In this release, we have implemented Information Schema Predicate Pushdown for Snowflake, resolving issue #7
- TSQL: Implement correct grammar for CREATE TABLE in all forms (#796). In this release, the TSqlLexer's CREATE TABLE statement grammar has been updated and expanded to support new keywords and improve accuracy. The newly added keywords 'EDGE', 'FILETABLE', 'NODE', and
NODES
enable correct parsing of CREATE TABLE statements using graph nodes and FILETABLE functionality. Existing keywords such as 'DROP_EXISTING', 'DYNAMIC', 'FILENAME', andFILTER
have been refined for better precision. Furthermore, the introduction of thetableIndices
rule standardizes the order of columns in the table. These enhancements improve the T-SQL parser's robustness and consistency, benefiting users in creating and managing tables in their databases. - TSQL: Implement grammar for CREATE DATABASE and CREATE DATABASE SCOPED OPTION (#788). In this release, we have implemented the TSQL grammar for
CREATE DATABASE
andCREATE DATABASE SCOPED OPTION
statements, addressing inconsistencies with TSQL documentation. The implementation was initially intended to cover the entire process from grammar to code generation. However, to simplify other DDL statements, the work was split into separate grammar-only pull requests. The diff introduces new methods such ascreateDatabaseScopedCredential
,createDatabaseOption
, anddatabaseFilestreamOption
, while modifying the existingcreateDatabase
method. ThecreateDatabaseScopedCredential
method handles the creation of a database scoped credential, which was previously part ofcreateDatabaseOption
. ThecreateDatabaseOption
method now focuses on handling individual options, whiledatabaseFilestreamOption
deals with filesystem specifications. Note that certain options, likeDEFAULT_LANGUAGE
,DEFAULT_FULLTEXT_LANGUAGE
, and more, have been marked as TODO and will be addressed in future updates. - TSQL: Improve transpilation coverage (#766). In this update, various enhancements have been made to improve the coverage of TSQL transpilation and address bugs in code generation, particularly for the
ExpressionGenerator
class in thecom/databricks/labs/remorph/generators/sql
package, and theTSqlExpressionBuilder
,TSqlFunctionBuilder
,TSqlCallMapper
, andQueryRunner
classes. Changes include adding support for new cases, modifying code generation behavior, improving test coverage, and updating existing tests for better TSQL code generation. Specific additions include new methods for handling bitwise operations, converting CHECKSUM_AGG calls to a sequence of MD5 function calls, and handling Fn instances. TheQueryRunner
class has been updated to include both the actual and expected outputs in error messages for better debugging purposes. Additionally, the test file for theDATEADD
function has been updated to ensure proper syntax and consistency. All these modifications aim to improve the reliability, accuracy, and compatibility of TSQL transpilation, ensuring better functionality and coverage for the Remorph library's transformation capabilities. - [chore] speedup build process by not running unit tests twice (#842). In this commit, the build process for the open-source library has been optimized by removing the execution of unit tests during the build phase in the Maven build process. A new plugin for the Apache Maven Surefire Plugin has been added, with the group ID set to "org.apache.maven.plugins", artifact ID set to "maven-surefire-plugin", and version set to "3.1.2". The configuration for this plugin includes a
skipTests
attribute set to "true", ensuring that tests are not run twice, thereby improving the build process speed. The existing ScalaTest Maven plugin configuration remains unchanged, allowing Scala tests to still be executed during the test phase. Additionally, the Maven Compiler Plugin has been upgraded to version 3.11.0, and the release parameter has been set to 8, ensuring that the Java compiler used during the build process is compatible with Java 8. The version numbers for several libraries, including os-lib, mainargs, ujson, scalatest, and exec-maven-plugin, are now being defined using properties, allowing Maven to manage and cache these libraries more efficiently. These changes improve the build process's performance and reliability without affecting the existing functionality. - [internal] better errors for call mapper (#816). In this release, the
ExpressionGenerator
class in thecom.databricks.labs.remorph.generators.sql
package has been updated to handle exceptions during the conversion of input functions to Databricks expressions. A try-catch block has been added to catchIndexOutOfBoundsException
and provide a more descriptive error message, including the name of the problematic function and the error message associated with the exception. ATranspileException
with the messagenot implemented
is now thrown when encountering a function for which a translation to Databricks expressions is not available. TheIsTranspiledFromSnowflakeQueryRunner
class in thecom.databricks.labs.remorph.coverage
package has also been updated to include the name of the exception class in the error message for better error identification when a non-fatal error occurs during parsing. Additionally, the import statement forFormatter
has been moved to ensure alphabetical order. These changes improve error handling and readability, thereby enhancing the overall user experience for developers interacting with the codebase. - [snowflake] map more functions to Databricks SQL (#826). This commit introduces new private methods
andPredicate
andorPredicate
to the ExpressionGenerator class in thecom.databricks.labs.remorph.generators.sql
package, enhancing the generation of SQL expressions for AND and OR logical operators, and improving readability and correctness of complex logical expressions. The LogicalPlanGenerator class in thesql
package now supports more flexibility in inserting data into a target relation, enabling users to choose between overwriting the existing data or appending to it. TheFROM_JSON
function in the CallMapper class has been updated to accommodate an optional third argument, providing more flexibility in handling JSON-related transformations. A new class,CastParseJsonToFromJson
, has been introduced to improve the performance of data processing pipelines that involve parsing JSON data in Snowflake using thePARSE_JSON
function. Additional Snowflake SQL functions have been mapped to Databricks SQL IR, enhancing compatibility and functionality. The ExpressionGeneratorTest class now generates predicates without parentheses, simplifying and improving readability. Mappings for several Snowflake functions to Databricks SQL have been added, enhancing compatibility with Databricks SQL. ThesqlFiles
sequence in theNestedFiles
class is now sorted before being mapped toAcceptanceTest
objects, ensuring consistent order for testing or debugging purposes. A semicolon has been added to the end of a SQL query in a test file for Snowflake DML insert functionality, ensuring proper query termination. - [sql] generate
INSERT INTO ...
(#823). In this release, we have made significant updates to our open-source library. The ExpressionGenerator.scala file has been updated to convert boolean values to lowercase instead of uppercase when generating INSERT INTO statements, ensuring SQL code consistency. A new methodinsert
has been added to theLogicalPlanGenerator
class to generate INSERT INTO SQL statements based on theInsertIntoTable
input. We have introduced a new case classInsertIntoTable
that extendsModification
to simplify the API for DML operations other than SELECT. The SQL ExpressionGenerator now generates boolean literals in lowercase, and new test cases have been added to ensure the correct generation of INSERT and JOIN statements. Lastly, we have added support for generating INSERT INTO statements in SQL for specified database tables, improving cross-platform compatibility. These changes aim to enhance the library's functionality and ease of use for software engineers. - [sql] generate basic JSON access (#835). In this release, we have added several new features and improvements to our open-source library. The
ExpressionGenerator
class now includes a new method,jsonAccess
, which generates SQL code to access a JSON object's properties, handling different types of elements in the path. TheTO_JSON
function in theStructsToJson
class has been updated to accept an optional expression as an argument, enhancing its flexibility. TheSnowflakeCallMapper
class now includes a new method,lift
, and a new feature to generate basic JSON access, with corresponding updates to test cases and methods. The SQL logical plan generator has been refined to generate star projections with escaped identifiers, handling complex table and database names. We have also added new methods and test cases to theSnowflakeCallMapper
class to convert Snowflake structs into JSON strings and cast Snowflake values to specific data types. These changes improve the library's ability to handle complex JSON data structures, enhance functionality, and ensure the quality of generated SQL code. - [sql] generate basic
CREATE TABLE
definition (#829). In this release, the open-source library's SQL generation capabilities have been enhanced with the addition of a newcreateTable
method to theLogicalPlanGenerator
class. This method generates aCREATE TABLE
definition for a givenir.CreateTableCommand
, producing a SQL statement with a comma-separated list of column definitions. Each column definition includes the column name, data type, and any applicable constraints, generated using theDataTypeGenerator.generateDataType
method and the newly-introducedconstraint
method. Additionally, theproject
method has been updated to incorporate aFROM
clause in the generated SQL statement when the input of the project node is notir.NoTable()
. These improvements extend the functionality of theLogicalPlanGenerator
class, allowing it to generateCREATE TABLE
statements for input catalog ASTs, thereby better supporting data transformation use cases. A new test for theCreateTableCommand
has been added to theLogicalPlanGeneratorTest
class to validate the correct transpilation of theCreateTableCommand
to aCREATE TABLE
SQL statement. - [sql] generate basic
TABLESAMPLE
(#830). In this commit, the open-source library'sLogicalPlanGenerator
class has been updated to include a new method,tableSample
, which generates SQL representations of table sampling operations. Previously, the class only handledINSERT
,DELETE
, andCREATE TABLE
commands. With this enhancement, the generator can now produce SQL statements using theTABLESAMPLE
clause, allowing for the selection of a sample of data from a table based on various sampling methods and a seed value for repeatable sampling. The newly supported sampling methods include row-based probabilistic, row-based fixed amount, and block-based sampling. Additionally, a new test case has been added for theLogicalPlanGenerator
related to theTableSample
class, validating the correct transpilation of named tables and fixed row sampling into theTABLESAMPLE
clause with specified parameters. This improvement ensures that the generated SQL code accurately represents the desired table sampling settings.
Dependency updates:
- Bump sqlglot from 25.6.1 to 25.8.1 (#749).
- Aggregate Queries Reconciliation (#740). This release introduces several changes to enhance the functionality of the project, including the implementation of Aggregate Queries Reconciliation, addressing issue #503. A new property,
aggregates
, has been added to the base class of the query builder module to support aggregate queries reconciliation. Agenerate_final_reconcile_aggregate_output
function has been added to generate the final reconcile output for aggregate queries. A new SQL file creates a table calledaggregate_details
to store details about aggregate reconciles, and a new column,operation_name
, has been added to themain
table in theinstallation
reconciliation query. Additionally, new classes and methods have been introduced for handling aggregate queries and their reconciliation, and new SQL tables and columns have been created for storing and managing rules for aggregating data in the context of query reconciliation. Unit tests have been added to ensure the proper functioning of aggregate queries reconciliation and reconcile aggregate data in the context of missing records. - Generate GROUP BY / PIVOT (#747). The LogicalPlanGenerator class in the remorph library has been updated to support generating GROUP BY and PIVOT clauses for SQL queries. A new private method, "aggregate", has been added to handle two types of aggregates: GroupBy and Pivot. For GroupBy, it generates a GROUP BY clause with specified grouping expressions. For Pivot, it generates a PIVOT clause where the specified column is used as the pivot column and the specified values are used as the pivot values, compatible with Spark SQL. If the aggregate type is unsupported, a TranspileException is thrown. Additionally, new test cases have been introduced for the LogicalPlanGenerator class in the com.databricks.labs.remorph.generators.sql package to support testing the transpilation of Aggregate expressions with GROUP BY and PIVOT clauses, ensuring proper handling and transpilation of these expressions.
- Implement error strategy for Snowflake parsing and use error strategy for all parser instances (#760). In this release, we have developed an error strategy specifically for Snowflake parsing that translates raw token names and parser rules into more user-friendly SQL error messages. This strategy is applied consistently across all parser instances, ensuring a unified error handling experience. Additionally, we have refined the DBL_DOLLAR rule in the SnowflakeLexer grammar to handle escaped dollar signs correctly. These updates improve the accuracy and readability of error messages for SQL authors, regardless of the parsing tool or transpiler used. Furthermore, we have updated the TSQL parsing error strategy to match the new Snowflake error strategy implementation, providing a consistent error handling experience across dialects.
- Incremental improvement to error messages - article selection (#711). In this release, we have implemented an incremental improvement to the error messages generated during T-SQL code parsing. This change introduces a new private method,
articleFor
, which determines whether to usea
oran
in the generated messages based on the first letter of the following word. ThegenerateMessage
method has been updated to use this new method when constructing the initial error message and subsequent messages when there are multiple expected tokens. This improvement ensures consistent use of articlesa
oran
in the error messages, enhancing their readability for software engineers working with T-SQL code. - TSQL: Adds tests and support for SELECT OPTION(...) generation (#755). In this release, we have added support for generating code for the TSQL
SELECT ... OPTION(...)
clause in the codebase. This new feature includes the ability to transpile any query hints supplied with a SELECT statement as comments in the output code, allowing for easier assessment of query performance after transpilation. The OPTION clause is now generated as comments, including MAXRECURSION, string options, boolean options, and auto options. Additionally, we have added new tests and updated the TSqlAstBuilderSpec test class with new and updated test cases to cover the new functionality. The implementation is focused on generating code for the OPTION clause, and does not affect the actual execution of the query. The changes are limited to the ExpressionGenerator class and its associated methods, and the TSqlRelationBuilder class, without affecting other parts of the codebase. - TSQL: IR implementation of MERGE (#719). The open-source library has been updated to include a complete implementation of the TSQL MERGE statement's IR (Intermediate Representation), bringing it in line with Spark SQL. The
LogicalPlanGenerator
class now includes agenerateMerge
method, which generates the SQL code for the MERGE statement, taking aMergeIntoTable
object containing the target and source tables, merge condition, and merge actions as input. TheMergeIntoTable
class has been added as a case class to represent the logical plan of the MERGE INTO command and extends theModification
trait. TheLogicalPlanGenerator
class also includes a newgenerateWithOptions
method, which generates SQL code for the WITH OPTIONS clause, taking aWithOptions
object containing the input and options as children. Additionally, theTSqlRelationBuilder
class has been updated to handle the MERGE statement's parsing, introducing new methods and updating existing ones, such asvisitMerge
. TheTSqlToDatabricksTranspiler
class has been updated to include support for the TSQL MERGE statement, and theExpressionGenerator
class has new tests for options, columns, and arithmetic expressions. A new optimization rule,TrapInsertDefaultsAction
, has been added to handle the behavior of the DEFAULT keyword during INSERT statements. The commit also includes test cases for theMergeIntoTable
logical operator and the T-SQL merge statement in theTSqlAstBuilderSpec
.
- Added TSql transpiler (#734). In this release, we have developed a new open-source library feature that enhances the transpilation of T-SQL code to Databricks-compatible code. The new TSqlToDatabricksTranspiler class has been added, which extends the Transpiler abstract class and defines the transpile method. This method converts T-SQL code to Databricks-compatible code by creating a lexer, token stream, parser, and parsed tree from the input string using TSqlLexer, CommonTokenStream, and tSqlFile. The parsed tree is then passed to the TSqlAstBuilder's visit method to generate a logical plan, which is optimized using an optimizer object with rules such as PullLimitUpwards and TopPercentToLimitSubquery. The optimized plan is then passed to the LogicalPlanGenerator's generate method to generate the final transpiled code. Additionally, a new class, IsTranspiledFromTSqlQueryRunner, has been added to the QueryRunner object to transpile T-SQL queries into the format expected by the Databricks query runner using the new TSqlToDatabricksTranspiler. The AcceptanceTestRunner class in the com.databricks.labs.remorph.coverage package has been updated to replace TSqlAstBuilder with IsTranspiledFromTSqlQueryRunner in the TSqlAcceptanceSuite class, indicating a change in the code responsible for handling TSql queries. This new feature aims to provide a smooth and efficient way to convert T-SQL code to Databricks-compatible code for further processing and execution.
- Added the missing info for the reconciliation documentation (#520). In this release, we have made significant improvements to the reconciliation feature of our open-source library. We have added a configuration folder and provided a template for creating configuration files for specific table sources. The config files will contain necessary configurations for table-specific reconciliation. Additionally, we have included a note in the transformation section detailing the usage of user-defined functions (UDFs) in transformation expressions, with an example UDF called
sort_array_input()
provided. The reconcile configuration sample documentation has also been added, with a JSON configuration for reconciling tables using various operations like drop, join, transformation, threshold, filter, and JDBC ReaderOptions. The commit also includes examples of source and target tables, data overviews, and reconciliation configurations for various scenarios, such as basic config and column mapping, user transformations, explicit select, explicit drop, filters, and thresholds comparison, among others. These changes aim to make it easier for users to set up and execute the reconciliation process for specific table sources and provide clear and concise information about using UDFs in transformation expressions for reconciliation. This commit is co-authored by Vijay Pavan Nissankararao and SundarShankar89. - Bump sigstore/gh-action-sigstore-python from 2.1.1 to 3.0.0 (#555). In this pull request, the sigstore/gh-action-sigstore-python dependency is being updated from version 2.1.1 to 3.0.0. This new version includes several changes and improvements, such as the addition of recursive globbing with ** to the inputs and the removal of certain settings like fulcio-url, rekor-url, ctfe, and rekor-root-pubkey. The signature, certificate, and bundle output settings have also been removed. Furthermore, the way inputs are parsed has been changed, and they are now made optional under certain conditions. The default suffix has been updated to .sigstore.json. The 3.0.0 version also resolves various deprecations present in sigstore-python's 2.x series and supports CI runners that use PEP 668 to constrain global package prefixes.
- Bump sqlglot from 25.1.0 to 25.5.1 (#534). In the latest release, the
sqlglot
package has been updated from version 25.1.0 to 25.5.1, which includes bug fixes, breaking changes, new features, and refactors for parsing, analyzing, and rewriting SQL queries. The new version introduces optimizations for coalesced USING columns, preserves EXTRACT(date_part FROM datetime) calls, decouples NVL() from COALESCE(), and supports FROM CHANGES in Snowflake. It also provides configurable transpilation of Snowflake VARIANT, and supports view schema binding options for Spark and Databricks. The update addresses several issues, such as the use of timestamp with time zone over timestamptz, switch off table alias columns generation, and parse rhs of x::varchar(max) into a type. Additionally, the update cleans up CurrentTimestamp generation logic for Teradata. Thesqlglot
dependency has also been updated in the 'experimental.py' file of the 'databricks/labs/remorph/snow' module, along with the addition of a new private method_parse_json
to theGenerator
class for parsing JSON data. Software engineers should review the changes and update their code accordingly, as conflicts with existing code will be resolved automatically by Dependabot, as long as the pull request is not altered manually. - Bump sqlglot from 25.5.1 to 25.6.1 (#585). In this release, the
sqlglot
dependency is updated from version 25.5.1 to 25.6.1 in the 'pyproject.toml' file. This update includes bug fixes, breaking changes, new features, and improvements. Breaking changes consist of updates to the QUALIFY clause in queries and the canonicalization of struct and array inline constructor. New features include support for ORDER BY ALL, FROM ROWS FROM (...), RPAD & LPAD functions, and exp.TimestampAdd. Bug fixes address issues related to the QUALIFY clause in queries, expansion of SELECT * REPLACE, RENAME, transpiling UDFs from Databricks, and more. The pull request also includes a detailed changelog, commit history, instructions for triggering Dependabot actions, and commands for reference, with the exception of the compatibility score for the new version, which is not taken into account in this pull request. - Feature/reconcile table mismatch threshold (#550). This commit enhances the reconciliation process in the open-source library with several new features, addressing issue #504. A new
get_record_count
method is added to theReconcile
class, providing record count data for source and target tables, facilitating comprehensive analysis of table mismatches. ACountQueryBuilder
class is introduced to build record count queries for different layers and SQL dialects, ensuring consistency in data processing. TheThresholds
class is refactored intoColumnThresholds
andTableThresholds
, allowing for more granular control over comparisons and customizable threshold settings. New methods_is_mismatch_within_threshold_limits
and_insert_into_metrics_table
are added to therecon_capture.py
file, improving fine-grained control over the reconciliation process and preventing false positives. Additionally, new classes, methods, and data structures have been implemented in theexecute
module to handle reconciliation queries and data more efficiently. These improvements contribute to a more accurate and robust reconciliation system. - Feature: introduce core transpiler (#715). A new core transpiler,
SnowflakeToDatabricksTranspiler
, has been introduced to convert Snowflake queries into Databricks SQL, streamlining integration and compatibility between the two systems. This transpiler is integrated into the coverage test suites for thorough testing, and is used to convert various types of logical plans, handling cases such asBatch
,WithCTE
,Project
,NamedTable
,Filter
, andJoin
. TheSnowflakeToDatabricksTranspiler
class tokenizes input Snowflake query strings, initializes aSnowflakeParser
instance, parses the input Snowflake query, generates a logical plan, and applies theLogicalPlanGenerator
to the logical plan to generate the equivalent Databricks SQL query. Additionally, theSnowflakeAstBuilder
class has been updated to alter the wayBatch
logical plans are built and improve overall functionality of the transpiler. - Fixed LEFT and RIGHT JOIN syntax in Snowflake ANTLR grammar (#526). A fix has been implemented to address issues with the Snowflake ANTLR grammar related to the proper parsing of LEFT and RIGHT JOIN statements. Previously, the keywords LEFT and RIGHT were incorrectly allowed as identifiers, but they are hard keywords that must be escaped to be used as column names. This change updates the grammar to escape these keywords in JOIN statements, improving the overall parsing of queries that include LEFT and RIGHT JOINs. Additionally, semantic predicates have been suggested to handle cases where LEFT or RIGHT are used as column names without escaping, although this is not yet implemented. To ensure the correctness of the updated grammar, new tests have been added to the SnowflakeAstBuilderSpec for LEFT and RIGHT JOINs, which check that the Abstract Syntax Tree (AST) is built correctly for these queries.
- Fixed Snowflake Acceptance Testcases Failures (#531). In this release, updates have been made to the acceptance testcases for various SQL functions in the open-source library. The DENSE RANK function's testcase has been updated with a window specification and ORDER BY clause in both Snowflake and Databricks SQL syntaxes, ensuring accurate test results. The LAG function's testcase now includes a PARTITION BY and ORDER BY clause, as well as the NULLS LAST keyword in Databricks SQL, for improved accuracy and consistency. The SQL queries in the Snowflake testcase for the
last_value
function have been updated with a window specification, ORDER BY clause, and NULLS LAST directive for Databricks SQL. Test case failures in the Snowflake acceptance testsuite have been addressed with updates to the LEAD function, MONTH_NAME to MONTHNAME renaming, and DATE_FORMAT to TO_DATE conversion, improving reliability and consistency. The ntile function's testcase has been updated with PARTITION BY and ORDER BY clauses, and the NULLS LAST keyword has been added to the Databricks SQL query. The SQL query for null-safe equality comparison has been updated with a conditional expression compatible with Snowflake. The ranking function's testcase has been improved with the appropriate partition and order by clauses, and the NULLS LAST keyword has been added to the Databricks SQL query, enhancing accuracy and consistency. Lastly, updates have been made to the ROW_NUMBER function's testcase, ensuring accurate and consistent row numbering for both Snowflake and Databricks. - Fixed TSQL transpiler (#735). In this release, we have implemented a fix for the TSQL transpiler, addressing the issue #7. This enhancement allows the library to accurately convert TSQL code into an equivalent format that is compatible with other databases. The fix resolves reported bugs related to incorrect syntax interpretation, thereby improving the overall reliability and functionality of the transpiler. Software engineers and developers relying on TSQL compatibility for cross-database operations will benefit from this improvement. We encourage users to test and provide feedback on this updated feature.
- Fixed
SELECT TOP X PERCENT
IR translation for TSQL (#733). In this release, we have made several enhancements to the open-source library to improve compatibility with T-SQL and Catalyst. We have added a new dependency,pprint_${scala.binary.version}
version 0.8.1 from thecom.lihaoyi
group, to provide advanced pretty-printing functionality for Scala. We have also fixed the translation of the TSQLSELECT TOP X PERCENT
feature in the parser for intermediate expressions, addressing the difference in syntax between TSQL and SQL for limiting the number of rows returned by a query. Additionally, we have modified the implementation of theWITH
clause and added a new expression for theSELECT TOP
clause in T-SQL, improving the compatibility of the codebase with T-SQL and aligning it with Catalyst. We also introduced a new abstract classRule
and a case classRules
in thecom.databricks.labs.remorph.parsers.intermediate
package to fix theSELECT TOP X PERCENT
IR translation for TSQL by adding new rules. Furthermore, we have added a new Scala file,subqueries.scala
, containing abstract classSubqueryExpression
and two case classes that extend it, and made changes to thetrees.scala
file to improve the tree string representation for better readability and consistency in the codebase. These changes aim to improve the overall functionality of the library and make it easier for new users to understand and adopt the project. - Fixed invalid null constraint and FQN (#517). In this change, we addressed issues #516 and #517, which involved resolving an invalid null constraint and correcting a fully qualified name (FQN) in our open-source library. The
read_data
function in thedatabricks.py
file was updated to improve null constraint handling and ensure the FQN is valid. Previously, the catalog was always appended to the table name, potentially resulting in an invalid FQN and null constraint issues. Now, the code checks if the catalog exists before appending it to the table name, and if not provided, the schema and table name are concatenated directly. Additionally, we removed the NOT NULL constraint from the catalog field in the source_table and target_table structs in the main SQL file, allowing null values for this field. These changes maintain backward compatibility and enhance the overall functionality and robustness of the project, ensuring accurate query results and avoiding potential errors. - Generate SQL for arithmetic operators (#726). In this release, we have introduced a new private method
arithmetic
to theExpressionGenerator
class that generates SQL for arithmetic operations, including unary minus, unary plus, multiplication, division, modulo, addition, and subtraction. This improves the readability and maintainability of the code by separating concerns and making the functionality more explicit. Additionally, we have introduced a new trait namedArithmetic
to group arithmetic expressions together, which enables easier manipulation and identification of arithmetic expressions in the code. A new test suite has also been added for arithmetic operations in theExpressionGenerator
class, which improves test coverage and ensures the correct SQL is generated for these operations. These changes provide a welcome addition for developers looking to extend the functionality of theExpressionGenerator
class for arithmetic operations. - Generate SQL for bitwise operators. The ExpressionGenerator class in the remorph project has been updated to support generating SQL for bitwise operators (OR, AND, XOR, NOT) through the addition of a new private method
bitwise
that converts bitwise operations to equivalent SQL expressions. Theexpression
method has also been updated to utilize the newbitwise
method for any input ir.Expression that is a bitwise operation. To facilitate this change, a new trait calledBitwise
and updated case classes for bitwise operations, includingBitwiseNot
,BitwiseAnd
,BitwiseOr
, andBitwiseXor
, have been implemented. The updated case classes extend the newBitwise
trait and include thedataType
override method to return the data type of the left expression. A new test case in ExpressionGeneratorTest for bitwise operators has been added to validate the functionality, and theexpression
method in ExpressionGenerator now utilizes the GeneratorContext() instead of new GeneratorContext(). These changes enable the ExpressionGenerator to generate SQL code for bitwise operations, expanding its capabilities. - Generate
.. LIKE ..
(#723). In this commit, theExpressionGenerator
class has been enhanced with a new method,like
, which generates SQLLIKE
expressions. The method takes aGeneratorContext
object and anir.Like
object as arguments, and returns a string representation of theLIKE
expression. It uses theexpression
method to generate the left and right sides of theLIKE
operator, and also handles the optional escape character. Additionally, thetimestampLiteral
anddateLiteral
methods have been updated to take anir.Literal
object and better handleNULL
values. A new test case has also been added for theExpressionGenerator
class, which checks thelike
function and includes examples for basic usage and usage with an escape character. This commit improves the functionality of theExpressionGenerator
class, allowing it to handleLIKE
expressions and better handleNULL
values for timestamps and dates. - Generate
DISTINCT
,*
(#739). In this release, we've enhanced the ExpressionGenerator class to support generatingDISTINCT
and*
(star) expressions in SQL queries. Previously, the class did not handle these cases, resulting in incomplete or incorrect SQL queries. With the introduction of thedistinct
method, the class can now generate theDISTINCT
keyword followed by the expression to be applied to, and thestar
method produces the*
symbol, optionally followed by the name of the object (table or subquery) to which it applies. These improvements make the ExpressionGenerator class more robust and compatible with various SQL dialects, resulting in more accurate query outcomes. We've also added new test cases for theExpressionGenerator
class to ensure thatDISTINCT
and*
expressions are generated correctly. Additionally, support for generating SQLDISTINCT
and*
(wildcard) has been added to the transpilation of Logical Plans to SQL, specifically in their.Project
class. This ensures that the correct SQLSELECT * FROM table
syntax is generated when a wildcard is used in the expression list. These enhancements significantly improve the functionality and compatibility of our open-source library. - Generate
LIMIT
(#732). A new method has been added to generate a SQL LIMIT clause for a logical plan in the data transformation tool. A newcase
branch has been implemented in thegenerate
method of theLogicalPlanGenerator
class to handle their.Limit
case, which generates the SQL LIMIT clause. If a percentage limit is specified, the tool will throw an exception as it is not currently supported. Thegenerate
method of theExpressionGenerator
class has been replaced with a newgenerate
method in their.Project
,ir.Filter
, and newir.Limit
cases to ensure consistent expression generation. A new case classLimit
has been added to thecom.databricks.labs.remorph.parsers.intermediate.relations
package, which extends theUnaryNode
class and has four parameters:input
,limit
,is_percentage
, andwith_ties
. This new class enables limiting the number of rows returned by a query, with the ability to specify a percentage of rows or include ties in the result set. Additionally, a new test case has been added to theLogicalPlanGeneratorTest
class to verify the transpilation of aLimit
node to its SQL equivalent, ensuring that theLimit
node is correctly handled during transpilation. - Generate
OFFSET
SQL clauses (#736). The Remorph project's latest update introduces a new OFFSET clause generation feature for SQL queries in the LogicalPlanGenerator class. This change adds support for skipping a specified number of rows before returning results in a SQL query, enhancing the library's query generation capabilities. The implementation includes a new case in the match statement of the generate function to handle ir.Offset nodes, creating a string representation of the OFFSET clause using the provide offset expression. Additionally, the commit includes a new test case in the LogicalPlanGeneratorTest class to validate the OFFSET clause generation, ensuring that the LogicalPlanGenerator can translate Offset AST nodes into corresponding SQL statements. Overall, this update enables the generation of more comprehensive SQL queries with OFFSET support, providing software engineers with greater flexibility for pagination and other data processing tasks. - Generate
ORDER BY
SQL clauses (#737). This commit introduces new classes and enumerations for sort direction and null ordering, as well as an updated SortOrder case class, enabling the generation of ORDER BY SQL clauses. The LogicalPlanGenerator and SnowflakeExpressionBuilder classes have been modified to utilize these changes, allowing for more flexible and customizable sorting and null ordering when generating SQL queries. Additionally, the TSqlRelationBuilderSpec test suite has been updated to reflect these changes, and new test cases have been added to ensure the correct transpilation of ORDER BY clauses. Overall, these improvements enhance the Remorph project's capability to parse and generate SQL expressions with various sorting scenarios, providing a more robust and maintainable codebase. - Generate
UNION
/EXCEPT
/INTERSECT
(#731). In this release, we have introduced support for theUNION
,EXCEPT
, andINTERSECT
set operations in our data processing system's generator. A newunknown
method has been added to theGenerator
trait to return aTranspileException
when encountering unsupported operations, allowing for better error handling and more informative error messages. TheLogicalPlanGenerator
class in the remorph project has been extended to support generatingUNION
,EXCEPT
, andINTERSECT
SQL operations with the addition of a new parameter,explicitDistinct
, to enable explicit specification ofDISTINCT
for these set operations. A new test suite has been added to theLogicalPlanGenerator
to test the generation of these operations using theSetOperation
class, which now has four possible set operations:UnionSetOp
,IntersectSetOp
,ExceptSetOp
, andUnspecifiedSetOp
. With these changes, our system can handle a wider range of input and provide meaningful error messages for unsupported operations, making it more versatile in handling complex SQL queries. - Generate
VALUES
SQL clauses (#738). The latest commit introduces a new feature to generateVALUES
SQL clauses in the context of the logical plan generator. A new case branch has been implemented in thegenerate
method to manageir.Values
expressions, converting input data (lists of lists of expressions) into a string representation compatible withVALUES
clauses. The existing functionality remains unchanged. Additionally, a new test case has been added for theLogicalPlanGenerator
class, which checks the correct transpilation toVALUES
SQL clauses. This test case ensures that their.Values
method, which takes a sequence of sequences of literals, generates the correspondingVALUES
SQL clause, specifically checking the inputSeq(Seq(ir.Literal(1), ir.Literal(2)), Seq(ir.Literal(3), ir.Literal(4)))
against the SQL clause"VALUES (1,2), (3,4)"
. This change enables testing the functionality of generatingVALUES
SQL clauses using theLogicalPlanGenerator
class. - Generate predicate expressions. This commit introduces the generation of predicate expressions as part of the SQL ExpressionGenerator in the
com.databricks.labs.remorph.generators.sql
package, enabling the creation of more complex SQL expressions. The changes include the addition of a new private method,predicate(ctx: GeneratorContext, expr: Expression)
, to handle predicate expressions, and the introduction of two new predicate expression types, LessThan and LessThanOrEqual, for comparing the relative ordering of two expressions. Existing predicate expression types have been updated with consistent naming. Additionally, the commit incorporates improvements to the handling of comparison operators in the SnowflakeExpressionBuilder and TSqlExpressionBuilder classes, addressing bugs and ensuring precise predicate expression generation. TheParserTestCommon
trait has also been updated to reorder certain operators in a logical plan, maintaining consistent comparison results in tests. New test cases have been added to several test suites to ensure the correct interpretation and generation of predicate expressions involving different data types and search conditions. Overall, these enhancements provide more fine-grained comparison of expressions, enable more nuanced condition checking, and improve the robustness and accuracy of the SQL expression generation process. - Merge remote-tracking branch 'origin/main'. In this update, the
ExpressionGenerator
class in thecom.databricks.labs.remorph.generators.sql
package has been enhanced with two new private methods:dateLiteral
andtimestampLiteral
. These methods are designed to generate the SQL literal representation ofDateType
andTimestampType
expressions, respectively. The introduction of these methods addresses the previous limitations of formatting date and timestamp values directly, which lacked extensibility and required duplicated code for handling null values. By extracting the formatting logic into separate methods, this commit significantly improves code maintainability and reusability, enhancing the overall readability and understandability of theExpressionGenerator
class for developers. ThedateLiteral
method handlesDateType
values by formatting them using thedateFormat
SimpleDateFormat
instance, returningNULL
if the value is missing. Likewise, thetimestampLiteral
method formatsTimestampType
values using thetimeFormat
SimpleDateFormat
instance, returningNULL
if the value is missing. These methods will enable developers to grasp the code's functionality more easily and make future enhancements to the class. - Modified dataclass for table threshold and added documentation (#714). A series of modifications have been implemented to enhance the threshold configuration and validation for table reconciliation in the open-source library. The
TableThresholds
dataclass has been updated to accept a string for themodel
attribute, replacing the previously usedTableThresholdModel
Enum. Additionally, a newvalidate_threshold_model
method has been added toTableThresholds
to ensure proper validation of themodel
attribute. A new exception class,InvalidModelForTableThreshold
, has been introduced to handle invalid settings. Column-specific thresholds can now be set using theColumnThresholds
configuration option. Therecon_capture.py
andrecon_config.py
files have been updated accordingly, and the documentation has been revised to clarify these changes. These improvements offer greater flexibility and control for users configuring thresholds while also refining validation and error handling. - Support CTAS in TSQL Grammar and add more comparison operators (#545). In this release, we have added support for the CTAS (CREATE TABLE AS) statement in the TSQL (T-SQL) grammar, as well as introduced new comparison operators: !=, !<, and !>. The CTAS statement allows for the creation of a new table by selecting from an existing table or query, potentially improving code readability and performance. The new comparison operators provide alternative ways of expressing inequalities, increasing flexibility for developers. The keyword
REPLICATE
has also been added for creating a full copy of a database or availability group. These changes enhance the overall functionality of the TSQL grammar and improve the user's ability to express various operations in TSQL. The CTAS statement is implemented as a new rule, and the new comparison operators are added as methods in the TSqlExpressionBuilder class. These changes provide increased capability and flexibility for TSQL parsing and query handling. The new methods are not adding any new external dependencies, and the project remains self-contained. The additions have been tested with the TSqlExpressionBuilderSpec test suite, ensuring the functionality and compatibility of the TSQL parser. - Support translation of TSQL INGORE NULLS clause in windowing functions (#511). The latest change introduces support for translating the TSQL IGNORE NULLS and RESPECT NULLS clauses in windowing functions to their equivalents in Databricks SQL. In TSQL, these clauses appear after the function name and before the OVER clause, affecting how the functions handle null values. Databricks SQL represents this functionality with an optional trailing boolean parameter for specific windowing functions. With this update, when the IGNORE NULLS clause is specified in TSQL, a boolean option is appended to the corresponding Databicks SQL windowing functions, with RESPECT NULLS as the default. This enhancement is facilitated by a new private method,
buildNullIgnore
, which adds the boolean parameter to the original expression when IGNORE NULLS is specified in the OVER clause. The alteration is exemplified in new test examples for the TSqlFunctionSpec, which include testing the LEAD function with and without the IGNORE NULLS clause, and updates to the translation of functions with non-standard syntax. - TSQL: Implement TSQL UPDATE/DELETE statements (#540). In this release, we have added support for TSQL UPDATE and DELETE statements in all syntactical forms, including UDF column transformations, in the TSqlParser.g4 file. The implementation of both statements is done in a single file and they share many common clauses. We have also introduced two new case classes, UpdateTable and MergeTables, in the extensions.scala file to implement the TSQL UPDATE and DELETE statements, respectively. Additionally, we have added new methods to handle various clauses and elements associated with these statements in the TSqlErrorStrategy class. A series of tests have been included to ensure the correct translation of various UPDATE and DELETE queries to their respective Abstract Syntax Trees (ASTs). These changes bring TSQL UPDATE and DELETE statement functionality to the project and allow for their use in a variety of contexts, providing developers with more flexibility and control when working with TSQL UPDATE and DELETE statements in the parser.
- TSQL: Implement translation of INSERT statement (#515). In this release, we have implemented the TSQL INSERT statement in its entirety, including all target options, optional clauses, and Common Table Expressions (CTEs) in our open-source library. The change includes updates to the TSqlParser.g4 file to support the INSERT statement's various clauses, such as TOP, INTO, WITH TABLE HINTS, outputClause, and optionClause. We have also added new case classes to the TSQL AST to handle various aspects of the INSERT statement, including LocalVarTable, Output, InsertIntoTable, DerivedRows, DefaultValues, and Default. The TSqlExpressionBuilder and TSqlRelationBuilder classes have been updated to support the new INSERT statement, including handling output column lists, aliases, and JSON clauses. We have added specification tests to TSqlAstBuilderSpec.scala to demonstrate the various ways that the INSERT statement can be written and to ensure the correct translation of TSQL INSERT statements into the remorph project.
- TSQL: Remove the SIGN fragment from numerical tokens (#547). In this release, we have made changes to the TSQL expression builder in the remorph project that affect how negative and positive numerical literals are parsed. Negative literals, such as -2, will now be parsed as UMinus(Literal(2)) instead of Literal(-2), and positive literals, such as +1, will be parsed as UPlus(Literal(1)) instead of Literal(1). This change was made to address issue #546, but it is not an ideal solution as it may cause inconvenience in downstream processes. The affected numerical tokens include INT, HEX, FLOAT, REAL, and MONEY, which have been simplified by removing the SIGN fragment. We have updated the buildPrimitive method to handle INT, REAL, and FLOAT token types and ensure that numerical tokens are parsed correctly. We intend to keep issue #546 open for further exploration of a better solution. The tests have been updated to reflect these changes.
- TSQL: Simplifies named table tableSource, implements columnAlias list (#512). This change introduces significant updates to the TSqlParser's grammar for tableSource, simplifying and consolidating rules related to table aliases and column alias lists. A new Relation called TableWithHints has been added to collect and process table hints, some of which have direct counterparts in the Catalyst optimizer or can be used as comments for migration purposes. The TSQLExpressionBuilder and TSqlRelationBuilder classes have been modified to handle table hints and column aliases, and the TSqlAstBuilderSpec test suite has been updated to include new tests for table hints in T-SQL SELECT statements. These changes aim to improve parsing, handling, and optimization of table sources, table hints, and column aliases in TSQL queries.
- TSQL: Support generic FOR options (#525). In this release, we have added support for parsing T-SQL (Transact-SQL) options that contain the keyword
FOR
using the standard syntax[FOR]
. This change is necessary asFOR
cannot be used directly as an identifier without escaping, as it would otherwise be seen as a table alias in aSELECT
statement. The ANTLR rule for parsing generic options has been expanded to handle these special cases correctly, allowing for the proper parsing of options such asOPTIMIZE FOR UNKNOWN
in aSELECT
statement. Additionally, a new case has been added to the OptionBuilder class to handle theFOR
keyword and elide it, convertingOPTIMIZE FOR UNKNOWN
toOPTIMIZE
with an id of 'UNKNOWN'. This ensures the proper handling of options containingFOR
and avoids any conflicts with theFOR
clause in T-SQL statements. This change was implemented by Valentin Kasas and involves adding a new alternative to thegenericOption
rule in the TSqlParser.g4 file, but no new methods have been added. - Updated Dialect Variable Name (#535). In this release, the
source
variable name in theQueryBuilder
class, which refers to theDialect
instance, has been updated toengine
to accurately reflect its meaning as referring to eitherSource
orTarget
. This change includes updating the usage ofsource
toengine
in thebuild_query
,_get_with_clause
, andbuild_threshold_query
methods, as well as removing unnecessary parentheses in a list. These changes improve the code's clarity, accuracy, and readability, while maintaining the overall functionality of the affected methods. - Use Oracle library only if the recon source is Oracle (#532). In this release, we have added a new
ReconcileConfig
configuration object and aSourceType
enumeration in thedatabricks.labs.remorph.config
anddatabricks.labs.remorph.reconcile.constants
modules, respectively. These objects are introduced to determine whether to include the Oracle JDBC driver library in a reconciliation job's task libraries. Thedeploy_job
method and the_job_recon_task
method have been updated to use the_recon_config
attribute to decide whether to include the Oracle JDBC driver library. Additionally, the_deploy_reconcile_job
method in theinstall.py
file has been modified to include a new parameter calledreconcile
, which is passed as an argument from the_config
object. This change enhances the flexibility and customization of the reconcile job deployment. Furthermore, new fixturesoracle_recon_config
andsnowflake_reconcile_config
have been introduced forReconcileConfig
objects with Oracle and Snowflake specific configurations, respectively. These fixtures are used in the test functions for deploying jobs, ensuring that the tests are more focused and better reflect the actual behavior of the code. - [chore] Make singletons for the relevant
DataType
instances (#705). This commit introduces case objects for various data types, such as NullType, StringType, and others, effectively making them singletons. This change simplifies the creation of data type instances and ensures that each type has a single instance throughout the application. The new case objects are utilized in building SQL expressions, affecting functions such as Cast and TRY_CAST in the TSqlExpressionBuilder class. Additionally, the test file TSqlExpressionBuilderSpec.scala has been updated to include the new case objects and handle errors by returning null. The data types tested include integer types, decimal types, date and time types, string types, binary types, and JSON. The primary goal of this change is to improve the management and identification of data types in the codebase, as well as to enhance code readability and maintainability.
Dependency updates:
- Bump sqlglot from 25.1.0 to 25.5.1 (#534).
- Bump sigstore/gh-action-sigstore-python from 2.1.1 to 3.0.0 (#555).
- Bump sqlglot from 25.5.1 to 25.6.1 (#585).
- Added Oracle ojdbc8 dependent library during reconcile Installation (#474). In this release, the
deployment.py
file in thedatabricks/labs/remorph/helpers
directory has been updated to add theojdbc8
library as aMavenLibrary
in the_job_recon_task
function, enabling the reconciliation process to access the Oracle Data source and pull data for reconciliation between Oracle and Databricks. TheJDBCReaderMixin
class in thejdbc_reader.py
file has also been updated to include the Oracle ojdbc8 dependent library for reconciliation during thereconcile
process. This involves installing thecom.oracle.database.jdbc:ojdbc8:23.4.0.24.05
jar as a dependent library and updating the driver class tooracle.jdbc.driver.OracleDriver
fromoracle
. A new dictionarydriver_class
has been added, which maps the driver name to the corresponding class name, allowing for dynamic driver class selection during the_get_jdbc_reader
method call. Thetest_read_data_with_options
unit test has been updated to test the Oracle connector for reading data with specific options, including the use of the correct driver class and specifying the database table for data retrieval, improving the accuracy and reliability of the reconciliation process. - Added TSQL coverage tests in the generated report artifact (#452). In this release, we have added new TSQL coverage tests and Snowflake coverage tests to the generated report artifact in the CI/CD pipeline. These tests are executed using Maven with the updated command "mvn --update-snapshots -B test -pl coverage --file pom.xml --fail-at-end" and "mvn --update-snapshots -B exec:java -pl coverage --file pom.xml --fail-at-end -Dexec.args="-i tests/resources/functional/snowflake -o coverage-result.json" respectively, and the "continue-on-error: true" option is added to allow the pipeline to proceed even if the tests fail. Additionally, we have introduced a new constructor to the
CommentBasedQueryExtractor
class, which accepts adialect
parameter and allows for easier configuration of the start and end comments for different SQL dialects. We have also updated the CommentBasedQueryExtractor for Snowflake and added two TSQL coverage tests to the generated report artifact to ensure that theQueryExtractor
is working correctly for TSQL queries. These changes will help ensure thorough testing and identification of TSQL and Snowflake queries during the CI/CD process. - Added full support for analytical windowing functions (#401). In this release, full support for analytical windowing functions has been implemented, addressing issue #401. The functions were previously specified in the parser grammar but have been moved to the standard function lookup table for more consistent handling. This enhancement allows for the use of analytical aggregate functions, such as FIRST_VALUE and PERCENTILE_CONT, with a
WITHIN GROUP
syntax and anOVER
clause, enabling more complex queries and data analysis. TheFixedArity
andVariableArity
classes have been updated with new methods for the supported functions, and appropriate examples have been provided to demonstrate their usage in SQL. - Added parsing for STRPOS in presto (#462). A new feature has been added to the remorph/snow package's presto module to parse the STRPOS function in SQL code. This has been achieved by importing the locate_to_strposition function from sqlglot.dialects.dialect and incorporating it into the FUNCTIONS dictionary in the Parser class. This change enables the parsing of the STRPOS function, which returns the position of the first occurrence of a substring in a string. The implementation has been tested with a SQL file containing two queries for Presto SQL using STRPOS and Databricks SQL using LOCATE, both aimed at finding the position of the letter
l
in the string 'Hello world', starting the search from the second position. This feature is particularly relevant for software engineers working on data processing and analytics projects involving both Presto and Databricks SQL, as it ensures compatibility and consistent behavior between the two for string manipulation functions. The commit is part of issue #462, and the diff provided includes a new SQL file with test cases for the STRPOS function in Presto and Locate function in Databricks SQL. The test cases confirm if thehello
string is present in the greeting_message column of the greetings_table. This feature allows users to utilize the STRPOS function in Presto to determine if a specific substring is present in a string. - Added validation for join columns for all query builders and limiting rows for reports (#413). In this release, we've added validation for join columns in all query builders, ensuring consistent and accurate data joins. A limit on the number of rows displayed for reports has been implemented with a default of 50. The
compare.py
andexecute.py
files have been updated to include validation, and theQueryBuilder
andHashQueryBuilder
classes have new methods for validating join columns. TheSamplingQueryBuilder
,ThresholdQueryBuilder
, andrecon_capture.py
files have similar updates for validation and limiting rows for reports. Therecon_config.py
file now has a new return type for theget_join_columns
method, and a new methodtest_no_join_columns_raise_exception()
has been added in thetest_threshold_query.py
file. These changes aim to enhance data consistency, accuracy, and efficiency for software engineers. - Adds more coverage tests for functions to TSQL coverage (#420). This commit adds new coverage tests for various TSQL functions, focusing on the COUNT, MAX, MIN, STDEV, STDEVP, SUM, and VARP functions, which are identical in Databricks SQL. The tests include cases with and without the DISTINCT keyword to ensure consistent behavior between TSQL and Databricks. For the GROUPING and GROUPING_ID functions, which have some differences, tests and examples of TSQL and Databicks SQL code are provided. The CHECKSUM_AGG function, not directly supported in Databricks SQL, is tested using MD5 and CONCAT_WS for equivalence. The CUME_DIST function, identical in both systems, is also tested. Additionally, a new test file for the STDEV function and updated tests for the VAR function are introduced, enhancing the reliability and robustness of TSQL conversions in the project.
- Catalog, Schema Permission checks (#492). This release introduces enhancements to the Catalog and Schema functionality, with the addition of permission checks that raise explicit
Permission Denied
exceptions. The logger messages have been updated for clarity and a new variable, README_RECON_REPO, has been created to reference the readme file for the recon_config repository. The ReconcileUtils class has been modified to handle scenarios where the recon_config file is not found or corrupted during loading, providing clear error messages and guidance for users. The unit tests for the install feature have been updated with permission checks for Catalog and Schema operations, ensuring robust handling of permission denied errors. These changes improve the system's error handling and provide clearer guidance for users encountering permission issues. - Changing the secret name acc to install script (#432). In this release, the
recon
function in theexecute.py
file of thedatabricks.labs.remorph.reconcile
package has been updated to dynamically generate the secret name instead of hardcoding it as "secret_scope". This change utilizes the newget_key_form_dialect
function to create a secret name specific to the source dialect being used in the reconciliation process. Theget_dialect
function, along withDatabaseConfig
,TableRecon
, and the newly addedget_key_form_dialect
, have been imported fromdatabricks.labs.remorph.config
. This enhancement improves the security and flexibility of the reconciliation process by generating dynamic and dialect-specific secret names. - Feature/recon documentation (#395). This commit introduces a new reconciliation process, enhancing data consistency between sources, co-authored by Ganesh Dogiparthi, ganeshdogiparthi-db, and SundarShankar89. The README.md file provides detailed documentation for the reconciliation process. A new binary file, docs/transpile-install.gif, offers installation instructions or visual aids, while a mermaid flowchart in
report_types_visualisation.md
illustrates report generation for data, rows, schema, and overall reconciliation. No existing functionality was modified, ensuring the addition of valuable features for software engineers adopting this project. - Fixing issues in sample query builder to handle Null's and zero (#457). This commit introduces improvements to the sample query builder's handling of Nulls and zeroes, addressing bug #450. The changes include updated SQL queries in the test threshold query file with COALESCE and TRIM functions to replace Null values with a specified string, ensuring consistent comparison of datasets. The query store in test_execute.py has also been enhanced to handle NULL and zero values using COALESCE, improving overall robustness and consistency. Additionally, new methods such as build_join_clause, trim, and coalesce have been added to enhance null handling in the query builder. The commit also introduces the MockDataSource class, a likely test implementation of a data source, and updates the log_and_throw_exception function for clearer error messaging.
- Implement Lakeview Dashboard Publisher (#405). In this release, we've introduced the
DashboardPublisher
class in thedashboard_publisher.py
module to streamline the process of creating and publishing dashboards in Databricks Workspace. This class simplifies dashboard creation by accepting an instance ofWorkspaceClient
andInstallation
and providing methods for creating and publishing dashboards with optional parameter substitution. Additionally, we've added a new JSON file, 'Remorph-Reconciliation-Substituted.lvdash.json', which contains a dashboard definition for a data reconciliation feature. This dashboard includes various widgets for filtering and displaying reconciliation results. We've also added a test file for the Lakeview Dashboard Publisher feature, which includes tests to ensure that theDashboardPublisher
can create dashboards using specified file paths and parameters. These new features and enhancements are aimed at improving the user experience and streamlining the process of creating and publishing dashboards in Databricks Workspace. - Integrate recon metadata reconcile cli (#444). A new CLI command,
databricks labs remorph reconcile
, has been added to initiate the Data Reconciliation process, loadingreconcile.yml
andrecon_config.json
configuration files from the Databricks Workspace. If these files are missing, the user is prompted to reinstall thereconcile
module and exit the command. The command then triggers theRemorph_Reconciliation_Job
based on the Job ID stored in thereconcile.yml
file. This simplifies the reconcile execution process, requiring users to first configure thereconcile
module and generate therecon_config_<SOURCE>.json
file usingdatabricks labs remorph install
anddatabricks labs remorph generate-recon-config
commands. The new CLI command has been manually tested and includes unit tests. Integration tests and verification on the staging environment are pending. This feature was co-authored by Bishwajit, Ganesh Dogiparthi, and SundarShankar89. - Introduce coverage tests (#382). This commit introduces coverage tests and updates the GitHub Actions workflow to use Java 11 with Corretto distribution, improving testing and coverage analysis for the project. Coverage tests are added as part of the remorph project with the introduction of a new module for coverage and updating the artifact version to 0.2.0-SNAPSHOT. The pom.xml file is modified to change the parent project version to 0.2.0-SNAPSHOT, ensuring accurate assessment and maintenance of code coverage during development. In addition, a new Main object within the com.databricks.labs.remorph.coverage package is implemented for running coverage tests using command-line arguments, along with the addition of a new file QueryRunner.scala and case classes for ReportEntryHeader, ReportEntryReport, and ReportEntry for capturing and reporting on the status and results of parsing and transpilation processes. The
Cache Maven packages
step is removed and replaced with two new steps:Run Unit Tests with Maven
and "Run Coverage Tests with Maven." The former executes unit tests and generates a test coverage report, while the latter downloads remorph-core jars as artifacts, executes coverage tests with Maven, and uploads coverage tests results as json artifacts. Thecoverage-tests
job runs after thetest-core
job and uses the same environment, checking out the code with full history, setting up Java 11 with Corretto distribution, downloading remorph-core-jars artifacts, and running coverage tests with Maven, even if there are errors. The JUnit report is also published, and the coverage tests results are uploaded as json artifacts, providing better test coverage and more reliable code for software engineers adopting the project. - Presto approx percentile func fix (#411). The remorph library has been updated to support the Presto database system, with a new module added to the config.py file to enable robust and maintainable interaction. An
APPROX_PERCENTILE
function has been implemented in thepresto.py
file of thesqlglot.dialects.presto
package, allowing for approximate percentile calculations in Presto and Databricks SQL. A test file has been included for both SQL dialects, with queries calculating the approximate median of the height column in the people table. The new functionality enhances the compatibility and versatility of the remorph library in working with Presto databases and improves overall project functionality. Additionally, a new test file for Presto in the snowflakedriver project has been introduced to test expected exceptions, further ensuring robustness and reliability. - Raise exception if reconciliation fails for any table (#412). In this release, we have implemented significant changes to improve exception handling and raise meaningful exceptions when reconciliation fails for any table in our open-source library. A new exception class,
ReconciliationException
, has been added as a child of theException
class, which takes two optional parameters in its constructor,message
andreconcile_output
. TheReconcileOutput
property has been created for accessing the reconcile output object. TheInvalidInputException
class now inherits fromValueError
, making the code more explicit with the type of errors being handled. A new method,_verify_successful_reconciliation
, has been introduced to check the reconciliation output status and raise aReconciliationException
if any table fails reconciliation. Thetest_execute.py
file has been updated to raise aReconciliationException
if reconciliation for a specific report type fails, and new tests have been added to the test suite to ensure the correct behavior of thereconcile
function with and without raising exceptions. - Removed USE catalog/schema statement as lsql has added the feature (#465). In this release, the usage of
USE
statements for selecting a catalog and schema has been removed in theget_sql_backend
function, thanks to the new feature provided by the lsql library. This enhancement improves code readability, maintainability, and enables better integration with the SQL backend. The commit also includes changes to the installation process for reconciliation metadata tables, providing more clarity and simplicity in the code. Additionally, several test functions have been added or modified to ensure the proper functioning of theget_sql_backend
function in various scenarios, including cases where a warehouse ID is not provided or when executing SQL statements in a notebook environment. An error simulation test has also been added for handlingDatabricksError
exceptions when executing SQL statements using theDatabricksConnectBackend
class. - Sampling with clause query to have
from dual
in from clause for oracle source (#464). In this release, we've added theget_key_from_dialect
function, replacing the previousget_key_form_dialect
function, to retrieve the key associated with a given dialect object, serving as a unique identifier for the dialect. This improvement enhances the flexibility and readability of the codebase, making it easier to locate and manipulate dialect objects. Additionally, we've modified the 'sampling_query.py' file to includefrom dual
in thefrom
clause for Oracle sources in a sampling query with a clause, enabling sampling from Oracle databases. The_insert_into_main_table
method in therecon_capture.py
file of thedatabricks.labs.remorph.reconcile
module has been updated to ensure accurate key retrieval for the specified dialect, thereby improving the reconciliation process. These changes resolve issues #458 and #464, enhancing the functionality of the sampling query builder and providing better support for various databases. - Support function translation to Databricks SQL in TSql and Snowflake (#414). This commit introduces a dialect-aware FunctionBuilder system and a ConversionStrategy system to enable seamless translation of SQL functions between TSQL, Snowflake, and Databricks SQL IR. The new FunctionBuilder system can handle both simple name translations and more complex conversions when there is no direct equivalent. For instance, TSQL's ISNULL function translates to IFNULL in Databricks SQL, while Snowflake's ISNULL remains unchanged. The commit also includes updates to the TSqlExpressionBuilder and new methods for building and visiting various contexts, enhancing compatibility and expanding the range of supported SQL dialects. Additionally, new tests have been added in the FunctionBuilderSpec to ensure the correct arity and function type for various SQL functions.
- TSQL: Create coverage tests for TSQL -> Databricks functions (#415). This commit introduces coverage tests for T-SQL functions and their equivalent Databricks SQL implementations, focusing on the DATEADD function's
yy
keyword. The DATEADD function is translated to the ADD_MONTHS function in Databricks SQL, with the number of months multiplied by 12. This ensures functional equivalence between T-SQL and Databricks SQL for date addition involving years. The tests are written as SQL scripts and are located in thetests/resources/functional/tsql/functions
directory, covering various scenarios and possible engine differences between T-SQL and Databricks SQL. The conversion process is documented, and future automation of this documentation is considered. - TSQL: Implement WITH CTE (#443). With this commit, we have extended the TSQL functionality by adding support for Common Table Expressions (CTEs). CTEs are temporary result sets that can be defined within a single execution of a SELECT, INSERT, UPDATE, DELETE, or CREATE VIEW statement, allowing for more complex and efficient queries. The implementation includes the ability to create a CTE with an optional name and a column list, followed by a SELECT statement that defines the CTE. CTEs can be self-referential and can be used to simplify complex queries, improving code readability and performance. This feature is particularly useful for cases where multiple queries rely on the same intermediate result set, as it enables reusing the results without having to repeat the query.
- TSQL: Implement functions with specialized syntax (#430). This commit introduces new data type conversion functions and JSON manipulation capabilities to T-SQL, addressing issue #430. The newly implemented features include
NEXT VALUE FOR sequence
,CAST(col TO sometype)
,TRY_CAST(col TO sometype)
,JSON_ARRAY
, andJSON_OBJECT
. These functions support specialized syntax for handling data type conversions and JSON operations, including NULL value handling usingNULL ON NULL
andABSENT ON NULL
syntax. TheTSqlFunctionBuilder
class has been updated to accommodate these changes, and new test cases have been added to theTSqlFunctionSpec
test class in Scala. This enhancement enables SQL-based querying and data manipulation with increased functionality for T-SQL parser and function evaluations. - TSQL: Support DISTINCT in SELECT list and aggregate functions (#400). This commit adds support for the
DISTINCT
keyword in T-SQL for use in theSELECT
list and aggregate functions such asCOUNT
. When used in theSELECT
list,DISTINCT
ensures unique values of the specified expression are returned, and in aggregate functions likeCOUNT
, it considers only distinct values of the specified argument. This change aligns with the SQL standard and enhances the functionality of the T-SQL parser, providing developers with greater flexibility and control when usingDISTINCT
in complex queries and aggregate functions. The default behavior in SQL,ALL
, remains unchanged, and the parser has been updated to accommodate these improvements. - TSQL: Update the SELECT statement to support XML workspaces (#451). This release introduces updates to the TSQL Select statement grammar to correctly support XMLWORKSPACES in accordance with the latest specification. Although Databricks SQL does not currently support XMLWORKSPACES, this change is a syntax-only update to enable compatibility with other platforms that do support it. Newly added components include 'xmlNamespaces', 'xmlDeclaration', 'xmlSchemaCollection', 'xmlTypeDefinition', 'createXmlSchemaCollection', 'xmlIndexOptions', 'xmlIndexOption', 'openXml', 'xmlCommonDirectives', and 'xmlColumnDefinition'. These additions enable the creation, configuration, and usage of XML schemas and indexes, as well as the specification of XML namespaces and directives. A new test file for functional tests has been included to demonstrate the use of XMLWORKSPACES in TSQL and its equivalent syntax in Databricks SQL. While this update does not affect the existing codebase's functionality, it does enable support for XMLWORKSPACES syntax in TSQL, facilitating easier integration with other platforms that support it. Please note that Databricks SQL does not currently support XML workspaces.
- Test merge queue (#424). In this release, the Scalafmt configuration has been updated to version 3.8.0, with changes to the formatting of Scala code. The
danglingParentheses
preset option has been set to "false", removing dangling parentheses from the code. Additionally, theconfigStyleArguments
option has been set tofalse
under "optIn". These modifications to the configuration file are likely to affect the formatting and style of the Scala code in the project, ensuring consistent and organized code. This change aims to enhance the readability and maintainability of the codebase. - Updated bug and feature yml to support reconcile (#390). The open-source library has been updated to improve issue and feature categorization. In the
.github/ISSUE_TEMPLATE/bug.yml
file, new options for TranspileParserError, TranspileValidationError, and TranspileLateralColumnAliasError have been added to thelabel: Category of Bug / Issue
field. Additionally, a new option for ReconcileError has been included. Thefeature.yml
file in the.github/ISSUE_TEMPLATE
directory has also been updated, introducing a required dropdown menu labeled "Category of feature request." This dropdown offers options for Transpile, Reconcile, and Other categories, ensuring accurate classification and organization of incoming feature requests. The modifications aim to enhance clarity for maintainers in reviewing and prioritizing issue resolutions and feature implementations related to reconciliation functionality. - Updated the documentation with json config examples (#486). In this release, the Remorph Reconciliation tool on Databricks has been updated to include JSON config examples for various config elements such as jdbc_reader_options, column_mapping, transformations, thresholds, and filters. These config elements enable users to define source and target data, join columns, JDBC reader options, select and drop columns, column mappings, transformations, thresholds, and filters. The update also provides examples in both Python and JSON formats, as well as instructions for installing the necessary Oracle JDBC library on a Databricks cluster. This update enhances the tool's functionality, making it easier for software engineers to reconcile source data with target data on Databricks.
- Updated uninstall flow (#476). In this release, the
uninstall
functionality of thedatabricks labs remorph
tool has been updated to align with the latest changes made to theinstall
refactoring. Theuninstall
flow now utilizes a newMockInstallation
class, which handles the uninstallation process and takes a dictionary of configuration files and their corresponding contents as input. Theuninstall
function has been modified to returnFalse
in two cases, either when there is no remorph directory or when the user decides not to uninstall. AMockInstallation
object is created for the reconcile.yml file, and appropriate exceptions are raised in the aforementioned cases. Theuninstall
function now uses aWorkspaceUnInstallation
orWorkspaceUnInstaller
object, depending on the input arguments, to handle the uninstallation process. Additionally, theMockPrompts
class is used to prompt the user for confirmation before uninstalling remorph. - Updates to developer documentation and add grammar formatting to maven (#490). The developer documentation has been updated to include grammar formatting instructions and support for dialects other than Snowflake. The Maven build cycle has been modified to format grammars before ANTLR processes them, enhancing readability and easing conflict resolution during maintenance. The TSqlLexer.g4 file has been updated with formatting instructions and added dialect recognition. These changes ensure that grammars are consistently formatted and easily resolvable during merges. Engineers adopting this project should reformat the grammar file before each commit, following the provided formatting instructions and reference link. Grammar modifications in the TSqlParser.g4 file, such as alterations in partitionFunction and freetextFunction rules, improve structure and readability.
- Upgrade sqlglot from 23.13.7 to 25.1.0 (#473). In the latest release, the sqlglot package has been upgraded from version 23.13.7 to 25.1.0, offering potential new features, bug fixes, and performance improvements for SQL processing. The package dependency for numpy has been updated to version 1.26.4, which may introduce new functionality, improve existing features, or fix numpy integration issues. Furthermore, the addition of the types-pytz package as a dependency provides type hints for pytz, enhancing codebase type checking and static analysis capabilities. Specific modifications to the test_sql_transpiler.py file include updating the expected result in the test_parse_query function and removing unnecessary whitespaces in the transpiled_sql assertion in the test_procedure_conversion function. Although the find_root_tables function remains unchanged, the upgrade to sqlglot promises overall functionality enhancements, which software engineers can leverage in their projects.
- Use default_factory in recon_config.py (#431). In this release, the default value handling for the
status
field in theDataReconcileOutput
andReconcileTableOutput
classes has been improved to comply with Python 3.11. Previously, a mutable default value was used, causing aValueError
issue. This has been addressed by implementing thedefault_factory
argument in thefield
function to ensure a new instance ofStatusOutput
is created for each class. Additionally,MismatchOutput
andThresholdOutput
classes now also utilizedefault_factory
for consistent and robust default value handling, enhancing the overall code quality and preventing potential issues arising from mutable default values. - edit distance (#501). In this release, we have implemented an
edit distance
feature for calculating the difference between two strings using the LEVENSHTEIN function. This has been achieved by adding a new method,anonymous_sql
, to theGenerator
class in thedatabricks.py
file. The method takes expressions of theAnonymous
type as arguments and calls theLEVENSHTEIN
function if thethis
attribute of the expression is equal to "EDITDISTANCE". Additionally, a new test file has been introduced for the anonymous user in the functional snowflake test suite to ensure the accurate calculation of string similarity using the EDITDISTANCE function. This change includes examples of using the EDITDISTANCE function with different parameters and compares it with the LEVENSHTEIN function available in Databricks. It addresses issue #500, which was related to testing the edit distance functionality.
- Capture Reconcile metadata in delta tables for dashbaords (#369). In this release, changes have been made to improve version control management, reduce repository size, and enhance build times. A new directory, "spark-warehouse/", has been added to the Git ignore file to prevent unnecessary files from being tracked and included in the project. The
WriteToTableException
class has been added to theexception.py
file to raise an error when a runtime exception occurs while writing data to a table. A newReconCapture
class has been implemented in thereconcile
package to capture and persist reconciliation metadata in delta tables. Therecon
function has been updated to initialize this new class, passing in the required parameters. Additionally, a new file,recon_capture.py
, has been added to the reconcile package, which implements theReconCapture
class responsible for capturing metadata related to data reconciliation. Therecon_config.py
file has been modified to introduce a new class,ReconcileProcessDuration
, and restructure the classesReconcileOutput
,MismatchOutput
, andThresholdOutput
. The commit also captures reconcile metadata in delta tables for dashboards in the context of unit tests in thetest_execute.py
file and includes a new file,test_recon_capture.py
, to test the reconcile capture functionality of theReconCapture
class. - Expand translation of Snowflake
expr
(#351). In this release, the translation of theexpr
category in the Snowflake language has been significantly expanded, addressing uncovered grammar areas, incorrect interpretations, and duplicates. Thesubquery
is now excluded as a validexpr
, and new case classes such asNextValue
,ArrayAccess
,JsonAccess
,Collate
, andIff
have been added to theExpression
class. These changes improve the comprehensiveness and accuracy of the Snowflake parser, allowing for a more flexible and accurate translation of various operations. Additionally, theSnowflakeExpressionBuilder
class has been updated to handle previously unsupported cases, enhancing the parser's ability to parse Snowflake SQL expressions. - Fixed orcale missing datatypes (#333). In the latest release, the Oracle class of the Tokenizer in the open-source library has undergone a fix to address missing datatypes. Previously, the KEYWORDS mapping did not require Tokens for keys, which led to unsupported Oracle datatypes. This issue has been resolved by modifying the test_schema_compare.py file to ensure that all Oracle datatypes, including LONG, NCLOB, ROWID, UROWID, ANYTYPE, ANYDATA, ANYDATASET, XMLTYPE, SDO_GEOMETRY, SDO_TOPO_GEOMETRY, and SDO_GEORASTER, are now mapped to the TEXT TokenType. This improvement enhances the compatibility of the code with Oracle datatypes and increases the reliability of the schema comparison functionality, as demonstrated by the test function test_schema_compare, which now returns is_valid as True and a count of 0 for is_valid =
false
in the resulting dataframe. - Fixed the recon_config functions to handle null values (#399). In this release, the recon_config functions have been enhanced to manage null values and provide more flexible column mapping for reconciliation purposes. A
__post_init__
method has been added to certain classes to convert specified attributes to lowercase and handle null values. A new helper method,_get_is_string
, has been introduced to determine if a column is of string type. Additionally, new functions such asget_tgt_to_src_col_mapping_list
,get_layer_tgt_to_src_col_mapping
,get_src_to_tgt_col_mapping_list
, andget_layer_src_to_tgt_col_mapping
have been added to retrieve column mappings, enhancing the overall functionality and robustness of the reconciliation process. These improvements will benefit software engineers by ensuring more accurate and reliable configuration handling, as well as providing more flexibility in mapping source and target columns during reconciliation. - Improve Exception handling (#392). The commit titled
Improve Exception Handling
enhances error handling in the project, addressing issues #388 and #392. Changes include refactoring thecreate_adapter
method in theDataSourceAdapter
class, updating method arguments in test functions, and adding new methods in thetest_execute.py
file for better test doubles. TheDataSourceAdapter
class is replaced with thecreate_adapter
function, which takes the same arguments and returns an instance of the appropriateDataSource
subclass based on the providedengine
parameter. The diff also modifies the behavior of certain test methods to raise more specific and accurate exceptions. Overall, these changes improve exception handling, streamline the codebase, and provide clearer error messages for software engineers. - Introduced morph_sql and morph_column_expr functions for inline transpilation and validation (#328). Two new classes, TranspilationResult and ValidationResult, have been added to the config module of the remorph package to store the results of transpilation and validation. The morph_sql and morph_column_exp functions have been introduced to support inline transpilation and validation of SQL code and column expressions. A new class, Validator, has been added to the validation module to handle validation, and the validate_format_result method within this class has been updated to return a ValidationResult object. The _query method has also been added to the class, which executes a given SQL query and returns a tuple containing a boolean indicating success, any exception message, and the result of the query. Unit tests for these new functions have been updated to ensure proper functionality.
- Output for the reconcile function (#389). A new function
get_key_form_dialect
has been added to theconfig.py
module, which takes aDialect
object and returns the corresponding key used in theSQLGLOT_DIALECTS
dictionary. Additionally, theMorphConfig
dataclass has been updated to include a new attribute__file__
, which sets the filename to "config.yml". Theget_dialect
function remains unchanged. Two new exceptions,WriteToTableException
andInvalidInputException
, have been introduced, and the existingDataSourceRuntimeException
has been modified in the same module to improve error handling. Theexecute.py
file's reconcile function has undergone several changes, including adding imports forInvalidInputException
,ReconCapture
, andgenerate_final_reconcile_output
fromrecon_exception
andrecon_capture
modules, and modifying theReconcileOutput
type. Thehash_query.py
file's reconcile function has been updated to include a new_get_with_clause
method, which returns aSelect
object for a given DataFrame, and thebuild_query
method has been updated to include a new query construction step using thewith_clause
object. Thethreshold_query.py
file's reconcile function's output has been updated to include query and logger statements, a new method for allowing user transformations on threshold aliases, and the dialect specified in the sql method. A newgenerate_final_reconcile_output
function has been added to therecon_capture.py
file, which generates a reconcile output given a recon_id and a SparkSession. New classes and dataclasses, includingSchemaReconcileOutput
,ReconcileProcessDuration
,StatusOutput
,ReconcileTableOutput
, andReconcileOutput
, have been introduced in thereconcile/recon_config.py
file. Thetests/unit/reconcile/test_execute.py
file has been updated to include new test cases for therecon
function, including tests for different report types and scenarios, such as data, schema, and all report types, exceptions, and incorrect report types. A new test case,test_initialise_data_source
, has been added to test theinitialise_data_source
function, and thetest_recon_for_wrong_report_type
test case has been updated to expect anInvalidInputException
when an incorrect report type is passed to therecon
function. Thetest_reconcile_data_with_threshold_and_row_report_type
test case has been added to test thereconcile_data
method of theReconciliation
class with a row report type and threshold options. Overall, these changes improve the functionality and robustness of the reconcile process by providing more fine-grained control over the generation of the final reconcile output and better handling of exceptions and errors. - Threshold Source and Target query builder (#348). In this release, we've introduced a new method,
build_threshold_query
, that constructs a customizable threshold query based on a table's partition, join, and threshold columns configuration. The method identifies necessary columns, applies specified transformations, and includes a WHERE clause based on the filter defined in the table configuration. The resulting query is then converted to a SQL string using the dialect of the source database. Additionally, we've updated the test file for the threshold query builder in the reconcile package, including refactoring of function names and updated assertions for query comparison. We've added two new test methods:test_build_threshold_query_with_single_threshold
andtest_build_threshold_query_with_multiple_thresholds
. These changes enhance the library's functionality, providing a more robust and customizable threshold query builder, and improve test coverage for various configurations and scenarios. - Unpack nested alias (#336). This release introduces a significant update to the 'lca_utils.py' file, addressing the limitation of not handling nested aliases in window expressions and where clauses, which resolves issue #334. The
unalias_lca_in_select
method has been implemented to recursively parse nested selects and unalias lateral column aliases, thereby identifying and handling unsupported lateral column aliases. This method is utilized in thecheck_for_unsupported_lca
method to handle unsupported lateral column aliases in the input SQL string. Furthermore, the 'test_lca_utils.py' file has undergone changes, impacting several test functions and introducing two new ones,test_fix_nested_lca
and 'test_fix_nested_lca_with_no_scope', to ensure the code's reliability and accuracy by preventing unnecessary assumptions and hallucinations. These updates demonstrate our commitment to improving the library's functionality and test coverage.
- Added
Configure Secrets
support todatabricks labs remorph configure-secrets
cli command (#254). TheConfigure Secrets
feature has been implemented in thedatabricks labs remorph
CLI command, specifically for the newconfigure-secrets
command. This addition allows users to establish Scope and Secrets within their Databricks Workspace, enhancing security and control over resource access. The implementation includes a newrecon_config_utils.py
file in thedatabricks/labs/remorph/helpers
directory, which contains classes and methods for managing Databricks Workspace secrets. Furthermore, theReconConfigPrompts
helper class has been updated to handle prompts for selecting sources, entering secret scope names, and handling overwrites. The CLI command has also been updated with a newconfigure_secrets
function and corresponding tests to ensure correct functionality. - Added handling for invalid alias usage by manipulating the AST (#219). The recent commit addresses the issue of invalid alias usage in SQL queries by manipulating the Abstract Syntax Tree (AST). It introduces a new method,
unalias_lca_in_select
, which unaliases Lateral Column Aliases (LCA) in the SELECT clause of a query. The AliasInfo class is added to manage aliases more effectively, with attributes for the name, expression, and a flag indicating if the alias name is the same as a column. Additionally, the execute.py file is modified to check for unsupported LCA using thelca_utils.check_for_unsupported_lca
method, improving the system's robustness when handling invalid aliases. Test cases are also added in the new file, test_lca_utils.py, to validate the behavior of thecheck_for_unsupported_lca
function, ensuring that SQL queries are correctly formatted for Snowflake dialect and avoiding errors due to invalid alias usage. - Added support for
databricks labs remorph generate-lineage
CLI command (#238). A new CLI command,databricks labs remorph generate-lineage
, has been added to generate lineage for input SQL files, taking the source dialect, input, and output directories as arguments. The command uses existing logic to generate a directed acyclic graph (DAG) and then creates a DOT file in the output directory using the DAG. The new command is supported by new functions_generate_dot_file_contents
,lineage_generator
, and methods in theRootTableIdentifier
andDAG
classes. The command has been manually tested and includes unit tests, with plans for adding integration tests in the future. The commit also includes a new methodtemp_dirs_for_lineage
and updates to theconfigure_secrets_databricks
method to handle a new source type "databricks". The command handles invalid input and raises appropriate exceptions. - Custom oracle tokenizer (#316). In this release, the remorph library has been updated to enhance its handling of Oracle databases. A custom Oracle tokenizer has been developed to map the
LONG
datatype to text (string) in the tokenizer, allowing for more precise parsing and manipulation ofLONG
columns in Oracle databases. The Oracle dialect in the configuration file has also been updated to utilize the new custom Oracle tokenizer. Additionally, the Oracle class from the snow module has been imported and integrated into the Oracle dialect. These improvements will enable the remorph library to manage Oracle databases more efficiently, with a particular focus on improving the handling of theLONG
datatype. The commit also includes updates to test files in the functional/oracle/test_long_datatype directory, which ensure the proper conversion of theLONG
datatype to text. Furthermore, a new test file has been added to the tests/unit/snow directory, which checks for compatibility with Oracle's long data type. These changes enhance the library's compatibility with Oracle databases, ensuring accurate handling and manipulation of theLONG
datatype in Oracle SQL and Databricks SQL. - Removed strict source dialect checks (#284). In the latest release, the
transpile
andgenerate_lineage
functions incli.py
have undergone changes to allow for greater flexibility in source dialect selection. Previously, onlysnowflake
ortsql
dialects were supported, but now any source dialect supported by SQLGLOT can be used, controlled by theSQLGLOT_DIALECTS
dictionary. Providing an unsupported source dialect will result in a validation error. Additionally, the input and output folder paths for thegenerate_lineage
function are now validated against the file system to ensure their existence and validity. In theinstall.py
file of thedatabricks/labs/remorph
package, the source dialect selection has been updated to useSQLGLOT_DIALECTS.keys()
, replacing the previous hardcoded list. This change allows for more flexibility in selecting the source dialect. Furthermore, recent updates to various test functions in thetest_install.py
file suggest that the source selection process has been modified, possibly indicating the addition of new sources or a change in source identification. These modifications provide greater flexibility in testing and potentially in the actual application. - Set Catalog, Schema from default Config (#312). A new feature has been added to our open-source library that allows users to specify the
catalog
andschema
configuration options as part of thetranspile
command-line interface (CLI). If these options are not provided, thetranspile
function in thecli.py
file will now set them to the values specified indefault_config
. This ensures that a default catalog and schema are used if they are not explicitly set by the user. Thelabs.yml
file has been updated to reflect these changes, with the addition of thecatalog-name
andschema-name
options to thecommands
object. Thedefault
property of thevalidation
object has also been updated totrue
, indicating that the validation step will be skipped by default. These changes provide increased flexibility and ease-of-use for users of thetranspile
functionality. - Support for Null safe equality join for databricks generator (#280). In this release, we have implemented support for a null-safe equality join in the Databricks generator, addressing issue #280. This feature introduces the use of the " <=> " operator in the generated SQL code instead of the
is not distinct from
syntax to ensure accurate comparisons when NULL values are present in the columns being joined. The Generator class has been updated with a new method, NullSafeEQ, which takes in an expression and returns the binary version of the expression using the " <=> " operator. The preprocess method in the Generator class has also been modified to include this new functionality. It is important to note that this change may require users to update their existing code to align with the new syntax in the Databricks environment. With this enhancement, the Databricks generator is now capable of performing null-safe equality joins, resulting in consistent results regardless of the presence of NULL values in the join conditions.
- Added serverless validation using lsql library (#176). Workspaceclient object is used with
product
name andproduct_version
along with correspondingcluster_id
orwarehouse_id
assdk_config
inMorphConfig
object. - Enhanced install script to enforce usage of a warehouse or cluster when
skip-validation
is set toFalse
(#213). In this release, the installation process has been enhanced to mandate the use of a warehouse or cluster when theskip-validation
parameter is set toFalse
. This change has been implemented across various components, including the install script,transpile
function, andget_sql_backend
function. Additionally, new pytest fixtures and methods have been added to improve test configuration and resource management during testing. Unit tests have been updated to enforce usage of a warehouse or cluster when theskip-validation
flag is set toFalse
, ensuring proper resource allocation and validation process improvement. This development focuses on promoting a proper setup and usage of the system, guiding new users towards a correct configuration and improving the overall reliability of the tool. - Patch subquery with json column access (#190). The open-source library has been updated with new functionality to modify how subqueries with JSON column access are handled in the
snowflake.py
file. This change includes the addition of a check for an opening parenthesis after theFROM
keyword to detect and break loops when a subquery is found, as opposed to a table name. This improvement enhances the handling of complex subqueries and JSON column access, making the code more robust and adaptable to different query structures. Additionally, a new test method,test_nested_query_with_json
, has been introduced to thetests/unit/snow/test_databricks.py
file to test the behavior of nested queries involving JSON column access when using a Snowflake dialect. This new method validates the expected output of a specific nested query when it is transpiled to Snowflake's SQL dialect, allowing for more comprehensive testing of JSON column access and type casting in Snowflake dialects. The existingtest_delete_from_keyword
method remains unchanged. - Snowflake
UPDATE FROM
to DatabricksMERGE INTO
implementation (#198). - Use Runtime SQL backend in Notebooks (#211). In this update, the
db_sql.py
file in thedatabricks/labs/remorph/helpers
directory has been modified to support the use of the Runtime SQL backend in Notebooks. This change includes the addition of a newRuntimeBackend
class in thebackends
module and an import statement foros
. Theget_sql_backend
function now returns aRuntimeBackend
instance when theDATABRICKS_RUNTIME_VERSION
environment variable is present, allowing for more efficient and secure SQL statement execution in Databricks notebooks. Additionally, a new test case for theget_sql_backend
function has been added to ensure the correct behavior of the function in various runtime environments. These enhancements improve SQL execution performance and security in Databricks notebooks and increase the project's versatility for different use cases. - Added Issue Templates for bugs, feature and config (#194). Two new issue templates have been added to the project's GitHub repository to improve issue creation and management. The first template, located in
.github/ISSUE_TEMPLATE/bug.yml
, is for reporting bugs and prompts users to provide detailed information about the issue, including the current and expected behavior, steps to reproduce, relevant log output, and sample query. The second template, added under the path.github/ISSUE_TEMPLATE/config.yml
, is for configuration-related issues and includes support contact links for general Databricks questions and Remorph documentation, as well as fields for specifying the operating system and software version. A new issue template for feature requests, named "Feature Request", has also been added, providing a structured format for users to submit requests for new functionality for the Remorph project. These templates will help streamline the issue creation process, improve the quality of information provided, and make it easier for the development team to quickly identify and address bugs and feature requests. - Added Databricks Source Adapter (#185). In this release, the project has been enhanced with several new features for the Databricks Source Adapter. A new
engine
parameter has been added to theDataSource
class, replacing the originalsource
parameter. The_get_secrets
and_get_table_or_query
methods have been updated to use theengine
parameter for key naming and handling queries with aselect
statement differently, respectively. A Databricks Source Adapter for Oracle databases has been introduced, which includes a newOracleDataSource
class that provides functionality to connect to an Oracle database using JDBC. A Databricks Source Adapter for Snowflake has also been added, featuring theSnowflakeDataSource
class that handles data reading and schema retrieval from Snowflake. TheDatabricksDataSource
class has been updated to handle data reading and schema retrieval from Databricks, including a newget_schema_query
method that generates the query to fetch the schema based on the provided catalog and table name. Exception handling for reading data and fetching schema has been implemented for all new classes. These changes provide increased flexibility for working with various data sources, improved code maintainability, and better support for different use cases. - Added Threshold Query Builder (#188). In this release, the open-source library has added a Threshold Query Builder feature, which includes several changes to the existing functionality in the data source connector. A new import statement adds the
re
module for regular expressions, and new parameters have been added to theread_data
andget_schema
abstract methods. The_get_jdbc_reader_options
method has been updated to accept aoptions
parameter of type "JdbcReaderOptions", and a new static method, "_get_table_or_query", has been added to construct the table or query string based on provided parameters. Additionally, a new class, "QueryConfig", has been introduced in the "databricks.labs.remorph.reconcile" package to configure queries for data reconciliation tasks. A new abstract base class QueryBuilder has been added to the query_builder.py file, along with HashQueryBuilder and ThresholdQueryBuilder classes to construct SQL queries for generating hash values and selecting columns based on threshold values, transformation rules, and filtering conditions. These changes aim to enhance the functionality of the data source connector, add modularity, customizability, and reusability to the query builder, and improve data reconciliation tasks. - Added snowflake connector code (#177). In this release, the open-source library has been updated to add a Snowflake connector for data extraction and schema manipulation. The changes include the addition of the SnowflakeDataSource class, which is used to read data from Snowflake using PySpark, and has methods for getting the JDBC URL, reading data with and without JDBC reader options, getting the schema, and handling exceptions. A new constant, SNOWFLAKE, has been added to the SourceDriver enum in constants.py, which represents the Snowflake JDBC driver class. The code modifications include updating the constructor of the DataSource abstract base class to include a new parameter 'scope', and updating the
_get_secrets
method to accept akey_name
parameter instead of 'key'. Additionally, a test file 'test_snowflake.py' has been added to test the functionality of the SnowflakeDataSource class. This release also updates the pyproject.toml file to version lock the dependencies like black, ruff, and isort, and modifies the coverage report configuration to exclude certain files and lines from coverage checks. These changes were completed by Ravikumar Thangaraj and SundarShankar89. remorph reconcile
baseline for Query Builder and Source Adapter for oracle as source (#150).
Dependency updates:
- Bump sqlglot from 22.4.0 to 22.5.0 (#175).
- Updated databricks-sdk requirement from <0.22,>=0.18 to >=0.18,<0.23 (#178).
- Updated databricks-sdk requirement from <0.23,>=0.18 to >=0.18,<0.24 (#189).
- Bump actions/checkout from 3 to 4 (#203).
- Bump actions/setup-python from 4 to 5 (#201).
- Bump codecov/codecov-action from 1 to 4 (#202).
- Bump softprops/action-gh-release from 1 to 2 (#204).
- Added Pylint Checker (#149). This diff adds a Pylint checker to the project, which is used to enforce a consistent code style, identify potential bugs, and check for errors in the Python code. The configuration for Pylint includes various settings, such as a line length limit, the maximum number of arguments for a function, and the maximum number of lines in a module. Additionally, several plugins have been specified to load, which add additional checks and features to Pylint. The configuration also includes settings that customize the behavior of Pylint's naming conventions checks and handle various types of code constructs, such as exceptions, logging statements, and import statements. By using Pylint, the project can help ensure that its code is of high quality, easy to understand, and free of bugs. This diff includes changes to various files, such as cli.py, morph_status.py, validate.py, and several SQL-related files, to ensure that they adhere to the desired Pylint configuration and best practices for code quality and organization.
- Fixed edge case where column name is same as alias name (#164). A recent commit has introduced fixes for edge cases related to conflicts between column names and alias names in SQL queries, addressing issues #164 and #130. The
check_for_unsupported_lca
function has been updated with two helper functions_find_aliases_in_select
and_find_invalid_lca_in_window
to detect aliases with the same name as a column in a SELECT expression and identify invalid Least Common Ancestors (LCAs) in window functions, respectively. Thefind_windows_in_select
function has been refactored and renamed to_find_windows_in_select
for improved code readability. Thetranspile
andparse
functions in thesql_transpiler.py
file have been updated with try-except blocks to handle cases where a column name matches the alias name, preventing errors or exceptions such asParseError
,TokenError
, andUnsupportedError
. A new unit test, "test_query_with_same_alias_and_column_name", has been added to verify the fix, passing a SQL query with a subquery having a column aliasca_zip
which is also used as a column name in the same query, confirming that the function correctly handles the scenario where a column name conflicts with an alias name. TO_NUMBER
withoutformat
edge case (#172). TheTO_NUMBER without format edge case
commit introduces changes to address an unsupported usage of theTO_NUMBER
function in Databicks SQL dialect when theformat
parameter is not provided. The new implementation introduces constantsPRECISION_CONST
andSCALE_CONST
(set to 38 and 0 respectively) as default values forprecision
andscale
parameters. These changes ensure Databricks SQL dialect requirements are met by modifying the_to_number
method to incorporate these constants. AnUnsupportedError
will now be raised whenTO_NUMBER
is called without aformat
parameter, improving error handling and ensuring users are aware of the requiredformat
parameter. Test cases have been added forTO_DECIMAL
,TO_NUMERIC
, andTO_NUMBER
functions with format strings, covering cases where the format is taken from table columns. The commit also ensures that an error is raised whenTO_DECIMAL
is called without a format parameter.
Dependency updates:
- Bump sqlglot from 21.2.1 to 22.0.1 (#152).
- Bump sqlglot from 22.0.1 to 22.1.1 (#159).
- Updated databricks-labs-blueprint[yaml] requirement from ~=0.2.3 to >=0.2.3,<0.4.0 (#162).
- Bump sqlglot from 22.1.1 to 22.2.0 (#161).
- Bump sqlglot from 22.2.0 to 22.2.1 (#163).
- Updated databricks-sdk requirement from <0.21,>=0.18 to >=0.18,<0.22 (#168).
- Bump sqlglot from 22.2.1 to 22.3.1 (#170).
- Updated databricks-labs-blueprint[yaml] requirement from <0.4.0,>=0.2.3 to >=0.2.3,<0.5.0 (#171).
- Bump sqlglot from 22.3.1 to 22.4.0 (#173).
- Added conversion logic for Try_to_Decimal without format (#142).
- Identify Root Table for folder containing SQLs (#124).
- Install Script (#106).
- Integration Test Suite (#145).
Dependency updates:
- Updated databricks-sdk requirement from <0.20,>=0.18 to >=0.18,<0.21 (#143).
- Bump sqlglot from 21.0.0 to 21.1.2 (#137).
- Bump sqlglot from 21.1.2 to 21.2.0 (#147).
- Bump sqlglot from 21.2.0 to 21.2.1 (#148).
- Added support for WITHIN GROUP for ARRAY_AGG and LISTAGG functions (#133).
- Fixed Merge "INTO" for delete from syntax (#129).
- Fixed
DATE TRUNC
parse errors (#131). - Patched Logger function call during wheel file (#135).
- Patched extra call to root path (#126).
Dependency updates:
- Updated databricks-sdk requirement from ~=0.18.0 to >=0.18,<0.20 (#134).
Dependency updates:
- Added test_approx_percentile and test_trunc Testcases (#98).
- Updated contributing/developer guide (#97).
- Added baseline for Databricks CLI frontend (#60).
- Added custom Databricks dialect test cases and lateral struct parsing (#77).
- Extended Snowflake to Databricks functions coverage (#72, #69).
- Added
databricks labs remorph transpile
documentation for installation and usage (#73).
Dependency updates:
- Bump sqlglot from 20.8.0 to 20.9.0 (#83).
- Updated databricks-sdk requirement from ~=0.17.0 to ~=0.18.0 (#90).
- Bump sqlglot from 20.9.0 to 20.10.0 (#91).
Initial commit