Skip to content

Latest commit

 

History

History
335 lines (273 loc) · 212 KB

CHANGELOG.md

File metadata and controls

335 lines (273 loc) · 212 KB

Version changelog

0.9.0

  • Added support for format_datetime function in presto to Databricks (#1250). A new format_datetime function has been added to the Parser class in the presto.py file to provide support for formatting datetime values in Presto on Databricks. This function utilizes the DateFormat.from_arg_list method from the local_expression module to format datetime values according to a specified format string. To ensure compatibility and consistency between Presto and Databricks, a new test file test_format_datetime_1.sql has been added, containing SQL queries that demonstrate the usage of the format_datetime function in Presto and its equivalent in Databricks, DATE_FORMAT. This standalone change adds new functionality without modifying any existing code.
  • Added support for SnowFlake SUBSTR (#1238). This commit enhances the library's SnowFlake support by adding the SUBSTR function, which was previously unsupported and existed only as an alternative to SUBSTRING. The project now fully supports both functions, and the SUBSTRING function can be used interchangeably with SUBSTR via the new withConversionStrategy(SynonymOf("SUBSTR")) method. Additionally, this commit supersedes a previous pull request that lacked a GPG signature and includes a test for the SUBSTR function. The ARRAY_SLICE function has also been updated to match SnowFlake's behavior, and the project now supports a more comprehensive list of SQL functions with their corresponding arity.
  • Added support for json_size function in presto (#1236). A new json_size function for Presto has been added, which determines the size of a JSON object or array and returns an integer. Two new methods, _build_json_size and get_json_object, have been implemented to handle JSON objects and arrays differently, and the Parser and Tokenizer classes of the Presto class have been updated to include the new json_size function. An alternative implementation for Databricks using SQL functions is provided, and a test case is added to cover a fixed is not null error for json_extract in the Databricks generator. Additionally, a new test file for Presto has been added to test the functionality of the json_extract function in Presto, and a new method GetJsonObject is introduced to extract a JSON object from a given path. The json_extract function has also been updated to extract the value associated with a specified key from JSON data in both Presto and Databricks.
  • Enclosed subqueries in parenthesis (#1232). This PR introduces changes to the ExpressionGenerator and LogicalPlanGenerator classes to ensure that subqueries are correctly enclosed in parentheses during code generation. Previously, subqueries were not always enclosed in parentheses, leading to incorrect code. This issue has been addressed by enclosing subqueries in parentheses in the in and scalarSubquery methods, and by adding new match cases for ir.Filter in the LogicalPlanGenerator class. The changes also take care to avoid doubling enclosing parentheses in the .. IN(SELECT...) pattern. New methods have not been added, and existing functionality has been modified to ensure that subqueries are correctly enclosed in parentheses, leading to the generation of correct SQL code. Test cases have been included in a separate PR. These changes improve the correctness of the generated code, avoiding issues such as SELECT * FROM SELECT * FROM t WHERE a > a WHERE a > 'b' and ensuring that the generated code includes parentheses around subqueries.
  • Fixed serialization of MultipleErrors (#1177). In the latest release, the encoding of errors in the com.databricks.labs.remorph.coverage package has been improved with an update to the encoders.scala file. The change involves a fix for serializing MultipleErrors instances using the asJson method on each error instead of just the message. This modification ensures that all relevant information about each error is included in the encoded output, improving the accuracy of serialization for MultipleErrors class. Users who handle multiple errors and require precise serialization representation will benefit from this enhancement, as it guarantees comprehensive information encoding for each error instance.
  • Fixed presto strpos and array_average functions (#1196). This PR introduces new classes Locate and NamedStruct in the local_expression.py file to handle the STRPOS and ARRAY_AVERAGE functions in a Databricks environment, ensuring compatibility with Presto SQL. The STRPOS function, used to locate the position of a substring within a string, now uses the Locate class and emits a warning regarding differences in implementation between Presto and Databricks SQL. A new method _build_array_average has been added to handle the ARRAY_AVERAGE function in Databricks, which calculates the average of an array, accommodating nulls, integers, and doubles. Two SQL test cases have been added to demonstrate the use of the ARRAY_AVERAGE function with arrays containing integers and doubles. These changes promote compatibility and consistent behavior between Presto and Databricks when dealing with STRPOS and ARRAY_AVERAGE functions, enhancing the ability to migrate between the systems smoothly.
  • Handled presto Unnest cross join to Databricks lateral view (#1209). This release introduces new features and updates for handling Presto UNNEST cross joins in Databricks, utilizing the lateral view feature. New methods have been added to improve efficiency and robustness when handling UNNEST cross joins. Additionally, new test cases have been implemented for Presto and Databricks to ensure compatibility and consistency between the two systems in handling UNNEST cross joins, array construction and flattening, and parsing JSON data. Some limitations and issues remain, which will be addressed in future work. The acceptance tests have also been updated, with certain tests now expected to pass, while others may still fail. This release aims to improve the functionality and compatibility of Presto and Databricks when handling UNNEST cross joins and JSON data.
  • Implemented remaining TSQL set operations (#1227). This pull request enhances the TSql parser by adding support for parsing and converting the set operations UNION [ALL], EXCEPT, and INTERSECT to the Intermediate Representation (IR). Initially, the grammar recognized these operations, but they were not being converted to the IR. This change resolves issues #1126 and #1102 and includes new unit, transpiler, and functional tests, ensuring the correct behavior of these set operations, including precedence rules. The commit also introduces a new test file, union-all.sql, demonstrating the correct handling of simple UNION ALL operations, ensuring consistent output across TSQL and Databricks SQL platforms.
  • Supported multiple columns in order by clause in for ARRAYAGG (#1228). This commit enhances the ARRAYAGG and LISTAGG functions by adding support for multiple columns in the order by clause and sorting in both ascending and descending order. A new method, sortArray, has been introduced to handle multiple sort orders. The changes also improve the functionality of the ARRAYAGG function in the Snowflake dialect by supporting multiple columns in the ORDER BY clause, with an optional DESC keyword for each column. The WithinGroupParams dataclass has been updated in the local expression module to include a list of tuples for the order columns and their sorting direction. These changes provide increased flexibility and control over the output of the ARRAYAGG and LISTAGG functions
  • Added TSQL parser support for (LHS) UNION RHS queries (#1211). In this release, we have implemented support for a new form of UNION in the TSQL parser, specifically for queries formatted as (SELECT a from b) UNION [ALL] SELECT x from y. This allows the union of two SELECT queries with an optional ALL keyword to include duplicate rows. The implementation includes a new case statement in the TSqlRelationBuilder class that handles this form of UNION, creating a SetOperation object with the left-hand side and right-hand side of the union, and an is_all flag based on the presence of the ALL keyword. Additionally, we have added support for parsing right-associative UNION clauses in TSQL queries, enhancing the flexibility and expressiveness of the TSQL parser for more complex and nuanced queries. The commit also includes new test cases to verify the correct translation of TSQL set operations to Databricks SQL, resolving issue #1127. This enhancement allows for more accurate parsing of TSQL queries that use the UNION operator in various formats.
  • Added support for inline columns in CTEs (#1184). In this release, we have added support for inline columns in Common Table Expressions (CTEs) in Snowflake across various components of our open-source library. This includes updates to the AST (Abstract Syntax Tree) for better TSQL translation and the introduction of the new case class KnownInterval for handling intervals. We have also implemented a new method, DealiasInlineColumnExpressions, in the SnowflakePlanParser class to parse inline columns in CTEs and modify the class constructor to include this new method. Additionally, a new private case class InlineColumnExpression has been introduced to allow for more efficient processing of Snowflake CTEs. The SnowflakeToDatabricksTranspiler has also been updated to support inline columns in CTEs, as demonstrated by a new test case. These changes improve compatibility, precision, and usability of the codebase, providing a better overall experience for software engineers working with CTEs in Snowflake.
  • Implemented AST for positional column identifiers (#1181). The recent change introduces an Abstract Syntax Tree (AST) for positional column identifiers in the Snowflake project, specifically in the ExpressionGenerator class. The new NameOrPosition type represents a column identifier, either by name or position. The Id and Position classes inherit from NameOrPosition, and the nameOrPosition method has been added to check and return the appropriate SQL representation. However, due to Databricks' lack of positional column identifier support, the generator side does not yet support this feature. This means that the schema of the table is required to properly translate queries involving positional column identifiers. This enhancement increases the system's flexibility in handling Snowflake's query structures, with the potential for more comprehensive generator-side support in the future.
  • Implemented GROUP BY ALL (#1180). The GROUP BY ALL clause is now supported in the LogicalPlanGenerator class of the remorph project, with the addition of a new case to handle the GroupByAll type and updated implementation for the Pivot type. A new case object called GroupByAll has been added to the relations.scala file's sealed trait "GroupType". A new test case has been implemented in the SnowflakeToDatabricksTranspilerTest class to check the correct transpilation of the GROUP BY ALL clause from Snowflake SQL syntax to Databricks SQL syntax. These changes allow for more flexibility and control in grouping operations and enable the implementation of specific functionality for the GROUP BY ALL clause in Snowflake, improving compatibility with Snowflake SQL syntax.

Dependency updates:

  • Bump codecov/codecov-action from 4 to 5 (#1210).
  • Bump sqlglot from 25.30.0 to 25.32.1 (#1254).

0.8.0

  • Added IR for stored procedures (#1161). In this release, we have made significant enhancements to the project by adding support for stored procedures. We have introduced a new CreateVariable case class to manage variable creation within the intermediate representation (IR), and removed the SetVariable case class as it is now redundant. A new CaseStatement class has been added to represent SQL case statements with value match, and a CompoundStatement class has been implemented to enable encapsulation of a sequence of logical plans within a single compound statement. The DeclareCondition, DeclareContinueHandler, and DeclareExitHandler case classes have been introduced to handle conditional logic and exit handlers in stored procedures. New classes DeclareVariable, ElseIf, ForStatement, If, Iterate, Leave, Loop, RepeatUntil, Return, SetVariable, and Signal have been added to the project to provide more comprehensive support for procedural language features and control flow management in stored procedures. We have also included SnowflakeCommandBuilder support for stored procedures and updated the visitExecuteTask method to handle stored procedure calls using the SetVariable method.
  • Added Variant Support (#998). In this commit, support for the Variant datatype has been added to the create table functionality, enhancing the system's compatibility with Snowflake's datatypes. A new VariantType has been introduced, which allows for more comprehensive handling of data during create table operations. Additionally, a remarks VARIANT line is added in the CREATE TABLE statement and the corresponding spec test has been updated. The Variant datatype is a flexible datatype that can store different types of data, such as arrays, objects, and strings, offering increased functionality for users working with variant data. Furthermore, this change will enable the use of the Variant datatype in Snowflake tables and improves the data modeling capabilities of the system.
  • Added PySpark generator (#1026). The engineering team has developed a new PySpark generator for the com.databricks.labs.remorph.generators package. This addition introduces a new parameter, logical, of type Generator[ir.LogicalPlan, String], in the SQLGenerator for SQL queries. A new abstract class BasePythonGenerator has been added, which extends the Generator class and generates Python code. A ExpressionGenerator class has also been added, which extends BasePythonGenerator and is responsible for generating Python code for ir.Expression objects. A new LogicalPlanGenerator class has been added, which extends BasePythonGenerator and is responsible for generating Python code for a given ir.LogicalPlan. A new StatementGenerator class has been implemented, which converts Statement objects into Python code. A new Python-generating class, PythonGenerator, has been added, which includes the implementation of an abstract syntax tree (AST) for Python in Scala. This AST includes classes for various Python language constructs. Additionally, new implicit classes for PythonInterpolator, PythonOps, and PythonSeqOps have been added to allow for the creation of PySpark code using the Remorph framework. The AndOrToBitwise rule has been implemented to convert And and Or expressions to their bitwise equivalents. The DotToFCol rule has been implemented to transform code that references columns using dot notation in a DataFrame to use the col function with a string literal of the column name instead. A new PySparkStatements object and PySparkExpressions class have been added, which provide functionality for transforming expressions in a data processing pipeline to PySpark equivalents. The SnowflakeToPySparkTranspiler class has been added to transpile Snowflake queries to PySpark code. A new PySpark generator has been added to the Transpiler class, which is implemented as an instance of the SqlGenerator class. This change enhances the Transpiler class with a new PySpark generator and improves serialization efficiency.
  • Added debug-bundle command for folder-to-folder translation (#1045). In this release, we have introduced a debug-bundle command to the remorph project's CLI, specifically added to the proxy_command function, which already includes debug-script, debug-me, and debug-coverage commands. This new command enhances the tool's debugging capabilities, allowing developers to generate a bundle of translated queries for folder-to-folder translation tasks. The debug-bundle command accepts three flags: dialect, src, and dst, specifying the SQL dialect, source directory, and destination directory, respectively. Furthermore, the update includes refactoring the FileSetGenerator class in the orchestration package of the com.databricks.labs.remorph.generators package, adding a debug-bundle command to the Main object, and updating the FileQueryHistoryProvider method in the ApplicationContext trait. These improvements focus on providing a convenient way to convert folder-based SQL scripts to other formats like SQL and PySpark, enhancing the translation capabilities of the project.
  • Added ruff Python formatter proxy (#1038). In this release, we have added support for the ruff Python formatter in our project's continuous integration and development workflow. We have also introduced a new FORMAT stage in the WorkflowStage object in the Result Scala object to include formatting as a separate step in the workflow. A new RuffFormatter class has been added to format Python code using the ruff tool, and a StandardInputPythonSubprocess class has been included to run a Python subprocess and capture its output and errors. Additionally, we have added a proxy for the ruff formatter to the SnowflakeToPySparkTranspilerTest for Scala to improve the readability of the transpiled Python code generated by the SnowflakeToPySparkTranspiler. Lastly, we have introduced a new ruff formatter proxy in the test code for the transpiler library to enforce format and style conventions in Python code. These changes aim to improve the development and testing experience for the project and ensure that the code follows the desired formatting and style standards.
  • Added baseline for translating workflows (#1042). In this release, several new features have been added to the open-source library to improve the translation of workflows. A new dependency for the Jackson YAML data format library, version 2.14.0, has been added to the pom.xml file to enable processing YAML files and converting them to Java objects. A new FileSet class has been introduced, which provides an in-memory data structure to manage a set of files, allowing users to add, retrieve, and remove files by name and persist the contents of the files to the file system. A new FileSetGenerator class has been added that generates a FileSet object from a JobNode object, enabling the translation of workflows by generating all necessary files for a workspace. A new DefineJob class has been developed to define a new rule for processing JobNode objects in the Remorph system, converting instances of SuccessPy and SuccessSQL into PythonNotebookTask and SqlNotebookTask objects, respectively. Additionally, various new classes, such as GenerateBundleFile, QueryHistoryToQueryNodes, ReformatCode, TryGeneratePythonNotebook, TryGenerateSQL, TrySummarizeFailures, InformationFile, SuccessPy, SuccessSQL, FailedQuery, Migration, PartialQuery, QueryPlan, RawMigration, Comment, and PlanComment, have been introduced to provide a more comprehensive and nuanced job orchestration framework. The Library case class has been updated to better separate concerns between library configuration and code assets. These changes address issue #1042 and provide a more robust and flexible workflow translation solution.
  • Added correct generation of databricks.yml for QueryHistory (#1044). The FileSet class in the FileSet.scala file has been updated to include a new method that correctly generates the databricks.yml file for the QueryHistory feature. This file is used for orchestrating cross-compiled queries, creating three files in total - two SQL notebooks with translated and formatted queries and a databricks.yml file to define an asset bundle for the queries. The new method in the FileSet class writes the content to the file using the Files.write method from the java.nio.file package instead of the previously used PrintWriter. The FileSetGenerator class has been updated to include the new databricks.yml file generation, and new rules and methods have been added to improve the accuracy and consistency of schema definitions in the generated orchestration files. Additionally, the DefineJob and DefineSchemas classes have been introduced to simplify the orchestration generation process.
  • Added documentation around Transformation (#1043). In this release, the Transformation class in our open-source library has been enhanced with detailed documentation, type parameters, and new methods. The class represents a stateful computation that produces an output of type Out while managing a state of type State. The new methods include map and flatMap for modifying the output and chaining transformations, as well as run and runAndDiscardState for executing the computation with a given initial state and producing a Result containing the output and state. Additionally, we have introduced a new trait called TransformationConstructors that provides constructors for successful transformations, error transformations, lifted results, state retrieval, replacement, and updates. The CodeGenerator trait in our code generation library has also been updated with several new methods for more control and flexibility in the code generation process. These include commas and spaces for formatting output, updateGenCtx for updating the GeneratorContext, nest and unnest for indentation, withIndentedBlock for producing nested blocks of code, and withGenCtx for creating transformations that use the current GeneratorContext.
  • Added tests for Snow ARRAY_REMOVE function (#979). In this release, we have added tests for the Snowflake ARRAY_REMOVE function in the SnowflakeToDatabricksTranspilerTest. The tests, currently ignored, demonstrate the usage of the ARRAY_REMOVE function with different data types, such as integers and doubles. A TODO comment is included for a test case involving VARCHAR casting, to be enabled once the necessary casting functionality is implemented. This update enhances the library's capabilities and ensures that the ARRAY_REMOVE function can handle a variety of data types. Software engineers can refer to these tests to understand the usage of the ARRAY_REMOVE function in the transpiler and the planned casting functionality.
  • Avoid non local return (#1052). In this release, the render method of the generators package object in the com.databricks.labs.remorph package has been updated to avoid using non-local returns and follow recommended coding practices. Instead of returning early from the method, it now uses Option to track failures and a try-catch block to handle exceptions. In cases of exception during string concatenation, the method sets the failureOpt variable to Some(lift(KoResult(WorkflowStage.GENERATE, UncaughtException(e)))). Additionally, the test file "CodeInterpolatorSpec.scala" has been modified to fix an issue with exception handling. In the updated code, new variables for each argument are introduced, and the problematic code is placed within an interpolated string, allowing for proper exception handling. This release enhances the robustness and reliability of the code interpolator and ensures that the method follows recommended coding practices.
  • Collect errors in Phase (#1046). The open-source library Remorph has received significant updates, focusing on enhancing error collection and simplifying the Transformation class. The changes include a new method recordError in the abstract Phase trait and its concrete implementations for collecting errors during each phase. The Transformation class has been simplified by removing the unused Phase parameter, while the Generator, CodeGenerator, and FileSetGenerator have been specialized to use Transformation without the Phase parameter. The TryGeneratePythonNotebook, TryGenerateSQL, CodeInterpolator, and TBASeqOps classes have been updated for a more concise and focused state. The imports have been streamlined, and the PySparkGenerator, SQLGenerator, and PlanParser have been modified to remove the unused Phase type parameter. A new test file, TransformationTest, has been added to check the error collection functionality in the Transformation class. Overall, these enhancements improve the reliability, readability, and maintainability of the Remorph library.
  • Correctly generate F.fn_name for builtin PySpark functions (#1037). This commit introduces changes to the generation of F.fn_name for builtin PySpark functions in the PySparkExpressions.scala file, specifically for PySpark's builtin functions (fn). It includes a new case to handle these functions by converting them to their lowercase equivalent in Python using Locale.getDefault. Additionally, changes have been made to handle window specifications more accurately, such as using ImportClassSideEffect with windowSpec and generating and applying a window function (fn) over it. The LAST_VALUE function has been modified to LAST in the SnowflakeToDatabricksTranspilerTest.scala file, and new methods such as First, Last, Lag, Lead, and NthValue have been added to the SnowflakeCallMapper class. These changes improve the accuracy, flexibility, and compatibility of PySpark when working with built-in functions and window specifications, making the codebase more readable, maintainable, and efficient.
  • Create Command Extended (#1033). In this release, the open-source library has been updated with several new features related to table management and SQL code generation. A new method replaceTable has been added to the LogicalPlanGenerator class, which generates SQL code for a ReplaceTableCommand IR node and replaces an existing table with the same name if it already exists. Additionally, support has been added for generating SQL code for an IdentityConstraint IR node, which specifies whether a column is an auto-incrementing identity column. The CREATE TABLE statement has been updated to include the AUTOINCREMENT and REPLACE constraints, and a new IdentityConstraint case class has been introduced to extend the capabilities of the UnnamedConstraint class. The TSqlDDLBuilder class has also been updated to handle the IDENTITY keyword more effectively. A new command implementation with AUTOINCREMENT and REPLACE constraints has been added, and a new SQL script has been included in the functional tests for testing CREATE DDL statements with identity columns. Finally, the SQL transpiler has been updated to support the CREATE OR REPLACE PROCEDURE syntax, providing more flexibility and convenience for users working with stored procedures in their SQL code. These updates aim to improve the functionality and ease of use of the open-source library for software engineers working with SQL code and table management.
  • Don't draft automated releases (#995). In this release, we have made a modification to the release.yml file in the .github/workflows directory by removing the "draft: true" line. This change removes the creation of draft releases in the automated release process, simplifying it and making it more straightforward for users to access new versions of the software. The job section of the release.yml file now only includes the release job, with the "release-signing-artifacts: true" still enabled, ensuring that the artifacts are signed. This improvement enhances the overall release process, making it more efficient and user-friendly.
  • Enhance the Snow ARRAY_SORT function support (#994). With this release, the Snowflake ARRAY_SORT function now supports Boolean literals as parameters, improving its functionality. The changes include validating Boolean parameters in SnowflakeCallMapper.scala, throwing a TranspileException for unsupported arguments, and simplifying the IR using the DBSQL SORT_ARRAY function. Additionally, new test cases have been added to SnowflakeCallMapperSpec for the ARRAY_SORT and ARRAY_SLICE functions. The SnowflakeToDatabricksTranspilerTest class has also been updated with new test cases that cover the enhanced ARRAY_SORT function's various overloads and combinations of Boolean literals, NULLs, and a custom sorting function. This ensures that invalid usage is caught during transpilation, providing better error handling and supporting more use cases.
  • Ensure that unparsable text is not lost in the generated output (#1012). In this release, we have implemented an enhancement to the error handling strategy in the ANTLR-generated parsers for SQL. This change records where parsing failed and gathers un-parsable input, preserving them as custom error nodes in the ParseTree at strategic points. The new custom error strategy allows visitors for higher level rules such as sqlCommand in Snowflake and sqlClauses in TSQL to check for an error node in the children and generate an Ir node representing the un-parsed text. Additionally, new methods have been introduced to handle error recovery, find the highest context in the tree for the particular parser, and recursively find the context. A separate improvement is planned to ensure the PLanParser no longer stops when syntax errors are discovered, allowing safe traversal of the ParseTree. This feature is targeted towards software engineers working with SQL parsing and aims to improve error handling and recovery.
  • Fetch table definitions for TSQL (#986). This pull request introduces a new TableDefinition case class that encapsulates metadata properties for tables in TSQL, such as catalog name, schema name, table name, location, table format, view definition, columns, table size, and comments. A TSqlTableDefinitions class has been added with methods to retrieve table definitions, all schemas, and all catalogs from TSQL. The SnowflakeTypeBuilder is updated to parse data types from TSQL. The SnowflakeTableDefinitions class has been refactored to use the new TableDefinition case class and retrieve table definitions more efficiently. The changes also include adding two new test cases to verify the correct retrieval of table definitions and catalogs for TSQL.
  • Fixed handling of projected expressions in TreeNode (#1159). In this release, we have addressed the handling of projected expressions in the TreeNode class, resolving issues #1072 and #1159. The expressions method in the Plan abstract class has been modified to include the final keyword, restricting overriding in subclasses. This method now returns all expressions present in a query from the current plan operator and its descendants. Additionally, we have introduced a new private method, seqToExpressions, used for recursively finding all expressions from a given sequence. The Project class, representing a relational algebra operation that projects a subset of columns in a table, now utilizes a new columns parameter instead of expressions. Similar changes have been applied to other classes extending UnaryNode, such as Join, Deduplicate, and Hint. The values parameter of the Values class has also been altered to accurately represent input values. A new test class, JoinTest, has been introduced to verify the correct propagation of expressions in join nodes, ensuring intended data transformations.
  • Handling any_keys_match from presto (#1048). In this commit, we have added support for the any_keys_match Presto function in Databricks by implementing it using existing Databricks functions. The any_keys_match function checks if any keys in a map match a given condition. Specifically, we have introduced two new classes, MapKeys and ArrayExists, which allow us to extract keys from the input map and check if any of the keys satisfy the given condition using the exists function. This is accomplished by renaming exists to array_exists to better reflect its purpose. Additionally, we have provided a Databricks SQL query that mimics the behavior of the any_keys_match function in Presto and added tests to ensure that it works as expected. These changes enable users to perform equivalent operations with a consistent syntax in Databricks and Presto.
  • Improve IR for job nodes (#1041). The open-source library has undergone improvements to the Intermediate Representation (IR) for job nodes, as indicated by the commit message "Improve IR for job nodes." This release introduces several significant changes, including: Refactoring of the JobNode class to extend the TreeNode class and the addition of a new abstract class LeafJobNode that overrides the children method to always return an empty Seq. Enhancements to the ClusterSpec case class, which now includes a toSDK method that properly initializes and sets the values of the fields in the SDK ClusterSpec object. Improvements to the NewClusterSpec class, which updates the types of several optional fields and introduces changes to the toSDK method for better conversion to the SDK format. Removal of the Job class, which previously represented a job in the IR of workflows. Changes to the JobCluster case class, which updates the newCluster attribute from ClusterSpec to NewClusterSpec. Update to the JobEmailNotifications class, which now extends LeafJobNode and includes new methods and overwrites existing ones from LeafJobNode. Improvements to the JobNotificationSettings class, which replaces the original toSDK method with a new implementation for more accurate SDK representation of job notification settings. Refactoring of the JobParameterDefinition class, which updates the toSDK method for more efficient conversion to the SDK format. These changes simplify the representation of job nodes, align the codebase more closely with the underlying SDK, and improve overall code maintainability and compatibility with other Remorph components.
  • Query History From Folder (#991). The Estimator class in the Remorph project has been updated to enhance the query history interface by adding metadata from reading from a folder, improving its ability to handle queries from different users and increasing the accuracy of estimation reports. The Anonymizer class has also been updated to handle cases where the user field is missing, ensuring the anonymization process can proceed smoothly and preventing potential errors. A new FileQueryHistory class has been added to provide query history functionality by reading metadata from a specified folder. The SnowflakeQueryHistory class has been updated to directly implement the history() method and include new fields in the ExecutedQuery objects, such as 'id', 'source', 'timestamp', 'duration', 'user', and 'filename'. A new ExecutedQuery case class has been introduced, which now includes optional user and filename fields, and a new QueryHistoryProvider trait has been added with a method history() that returns a QueryHistory object containing a sequence of ExecutedQuery objects, enhancing the query history interface's flexibility and power. Test suites and test data for the Anonymizer and TableGraph classes have been updated to accommodate these changes, allowing for more comprehensive testing of query history functionality. A FileQueryHistorySpec test file has been added to test the FileQueryHistory class's ability to correctly extract queries from SQL files, ensuring the class works as expected.
  • Rework serialization using circe+jackson (#1163). In pull request #1163, the serialization mechanism in the project has been refactored to use the Circe and Jackson libraries, replacing the existing ujson library. This change includes the addition of the Circe, Circe-generic-extras, and Circe-jackson libraries, which are licensed under the Apache 2.0 license. The project now includes the copyright notices and license information for all open-source projects that have contributed code to it, ensuring compliance with open-source licenses. The CoverageTest class has been updated to incorporate error encoding using Circe and Jackson libraries, and the EstimationReport case classes no longer have implicit ReadWriter instances defined using macroRW. Instead, circe and Jackson encode and decode instances are likely defined elsewhere in the codebase. The BaseQueryRunner abstract class has been updated to handle both parsing and transpilation errors in a more uniform way, using a failures field instead of transpilation_error or parsing_error. Additionally, a new file, encoders.scala, has been introduced, which defines encoders for serializing objects to JSON using the Circe and Jackson libraries. These changes aim to improve serialization and deserialization performance and capabilities, simplify the codebase, and ensure consistent and readable output.
  • Some window functions does not support window frame conditions (#999). The Snowflake expression builder has been updated to correct the default window frame specifications for certain window functions and modify the behavior of the ORDER BY clause in these functions. This change ensures that the expression builder generates the correct SQL syntax for unsupported functions like "LAG", "DENSE_RANK", "LEAD", "PERCENT_RANK", "RANK", and "ROW_NUMBER", improving the compatibility and reliability of the generated queries. Additionally, a unit test for the SnowflakeExpressionBuilder has been updated to account for changes in the way window functions are handled, enhancing the accuracy of the builder in generating valid SQL for window functions in Snowflake.
  • Split workflow definitions into sensible packages (#1039). The AutoScale class has been refactored and moved to a new package, com.databricks.labs.remorph.intermediate.workflows.clusters, extending JobNode from com.databricks.labs.remorph.intermediate.workflows. It now includes a case class for auto-scaling that takes optional integer arguments maxWorkers and minWorkers, and a single method apply that creates and configures a cluster using the Databricks SDK's ComputeService. The AwsAttributes and AzureAttributes classes have also been moved to the com.databricks.labs.remorph.intermediate.workflows.clusters package and now extend JobNode. These classes manage AWS and Azure-related attributes for compute resources in a workflow. The ClientsTypes case class has been moved to a new clusters sub-package within the workflows package and now extends JobNode, and the ClusterLogConf class has been moved to a new clusters package. The JobDeployment class has been refactored and moved to com.databricks.labs.remorph.intermediate.workflows.jobs, and the JobEmailNotifications, JobsHealthRule, and WorkspaceStorageInfo classes have been moved to new packages and now import JobNode. These changes improve the organization and maintainability of the codebase, making it easier to understand and navigate.
  • TO_NUMBER/TO_DECIMAL/TO_NUMERIC without precision and scale (#1053). This pull request introduces improvements to the transpilation process for handling cases where precision and scale are not specified for TO_NUMBER, TO_DECIMAL, or TO_NUMERIC Snowflake functions. The updated transpiler now automatically applies default values when these parameters are omitted, with precision set to the maximum allowed value of 38 and scale set to 0. A new method has been added to manage these cases, and four new test cases have been included to verify the transpilation of TO_NUMBER and TO_DECIMAL functions without specified precision and scale, and with various input formats. This change ensures consistent behavior across different SQL dialects for cases where precision and scale are not explicitly defined in the conversion functions.
  • Table comments captured as part of Snowflake Table Definition (#989). In this release, we have added support for capturing table comments as part of Snowflake Table Definitions in the remorph library. This includes modifying the TableDefinition case class to include an optional comment field, and updating the SQL query in the SnowflakeTableDefinitions class to retrieve table comments. A new integration test for Snowflake table definitions has also been introduced to ensure the proper functioning of the new feature. This test creates a connection to the Snowflake database, retrieves a list of table definitions, and checks for the presence of table comments. These changes are part of our ongoing efforts to improve metadata capture for Snowflake tables (Note: The commit message references issue #945 on GitHub, which this pull request is intended to close.)
  • Use Transformation to get rid of the ctx parameter in generators (#1040). The Generating class has undergone significant changes, removing the ctx parameter and introducing two new phases, Parsing and BuildingAst, in the sealed trait Phase. The Parsing phase extends Phase with a previous phase of Init and contains the source code and filename. The BuildingAst phase extends Phase with a previous phase of Parsing and contains the parsed tree and the previous phase. The Optimizing phase now contains the unoptimized plan and the previous phase. The Generating phase now contains the optimized plan, the current node, the total statements, the transpiled statements, the GeneratorContext, and the previous phase. Additionally, the TransformationConstructors trait has been updated to allow for the creation of Transformation instances specific to a certain phase of a workflow. The runQuery method in the BaseQueryRunner abstract class has been updated to use a new transpile method provided by the Transpiler trait, and the Estimator class in the Estimation module has undergone changes to remove the ctx parameter in generators. Overall, these changes simplify the implementation, improve code maintainability, and enable better separation of concerns in the codebase.
  • With Recursive (#1000). In this release, we have introduced several enhancements for With Recursive statements in SQL parsing and processing for the Snowflake database. A new IR (Intermediate Representation) for With Recursive CTE (Common Table Expression) has been implemented in the SnowflakeAstBuilder.scala file. A new case class, WithRecursiveCTE, has been added to the SnowflakeRelationBuilder class in the databricks/labs/remorph project, which extends RelationCommon and includes two members: ctes and query. The buildColumns method in the SnowflakeRelationBuilder class has been updated to handle cases where columnList is null and extract column names differently. Additionally, a new test has been added in SnowflakeAstBuilderSpec.scala that verifies the correct handling of a recursive CTE query. These enhancements improve the support for recursive queries in the Snowflake database, enabling more powerful and flexible querying capabilities for developers and data analysts working with complex data structures.
  • [chore] fixed query coverage report (#1160). In this release, we have addressed the issue #1160 related to the query coverage report. We have implemented changes to the QueryRunner abstract class in the com.databricks.labs.remorph.coverage package. The ReportEntryReport constructor now accepts a new parameter parsed, which is set to 1 if there is no transpilation error and 0 otherwise. Previously, parsed was always set to 1, regardless of the presence of a transpilation error. We also updated the extractQueriesFromFile and extractQueriesFromFolder methods in the FileQueryHistory class to return a single ExecutedQuery instance, allowing for better query coverage reporting. Additionally, we modified the behavior of the history() method of the fileQueryHistory object in the FileQueryHistorySpec test case. The method now returns a query history object with a single query having a source including the text "SELECT * FROM table1;" and "SELECT * FROM table2;", effectively merging the previous two queries into one. These changes ensure that the report accurately reflects whether the query was successfully transpiled, parsed, and stored in the query history. It is crucial to test thoroughly any parts of the code that rely on the history() method to return separate queries, as the behavior of the method has changed.

0.7.0

  • Added private key authentication for sf (#917). This commit adds support for private key authentication to the Snowflake data source connector, providing users with more flexibility and security. The cryptography library is used to process the user-provided private key, with priority given to the pem_private_key secret, followed by the sfPassword secret. If neither secret is found, an exception is raised. However, password-based authentication is still used when JDBC options are provided, as Spark JDBC does not currently support private key authentication. A new exception class, InvalidSnowflakePemPrivateKey, has been added for handling invalid or malformed private keys. Additionally, new tests have been included for reading data with private key authentication, handling malformed private keys, and checking for missing authentication keys. The notice has been updated to include the cryptography library's copyright and license information.
  • Added support for PARSE_JSON and VARIANT datatype (#906). This commit introduces support for the PARSE_JSON function and VARIANT datatype in the Snowflake parser, addressing issue #894. The implementation removes the experimental dialect, enabling support for the VARIANT datatype and using PARSE_JSON for it. The variant_explode function is also utilized. During transpilation to Snowflake, whenever the : operator is encountered in the SELECT statement, everything will be treated as a VARIANT on the Databricks side to handle differences between Snowflake and Databricks in accessing variant types. These changes are implemented using ANTLR.
  • Added upgrade script and modified metrics sql (#990). In this release, the open-source library has been updated with several improvements to the metrics system, database upgrades, and setup process. The setup_spark_remote.sh script now checks if the Spark server is running by pinging localhost:4040, rather than sleeping for a fixed time, allowing for faster execution and more accurate server status determination. The metrics table's insert statement has been updated to cast values to Bigint for better handling of larger counts. An upgrade script has been added to facilitate required modifications, and the setup_spark_remote.sh script has been modified to validate URLs. A new Python file for upgrading the metrics table's data types has been added, which includes a function to recreate the table with the correct data types for specific columns. The upgrade_common module now includes several functions for upgrading database tables, and a new unit test file, test_upgrade_common.py, has been added with test cases for these functions. Lastly, the upgrade script for v0.4.0 has been updated to simplify the process of checking if the main table in the reconcile metadata needs to be recreated and to add an operation_name column. These changes improve the library's functionality, accuracy, and robustness, particularly for larger datasets and upgrading processes, enhancing the overall user experience.
  • Basic CTAS Implementation (#968). This pull request adds basic support for the CREATE TABLE AS SELECT (CTAS) statement in Snowflake, enabling users to create a new table by selecting data from an existing table or query. In the LogicalPlanGenerator class, a new method has been implemented to handle CTAS statements, which generates the appropriate SQL command for creating a table based on the result of a select query. The SnowflakeDDLBuilder class now includes a relationBuilder class member for building relations based on Snowflake DDL input, and the visitCreateTableAsSelect method has been overridden to parse CTAS statements and construct corresponding IR objects. The test suite has been expanded to include a new spec for CTAS statements and a test case for the CTAS statement "CREATE TABLE t1 AS SELECT c1, c2 FROM t2;". Additionally, a new test file "test_ctas_complex.sql" has been added, containing SQL statements for creating a new table by selecting columns from multiple tables and computing new columns through various data manipulations. The implementation also includes adding new SQL statements for CTAS in both Snowflake and Databricks dialects, allowing for testing the CTAS functionality for both.
  • Create repeatable estimator for Snowflake query history (#924). This commit introduces a new coverage estimation tool for analyzing query history in a database, initially implemented for Snowflake. The tool parses and transpiles query history into Databricks SQL and reports on the percentage of query history it can process. It includes a new SnowflakePlanParser class that handles Snowflake query plans, a SqlGenerator class that generates Databricks SQL from optimized logical plans, and a dialect method that returns the dialect string. The long-term plan is to extend this functionality to other supported databases and dialects and include a report on SQL complexity. Additionally, test cases have been added to the AnonymizerTest class to ensure the correct functionality of the Anonymizer class, which anonymizes executed queries when provided with a PlanParser object. The Anonymizer class is intended to be used as part of the coverage estimation tool, which will provide analysis of query history for various databases.
  • Created a mapping dict for algo for each dialect at layer level (#934). A series of changes have been implemented to improve the reconciliation process and the handling of hash algorithms in the open-source library. A mapping dictionary algorithm to dialect has been introduced at the layer level to enhance the reconciliation process. The get_hash_transform function now accepts a new layer argument and returns a list of hash algorithms from the HashAlgoMapping dictionary. A new HashAlgoMapping class has been added to map algorithms to a dialect for hashing, replacing the previous DialectHashConfig class. A new function get_dialect has been introduced to retrieve the dialect based on the layer. The _hash_transform function and the build_query method have been updated to use the layer parameter when determining the dialect. These changes provide more precise control over the algorithm used for hash transformation based on the source layer and the target dialect, resulting in improved reconciliation performance and accuracy.
  • Fetch TableDefinitions from Snowflake (#904). A new SnowflakeTableDefinitions class has been added to simplify the discovery of Snowflake table metadata. This class establishes a connection with a Snowflake database through a Connection object, and provides methods such as getDataType and getTableDefinitionQuery to parse data types and generate queries for table definitions. Moreover, it includes a getTableDefinitions method to retrieve all table definitions in a Snowflake database as a sequence of TableDefinition objects, which encapsulates various properties of each table. The class also features methods for retrieving all catalogs and schemas in a Snowflake database. Alongside SnowflakeTableDefinitions, a new test class, SnowflakeTableDefinitionTest, has been introduced to verify the behavior of getTableDefinitions and ensure that the class functions as intended, adhering to the desired behavior.
  • Guide user on missing configuration file (#930). In this commit, the _verify_recon_table_config method in the runner.py file of the databricks/labs/remorph package has been updated to handle missing reconcile table configurations during installation. When the reconcile table configuration is not found, an error message will now display the name of the requested configuration file. This enhancement helps users identify the specific configuration file they need to provide in their workspace, addressing issue #919. This commit is co-authored by Ludovic Claude.
  • Implement more missing visitor functions for Snowflake and TSQL (#975). In this release, we have added several missing visitor methods for the Snowflake and TSQL builder classes to improve the reliability and maintainability of our parser. Previously, when a visitor method was missing, the default visitor was called, causing the visit of all children of the ParseTree, which was not ideal. This could lead to incorrect results due to a slight modification in the ANTLR grammar inadvertently breaking the visitor. In this release, we have implemented several new visitor methods for both Snowflake and TSQL builder classes, including the visitDdlCommand method in the SnowflakeDDLBuilder class and the visitDdlClause method in the TSqlDDLBuilder class. These new methods ensure that the ParseTree is traversed correctly and that the correct IR node is returned. The visitDdlCommand method checks for different types of DDL commands, such as create, alter, drop, and undrop, and calls the appropriate method for each type. The visitDdlClause method contains a sequence of methods corresponding to different DDL clauses and calls the first non-null method in the sequence. These changes significantly improve the robustness of our parser and enhance the reliability of our code.
  • Introduce typed errors (#981). This commit introduces typed errors in the form of a new class, UnexpectedNode, and several case classes including ParsingError, UnsupportedDataType, WrongNumberOfArguments, UnsupportedArguments, and UnsupportedDateTimePart in various packages, as part of the ongoing effort to replace exception throwing with returning Result types in future pull requests. These changes will improve error handling and provide more context and precision for errors, facilitating debugging and maintenance of the remorph library and data type generation functionality. The TranspileException class is now constructed with specific typed error instances, and the ErrorCollector and ErrorDetail classes have been updated to use ParsingError. Additionally, the SnowflakeCallMapper and SnowflakeTimeUnits classes have been updated to use the new typed error mechanism, providing more precise error handling for Snowflake-specific functions and expressions.
  • Miscellaneous improvements to Snowflake parser (#952). This diff brings several miscellaneous improvements to the Snowflake parser in the open-source library, targeting increased parse and transpilation success rates. The modifications include updating the colDecl rule to allow optional data types, introducing an objectField rule, and enabling date and timestamp literals as strings. Additionally, the parser has been refined to handle identifiers more efficiently, such as hashes within the AnonymizerTest. The expected Ast for certain test cases has also been updated to improve parser accuracy. These changes aim to create a more robust and streamlined Snowflake parser, minimizing parsing errors and enhancing overall user experience for project adopters. Furthermore, the error handling and reporting capabilities of the Snowflake parser have been improved with new case classes, IndividualError and ErrorsSummary, and updated error messages.
  • Moved intermediate package out of parsers (#972). In this release, the intermediate package has been refactored out of the parsers package, aligning with the design principle that parsers should depend on the intermediate representation instead of the other way around. This change affects various classes and methods across the project, all of which have been updated to import the intermediate package from its new location. No new functionality has been introduced, but the refactoring improves the package structure and dependency management. The EstimationAnalyzer class in the coverage/estimation package has been updated to import classes from the new location of the intermediate package, and its evaluateTree method has been updated to use the new import path for LogicalPlan and Expression. Other affected classes include SnowflakeTableDefinitions, SnowflakeLexer, SnowflakeParser, SnowflakeTypeBuilder, GeneratorContext, DataTypeGenerator, IRHelpers, and multiple test files.
  • Patch Function without Brackets (#907). This commit introduces new lexer and parser rules to handle Snowflake SQL functions without parentheses, specifically impacting CURRENT_DATE, CURRENT_TIME, CURRENT_TIMESTAMP, LOCALTIME, and LOCALTIMESTAMP. The new rules allow these functions to be used without parentheses, consistent with Snowflake SQL. This change fixes functional tests and includes documentation for the affected functions. However, there is a pending task to add or fix more test cases to ensure comprehensive testing of the new rules. Additionally, the syntax of the SELECT statement for the CURRENT_TIMESTAMP function has been updated, removing the need for the parameter 'col1'. This change simplifies the syntax for certain SQL functions in the codebase and improves the consistency and reliability of the functional tests.
  • Root Table (#936). The PR #<pr-number> introduces a new class TableGraph that extends DependencyGraph and implements LazyLogging trait. This class builds a graph of tables and their dependencies based on query history and table definitions. It provides methods to add nodes and edges, build the graph, and retrieve root, upstream, and downstream tables. The DependencyGraph trait offers a more structured and flexible way to handle table dependencies. This change is part of the Root Table feature (issue #936) that identifies root tables in a graph of table dependencies, closing issue #23. The PR includes a new TableGraphTest class that demonstrates the use of these methods and verifies their behavior for better data flow understanding and optimization.
  • Snowflake Merge Implementation (#964). In this release, we have implemented the Merge statement for the Snowflake parser, which enables updating or deleting rows in a target table based on matches with a source table, and inserting new rows into the target table when there are no matches. This feature includes updates to the SnowflakeDMLBuilder and SnowflakeExpressionBuilder classes, allowing for proper handling of column names and MERGE queries. Additionally, we have added test cases to the SnowflakeASTBuilder, SnowflakeDMLBuilderSpec, and SnowflakeToDatabricksTranspiler to ensure the accurate translation and execution of MERGE statements for the Snowflake dialect. These changes bring important database migration and synchronization capabilities to our open-source library, improving its functionality and usability for software engineers.
  • TSQL: Implement CREATE TABLE (#911). This commit implements the TSQL CREATE TABLE command and its various options and forms, including CTAS, graph node syntax, and analytics variants, as well as syntactical differences for SQL Server. The DDL and DML visitors have been moved from the AST and Relation visitors to separate classes for better responsibility segregation. The LogicalPlanGenerator class has been updated to generate unique constraints, primary keys, foreign keys, check constraints, default value constraints, and identity constraints for the CREATE TABLE command. Additionally, new classes for generating SQL options and handling unresolved options during transpilation have been added to enhance the parser's capability to manage various options and forms. These changes improve the transpilation of TSQL code and the organization of the codebase, making it easier to maintain and extend.
  • Transpile Snow ARRAY_SORT function (#973). In this release, we have implemented support for the Snowflake ARRAY_SORT function in our open-source library. This feature has been added as part of issue #973, and it involves the addition of two new private methods, arraySort and makeArraySort, to the SnowflakeCallMapper class. The arraySort method takes a sequence of expressions as input and sorts the array using the makeArraySort method. The makeArraySort method handles both null and non-null values, sorts the array in ascending or descending order based on the provided parameter, and determines the position of null or small values based on the nulls first parameter. The sorted array is then returned as an ir.ArraySort expression. This functionality allows for the sorting of arrays in Snowflake SQL to be translated to equivalent code in the target language. This enhancement simplifies the process of working with arrays in Snowflake SQL and provides users with a more streamlined experience.
  • Transpile Snow MONTHS_BETWEEN function correctly (#963). In this release, the remorph library's SnowflakeCallMapper class in the com/databricks/labs/remorph/parsers/snowflake/rules package has been updated to handle the MONTHS_BETWEEN function. A new case has been added that creates a MonthsBetween object with the first two arguments of the function call and a boolean value of true. This change enhances compatibility and ensures that the output accurately reflects the intended functionality. Additionally, new test cases have been introduced to the SnowflakeCallMapperSpec for the transpilation of the MONTHS_BETWEEN function. These test cases demonstrate accurate mapping of the function to the MonthsBetween class and proper casting of inputs as dates or timestamps, improving the reliability and precision of date and time calculations.
  • Updated Installation to handle install errors (#962). In this release, we've made significant improvements to the remorph project, addressing and resolving installation errors that were occurring during the installation process in development mode. We've introduced a new ProductInfo class in the wheels module, which provides information about the products being installed. This change replaces the use of WheelsV2 in two test functions. Additionally, we've updated the workspace_installation method in application.py to handle installation errors more effectively, addressing the dependency on workspace .remorph due to wheels. We've also added new methods to installation.py to manage local and remote version files, and updated the _upgrade_reconcile_workflow function to ensure the correct wheel path is used during installation. These changes improve the overall quality of the codebase, making it easier for developers to adopt and maintain the project, and ensure a more seamless installation experience for users.
  • Updated catalog operations logging (#910). In this release, the setup process for the catalog, schema, and volume in the configurator module has been simplified and improved. The previous implementation repeatedly prompted the user for input until the correct input was provided or a maximum number of attempts was reached. The updated code now checks if the catalog, schema, or volume already exists and either uses it or prompts the user to create it once. If the user does not have the necessary privileges to use the catalog, schema, or volume, an error message is logged and the installation is aborted. New methods have been added to check for necessary privileges, such as has_necessary_catalog_access, has_necessary_schema_access, and has_necessary_volume_access, which return a boolean indicating whether the user has the necessary privileges and log an error message with the missing privileges if not. The logging for catalog operations in the install.py file has also been updated to check for privileges at the end of the process and list any missing privileges for each catalog object. Additionally, changes have been made to the unit tests for the ResourceConfigurator class to ensure that the system handles cases where the user does not have the necessary permissions to access catalogs, schemas, or volumes, preventing unauthorized access and maintaining the security and integrity of the system.
  • Updated remorph reconcile workflow to use wheels instead of pypi (#884). In this release, the installation process for the Remorph library has been updated to allow for the use of locally uploaded wheel files instead of downloading the package from PyPI. This change includes updates to the install and _deploy_jobs methods in the recon.py file to accept a new wheel_paths argument, which is used to pass the path of the Remorph wheel file to the deploy_recon_job method. The _upgrade_reconcile_workflow function in the v0.4.0_add_main_table_operation_name_column.py file has also been updated to upload the wheel package to the workspace and pass its path to the deploy_reconcile_job method. Additionally, the deploy_recon_job method in the JobDeployment class now accepts a new wheel_file argument, which represents the name of the wheel file for the remorph library. These changes address issues faced by customers with no public internet access and enable the use of new features before they are released on PyPI. The test_recon.py file in the tests/unit/deployment directory has also been updated to reflect these changes.
  • Upgrade script Implementation (#777). In this release, we've implemented an upgrade script as part of pull request #777, which resolves issue #769. This change introduces a new Upgrades class in application.py that accepts product_info and installation as parameters and includes a cached property wheels for improved performance. Additionally, we've added new methods to the WorkspaceInstaller class for handling upgrade-related tasks, including the creation of a ProductInfo object, interacting with the Databricks SDK, and handling potential errors. We've also added a test case to ensure that upgrades are applied correctly on more recent versions. These changes are part of our ongoing effort to enhance the management and application of upgrades to installed products.
  • bug fix for to_array function (#961). A bug fix has been implemented to improve the TO_ARRAY function in our open-source library. Previously, this function expected only one parameter, but it has been updated to accept two parameters, with the second being optional. This change brings the function in line with other functions in the class, improving flexibility and ensuring backward compatibility. The TO_ARRAY function is used to convert a given expression to an array if it is not null and return null otherwise. The commit also includes updates to the Generator class, where a new entry for the ToArray expression has been added to the expression_map dictionary. Additionally, a new ToArray class has been introduced as a subclass of Func, allowing the function to handle a variable number of arguments more gracefully. Relevant updates have been made to the functional tests for the to_array function for both Snowflake and Databricks SQL, demonstrating its handling of null inputs and comparing it with the corresponding ARRAY function in each SQL dialect. Overall, these changes enhance the functionality and adaptability of the TO_ARRAY function.
  • feat: Implement all of TSQL predicates except for SOME ALL ANY (#922). In this commit, we have implemented the IR generation for several TSQL predicates including IN, IS, BETWEEN, LIKE, EXISTS, and FREETEXT, thereby improving the parser's ability to handle a wider range of TSQL syntax. The TSqlParser class has been updated with new methods and changes to existing ones, including the addition of new labeled expressions to the predicate rule. Additionally, we have corrected an error in the LIKE predicate's implementation, allowing the ESCAPE character to accept a full expression that evaluates to a single character at runtime, rather than assuming it to be a single character at parse time. These changes provide more flexibility and adherence to the TSQL standard, enhancing the overall functionality of the project for our adopters.

0.6.0

  • Added query history retrieval from Snowflake (#874). This release introduces query history retrieval from Snowflake, enabling expanded compatibility and data source options for the system. The update includes adding the Snowflake JDBC driver and its dependencies to the pom.xml file, and the implementation of a new SnowflakeQueryHistory class to retrieve query history from Snowflake. The Anonymizer object is also added to anonymize query histories by fingerprinting queries based on their structure. Additionally, several case classes are added to represent various types of data related to query execution and table definitions in a Snowflake database. A new EnvGetter class is also included to retrieve environment variables for use in testing. Test files for the Anonymizer and SnowflakeQueryHistory classes are added to ensure proper functionality.
  • Added support for ALTER TABLE: ADD COLUMNS, DROP COLUMNS, RENAME COLUMNS, and DROP CONSTRAINTS (#861). In this release, support for various ALTER TABLE SQL commands has been added to our open-source library, including ADD COLUMNS, DROP COLUMNS, RENAME COLUMNS, and DROP CONSTRAINTS. These features have been implemented in the LogicalPlanGenerator class, which now includes a new private method alterTable that takes a context and an AlterTableCommand object and returns an ALTER TABLE SQL statement. Additionally, a new sealed trait TableAlteration has been introduced, with four case classes extending it to handle specific table alteration operations. The SnowflakeTypeBuilder class has also been updated to parse and build Snowflake-specific SQL types for these commands. These changes provide improved functionality for managing and manipulating tables in Snowflake, making it easier for users to work with and modify their data. The new functionality has been tested using the SnowflakeToDatabricksTranspilerTest class, which specifies Snowflake ALTER TABLE commands and the expected transpiled results.
  • Added support for STRUCT types and conversions (#852). This change adds support for STRUCT types and conversions in the system by implementing new StructType, StructField, and StructExpr classes for parsing, data type inference, and code generation. It also maps the OBJECT_CONSTRUCT from Snowflake and introduces updates to various case classes such as JsonExpr, Struct, and Star. These improvements enhance the system's capability to handle complex data structures, ensuring better compatibility with external data sources and expanding the range of transformations available for users. Additionally, the changes include the addition of test cases to verify the functionality of generating SQL data types for STRUCT expressions and handling JSON literals more accurately.
  • Minor upgrades to Snowflake parameter processing (#871). This commit includes minor upgrades to Snowflake parameter processing, enhancing the consistency and readability of the code. The changes normalize parameter generation to use ${} syntax for clarity and to align with Databricks notebook examples. An extra coverage test for variable references within strings has been added. The specific changes include updating a SELECT statement in a Snowflake SQL query to use ${} for parameter processing. The commit also introduces a new SQL file for functional tests related to Snowflake's parameter processing, which includes commented out and alternate syntax versions of a query. This commit is part of continuous efforts to improve the functionality, reliability, and usability of the Snowflake parameter processing feature.
  • Patch/reconcile support temp views (#901). The latest update to the remorph-reconcile library adds support for temporary views, a new feature that was not previously available. With this change, the system can now handle global_temp for temporary views by modifying the _get_schema_query method to return a query for the global_temp schema if the schema name is set as such. Additionally, the read_data method was updated to correctly handle the namespace and catalog for temporary views. The new variable namespace_catalog has been introduced, which is set to hive_metastore if the catalog is not set, and to the original catalog with the added schema otherwise. The table_with_namespace variable is then updated to use the namespace_catalog and table name, allowing for correct querying of temporary views. These modifications enable remorph-reconcile to work seamlessly with temporary views, enhancing its flexibility and functionality. The updated unit tests reflect these changes, with assertions to ensure that the correct SQL statements are being generated and executed for temporary views.
  • Reconcile Table Recon JSON filename updates (#866). The Remorph project has implemented a change to the naming convention and placement of the configuration file for the table reconciliation process. The configuration file, previously named according to individual preference, must now follow the pattern recon_config_<DATA_SOURCE>_<SOURCE_CATALOG_OR_SCHEMA>_<REPORT_TYPE>.json and be placed in the .remorph directory within the Databricks Workspace. Examples of Table Recon filenames for Snowflake, Oracle, and Databricks source systems have been provided for reference. Additionally, the data_source field in the config file has been updated to accurately reflect the data source. The case of the filename should now match the case of SOURCE_CATALOG_OR_SCHEMA as defined in the config. Compliance with this new naming convention and placement is required for the successful execution of the table reconciliation process.
  • [snowflake] parse parameters (#855). The open-source library has undergone changes related to Scalafmt configuration, Snowflake SQL parsing, and the introduction of a new ExpressionGenerator class method. The Scalafmt configuration change introduces a new docstrings.wrap option set to false, disabling docstring wrapping at the specified column limit. The danglingParentheses.preset option is also set to false, disabling the formatting rule for unnecessary parentheses. In Snowflake SQL parsing, new token types, lexer modes, and parser rules have been added to improve the parsing of string literals and other elements. A new variable method in the ExpressionGenerator class generates SQL expressions for ir.Variable objects. A new Variable case class has been added to represent a variable in an expression, and the SchemaReference case class now takes a single child expression. The SnowflakeDDLBuilder class has a new method, extractString, to safely extract strings from ANTLR4 context objects. The SnowflakeErrorStrategy object now includes new parameters for parsing Snowflake syntax, and the Snowflake LexerSpec test class has new methods for filling tokens from an input string and dumping the token list. Tests have been added for various string literal scenarios, and the SnowflakeAstBuilderSpec includes a new test case for handling the translate amps functionality. The Snowflake SQL queries in the test file have been updated to standardize parameter referencing syntax, improving consistency and readability.
  • fixed current_date() generation (#890). This release includes a fix for an issue with the generation of the current_date() function in SQL queries, specifically for the Snowflake dialect. A test case in the sqlglot-incorrect category has been updated to use the correct syntax for the CURRENT_DATE function, which includes parentheses (SELECT CURRENT_DATE() FROM tabl;). Additionally, the current_date() function is now called consistently throughout the tests, either as CURRENT_DATE or CURRENT_DATE(), depending on the syntax required by Snowflake. No new methods were added, and the existing functionality was changed only to correct the current_date() generation. This improvement ensures accurate and consistent generation of the current_date() function across different SQL dialects, enhancing the reliability and accuracy of the tests.

0.5.0

  • Added Translation Support for ! as commands and & for Parameters (#771). This commit adds translation support for using "!" as commands and "&" as parameters in Snowflake code within the remorph tool, enhancing compatibility with Snowflake syntax. The "!set exit_on_error=true" command, which previously caused an error, is now treated as a comment and prepended with -- in the output. The "&" symbol, previously unrecognized, is converted to its Databricks equivalent "$", which represents parameters, allowing for proper handling of Snowflake SQL code containing "!" commands and "&" parameters. These changes improve the compatibility and robustness of remorph with Snowflake code and enable more efficient processing of Snowflake SQL statements. Additionally, the commit introduces a new test suite for Snowflake commands, enhancing code coverage and ensuring proper functionality of the transpiler.
  • Added LET and DECLARE statements parsing in Snowflake PL/SQL procedures (#548). This commit introduces support for parsing DECLARE and LET statements in Snowflake PL/SQL procedures, enabling variable declaration and assignment. It adds new grammar rules, refactors code using ScalaSubquery, and implements IR visitors for DECLARE and LET statements with Variable Assignment and ResultSet Assignment. The RETURN statement and parameterized expressions are also now supported. Note that CURSOR is not yet covered. These changes allow for improved processing and handling of Snowflake PL/SQL code, enhancing the overall functionality of the library.
  • Added logger statements in get_schema function (#756). In this release, enhanced logging has been implemented in the Metadata (Schema) fetch functions, specifically in the get_schema function and other metadata fetch functions within Oracle, SnowflakeDataSource modules. The changes include logger statements that log the schema query, start time, and end time, providing better visibility into the performance and behavior of these functions during debugging or monitoring. The logging functionality is implemented using the built-in logging module and timestamps are obtained using the datetime module. In the SnowflakeDataSource class, RuntimeError or PySparkException will be raised if the user's current role lacks the necessary privileges to access the specified Information Schema object. The INFORMATION_SCHEMA table in Snowflake is used to fetch the schema, with the query modified to handle unquoted and quoted identifiers and the ordinal position of columns. The get_schema_query function has also been updated for better formatting for the SQL query used to fetch schema information. The schema fetching method remains unchanged, but these enhancements provide more detailed logging for debugging and monitoring purposes.
  • Aggregates Reconcile CLI Implementation (#770). The Aggregates Reconcile CLI Implementation commit introduces a new command-line interface (CLI) for reconcile jobs, specifically for aggregated data. This change adds a new parameter, "operation_name", to the run method in the runner.py file, which determines the type of reconcile operation to perform. A new function, _trigger_reconcile_aggregates, has been implemented to reconcile aggregate data based on provided configurations and log the reconciliation process outcome. Additionally, new methods for defining job parameters and settings, such as max_concurrent_runs and "parameters", have been included. This CLI implementation enhances the customizability and control of the reconciliation process for users, allowing them to focus on specific use cases and data aggregations. The changes also include new test cases in test_runner.py to ensure the proper behavior of the ReconcileRunner class when the aggregates-reconcile operation_name is set.
  • Aggregates Reconcile Updates (#784). This commit introduces significant updates to the Table Deployment feature, enabling it to support Aggregate Tables deployment and modifying the persistence logic for tables. Notable changes include the addition of a new aggregates attribute to the Table class in the configuration, which allows users to specify aggregate functions and optionally group by specific columns. The reconcile process now captures mismatch data, missing rows in the source, and missing rows in the target in the recon metrics tables. Furthermore, the aggregates reconcile process supports various aggregate functions like min, max, count, sum, avg, median, mode, percentile, stddev, and variance. The documentation has been updated to reflect these improvements. The commit also removes the percentile function from the reconciliation configuration and modifies the aggregate_metrics SQL query, enhancing the flexibility of the Table Deployment feature for Aggregate Tables. Users should note that the percentile function is no longer a valid option and should update their code accordingly.
  • Aggregates Reconcile documentation (#779). In this commit, the Aggregates Reconcile utility has been enhanced with new documentation and visualizations for improved understanding and usability. The utility now includes a flow diagram, visualization, and README file illustrating how it compares specific aggregate metrics between source and target data residing on Databricks. A new configuration sample is added, showcasing the reconciliation of two tables using various aggregate functions, join columns, transformations, filters, and JDBC ReaderOptions configurations. The commit also introduces two Mermaid flowchart diagrams, depicting the reconciliation process with and without a group by operation. Additionally, new flow diagram visualizations in PNG and GIF formats have been added, aiding in understanding the process flow of the Aggregates Reconcile feature. The reconcile configuration samples in the documentation have also been updated with a spelling correction for clarity.
  • Bump sqlglot from 25.6.1 to 25.8.1 (#749). In this version update, the sqlglot dependency has been bumped from 25.6.1 to 25.8.1, bringing several bug fixes and new features related to various SQL dialects such as BigQuery, DuckDB, and T-SQL. Notable changes include support for BYTEINT in BigQuery, improved parsing and transpilation of StrToDate in ClickHouse, and support for SUMMARIZE in DuckDB. Additionally, there are bug fixes for DuckDB and T-SQL, including wrapping left IN clause json extract arrow operand and handling JSON_QUERY with a single argument. The update also includes refactors and changes to the ANNOTATORS and PARSER modules to improve dialect-aware annotation and consistency. This pull request is compatible with sqlglot version 25.6.1 and below and includes a detailed list of commits and their corresponding changes.
  • Generate window functions (#772). In this release, we have added support for generating SQL WINDOW and SortOrder expressions in the ExpressionGenerator class. This enhancement includes the ability to generate a WINDOW expression with a window function, partitioning and ordering clauses, and an optional window frame, using the window and frameBoundary methods. The sortOrder method now generates the SQL SortOrder expression, which includes the expression to sort by, sort direction, and null ordering. Additional methods orNull and doubleQuote return a string representing a NULL value and a string enclosed in double quotes, respectively. These changes provide increased flexibility for handling more complex expressions in SQL. Additionally, new test cases have been added to the ExpressionGeneratorTest to ensure the correct generation of SQL window functions, specifically the ROW_NUMBER() function with various partitioning, ordering, and framing specifications. These updates improve the robustness and functionality of the ExpressionGenerator class for generating SQL window functions.
  • Implement TSQL specific function call mapper (#765). This commit introduces several new features to enhance compatibility between TSQL and Databricks SQL. A new method, interval, has been added to generate a Databricks SQL compatible string for intervals in a TSQL expression. The expression method has been updated to handle certain functions directly, improving translation efficiency. Specifically, the DATEADD function is now translated to Databricks SQL's DATE_ADD, ADD_MONTHS, and xxx + INTERVAL n {days|months|etc} constructs. The changes also include a new sealed trait KnownIntervalType, a new case class KnownInterval, and a new class TSqlCallMapper for mapping TSQL functions to Databricks SQL equivalents. Furthermore, the commit introduces new tests for TSQL specific function call mappers, ensuring proper translation of TSQL functions to Databricks SQL compatible constructs. These improvements collectively facilitate better integration and compatibility between TSQL and Databricks SQL.
  • Improve TSQL and Snowflake parser and lexer (#757). In this release, the open-source library's Snowflake and TSQL lexers and parsers have been improved for better functionality and robustness. For the Snowflake lexer, unnecessary escape sequence processing has been removed, and various options have been corrected to be simple strings. The lexer now accepts a question mark as a placeholder for prepared statements in Snowflake statements. The TSQL lexer has undergone minor improvements, such as aligning the catch-all rule name with Snowflake. The Snowflake parser now accepts the question mark as a PARAM placeholder and simplifies the typeFileformat rule to accept a single STRING token. Additionally, several new keywords have been added to the TSQL lexer, improving consistency and clarity. These changes aim to simplify lexer and parser rules, enhance option handling and placeholders, and ensure consistency between Snowflake and TSQL.
  • Patch Information Schema Predicate Pushdown for Snowflake (#764). In this release, we have implemented Information Schema Predicate Pushdown for Snowflake, resolving issue #7
  • TSQL: Implement correct grammar for CREATE TABLE in all forms (#796). In this release, the TSqlLexer's CREATE TABLE statement grammar has been updated and expanded to support new keywords and improve accuracy. The newly added keywords 'EDGE', 'FILETABLE', 'NODE', and NODES enable correct parsing of CREATE TABLE statements using graph nodes and FILETABLE functionality. Existing keywords such as 'DROP_EXISTING', 'DYNAMIC', 'FILENAME', and FILTER have been refined for better precision. Furthermore, the introduction of the tableIndices rule standardizes the order of columns in the table. These enhancements improve the T-SQL parser's robustness and consistency, benefiting users in creating and managing tables in their databases.
  • TSQL: Implement grammar for CREATE DATABASE and CREATE DATABASE SCOPED OPTION (#788). In this release, we have implemented the TSQL grammar for CREATE DATABASE and CREATE DATABASE SCOPED OPTION statements, addressing inconsistencies with TSQL documentation. The implementation was initially intended to cover the entire process from grammar to code generation. However, to simplify other DDL statements, the work was split into separate grammar-only pull requests. The diff introduces new methods such as createDatabaseScopedCredential, createDatabaseOption, and databaseFilestreamOption, while modifying the existing createDatabase method. The createDatabaseScopedCredential method handles the creation of a database scoped credential, which was previously part of createDatabaseOption. The createDatabaseOption method now focuses on handling individual options, while databaseFilestreamOption deals with filesystem specifications. Note that certain options, like DEFAULT_LANGUAGE, DEFAULT_FULLTEXT_LANGUAGE, and more, have been marked as TODO and will be addressed in future updates.
  • TSQL: Improve transpilation coverage (#766). In this update, various enhancements have been made to improve the coverage of TSQL transpilation and address bugs in code generation, particularly for the ExpressionGenerator class in the com/databricks/labs/remorph/generators/sql package, and the TSqlExpressionBuilder, TSqlFunctionBuilder, TSqlCallMapper, and QueryRunner classes. Changes include adding support for new cases, modifying code generation behavior, improving test coverage, and updating existing tests for better TSQL code generation. Specific additions include new methods for handling bitwise operations, converting CHECKSUM_AGG calls to a sequence of MD5 function calls, and handling Fn instances. The QueryRunner class has been updated to include both the actual and expected outputs in error messages for better debugging purposes. Additionally, the test file for the DATEADD function has been updated to ensure proper syntax and consistency. All these modifications aim to improve the reliability, accuracy, and compatibility of TSQL transpilation, ensuring better functionality and coverage for the Remorph library's transformation capabilities.
  • [chore] speedup build process by not running unit tests twice (#842). In this commit, the build process for the open-source library has been optimized by removing the execution of unit tests during the build phase in the Maven build process. A new plugin for the Apache Maven Surefire Plugin has been added, with the group ID set to "org.apache.maven.plugins", artifact ID set to "maven-surefire-plugin", and version set to "3.1.2". The configuration for this plugin includes a skipTests attribute set to "true", ensuring that tests are not run twice, thereby improving the build process speed. The existing ScalaTest Maven plugin configuration remains unchanged, allowing Scala tests to still be executed during the test phase. Additionally, the Maven Compiler Plugin has been upgraded to version 3.11.0, and the release parameter has been set to 8, ensuring that the Java compiler used during the build process is compatible with Java 8. The version numbers for several libraries, including os-lib, mainargs, ujson, scalatest, and exec-maven-plugin, are now being defined using properties, allowing Maven to manage and cache these libraries more efficiently. These changes improve the build process's performance and reliability without affecting the existing functionality.
  • [internal] better errors for call mapper (#816). In this release, the ExpressionGenerator class in the com.databricks.labs.remorph.generators.sql package has been updated to handle exceptions during the conversion of input functions to Databricks expressions. A try-catch block has been added to catch IndexOutOfBoundsException and provide a more descriptive error message, including the name of the problematic function and the error message associated with the exception. A TranspileException with the message not implemented is now thrown when encountering a function for which a translation to Databricks expressions is not available. The IsTranspiledFromSnowflakeQueryRunner class in the com.databricks.labs.remorph.coverage package has also been updated to include the name of the exception class in the error message for better error identification when a non-fatal error occurs during parsing. Additionally, the import statement for Formatter has been moved to ensure alphabetical order. These changes improve error handling and readability, thereby enhancing the overall user experience for developers interacting with the codebase.
  • [snowflake] map more functions to Databricks SQL (#826). This commit introduces new private methods andPredicate and orPredicate to the ExpressionGenerator class in the com.databricks.labs.remorph.generators.sql package, enhancing the generation of SQL expressions for AND and OR logical operators, and improving readability and correctness of complex logical expressions. The LogicalPlanGenerator class in the sql package now supports more flexibility in inserting data into a target relation, enabling users to choose between overwriting the existing data or appending to it. The FROM_JSON function in the CallMapper class has been updated to accommodate an optional third argument, providing more flexibility in handling JSON-related transformations. A new class, CastParseJsonToFromJson, has been introduced to improve the performance of data processing pipelines that involve parsing JSON data in Snowflake using the PARSE_JSON function. Additional Snowflake SQL functions have been mapped to Databricks SQL IR, enhancing compatibility and functionality. The ExpressionGeneratorTest class now generates predicates without parentheses, simplifying and improving readability. Mappings for several Snowflake functions to Databricks SQL have been added, enhancing compatibility with Databricks SQL. The sqlFiles sequence in the NestedFiles class is now sorted before being mapped to AcceptanceTest objects, ensuring consistent order for testing or debugging purposes. A semicolon has been added to the end of a SQL query in a test file for Snowflake DML insert functionality, ensuring proper query termination.
  • [sql] generate INSERT INTO ... (#823). In this release, we have made significant updates to our open-source library. The ExpressionGenerator.scala file has been updated to convert boolean values to lowercase instead of uppercase when generating INSERT INTO statements, ensuring SQL code consistency. A new method insert has been added to the LogicalPlanGenerator class to generate INSERT INTO SQL statements based on the InsertIntoTable input. We have introduced a new case class InsertIntoTable that extends Modification to simplify the API for DML operations other than SELECT. The SQL ExpressionGenerator now generates boolean literals in lowercase, and new test cases have been added to ensure the correct generation of INSERT and JOIN statements. Lastly, we have added support for generating INSERT INTO statements in SQL for specified database tables, improving cross-platform compatibility. These changes aim to enhance the library's functionality and ease of use for software engineers.
  • [sql] generate basic JSON access (#835). In this release, we have added several new features and improvements to our open-source library. The ExpressionGenerator class now includes a new method, jsonAccess, which generates SQL code to access a JSON object's properties, handling different types of elements in the path. The TO_JSON function in the StructsToJson class has been updated to accept an optional expression as an argument, enhancing its flexibility. The SnowflakeCallMapper class now includes a new method, lift, and a new feature to generate basic JSON access, with corresponding updates to test cases and methods. The SQL logical plan generator has been refined to generate star projections with escaped identifiers, handling complex table and database names. We have also added new methods and test cases to the SnowflakeCallMapper class to convert Snowflake structs into JSON strings and cast Snowflake values to specific data types. These changes improve the library's ability to handle complex JSON data structures, enhance functionality, and ensure the quality of generated SQL code.
  • [sql] generate basic CREATE TABLE definition (#829). In this release, the open-source library's SQL generation capabilities have been enhanced with the addition of a new createTable method to the LogicalPlanGenerator class. This method generates a CREATE TABLE definition for a given ir.CreateTableCommand, producing a SQL statement with a comma-separated list of column definitions. Each column definition includes the column name, data type, and any applicable constraints, generated using the DataTypeGenerator.generateDataType method and the newly-introduced constraint method. Additionally, the project method has been updated to incorporate a FROM clause in the generated SQL statement when the input of the project node is not ir.NoTable(). These improvements extend the functionality of the LogicalPlanGenerator class, allowing it to generate CREATE TABLE statements for input catalog ASTs, thereby better supporting data transformation use cases. A new test for the CreateTableCommand has been added to the LogicalPlanGeneratorTest class to validate the correct transpilation of the CreateTableCommand to a CREATE TABLE SQL statement.
  • [sql] generate basic TABLESAMPLE (#830). In this commit, the open-source library's LogicalPlanGenerator class has been updated to include a new method, tableSample, which generates SQL representations of table sampling operations. Previously, the class only handled INSERT, DELETE, and CREATE TABLE commands. With this enhancement, the generator can now produce SQL statements using the TABLESAMPLE clause, allowing for the selection of a sample of data from a table based on various sampling methods and a seed value for repeatable sampling. The newly supported sampling methods include row-based probabilistic, row-based fixed amount, and block-based sampling. Additionally, a new test case has been added for the LogicalPlanGenerator related to the TableSample class, validating the correct transpilation of named tables and fixed row sampling into the TABLESAMPLE clause with specified parameters. This improvement ensures that the generated SQL code accurately represents the desired table sampling settings.

Dependency updates:

  • Bump sqlglot from 25.6.1 to 25.8.1 (#749).

0.4.1

  • Aggregate Queries Reconciliation (#740). This release introduces several changes to enhance the functionality of the project, including the implementation of Aggregate Queries Reconciliation, addressing issue #503. A new property, aggregates, has been added to the base class of the query builder module to support aggregate queries reconciliation. A generate_final_reconcile_aggregate_output function has been added to generate the final reconcile output for aggregate queries. A new SQL file creates a table called aggregate_details to store details about aggregate reconciles, and a new column, operation_name, has been added to the main table in the installation reconciliation query. Additionally, new classes and methods have been introduced for handling aggregate queries and their reconciliation, and new SQL tables and columns have been created for storing and managing rules for aggregating data in the context of query reconciliation. Unit tests have been added to ensure the proper functioning of aggregate queries reconciliation and reconcile aggregate data in the context of missing records.
  • Generate GROUP BY / PIVOT (#747). The LogicalPlanGenerator class in the remorph library has been updated to support generating GROUP BY and PIVOT clauses for SQL queries. A new private method, "aggregate", has been added to handle two types of aggregates: GroupBy and Pivot. For GroupBy, it generates a GROUP BY clause with specified grouping expressions. For Pivot, it generates a PIVOT clause where the specified column is used as the pivot column and the specified values are used as the pivot values, compatible with Spark SQL. If the aggregate type is unsupported, a TranspileException is thrown. Additionally, new test cases have been introduced for the LogicalPlanGenerator class in the com.databricks.labs.remorph.generators.sql package to support testing the transpilation of Aggregate expressions with GROUP BY and PIVOT clauses, ensuring proper handling and transpilation of these expressions.
  • Implement error strategy for Snowflake parsing and use error strategy for all parser instances (#760). In this release, we have developed an error strategy specifically for Snowflake parsing that translates raw token names and parser rules into more user-friendly SQL error messages. This strategy is applied consistently across all parser instances, ensuring a unified error handling experience. Additionally, we have refined the DBL_DOLLAR rule in the SnowflakeLexer grammar to handle escaped dollar signs correctly. These updates improve the accuracy and readability of error messages for SQL authors, regardless of the parsing tool or transpiler used. Furthermore, we have updated the TSQL parsing error strategy to match the new Snowflake error strategy implementation, providing a consistent error handling experience across dialects.
  • Incremental improvement to error messages - article selection (#711). In this release, we have implemented an incremental improvement to the error messages generated during T-SQL code parsing. This change introduces a new private method, articleFor, which determines whether to use a or an in the generated messages based on the first letter of the following word. The generateMessage method has been updated to use this new method when constructing the initial error message and subsequent messages when there are multiple expected tokens. This improvement ensures consistent use of articles a or an in the error messages, enhancing their readability for software engineers working with T-SQL code.
  • TSQL: Adds tests and support for SELECT OPTION(...) generation (#755). In this release, we have added support for generating code for the TSQL SELECT ... OPTION(...) clause in the codebase. This new feature includes the ability to transpile any query hints supplied with a SELECT statement as comments in the output code, allowing for easier assessment of query performance after transpilation. The OPTION clause is now generated as comments, including MAXRECURSION, string options, boolean options, and auto options. Additionally, we have added new tests and updated the TSqlAstBuilderSpec test class with new and updated test cases to cover the new functionality. The implementation is focused on generating code for the OPTION clause, and does not affect the actual execution of the query. The changes are limited to the ExpressionGenerator class and its associated methods, and the TSqlRelationBuilder class, without affecting other parts of the codebase.
  • TSQL: IR implementation of MERGE (#719). The open-source library has been updated to include a complete implementation of the TSQL MERGE statement's IR (Intermediate Representation), bringing it in line with Spark SQL. The LogicalPlanGenerator class now includes a generateMerge method, which generates the SQL code for the MERGE statement, taking a MergeIntoTable object containing the target and source tables, merge condition, and merge actions as input. The MergeIntoTable class has been added as a case class to represent the logical plan of the MERGE INTO command and extends the Modification trait. The LogicalPlanGenerator class also includes a new generateWithOptions method, which generates SQL code for the WITH OPTIONS clause, taking a WithOptions object containing the input and options as children. Additionally, the TSqlRelationBuilder class has been updated to handle the MERGE statement's parsing, introducing new methods and updating existing ones, such as visitMerge. The TSqlToDatabricksTranspiler class has been updated to include support for the TSQL MERGE statement, and the ExpressionGenerator class has new tests for options, columns, and arithmetic expressions. A new optimization rule, TrapInsertDefaultsAction, has been added to handle the behavior of the DEFAULT keyword during INSERT statements. The commit also includes test cases for the MergeIntoTable logical operator and the T-SQL merge statement in the TSqlAstBuilderSpec.

0.4.0

  • Added TSql transpiler (#734). In this release, we have developed a new open-source library feature that enhances the transpilation of T-SQL code to Databricks-compatible code. The new TSqlToDatabricksTranspiler class has been added, which extends the Transpiler abstract class and defines the transpile method. This method converts T-SQL code to Databricks-compatible code by creating a lexer, token stream, parser, and parsed tree from the input string using TSqlLexer, CommonTokenStream, and tSqlFile. The parsed tree is then passed to the TSqlAstBuilder's visit method to generate a logical plan, which is optimized using an optimizer object with rules such as PullLimitUpwards and TopPercentToLimitSubquery. The optimized plan is then passed to the LogicalPlanGenerator's generate method to generate the final transpiled code. Additionally, a new class, IsTranspiledFromTSqlQueryRunner, has been added to the QueryRunner object to transpile T-SQL queries into the format expected by the Databricks query runner using the new TSqlToDatabricksTranspiler. The AcceptanceTestRunner class in the com.databricks.labs.remorph.coverage package has been updated to replace TSqlAstBuilder with IsTranspiledFromTSqlQueryRunner in the TSqlAcceptanceSuite class, indicating a change in the code responsible for handling TSql queries. This new feature aims to provide a smooth and efficient way to convert T-SQL code to Databricks-compatible code for further processing and execution.
  • Added the missing info for the reconciliation documentation (#520). In this release, we have made significant improvements to the reconciliation feature of our open-source library. We have added a configuration folder and provided a template for creating configuration files for specific table sources. The config files will contain necessary configurations for table-specific reconciliation. Additionally, we have included a note in the transformation section detailing the usage of user-defined functions (UDFs) in transformation expressions, with an example UDF called sort_array_input() provided. The reconcile configuration sample documentation has also been added, with a JSON configuration for reconciling tables using various operations like drop, join, transformation, threshold, filter, and JDBC ReaderOptions. The commit also includes examples of source and target tables, data overviews, and reconciliation configurations for various scenarios, such as basic config and column mapping, user transformations, explicit select, explicit drop, filters, and thresholds comparison, among others. These changes aim to make it easier for users to set up and execute the reconciliation process for specific table sources and provide clear and concise information about using UDFs in transformation expressions for reconciliation. This commit is co-authored by Vijay Pavan Nissankararao and SundarShankar89.
  • Bump sigstore/gh-action-sigstore-python from 2.1.1 to 3.0.0 (#555). In this pull request, the sigstore/gh-action-sigstore-python dependency is being updated from version 2.1.1 to 3.0.0. This new version includes several changes and improvements, such as the addition of recursive globbing with ** to the inputs and the removal of certain settings like fulcio-url, rekor-url, ctfe, and rekor-root-pubkey. The signature, certificate, and bundle output settings have also been removed. Furthermore, the way inputs are parsed has been changed, and they are now made optional under certain conditions. The default suffix has been updated to .sigstore.json. The 3.0.0 version also resolves various deprecations present in sigstore-python's 2.x series and supports CI runners that use PEP 668 to constrain global package prefixes.
  • Bump sqlglot from 25.1.0 to 25.5.1 (#534). In the latest release, the sqlglot package has been updated from version 25.1.0 to 25.5.1, which includes bug fixes, breaking changes, new features, and refactors for parsing, analyzing, and rewriting SQL queries. The new version introduces optimizations for coalesced USING columns, preserves EXTRACT(date_part FROM datetime) calls, decouples NVL() from COALESCE(), and supports FROM CHANGES in Snowflake. It also provides configurable transpilation of Snowflake VARIANT, and supports view schema binding options for Spark and Databricks. The update addresses several issues, such as the use of timestamp with time zone over timestamptz, switch off table alias columns generation, and parse rhs of x::varchar(max) into a type. Additionally, the update cleans up CurrentTimestamp generation logic for Teradata. The sqlglot dependency has also been updated in the 'experimental.py' file of the 'databricks/labs/remorph/snow' module, along with the addition of a new private method _parse_json to the Generator class for parsing JSON data. Software engineers should review the changes and update their code accordingly, as conflicts with existing code will be resolved automatically by Dependabot, as long as the pull request is not altered manually.
  • Bump sqlglot from 25.5.1 to 25.6.1 (#585). In this release, the sqlglot dependency is updated from version 25.5.1 to 25.6.1 in the 'pyproject.toml' file. This update includes bug fixes, breaking changes, new features, and improvements. Breaking changes consist of updates to the QUALIFY clause in queries and the canonicalization of struct and array inline constructor. New features include support for ORDER BY ALL, FROM ROWS FROM (...), RPAD & LPAD functions, and exp.TimestampAdd. Bug fixes address issues related to the QUALIFY clause in queries, expansion of SELECT * REPLACE, RENAME, transpiling UDFs from Databricks, and more. The pull request also includes a detailed changelog, commit history, instructions for triggering Dependabot actions, and commands for reference, with the exception of the compatibility score for the new version, which is not taken into account in this pull request.
  • Feature/reconcile table mismatch threshold (#550). This commit enhances the reconciliation process in the open-source library with several new features, addressing issue #504. A new get_record_count method is added to the Reconcile class, providing record count data for source and target tables, facilitating comprehensive analysis of table mismatches. A CountQueryBuilder class is introduced to build record count queries for different layers and SQL dialects, ensuring consistency in data processing. The Thresholds class is refactored into ColumnThresholds and TableThresholds, allowing for more granular control over comparisons and customizable threshold settings. New methods _is_mismatch_within_threshold_limits and _insert_into_metrics_table are added to the recon_capture.py file, improving fine-grained control over the reconciliation process and preventing false positives. Additionally, new classes, methods, and data structures have been implemented in the execute module to handle reconciliation queries and data more efficiently. These improvements contribute to a more accurate and robust reconciliation system.
  • Feature: introduce core transpiler (#715). A new core transpiler, SnowflakeToDatabricksTranspiler, has been introduced to convert Snowflake queries into Databricks SQL, streamlining integration and compatibility between the two systems. This transpiler is integrated into the coverage test suites for thorough testing, and is used to convert various types of logical plans, handling cases such as Batch, WithCTE, Project, NamedTable, Filter, and Join. The SnowflakeToDatabricksTranspiler class tokenizes input Snowflake query strings, initializes a SnowflakeParser instance, parses the input Snowflake query, generates a logical plan, and applies the LogicalPlanGenerator to the logical plan to generate the equivalent Databricks SQL query. Additionally, the SnowflakeAstBuilder class has been updated to alter the way Batch logical plans are built and improve overall functionality of the transpiler.
  • Fixed LEFT and RIGHT JOIN syntax in Snowflake ANTLR grammar (#526). A fix has been implemented to address issues with the Snowflake ANTLR grammar related to the proper parsing of LEFT and RIGHT JOIN statements. Previously, the keywords LEFT and RIGHT were incorrectly allowed as identifiers, but they are hard keywords that must be escaped to be used as column names. This change updates the grammar to escape these keywords in JOIN statements, improving the overall parsing of queries that include LEFT and RIGHT JOINs. Additionally, semantic predicates have been suggested to handle cases where LEFT or RIGHT are used as column names without escaping, although this is not yet implemented. To ensure the correctness of the updated grammar, new tests have been added to the SnowflakeAstBuilderSpec for LEFT and RIGHT JOINs, which check that the Abstract Syntax Tree (AST) is built correctly for these queries.
  • Fixed Snowflake Acceptance Testcases Failures (#531). In this release, updates have been made to the acceptance testcases for various SQL functions in the open-source library. The DENSE RANK function's testcase has been updated with a window specification and ORDER BY clause in both Snowflake and Databricks SQL syntaxes, ensuring accurate test results. The LAG function's testcase now includes a PARTITION BY and ORDER BY clause, as well as the NULLS LAST keyword in Databricks SQL, for improved accuracy and consistency. The SQL queries in the Snowflake testcase for the last_value function have been updated with a window specification, ORDER BY clause, and NULLS LAST directive for Databricks SQL. Test case failures in the Snowflake acceptance testsuite have been addressed with updates to the LEAD function, MONTH_NAME to MONTHNAME renaming, and DATE_FORMAT to TO_DATE conversion, improving reliability and consistency. The ntile function's testcase has been updated with PARTITION BY and ORDER BY clauses, and the NULLS LAST keyword has been added to the Databricks SQL query. The SQL query for null-safe equality comparison has been updated with a conditional expression compatible with Snowflake. The ranking function's testcase has been improved with the appropriate partition and order by clauses, and the NULLS LAST keyword has been added to the Databricks SQL query, enhancing accuracy and consistency. Lastly, updates have been made to the ROW_NUMBER function's testcase, ensuring accurate and consistent row numbering for both Snowflake and Databricks.
  • Fixed TSQL transpiler (#735). In this release, we have implemented a fix for the TSQL transpiler, addressing the issue #7. This enhancement allows the library to accurately convert TSQL code into an equivalent format that is compatible with other databases. The fix resolves reported bugs related to incorrect syntax interpretation, thereby improving the overall reliability and functionality of the transpiler. Software engineers and developers relying on TSQL compatibility for cross-database operations will benefit from this improvement. We encourage users to test and provide feedback on this updated feature.
  • Fixed SELECT TOP X PERCENT IR translation for TSQL (#733). In this release, we have made several enhancements to the open-source library to improve compatibility with T-SQL and Catalyst. We have added a new dependency, pprint_${scala.binary.version} version 0.8.1 from the com.lihaoyi group, to provide advanced pretty-printing functionality for Scala. We have also fixed the translation of the TSQL SELECT TOP X PERCENT feature in the parser for intermediate expressions, addressing the difference in syntax between TSQL and SQL for limiting the number of rows returned by a query. Additionally, we have modified the implementation of the WITH clause and added a new expression for the SELECT TOP clause in T-SQL, improving the compatibility of the codebase with T-SQL and aligning it with Catalyst. We also introduced a new abstract class Rule and a case class Rules in the com.databricks.labs.remorph.parsers.intermediate package to fix the SELECT TOP X PERCENT IR translation for TSQL by adding new rules. Furthermore, we have added a new Scala file, subqueries.scala, containing abstract class SubqueryExpression and two case classes that extend it, and made changes to the trees.scala file to improve the tree string representation for better readability and consistency in the codebase. These changes aim to improve the overall functionality of the library and make it easier for new users to understand and adopt the project.
  • Fixed invalid null constraint and FQN (#517). In this change, we addressed issues #516 and #517, which involved resolving an invalid null constraint and correcting a fully qualified name (FQN) in our open-source library. The read_data function in the databricks.py file was updated to improve null constraint handling and ensure the FQN is valid. Previously, the catalog was always appended to the table name, potentially resulting in an invalid FQN and null constraint issues. Now, the code checks if the catalog exists before appending it to the table name, and if not provided, the schema and table name are concatenated directly. Additionally, we removed the NOT NULL constraint from the catalog field in the source_table and target_table structs in the main SQL file, allowing null values for this field. These changes maintain backward compatibility and enhance the overall functionality and robustness of the project, ensuring accurate query results and avoiding potential errors.
  • Generate SQL for arithmetic operators (#726). In this release, we have introduced a new private method arithmetic to the ExpressionGenerator class that generates SQL for arithmetic operations, including unary minus, unary plus, multiplication, division, modulo, addition, and subtraction. This improves the readability and maintainability of the code by separating concerns and making the functionality more explicit. Additionally, we have introduced a new trait named Arithmetic to group arithmetic expressions together, which enables easier manipulation and identification of arithmetic expressions in the code. A new test suite has also been added for arithmetic operations in the ExpressionGenerator class, which improves test coverage and ensures the correct SQL is generated for these operations. These changes provide a welcome addition for developers looking to extend the functionality of the ExpressionGenerator class for arithmetic operations.
  • Generate SQL for bitwise operators. The ExpressionGenerator class in the remorph project has been updated to support generating SQL for bitwise operators (OR, AND, XOR, NOT) through the addition of a new private method bitwise that converts bitwise operations to equivalent SQL expressions. The expression method has also been updated to utilize the new bitwise method for any input ir.Expression that is a bitwise operation. To facilitate this change, a new trait called Bitwise and updated case classes for bitwise operations, including BitwiseNot, BitwiseAnd, BitwiseOr, and BitwiseXor, have been implemented. The updated case classes extend the new Bitwise trait and include the dataType override method to return the data type of the left expression. A new test case in ExpressionGeneratorTest for bitwise operators has been added to validate the functionality, and the expression method in ExpressionGenerator now utilizes the GeneratorContext() instead of new GeneratorContext(). These changes enable the ExpressionGenerator to generate SQL code for bitwise operations, expanding its capabilities.
  • Generate .. LIKE .. (#723). In this commit, the ExpressionGenerator class has been enhanced with a new method, like, which generates SQL LIKE expressions. The method takes a GeneratorContext object and an ir.Like object as arguments, and returns a string representation of the LIKE expression. It uses the expression method to generate the left and right sides of the LIKE operator, and also handles the optional escape character. Additionally, the timestampLiteral and dateLiteral methods have been updated to take an ir.Literal object and better handle NULL values. A new test case has also been added for the ExpressionGenerator class, which checks the like function and includes examples for basic usage and usage with an escape character. This commit improves the functionality of the ExpressionGenerator class, allowing it to handle LIKE expressions and better handle NULL values for timestamps and dates.
  • Generate DISTINCT, * (#739). In this release, we've enhanced the ExpressionGenerator class to support generating DISTINCT and * (star) expressions in SQL queries. Previously, the class did not handle these cases, resulting in incomplete or incorrect SQL queries. With the introduction of the distinct method, the class can now generate the DISTINCT keyword followed by the expression to be applied to, and the star method produces the * symbol, optionally followed by the name of the object (table or subquery) to which it applies. These improvements make the ExpressionGenerator class more robust and compatible with various SQL dialects, resulting in more accurate query outcomes. We've also added new test cases for the ExpressionGenerator class to ensure that DISTINCT and * expressions are generated correctly. Additionally, support for generating SQL DISTINCT and * (wildcard) has been added to the transpilation of Logical Plans to SQL, specifically in the ir.Project class. This ensures that the correct SQL SELECT * FROM table syntax is generated when a wildcard is used in the expression list. These enhancements significantly improve the functionality and compatibility of our open-source library.
  • Generate LIMIT (#732). A new method has been added to generate a SQL LIMIT clause for a logical plan in the data transformation tool. A new case branch has been implemented in the generate method of the LogicalPlanGenerator class to handle the ir.Limit case, which generates the SQL LIMIT clause. If a percentage limit is specified, the tool will throw an exception as it is not currently supported. The generate method of the ExpressionGenerator class has been replaced with a new generate method in the ir.Project, ir.Filter, and new ir.Limit cases to ensure consistent expression generation. A new case class Limit has been added to the com.databricks.labs.remorph.parsers.intermediate.relations package, which extends the UnaryNode class and has four parameters: input, limit, is_percentage, and with_ties. This new class enables limiting the number of rows returned by a query, with the ability to specify a percentage of rows or include ties in the result set. Additionally, a new test case has been added to the LogicalPlanGeneratorTest class to verify the transpilation of a Limit node to its SQL equivalent, ensuring that the Limit node is correctly handled during transpilation.
  • Generate OFFSET SQL clauses (#736). The Remorph project's latest update introduces a new OFFSET clause generation feature for SQL queries in the LogicalPlanGenerator class. This change adds support for skipping a specified number of rows before returning results in a SQL query, enhancing the library's query generation capabilities. The implementation includes a new case in the match statement of the generate function to handle ir.Offset nodes, creating a string representation of the OFFSET clause using the provide offset expression. Additionally, the commit includes a new test case in the LogicalPlanGeneratorTest class to validate the OFFSET clause generation, ensuring that the LogicalPlanGenerator can translate Offset AST nodes into corresponding SQL statements. Overall, this update enables the generation of more comprehensive SQL queries with OFFSET support, providing software engineers with greater flexibility for pagination and other data processing tasks.
  • Generate ORDER BY SQL clauses (#737). This commit introduces new classes and enumerations for sort direction and null ordering, as well as an updated SortOrder case class, enabling the generation of ORDER BY SQL clauses. The LogicalPlanGenerator and SnowflakeExpressionBuilder classes have been modified to utilize these changes, allowing for more flexible and customizable sorting and null ordering when generating SQL queries. Additionally, the TSqlRelationBuilderSpec test suite has been updated to reflect these changes, and new test cases have been added to ensure the correct transpilation of ORDER BY clauses. Overall, these improvements enhance the Remorph project's capability to parse and generate SQL expressions with various sorting scenarios, providing a more robust and maintainable codebase.
  • Generate UNION / EXCEPT / INTERSECT (#731). In this release, we have introduced support for the UNION, EXCEPT, and INTERSECT set operations in our data processing system's generator. A new unknown method has been added to the Generator trait to return a TranspileException when encountering unsupported operations, allowing for better error handling and more informative error messages. The LogicalPlanGenerator class in the remorph project has been extended to support generating UNION, EXCEPT, and INTERSECT SQL operations with the addition of a new parameter, explicitDistinct, to enable explicit specification of DISTINCT for these set operations. A new test suite has been added to the LogicalPlanGenerator to test the generation of these operations using the SetOperation class, which now has four possible set operations: UnionSetOp, IntersectSetOp, ExceptSetOp, and UnspecifiedSetOp. With these changes, our system can handle a wider range of input and provide meaningful error messages for unsupported operations, making it more versatile in handling complex SQL queries.
  • Generate VALUES SQL clauses (#738). The latest commit introduces a new feature to generate VALUES SQL clauses in the context of the logical plan generator. A new case branch has been implemented in the generate method to manage ir.Values expressions, converting input data (lists of lists of expressions) into a string representation compatible with VALUES clauses. The existing functionality remains unchanged. Additionally, a new test case has been added for the LogicalPlanGenerator class, which checks the correct transpilation to VALUES SQL clauses. This test case ensures that the ir.Values method, which takes a sequence of sequences of literals, generates the corresponding VALUES SQL clause, specifically checking the input Seq(Seq(ir.Literal(1), ir.Literal(2)), Seq(ir.Literal(3), ir.Literal(4))) against the SQL clause "VALUES (1,2), (3,4)". This change enables testing the functionality of generating VALUES SQL clauses using the LogicalPlanGenerator class.
  • Generate predicate expressions. This commit introduces the generation of predicate expressions as part of the SQL ExpressionGenerator in the com.databricks.labs.remorph.generators.sql package, enabling the creation of more complex SQL expressions. The changes include the addition of a new private method, predicate(ctx: GeneratorContext, expr: Expression), to handle predicate expressions, and the introduction of two new predicate expression types, LessThan and LessThanOrEqual, for comparing the relative ordering of two expressions. Existing predicate expression types have been updated with consistent naming. Additionally, the commit incorporates improvements to the handling of comparison operators in the SnowflakeExpressionBuilder and TSqlExpressionBuilder classes, addressing bugs and ensuring precise predicate expression generation. The ParserTestCommon trait has also been updated to reorder certain operators in a logical plan, maintaining consistent comparison results in tests. New test cases have been added to several test suites to ensure the correct interpretation and generation of predicate expressions involving different data types and search conditions. Overall, these enhancements provide more fine-grained comparison of expressions, enable more nuanced condition checking, and improve the robustness and accuracy of the SQL expression generation process.
  • Merge remote-tracking branch 'origin/main'. In this update, the ExpressionGenerator class in the com.databricks.labs.remorph.generators.sql package has been enhanced with two new private methods: dateLiteral and timestampLiteral. These methods are designed to generate the SQL literal representation of DateType and TimestampType expressions, respectively. The introduction of these methods addresses the previous limitations of formatting date and timestamp values directly, which lacked extensibility and required duplicated code for handling null values. By extracting the formatting logic into separate methods, this commit significantly improves code maintainability and reusability, enhancing the overall readability and understandability of the ExpressionGenerator class for developers. The dateLiteral method handles DateType values by formatting them using the dateFormat SimpleDateFormat instance, returning NULL if the value is missing. Likewise, the timestampLiteral method formats TimestampType values using the timeFormat SimpleDateFormat instance, returning NULL if the value is missing. These methods will enable developers to grasp the code's functionality more easily and make future enhancements to the class.
  • Modified dataclass for table threshold and added documentation (#714). A series of modifications have been implemented to enhance the threshold configuration and validation for table reconciliation in the open-source library. The TableThresholds dataclass has been updated to accept a string for the model attribute, replacing the previously used TableThresholdModel Enum. Additionally, a new validate_threshold_model method has been added to TableThresholds to ensure proper validation of the model attribute. A new exception class, InvalidModelForTableThreshold, has been introduced to handle invalid settings. Column-specific thresholds can now be set using the ColumnThresholds configuration option. The recon_capture.py and recon_config.py files have been updated accordingly, and the documentation has been revised to clarify these changes. These improvements offer greater flexibility and control for users configuring thresholds while also refining validation and error handling.
  • Support CTAS in TSQL Grammar and add more comparison operators (#545). In this release, we have added support for the CTAS (CREATE TABLE AS) statement in the TSQL (T-SQL) grammar, as well as introduced new comparison operators: !=, !<, and !>. The CTAS statement allows for the creation of a new table by selecting from an existing table or query, potentially improving code readability and performance. The new comparison operators provide alternative ways of expressing inequalities, increasing flexibility for developers. The keyword REPLICATE has also been added for creating a full copy of a database or availability group. These changes enhance the overall functionality of the TSQL grammar and improve the user's ability to express various operations in TSQL. The CTAS statement is implemented as a new rule, and the new comparison operators are added as methods in the TSqlExpressionBuilder class. These changes provide increased capability and flexibility for TSQL parsing and query handling. The new methods are not adding any new external dependencies, and the project remains self-contained. The additions have been tested with the TSqlExpressionBuilderSpec test suite, ensuring the functionality and compatibility of the TSQL parser.
  • Support translation of TSQL INGORE NULLS clause in windowing functions (#511). The latest change introduces support for translating the TSQL IGNORE NULLS and RESPECT NULLS clauses in windowing functions to their equivalents in Databricks SQL. In TSQL, these clauses appear after the function name and before the OVER clause, affecting how the functions handle null values. Databricks SQL represents this functionality with an optional trailing boolean parameter for specific windowing functions. With this update, when the IGNORE NULLS clause is specified in TSQL, a boolean option is appended to the corresponding Databicks SQL windowing functions, with RESPECT NULLS as the default. This enhancement is facilitated by a new private method, buildNullIgnore, which adds the boolean parameter to the original expression when IGNORE NULLS is specified in the OVER clause. The alteration is exemplified in new test examples for the TSqlFunctionSpec, which include testing the LEAD function with and without the IGNORE NULLS clause, and updates to the translation of functions with non-standard syntax.
  • TSQL: Implement TSQL UPDATE/DELETE statements (#540). In this release, we have added support for TSQL UPDATE and DELETE statements in all syntactical forms, including UDF column transformations, in the TSqlParser.g4 file. The implementation of both statements is done in a single file and they share many common clauses. We have also introduced two new case classes, UpdateTable and MergeTables, in the extensions.scala file to implement the TSQL UPDATE and DELETE statements, respectively. Additionally, we have added new methods to handle various clauses and elements associated with these statements in the TSqlErrorStrategy class. A series of tests have been included to ensure the correct translation of various UPDATE and DELETE queries to their respective Abstract Syntax Trees (ASTs). These changes bring TSQL UPDATE and DELETE statement functionality to the project and allow for their use in a variety of contexts, providing developers with more flexibility and control when working with TSQL UPDATE and DELETE statements in the parser.
  • TSQL: Implement translation of INSERT statement (#515). In this release, we have implemented the TSQL INSERT statement in its entirety, including all target options, optional clauses, and Common Table Expressions (CTEs) in our open-source library. The change includes updates to the TSqlParser.g4 file to support the INSERT statement's various clauses, such as TOP, INTO, WITH TABLE HINTS, outputClause, and optionClause. We have also added new case classes to the TSQL AST to handle various aspects of the INSERT statement, including LocalVarTable, Output, InsertIntoTable, DerivedRows, DefaultValues, and Default. The TSqlExpressionBuilder and TSqlRelationBuilder classes have been updated to support the new INSERT statement, including handling output column lists, aliases, and JSON clauses. We have added specification tests to TSqlAstBuilderSpec.scala to demonstrate the various ways that the INSERT statement can be written and to ensure the correct translation of TSQL INSERT statements into the remorph project.
  • TSQL: Remove the SIGN fragment from numerical tokens (#547). In this release, we have made changes to the TSQL expression builder in the remorph project that affect how negative and positive numerical literals are parsed. Negative literals, such as -2, will now be parsed as UMinus(Literal(2)) instead of Literal(-2), and positive literals, such as +1, will be parsed as UPlus(Literal(1)) instead of Literal(1). This change was made to address issue #546, but it is not an ideal solution as it may cause inconvenience in downstream processes. The affected numerical tokens include INT, HEX, FLOAT, REAL, and MONEY, which have been simplified by removing the SIGN fragment. We have updated the buildPrimitive method to handle INT, REAL, and FLOAT token types and ensure that numerical tokens are parsed correctly. We intend to keep issue #546 open for further exploration of a better solution. The tests have been updated to reflect these changes.
  • TSQL: Simplifies named table tableSource, implements columnAlias list (#512). This change introduces significant updates to the TSqlParser's grammar for tableSource, simplifying and consolidating rules related to table aliases and column alias lists. A new Relation called TableWithHints has been added to collect and process table hints, some of which have direct counterparts in the Catalyst optimizer or can be used as comments for migration purposes. The TSQLExpressionBuilder and TSqlRelationBuilder classes have been modified to handle table hints and column aliases, and the TSqlAstBuilderSpec test suite has been updated to include new tests for table hints in T-SQL SELECT statements. These changes aim to improve parsing, handling, and optimization of table sources, table hints, and column aliases in TSQL queries.
  • TSQL: Support generic FOR options (#525). In this release, we have added support for parsing T-SQL (Transact-SQL) options that contain the keyword FOR using the standard syntax [FOR]. This change is necessary as FOR cannot be used directly as an identifier without escaping, as it would otherwise be seen as a table alias in a SELECT statement. The ANTLR rule for parsing generic options has been expanded to handle these special cases correctly, allowing for the proper parsing of options such as OPTIMIZE FOR UNKNOWN in a SELECT statement. Additionally, a new case has been added to the OptionBuilder class to handle the FOR keyword and elide it, converting OPTIMIZE FOR UNKNOWN to OPTIMIZE with an id of 'UNKNOWN'. This ensures the proper handling of options containing FOR and avoids any conflicts with the FOR clause in T-SQL statements. This change was implemented by Valentin Kasas and involves adding a new alternative to the genericOption rule in the TSqlParser.g4 file, but no new methods have been added.
  • Updated Dialect Variable Name (#535). In this release, the source variable name in the QueryBuilder class, which refers to the Dialect instance, has been updated to engine to accurately reflect its meaning as referring to either Source or Target. This change includes updating the usage of source to engine in the build_query, _get_with_clause, and build_threshold_query methods, as well as removing unnecessary parentheses in a list. These changes improve the code's clarity, accuracy, and readability, while maintaining the overall functionality of the affected methods.
  • Use Oracle library only if the recon source is Oracle (#532). In this release, we have added a new ReconcileConfig configuration object and a SourceType enumeration in the databricks.labs.remorph.config and databricks.labs.remorph.reconcile.constants modules, respectively. These objects are introduced to determine whether to include the Oracle JDBC driver library in a reconciliation job's task libraries. The deploy_job method and the _job_recon_task method have been updated to use the _recon_config attribute to decide whether to include the Oracle JDBC driver library. Additionally, the _deploy_reconcile_job method in the install.py file has been modified to include a new parameter called reconcile, which is passed as an argument from the _config object. This change enhances the flexibility and customization of the reconcile job deployment. Furthermore, new fixtures oracle_recon_config and snowflake_reconcile_config have been introduced for ReconcileConfig objects with Oracle and Snowflake specific configurations, respectively. These fixtures are used in the test functions for deploying jobs, ensuring that the tests are more focused and better reflect the actual behavior of the code.
  • [chore] Make singletons for the relevant DataType instances (#705). This commit introduces case objects for various data types, such as NullType, StringType, and others, effectively making them singletons. This change simplifies the creation of data type instances and ensures that each type has a single instance throughout the application. The new case objects are utilized in building SQL expressions, affecting functions such as Cast and TRY_CAST in the TSqlExpressionBuilder class. Additionally, the test file TSqlExpressionBuilderSpec.scala has been updated to include the new case objects and handle errors by returning null. The data types tested include integer types, decimal types, date and time types, string types, binary types, and JSON. The primary goal of this change is to improve the management and identification of data types in the codebase, as well as to enhance code readability and maintainability.

Dependency updates:

  • Bump sqlglot from 25.1.0 to 25.5.1 (#534).
  • Bump sigstore/gh-action-sigstore-python from 2.1.1 to 3.0.0 (#555).
  • Bump sqlglot from 25.5.1 to 25.6.1 (#585).

0.3.0

  • Added Oracle ojdbc8 dependent library during reconcile Installation (#474). In this release, the deployment.py file in the databricks/labs/remorph/helpers directory has been updated to add the ojdbc8 library as a MavenLibrary in the _job_recon_task function, enabling the reconciliation process to access the Oracle Data source and pull data for reconciliation between Oracle and Databricks. The JDBCReaderMixin class in the jdbc_reader.py file has also been updated to include the Oracle ojdbc8 dependent library for reconciliation during the reconcile process. This involves installing the com.oracle.database.jdbc:ojdbc8:23.4.0.24.05 jar as a dependent library and updating the driver class to oracle.jdbc.driver.OracleDriver from oracle. A new dictionary driver_class has been added, which maps the driver name to the corresponding class name, allowing for dynamic driver class selection during the _get_jdbc_reader method call. The test_read_data_with_options unit test has been updated to test the Oracle connector for reading data with specific options, including the use of the correct driver class and specifying the database table for data retrieval, improving the accuracy and reliability of the reconciliation process.
  • Added TSQL coverage tests in the generated report artifact (#452). In this release, we have added new TSQL coverage tests and Snowflake coverage tests to the generated report artifact in the CI/CD pipeline. These tests are executed using Maven with the updated command "mvn --update-snapshots -B test -pl coverage --file pom.xml --fail-at-end" and "mvn --update-snapshots -B exec:java -pl coverage --file pom.xml --fail-at-end -Dexec.args="-i tests/resources/functional/snowflake -o coverage-result.json" respectively, and the "continue-on-error: true" option is added to allow the pipeline to proceed even if the tests fail. Additionally, we have introduced a new constructor to the CommentBasedQueryExtractor class, which accepts a dialect parameter and allows for easier configuration of the start and end comments for different SQL dialects. We have also updated the CommentBasedQueryExtractor for Snowflake and added two TSQL coverage tests to the generated report artifact to ensure that the QueryExtractor is working correctly for TSQL queries. These changes will help ensure thorough testing and identification of TSQL and Snowflake queries during the CI/CD process.
  • Added full support for analytical windowing functions (#401). In this release, full support for analytical windowing functions has been implemented, addressing issue #401. The functions were previously specified in the parser grammar but have been moved to the standard function lookup table for more consistent handling. This enhancement allows for the use of analytical aggregate functions, such as FIRST_VALUE and PERCENTILE_CONT, with a WITHIN GROUP syntax and an OVER clause, enabling more complex queries and data analysis. The FixedArity and VariableArity classes have been updated with new methods for the supported functions, and appropriate examples have been provided to demonstrate their usage in SQL.
  • Added parsing for STRPOS in presto (#462). A new feature has been added to the remorph/snow package's presto module to parse the STRPOS function in SQL code. This has been achieved by importing the locate_to_strposition function from sqlglot.dialects.dialect and incorporating it into the FUNCTIONS dictionary in the Parser class. This change enables the parsing of the STRPOS function, which returns the position of the first occurrence of a substring in a string. The implementation has been tested with a SQL file containing two queries for Presto SQL using STRPOS and Databricks SQL using LOCATE, both aimed at finding the position of the letter l in the string 'Hello world', starting the search from the second position. This feature is particularly relevant for software engineers working on data processing and analytics projects involving both Presto and Databricks SQL, as it ensures compatibility and consistent behavior between the two for string manipulation functions. The commit is part of issue #462, and the diff provided includes a new SQL file with test cases for the STRPOS function in Presto and Locate function in Databricks SQL. The test cases confirm if the hello string is present in the greeting_message column of the greetings_table. This feature allows users to utilize the STRPOS function in Presto to determine if a specific substring is present in a string.
  • Added validation for join columns for all query builders and limiting rows for reports (#413). In this release, we've added validation for join columns in all query builders, ensuring consistent and accurate data joins. A limit on the number of rows displayed for reports has been implemented with a default of 50. The compare.py and execute.py files have been updated to include validation, and the QueryBuilder and HashQueryBuilder classes have new methods for validating join columns. The SamplingQueryBuilder, ThresholdQueryBuilder, and recon_capture.py files have similar updates for validation and limiting rows for reports. The recon_config.py file now has a new return type for the get_join_columns method, and a new method test_no_join_columns_raise_exception() has been added in the test_threshold_query.py file. These changes aim to enhance data consistency, accuracy, and efficiency for software engineers.
  • Adds more coverage tests for functions to TSQL coverage (#420). This commit adds new coverage tests for various TSQL functions, focusing on the COUNT, MAX, MIN, STDEV, STDEVP, SUM, and VARP functions, which are identical in Databricks SQL. The tests include cases with and without the DISTINCT keyword to ensure consistent behavior between TSQL and Databricks. For the GROUPING and GROUPING_ID functions, which have some differences, tests and examples of TSQL and Databicks SQL code are provided. The CHECKSUM_AGG function, not directly supported in Databricks SQL, is tested using MD5 and CONCAT_WS for equivalence. The CUME_DIST function, identical in both systems, is also tested. Additionally, a new test file for the STDEV function and updated tests for the VAR function are introduced, enhancing the reliability and robustness of TSQL conversions in the project.
  • Catalog, Schema Permission checks (#492). This release introduces enhancements to the Catalog and Schema functionality, with the addition of permission checks that raise explicit Permission Denied exceptions. The logger messages have been updated for clarity and a new variable, README_RECON_REPO, has been created to reference the readme file for the recon_config repository. The ReconcileUtils class has been modified to handle scenarios where the recon_config file is not found or corrupted during loading, providing clear error messages and guidance for users. The unit tests for the install feature have been updated with permission checks for Catalog and Schema operations, ensuring robust handling of permission denied errors. These changes improve the system's error handling and provide clearer guidance for users encountering permission issues.
  • Changing the secret name acc to install script (#432). In this release, the recon function in the execute.py file of the databricks.labs.remorph.reconcile package has been updated to dynamically generate the secret name instead of hardcoding it as "secret_scope". This change utilizes the new get_key_form_dialect function to create a secret name specific to the source dialect being used in the reconciliation process. The get_dialect function, along with DatabaseConfig, TableRecon, and the newly added get_key_form_dialect, have been imported from databricks.labs.remorph.config. This enhancement improves the security and flexibility of the reconciliation process by generating dynamic and dialect-specific secret names.
  • Feature/recon documentation (#395). This commit introduces a new reconciliation process, enhancing data consistency between sources, co-authored by Ganesh Dogiparthi, ganeshdogiparthi-db, and SundarShankar89. The README.md file provides detailed documentation for the reconciliation process. A new binary file, docs/transpile-install.gif, offers installation instructions or visual aids, while a mermaid flowchart in report_types_visualisation.md illustrates report generation for data, rows, schema, and overall reconciliation. No existing functionality was modified, ensuring the addition of valuable features for software engineers adopting this project.
  • Fixing issues in sample query builder to handle Null's and zero (#457). This commit introduces improvements to the sample query builder's handling of Nulls and zeroes, addressing bug #450. The changes include updated SQL queries in the test threshold query file with COALESCE and TRIM functions to replace Null values with a specified string, ensuring consistent comparison of datasets. The query store in test_execute.py has also been enhanced to handle NULL and zero values using COALESCE, improving overall robustness and consistency. Additionally, new methods such as build_join_clause, trim, and coalesce have been added to enhance null handling in the query builder. The commit also introduces the MockDataSource class, a likely test implementation of a data source, and updates the log_and_throw_exception function for clearer error messaging.
  • Implement Lakeview Dashboard Publisher (#405). In this release, we've introduced the DashboardPublisher class in the dashboard_publisher.py module to streamline the process of creating and publishing dashboards in Databricks Workspace. This class simplifies dashboard creation by accepting an instance of WorkspaceClient and Installation and providing methods for creating and publishing dashboards with optional parameter substitution. Additionally, we've added a new JSON file, 'Remorph-Reconciliation-Substituted.lvdash.json', which contains a dashboard definition for a data reconciliation feature. This dashboard includes various widgets for filtering and displaying reconciliation results. We've also added a test file for the Lakeview Dashboard Publisher feature, which includes tests to ensure that the DashboardPublisher can create dashboards using specified file paths and parameters. These new features and enhancements are aimed at improving the user experience and streamlining the process of creating and publishing dashboards in Databricks Workspace.
  • Integrate recon metadata reconcile cli (#444). A new CLI command, databricks labs remorph reconcile, has been added to initiate the Data Reconciliation process, loading reconcile.yml and recon_config.json configuration files from the Databricks Workspace. If these files are missing, the user is prompted to reinstall the reconcile module and exit the command. The command then triggers the Remorph_Reconciliation_Job based on the Job ID stored in the reconcile.yml file. This simplifies the reconcile execution process, requiring users to first configure the reconcile module and generate the recon_config_<SOURCE>.json file using databricks labs remorph install and databricks labs remorph generate-recon-config commands. The new CLI command has been manually tested and includes unit tests. Integration tests and verification on the staging environment are pending. This feature was co-authored by Bishwajit, Ganesh Dogiparthi, and SundarShankar89.
  • Introduce coverage tests (#382). This commit introduces coverage tests and updates the GitHub Actions workflow to use Java 11 with Corretto distribution, improving testing and coverage analysis for the project. Coverage tests are added as part of the remorph project with the introduction of a new module for coverage and updating the artifact version to 0.2.0-SNAPSHOT. The pom.xml file is modified to change the parent project version to 0.2.0-SNAPSHOT, ensuring accurate assessment and maintenance of code coverage during development. In addition, a new Main object within the com.databricks.labs.remorph.coverage package is implemented for running coverage tests using command-line arguments, along with the addition of a new file QueryRunner.scala and case classes for ReportEntryHeader, ReportEntryReport, and ReportEntry for capturing and reporting on the status and results of parsing and transpilation processes. The Cache Maven packages step is removed and replaced with two new steps: Run Unit Tests with Maven and "Run Coverage Tests with Maven." The former executes unit tests and generates a test coverage report, while the latter downloads remorph-core jars as artifacts, executes coverage tests with Maven, and uploads coverage tests results as json artifacts. The coverage-tests job runs after the test-core job and uses the same environment, checking out the code with full history, setting up Java 11 with Corretto distribution, downloading remorph-core-jars artifacts, and running coverage tests with Maven, even if there are errors. The JUnit report is also published, and the coverage tests results are uploaded as json artifacts, providing better test coverage and more reliable code for software engineers adopting the project.
  • Presto approx percentile func fix (#411). The remorph library has been updated to support the Presto database system, with a new module added to the config.py file to enable robust and maintainable interaction. An APPROX_PERCENTILE function has been implemented in the presto.py file of the sqlglot.dialects.presto package, allowing for approximate percentile calculations in Presto and Databricks SQL. A test file has been included for both SQL dialects, with queries calculating the approximate median of the height column in the people table. The new functionality enhances the compatibility and versatility of the remorph library in working with Presto databases and improves overall project functionality. Additionally, a new test file for Presto in the snowflakedriver project has been introduced to test expected exceptions, further ensuring robustness and reliability.
  • Raise exception if reconciliation fails for any table (#412). In this release, we have implemented significant changes to improve exception handling and raise meaningful exceptions when reconciliation fails for any table in our open-source library. A new exception class, ReconciliationException, has been added as a child of the Exception class, which takes two optional parameters in its constructor, message and reconcile_output. The ReconcileOutput property has been created for accessing the reconcile output object. The InvalidInputException class now inherits from ValueError, making the code more explicit with the type of errors being handled. A new method, _verify_successful_reconciliation, has been introduced to check the reconciliation output status and raise a ReconciliationException if any table fails reconciliation. The test_execute.py file has been updated to raise a ReconciliationException if reconciliation for a specific report type fails, and new tests have been added to the test suite to ensure the correct behavior of the reconcile function with and without raising exceptions.
  • Removed USE catalog/schema statement as lsql has added the feature (#465). In this release, the usage of USE statements for selecting a catalog and schema has been removed in the get_sql_backend function, thanks to the new feature provided by the lsql library. This enhancement improves code readability, maintainability, and enables better integration with the SQL backend. The commit also includes changes to the installation process for reconciliation metadata tables, providing more clarity and simplicity in the code. Additionally, several test functions have been added or modified to ensure the proper functioning of the get_sql_backend function in various scenarios, including cases where a warehouse ID is not provided or when executing SQL statements in a notebook environment. An error simulation test has also been added for handling DatabricksError exceptions when executing SQL statements using the DatabricksConnectBackend class.
  • Sampling with clause query to have from dual in from clause for oracle source (#464). In this release, we've added the get_key_from_dialect function, replacing the previous get_key_form_dialect function, to retrieve the key associated with a given dialect object, serving as a unique identifier for the dialect. This improvement enhances the flexibility and readability of the codebase, making it easier to locate and manipulate dialect objects. Additionally, we've modified the 'sampling_query.py' file to include from dual in the from clause for Oracle sources in a sampling query with a clause, enabling sampling from Oracle databases. The _insert_into_main_table method in the recon_capture.py file of the databricks.labs.remorph.reconcile module has been updated to ensure accurate key retrieval for the specified dialect, thereby improving the reconciliation process. These changes resolve issues #458 and #464, enhancing the functionality of the sampling query builder and providing better support for various databases.
  • Support function translation to Databricks SQL in TSql and Snowflake (#414). This commit introduces a dialect-aware FunctionBuilder system and a ConversionStrategy system to enable seamless translation of SQL functions between TSQL, Snowflake, and Databricks SQL IR. The new FunctionBuilder system can handle both simple name translations and more complex conversions when there is no direct equivalent. For instance, TSQL's ISNULL function translates to IFNULL in Databricks SQL, while Snowflake's ISNULL remains unchanged. The commit also includes updates to the TSqlExpressionBuilder and new methods for building and visiting various contexts, enhancing compatibility and expanding the range of supported SQL dialects. Additionally, new tests have been added in the FunctionBuilderSpec to ensure the correct arity and function type for various SQL functions.
  • TSQL: Create coverage tests for TSQL -> Databricks functions (#415). This commit introduces coverage tests for T-SQL functions and their equivalent Databricks SQL implementations, focusing on the DATEADD function's yy keyword. The DATEADD function is translated to the ADD_MONTHS function in Databricks SQL, with the number of months multiplied by 12. This ensures functional equivalence between T-SQL and Databricks SQL for date addition involving years. The tests are written as SQL scripts and are located in the tests/resources/functional/tsql/functions directory, covering various scenarios and possible engine differences between T-SQL and Databricks SQL. The conversion process is documented, and future automation of this documentation is considered.
  • TSQL: Implement WITH CTE (#443). With this commit, we have extended the TSQL functionality by adding support for Common Table Expressions (CTEs). CTEs are temporary result sets that can be defined within a single execution of a SELECT, INSERT, UPDATE, DELETE, or CREATE VIEW statement, allowing for more complex and efficient queries. The implementation includes the ability to create a CTE with an optional name and a column list, followed by a SELECT statement that defines the CTE. CTEs can be self-referential and can be used to simplify complex queries, improving code readability and performance. This feature is particularly useful for cases where multiple queries rely on the same intermediate result set, as it enables reusing the results without having to repeat the query.
  • TSQL: Implement functions with specialized syntax (#430). This commit introduces new data type conversion functions and JSON manipulation capabilities to T-SQL, addressing issue #430. The newly implemented features include NEXT VALUE FOR sequence, CAST(col TO sometype), TRY_CAST(col TO sometype), JSON_ARRAY, and JSON_OBJECT. These functions support specialized syntax for handling data type conversions and JSON operations, including NULL value handling using NULL ON NULL and ABSENT ON NULL syntax. The TSqlFunctionBuilder class has been updated to accommodate these changes, and new test cases have been added to the TSqlFunctionSpec test class in Scala. This enhancement enables SQL-based querying and data manipulation with increased functionality for T-SQL parser and function evaluations.
  • TSQL: Support DISTINCT in SELECT list and aggregate functions (#400). This commit adds support for the DISTINCT keyword in T-SQL for use in the SELECT list and aggregate functions such as COUNT. When used in the SELECT list, DISTINCT ensures unique values of the specified expression are returned, and in aggregate functions like COUNT, it considers only distinct values of the specified argument. This change aligns with the SQL standard and enhances the functionality of the T-SQL parser, providing developers with greater flexibility and control when using DISTINCT in complex queries and aggregate functions. The default behavior in SQL, ALL, remains unchanged, and the parser has been updated to accommodate these improvements.
  • TSQL: Update the SELECT statement to support XML workspaces (#451). This release introduces updates to the TSQL Select statement grammar to correctly support XMLWORKSPACES in accordance with the latest specification. Although Databricks SQL does not currently support XMLWORKSPACES, this change is a syntax-only update to enable compatibility with other platforms that do support it. Newly added components include 'xmlNamespaces', 'xmlDeclaration', 'xmlSchemaCollection', 'xmlTypeDefinition', 'createXmlSchemaCollection', 'xmlIndexOptions', 'xmlIndexOption', 'openXml', 'xmlCommonDirectives', and 'xmlColumnDefinition'. These additions enable the creation, configuration, and usage of XML schemas and indexes, as well as the specification of XML namespaces and directives. A new test file for functional tests has been included to demonstrate the use of XMLWORKSPACES in TSQL and its equivalent syntax in Databricks SQL. While this update does not affect the existing codebase's functionality, it does enable support for XMLWORKSPACES syntax in TSQL, facilitating easier integration with other platforms that support it. Please note that Databricks SQL does not currently support XML workspaces.
  • Test merge queue (#424). In this release, the Scalafmt configuration has been updated to version 3.8.0, with changes to the formatting of Scala code. The danglingParentheses preset option has been set to "false", removing dangling parentheses from the code. Additionally, the configStyleArguments option has been set to false under "optIn". These modifications to the configuration file are likely to affect the formatting and style of the Scala code in the project, ensuring consistent and organized code. This change aims to enhance the readability and maintainability of the codebase.
  • Updated bug and feature yml to support reconcile (#390). The open-source library has been updated to improve issue and feature categorization. In the .github/ISSUE_TEMPLATE/bug.yml file, new options for TranspileParserError, TranspileValidationError, and TranspileLateralColumnAliasError have been added to the label: Category of Bug / Issue field. Additionally, a new option for ReconcileError has been included. The feature.yml file in the .github/ISSUE_TEMPLATE directory has also been updated, introducing a required dropdown menu labeled "Category of feature request." This dropdown offers options for Transpile, Reconcile, and Other categories, ensuring accurate classification and organization of incoming feature requests. The modifications aim to enhance clarity for maintainers in reviewing and prioritizing issue resolutions and feature implementations related to reconciliation functionality.
  • Updated the documentation with json config examples (#486). In this release, the Remorph Reconciliation tool on Databricks has been updated to include JSON config examples for various config elements such as jdbc_reader_options, column_mapping, transformations, thresholds, and filters. These config elements enable users to define source and target data, join columns, JDBC reader options, select and drop columns, column mappings, transformations, thresholds, and filters. The update also provides examples in both Python and JSON formats, as well as instructions for installing the necessary Oracle JDBC library on a Databricks cluster. This update enhances the tool's functionality, making it easier for software engineers to reconcile source data with target data on Databricks.
  • Updated uninstall flow (#476). In this release, the uninstall functionality of the databricks labs remorph tool has been updated to align with the latest changes made to the install refactoring. The uninstall flow now utilizes a new MockInstallation class, which handles the uninstallation process and takes a dictionary of configuration files and their corresponding contents as input. The uninstall function has been modified to return False in two cases, either when there is no remorph directory or when the user decides not to uninstall. A MockInstallation object is created for the reconcile.yml file, and appropriate exceptions are raised in the aforementioned cases. The uninstall function now uses a WorkspaceUnInstallation or WorkspaceUnInstaller object, depending on the input arguments, to handle the uninstallation process. Additionally, the MockPrompts class is used to prompt the user for confirmation before uninstalling remorph.
  • Updates to developer documentation and add grammar formatting to maven (#490). The developer documentation has been updated to include grammar formatting instructions and support for dialects other than Snowflake. The Maven build cycle has been modified to format grammars before ANTLR processes them, enhancing readability and easing conflict resolution during maintenance. The TSqlLexer.g4 file has been updated with formatting instructions and added dialect recognition. These changes ensure that grammars are consistently formatted and easily resolvable during merges. Engineers adopting this project should reformat the grammar file before each commit, following the provided formatting instructions and reference link. Grammar modifications in the TSqlParser.g4 file, such as alterations in partitionFunction and freetextFunction rules, improve structure and readability.
  • Upgrade sqlglot from 23.13.7 to 25.1.0 (#473). In the latest release, the sqlglot package has been upgraded from version 23.13.7 to 25.1.0, offering potential new features, bug fixes, and performance improvements for SQL processing. The package dependency for numpy has been updated to version 1.26.4, which may introduce new functionality, improve existing features, or fix numpy integration issues. Furthermore, the addition of the types-pytz package as a dependency provides type hints for pytz, enhancing codebase type checking and static analysis capabilities. Specific modifications to the test_sql_transpiler.py file include updating the expected result in the test_parse_query function and removing unnecessary whitespaces in the transpiled_sql assertion in the test_procedure_conversion function. Although the find_root_tables function remains unchanged, the upgrade to sqlglot promises overall functionality enhancements, which software engineers can leverage in their projects.
  • Use default_factory in recon_config.py (#431). In this release, the default value handling for the status field in the DataReconcileOutput and ReconcileTableOutput classes has been improved to comply with Python 3.11. Previously, a mutable default value was used, causing a ValueError issue. This has been addressed by implementing the default_factory argument in the field function to ensure a new instance of StatusOutput is created for each class. Additionally, MismatchOutput and ThresholdOutput classes now also utilize default_factory for consistent and robust default value handling, enhancing the overall code quality and preventing potential issues arising from mutable default values.
  • edit distance (#501). In this release, we have implemented an edit distance feature for calculating the difference between two strings using the LEVENSHTEIN function. This has been achieved by adding a new method, anonymous_sql, to the Generator class in the databricks.py file. The method takes expressions of the Anonymous type as arguments and calls the LEVENSHTEIN function if the this attribute of the expression is equal to "EDITDISTANCE". Additionally, a new test file has been introduced for the anonymous user in the functional snowflake test suite to ensure the accurate calculation of string similarity using the EDITDISTANCE function. This change includes examples of using the EDITDISTANCE function with different parameters and compares it with the LEVENSHTEIN function available in Databricks. It addresses issue #500, which was related to testing the edit distance functionality.

0.2.0

  • Capture Reconcile metadata in delta tables for dashbaords (#369). In this release, changes have been made to improve version control management, reduce repository size, and enhance build times. A new directory, "spark-warehouse/", has been added to the Git ignore file to prevent unnecessary files from being tracked and included in the project. The WriteToTableException class has been added to the exception.py file to raise an error when a runtime exception occurs while writing data to a table. A new ReconCapture class has been implemented in the reconcile package to capture and persist reconciliation metadata in delta tables. The recon function has been updated to initialize this new class, passing in the required parameters. Additionally, a new file, recon_capture.py, has been added to the reconcile package, which implements the ReconCapture class responsible for capturing metadata related to data reconciliation. The recon_config.py file has been modified to introduce a new class, ReconcileProcessDuration, and restructure the classes ReconcileOutput, MismatchOutput, and ThresholdOutput. The commit also captures reconcile metadata in delta tables for dashboards in the context of unit tests in the test_execute.py file and includes a new file, test_recon_capture.py, to test the reconcile capture functionality of the ReconCapture class.
  • Expand translation of Snowflake expr (#351). In this release, the translation of the expr category in the Snowflake language has been significantly expanded, addressing uncovered grammar areas, incorrect interpretations, and duplicates. The subquery is now excluded as a valid expr, and new case classes such as NextValue, ArrayAccess, JsonAccess, Collate, and Iff have been added to the Expression class. These changes improve the comprehensiveness and accuracy of the Snowflake parser, allowing for a more flexible and accurate translation of various operations. Additionally, the SnowflakeExpressionBuilder class has been updated to handle previously unsupported cases, enhancing the parser's ability to parse Snowflake SQL expressions.
  • Fixed orcale missing datatypes (#333). In the latest release, the Oracle class of the Tokenizer in the open-source library has undergone a fix to address missing datatypes. Previously, the KEYWORDS mapping did not require Tokens for keys, which led to unsupported Oracle datatypes. This issue has been resolved by modifying the test_schema_compare.py file to ensure that all Oracle datatypes, including LONG, NCLOB, ROWID, UROWID, ANYTYPE, ANYDATA, ANYDATASET, XMLTYPE, SDO_GEOMETRY, SDO_TOPO_GEOMETRY, and SDO_GEORASTER, are now mapped to the TEXT TokenType. This improvement enhances the compatibility of the code with Oracle datatypes and increases the reliability of the schema comparison functionality, as demonstrated by the test function test_schema_compare, which now returns is_valid as True and a count of 0 for is_valid = false in the resulting dataframe.
  • Fixed the recon_config functions to handle null values (#399). In this release, the recon_config functions have been enhanced to manage null values and provide more flexible column mapping for reconciliation purposes. A __post_init__ method has been added to certain classes to convert specified attributes to lowercase and handle null values. A new helper method, _get_is_string, has been introduced to determine if a column is of string type. Additionally, new functions such as get_tgt_to_src_col_mapping_list, get_layer_tgt_to_src_col_mapping, get_src_to_tgt_col_mapping_list, and get_layer_src_to_tgt_col_mapping have been added to retrieve column mappings, enhancing the overall functionality and robustness of the reconciliation process. These improvements will benefit software engineers by ensuring more accurate and reliable configuration handling, as well as providing more flexibility in mapping source and target columns during reconciliation.
  • Improve Exception handling (#392). The commit titled Improve Exception Handling enhances error handling in the project, addressing issues #388 and #392. Changes include refactoring the create_adapter method in the DataSourceAdapter class, updating method arguments in test functions, and adding new methods in the test_execute.py file for better test doubles. The DataSourceAdapter class is replaced with the create_adapter function, which takes the same arguments and returns an instance of the appropriate DataSource subclass based on the provided engine parameter. The diff also modifies the behavior of certain test methods to raise more specific and accurate exceptions. Overall, these changes improve exception handling, streamline the codebase, and provide clearer error messages for software engineers.
  • Introduced morph_sql and morph_column_expr functions for inline transpilation and validation (#328). Two new classes, TranspilationResult and ValidationResult, have been added to the config module of the remorph package to store the results of transpilation and validation. The morph_sql and morph_column_exp functions have been introduced to support inline transpilation and validation of SQL code and column expressions. A new class, Validator, has been added to the validation module to handle validation, and the validate_format_result method within this class has been updated to return a ValidationResult object. The _query method has also been added to the class, which executes a given SQL query and returns a tuple containing a boolean indicating success, any exception message, and the result of the query. Unit tests for these new functions have been updated to ensure proper functionality.
  • Output for the reconcile function (#389). A new function get_key_form_dialect has been added to the config.py module, which takes a Dialect object and returns the corresponding key used in the SQLGLOT_DIALECTS dictionary. Additionally, the MorphConfig dataclass has been updated to include a new attribute __file__, which sets the filename to "config.yml". The get_dialect function remains unchanged. Two new exceptions, WriteToTableException and InvalidInputException, have been introduced, and the existing DataSourceRuntimeException has been modified in the same module to improve error handling. The execute.py file's reconcile function has undergone several changes, including adding imports for InvalidInputException, ReconCapture, and generate_final_reconcile_output from recon_exception and recon_capture modules, and modifying the ReconcileOutput type. The hash_query.py file's reconcile function has been updated to include a new _get_with_clause method, which returns a Select object for a given DataFrame, and the build_query method has been updated to include a new query construction step using the with_clause object. The threshold_query.py file's reconcile function's output has been updated to include query and logger statements, a new method for allowing user transformations on threshold aliases, and the dialect specified in the sql method. A new generate_final_reconcile_output function has been added to the recon_capture.py file, which generates a reconcile output given a recon_id and a SparkSession. New classes and dataclasses, including SchemaReconcileOutput, ReconcileProcessDuration, StatusOutput, ReconcileTableOutput, and ReconcileOutput, have been introduced in the reconcile/recon_config.py file. The tests/unit/reconcile/test_execute.py file has been updated to include new test cases for the recon function, including tests for different report types and scenarios, such as data, schema, and all report types, exceptions, and incorrect report types. A new test case, test_initialise_data_source, has been added to test the initialise_data_source function, and the test_recon_for_wrong_report_type test case has been updated to expect an InvalidInputException when an incorrect report type is passed to the recon function. The test_reconcile_data_with_threshold_and_row_report_type test case has been added to test the reconcile_data method of the Reconciliation class with a row report type and threshold options. Overall, these changes improve the functionality and robustness of the reconcile process by providing more fine-grained control over the generation of the final reconcile output and better handling of exceptions and errors.
  • Threshold Source and Target query builder (#348). In this release, we've introduced a new method, build_threshold_query, that constructs a customizable threshold query based on a table's partition, join, and threshold columns configuration. The method identifies necessary columns, applies specified transformations, and includes a WHERE clause based on the filter defined in the table configuration. The resulting query is then converted to a SQL string using the dialect of the source database. Additionally, we've updated the test file for the threshold query builder in the reconcile package, including refactoring of function names and updated assertions for query comparison. We've added two new test methods: test_build_threshold_query_with_single_threshold and test_build_threshold_query_with_multiple_thresholds. These changes enhance the library's functionality, providing a more robust and customizable threshold query builder, and improve test coverage for various configurations and scenarios.
  • Unpack nested alias (#336). This release introduces a significant update to the 'lca_utils.py' file, addressing the limitation of not handling nested aliases in window expressions and where clauses, which resolves issue #334. The unalias_lca_in_select method has been implemented to recursively parse nested selects and unalias lateral column aliases, thereby identifying and handling unsupported lateral column aliases. This method is utilized in the check_for_unsupported_lca method to handle unsupported lateral column aliases in the input SQL string. Furthermore, the 'test_lca_utils.py' file has undergone changes, impacting several test functions and introducing two new ones, test_fix_nested_lca and 'test_fix_nested_lca_with_no_scope', to ensure the code's reliability and accuracy by preventing unnecessary assumptions and hallucinations. These updates demonstrate our commitment to improving the library's functionality and test coverage.

0.1.7

  • Added Configure Secrets support to databricks labs remorph configure-secrets cli command (#254). The Configure Secrets feature has been implemented in the databricks labs remorph CLI command, specifically for the new configure-secrets command. This addition allows users to establish Scope and Secrets within their Databricks Workspace, enhancing security and control over resource access. The implementation includes a new recon_config_utils.py file in the databricks/labs/remorph/helpers directory, which contains classes and methods for managing Databricks Workspace secrets. Furthermore, the ReconConfigPrompts helper class has been updated to handle prompts for selecting sources, entering secret scope names, and handling overwrites. The CLI command has also been updated with a new configure_secrets function and corresponding tests to ensure correct functionality.
  • Added handling for invalid alias usage by manipulating the AST (#219). The recent commit addresses the issue of invalid alias usage in SQL queries by manipulating the Abstract Syntax Tree (AST). It introduces a new method, unalias_lca_in_select, which unaliases Lateral Column Aliases (LCA) in the SELECT clause of a query. The AliasInfo class is added to manage aliases more effectively, with attributes for the name, expression, and a flag indicating if the alias name is the same as a column. Additionally, the execute.py file is modified to check for unsupported LCA using the lca_utils.check_for_unsupported_lca method, improving the system's robustness when handling invalid aliases. Test cases are also added in the new file, test_lca_utils.py, to validate the behavior of the check_for_unsupported_lca function, ensuring that SQL queries are correctly formatted for Snowflake dialect and avoiding errors due to invalid alias usage.
  • Added support for databricks labs remorph generate-lineage CLI command (#238). A new CLI command, databricks labs remorph generate-lineage, has been added to generate lineage for input SQL files, taking the source dialect, input, and output directories as arguments. The command uses existing logic to generate a directed acyclic graph (DAG) and then creates a DOT file in the output directory using the DAG. The new command is supported by new functions _generate_dot_file_contents, lineage_generator, and methods in the RootTableIdentifier and DAG classes. The command has been manually tested and includes unit tests, with plans for adding integration tests in the future. The commit also includes a new method temp_dirs_for_lineage and updates to the configure_secrets_databricks method to handle a new source type "databricks". The command handles invalid input and raises appropriate exceptions.
  • Custom oracle tokenizer (#316). In this release, the remorph library has been updated to enhance its handling of Oracle databases. A custom Oracle tokenizer has been developed to map the LONG datatype to text (string) in the tokenizer, allowing for more precise parsing and manipulation of LONG columns in Oracle databases. The Oracle dialect in the configuration file has also been updated to utilize the new custom Oracle tokenizer. Additionally, the Oracle class from the snow module has been imported and integrated into the Oracle dialect. These improvements will enable the remorph library to manage Oracle databases more efficiently, with a particular focus on improving the handling of the LONG datatype. The commit also includes updates to test files in the functional/oracle/test_long_datatype directory, which ensure the proper conversion of the LONG datatype to text. Furthermore, a new test file has been added to the tests/unit/snow directory, which checks for compatibility with Oracle's long data type. These changes enhance the library's compatibility with Oracle databases, ensuring accurate handling and manipulation of the LONG datatype in Oracle SQL and Databricks SQL.
  • Removed strict source dialect checks (#284). In the latest release, the transpile and generate_lineage functions in cli.py have undergone changes to allow for greater flexibility in source dialect selection. Previously, only snowflake or tsql dialects were supported, but now any source dialect supported by SQLGLOT can be used, controlled by the SQLGLOT_DIALECTS dictionary. Providing an unsupported source dialect will result in a validation error. Additionally, the input and output folder paths for the generate_lineage function are now validated against the file system to ensure their existence and validity. In the install.py file of the databricks/labs/remorph package, the source dialect selection has been updated to use SQLGLOT_DIALECTS.keys(), replacing the previous hardcoded list. This change allows for more flexibility in selecting the source dialect. Furthermore, recent updates to various test functions in the test_install.py file suggest that the source selection process has been modified, possibly indicating the addition of new sources or a change in source identification. These modifications provide greater flexibility in testing and potentially in the actual application.
  • Set Catalog, Schema from default Config (#312). A new feature has been added to our open-source library that allows users to specify the catalog and schema configuration options as part of the transpile command-line interface (CLI). If these options are not provided, the transpile function in the cli.py file will now set them to the values specified in default_config. This ensures that a default catalog and schema are used if they are not explicitly set by the user. The labs.yml file has been updated to reflect these changes, with the addition of the catalog-name and schema-name options to the commands object. The default property of the validation object has also been updated to true, indicating that the validation step will be skipped by default. These changes provide increased flexibility and ease-of-use for users of the transpile functionality.
  • Support for Null safe equality join for databricks generator (#280). In this release, we have implemented support for a null-safe equality join in the Databricks generator, addressing issue #280. This feature introduces the use of the " <=> " operator in the generated SQL code instead of the is not distinct from syntax to ensure accurate comparisons when NULL values are present in the columns being joined. The Generator class has been updated with a new method, NullSafeEQ, which takes in an expression and returns the binary version of the expression using the " <=> " operator. The preprocess method in the Generator class has also been modified to include this new functionality. It is important to note that this change may require users to update their existing code to align with the new syntax in the Databricks environment. With this enhancement, the Databricks generator is now capable of performing null-safe equality joins, resulting in consistent results regardless of the presence of NULL values in the join conditions.

0.1.6

  • Added serverless validation using lsql library (#176). Workspaceclient object is used with product name and product_version along with corresponding cluster_id or warehouse_id as sdk_config in MorphConfig object.
  • Enhanced install script to enforce usage of a warehouse or cluster when skip-validation is set to False (#213). In this release, the installation process has been enhanced to mandate the use of a warehouse or cluster when the skip-validation parameter is set to False. This change has been implemented across various components, including the install script, transpile function, and get_sql_backend function. Additionally, new pytest fixtures and methods have been added to improve test configuration and resource management during testing. Unit tests have been updated to enforce usage of a warehouse or cluster when the skip-validation flag is set to False, ensuring proper resource allocation and validation process improvement. This development focuses on promoting a proper setup and usage of the system, guiding new users towards a correct configuration and improving the overall reliability of the tool.
  • Patch subquery with json column access (#190). The open-source library has been updated with new functionality to modify how subqueries with JSON column access are handled in the snowflake.py file. This change includes the addition of a check for an opening parenthesis after the FROM keyword to detect and break loops when a subquery is found, as opposed to a table name. This improvement enhances the handling of complex subqueries and JSON column access, making the code more robust and adaptable to different query structures. Additionally, a new test method, test_nested_query_with_json, has been introduced to the tests/unit/snow/test_databricks.py file to test the behavior of nested queries involving JSON column access when using a Snowflake dialect. This new method validates the expected output of a specific nested query when it is transpiled to Snowflake's SQL dialect, allowing for more comprehensive testing of JSON column access and type casting in Snowflake dialects. The existing test_delete_from_keyword method remains unchanged.
  • Snowflake UPDATE FROM to Databricks MERGE INTO implementation (#198).
  • Use Runtime SQL backend in Notebooks (#211). In this update, the db_sql.py file in the databricks/labs/remorph/helpers directory has been modified to support the use of the Runtime SQL backend in Notebooks. This change includes the addition of a new RuntimeBackend class in the backends module and an import statement for os. The get_sql_backend function now returns a RuntimeBackend instance when the DATABRICKS_RUNTIME_VERSION environment variable is present, allowing for more efficient and secure SQL statement execution in Databricks notebooks. Additionally, a new test case for the get_sql_backend function has been added to ensure the correct behavior of the function in various runtime environments. These enhancements improve SQL execution performance and security in Databricks notebooks and increase the project's versatility for different use cases.
  • Added Issue Templates for bugs, feature and config (#194). Two new issue templates have been added to the project's GitHub repository to improve issue creation and management. The first template, located in .github/ISSUE_TEMPLATE/bug.yml, is for reporting bugs and prompts users to provide detailed information about the issue, including the current and expected behavior, steps to reproduce, relevant log output, and sample query. The second template, added under the path .github/ISSUE_TEMPLATE/config.yml, is for configuration-related issues and includes support contact links for general Databricks questions and Remorph documentation, as well as fields for specifying the operating system and software version. A new issue template for feature requests, named "Feature Request", has also been added, providing a structured format for users to submit requests for new functionality for the Remorph project. These templates will help streamline the issue creation process, improve the quality of information provided, and make it easier for the development team to quickly identify and address bugs and feature requests.
  • Added Databricks Source Adapter (#185). In this release, the project has been enhanced with several new features for the Databricks Source Adapter. A new engine parameter has been added to the DataSource class, replacing the original source parameter. The _get_secrets and _get_table_or_query methods have been updated to use the engine parameter for key naming and handling queries with a select statement differently, respectively. A Databricks Source Adapter for Oracle databases has been introduced, which includes a new OracleDataSource class that provides functionality to connect to an Oracle database using JDBC. A Databricks Source Adapter for Snowflake has also been added, featuring the SnowflakeDataSource class that handles data reading and schema retrieval from Snowflake. The DatabricksDataSource class has been updated to handle data reading and schema retrieval from Databricks, including a new get_schema_query method that generates the query to fetch the schema based on the provided catalog and table name. Exception handling for reading data and fetching schema has been implemented for all new classes. These changes provide increased flexibility for working with various data sources, improved code maintainability, and better support for different use cases.
  • Added Threshold Query Builder (#188). In this release, the open-source library has added a Threshold Query Builder feature, which includes several changes to the existing functionality in the data source connector. A new import statement adds the re module for regular expressions, and new parameters have been added to the read_data and get_schema abstract methods. The _get_jdbc_reader_options method has been updated to accept a options parameter of type "JdbcReaderOptions", and a new static method, "_get_table_or_query", has been added to construct the table or query string based on provided parameters. Additionally, a new class, "QueryConfig", has been introduced in the "databricks.labs.remorph.reconcile" package to configure queries for data reconciliation tasks. A new abstract base class QueryBuilder has been added to the query_builder.py file, along with HashQueryBuilder and ThresholdQueryBuilder classes to construct SQL queries for generating hash values and selecting columns based on threshold values, transformation rules, and filtering conditions. These changes aim to enhance the functionality of the data source connector, add modularity, customizability, and reusability to the query builder, and improve data reconciliation tasks.
  • Added snowflake connector code (#177). In this release, the open-source library has been updated to add a Snowflake connector for data extraction and schema manipulation. The changes include the addition of the SnowflakeDataSource class, which is used to read data from Snowflake using PySpark, and has methods for getting the JDBC URL, reading data with and without JDBC reader options, getting the schema, and handling exceptions. A new constant, SNOWFLAKE, has been added to the SourceDriver enum in constants.py, which represents the Snowflake JDBC driver class. The code modifications include updating the constructor of the DataSource abstract base class to include a new parameter 'scope', and updating the _get_secrets method to accept a key_name parameter instead of 'key'. Additionally, a test file 'test_snowflake.py' has been added to test the functionality of the SnowflakeDataSource class. This release also updates the pyproject.toml file to version lock the dependencies like black, ruff, and isort, and modifies the coverage report configuration to exclude certain files and lines from coverage checks. These changes were completed by Ravikumar Thangaraj and SundarShankar89.
  • remorph reconcile baseline for Query Builder and Source Adapter for oracle as source (#150).

Dependency updates:

  • Bump sqlglot from 22.4.0 to 22.5.0 (#175).
  • Updated databricks-sdk requirement from <0.22,>=0.18 to >=0.18,<0.23 (#178).
  • Updated databricks-sdk requirement from <0.23,>=0.18 to >=0.18,<0.24 (#189).
  • Bump actions/checkout from 3 to 4 (#203).
  • Bump actions/setup-python from 4 to 5 (#201).
  • Bump codecov/codecov-action from 1 to 4 (#202).
  • Bump softprops/action-gh-release from 1 to 2 (#204).

0.1.5

  • Added Pylint Checker (#149). This diff adds a Pylint checker to the project, which is used to enforce a consistent code style, identify potential bugs, and check for errors in the Python code. The configuration for Pylint includes various settings, such as a line length limit, the maximum number of arguments for a function, and the maximum number of lines in a module. Additionally, several plugins have been specified to load, which add additional checks and features to Pylint. The configuration also includes settings that customize the behavior of Pylint's naming conventions checks and handle various types of code constructs, such as exceptions, logging statements, and import statements. By using Pylint, the project can help ensure that its code is of high quality, easy to understand, and free of bugs. This diff includes changes to various files, such as cli.py, morph_status.py, validate.py, and several SQL-related files, to ensure that they adhere to the desired Pylint configuration and best practices for code quality and organization.
  • Fixed edge case where column name is same as alias name (#164). A recent commit has introduced fixes for edge cases related to conflicts between column names and alias names in SQL queries, addressing issues #164 and #130. The check_for_unsupported_lca function has been updated with two helper functions _find_aliases_in_select and _find_invalid_lca_in_window to detect aliases with the same name as a column in a SELECT expression and identify invalid Least Common Ancestors (LCAs) in window functions, respectively. The find_windows_in_select function has been refactored and renamed to _find_windows_in_select for improved code readability. The transpile and parse functions in the sql_transpiler.py file have been updated with try-except blocks to handle cases where a column name matches the alias name, preventing errors or exceptions such as ParseError, TokenError, and UnsupportedError. A new unit test, "test_query_with_same_alias_and_column_name", has been added to verify the fix, passing a SQL query with a subquery having a column alias ca_zip which is also used as a column name in the same query, confirming that the function correctly handles the scenario where a column name conflicts with an alias name.
  • TO_NUMBER without format edge case (#172). The TO_NUMBER without format edge case commit introduces changes to address an unsupported usage of the TO_NUMBER function in Databicks SQL dialect when the format parameter is not provided. The new implementation introduces constants PRECISION_CONST and SCALE_CONST (set to 38 and 0 respectively) as default values for precision and scale parameters. These changes ensure Databricks SQL dialect requirements are met by modifying the _to_number method to incorporate these constants. An UnsupportedError will now be raised when TO_NUMBER is called without a format parameter, improving error handling and ensuring users are aware of the required format parameter. Test cases have been added for TO_DECIMAL, TO_NUMERIC, and TO_NUMBER functions with format strings, covering cases where the format is taken from table columns. The commit also ensures that an error is raised when TO_DECIMAL is called without a format parameter.

Dependency updates:

  • Bump sqlglot from 21.2.1 to 22.0.1 (#152).
  • Bump sqlglot from 22.0.1 to 22.1.1 (#159).
  • Updated databricks-labs-blueprint[yaml] requirement from ~=0.2.3 to >=0.2.3,<0.4.0 (#162).
  • Bump sqlglot from 22.1.1 to 22.2.0 (#161).
  • Bump sqlglot from 22.2.0 to 22.2.1 (#163).
  • Updated databricks-sdk requirement from <0.21,>=0.18 to >=0.18,<0.22 (#168).
  • Bump sqlglot from 22.2.1 to 22.3.1 (#170).
  • Updated databricks-labs-blueprint[yaml] requirement from <0.4.0,>=0.2.3 to >=0.2.3,<0.5.0 (#171).
  • Bump sqlglot from 22.3.1 to 22.4.0 (#173).

0.1.4

  • Added conversion logic for Try_to_Decimal without format (#142).
  • Identify Root Table for folder containing SQLs (#124).
  • Install Script (#106).
  • Integration Test Suite (#145).

Dependency updates:

  • Updated databricks-sdk requirement from <0.20,>=0.18 to >=0.18,<0.21 (#143).
  • Bump sqlglot from 21.0.0 to 21.1.2 (#137).
  • Bump sqlglot from 21.1.2 to 21.2.0 (#147).
  • Bump sqlglot from 21.2.0 to 21.2.1 (#148).

0.1.3

  • Added support for WITHIN GROUP for ARRAY_AGG and LISTAGG functions (#133).
  • Fixed Merge "INTO" for delete from syntax (#129).
  • Fixed DATE TRUNC parse errors (#131).
  • Patched Logger function call during wheel file (#135).
  • Patched extra call to root path (#126).

Dependency updates:

  • Updated databricks-sdk requirement from ~=0.18.0 to >=0.18,<0.20 (#134).

0.1.2

  • Fixed duplicate LCA warnings (#108).
  • Fixed invalid flagging of LCA usage (#117).

Dependency updates:

  • Bump sqlglot from 20.10.0 to 20.11.0 (#95).
  • Bump sqlglot from 20.11.0 to 21.0.0 (#122).

0.1.1

  • Added test_approx_percentile and test_trunc Testcases (#98).
  • Updated contributing/developer guide (#97).

0.1.0

  • Added baseline for Databricks CLI frontend (#60).
  • Added custom Databricks dialect test cases and lateral struct parsing (#77).
  • Extended Snowflake to Databricks functions coverage (#72, #69).
  • Added databricks labs remorph transpile documentation for installation and usage (#73).

Dependency updates:

  • Bump sqlglot from 20.8.0 to 20.9.0 (#83).
  • Updated databricks-sdk requirement from ~=0.17.0 to ~=0.18.0 (#90).
  • Bump sqlglot from 20.9.0 to 20.10.0 (#91).

0.0.1

Initial commit