Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: parquetwriter fails with text datatype where characters exceed 4000 #584

Open
areddyme986 opened this issue Jan 3, 2025 · 0 comments

Comments

@areddyme986
Copy link

Library Version

4.25.0

OS

Windows

OS Architecture

64 bit

How to reproduce?

Code which i am using to write dataset to parquet file:

bool rtnVal = false;
int BatchSize = 10000;
var rowCount = 0;

var columns = Enumerable.Range(0, reader.FieldCount)
    .Select(o => new { FieldName = reader.GetName(o), DataType = reader.GetFieldType(o) })
    .ToList();

var columnSchema = new List<DataField>(columns.Count);
foreach (var item in columns)
{
    string fieldName = item.FieldName;
    if (item.DataType.FullName is not null)
    {
        Type? type = Type.GetType(item.DataType.FullName);
        if (type is not null)
        {

            columnSchema.Add(new DataField(name: fieldName, clrType: type, isNullable: true));
        }
    }

}

var schema = new ParquetSchema(columnSchema);
var dataset = new Parquet.Rows.Table(schema);
var i = 1;
var outputFile = new FileInfo(filePath);

using (var fileStream = outputFile.Create())
{

    using ParquetWriter writer = await ParquetWriter.CreateAsync(schema, fileStream, new ParquetOptions { UseDeltaBinaryPackedEncoding = false });
    writer.CompressionMethod = CompressionMethod.Snappy;
    if (reader.HasRows)
    {
        while (await reader.ReadAsync())
        {
            var row = Enumerable.Range(0, reader.FieldCount)
                .Select(o => ToParquetType(reader.GetValue(o)))
                .ToArray();

            dataset.Add(row);

            if (i == BatchSize)
            {
                await writer.WriteAsync(dataset);
                dataset.Clear();

                i = 1;
            }

            i++;
            rowCount++;
        }

        if (dataset.Count > 0)
        {
            await writer.WriteAsync(dataset);
            dataset.Clear();
        }
    }
}
rtnVal = true;

Issue is while writing logs table data to the parquet file and in this table one of the column's data type is text and has more than 8000 charecters (nested json values). Number of rows in the table: 98k

so if i filter the table based on the data length and select rows based on this condition " select col1, col2, col3 from table where datalength(col3) < 4000 " then i am able to write to parquet file successfully. Records count: 33k

if i select rows having datalength greater than 4000 charecters , code is failing with the below error. Tried with batch size values 500, 10000, 100000. Only time this works is writing the data completely at once instead of batches.

So could you please advise on how to handle writing this data (columns having values greater than 4000 charecters) and please correct me if i am doing wrongly.

Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.

Repeat 2 times:

--------------------------------

   at System.Data.Common.UnsafeNativeMethods+IRowset.GetNextRows(IntPtr, IntPtr, IntPtr, IntPtr ByRef, IntPtr ByRef)

--------------------------------

   at System.Data.OleDb.OleDbDataReader.GetRowHandles()

   at System.Data.OleDb.OleDbDataReader.ReadRowset()

   at System.Data.OleDb.OleDbDataReader.Read()

   at System.Data.Common.DbDataReader.ReadAsync(System.Threading.CancellationToken)

   at RadarSync.Data.MetaDataDL+<WriteParquetFile>d__22.MoveNext()

   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)

   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[System.Boolean, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.__Canon, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e](https://masterashley.sharepoint.com/sites/EnterpriseDataServices/SitePages/System.Boolean,%20System.Private.CoreLib,%20Version=8.0.0.0,%20Culture=neutral,%20PublicKeyToken=7cec85d7bea7798e],[System.__Canon,%20System.Private.CoreLib,%20Version=8.0.0.0,%20Culture=neutral,%20PublicKeyToken=7cec85d7bea7798e.aspx).MoveNext(System.Threading.Thread)

   at System.Runtime.CompilerServices.TaskAwaiter+<>c.<OutputWaitEtwEvents>b__12_0(System.Action, System.Threading.Tasks.Task)

   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Action, Boolean)

   at System.Threading.Tasks.Task.RunContinuations(System.Object)

   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e](https://masterashley.sharepoint.com/sites/EnterpriseDataServices/SitePages/System.Threading.Tasks.VoidTaskResult,%20System.Private.CoreLib,%20Version=8.0.0.0,%20Culture=neutral,%20PublicKeyToken=7cec85d7bea7798e.aspx).SetExistingTaskResult(System.Threading.Tasks.Task`1<System.Threading.Tasks.VoidTaskResult>, System.Threading.Tasks.VoidTaskResult)

   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder.SetResult()

   at Parquet.ParquetExtensions+<WriteAsync>d__2.MoveNext()

   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[Parquet.ParquetExtensions+<WriteAsync>d__2, Parquet, Version=4.0.0.0, Culture=neutral, PublicKeyToken=d380b3dee6d01926](https://masterashley.sharepoint.com/sites/EnterpriseDataServices/SitePages/System.Threading.Tasks.VoidTaskResult,%20System.Private.CoreLib,%20Version=8.0.0.0,%20Culture=neutral,%20PublicKeyToken=7cec85d7bea7798e],[Parquet.ParquetExtensions+%3CWriteAsync%3Ed__2,%20Parquet,%20Version=4.0.0.0,%20Culture=neutral,%20PublicKeyToken=d380b3dee6d01926.aspx).ExecutionContextCallback(System.Object)

   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)

   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[Parquet.ParquetExtensions+<WriteAsync>d__2, Parquet, Version=4.0.0.0, Culture=neutral, PublicKeyToken=d380b3dee6d01926](https://masterashley.sharepoint.com/sites/EnterpriseDataServices/SitePages/System.Threading.Tasks.VoidTaskResult,%20System.Private.CoreLib,%20Version=8.0.0.0,%20Culture=neutral,%20PublicKeyToken=7cec85d7bea7798e],[Parquet.ParquetExtensions+%3CWriteAsync%3Ed__2,%20Parquet,%20Version=4.0.0.0,%20Culture=neutral,%20PublicKeyToken=d380b3dee6d01926.aspx).MoveNext(System.Threading.Thread)

   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[Parquet.ParquetExtensions+<WriteAsync>d__2, Parquet, Version=4.0.0.0, Culture=neutral, PublicKeyToken=d380b3dee6d01926](https://masterashley.sharepoint.com/sites/EnterpriseDataServices/SitePages/System.Threading.Tasks.VoidTaskResult,%20System.Private.CoreLib,%20Version=8.0.0.0,%20Culture=neutral,%20PublicKeyToken=7cec85d7bea7798e],[Parquet.ParquetExtensions+%3CWriteAsync%3Ed__2,%20Parquet,%20Version=4.0.0.0,%20Culture=neutral,%20PublicKeyToken=d380b3dee6d01926.aspx).MoveNext()

   at System.Runtime.CompilerServices.TaskAwaiter+<>c.<OutputWaitEtwEvents>b__12_0(System.Action, System.Threading.Tasks.Task)

   at System.Threading.Tasks.AwaitTaskContinuation.System.Threading.IThreadPoolWorkItem.Execute()

   at System.Threading.ThreadPoolWorkQueue.Dispatch()

   at System.Threading.PortableThreadPool+WorkerThread.WorkerThreadStart()

 

 

   at System.Data.OleDb.OleDbDataReader.DoValueCheck(Int32 ordinal)
  at System.Data.OleDb.OleDbDataReader.get_Item(Int32 index)
  at RadarSync.Data.MetaDataDL.WriteParquetFile(OleDbDataReader reader, String filePath, String operationName) in C:\Users\ameenige\Aravind\repos\Radarsync6-ETL\RadarSync6\RadarSync.Data\MetaDataDL.cs:line 1057

Failing test

No response

@areddyme986 areddyme986 changed the title [BUG]: [BUG]: parquetwriter fails with text datatype where characters exceed 4000 Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant