You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Code which i am using to write dataset to parquet file:
bool rtnVal = false;
int BatchSize = 10000;
var rowCount = 0;
var columns = Enumerable.Range(0, reader.FieldCount)
.Select(o => new { FieldName = reader.GetName(o), DataType = reader.GetFieldType(o) })
.ToList();
var columnSchema = new List<DataField>(columns.Count);
foreach (var item in columns)
{
string fieldName = item.FieldName;
if (item.DataType.FullName is not null)
{
Type? type = Type.GetType(item.DataType.FullName);
if (type is not null)
{
columnSchema.Add(new DataField(name: fieldName, clrType: type, isNullable: true));
}
}
}
var schema = new ParquetSchema(columnSchema);
var dataset = new Parquet.Rows.Table(schema);
var i = 1;
var outputFile = new FileInfo(filePath);
using (var fileStream = outputFile.Create())
{
using ParquetWriter writer = await ParquetWriter.CreateAsync(schema, fileStream, new ParquetOptions { UseDeltaBinaryPackedEncoding = false });
writer.CompressionMethod = CompressionMethod.Snappy;
if (reader.HasRows)
{
while (await reader.ReadAsync())
{
var row = Enumerable.Range(0, reader.FieldCount)
.Select(o => ToParquetType(reader.GetValue(o)))
.ToArray();
dataset.Add(row);
if (i == BatchSize)
{
await writer.WriteAsync(dataset);
dataset.Clear();
i = 1;
}
i++;
rowCount++;
}
if (dataset.Count > 0)
{
await writer.WriteAsync(dataset);
dataset.Clear();
}
}
}
rtnVal = true;
Issue is while writing logs table data to the parquet file and in this table one of the column's data type is text and has more than 8000 charecters (nested json values). Number of rows in the table: 98k
so if i filter the table based on the data length and select rows based on this condition " select col1, col2, col3 from table where datalength(col3) < 4000 " then i am able to write to parquet file successfully. Records count: 33k
if i select rows having datalength greater than 4000 charecters , code is failing with the below error. Tried with batch size values 500, 10000, 100000. Only time this works is writing the data completely at once instead of batches.
So could you please advise on how to handle writing this data (columns having values greater than 4000 charecters) and please correct me if i am doing wrongly.
Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
Repeat 2 times:
--------------------------------
at System.Data.Common.UnsafeNativeMethods+IRowset.GetNextRows(IntPtr, IntPtr, IntPtr, IntPtr ByRef, IntPtr ByRef)
--------------------------------
at System.Data.OleDb.OleDbDataReader.GetRowHandles()
at System.Data.OleDb.OleDbDataReader.ReadRowset()
at System.Data.OleDb.OleDbDataReader.Read()
at System.Data.Common.DbDataReader.ReadAsync(System.Threading.CancellationToken)
at RadarSync.Data.MetaDataDL+<WriteParquetFile>d__22.MoveNext()
at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[System.Boolean, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.__Canon, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e](https://masterashley.sharepoint.com/sites/EnterpriseDataServices/SitePages/System.Boolean,%20System.Private.CoreLib,%20Version=8.0.0.0,%20Culture=neutral,%20PublicKeyToken=7cec85d7bea7798e],[System.__Canon,%20System.Private.CoreLib,%20Version=8.0.0.0,%20Culture=neutral,%20PublicKeyToken=7cec85d7bea7798e.aspx).MoveNext(System.Threading.Thread)
at System.Runtime.CompilerServices.TaskAwaiter+<>c.<OutputWaitEtwEvents>b__12_0(System.Action, System.Threading.Tasks.Task)
at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Action, Boolean)
at System.Threading.Tasks.Task.RunContinuations(System.Object)
at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e](https://masterashley.sharepoint.com/sites/EnterpriseDataServices/SitePages/System.Threading.Tasks.VoidTaskResult,%20System.Private.CoreLib,%20Version=8.0.0.0,%20Culture=neutral,%20PublicKeyToken=7cec85d7bea7798e.aspx).SetExistingTaskResult(System.Threading.Tasks.Task`1<System.Threading.Tasks.VoidTaskResult>, System.Threading.Tasks.VoidTaskResult)
at System.Runtime.CompilerServices.AsyncTaskMethodBuilder.SetResult()
at Parquet.ParquetExtensions+<WriteAsync>d__2.MoveNext()
at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[Parquet.ParquetExtensions+<WriteAsync>d__2, Parquet, Version=4.0.0.0, Culture=neutral, PublicKeyToken=d380b3dee6d01926](https://masterashley.sharepoint.com/sites/EnterpriseDataServices/SitePages/System.Threading.Tasks.VoidTaskResult,%20System.Private.CoreLib,%20Version=8.0.0.0,%20Culture=neutral,%20PublicKeyToken=7cec85d7bea7798e],[Parquet.ParquetExtensions+%3CWriteAsync%3Ed__2,%20Parquet,%20Version=4.0.0.0,%20Culture=neutral,%20PublicKeyToken=d380b3dee6d01926.aspx).ExecutionContextCallback(System.Object)
at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[Parquet.ParquetExtensions+<WriteAsync>d__2, Parquet, Version=4.0.0.0, Culture=neutral, PublicKeyToken=d380b3dee6d01926](https://masterashley.sharepoint.com/sites/EnterpriseDataServices/SitePages/System.Threading.Tasks.VoidTaskResult,%20System.Private.CoreLib,%20Version=8.0.0.0,%20Culture=neutral,%20PublicKeyToken=7cec85d7bea7798e],[Parquet.ParquetExtensions+%3CWriteAsync%3Ed__2,%20Parquet,%20Version=4.0.0.0,%20Culture=neutral,%20PublicKeyToken=d380b3dee6d01926.aspx).MoveNext(System.Threading.Thread)
at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[Parquet.ParquetExtensions+<WriteAsync>d__2, Parquet, Version=4.0.0.0, Culture=neutral, PublicKeyToken=d380b3dee6d01926](https://masterashley.sharepoint.com/sites/EnterpriseDataServices/SitePages/System.Threading.Tasks.VoidTaskResult,%20System.Private.CoreLib,%20Version=8.0.0.0,%20Culture=neutral,%20PublicKeyToken=7cec85d7bea7798e],[Parquet.ParquetExtensions+%3CWriteAsync%3Ed__2,%20Parquet,%20Version=4.0.0.0,%20Culture=neutral,%20PublicKeyToken=d380b3dee6d01926.aspx).MoveNext()
at System.Runtime.CompilerServices.TaskAwaiter+<>c.<OutputWaitEtwEvents>b__12_0(System.Action, System.Threading.Tasks.Task)
at System.Threading.Tasks.AwaitTaskContinuation.System.Threading.IThreadPoolWorkItem.Execute()
at System.Threading.ThreadPoolWorkQueue.Dispatch()
at System.Threading.PortableThreadPool+WorkerThread.WorkerThreadStart()
at System.Data.OleDb.OleDbDataReader.DoValueCheck(Int32 ordinal)
at System.Data.OleDb.OleDbDataReader.get_Item(Int32 index)
at RadarSync.Data.MetaDataDL.WriteParquetFile(OleDbDataReader reader, String filePath, String operationName) in C:\Users\ameenige\Aravind\repos\Radarsync6-ETL\RadarSync6\RadarSync.Data\MetaDataDL.cs:line 1057
Failing test
No response
The text was updated successfully, but these errors were encountered:
areddyme986
changed the title
[BUG]:
[BUG]: parquetwriter fails with text datatype where characters exceed 4000
Jan 7, 2025
Library Version
4.25.0
OS
Windows
OS Architecture
64 bit
How to reproduce?
Code which i am using to write dataset to parquet file:
Issue is while writing logs table data to the parquet file and in this table one of the column's data type is text and has more than 8000 charecters (nested json values). Number of rows in the table: 98k
so if i filter the table based on the data length and select rows based on this condition " select col1, col2, col3 from table where datalength(col3) < 4000 " then i am able to write to parquet file successfully. Records count: 33k
if i select rows having datalength greater than 4000 charecters , code is failing with the below error. Tried with batch size values 500, 10000, 100000. Only time this works is writing the data completely at once instead of batches.
So could you please advise on how to handle writing this data (columns having values greater than 4000 charecters) and please correct me if i am doing wrongly.
Failing test
No response
The text was updated successfully, but these errors were encountered: