Reading huge (21GB) Oasis file #280

walkerstop · 2024-11-24T05:24:08Z

Hi, I was wondering if anyone has had success reading very large .oas files.
I am using the C++ gdstk library, built and running on a SuSE Linux server that has 512GB of RAM.
My .oas file is 21GB.
The call to gdstk::read_oas() has been running for 2 days and so far has consumed 110GB of RAM but still has not finished the call to read_oas()
Just wondering if anyone had any ideas to try. I am also running some performance profiling to see where the bottleneck is but haven't looked into the results yet.
I believe (based on limited information so far) that a lot of the CPU cycles are being spent in calls to calloc() coming from allocate_clear()
In my case I actually only need to read a few of the layer/data types (tags) in the file.
I have already added shape_tags and label_tags filters (like gdstk::read_gds() already has) and so I am throwing away MOST of the content of this file.
However, the way that I implemented the shape_tags and label_tags was kind of stupid and is probably not helping the performance. I let read_oas() allocate the structures and read the elements, and only once I know what the tag is, then if it's not a tag I want, then I free the structures and don't add them to the library.
I know this is wasting a lot of cycles allocating and then immediately freeing memory.
I was thinking it might improve the performance if I can read elements into temporary structures on the stack until I read the tag, and then only allocate and add them to the library if I need to keep them, but I haven't tried this yet, it's not super straightforward to me how to do this due to some of the modal pointers, and due to the space needed to read elements not always being the same.
Anyway, if anyone has any ideas, I'd love to hear them.
If I do manage to support shape_tags and label_tags in read_oas() in a smarter way, I would be happy to open a pull request in case the change is useful to others, but I have not gotten that far.
Something I have tried already is using larger read buffers instead of my system's default 8KB buffers. That didn't seem to help much.
Another idea I have not tried yet would be using mmap() instead of fread(), but based on what I've read so far, it's not clear to me whether this would be much faster or not. I kind of doubt that the bottleneck is fread() anyway, but I should know more once I've done more profiling.

tvt173 · 2024-11-24T05:32:16Z

Have you tried doing it with klayout instead? I find that klayout typically behaves better in situations like this

walkerstop · 2024-11-24T08:42:41Z

Have you tried doing it with klayout instead? I find that klayout typically behaves better in situations like this

Thanks for the suggestion, but unfortunately our app is closed source, so the KLayout copyleft license means I can't use it for our app

heitzmann · 2024-11-25T20:12:29Z

This is something that can be added, but it will require some work. Basically, we need to add a filter list in read_oas similar to shape_tags in read_gds and change the way it loads all shapes and text by reading the stored data first, updating all modal variables, and then only allocating the structure if the layer/datatype was required.

walkerstop · 2024-11-27T04:01:04Z

I will probably do this. However, I am not sure I understand how some of the modal variables work.

I thought that the point of the modal variables was that if we read a record and it doesn't include some of the attributes, they can be re-used from the previous record by using the modal variable again.

So for example if you have a bunch of polygons with mostly the same attributes you could read them once and re-use them over and over.

However, this doesn't seem to be the case for all of the modal variables.

For example modal_repetition, we only copy this repetition to the current record being read if the record contained a repetition. I don't see anywhere that the code re-uses the modal_repetition that was read by a previous record.

If I look at modal_textlayer and modal_texttype, on the other hand, these variables get use to set the tag of a label even if the label didn't include a layer or type. It looks like I could have 1 label record that included a layer and data type, and then if every label after that didn't include any layer or datatype, it would just re-use the previous modal values over and over.

This is different than the modal_repetition, I can't see what the purpose of modal_repetition is since it only gets copied if the current record includes a repetition.

Another example is the modal_polygon_points, it looks like if I had 1 polygon with a point list, then a bunch with no point list, it will just re-use the modal_polygon_points for them all

So I guess I don't really understand the inconsistency in the way the modal variables are used.

Then if I look at modal_layer and modal_datatype, those are shared between polygons, paths, trapezoids, circles, XGEOMETRY. Is that how it's supposed to work? But labels don't share the same modal layer or modal datatype as other shapes?

Just want to make sure I understand how it's supposed to work before I go off making code changes

walkerstop · 2024-11-27T17:40:43Z

OK so I was missing something about how modal_repetition works before, sorry.
I see now that when read_oas() calls oasis_read_repetition(in, factor, modal_repetition) it will only overwrite modal_repetition if the repetition type it reads is not 0. If it's 0 then modal_repetition stays the same and gets copied to the new polygon or whatever.
I think it makes sense now. I implemented shape_tags and label_tags this way (update all the modal variables, then only allocate records on the heap if the tags match those requested)
It seems like it's working correctly for me, need to do some more regression tests but I think it's working.
After those changes I am able to read this 21GB file in 16 minutes whereas previously it ran for several days without finishing.

Roy-974 · 2024-12-13T07:29:52Z

Hello @walkerstop, if possible, could you create a pull request on your implementation? I face a similar problem. Thanks in advance

walkerstop · 2024-12-19T09:49:47Z

Hello @walkerstop, if possible, could you create a pull request on your implementation? I face a similar problem. Thanks in advance

I created the pull request with that and also .gds.gz support:
#286

But I need help testing it

I only tested these features in my closed-source app which requires a lot of other code changes. I know it works there.

walkerstop mentioned this issue Dec 19, 2024

Add .gds.gz support to read_gds() and shape tags and layer tags filtering to read_oas() #286

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading huge (21GB) Oasis file #280

Reading huge (21GB) Oasis file #280

walkerstop commented Nov 24, 2024 •

edited

Loading

tvt173 commented Nov 24, 2024

walkerstop commented Nov 24, 2024 •

edited

Loading

heitzmann commented Nov 25, 2024

walkerstop commented Nov 27, 2024

walkerstop commented Nov 27, 2024

Roy-974 commented Dec 13, 2024

walkerstop commented Dec 19, 2024 •

edited

Loading

Reading huge (21GB) Oasis file #280

Reading huge (21GB) Oasis file #280

Comments

walkerstop commented Nov 24, 2024 • edited Loading

tvt173 commented Nov 24, 2024

walkerstop commented Nov 24, 2024 • edited Loading

heitzmann commented Nov 25, 2024

walkerstop commented Nov 27, 2024

walkerstop commented Nov 27, 2024

Roy-974 commented Dec 13, 2024

walkerstop commented Dec 19, 2024 • edited Loading

walkerstop commented Nov 24, 2024 •

edited

Loading

walkerstop commented Nov 24, 2024 •

edited

Loading

walkerstop commented Dec 19, 2024 •

edited

Loading