Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading huge (21GB) Oasis file #280

Open
walkerstop opened this issue Nov 24, 2024 · 7 comments
Open

Reading huge (21GB) Oasis file #280

walkerstop opened this issue Nov 24, 2024 · 7 comments

Comments

@walkerstop
Copy link

walkerstop commented Nov 24, 2024

Hi, I was wondering if anyone has had success reading very large .oas files.
I am using the C++ gdstk library, built and running on a SuSE Linux server that has 512GB of RAM.
My .oas file is 21GB.
The call to gdstk::read_oas() has been running for 2 days and so far has consumed 110GB of RAM but still has not finished the call to read_oas()
Just wondering if anyone had any ideas to try. I am also running some performance profiling to see where the bottleneck is but haven't looked into the results yet.
I believe (based on limited information so far) that a lot of the CPU cycles are being spent in calls to calloc() coming from allocate_clear()
In my case I actually only need to read a few of the layer/data types (tags) in the file.
I have already added shape_tags and label_tags filters (like gdstk::read_gds() already has) and so I am throwing away MOST of the content of this file.
However, the way that I implemented the shape_tags and label_tags was kind of stupid and is probably not helping the performance. I let read_oas() allocate the structures and read the elements, and only once I know what the tag is, then if it's not a tag I want, then I free the structures and don't add them to the library.
I know this is wasting a lot of cycles allocating and then immediately freeing memory.
I was thinking it might improve the performance if I can read elements into temporary structures on the stack until I read the tag, and then only allocate and add them to the library if I need to keep them, but I haven't tried this yet, it's not super straightforward to me how to do this due to some of the modal pointers, and due to the space needed to read elements not always being the same.
Anyway, if anyone has any ideas, I'd love to hear them.
If I do manage to support shape_tags and label_tags in read_oas() in a smarter way, I would be happy to open a pull request in case the change is useful to others, but I have not gotten that far.
Something I have tried already is using larger read buffers instead of my system's default 8KB buffers. That didn't seem to help much.
Another idea I have not tried yet would be using mmap() instead of fread(), but based on what I've read so far, it's not clear to me whether this would be much faster or not. I kind of doubt that the bottleneck is fread() anyway, but I should know more once I've done more profiling.

@tvt173
Copy link

tvt173 commented Nov 24, 2024

Have you tried doing it with klayout instead? I find that klayout typically behaves better in situations like this

@walkerstop
Copy link
Author

walkerstop commented Nov 24, 2024

Have you tried doing it with klayout instead? I find that klayout typically behaves better in situations like this

Thanks for the suggestion, but unfortunately our app is closed source, so the KLayout copyleft license means I can't use it for our app

@heitzmann
Copy link
Owner

This is something that can be added, but it will require some work. Basically, we need to add a filter list in read_oas similar to shape_tags in read_gds and change the way it loads all shapes and text by reading the stored data first, updating all modal variables, and then only allocating the structure if the layer/datatype was required.

@walkerstop
Copy link
Author

I will probably do this. However, I am not sure I understand how some of the modal variables work.

I thought that the point of the modal variables was that if we read a record and it doesn't include some of the attributes, they can be re-used from the previous record by using the modal variable again.

So for example if you have a bunch of polygons with mostly the same attributes you could read them once and re-use them over and over.

However, this doesn't seem to be the case for all of the modal variables.

For example modal_repetition, we only copy this repetition to the current record being read if the record contained a repetition. I don't see anywhere that the code re-uses the modal_repetition that was read by a previous record.

If I look at modal_textlayer and modal_texttype, on the other hand, these variables get use to set the tag of a label even if the label didn't include a layer or type. It looks like I could have 1 label record that included a layer and data type, and then if every label after that didn't include any layer or datatype, it would just re-use the previous modal values over and over.

This is different than the modal_repetition, I can't see what the purpose of modal_repetition is since it only gets copied if the current record includes a repetition.

Another example is the modal_polygon_points, it looks like if I had 1 polygon with a point list, then a bunch with no point list, it will just re-use the modal_polygon_points for them all

So I guess I don't really understand the inconsistency in the way the modal variables are used.

Then if I look at modal_layer and modal_datatype, those are shared between polygons, paths, trapezoids, circles, XGEOMETRY. Is that how it's supposed to work? But labels don't share the same modal layer or modal datatype as other shapes?

Just want to make sure I understand how it's supposed to work before I go off making code changes

@walkerstop
Copy link
Author

OK so I was missing something about how modal_repetition works before, sorry.
I see now that when read_oas() calls oasis_read_repetition(in, factor, modal_repetition) it will only overwrite modal_repetition if the repetition type it reads is not 0. If it's 0 then modal_repetition stays the same and gets copied to the new polygon or whatever.
I think it makes sense now. I implemented shape_tags and label_tags this way (update all the modal variables, then only allocate records on the heap if the tags match those requested)
It seems like it's working correctly for me, need to do some more regression tests but I think it's working.
After those changes I am able to read this 21GB file in 16 minutes whereas previously it ran for several days without finishing.

@Roy-974
Copy link

Roy-974 commented Dec 13, 2024

Hello @walkerstop, if possible, could you create a pull request on your implementation? I face a similar problem. Thanks in advance

@walkerstop
Copy link
Author

walkerstop commented Dec 19, 2024

Hello @walkerstop, if possible, could you create a pull request on your implementation? I face a similar problem. Thanks in advance

I created the pull request with that and also .gds.gz support:
#286

But I need help testing it

I only tested these features in my closed-source app which requires a lot of other code changes. I know it works there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants