Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix simplifies #44

Merged
merged 11 commits into from
Jan 12, 2024
Merged

Fix simplifies #44

merged 11 commits into from
Jan 12, 2024

Conversation

skygering
Copy link
Collaborator

@skygering skygering commented Jan 4, 2024

The simplification function wasn't working for DouglasPeucker with the tol keyword used. I found that this was because although the algorithm was using squared distance, it wasn't squaring the tolerance within the algorithm so no points were being removed.

I decided to move the adaptation of the tolerance values into the constructors so that the same SimplifyAlg object could be used more than once without the tolerance being changed each time that the function was run. This seems important since for the algorithms other than DouglasPeucker you do need to explicitly use the object. I also swapped to using @kwdef (is it exported yet?) to get rid of the constructor that does exactly what @kwdef does.

While I was doing this, I noticed that the DouglasPeucker algorithm was wrong. It starts at the "bottom" and finds the smallest tolerances between adjacent edges and then removes the smallest and moves "up" the shape. The actual DouglasPeucker algorithm finds edges with the largest distances and marks those to be kept and then moves "down" the shape. You can see a gif of this here. This gives different, but similar results since in DouglasPeucker there are some points that can't be removed that can be removed with the previous implementation. I implemented it using recursion, but I am not sure how to make it work for the number and ratio keywords. I will think about it, but I would love any suggestions.

Here is the before performance in LibGEOS and with GeometryOps (with LG and GI polygons of 35 points reduced to 17 points) for DouglasPeucker:
pre_update_stats

And here is the new performance (I think it is faster because we can stop earlier):
Screenshot 2024-01-04 at 1 41 37 PM

I also added a file of really wacky polygons that I use in my work for future testing.

@skygering skygering requested a review from rafaqz January 4, 2024 21:58
@skygering
Copy link
Collaborator Author

We could also get this in so that at least it works with tol and then open up another PR to fix the rest. I am also not convinced by the implementation of the radial distance one either.

@rafaqz
Copy link
Member

rafaqz commented Jan 4, 2024

Good catch. I mostly adapted some elses implementation for that 😅

Would be good to keep the number and ratio. Its pretty slick specifying those. You can do it iteratively like a root finder...

(to clarify we can merge this and add it back later)

Copy link
Member

@rafaqz rafaqz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing my horrible errors. It would be good to find some known correct algs to test against

end
#= For each algorithm, apply simplication to all curves, multipoints, and
points, reconstructing everything else around them. =#
_simplify(alg::SimplifyAlg, data; kw...) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function layout from a style guide? I normally write roughly Blue Style. Can totally change that but it would be good to agree on one, I would habitually change it back to the above!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I just use equals when it is just one function equal to another. But I see in the BlueSky style guide (new to me!) that that isn't preferred unless it fits on one line.

I guess my logic is, is we are literally just renaming the function, just like in lines 101 and 103 and those were written the exact same way.

Copy link
Member

@rafaqz rafaqz Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah style guides are never quite what you want but it can simplify some things in collaborations like this. There's a more common standard julia stule guide somewhere too Im easy either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not so concise:
https://docs.julialang.org/en/v1/manual/style-guide/

But I think its encoded in JuliaFormatter more precisely as the default:
https://domluna.github.io/JuliaFormatter.jl/stable/style/

end
return result
end

_remove!(s, i) = s[i:end-1] .= s[i+1:end]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best not to mix changes and moves. We should follow ColPrac, this is a guidline:

https://github.com/SciML/ColPrac

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in the future would you prefer that this be a new PR?

Copy link
Member

@rafaqz rafaqz Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks that would be nice! But I also posted ColPrac to distance this request from being just a personal preference. See:

PRs that move code should not also change code, so that they are easier to review.

    If only moving code, review for correctness is not required.

    If only changing code, then the diff makes it clear what lines have changed.

PRs with large improvements to style should not also change functionality.

    This is to avoid making large diffs that are not the focus of the PR.
    While it is often helpful to fix a few typos in comments on the way past, it is different from using a regex or formatter on the whole project to fix spacing around operators.

Following ColPrac (roughly) also makes it easier to point out my own bad coding practices

src/transformations/simplify.jl Outdated Show resolved Hide resolved
previous = point
end
## Never remove the end points
distances[begin] = distances[end] = Inf
## This avoids taking the square root of each distance above
Copy link
Member

@rafaqz rafaqz Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for doing this here is that radial distance can be used to prefilter for the other algorithms. So squaring before makes it unclear if its squared already, or multiplied by 2 in VisvilinghamWyatt.

(basically the prefiltering was half finished code)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this for me is that if you make an object

alg = RadialDistance(; tol = 1.5) and you have two polygons poly1 and poly2, I would expect to be able to do the following:

simplify(alg, poly1)
simplify(alg, poly2)

But in this case, the tolerance would be squared twice, right? I would rather have a converter to take a DouglasPeucker or a VisvalingamWhyatt algorithm to a RadialDistance algorithm and it know to either square it or halve if or whatever and then pass that to use the pre-filtering.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it might make sense for them to just provide a pre-filtering algorithm if they so desire with an optional prefilter_alg argument. This would increase the flexibility and also would be more consistent than using tol since the tol for VisvalingamWhyatt is an area while the tol for both of the other two is a distance which would cause problems if you try to prefilter VisvalingamWhyatt using RadialDistance with the same alg object. This would then allow a different tolerance/number/ratio for the pre-filtering than the actual simplification algorithm and would let people hook into the architecture easier. Thoughts?

Copy link
Member

@rafaqz rafaqz Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah thats a better idea.

I think its worth having if you have huge detailed polygons, as RadialDistance should be so much faster.

@skygering
Copy link
Collaborator Author

I added in the pre-filter algorithm and I have also re-written the DouglasPeucker algorithm iteratively instead of recursively so that it can also be used with the number and ratio keywords. This led to a pretty nice speedup from above:
Screen Shot 2024-01-09 at 2 08 21 PM

I would like to add a few more tests, but then I think this is good to go.

How do you feel about the new pre-filtering algorithm? My only concern is that users might be confused by it. Like why not just run _simplify with one algorithm first and then another by calling _simplify again? And it is to save time so we don't have to re-break down the polygon. But, it is still a valid question of if that benefit is worth it being an additional keyword. Probably for really big shapes. So I am happy with it if you are @rafaqz.

@rafaqz
Copy link
Member

rafaqz commented Jan 9, 2024

Maybe we should try it on some huge polygons and see if its worth it, or if just running it twice is just as good

@skygering
Copy link
Collaborator Author

How many vertices are you thinking? I can test it.

@rafaqz
Copy link
Member

rafaqz commented Jan 10, 2024

Maybe a FeatureCollection/vector of polygons with thousands-millions of points?

@skygering
Copy link
Collaborator Author

Can you do me a favor and run this?

import LibGEOS as LG
import GeoInterface as GI
c = [[[1442.811638252033, 37931.54281488228], [1242.0321008084024, 37650.54139544157], [519.3606654818966, 37530.20857244596], [378.859955761543, 37630.59834116777], [258.4147165404589, 37610.542870668505], [-92.8370577604245, 37861.51729247304], [-112.89252825969345, 37981.96253169413], [-393.8939477004001, 38182.74206913776], [-413.949418199669, 38303.18730835884], [-624.7004827801991, 38453.77196144157], [-745.1457220012833, 38433.7164909423], [-815.3960768614602, 38483.91137530321], [-835.4515473607294, 38604.35661452429], [-975.9522570810825, 38704.74638324611], [-1076.229609577428, 39306.972579351524], [-975.8398408556127, 39447.47328907188], [-995.8953113548819, 39567.91852829296], [-895.5055426330664, 39708.41923801332], [-915.5610131323356, 39828.8644772344], [-714.7814756887051, 40109.86589667511], [-734.836946187974, 40230.31113589619], [-433.66764002252853, 40651.81326505725], [-473.77858102106666, 40892.70374349942], [-272.99904357743617, 41173.705162940125], [-293.05451407670535, 41294.15040216121], [-192.6647453548901, 41434.65111188156], [-212.72021585415905, 41555.09635110265], [-162.52533149325131, 41625.346705962824], [2366.8246921495183, 42046.51158644748], [2437.0750470096955, 41996.316702086566], [2336.68527828788, 41855.81599236622], [2416.9071602849563, 41374.03503548188], [2557.4078700053096, 41273.64526676006], [2637.629752002386, 40791.864309875724], [2537.2399832805704, 40651.363600155375], [2577.3509242791088, 40410.47312171321], [2476.9611555572938, 40269.97241199285], [2497.016626056563, 40149.52717277176], [2296.237088612932, 39868.52575333106], [2376.4589706100082, 39386.74479644672], [2175.6794331663777, 39105.743377006016], [2215.790374164916, 38864.85289856385], [2115.400605443101, 38724.35218884349], [2135.45607594237, 38603.90694962241], [2035.0663072205548, 38463.406239902055], [2055.1217777198235, 38342.961000680974], [1904.5371246371008, 38132.209936100444], [1663.6466461949326, 38092.0989951019], [1563.2568774731171, 37951.59828538155], [1442.811638252033, 37931.54281488228]]]
poly = LG.Polygon(c)
lg_vals = GI.coordinates(LG.simplify(poly, 100.0))[1]
reduced_npoints = length(lg_vals)

Locally I am getting 16, which is what I expect based on my simplify code, but when I run the tests I am getting 15 points. It is deleting the first point. And for the life of me, I can’t understand what is wrong. I put a @show in the tests so that you can see what is going on. I think the versions are the same. I can't figure out what is different. I even copied the above coordinates from a @show call.

@skygering
Copy link
Collaborator Author

And I think we should keep the pre-filtering. I made a GeometryCollection of 500 LineStrings that each have 1e6 points and the results are pretty different for included pre-filtering versus an explicit call to another algorithm before calling the "main" algorithm.
Screenshot 2024-01-10 at 3 45 42 PM

I will say that while both of these return the same results (with RadialDistance as a pre-filter for DouglasPeucker) they do not return the same results as just DouglasPeucker. I think this is due to radial distance removing some points that are "off-limits" to remove in DouglasPeucker. I am not sure if it should, but I think that is out of scope of this PR.

@skygering
Copy link
Collaborator Author

skygering commented Jan 11, 2024

Can you do me a favor and run this?

import LibGEOS as LG
import GeoInterface as GI
c = [[[1442.811638252033, 37931.54281488228], [1242.0321008084024, 37650.54139544157], [519.3606654818966, 37530.20857244596], [378.859955761543, 37630.59834116777], [258.4147165404589, 37610.542870668505], [-92.8370577604245, 37861.51729247304], [-112.89252825969345, 37981.96253169413], [-393.8939477004001, 38182.74206913776], [-413.949418199669, 38303.18730835884], [-624.7004827801991, 38453.77196144157], [-745.1457220012833, 38433.7164909423], [-815.3960768614602, 38483.91137530321], [-835.4515473607294, 38604.35661452429], [-975.9522570810825, 38704.74638324611], [-1076.229609577428, 39306.972579351524], [-975.8398408556127, 39447.47328907188], [-995.8953113548819, 39567.91852829296], [-895.5055426330664, 39708.41923801332], [-915.5610131323356, 39828.8644772344], [-714.7814756887051, 40109.86589667511], [-734.836946187974, 40230.31113589619], [-433.66764002252853, 40651.81326505725], [-473.77858102106666, 40892.70374349942], [-272.99904357743617, 41173.705162940125], [-293.05451407670535, 41294.15040216121], [-192.6647453548901, 41434.65111188156], [-212.72021585415905, 41555.09635110265], [-162.52533149325131, 41625.346705962824], [2366.8246921495183, 42046.51158644748], [2437.0750470096955, 41996.316702086566], [2336.68527828788, 41855.81599236622], [2416.9071602849563, 41374.03503548188], [2557.4078700053096, 41273.64526676006], [2637.629752002386, 40791.864309875724], [2537.2399832805704, 40651.363600155375], [2577.3509242791088, 40410.47312171321], [2476.9611555572938, 40269.97241199285], [2497.016626056563, 40149.52717277176], [2296.237088612932, 39868.52575333106], [2376.4589706100082, 39386.74479644672], [2175.6794331663777, 39105.743377006016], [2215.790374164916, 38864.85289856385], [2115.400605443101, 38724.35218884349], [2135.45607594237, 38603.90694962241], [2035.0663072205548, 38463.406239902055], [2055.1217777198235, 38342.961000680974], [1904.5371246371008, 38132.209936100444], [1663.6466461949326, 38092.0989951019], [1563.2568774731171, 37951.59828538155], [1442.811638252033, 37931.54281488228]]]
poly = LG.Polygon(c)
lg_vals = GI.coordinates(LG.simplify(poly, 100.0))[1]
reduced_npoints = length(lg_vals)

Locally I am getting 16, which is what I expect based on my simplify code, but when I run the tests I am getting 15 points. It is deleting the first point. And for the life of me, I can’t understand what is wrong. I put a @show in the tests so that you can see what is going on. I think the versions are the same. I can't figure out what is different. I even copied the above coordinates from a @show call.

Okay, I think I know what is going on. I don't know how I was getting 16 yesterday, but I am getting 15 now... All of the examples of the algorithm that I have seen so far say that the first and the last points are ALWAYS included. However, I am wondering if this is just for LineStringsand LibGEOS is also testing if the first point should be included... I am not sure how to handle that given that we need to start at a given point... maybe we just test at the end if it should be included?

I will play around with it. If I can't figure it out quickly, maybe we can just merge how it is and open an issue?

Yep - found it: https://github.com/locationtech/jts/blob/master/modules/core/src/main/java/org/locationtech/jts/simplify/DouglasPeuckerLineSimplifier.java#L81

@rafaqz
Copy link
Member

rafaqz commented Jan 11, 2024

Yeah lets merge, this a huge improvement as-is.

Well I guess we should fix that test or mark it as broken first!

@skygering
Copy link
Collaborator Author

Okay, I fixed the previous error so that the endpoints are removed if needed.

I have now discovered another difference from LibGEOS. I am going to try to debug it but if I can't figure it out by tonight then I will push, merge, and open an issue.

@skygering
Copy link
Collaborator Author

skygering commented Jan 12, 2024

Okay, we don't get the exact same thing as LibGEOS when there are two points in a range that are the same distance from the line. That happens a surprising amount on the test polygons I am using, probably because of how they were generated. But otherwise, we get the same thing. For example,
Screenshot 2024-01-11 at 5 28 34 PM
The two points with the orange lines are the 106th and 107th points of the shape. They are the same distance from the line formed by the 105 and 109th points. In our algorithm, we keep with 106th point. In LibGEOS, they keep the 107th point, which also then leads to keeping the 108th point.

I tried switching the maximum distance finder to mark a new "maximum distance" point if the distance is greater than or equal to the current max distance. However, this broke another test case. I feel pretty confident that this doesn't significantly change the outcome of the algorithm so I am going to merge.

Do you think I still need to open an issue about this @rafaqz?

@skygering skygering marked this pull request as ready for review January 12, 2024 02:25
@rafaqz
Copy link
Member

rafaqz commented Jan 12, 2024

Maybe we should note in the docs that we dont promise anything about the order of removal of equal distance points.

@rafaqz
Copy link
Member

rafaqz commented Jan 12, 2024

Also, am I right to understand this is now 8x faster than LibGEOS?

@skygering
Copy link
Collaborator Author

Not quite 8x. I think things got a bit slower when I added on the checks at the end for polygons. Things can probably still be optimized. It also seems to depend on the size of the polygon. So for a polygon with 35 points going to ~16 points we are 5x faster, while for a polygon with 146 points goint to ~53 points we are 3x faster (but also removing more points so...).
Screenshot 2024-01-12 at 9 27 38 AM
Screenshot 2024-01-12 at 9 25 03 AM

@rafaqz
Copy link
Member

rafaqz commented Jan 12, 2024

Ok, still pretty good though. And over collections it will be threaded!

It made me think about that juliacon talk, if you want to do it! Could be a good motivator to benchmark everything and get it registered

@rafaqz
Copy link
Member

rafaqz commented Jan 12, 2024

Merge whenever you like.

After that I might get out the profiler and see whats left to squeeze out of some of these methods.

@skygering skygering merged commit ba1c7ce into JuliaGeo:main Jan 12, 2024
3 checks passed
@skygering skygering deleted the fix_simplifies branch January 12, 2024 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants