Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Would caching these changeset diffs help? #28

Closed
iandees opened this issue Apr 26, 2016 · 9 comments
Closed

Would caching these changeset diffs help? #28

iandees opened this issue Apr 26, 2016 · 9 comments
Labels

Comments

@iandees
Copy link
Member

iandees commented Apr 26, 2016

I've been poking around with caching OSM data like crazy and these changeset diffs are a prime example of something that it seems would be perfect caching candidates.

Am I correct in thinking that the changeset diff from Overpass does not change once the changeset has been closed? Do you make frequent requests for the same changeset?

I would propose that I set up an S3 bucket that this webapp can hit. If the changeset isn't in the S3 bucket, it forwards to a lambda function (via API gateway) that makes the same request that you do now, but also saves it to S3 so that future requests can return the static file immediately.

@batpad
Copy link
Collaborator

batpad commented Apr 27, 2016

@iandees that sounds amazing!

We have been toying with some ideas of side-stepping overpass entirely and building up changeset diffs through looking at minutely replication files, but that is probably a way away.

Caching this on s3 sounds like a great idea, and the architecture you have outlined seems perfect.

@iandees
Copy link
Member Author

iandees commented Apr 27, 2016

Sounds good. This is something that I've been wanting to do for quite some time, so it'll be nice to have someone to test it 😄 .

@batpad
Copy link
Collaborator

batpad commented Apr 27, 2016

Happy to test any time :)

One thing to be aware of: the query to overpass is a little bit of a hack -- and while it is generally pretty reliable, there is a possibility that it would return wrong data.

As you can see, what it does is something like:

query the OSM API for details on the changeset -- bbox, created_at and closed_at.

It then queries Overpass for all features within that bbox that were modified during those times. So it is entirely possible that there would have been other changes in that bbox at the same time that will get mixed up in here. But this seems, unfortunately, like the best way to get a diff of features in a changeset -- it mimics the technique used by Achavi

So far though, we have not noticed any discrepancies, but just so that you know, it is entirely possible that there will be some. (If you are hoping to have an authoritative cache of changeset diffs, there may be edge cases where the data is wrong).

@iandees
Copy link
Member Author

iandees commented May 14, 2016

A little update here: in the interest of being able to build more than just changeset diffs I'm importing full-history data into a database and will use that to build the diffs. This is why it's taking longer :).

@batpad
Copy link
Collaborator

batpad commented May 27, 2016

@iandees very curious to know if there's any update here :) - no rush of course, just what you outlined seems super exciting, so just happy to jump onto this if there's anything to test even if it's a bit raw - also happy to help if there's anything I can do to make this happen.

@iandees
Copy link
Member Author

iandees commented May 31, 2016

My progress so far has been slowed by trying to load a full-history database dump into a database. I eventually hacked together a libosmium program to dump a TSV file that I then loaded into an RDS PostgreSQL instance.

Based on that I wrote a bit of Python that downloads an OSM changeset and then queries the above database to find the before and after geometry of the data in a changeset. I ended up using a format that's slightly different than the Overpass result. Check out an example output here. It's rather long because it includes geometry for the relations, but it would be fairly simple to (optionally) exclude relations.

The credits I'm using to pay for the ~$150/mo database expire today, so I will have to throw out this database or find somewhere else to host it. I've thought about putting together a PR for Overpass to better handle this specific changeset diff situation (via caching and a more reliable query), too.

@iandees
Copy link
Member Author

iandees commented Mar 18, 2017

I noticed that you've recently moved away from using Overpass to get changeset diff data. I've started to think more about this ticket again, and wondering if you think it'd be useful to break out the code you're writing to build the diff data into a separate module.

@geohacker
Copy link
Contributor

Hey @iandees! We started caching changesets and augmented diffs. More here: http://www.openstreetmap.org/user/geohacker/diary/40846

@ajithranka
Copy link
Contributor

wondering if you think it'd be useful to break out the code you're writing to build the diff data into a separate module

Agreed. Opening a new issue here.

Let's close this ticket. @iandees please reopen if we missed something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants