Archiver UUID and url - are they still relevant? #25

jeffreyliu · 2017-12-01T17:48:49Z

Currently the constructor to Archiver takes two arguments: UUID and url. I'm wondering if they're still relevant.
The UUID is a holdover from the archivers.space workflow - I'm not sure if there's an analog in DT? I think we can just remove this.
Also, in #5, we discussed that a custom crawl could span multiple urls. This raises a couple questions:

Does it still make sense for a scraper to be tied to a 'root' url?
How do we distinguish between "urls that this scraper takes data from" and "child URLs of pages linked to from this page"

ebenp · 2017-12-02T20:39:02Z

I'm not sure the UUID is as relevant here as it was in the archivers-space since we are using the archivers tool on either child urls or byte files,rather than individual url pages. Maybe the UUID should be the scraper root url or removed altogether.

In regards to custom crawlers spawning multiple urls it seems that a scraper has to begin at some url. Maybe that's our root url as a reference for future scraper runs and gets set with the archiver tool initialization?
For distinguishing the type of url collected maybe data collection urls should be passed inthe add data function and child url are maintained through the add url function?

b5 · 2018-04-10T12:05:46Z

So, you know, it's only been four months, but yes UUID's should be ignored whenever possible. I'd favor hashes for blob content and urls for anything that has a clear association to... a URL.

jeffreyliu added the enhancement label Dec 1, 2017

jeffreyliu assigned jeffreyliu, b5 and ebenp Dec 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Archiver UUID and url - are they still relevant? #25

Archiver UUID and url - are they still relevant? #25

jeffreyliu commented Dec 1, 2017 •

edited

Loading

ebenp commented Dec 2, 2017

b5 commented Apr 10, 2018

Archiver UUID and url - are they still relevant? #25

Archiver UUID and url - are they still relevant? #25

Comments

jeffreyliu commented Dec 1, 2017 • edited Loading

ebenp commented Dec 2, 2017

b5 commented Apr 10, 2018

jeffreyliu commented Dec 1, 2017 •

edited

Loading