Skip to content
This repository has been archived by the owner on Oct 29, 2019. It is now read-only.

Archiver UUID and url - are they still relevant? #25

Open
jeffreyliu opened this issue Dec 1, 2017 · 2 comments
Open

Archiver UUID and url - are they still relevant? #25

jeffreyliu opened this issue Dec 1, 2017 · 2 comments
Assignees

Comments

@jeffreyliu
Copy link
Collaborator

jeffreyliu commented Dec 1, 2017

Currently the constructor to Archiver takes two arguments: UUID and url. I'm wondering if they're still relevant.
The UUID is a holdover from the archivers.space workflow - I'm not sure if there's an analog in DT? I think we can just remove this.
Also, in #5, we discussed that a custom crawl could span multiple urls. This raises a couple questions:

  • Does it still make sense for a scraper to be tied to a 'root' url?
  • How do we distinguish between "urls that this scraper takes data from" and "child URLs of pages linked to from this page"
@ebenp
Copy link
Collaborator

ebenp commented Dec 2, 2017

I'm not sure the UUID is as relevant here as it was in the archivers-space since we are using the archivers tool on either child urls or byte files,rather than individual url pages. Maybe the UUID should be the scraper root url or removed altogether.

In regards to custom crawlers spawning multiple urls it seems that a scraper has to begin at some url. Maybe that's our root url as a reference for future scraper runs and gets set with the archiver tool initialization?
For distinguishing the type of url collected maybe data collection urls should be passed inthe add data function and child url are maintained through the add url function?

@b5
Copy link
Member

b5 commented Apr 10, 2018

So, you know, it's only been four months, but yes UUID's should be ignored whenever possible. I'd favor hashes for blob content and urls for anything that has a clear association to... a URL.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants