-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
write crawler/scraper to get the data #34
Comments
Having worked on the scraper for a while now it is pretty stable and extends out to gitlab repositories and is a bit more stable able people attached to a project. You can find the current version here (not published to npm). Technology stuff:
I fear the output data format doesn't match the specification quite yet: 2022-06-21T201258_912Z.zip |
sweet :-) will check in detail later. thank you so much 😃 ok, checked the code. i guess it is fine. it seems to work different than what i expected, but there are many ways to write things. also, it seems the zip file produces a I checked your example output, which seems to include
I think the reason i did not include
The code snippet from the job description imagines the following output: the task description had const valuenetwork = {} // => valuenetwork.json
const projects = {} // => projects.json
const organisations = {} // => organisations.json
function add (package_json_url, package_json, dependencies, dependents) {
const url = package_json_url // e.g. https://github.com/hypercore-protocol/hypercore
// @INFO: what we are interested in:
const { name, version, description, author, homepage, keywords = [], license, repository = {} } = package_json
const package = { name, version, description, author, homepage, keywords, license, repository }
// `dependents` is an array of github repository urls
const customers = dependents
// `dependencies` is an array of github repository urls
const suppliers = dependencies
const org = url.split('/').slice(0, -1).join('/')
const project = {
name: package.name,
version: package.version,
description: package.description,
keywords: package.keywords,
homepage: package.homepage,
bugs: package.bugs,
license: package.license,
people: [package.author, ...package.contributors],
funding: package.funding,
repository: package.repository,
}
// e.g. https://github.com/hypercore-protocol
valuenetwork[url] = { url, customers, suppliers }
projects[url] = { url, org, blessed: true || false, project } // blessed true means its in `blessed.json`
organisations[org] = { url: org, projects: [url] }
}
@martinheidegger if you don't mind it would be really cool if we could standardize the data format of the output and document and standardize it essentially by giving good example entry for each of the output files instead of a "type definition", but i think that is what we need so that we can then base any frontend we might make on that output and know it wont change. |
@martinheidegger Also, could you update the I know there are a bunch more important modules, like hyperbee, hyperdrive or autobase, but they are all dependents of hypercore anyway right now, so they will be included [
{ "npm": "hypercore", "version": "*" },
{ "repoURL": "git+https://github.com/hypercore-protocol/hypercore-next" },
{ "npm": "@hyperswarm/dht", "version": "*" },
{ "npm": "hyperswarm", "version": "*" },
{ "npm": "@hyperswarm/dht-relay", "version": "*" },
{ "npm": "@hyperswarm/secret-stream", "version": "*" },
{ "npm": "hypercore-strong-link", "version": "*" },
{ "npm": "hyperdrive", "version": "*" },
{ "npm": "hyperbee", "version": "*" },
{ "npm": "autobase", "version": "*" }
] Once that is done, it would be great to run tit once to produce the first data set with the fixed output format (people don't need to be included yet this time around) and publish that to a new github repository. Then we can close the task :-) |
After a lot of experimentation and trying to figure out bugs in the data set I am thoroughly exhausted of this work. |
hm, i quickly checked, and i am not entirely sure which fields will be included and which wont in all cases, but i imagined to see structured in the way shown in the previous comments code snippet and to skip people for now, or rather even if the people are scraped, that the output doesnt yet include people. now if on top of the above we also already have a hmmm... thats just a bit confusing |
Following our conversation I added documentation to the scraper, cleaned & changed the output data. https://github.com/dat-ecosystem/dat-garden-rake#dat-garden-rake Currently there is a github action running with a clear cache that hopefully - once finished - will publish the data through github pages. https://github.com/dat-ecosystem/dat-garden-rake/actions/runs/2597447898 This is the output of a recent, local execution: |
Finally I managed to get the scraper to complete on github actions. The gh-pages branch contains the latest data (which means it also keeps previous run-results in storage). You will find the published version here: https://dat-ecosystem.org/dat-garden-rake/index.json |
With the scraper now running weekly and producing versioned data I am considering my work on this finished. Can we close this issue? |
@martinheidegger Thanks for the work on this task. Much appreciated :) |
@todo
@input
📦 https://npm.org@input
📦 https://github.com@output
📦(see ##info section below)
@output
📦(see ##info section below)
@output
📦screencast video about scraper
@input
📦(see ##info section below)
@input
📦screencast video about scraper
@input
📦./data/blessed.json
@output
📦scraper/crawler code
@input
📦./data/blessed.json
(with[ 'https://github.com/hypercore-protocol/hypercore' ]
)@input
📦 [scraper/crawler code
]@input
📦scraper/crawler code
@output
📦./<timestamp>/valuenetwork.json
@output
📦./<timestamp>/packages.json
@output
📦./<timestamp>/organisations.json
@output
📦./<timestamp>/index.json
@output
📦./index.json
@info
estimated duration:
2 days
estimated budget:
640 usd
concept
scraper can be executed locally to scrape
package.json
data from npm and github and crawl for alldependents
anddependencies
and repeat the process, starting from ablessed.json
list of initial github repositories until all dependents and dependents of dependents, but also all dependencies and dependencies of dependencies have been found and saved as timestamped json files to disk, so they can be committed and pushed to a github repository with the results.deal with rate limits:
of course, a task cant be resumed if its not the same day anymore, because that would produce a different timestamped json and therefore needs a fresh run anyway that wipes the database before trying to scrape everything from scratch
basic prototype
what to store in the files mentioned in the tasks above
how to query for dependents on github and npm
The text was updated successfully, but these errors were encountered: