Skip to content

Commit

Permalink
feat(#63): examples
Browse files Browse the repository at this point in the history
  • Loading branch information
h1alexbel committed Oct 30, 2024
1 parent a5b47c6 commit ed54dbb
Showing 1 changed file with 134 additions and 37 deletions.
171 changes: 134 additions & 37 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,30 +40,11 @@ then, execute:
ghminer --query "stars:2..100" --start "2005-01-01" --end "2024-01-01" --tokens pats.txt
```

After it will be done, you should have `result.csv` file with all GitHub
repositories those were created in the provided date range.

## CLI Options

| Option | Required | Description |
|---------------|----------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| `--query` || [GitHub Search API query] |
| `--graphql` || Path to GitHub API GraphQL query, default is `ghminer.graphql`. |
| `--schema` || Path to parsing schema, default is `ghminer.json`. |
| `--start` || The start date to search the repositories, in [ISO] format; e.g. `2024-01-01`. |
| `--end` || The end date to search the repositories, in [ISO] format; e.g. `2024-01-01`. |
| `--tokens` || Text file name that contains a number of [GitHub PATs]. Those will be used in order to pass GitHub API rate limits. Add as many tokens as needed, considering the amount of data (they should be separated by line break). |
| `--date` || The type of the date field to search on, you can choose from `created`, `updated` and `pushed`, the default one is `created`. |
| `--batchsize` || Request batch-size value in the range `10..100`. The default value is `10`. |
| `--filename` || The name of the file for the found repos (CSV and JSON files). The default one is `result`. |
| `--json` || Save found repos as JSON file too. |

### GraphQL Query

Your query, provided in `--graphql` can have all
[GitHub supported fields][Gh Explorer] you want. However, to keep this query
running to collect all possible repositories, ghminer requires you to have
the following structure:
Also, you should have these files: `ghminer.graphql` for GraphQL query, and
`ghminer.json` for parsing the response from GitHub API. In GraphQL query, you
can have all [GitHub supported fields][Gh Explorer] you want. However, to keep
this query running to collect all possible repositories, ghminer requires you to
have the following structure:

* `search` with `$searchQuery`, `$first`, `$after` attributes.
* `pageInfo` with `endCursor`, `hasNextPage` attributes.
Expand All @@ -73,37 +54,152 @@ Here is an example:

```graphql
query ($searchQuery: String!, $first: Int, $after: String) {
search(query: $searchQuery, type: REPOSITORY, first: $first, after: $after) {
repositoryCount
...
pageInfo {
endCursor
hasNextPage
search(query: $searchQuery, type: REPOSITORY, first: $first, after: $after) {
repositoryCount
nodes {
... on Repository {
nameWithOwner
defaultBranchRef {
name
}
licenseInfo {
spdxId
}
}
}
pageInfo {
endCursor
hasNextPage
}
}
}
}
```

### Parsing Schema
and `ghminer.json`:

```json
{
"repo": "nameWithOwner",
"branch": "defaultBranchRef.name",
"license": "licence.spdxId"
}
```

To parse response generated by [GraphQL Query](#graphql-query), you should
provide the parsing schema. This schema should have all desired metadata field
names as keys and path to the data in response as values.
After it will be done, you should have `result.csv` file with all GitHub
repositories those were created in the provided date range.

For instance:
### Bigger example

Consider this as more complicated example, demonstrating how to fetch various
fields from GitHub repository:

`ghminer.graphql`:

```graphql
query ($searchQuery: String!, $first: Int, $after: String) {
search(query: $searchQuery, type: REPOSITORY, first: $first, after: $after) {
repositoryCount
nodes {
... on Repository {
nameWithOwner
description
defaultBranchRef {
name
}
defaultBranchRef {
name
target {
repository {
object(expression: "HEAD:README.md") {
... on Blob {
text
}
}
}
... on Commit {
history(first: 1) {
totalCount
edges {
node {
committedDate
}
}
}
}
}
}
repositoryTopics(first: 10) {
edges {
node {
topic {
name
}
}
}
}
issues(states: [OPEN]) {
totalCount
}
pullRequests {
totalCount
}
object(expression: "HEAD:.github/workflows/") {
... on Tree {
entries {
name
object {
... on Blob {
byteSize
}
}
}
}
}
}
}
pageInfo {
endCursor
hasNextPage
}
}
}
```

`ghminer.json`:

```json
{
"repo": "nameWithOwner",
"description": "description",
"branch": "defaultBranchRef.name",
"readme": "defaultBranchRef.target.repository.object.text",
"topics": "repositoryTopics.edges[].node.topic.name",
"lastCommitDate": "defaultBranchRef.target.history.edges[0].node.committedDate",
"issues": "issues.totalCount",
"pulls": "pullRequests.totalCount",
"commits": "defaultBranchRef.target.history.totalCount",
"lastCommitDate": "defaultBranchRef.target.history.edges[0].node.committedDate",
"workflows": "object.entries.length"
}
```

Also, check [this repo][sr-detection], where ghminer is used to collect
Java repositories from GitHub for research experiment.

## CLI Options

| Option | Required | Description |
|---------------|----------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| `--query` || [GitHub Search API query] |
| `--graphql` || Path to GitHub API GraphQL query, default is `ghminer.graphql`. |
| `--schema` || Path to parsing schema, default is `ghminer.json`. |
| `--start` || The start date to search the repositories, in [ISO] format; e.g. `2024-01-01`. |
| `--end` || The end date to search the repositories, in [ISO] format; e.g. `2024-01-01`. |
| `--tokens` || Text file name that contains a number of [GitHub PATs]. Those will be used in order to pass GitHub API rate limits. Add as many tokens as needed, considering the amount of data (they should be separated by line break). |
| `--date` || The type of the date field to search on, you can choose from `created`, `updated` and `pushed`, the default one is `created`. |
| `--batchsize` || Request batch-size value in the range `10..100`. The default value is `10`. |
| `--filename` || The name of the file for the found repos (CSV and JSON files). The default one is `result`. |
| `--json` || Save found repos as JSON file too. |

## How to contribute

Fork repository, make changes, send us a [pull request](https://www.yegor256.com/2014/04/15/github-guidelines.html).
Expand All @@ -127,3 +223,4 @@ You will need [Node 20+] installed.
[Node 20+]: https://nodejs.org/en/download/package-manager
[blogpost]: https://h1alexbel.github.io/2024/05/24/ghminer.html
[Gh Explorer]: https://docs.github.com/en/graphql/overview/explorer
[sr-detection]: https://github.com/h1alexbel/sr-detection

0 comments on commit ed54dbb

Please sign in to comment.