Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(cron): add cron feature flag #153

Merged
merged 1 commit into from
Nov 25, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 18 additions & 4 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions examples/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "spider_examples"
version = "1.49.13"
version = "1.50.0"
authors = ["madeindjs <[email protected]>", "j-mendez <[email protected]>"]
description = "Multithreaded web crawler written in Rust."
repository = "https://github.com/spider-rs/spider"
Expand All @@ -22,7 +22,7 @@ htr = "0.5.27"
flexbuffers = "2.0.0"

[dependencies.spider]
version = "1.49.13"
version = "1.50.0"
path = "../spider"
features = ["serde"]

Expand Down
10 changes: 7 additions & 3 deletions spider/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "spider"
version = "1.49.13"
version = "1.50.0"
authors = ["madeindjs <[email protected]>", "j-mendez <[email protected]>"]
description = "The fastest web crawler written in Rust."
repository = "https://github.com/spider-rs/spider"
Expand Down Expand Up @@ -43,12 +43,15 @@ case_insensitive_string = { version = "0.1.7", features = [ "compact", "serde" ]
jsdom = { version = "0.0.11-alpha.1", optional = true, features = [ "hashbrown", "tokio" ] }
chromiumoxide = { version = "0.5.6", optional = true, features = ["tokio-runtime", "bytes"], default-features = false }
sitemap = { version = "0.4.1", optional = true }
chrono = "0.4.31"
cron = "0.12.0"
async-trait = "0.1.74"

[target.'cfg(all(not(windows), not(target_os = "android"), not(target_env = "musl")))'.dependencies]
tikv-jemallocator = { version = "0.5.0", optional = true }

[features]
default = ["sync"]
default = ["sync", "cron"]
regex = ["dep:regex"]
glob = ["dep:regex", "dep:itertools"]
ua_generator = ["dep:ua_generator"]
Expand All @@ -70,4 +73,5 @@ chrome = ["dep:chromiumoxide"]
chrome_headed = ["chrome"]
chrome_cpu = ["chrome"]
chrome_stealth = ["chrome"]
cookies = ["reqwest/cookies"]
cookies = ["reqwest/cookies"]
cron = []
52 changes: 45 additions & 7 deletions spider/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ This is a basic async example crawling a web page, add spider to your `Cargo.tom

```toml
[dependencies]
spider = "1.49.13"
spider = "1.50.0"
```

And then the code:
Expand Down Expand Up @@ -87,7 +87,7 @@ We have a couple optional feature flags. Regex blacklisting, jemaloc backend, gl

```toml
[dependencies]
spider = { version = "1.49.13", features = ["regex", "ua_generator"] }
spider = { version = "1.50.0", features = ["regex", "ua_generator"] }
```

1. `ua_generator`: Enables auto generating a random real User-Agent.
Expand Down Expand Up @@ -117,7 +117,7 @@ Move processing to a worker, drastically increases performance even if worker is

```toml
[dependencies]
spider = { version = "1.49.13", features = ["decentralized"] }
spider = { version = "1.50.0", features = ["decentralized"] }
```

```sh
Expand All @@ -137,7 +137,7 @@ Use the subscribe method to get a broadcast channel.

```toml
[dependencies]
spider = { version = "1.49.13", features = ["sync"] }
spider = { version = "1.50.0", features = ["sync"] }
```

```rust,no_run
Expand Down Expand Up @@ -167,7 +167,7 @@ Allow regex for blacklisting routes

```toml
[dependencies]
spider = { version = "1.49.13", features = ["regex"] }
spider = { version = "1.50.0", features = ["regex"] }
```

```rust,no_run
Expand All @@ -194,7 +194,7 @@ If you are performing large workloads you may need to control the crawler by ena

```toml
[dependencies]
spider = { version = "1.49.13", features = ["control"] }
spider = { version = "1.50.0", features = ["control"] }
```

```rust
Expand Down Expand Up @@ -258,11 +258,49 @@ async fn main() {
}
```

### Cron Jobs

Use cron jobs to run crawls continuously at anytime.

```toml
[dependencies]
spider = { version = "1.50.0", features = ["sync", "cron"] }
```

```rust,no_run
extern crate spider;

use spider::website::{Website, run_cron};
use spider::tokio;

#[tokio::main]
async fn main() {
let mut website: Website = Website::new("https://choosealicense.com");
// set the cron to run or use the builder pattern `website.with_cron`.
website.cron_str = "1/5 * * * * *".into();

let mut rx2 = website.subscribe(16).unwrap();

let join_handle = tokio::spawn(async move {
while let Ok(res) = rx2.recv().await {
println!("{:?}", res.get_url());
}
});

// take ownership of the website. You can also use website.run_cron, except you need to perform abort manually on handles created.
let runner = run_cron(website).await;
// This controls when to stop, you do not need to add the sleep here if the lifetime of your program does not shutdown after crawls etc.
println!("Starting the Runner for 10 seconds");
tokio::time::sleep(Duration::from_secs(10)).await;
let _ = tokio::join!(runner.stop(), join_handle);
}
```

### Chrome

```toml
[dependencies]
spider = { version = "1.49.13", features = ["chrome"] }
spider = { version = "1.50.0", features = ["chrome"] }
```

You can use `website.crawl_concurrent_raw` to perform a crawl without chromium when needed. Use the feature flag `chrome_headed` to enable headful browser usage if needed to debug.
Expand Down
Loading