Skip to content

Commit

Permalink
feat(cookies): add cookie jar optional feature
Browse files Browse the repository at this point in the history
  • Loading branch information
j-mendez committed Nov 24, 2023
1 parent 71379af commit 16c796a
Show file tree
Hide file tree
Showing 8 changed files with 133 additions and 24 deletions.
83 changes: 78 additions & 5 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions examples/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "spider_examples"
version = "1.49.11"
version = "1.49.13"
authors = ["madeindjs <[email protected]>", "j-mendez <[email protected]>"]
description = "Multithreaded web crawler written in Rust."
repository = "https://github.com/spider-rs/spider"
Expand All @@ -22,7 +22,7 @@ htr = "0.5.27"
flexbuffers = "2.0.0"

[dependencies.spider]
version = "1.49.11"
version = "1.49.13"
path = "../spider"
features = ["serde"]

Expand Down
5 changes: 3 additions & 2 deletions spider/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "spider"
version = "1.49.11"
version = "1.49.13"
authors = ["madeindjs <[email protected]>", "j-mendez <[email protected]>"]
description = "The fastest web crawler written in Rust."
repository = "https://github.com/spider-rs/spider"
Expand Down Expand Up @@ -65,8 +65,9 @@ socks = ["reqwest/socks"]
reqwest_json = ["reqwest/json"]
sitemap = ["dep:sitemap"]
js = ["dep:jsdom"]
budget = []
chrome = ["dep:chromiumoxide"]
chrome_headed = ["chrome"]
chrome_cpu = ["chrome"]
chrome_stealth = ["chrome"]
budget = []
cookies = ["reqwest/cookies"]
15 changes: 8 additions & 7 deletions spider/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ This is a basic async example crawling a web page, add spider to your `Cargo.tom

```toml
[dependencies]
spider = "1.49.11"
spider = "1.49.13"
```

And then the code:
Expand Down Expand Up @@ -87,7 +87,7 @@ We have a couple optional feature flags. Regex blacklisting, jemaloc backend, gl

```toml
[dependencies]
spider = { version = "1.49.11", features = ["regex", "ua_generator"] }
spider = { version = "1.49.13", features = ["regex", "ua_generator"] }
```

1. `ua_generator`: Enables auto generating a random real User-Agent.
Expand All @@ -109,14 +109,15 @@ spider = { version = "1.49.11", features = ["regex", "ua_generator"] }
1. `chrome_headed`: Enables chrome rendering headful rendering [experimental].
1. `chrome_cpu`: Disable gpu usage for chrome browser.
1. `chrome_stealth`: Enables stealth mode to make it harder to be detected as a bot.
1. `cookies`: Enables cookies storing and setting to use for request.

### Decentralization

Move processing to a worker, drastically increases performance even if worker is on the same machine due to efficient runtime split IO work.

```toml
[dependencies]
spider = { version = "1.49.11", features = ["decentralized"] }
spider = { version = "1.49.13", features = ["decentralized"] }
```

```sh
Expand All @@ -136,7 +137,7 @@ Use the subscribe method to get a broadcast channel.

```toml
[dependencies]
spider = { version = "1.49.11", features = ["sync"] }
spider = { version = "1.49.13", features = ["sync"] }
```

```rust,no_run
Expand Down Expand Up @@ -166,7 +167,7 @@ Allow regex for blacklisting routes

```toml
[dependencies]
spider = { version = "1.49.11", features = ["regex"] }
spider = { version = "1.49.13", features = ["regex"] }
```

```rust,no_run
Expand All @@ -193,7 +194,7 @@ If you are performing large workloads you may need to control the crawler by ena

```toml
[dependencies]
spider = { version = "1.49.11", features = ["control"] }
spider = { version = "1.49.13", features = ["control"] }
```

```rust
Expand Down Expand Up @@ -261,7 +262,7 @@ async fn main() {

```toml
[dependencies]
spider = { version = "1.49.11", features = ["chrome"] }
spider = { version = "1.49.13", features = ["chrome"] }
```

You can use `website.crawl_concurrent_raw` to perform a crawl without chromium when needed. Use the feature flag `chrome_headed` to enable headful browser usage if needed to debug.
Expand Down
1 change: 1 addition & 0 deletions spider/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@
//! - `chrome_headed`: Enables chrome rendering headful rendering [experimental].
//! - `chrome_cpu`: Disable gpu usage for chrome browser.
//! - `chrome_stealth`: Enables stealth mode to make it harder to be detected as a bot.
//! - `cookies`: Enables cookies storing and setting to use for request.
pub extern crate bytes;
pub extern crate compact_str;
Expand Down
41 changes: 37 additions & 4 deletions spider/src/website.rs
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ use compact_str::CompactString;
use hashbrown::HashMap;

use hashbrown::HashSet;
use reqwest::Client;
use reqwest::{Client, ClientBuilder};
use std::io::{Error, ErrorKind};
use std::sync::atomic::{AtomicI8, Ordering};
use std::sync::Arc;
Expand Down Expand Up @@ -154,6 +154,9 @@ pub struct Website {
#[cfg(feature = "budget")]
/// Crawl budget for the paths. This helps prevent crawling extra pages and limiting the amount.
pub budget: Option<HashMap<CaseInsensitiveString, u32>>,
#[cfg(feature = "cookies")]
/// Cookie string to use for network requests ex: "foo=bar; Domain=blog.spider"
pub cookie_str: String,
}

impl Website {
Expand Down Expand Up @@ -459,9 +462,8 @@ impl Website {
client
}

/// configure http client
#[cfg(not(feature = "decentralized"))]
pub fn configure_http_client(&mut self) -> Client {
/// build the http client
fn configure_http_client_builder(&mut self) -> ClientBuilder {
let host_str = self.domain_parsed.as_deref().cloned();
let default_policy = reqwest::redirect::Policy::default();
let policy = match host_str {
Expand Down Expand Up @@ -513,6 +515,37 @@ impl Website {
_ => client,
};

client
}

/// configure http client
#[cfg(all(not(feature = "decentralized"), not(feature = "cookies")))]
pub fn configure_http_client(&mut self) -> Client {
let client = self.configure_http_client_builder();

// should unwrap using native-tls-alpn
unsafe { client.build().unwrap_unchecked() }
}

/// build the client with cookie configurations
#[cfg(all(not(feature = "decentralized"), feature = "cookies"))]
pub fn configure_http_client(&mut self) -> Client {
let client = self.configure_http_client_builder();
let client = client.cookie_store(true);

let client = if !self.cookie_str.is_empty() && self.domain_parsed.is_some() {
match self.domain_parsed.clone() {
Some(p) => {
let cookie_store = reqwest::cookie::Jar::default();
cookie_store.add_cookie_str(&self.cookie_str, &p);
client.cookie_provider(cookie_store.into())
}
_ => client,
}
} else {
client
};

// should unwrap using native-tls-alpn
unsafe { client.build().unwrap_unchecked() }
}
Expand Down
4 changes: 2 additions & 2 deletions spider_cli/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "spider_cli"
version = "1.49.11"
version = "1.49.13"
authors = ["madeindjs <[email protected]>", "j-mendez <[email protected]>"]
description = "The fastest web crawler CLI written in Rust."
repository = "https://github.com/spider-rs/spider"
Expand All @@ -26,7 +26,7 @@ quote = "1.0.18"
failure_derive = "0.1.8"

[dependencies.spider]
version = "1.49.11"
version = "1.49.13"
path = "../spider"

[[bin]]
Expand Down
4 changes: 2 additions & 2 deletions spider_worker/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "spider_worker"
version = "1.49.11"
version = "1.49.13"
authors = ["madeindjs <[email protected]>", "j-mendez <[email protected]>"]
description = "The fastest web crawler as a worker or proxy."
repository = "https://github.com/spider-rs/spider"
Expand All @@ -22,7 +22,7 @@ lazy_static = "1.4.0"
env_logger = "0.10.0"

[dependencies.spider]
version = "1.49.11"
version = "1.49.13"
path = "../spider"
features = ["serde", "flexbuffers"]

Expand Down

0 comments on commit 16c796a

Please sign in to comment.