Merge pull request #24 from pubky/dev

Dev
pubky · Nov 8, 2024 · 7a54454 · 7a54454
2 parents 6ff26ed + af2cecf
commit 7a54454
Show file tree

Hide file tree

Showing 30 changed files with 1,568 additions and 530 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,4 @@
 /target
 Cargo.lock
 reference 
+/docs/simulation/target
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,11 +2,37 @@
 
 All notable changes to mainline dht will be documented in this file.
 
-##  [3.0.0](https://github.com/pubky/mainline/compare/3a4c3312410e69201a287e40cb7b6dbb30c663f2..v3.0.0) - 2024-09-27
+##  [4.0.1](https://github.com/pubky/mainline/compare/3a4c3312410e69201a287e40cb7b6dbb30c663f2..v3.0.0) - 2024-09-27
+
+### Added
+
+- Export `errors` module containing `PutError` as a part of the response of `Rpc::put`.
+- `Dht::find_node()` and `AsyncDht::find_node()` to lookup a certain target, without calling `get_peers` and the closest responding nodes.
+- `Dht::info()` and `AsyncDht::info()` some internal information about the node from one method.
+- `Info::dht_size_estimate` to get the ongoing dht size estimate resulting from watching results of all queries.
+- `Info::id` to get the Id of the node.
+- `measure_dht` example to estimate the DHT size.
 
 ### Changed
 
-- Removed all internal panic `#![deny(clippy::unwrap_used)]`
+- Removed all internal panic `#![deny(clippy::unwrap_used)]`.
 - `Testnet::new(size)` returns a `Result<Testnet>`.
-- `Dht::local_addr()` returns a `Result<SocketAddr>`.
- `AsyncDht::local_addr()` returns a `Result<SocketAddr>`.
+- `Dht::local_addr()` and `AsyncDht::local_addr()` replaced with `::info()`.
+- `Dht::shutdown()` and `AsyncDht::shutdown()` are now idempotent, and returns `()`.
+- `Rpc::drop` uses `tracing::debug!()` to log dropping the Rpc.
+- `Id::as_bytes()` instead of exposing internal `bytes` property.
+- Replace crate `Error` with more granular errors.
+- Replace Flume's `RecvError` with `expect()` message, since the sender should never be dropped to soon.
+- `DhtWasShutdown` error is a standalone error.
+- `InvalidIdSize` error is a standalone error.
+- Rename `DhtSettings` to `Settings`
+- Rename `DhtServer` to `DefaultServer`
+- `Dht::get_immutable()` and `AsyncDht::get_immutable()` return `Result<Option<bytes::Bytes>, DhtWasShutdown>`
+- `Node` fields are now all private, with `id()` and `address()` getters.
+- Changed `Settings` to be a the Builder, and make fields private.
+- Replaced `Rpc::new()` with `Settings::build_rpc()`.
+- Update the client version from `RS01` to `RS04`
+
+### Removed
+
+- Removed `mainline::error::Error` and `mainline::error::Result`.
diff --git a/Cargo.toml b/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "mainline"
-version = "3.0.0"
+version = "3.0.1"
 authors = ["nuh.dev"]
 edition = "2018"
 description = "Simple, robust, BitTorrent's Mainline DHT implementation"
@@ -23,14 +23,20 @@ ed25519-dalek = "2.1.0"
 bytes = "1.5.0"
 tracing = "0.1"
 lru = { version = "0.12.2", default-features = false }
+document-features = "0.2.10"
 
 [dev-dependencies]
 clap = { version = "4.4.8", features = ["derive"] }
 futures = "0.3.29"
 tracing-subscriber = "0.3"
 
 [features]
+## Enable [Dht::as_async()] to use [async_dht::AsyncDht]
 async = ["flume/async"]
+
+## Private feature to export ClosestNodes struct. Not a public API.
+__private_simulation = []
+
 default = []
 
 [package.metadata.docs.rs]

diff --git a/README.md b/README.md
@@ -8,7 +8,7 @@ It should work as a routing / storing node as well, and has been running in prod
 
 **[API Docs](https://docs.rs/mainline/latest/mainline/)**
 
-## Get started
+## Getting started
 
 Check the [Examples](https://github.com/Nuhvi/mainline/tree/main/examples).
 
@@ -30,6 +30,8 @@ Supported BEPs:
 - [x] [BEP0043 Read-only DHT Nodes](https://www.bittorrent.org/beps/bep_0043.html)
 - [x] [BEP0044 Storing arbitrary data in the DHT](https://www.bittorrent.org/beps/bep_0044.html)
 
+This implementation also includes [measures against Vertical Sybil Attacks](./docs/sybil-resistance.md).
+
 ### Server
 
 Running as a server is the same as a client, but you also respond to incoming requests and serve as a routing and storing node, supporting the general routing of the DHT, and contributing to the storage capacity of the DHT.

diff --git a/docs/censorship-resistance.md b/docs/censorship-resistance.md
@@ -0,0 +1,105 @@
+# Censorship Resistance
+
+## Overview
+
+One of the main criticism against distributed hash tables are their susceptibility to Sybil attacks,
+and by extension censorship. This document is an overview over the problem and how this implementation minimizes this risk.
+
+[Real-World Sybil Attacks in BitTorrent Mainline DHT](https://www.cl.cam.ac.uk/~lw525/publications/security.pdf) paper divides Sybil attacks 
+into “horizontal”, and “vertical”, the former tries to flood the entire network with Sybil nodes, while the later tries to target specific region of
+the ID space, to censor specific info-hashes.
+
+Our strategy in this document is to first: explain how can we transform all vertical attacks to horizontal attacks by necessity, and second: explore the
+cost of such horizontal attacks and the cost of resisting such attacks, and we consider the system resistant to censorship, if the cost of resistance to
+horizontal Sybil attacks are much lower than the cost of sustaining such attacks for extended periods of time.
+
+### Non goals
+
+For the sake of this document we will NOT discuss extreme forms of censorship like filtering out UDP packets that look like Bittorrent messages at the ISP level.
+Or filtering out packets that includes specific info hashes. This form of censorship apply to more than just DHTs, including DNS queries and more. And are better
+handled using VPNs and other firewall circumvention solutions. Including HTTPs relays that are hard to filter out or predict their purpose.
+
+We will focus on how to keep DHTs resistant to vulnerabilities that are inherint to their nature as open networks without a central reputation auhtority.
+
+Similarly, we will not discuss the effect of Sybil attacks on privacy, if one wants to keep their queries private, they are also advised to use a VPN or a trusted HTTPs server to relay their queries.
+
+## Vertical Sybil Attacks
+
+### Challenge
+
+In a DHT, nodes store a piece of information with a redundancy factor `k` (usually 20), meaning that a node tries to find the 
+`k` closest nodes to the info hash using XOR metric defined in [BEP_0005](https://www.bittorrent.org/beps/bep_0005.html) before
+storing the data in these nodes.
+
+This static redundancy factor, opens the room for Vertical Sybil attacks is where a malicious actor runs enough nodes close to an info hash 
+that a writer only writes to the attacker Sybil nodes, making it easy for that attacker to censors that information from the rest of the network.
+
+Consider the following example, with a Dht of size `8` and `k=2`, drawing nodes at their distances to a given target, should look like this:
+
+```md
+             (1)    (2)                  (3)    (4)           (5)           (6)           (7)    (8)       
+|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|
+0      1      2      3      4      5      6      7      8      9      10     11     12     13     14     15
+```
+
+So, if an attacker injected two (even closer) nodes, that don't match the distribution of the rest of network (Vertical Sybil as opposed to Horizontal Sybil),
+then you would expect the example above to look like this instead:
+
+```md
+(s1)  (s2)   (1)    (2)                  (3)    (4)           (5)           (6)           (7)    (8)       
+|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|
+0      1      2      3      4      5      6      7      8      9      10     11     12     13     14     15
+```
+
+As you can see, if we only store data at the closest `k=2` nodes, the data would be only stored within attacker nodes, thus successefully censored.
+
+### Solution
+
+The solution we use in this Mainline implementation, is to use the `expected distance to k (edk)` instead of `k`.
+
+To understand what does that mean, consider that we have a rough estimation of the DHT size (which we obtain as explained in the 
+documentation of the [Dht Size Estimate](./dht_size_estimate.md)), then we can _expect_ that the closest `k` nodes, are going to be
+within a range `edk`, for example, continuing the example from above, in a Dht of `8` nodes in a `16` ID space, we can expect
+the closest `2` nodes, within distance `4`.
+
+```md
+(s1)  (s2)   (1)    (2)   [edk]          (3)    (4)           (5)           (6)           (7)    (8)       
+|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|
+0      1      2      3      4      5      6      7      8      9      10     11     12     13     14     15
+```
+
+If we store data in all nodes until `edk` (the expected distance of the first 2 nodes), we would store the data at at least 2 honest nodes.
+
+Because the nature of the Dht queries, we should expect to get a response from at least one of these honest nodes as we query closer and closer nodes to the target info hash.
+
+### Assumptions
+
+This strategy depends on an [accurate and consistent estimate of the DHT size](./dht_size_estimate.md), which itself depends on the assumption of uniform
+distribution of nodes across the ID space. That uniform distribution can be verified separately by crawling the DHT, but it is also can enforced by only storing
+data in (secure nodes) which are nodes whose IDs are generated relatively to their IP address according to [BEP_0042](https://www.bittorrent.org/beps/bep_0042.html).
+
+## Horizontal Sybil Attacks
+
+So if an attacker can't perform a vertical Sybil attack, it has to run > 20 times the number of current honest nodes to have a good chance of taking over an info hash,
+i.e being in control of all 20 closest nodes to a target.
+
+Firstly, because we have a good way to estimate the dht size, we can all see the DHT size suddenly increasing 20x, which at least gives us all a chance to react to such extreme attack.
+
+Secondly, because of [BEP_0042](https://www.bittorrent.org/beps/bep_0042.html), an IPv4 can't have any more than 8 nodes, so an attacker needs to at least have control of millions of IP addresses.
+
+Thirdly, the current DHT size estimate seems to be near the limits enforced by [BEP_0042](https://www.bittorrent.org/beps/bep_0042.html) (~10 million nodes), which means an attacker will
+need to create more than 9 million nodes and try to replace already running nodes with their Sybil nodes, except that [BEP_0005](https://www.bittorrent.org/beps/bep_0005.html) favors older nodes
+than newer ones.
+
+To summarize, an attacker needs to have control over millions of IP addresses, actually run millions of nodes, hope that existing nodes churn enough to give them a chance to replace them in nodes routing tables,
+and hope that no one notices or reacts to such attack, and even then they need to sustain that attack, because as soon as they give up, the network resumes its normal operation.
+
+It is safe to say that much simpler modes of censorship are much more likely to be employed instead.
+
+## Conclusion
+
+While theoritically DHTs are not immune to Sybil nodes, and while it is impossible to stop attempts to inject nodes all over the DHT to snoop on traffic, it is not at all easy or practical to
+disrupt the operation of a large DHT network.
+
+The security of a DHT thus boils down to the number of honest nodes, as long as we don't see a massive decline of the size of the DHT, Mainline will remain as unstopable as a network based on
+the Internet can be.
diff --git a/docs/dht_size_estimate.md b/docs/dht_size_estimate.md
@@ -0,0 +1,98 @@
+# Dht Size Estimattion
+
+This is a documentation for the Dht size estimation used in this Mainline Dht implementation,
+within the context of [Sybil Resistance](./sybil-resistance.md).
+
+If you want to see a live estimation of the Dht size, you can run (in the root directory):
+
+```
+cd ./simulation
+cargo run 
+```
+
+## How does it work?
+
+In order to get an accurate calculation of the Dht size, you should take
+as many lookups (at uniformly disrtibuted target) as you can,
+and calculate the average of the estimations based on their responding nodes.
+
+Consider a Dht with a 4 bit key space.
+Then we can map nodes in that keyspace by their distance to a given target of a lookup.
+
+Assuming a random but uniform distribution of nodes (which can be measured independently),
+you should see nodes distributed somewhat like this:
+
+```md
+             (1)    (2)                  (3)    (4)           (5)           (6)           (7)    (8)       
+|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|
+0      1      2      3      4      5      6      7      8      9      10     11     12     13     14     15
+```
+
+So if you make a lookup and optained this partial view of the network:
+```md
+             (1)    (2)                  (3)                                (4)                  (5)       
+|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|
+0      1      2      3      4      5      6      7      8      9      10     11     12     13     14     15
+```
+
+Note: you see exponentially less further nodes than closer ones, which is what you should expect from how
+the routing table works.
+
+Seeing one node at distance (d1=2), suggests that the routing table might contain 8 nodes,
+since its full length is 8 times (d1).
+
+Similarily, seeing two nodes at (d2=3), suggests that the routing table might contain ~11
+nodes, since the key space is more than (d2).
+
+If we repeat this estimation for as many nodes as the routing table's `k` bucket size,
+and take their average, we get a more accurate estimation of the dht.
+
+## Formula
+
+The estimated number of Dht size, at each distance `di`, is `en_i = i * d_max / di` where `i` is the
+count of nodes discovered until this distance and `d_max` is the size of the key space.
+
+The final Dht size estimation is the least-squares fit of `en_1 + en_2 + .. + en_n`
+
+## Simulation
+
+Running this [simulation](../examples/dht_size_estimate.rs) for 2 million nodes and a after 16 lookups, we observe:
+
+- Mean estimate: 2,123,314 nodes 
+- Standard deviation: 7%
+- 95% Confidence Interval: +-14%
+
+Meaning that after 12 lookups, you can be confident you are not overestimating the Dht size by more than 10%,
+in fact you are most likely underestimating it slightly due to the limitation of real networks. 
+
+![distribution of estimated dht size after 4 lookups](./plot.png)
+
+Finally the standard deviation seems to follow a power law `stddev = 0.281 * lookups^-0.529`. Meaning after only 4 lookups, you can get an estimate with 95% confidence interval of +-28%.
+
+![Standard deviation relationship with number of lookups](./standard-deviation-vs-lookups.png)
+
+## Mapping simulation to real networks
+
+While the Mean estimate in the simulation slightly over estimate the real size in the simulation, the opposite is what should be expected in real networks.
+
+Unlike the simulation above, real networks are not perfect, meaning there is an error factor that can't be hard coded,
+as it depends on the response rate of nodes you query, the more requests timeout before you get a response, the more nodes
+you will miss, and the smaller you will think the Dht is.
+
+This is an error on the side of conservatism. And I can't think of anything in the real world that could distort the results
+expected from this simulation to the direction of overestimating the Dht size.
+
+### See it yourself
+
+You can measure the Dht size yourself by running:
+
+```
+cargo run --example measure_dht
+```
+
+Note that the estimate will be understated if you are using a hostile network to UDP packets or behind a VPN making your requests look like they are coming from the 
+same IP as many other users (causing nodes to rate limit your requests more often).
+
+## Acknowledgment
+
+This size estimation was based on [A New Method for Estimating P2P Network Size](https://eli.sohl.com/2020/06/05/dht-size-estimation.html#fnref:query-count)
diff --git a/docs/plot.png b/docs/plot.png
diff --git a/docs/simulation/Cargo.toml b/docs/simulation/Cargo.toml
@@ -0,0 +1,12 @@
+[package]
+name = "sim"
+version = "0.1.0"
+edition = "2021"
+
+[dependencies]
+clap = { version = "4.5.20", features = ["derive"] }
+ctrlc = "3.4.5"
+mainline = { version = "3.0.1", path = "../..", features = ["__private_simulation"] }
+num_cpus = "1.16.0"
+plotters = "0.3.7"
+statrs = "0.17.1"