Khashmir

Status: operational, but insufficient performance

Under active development:

IP spoofing and hijack prevention mechanism
Quarantine routing table for incoming candidates
Lazy routing table maintenance algorithm
Exponential keep-alive period (1x1 minute,2x2,4x4,8x8,16x16,infinite x32); 5.8 hours increase time.
fixed maintenance rate of 1 packet per second, then Max. post. keep-alive period.
Routing table be 600 peers with roughly 10min average peer alive time
Peer Hit-probability based bucket size in Clean_Cache routing table
Bucket size of 256, 128, 64, ... 8, 8, 8
Include 8 closest matching peers in outgoing DHT messages
Connectable peer harvesting algorithm
Mechanism to implement enhancement flags in least significant bits
recognize the Clean_Cache_Extension flag
Policy to bias Clean_Cache_Extension peers in some cases
Measure optimal bias which balances key distance and latency
Determine effect of
Dual cache lookup strategy for minimizing latency
Create No_Max_Connections_Limiter for silent discarding peers
No shared Max_Connections variable for swarm downloading and DHT connections
Use tracker and get_peers from PEX for golden peer discovery
Partial tracker for 6 golden peers and open TCP/IP connections
Scalable announce mechanism with intermediate peer tracking
Do not forward request for very popular requests, but start replying

Contents:

The usual approach for swarm discovery is by using a central tracker. However, the traditional tracker can be replaced by using a distributed hash table (DHT); the infohashes of .torrent files can be used as the key and the list of peers that are downloading the torrents are used as value. When a peer starts downloading a torrent it adds its contact information in DHT and it does a lookup in the DHT to discover other peers in the swarm. Note that a single DHT can be used for many torrents.

The standard BitTorrent also implements a DHT, called Khashmir, for swarm discovery. The Khashmir supplied with BitTorrent has been adapted in order to be used as decentralized tracker and therefore differs from the original Khashmir. We discuss the Khashmir supplied with BitTorrent.

Khashmir

Khashmir uses the infohash of a .torrent file as keys in the DHT. Infohashes are 160-bit and peers participating in Khashmir have a nodeID in the same 160-bit space. A value is a list of peers downloading/seeding the torrent that corresponds to the key. When a peer announces that it is downloading a torrent, its contact information is added to this list.

Routing

Khashmir uses the XOR operand as distance metric. The distance between two nodes, a and b, equals to a XOR b. The routing table is divided into 160 buckets, with each bucket covering a part of the nodeID space. Bucket i contains nodes with distance between 2ⁱ and 2ⁱ⁺¹. Newly encountered peers are added to the appropiate bucket until the bucket is full. Buckets have a predefined size which is currently 8.

.torrent File

The torrent file that uses a tracker contains an 'announce' key. However .torrent files "tracked" by Khashmir have instead a 'nodes' key which is a list of nodes in Khashmir. This peerlist enables the client to bootstrap into Khashmir.

e.g. nodes = "127.0.0.1", 6881], ["khashmirnode.theinternet.net", 7726

BitTorrent Protocol Extension

To be able to fill the routing table from peers received from an ordinary tracker the BitTorrent Protocol has been extended.

Peers participating in Khashmir set the least significant bit of the last reserved byte of the handshake. When a peer connects to another peer that also supports Khashmir, it should send a PORT message, indicating the port on which its Khashmir client is located. A PORT message begins with byte 0x09 and its payload are two bytes containing the Khashmir port of the sender. After having checked the port, the client can add the node to its routing table.

Technical Details

Using Khashmir

The source code of the mainline BitTorrent is divided in three directory: khashmir, BitTorrent and BTL. Unfortunately, the khashmir modules has dependencies on BitTorrent and BTL. The easy way is to copy the BitTorrent and BTL folder along. Another option is to copy the needed files from BitTorrent and BTL to the khashmir folder and adapt the import statements in the source (I have done this, you can find it at https://svn.tribler.org/abc/branches/fabian/tribler4.0.1_khashmir/Khashmir)

The main class is UTKhashmir in utkhashmir.py

Only a few methods of UTKhashmir are needed to use the DHT:

getPeers(infohash, callback)
returns peers that are downloading/seeding the torrent with the supplied infohash
callback should be a function that takes one parameter. The parameter is list of peers. A peer is a byte string of length 6, which is the ip and port (both in network order). Because of the the sloppy storage that Khashmir uses, values of a single key is stored across a few nodes. Every time Khashmir finds new values it calls the callback. Finally, the callback is called with an empty list.

announcePeer(infohash, port)
announces that we are downloading the torrent with the supplied infohash. However, for unknown reasons, I couldn't get this one working.

getPeersAndAnnounce(infohash, port, callback)
Does it all in one, and does actually work.

addContact(host, port)
checks whether the host is live and adds it to the routing table

Logging

For logging purposes it is sufficient to maintain a stable DHT which is inserted in the routing table at the first session. As long as this node is running, it is kept in the routing table, because Khashmir prefers old stable nodes over new ones. It is possible to log the start of every session, because at startup all nodes in the routing table are checked whether they're alive. The stable node is in the routing table of every node in the network and should therefore expect more traffic than usual.

Raw Server/Twisted

The mainline BitTorrent has a so-called raw server. The raw server is the component that takes care of creating and handling network connections. {Update: Twisted dependency is removed and the Tribler rawserver is integrated with Khashmir}

Default bootstrap node

Whenever BitTorrent encouters a khashmir .torrent with an empty nodes list, it defaults to the standard bootstrap node router.bittorrent.com:6881. Note this is not part of the Khashmir code but of BitTorrent client (BTL/ConvertedMetaInfo.py)

Problems

unconnectable peers in the cache
20 second timeout
8 nodes max. in parallel, can get stuck
Bootstrapping is prone to frequent failure when starting a session.
rate limiter drops vital packets; should turn off, needs repair (default 1Kbps is too little)

Planning

Raul at KTH

Week	Description
1	Understand and document algorithm details
2-4	Measure operation: timeouts, amount of stuck nodes, latency, dead-node cache pollution, NAT blocking, bandwidth usage vs. parallelism (4,8,16,32 peers), load balancing vs. node selection algorithms (peers attack the longest online peer in the network?),
5-7	Design and implement improvements
8	Measure effect of improvements for: Swarm discovery latency, bandwidth usage, and failure rate.
9-12	article writing

Future:

Enable swarms to grow beyond 1 million peers while maximizing network-friendliness. Possibly discover network topology characteristics. Then design policies to exploit them to maximize both performance and network-friendliness.

Measurement code

dhtmeasurements.tar.bz2 (or here)

This BZIP archive contains:

tribler4.1.7_web2: Tribler client. Khashmir operations have extra logging added. Contains a few khashmir*.py files to run a dht node without running the Tribler client. Furthermore, the rate limiter in khashmir has been disabled. tribler4.1.7_web2/khashmirtest2s.py: runs a khashmir discovery operations with khashmir.const.KRPC_TIMEOUT = 1.87, used by graph2 (see below) tribler4.1.7_web2/khashmirtest20s.py: runs a khashmir discovery operations with khashmir.const.KRPC_TIMEOUT = 20, used by graph2 (see below) tribler4.1.7_web2/khashmirtest2.py: starts a dht node, and then leaves you in the python debugger, so you can perform khashmir operations manually tribler4.1.7_web2/khashmirtest3.py: runs a dht node, used by bootstrapnode (see below) tribler4.1.7_web2/khashmirtest4.py: runs a 5 discovery operations on a 300s interval, used by graph1 (see below)

*/routing_table: khashmir's routing table

khashmir/ khashmir/bootstrapnode: khashmir/bootstrapnode/runbootstrapnode: runs a separate dht node on port 9999. This node is used to bootstrap the nodes that run the actual experiments. This bootstrap node is hardcoded in the py scripts mentioned above.

khashmir/graph1: measurements for Figure 4.6 and 4.7 in my thesis khashmir/graph1/991708: the torrent used for measurements khashmir/graph1/do: runs the experiment, khashmirtest4.py is executed multiple times khashmir/graph1/graph1: log of the experiment khashmir/graph1/latencies*: response times of requests (extracted from graph1)

khashmir/graph2: measurements for Figure 4.8 and 4.9 in my thesis khashmir/graph2/torrents/*: .torrents used for measurements khashmir/graph2/do: starts the measurements; for each torrent subdo is performed; results are put in the data directory khashmir/graph2/subdo: performs a scrape, a 2s khashmir discovery operation, and a 20s khashmir discovery operation. khashmir/graph2/data: results of measurements. For each torrent there are three files: a 20s timeout log, 2s timeout log, and a log of a tracker scrape. khashmir/graph2/process: post-processes results of the measurements

Open Issues

Unreachable nodes
Long (and static) timeout (20s)
Parallelism
Bootstrapping
Rate limiter
Kademlia's replacement cache not implemented
Iterative routing

Unreachable nodes

Experiments with real swarms indicate that the DHT is often functioning poorly. The main reason seems to be that the routing tables of DHT nodes are filled with IP addresses of peers that are either already offline or are not connectable due to NAT/Firewall issues. Experiments are planned to quantify this issue.

Every node can eventually appear as unreachable due to various reasons:

UDP is not reliable, therefore some messages can be lost.
Rate limiter (see its own section)
Overload

The node is firewalled (always unreachable for everyone)

A firewalled node can perform lookup operations but it doesn't route messages nor store values. The bad thing is that it is a free rider. The good thing is that it doesn't pollute others' routing tables. (see NATed node)

The node is behind a NAT box (sometimes unreachable by some nodes)

A NATed node can perform lookup operations and tries to route messages and to store values. Here is where really tricky issues are.

The node goes off-line aka "dead node" (unreachable for everyone while it is off-line)

There is no 'leave' message. Therefore when a node (D) leaves the DHT the nodes which have D in their routing tables will fail to reach D. The mechanisms to clean dead nodes are: (1) refresh every bucket every 15 minutes, and (2) replace an old node with a new one. Buckets which have not been used for 15 minutes are refreshed by doing a random lookup checking that every node in the bucket is still alive (reachable). Unreachable nodes are marked as dead*.

We believe that the Khashmir DHT is fundamentally incompatible with NAT usage. A node can include another node in their routing table after just a Ping message, no announce is required. No message or mechanism exists for a NATed node to prevent others from including it in their routing cache. To make matters worse, the naive DHT uses a Ping message to check if a peer is reachable and online. If a node just communicate with a NATed node, that Ping message will be successful, however, this NATed node is not globally connectible.

The complete algorithm operates as follows: A NATed node (N) sends a message (any kind of message) to another node (A). If A considers that N should be added to its routing table A checks N's reachability by sending a ping. N is behind a NAT which will allow traffic between A and N because N started the communication. To A, N is reachable so A adds N to its routing table. We call this "implicit NAT-node pollution", we explain why next. Another node (B) performs a lookup (getNodes or getPeers) on A. A replies with a bucket which includes N. Then, B tries to route the lookup message though N and waits for the timeout since the NAT box sees a connection from a non-known IP dropping the message. It is specially bad that the effect is indirect and can't be detected beforehand. It is indirect because, even if B implements a clever NAT detection, B will be affected by others' polluted routing tables. It can't be detected beforehand because the routing is done in real time (shortcut routing might help reducing the effect of the second issue).

Timeout

A timeout is harmful in lookup operations. It might be in other operations as well but I consider lookup operations much more delay critical. A node has to wait for a timeout (20s currently) when the 8 threads are stuck waiting for a reply. Considering 60% of the total nodes firewalled or NATed, chances are some queries are never replied.

I don't think that a static timeout is adequate. Nodes have different connections (latencies) and this could even vary for a single node at different times (congested network, node overloaded, etc.) 1.87 s. may be OK for a node connected to a university network but I think that every node should calculate its own value to better adapt to its environment.

The good thing of improving the timeout policy is direct benefit to the nodes implementing it. The bad thing is that it decreases the bad effects of a broken DHT without actually fixing it.

Parallelism

A lookup messages is routed through 8 nodes simultaneously. This is a trade-off between having to wait for timeouts (when less parallelism) and flood the DHT (when more parallelism). I think that we should avoid increasing parallelism by choosing the best nodes to route our messages.

It might be a good idea to find and follow shortcuts which decrease the number of hops drastically (more in the 'implementation ideas' section)

Bootstrapping

I need to check out this.

Rate limiter

I'm not sure it is a good idea to allocate unlimited bandwidth to the DHT. I can think of a node being the responsible for tracking a million peers swarm. I agree, however, that the DHT traffic should be prioritized (real time traffic) and spurious peaks should be allowed. What about a threshold of messages/minute? I don't have a clear idea of what we can do to solve this.

Replacement cache

Iterative routing

I don't see it much as a big problem but Crosby and Wallach consider it a key issue. I actually can see security issues in the recursive routing proposed. Anyway, any change regarding this issue means backwards incompatibility with Mainline DHT. I write it down here just to keep it in mind but I don't consider it important to us.

Measurements

Questions

Does an old node manage more traffic/maintenance overhead?
What's the best timeout policy?
NATed nodes
- How many NATed nodes try to pollute my routing table?
- How many NATed nodes are in my routing table? I.e. how many polluted entries are in my routing table?
- How long NATed nodes last in my routing table? (We can't tell if is was due to NAT timeout or the node just left)
- How many timeouts are due to NATed nodes? KEY question for DHT performance
- How many NATs can be discovered by sending a ping from a different UDP port?
Latency
- Latency to nodes from different locations

Hypothesis

Kademlia nodes prefer older nodes to newer ones. This algorithm is meant to make the DHT more stable because experimentally it's proven that nodes which have been alive long are likely to stay longer. Our hypothesis is that these old nodes manage the core traffic increasing its individual overhead. If this overhead is big enough, the architecture becomes less scalable and punish the nodes which contribute the most (wrong inactivation).
Crosby and Wallach experimented a few timeout policies for Mainline DHT. We expect to have similar results.
NATed nodes
- Too many. If BitTorrent experiment's figures are right around 60% of peers are unreachable. There are three reasons why a node appear unreachable: (1) dead, (2) firewalled, and (3) NATed. We expect that most of the unreachable nodes are behind NAT 25-45% of the total.
This depends on NAT timeouts. If NATtout > 15min. The probability of deleting a NATed node from the routing table is similar to a reachable node because the NATed node will not be detected. If NATtout < 15min and the bucket is not very active the NAT table timeout will eventually go off making the NATed node unreachable. Therefore, it will be removed from the routing table.

???See if Lucia has some data regarding NATtout???

If NATtout > 15min as long as the NATed node is alive. Otherwise, see above.
There are two ways NATed nodes produce timeouts: (1) A NAT table timeout is expired so the NATed node is unreachable (direct effect), and (2) during a lookup (getPeers) a node gives us a list which includes NATed nodes which are unreachable from us (indirect effect).
Latency
- According to Fabian's data 75% of the replies arrived within 1.87 seconds. According to Crosby and Wallach the average RTT is 1.6 s.

Methodology

Eperiment 1 (status: fixing scripts)
- Objective
  - How many no globally reachable nodes can we detect by pinging back from a different port (local_check) compared with remote_check (more accurate but complex/expensive)?
- Scenario
  - 2 nodes running on two different PCs (different IP, same LAN)
    - One node runs a DHT-capable BitTorrrent client (we're working with ktorrent) and "sniffer.py"
    - The other runs "remotechecker.py"
  - Process
    - sniffer uses tcpdump's output to analyze the traffic generated by the DHT. When a message is a query (type == 'a') registers the IP:port of the source node, sends a DHT ping from a different port and pass the IP:port information to remotechecker (via TCP)
    - remotechecker gets IP:port from sniffer, register the node, and pings it.
    - sniffer and remotechecker wait for 20 seconds. If a pong arrived, the node is consider reachable (R), otherwise the node is pinged again up to 3 times. If the node fails to reply 3 times, it is considered unreachable (U).
  - Findings
    - ktorrent 2.2.5 (Ubuntu package) pings several (2-5) nodes continuously (50-100 pings/sec). ktorrent 3.0.1 doesn't do that but it's very slow (30 torrents active, 20kb/s up/down cap).
    - Many nodes use many UDP ports when sending messages to us. This might be due to buggy implementations (don't think so) or NAT behavior. If the current DHT implementations only match identifier, the node would be considered as R when the reality is that the node is not behaving as it should.
    - Our data gathered previously is invalid because we didn't consider the previous point. We need to do some changes to the scripts.

Experiment 2 (status: planning/coding in order to use experiment 1's code for this one)
- Scenario
  - 20 nodes (N) running on PlanetLab hosts (different IP addresses)
    - Every node runs Tribler's Khashmir on port 22230.
    - Every node runs a script which sniffs tcpdump output on port 22230
  - One reference host (R) running on a PlanetLab host. The reference node does not run a DHT node.
- Process

Whenever the sniffer gets an incoming packet it sends a ping to the source from port 22231 and register whether it received a pong or the timeout expired (20s) (this is called local_check). It also sends the node's IP and port to the reference host. The same for outgoing messages. A reference host sends a DHT ping to the IP:port indicated by a node N and register whether it received a pong or the timeout expired (20s) (this is called indirect_check. We keep a list of checked nodes to avoid pinging a node several times.

Here is the table showing the reachability of the node depending the method used: R=reachable U=unreachable

Case	BTclient	local_check	indirect_check
1	U	U	U
2	R	U	U
3	R	R	U
4	R	R	R

Case 1

The node is globally unreachable (behind a firewall or dead)

Case 2

The node is partially reachable but its global reachability can be discovered just by using local_check. It's a NAT which matches IP and port

Case 3

The node is partially reachable and its global reachability cannot be discovered by local_check. It is a NAT which matches just IP address.

Case 4

Globally reachable node.

We hope to find very few cases 3. If so we would be able to discover global reachability by using a very simple mechanism (local_check).

Experiment 2

40 PlanetLab nodes running passive nodes 40 PlanetLab nodes running active nodes 60 popular torrent files 3 PlanetLab reference nodes (NAT checking + latency measurement)

Passive node (P)

A passive node joins the DHT and don't send any announcement nor getPeers. The only activity it does is maintenance and reply messages to other nodes. It might manage value storage if it happens to be close to any key.

Active node (A)

An active node joins the DHT and pretends to join 60 swarms. It sends announcement and getPeer messages every 30 minutes as it would do a Tribler peer downloading 60 files.

Modifications needed
- NAT checking (both)

Apart from checking the reachability of a node prior to add it to the routing table, the node will send the same ping from a different port and through every 3 reference nodes.

Reference node (R)

A reference node checks NATed nodes and measure latency. It receives messages from active and passive nodes containing the IP:port of the node to be checked. Every reference node has a list of nodes checked, if the node to be checked is not in the list a ping message is sent. The reference node waits for the pong and stores the time it took (RTT) or 0 when there was no pong (20s timeout).

Implementation Ideas

NAT detection

Check using another port

NATs keep a translation table used to let through packets which are likely the response for a a packet sent by the NATed node. If the NAT matches remote IP and port when using the NAT table a packet coming from the same IP but different port will be dropped. If this is the case, it's quite easy to check reachability by doing so. Example: Our node (A) is using port 1111 for DHT messages and receives a ping from a node (B). Then A sends a pong to B from port 1111. The pong will bypass a NAT because the NAT box matches A's IP and port and let it through. If A sends a ping from another port (e.g. 9999) the NAT box will match A's IP but not the port. Depending the NAT implementation the NAT box will drop the ping or not. We think (and hope) that most of the NAT implementations match IP and port giving us the opportunity to check global reachability by using this method. We are working on one experiment which will use this method and indirect check (below). This experiment will show the percentage of nodes that can reliably be checked with this method.

Indirect check

We can detect NATed nodes is by using another host (different IP address) as checker. When a node (A) receives a message from another node (B), A checks its routing table whether the bucket where B belongs is not full or there is any dead nodes. If so, A checks B's direct reachability by sending a ping to B and indirectly by sending a ping to B through a relay node (R). If A gets a response from B and R doesn't get any response it's likely that a NAT box let through A's message and dropped R's. The pros of this method are:

Reliability: I think that this the ultimate test for global reachability.
Speed: NATs are detected within seconds.

The cons of this method are:

Security risks: it could be used to perform DDoS.
Complexity: A must rely on other parties and a special message must be added.

Quarantine

NAT boxes translate IP:ports and keep a table in order to let through response packets from outside. This makes our DHT miserable because the NATed node is reachable from my node (I received a message from him, so my IP:port is in its NAT table) but from nowhere else. The NATed node is not _globally reachable_. The entries in the NAT table expire after some time. This timeout is not standard and Johan proposed to find out this value. (check Lucia's progress) The reason why the NAT timeout is important is because one way of checking reachability is by leaving new nodes in quarantine till the NAT timeout expires. Then my node can check the new node. I have no idea how long the NAT timeout is but Bjorn said that he have tested several NAT boxes and it's probably more than 20 minutes in most cases. If it was the case, I don't see this approach very attractive.

NAT flag

The idea of this methods is that nodes can check their reachability and flag it to others so others don't have to use more complex methods.

Message flag
Port numbering

This is a very simple method. Nodes which are reachable use a DHT port within a range (e.g. 26000-26999) and unreachable nodes use another range (e.g. 27000-27999). A good property is that we can see the flag even when we haven't received any message from the node. For instance, in a lookup (get_peers or get_nodes) a node (which is running an old implementation) returns its bucket containing several nodes. For the next hop in the lookup we will prefer nodes whose port is 26xxx and never use nodes whose port is 27xxx. There are some issues to be looked at:

NAT boxes translate ports. Therefore a node aware of its reachability will use 27xxx to send messages but the NAT box will change this port ruining our nice and simple method.
- Even if the NAT box is well configured to forward the traffic it's possible that the port used by the node (26xxx) is changed by the NAT box. (This should be researched)

Identifier flag

Identifiers are 160 bit long. The DHT only cares about the most-significant bits because it matches prefixes. Neighbor nodes share 40-50 bits and the rest of the bits are unused. The probability of sharing a prefix longer than 50 bits is pretty small. We need to do some experiments and maths but I think having 1 billion nodes the probability of sharing a 128-bit prefix is negligible. Once we know that the least-significant bits of node identifiers are 'useless' we can play with them. For instance, a identifier whose 32 least-significant bits are zeroes is announcing that it is not reachable while other with 32 ones is globally reachable. A node not supporting this flagging could get a random identifier whose 32 least-significant bits are the same. The probability of this is, though, very small (2^{(-31)).

This technique has all the pros port numbering has plus it is actually possible to implement.}

Dead node detection

Current system

Every bucket is checked, at least, every 15 minutes (i.e., every node in the routing table was alive (checked) 15 minutes ago.

TCP connection to key nodes

I don't like this method. It's backwards incompatible and adds complexity.

Aggressive checking
Progressive checking (check very often at the beginning and less often onwards)
Every bucket has, at least, one node alive (checked) 2 minutes ago

An smoother version of the current algorithm. Instead of checking the 8 nodes every 15 minutes, in every bucket a different node is checked every 2 minutes. Pros: The probability of finding every node in the bucket dead during a get_peers is lower. The DHT maintenance messages are more spaced in time, avoiding peaks. Cons: It might create a little more traffic although the checking could be done with pings which create less load for the checked nodes.

Dynamic timeout

Replacement routing table

Shortcut routing??

Glossary

Peer = The part of Tribler dealing with the BitTorrent protocol (data transfer, PEX, etc.). A peer is not part of the DHT. A peer announces, using its DHT node, its participation in the swarm.
Node = The part of Tribler dealing with the DTH (announcements, lookups, DHT maintenance, etc.). A node receives orders from the peer and uses the DHT to satisfy the peer's needs.
Key = A key is an identifier in the DHT. Storage (announcement) and lookup (getPeers) operations have a key as target which is the swarm identifier, i.e. its info_hash.
Value = List of BitTorrent peers participating in a swarm which annonced themselves in the DHT.
Bucket = The routing table is splited in buckets according to the number of shared prefix bits with the node's identifier. In Khashmir a bucket is form by 8 references to nodes.
NAT pollution = Insertion of a NATed node to a routing table.
NAT table timeout (NATtout)= Time which takes a NAT session to expire.

Attachments

dhtmeasurements.tar.bz2 (48.2MB)