KhashmirStatus: operational, but insufficient performance Under active development:
Contents:
The usual approach for swarm discovery is by using a central tracker. However, the traditional tracker can be replaced by using a distributed hash table (DHT); the infohashes of .torrent files can be used as the key and the list of peers that are downloading the torrents are used as value. When a peer starts downloading a torrent it adds its contact information in DHT and it does a lookup in the DHT to discover other peers in the swarm. Note that a single DHT can be used for many torrents. The standard BitTorrent also implements a DHT, called Khashmir, for swarm discovery. The Khashmir supplied with BitTorrent has been adapted in order to be used as decentralized tracker and therefore differs from the original Khashmir. We discuss the Khashmir supplied with BitTorrent. KhashmirKhashmir uses the infohash of a .torrent file as keys in the DHT. Infohashes are 160-bit and peers participating in Khashmir have a nodeID in the same 160-bit space. A value is a list of peers downloading/seeding the torrent that corresponds to the key. When a peer announces that it is downloading a torrent, its contact information is added to this list. RoutingKhashmir uses the XOR operand as distance metric. The distance between two nodes, a and b, equals to a XOR b. The routing table is divided into 160 buckets, with each bucket covering a part of the nodeID space. Bucket i contains nodes with distance between 2i and 2i+1. Newly encountered peers are added to the appropiate bucket until the bucket is full. Buckets have a predefined size which is currently 8. .torrent FileThe torrent file that uses a tracker contains an 'announce' key. However .torrent files "tracked" by Khashmir have instead a 'nodes' key which is a list of nodes in Khashmir. This peerlist enables the client to bootstrap into Khashmir. e.g. nodes = "127.0.0.1", 6881], ["khashmirnode.theinternet.net", 7726 BitTorrent Protocol ExtensionTo be able to fill the routing table from peers received from an ordinary tracker the BitTorrent Protocol has been extended. Peers participating in Khashmir set the least significant bit of the last reserved byte of the handshake. When a peer connects to another peer that also supports Khashmir, it should send a PORT message, indicating the port on which its Khashmir client is located. A PORT message begins with byte 0x09 and its payload are two bytes containing the Khashmir port of the sender. After having checked the port, the client can add the node to its routing table. Technical DetailsUsing KhashmirThe source code of the mainline BitTorrent is divided in three directory: khashmir, BitTorrent and BTL. Unfortunately, the khashmir modules has dependencies on BitTorrent and BTL. The easy way is to copy the BitTorrent and BTL folder along. Another option is to copy the needed files from BitTorrent and BTL to the khashmir folder and adapt the import statements in the source (I have done this, you can find it at https://svn.tribler.org/abc/branches/fabian/tribler4.0.1_khashmir/Khashmir) The main class is UTKhashmir in utkhashmir.py Only a few methods of UTKhashmir are needed to use the DHT:
LoggingFor logging purposes it is sufficient to maintain a stable DHT which is inserted in the routing table at the first session. As long as this node is running, it is kept in the routing table, because Khashmir prefers old stable nodes over new ones. It is possible to log the start of every session, because at startup all nodes in the routing table are checked whether they're alive. The stable node is in the routing table of every node in the network and should therefore expect more traffic than usual. Raw Server/TwistedThe mainline BitTorrent has a so-called raw server. The raw server is the component that takes care of creating and handling network connections. {Update: Twisted dependency is removed and the Tribler rawserver is integrated with Khashmir} Default bootstrap nodeWhenever BitTorrent encouters a khashmir .torrent with an empty nodes list, it defaults to the standard bootstrap node router.bittorrent.com:6881. Note this is not part of the Khashmir code but of BitTorrent client (BTL/ConvertedMetaInfo.py) Problems
PlanningRaul at KTH
Future: Enable swarms to grow beyond 1 million peers while maximizing network-friendliness. Possibly discover network topology characteristics. Then design policies to exploit them to maximize both performance and network-friendliness. Measurement codedhtmeasurements.tar.bz2 (or here) This BZIP archive contains: tribler4.1.7_web2: Tribler client. Khashmir operations have extra logging added. Contains a few khashmir*.py files to run a dht node without running the Tribler client. Furthermore, the rate limiter in khashmir has been disabled. tribler4.1.7_web2/khashmirtest2s.py: runs a khashmir discovery operations with khashmir.const.KRPC_TIMEOUT = 1.87, used by graph2 (see below) tribler4.1.7_web2/khashmirtest20s.py: runs a khashmir discovery operations with khashmir.const.KRPC_TIMEOUT = 20, used by graph2 (see below) tribler4.1.7_web2/khashmirtest2.py: starts a dht node, and then leaves you in the python debugger, so you can perform khashmir operations manually tribler4.1.7_web2/khashmirtest3.py: runs a dht node, used by bootstrapnode (see below) tribler4.1.7_web2/khashmirtest4.py: runs a 5 discovery operations on a 300s interval, used by graph1 (see below) */routing_table: khashmir's routing table khashmir/ khashmir/bootstrapnode: khashmir/bootstrapnode/runbootstrapnode: runs a separate dht node on port 9999. This node is used to bootstrap the nodes that run the actual experiments. This bootstrap node is hardcoded in the py scripts mentioned above. khashmir/graph1: measurements for Figure 4.6 and 4.7 in my thesis khashmir/graph1/991708: the torrent used for measurements khashmir/graph1/do: runs the experiment, khashmirtest4.py is executed multiple times khashmir/graph1/graph1: log of the experiment khashmir/graph1/latencies*: response times of requests (extracted from graph1) khashmir/graph2: measurements for Figure 4.8 and 4.9 in my thesis khashmir/graph2/torrents/*: .torrents used for measurements khashmir/graph2/do: starts the measurements; for each torrent subdo is performed; results are put in the data directory khashmir/graph2/subdo: performs a scrape, a 2s khashmir discovery operation, and a 20s khashmir discovery operation. khashmir/graph2/data: results of measurements. For each torrent there are three files: a 20s timeout log, 2s timeout log, and a log of a tracker scrape. khashmir/graph2/process: post-processes results of the measurements Open Issues
Unreachable nodesExperiments with real swarms indicate that the DHT is often functioning poorly. The main reason seems to be that the routing tables of DHT nodes are filled with IP addresses of peers that are either already offline or are not connectable due to NAT/Firewall issues. Experiments are planned to quantify this issue. Every node can eventually appear as unreachable due to various reasons:
A firewalled node can perform lookup operations but it doesn't route messages nor store values. The bad thing is that it is a free rider. The good thing is that it doesn't pollute others' routing tables. (see NATed node)
A NATed node can perform lookup operations and tries to route messages and to store values. Here is where really tricky issues are.
There is no 'leave' message. Therefore when a node (D) leaves the DHT the nodes which have D in their routing tables will fail to reach D. The mechanisms to clean dead nodes are: (1) refresh every bucket every 15 minutes, and (2) replace an old node with a new one. Buckets which have not been used for 15 minutes are refreshed by doing a random lookup checking that every node in the bucket is still alive (reachable). Unreachable nodes are marked as dead*. We believe that the Khashmir DHT is fundamentally incompatible with NAT usage. A node can include another node in their routing table after just a Ping message, no announce is required. No message or mechanism exists for a NATed node to prevent others from including it in their routing cache. To make matters worse, the naive DHT uses a Ping message to check if a peer is reachable and online. If a node just communicate with a NATed node, that Ping message will be successful, however, this NATed node is not globally connectible. The complete algorithm operates as follows: A NATed node (N) sends a message (any kind of message) to another node (A). If A considers that N should be added to its routing table A checks N's reachability by sending a ping. N is behind a NAT which will allow traffic between A and N because N started the communication. To A, N is reachable so A adds N to its routing table. We call this "implicit NAT-node pollution", we explain why next. Another node (B) performs a lookup (getNodes or getPeers) on A. A replies with a bucket which includes N. Then, B tries to route the lookup message though N and waits for the timeout since the NAT box sees a connection from a non-known IP dropping the message. It is specially bad that the effect is indirect and can't be detected beforehand. It is indirect because, even if B implements a clever NAT detection, B will be affected by others' polluted routing tables. It can't be detected beforehand because the routing is done in real time (shortcut routing might help reducing the effect of the second issue). TimeoutA timeout is harmful in lookup operations. It might be in other operations as well but I consider lookup operations much more delay critical. A node has to wait for a timeout (20s currently) when the 8 threads are stuck waiting for a reply. Considering 60% of the total nodes firewalled or NATed, chances are some queries are never replied. I don't think that a static timeout is adequate. Nodes have different connections (latencies) and this could even vary for a single node at different times (congested network, node overloaded, etc.) 1.87 s. may be OK for a node connected to a university network but I think that every node should calculate its own value to better adapt to its environment. The good thing of improving the timeout policy is direct benefit to the nodes implementing it. The bad thing is that it decreases the bad effects of a broken DHT without actually fixing it. ParallelismA lookup messages is routed through 8 nodes simultaneously. This is a trade-off between having to wait for timeouts (when less parallelism) and flood the DHT (when more parallelism). I think that we should avoid increasing parallelism by choosing the best nodes to route our messages. It might be a good idea to find and follow shortcuts which decrease the number of hops drastically (more in the 'implementation ideas' section) BootstrappingI need to check out this. Rate limiterI'm not sure it is a good idea to allocate unlimited bandwidth to the DHT. I can think of a node being the responsible for tracking a million peers swarm. I agree, however, that the DHT traffic should be prioritized (real time traffic) and spurious peaks should be allowed. What about a threshold of messages/minute? I don't have a clear idea of what we can do to solve this. Replacement cacheIterative routingI don't see it much as a big problem but Crosby and Wallach consider it a key issue. I actually can see security issues in the recursive routing proposed. Anyway, any change regarding this issue means backwards incompatibility with Mainline DHT. I write it down here just to keep it in mind but I don't consider it important to us. MeasurementsQuestions
Hypothesis
???See if Lucia has some data regarding NATtout???
Methodology
Whenever the sniffer gets an incoming packet it sends a ping to the source from port 22231 and register whether it received a pong or the timeout expired (20s) (this is called local_check). It also sends the node's IP and port to the reference host. The same for outgoing messages. A reference host sends a DHT ping to the IP:port indicated by a node N and register whether it received a pong or the timeout expired (20s) (this is called indirect_check. We keep a list of checked nodes to avoid pinging a node several times. Here is the table showing the reachability of the node depending the method used: R=reachable U=unreachable
The node is globally unreachable (behind a firewall or dead)
The node is partially reachable but its global reachability can be discovered just by using local_check. It's a NAT which matches IP and port
The node is partially reachable and its global reachability cannot be discovered by local_check. It is a NAT which matches just IP address.
Globally reachable node. We hope to find very few cases 3. If so we would be able to discover global reachability by using a very simple mechanism (local_check).
40 PlanetLab nodes running passive nodes 40 PlanetLab nodes running active nodes 60 popular torrent files 3 PlanetLab reference nodes (NAT checking + latency measurement)
A passive node joins the DHT and don't send any announcement nor getPeers. The only activity it does is maintenance and reply messages to other nodes. It might manage value storage if it happens to be close to any key.
An active node joins the DHT and pretends to join 60 swarms. It sends announcement and getPeer messages every 30 minutes as it would do a Tribler peer downloading 60 files.
Apart from checking the reachability of a node prior to add it to the routing table, the node will send the same ping from a different port and through every 3 reference nodes.
A reference node checks NATed nodes and measure latency. It receives messages from active and passive nodes containing the IP:port of the node to be checked. Every reference node has a list of nodes checked, if the node to be checked is not in the list a ping message is sent. The reference node waits for the pong and stores the time it took (RTT) or 0 when there was no pong (20s timeout). Implementation IdeasNAT detection
NATs keep a translation table used to let through packets which are likely the response for a a packet sent by the NATed node. If the NAT matches remote IP and port when using the NAT table a packet coming from the same IP but different port will be dropped. If this is the case, it's quite easy to check reachability by doing so. Example: Our node (A) is using port 1111 for DHT messages and receives a ping from a node (B). Then A sends a pong to B from port 1111. The pong will bypass a NAT because the NAT box matches A's IP and port and let it through. If A sends a ping from another port (e.g. 9999) the NAT box will match A's IP but not the port. Depending the NAT implementation the NAT box will drop the ping or not. We think (and hope) that most of the NAT implementations match IP and port giving us the opportunity to check global reachability by using this method. We are working on one experiment which will use this method and indirect check (below). This experiment will show the percentage of nodes that can reliably be checked with this method.
We can detect NATed nodes is by using another host (different IP address) as checker. When a node (A) receives a message from another node (B), A checks its routing table whether the bucket where B belongs is not full or there is any dead nodes. If so, A checks B's direct reachability by sending a ping to B and indirectly by sending a ping to B through a relay node (R). If A gets a response from B and R doesn't get any response it's likely that a NAT box let through A's message and dropped R's. The pros of this method are:
The cons of this method are:
NAT boxes translate IP:ports and keep a table in order to let through response packets from outside. This makes our DHT miserable because the NATed node is reachable from my node (I received a message from him, so my IP:port is in its NAT table) but from nowhere else. The NATed node is not _globally reachable_. The entries in the NAT table expire after some time. This timeout is not standard and Johan proposed to find out this value. (check Lucia's progress) The reason why the NAT timeout is important is because one way of checking reachability is by leaving new nodes in quarantine till the NAT timeout expires. Then my node can check the new node. I have no idea how long the NAT timeout is but Bjorn said that he have tested several NAT boxes and it's probably more than 20 minutes in most cases. If it was the case, I don't see this approach very attractive. NAT flagThe idea of this methods is that nodes can check their reachability and flag it to others so others don't have to use more complex methods.
This is a very simple method. Nodes which are reachable use a DHT port within a range (e.g. 26000-26999) and unreachable nodes use another range (e.g. 27000-27999). A good property is that we can see the flag even when we haven't received any message from the node. For instance, in a lookup (get_peers or get_nodes) a node (which is running an old implementation) returns its bucket containing several nodes. For the next hop in the lookup we will prefer nodes whose port is 26xxx and never use nodes whose port is 27xxx. There are some issues to be looked at:
Identifiers are 160 bit long. The DHT only cares about the most-significant bits because it matches prefixes. Neighbor nodes share 40-50 bits and the rest of the bits are unused. The probability of sharing a prefix longer than 50 bits is pretty small. We need to do some experiments and maths but I think having 1 billion nodes the probability of sharing a 128-bit prefix is negligible. Once we know that the least-significant bits of node identifiers are 'useless' we can play with them. For instance, a identifier whose 32 least-significant bits are zeroes is announcing that it is not reachable while other with 32 ones is globally reachable. A node not supporting this flagging could get a random identifier whose 32 least-significant bits are the same. The probability of this is, though, very small (2(-31)). This technique has all the pros port numbering has plus it is actually possible to implement. Dead node detection
Every bucket is checked, at least, every 15 minutes (i.e., every node in the routing table was alive (checked) 15 minutes ago.
I don't like this method. It's backwards incompatible and adds complexity.
An smoother version of the current algorithm. Instead of checking the 8 nodes every 15 minutes, in every bucket a different node is checked every 2 minutes. Pros: The probability of finding every node in the bucket dead during a get_peers is lower. The DHT maintenance messages are more spaced in time, avoiding peaks. Cons: It might create a little more traffic although the checking could be done with pings which create less load for the checked nodes. Dynamic timeoutReplacement routing tableShortcut routing??Glossary
Attachments
|