Content Search

Status: operational

Latest news: As of October 2009 we believe that Tribler includes the fastest known zero-server P2P implementation of keyword search: median response time is: 325 milliseconds. This means that search results in Tribler V5.2 often appear faster on your screen then in Google, Twitter or Youtube. Disclaimer: Tribler V5.2 will be released in Feb 2010 and is currently only available in source code form at svn.tribler.org

Finding content is the prime functionality of a P2P system. No matter how fast downloads are, everything starts with finding something interesting. Finding content should be as easy as using Google. Ideally, the first search result is exactly what you want and is a fast-downloading healthy swarm. Protection against spam is another key ingredient. If you were not looking for V-i-a-g-r-a or a special home video it should not appear in your search results. In user-generated video communities such as Youtube and Tribler this is a difficult problem. The key requirements of P2P search are thus: fast results, correct results and spam protection.

Search speed

Current median search speed of the latest Tribler Python code is 325 ms. The first search results thus quickly appear on your screen. For several years we have improved the search feature using our own overlay network. Search speed is being improved in every release by using more sophisticated algorithms. In the first years our search network was still small, slow and vulnerable to Tribler-specific spam attacks. The workings of the underlying P2P algorithms are described in various pages on this wiki: decentralised content discovery, DecentralizedRecommendation, SimilarityFunction, ClickLog, AdversarialContentSearch and Moderation (in progress).

To measure our search response time we conducted an experiment using our live search overlay. Using similar keywords that our users use we send queries into the P2P network. We recorded the response time for numerous queries by measuring the time between sending the query request and first incoming results. The graph below shows two lines, one for the old algorithm and the improved approach of 5.2 version. We are very please that now with 95% probability search results arrive within 1 second. Thus we achieved an important milestone: sub-second search in a full self-organising system.


Unbounded scalability and semantic clustering

Due to the lack of servers in Tribler architecture we believe our algorithms should be able to scale to many millions of users. This is important aim for the Tribler team as we aim for using Tribler in television hardware. With unbounded scalability every future Internet-connected television can be connected to a single global network.

In the past security, speed and search quality where in conflict. With Tribler we moved to a new generation: semantic clustering and full-replication of swarm info. Semantic clustering means that the peers in the network form connections with peers of similar TV viewing behavior. Or at a technical level: peers which download content with the same info_hash. We also exploit the fact that hard disks of today are big and cheap. We integrate full replication of metadata: every peer in the network stores search info for 50.000 swarms (.torrent files only).

The formula above is used to calculate how similar another peer in our overlay is. Assuming that peer u has created a peer-item matrix with received Buddycast messages and its own preference list, the similarity of peer u' to u is calculated by this formula :

  • Li denotes the list of peers who prefer(i.e. download)item i. c(u)denotes the length of peer u's preference list. Smoothing parameter mu is set to 1 and augment parameter k is set to 100000 in Buddycast3. It's easy to prove that k>sim(u',u)>=0. Furthermore, for a experience peer (i.e. having a number of preferece lists from other peers), 0.5k>sim(u',u)>=0.
  • |Li| denotes the length of the list. It is a reverse-frequency factor. The importance of an item for calculating peer-peer similarity decreases with the frequency of its occurrences in users’ preference lists.
  • Similarity is asymmetric: sim(u',u) is not equal to sim(u,u').

P2P Content Search Algorithms

New in upcoming V5.2 are P2P RSS feeds and moderations. Keyword search was the main improvement of V4.1. We had many complaints that the core P2P features of search & download where poor in Tribler V4.0. ContentSearch improvements :

  • Decentralised moderation of swarms New in V5.2
    • User-generation of metadata
    • Vote for you favorite moderator
    • fully implemented in zero-server technology
    • Hardened against spam attacks
  • P2P RSS feeds New in V5.2
    • spread list of approved swarms without big server
    • seamless translation of server-based RSS into P2P RSS
    • uses public/private key encryption based on elliptic curve crypto
    • cryptographic assurances in order to provide untampered feeds
  • Return many results Done: remote search
  • Works even after Tribler is just installed Done
  • Not consume much memory Done
  • Fast response time 325ms, world record
  • Returns no irrelevant results in progress
  • Partial matches to keywords are not shown done
  • Always use the AND operator when two keywords are provided
  • Provides clear user feedback, like TODO
    • Searched for silver in 40,000 files
    • Matches 1 - 20 of 89 for silver
    • No matches for your query Silver AND man.

Version V4.0 of Tribler keeps all .torrent information in main memory. A Google-like keyword search window will replace the browse all feature as the default opening screen For V4.1 we no longer keep all .torrent information in main memory. Like many other parts of Tribler this info needs to be loaded on-demand. Partial word matches are no longer shown and results with 1 matching keyword of a 2 keyword search are also no longer shown (aka default AND). A search of look 2007 does not return The.Outlook.2007.HDTV.

TIMEACTION
startuplist of all keyword of local stored .torrent files is loaded(+popularity); the filenames in a .torrent are ignored for this version
keyword search localmatching of keywords in main memory
keyword search remotesend out search to peers with at least 1000 collected torrents
result orderingresults are ordered by a complex combination of factors: their number of seeders, matching swarm names, matching included filenames in a swarm, plus special bonus for swarms in highly voted moderator channels
results displaymatching swarms are shown directly after local keyword search is completed
results highlightingkeywords matches on the torrent name are highlighted in bold, similar to Google
.torrent info loading.torrent details such as tracker URL are search in MegaCache after display of results
progressive updatesincoming results from the network search (Tribler peers + video sites) are inserted in real-time into the results, respecting the result ordering
limited window flickeringdue to the progressive updating, the results screen is constantly re-written. After the remote keyword search is done, the first page of the results screen is sorted and no longer updated. The slower Youtube results are appended.

Note that even the most basic in information retrieval algorithms are not yet included: Keyword frequency, BM25 in Tribler.

On the V4.1 : Enable remote queries of your MegaCache

  • Keyword search returns numerous results directly after install
  • Bootstrap process does not need to be very agressive
  • Remote keyword searches are not the main mechnism, only to help non-bootstrapped peers
  • Thus restrict to 10 keyword searches per random peer over its entire lifetime
  • Your taste buddies+friends have no restrictions on keyword searches
  • return maximum 20 alive+unchecked torrent names+details, not bulky hashes
  • Allow download of 10 .torrents for given keyword per peer over its lifetime
  • Present a counter to the user of how many .torrent files he can query
  • Show number of discovered peers and how many .torrent files they have on average

(partly?) Implemented by Arno...

For Tribler verion 4.2 :

Keyword searches are more powerfull

  • Buddycast is extended with a field with the last 10 keyword searches
  • Upon startup the opening screen of Tribler shows hot keywords of other peers
  • Clicking a hot keyword initiates a keyword search query
  • Combine the Youtube.com visual gridview with keyword browsing
  • Show for each hot keyword a few thumbnails

User Interface

Canceled Tribler user interface design: Mixing user-genrated video results and social network search has proven to be difficult for people to understand.

long-term Research ideas

Mapping of the user desires with the available content. Craft a iterative cycle of re-search using results feedback. Feedback on keyword search convergence. Real-time search keyword suggestion. pairs of keywords & click-log. Implicit tagging. Adversarial content search.

Usage of click-log approach in P2P search with privacy preservation and security

Arjen: click-log approach with maximum privacy. Spam and adversarial behavior is the key challenge of this research. Understanding the trade-off between performance, relevance, scalability, robustness, etc. Exploit the usage of social networks and friends for security & privacy. Use of aggregation to combat privacy leakage. Johan requirements: running experimental code and zero-servers.