Similarity Function

Status: deployed code

The similarity function is part of the system for fast keyword search. Every Tribler peer keeps open 10 overlay connections to peers with the highest similarity. This results in an P2P overlay which based on content semantics, or a semantic overlay.

An example of a semantic overlay is shown here:

In which Peer 1 and 3 are connected, because they are interested in Animals. Peer 2 and 3, because of Programming. But no connection exists between Peer 1 and 2, because they do not have overlapping interests. The similarity function in Tribler thus offers semantic clustering and scalability.

Peer-Peer Similarity

In the overlay peers exchange information regarding their download history. Using this information a User x Item matrix is constucted.

A row in the User x Item matrix represent the preference list of a peer. When a peer has downloaded an item, this is represented by a 1.

Using the User x Item matrix, similarity can be calculated by formula (1):

M₁₀, denotes the number of items User 1 downloaded, but User 2 did not.
M₀₁, denotes the number of items User 2 downloaded, but User 1 did not.
M₁₁, denotes the number of items both users downloaded.
L, denotes a preference for users with profiles of at least this length. Users which have a shorter profile (downloaded less items) are discounted. In Tribler, L is set to 40.

This function is based upon the Cosine similarity function, the only change is in the notation. Because of Tribler using boolean ratings (downloaded a file yes/no), the function can be rewritten to only comparing the number of items which are/are not overlapping. The profilelength boost was implemented because during research it was shown that the normal Cosine function has a preference to users with small profiles, which are less usefull when discovering new items.

Peer Recommendation

Peer recommendation has two utilities in Tribler:

Select the most similar peer as the target of a Buddycast message. After every 15sec, the client sends a BuddyCast message to either the most similar connectable peer or a random one according to a ratio (the ratio is set to 1 in Buddycast3). Addtionally, each peer will not be sent more than one Buddycast message in a time window (the time window is set to 4 hours in Buddycast3).
Put the top-10 taste buddies in the Buddycast message.

Similarity Updating

When the client starts, the similarity of each peer-peer pair is updated (Full-Update).
When a user changes his/her preference list (i.e. downloading a new item), similarity between all users can change thus again a Full-Update is run.
When a peer receives a new Buddycast message, only the peer-peer similarity to the sender is updated (Single-Update). For the taste buddies and random peers in the message, the peer picks out the ones of which it has no preference list and then updates the indirect peer-peer similarity with them by formula (3). Note that the indirect peer-peer similarity is stored as negtive value in order to be distinct with the direct one. (REALLY?)

Torrent Relevance

Using the similarities between peers, torrent relevancies can be calculated. Whenever a BuddyCast message is received a torrent is selected to be collected by the meta-data handler. This torrent should be the torrent most interresting to the peer. By using the top-50 most similar peers and their items, a relevancy can be calculated for all items by simply using the sum of all similarities for the items the peer which has sent the BuddyCast message has.

If the peer has no particular relevant items, the most popular item is selected as determined by the number of peers which have downloaded it according to a peers own local MegaCache.

Research

The Similarity function as described above is a result of a master thesis research done by Niels Zeilemaker. During this research the following research question was aswered: Find a similarity function from which an efficient P2P topology for keyword search emerges.

Literature Research

Literature Research: Optimizing Peer-to-Peer Keyword Search using Collaborative Filtering

A document created in preparation for the Master Thesis Research consisting of 40-50 Pages and 5-7 Chapters.

Research includes

Current state of P2P search strategies.
Current state of Similarity functions, both memory and model-based.
Methods used to evaluate performance of Similarity functions.
Problem applying existing Similarity functions to Distributed Environment.

In total 31 references used:

[attachment"Toward the Next Generation of Recommender Systems A Survey of the State-of-the-Art and Possible Extensions.pdf?format=raw" Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions]	IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 6, JUNE 2005	Adomavicius et al.
Incorporating Contextual Information in Recommender Systems Using a Multidimensional Approach	Department of Information & Decision Sciences Carlson School of Management University of Minnesota	Adomavicius et al.
Content-Based, Collaborative Recommondation	COMMUNICATIONS OF THE ACM March 1997/Vol. 40, No. 3	Balabanovic et al.
Empirical Analysis of Predictive Algorithms for Collaborative Filtering	Microsoft Research	Breese et al.
Eigentaste: A Constant Time Collaborative Filtering Algorithm	IEOR and EECS Departments University of California, Berkeley	Goldberg et al.
Combining Collaborative Filtering with Personal Agents for Better Recommendations	GroupLens Research Group	Good et al.
An Empirical Analysis of Design Choices in Neighborhood-Based Collaborative Filtering Algorithms	Department of Computer Science, Oregon State University	Herlocker et al.
The EigenTrust Algorithm for Reputation Management in P2P Networks	CIKM '01: Proceedings of the tenth international conference on Information and knowledge management	Kamvar et al.
Evaluation of Item-Based Top-N Recommendation Algorithms	University of Minnesota	Karypis et al.
Pollution in p2p file sharing systems	IEEE INFOCOM	Liang et al.
Search Performance Analysis in Peer-to-Peer Networks	Dept. of Electrical Eng. National Taiwan University, Taipei, Taiwan	Lin et al.
Amazon.com Recommendations	Amazon.com	Linden et al.
Content-Boosted Collaborative Filtering for Improved Recommendations	Proceedings of the Eighteenth National Conference on Artificial Intelligence(AAAI-2002)	Melville et al.
PocketLens: Toward a Personal Recommender System	University of Minnesota	Miller et al.
Passive Profiling from Server Logs in an Online Recruitement Enviroment	Proceedings of IJCAI Workshop on Intelligent Techniques for Web Personalization (ITWP2001)	Rafter et al.
Peer-to-Peer Architecture Case Study: Gnutella Network	Department of Computer Science The University of Chicago	Ripeanu et al.
Survey or research toward robust p2p networks search methods	Comput. Netw. 50(17)	Risson et al.
Percolation search in power law networks: Making unstructured peer-to-peer networks scalable	Proceedings of IEEE P2P04	Sarshar et al.
Item-Based Collaborative Filtering Recommendation Algorithms	GroupLens Research Group	Sarwar et al.
E-commerce recommendation applications	Data Min. Knowl. Discov.	Schafer et al.
Internet study 2007	http://www.ipoque.com	Schulze et al.
Personalized Recommendation in Social Tagging Systems using Hierarchical Clustering	RecSys '08: Proceedings of the 2008 ACM conference on Recommender Systems	Shepitsen et al.
Load reducing in the kad p2p system	Fifth International Workshop on Databases, Information Systems and Peer-to-Peer Computing (DBISP2P, 2007)	Steiner et al.
Improving lookup performance over a widely-deployed dht	IEEE International Conference on Computer Communications INFOCOM 2006	Stutzbach et al.
Adaptive Web Search Based on User Profile Constructed without Any Effort from Users	Proceedings of the 13th international conference on World Wide Web	Sugiyama et al.
Clustering methods for collaborative filtering	AAAI Press, 1998	Ungar et al.
Unifying User-based and Item-based Collaborative Filtering Approaches by Similarity Fusion	Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology	Wang et al.
A User-Item Relevance Model for Log-Based Collaborative Filtering	Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology	Wang et al.

Dataset

During the research a dataset was created by parsing superpeer logs in order to get insight into availible data (See SuperpeerLogs).

From this dataset a subset was created using the top-50000 users with the most downloaded files. This subset has 252.469 items and 50.000 users. Using tf-idf 31.906 items were assigned to a category. This helps evaluating the performance of more complex similarity functions and was done manually. First using the tf/idf the more frequent terms were discovered. Which were used to create a list of categories. All items matching all or a combination of the terms of a category where written to a file. These files were then checked manually and incorrect items were removed/disabled.

Evaluation of Candidate Functions

Using the dataset, then several well known similarity functions were evaluated. Due to the unique circumstances in a P2P enviroment (a huge amount of items, resulting in a very sparse dataset 0.99993), time was invested in trying to reduce the sparsity of the dataset.

A method which resulted in a improvement of the recommendations and thus the similarity between users, uses the swarmnames of items to find similar ones. These additional psuedo-ratings then allow for overlap and thus reduce the sparsity.

This more complex similarity function is called ItemItem(Levenshtein) and is not yet implemented in Tribler. The final thesis report can be downloaded here.

Attachments

Adaptive Web Search Based on User Profile Constructed without Any Effort from Users.pdf (304.6KB)
Amazon-Recommendations.pdf (353.8KB)
breese_98_empirical.pdf (242.2KB)
clements_08_optimizing.pdf (310.8KB)
clustering methods for collaborative filtering.pdf (242.1KB)
Combining Collaborative Filtering with Personal Agents for Better Recommendations.pdf (89.8KB)
compfunction.PNG (5.5KB)
Content-Boosted Collaborative Filtering for Improved Recommendations.pdf (97.5KB)
Cosine, Anderberg voor Arno Boudewijn.pdf (316.2KB)
e-commerce.pdf (583.4KB)
Eigentaste A Constant Time Collaborative Filtering Algorithm.pdf (775.8KB)
eigentrust.pdf (198.1KB)
fab-content-based-filtering.pdf (470.5KB)
gnutella-rc.pdf (166.9KB)
han_04_scalable.pdf (483.3KB)
herlocker_02_empirical.pdf (135.1KB)
improving lookup performance over a widely-deployed dht.pdf (197.8KB)
Incorporating Contextual Information in Recommender Systems.PDF (449.5KB)
indirectsimfunction.PNG (13.9KB)
ipoque internet survey.pdf (4.0MB)
item2peersimfunction.PNG (4.7KB)
Karypis2001.pdf (169.9KB)
Load reducing in the kad p2p system.pdf (170.6KB)
Meeting 17 September 2009.pdf (222.9KB)
mirza_03_studying.pdf (634.7KB)
Passive Profiling from Server Logs in an Online Recruitement Enviroment.pdf (103.8KB)
percolation-based search.pdf (321.6KB)
Personalized Recommendation in Social Tagging Systems.pdf (556.8KB)
pollution in p2p file sharing system.pdf (2.5MB)
precrec.png (51.2KB)
ResearchAssignment.pdf (1.2MB)
sarwar_01_itembased.pdf (250.6KB)
Search Performance Analysis in Peer-to-Peer Networks.pdf (70.3KB)
semantic.png (8.1KB)
simfunction.PNG (11.2KB)
survey or research toward robust p2p networks search methods.pdf (1.4MB)
thesis.pdf (978.4KB)
Toward a Personal Recommender System.pdf (1.0MB)
userxitem.png (9.6KB)
Wang2006.pdf (397.2KB)
wang_06_distributed.pdf (548.0KB)
wang_06_unifying.pdf (390.2KB)