ClickLogs: Goals and Research questionsHow relevant is a document for a given search term? This is the central question when we try to answer a user's search query. In a decentralized setting like when searching torrents, we mainly have to work with metadata like titles: The actual content is not available at search-time, so it cannot be analyzed to better predict relevance. However, user feedback is a precious resource in estimating relevance. Particularly implicit feedback, i.e. feedback which the user provides simply by using a system, can be a great benefit because it can be collected cheaply and at a large scale. This package aims to improve search quality in Triber by collecting and processing a particular kind of implicit feedback: click log data. Click logs contain information about which document was selected after searching for which keyword, and at which position in the result list the selected document was positioned. This data can be used to improve future search result lists; the easiest scenario would be a case in which we assume that for a given keyword, it's always the same torrent that's selected although it's displayed at rank #16. A simple way to improve search results would simply be to always put this torrent on the top of the list for the particular keyword. Of course, such situations will turn out more complicated in reality, and how to optimally address them is one of the questions of this research project. The goals of this project are three-fold:
1. Improvement of Tribler SearchFind what you're looking for, and have a pleasant experience while you're at it - this should be the offer of Tribler's search engine, particularly towards average users which do not want to browse many pages of results or become particularly smart about search terms. The aim of the clicklog approach is to support both search experience and quality: 1.1. Integrating reranking mechanisms into Tribler SearchThe main step is to rerank search results based on latent information in the clicklog data, as explained above. 1.2. "Suggest" Feature using database of frequent keywordsAnother use of clicklog data is a search field offering "auto-complete" or rather "suggest" functionality; while the user it typing, the program offers possible extensions to the already typed letters. 2. Exploring Reranking AlgorithmsWe are aiming to examine the influence of different reranking algorithms / of using reranking at all. Therefore, it is planned to randomly pick a reranking algorithm at search time and save it with the clicklog data. If we can accumulate enough data, we should be able to make a statement about the use of reranking by regarding, in the simplest case, the average position of clicked results; if clicked results are significantly higher for reranked results than for un-reranked results, reranking should be considered successful. 3. User-Generated Meta-Data3.1 Spam IssuesTribler's server-free infrastructure creates a different situation than for most other clicklog applications; normally, clicklog analysis is performed by the owner of a centralized search engine; with Tribler, the only clicks really observed are those of the local user; other clicks are only received via the BuddyCast protocol. Since these transmissions might be faked, it has to be examined in how far faked clicklogs could be used to, e.g., promote spam, and what counter-measures are appropriate.
How do these issues scale, and which additional counter-measures are required when we are facing explicitly given tags in the future? 3.2 Privacy IssuesIn the current phase, the goal is to overcome the technical difficulties in distributing and using clicklog data, i.e., a technological one. However, on a conceptual level, we have to ask ourselves under which situations clicklog data might be harmful in terms of privacy, and which countermeasures could be taken. Such questions will be explicitly addressed in a later step of the project. Contact: Nicolas Neubauer, Neural Information Processing Group, TU Berlin [neubauer@cs.tu-berlin.de] Implementation NotesHow to enable query loggingAs of rev. 10204, you may enable logging of search keywords by adding sth like RemoteQueryMsgHandler.getInstance().setLogFile("c:\crawl.txt") to startAPI in Tribler/Main/tribler.py Clicklog wire protocolAs of overlay version 8, the preference part of buddycast messages looks like this: [["hash1", ["linux","ubuntu"], 1, 2], ["hash2", ...], ...] when it would've looked like this: ["hash1", "hash2", ...] earlier on. This means that instead of a list of hashes, a list of lists is returned. Each inner list contains the hash, keywords used to find the corresponding torrent, the click position (where 1 is in fact the second position in the list) and the number of the reranking strategy used (here: 2). Important methods in the clicklog data lifecycleThe data for this message is first stored in Core/CacheDB/SqliteCacheDBHandler:MyPreferenceDBHandler.addClicklogToMyPreference The message itself is created in Core/CacheDB/SqliteCacheDBHandler:MyPreferenceDBHandler.getRecentLivePrefListWithClicklog On the passive part, it is digested (as well as pre-OL 8 buddycast data) and converted into a more comfortable dict for further internal processing in Core/CacheDB/buddycast.py:BuddyCastCore.createPreferenceDictionaryList The resulting data is finally stored in Core/CacheDB/SqliteCacheDBHandler:PreferenceDBHandler.addPreferences Overall Todos
1. Database
2. Store clicks
3. Extend BuddyCast ProtocolCreate and parse an additional data structure passed on with each BuddyCast transmission. This structure contains the clicklog data for the n most current torrent downloads, where n is equal to the number of torrents transmitted by BuddyCast in the first place. The constants MAX_KEYWORDS_STORED = 5 MAX_KEYWORD_LENGTH = 50 define the maximum number of terms per torrent / chars per terms that are accepted by a client in order to prevent spamming. 4. Stress testingAfter a number of improvements, time requirements are down to an acceptable size, I believe. See attachments for measurements performed with a dataset of search queries collected earlier on by the Tribler team. 5. Using the dataAutoComplete
RerankingRight now, 50% of the time (choosing by hours ;) no reranking takes place, and 50% of the time a simple heuristic is applied: The top and the second entry returned for a search query may be swapped if:
The first two criteria are supposed to prevent reranking from being applied without a proper data basis, whereas the third criterion is the actual reranking criterion: If a torrent is consistently selected in a bad position, we might want to move it up. Earlier considerations on rerankingDifferent reranking algorithms inlude
Implement heuristics for improving rankings. Very generally, rank documents higher if we know peers found them using the current search terms.
In order to be able to make any statement about the effect and quality of reranking algorithms, we should swap them randomly (even if we only use one actual reranking strategy, we should swap it with strategy 0, the current, non-existing one). We could do more elaborate learning if, for a click on position 3, we would also store the non-clicked torrents. I guess this is out of the question traffic-wise..? Code-wise, this means writing alternative implementations of Core.Search.KeywordSearch and pass them on to Core.Search.SearchManager. Preferably, SearchManager would receive a couple of instances to choose from, each with a "getRerankingStrategyID()" or so, choose one at random at every search (while evaluating) and make the choice readable for later logging. 6. Tracking the data
Attachments
|