Metadata Enrichment in Tribler (MET) NotesBasically miscellaneous notes moved from the main page to here to reduce clutter, but still archive them. MTurk ExperimentFirst ExperimentData:
Results Before:
Results After:
MockupsPreliminary findings studying some data from the ClickLog crawl
Current Tribler code for determining keywords of a swarm: keywords = Set(split_into_keywords(torrent_name)) # search through the .torrent file for potential keywords in # the filenames for filename in torrentdef.get_files_as_unicode(): keywords.update(split_into_keywords(filename)) After 20 days of crawling:
( only swarms for which their names are known are counted) Discovery of torrents/termsThe following three plots show how many torrents are discovered over time, and how many terms are extracted from that. We also show how many terms remain after doing simple filtering (word length > 2, freq > 1). Note that the act of searching causes torrents to be discovered at a higher rate (metadata for unknown infohashes in the search results gets requested). This raises the question whether active peers will tend to get a "narrower" view on the network than idling peers. When an active peer is shown a term cloud and clicks in it, it will perform a search on neighbouring peers. Due to Tribler's semantic overlay, these peers tend to have a similar taste as the searching peer. If a term cloud would only show the most popular terms (according to the peer's own MegaCache), clicking on these terms may enforce the popularity of the clicked terms. Frequency of discovered terms800k+ datasetDataset 1
Dataset 2
Clustering
Status: Research put on hold for now as it proved to be quite complex In a multilevel tag cloud, we cannot simply display the top N terms ranked on df and, when the user selects one of these terms, display the top M terms that co-occur with the selected term and are ranked on df. The graph below illustrates why. In this graph, the red nodes represent the top N=100 level 1 tags. For each level 1 tag, the top M=100 co-occurring tags are connected by an edge. "Pure" level 2 tags are colored green in the graph. As you can see, most level 1 tags also show up at the second level. An example why we don't want this ranking is the following situation. Let's say that in the global top N terms, the terms "x" and "y" are the two most frequent terms, but that they have a co-frequency of merely 1. Let's say the user selects term "x". For the second level of terms, we do not want "y" to be ranked high at all. A better approach seems to be to rank level 2 term candidates using the co-df. This results in a graph with 2275 nodes, as opposed to only 165 nodes. The above images show how terms co-occur. |