Rich MetadataThis page describes my work on Rich Metadata done during my MSc thesis at TUDelft. The goal of the thesis is to analyse, design, implement and evaluate a possible system that allows dissemination of rich metadata in the context of a P2P overlay, taking into account the intrinsic difficulties that come with this kind of environment, such as full decentralisation and the presence of a big number of nodes for which few assumptions about availability and reliability can be made. In the specific we study a possible solution to integrate subtitles ingestion and distribution within the Tribler P2P Overlay and its Channels abstraction, focusing on the network overhead that would be introduced, on the availability of metadata within the network, and on the quality of usersâÃô experience. Fast Links
For full reference, please consult the attachments at the bottom of this page (some of them are currently work in progress). DefinitionsWe will introduce here several definitions that will be used throughout this page. The reader is assumed to be familiar with the concepts and terms related to Tribler, especially for what concerns Channels (see LivePlaylists).
Problem DescriptionThe focus of this work will be to analyse possible mechanisms and infrastructures able to allow a content publisher to enrich the object of their publications (items) with Rich Metadata (RMD) and to disseminate it to interested consumers within a P2P Network. RMD dissemination can be divided in three different but related aspects, namely:
Preliminary problem analysysPlease see attachment:preliminary_analysys_document.pdf. Goals
Assumptions
Subtitle SupportFirst step towards full Rich Metadata support has been the implementation of a Subtitle Exchange Subsystem integrated in Tribler. It is meant to be a seamless extension to the Channels concept, allowing publishers to enrich items in their channels with subtitles in several languages. Please consult the User Guide (attachment:user_manual.pdf) for a Quick Start tutorial on how to use subtitles. Design Overview
Protocol DescriptionIn this section an informal description of the dissemination and retrieval protocol will be given. A complete and more formal specificiation of the "on-the-wire" protocol can be found here: attachment:subtitles_on_the_wire_spec.pdf . Subtitle IngestionTo start the ingestion process a publisher must provide a path pointing to the local copy of the subtitles file, the language of the selected subtitle, and he has to choose an item previously added to his channel to whom the subtitle are to be associated. After that the subsystem performs the following actions:
Subtitles announcement disseminationAvailability of subtitles within channels is disseminated via ChannelCast. The protocol has been updated and its messages now include information about subtitles for each gossiped torrent. Attention has been payed to guarantee full backward compatibility: old messages are understood by instances running the new protocol, and instances running old versions of the protocol are still able to understand new messages. A ChannelCast message is made of several entries: each of them represents an item (i.e. a torrent) published in a channel. Every peer builds the messages he sends including a mix of items from his own channel and from the channels he is subscribed to. Our enhancements of the protocol add to each entry in the message the so called rich metadata entry. In the figure below the contents of this entry are shown.
Subtitles retrievalSubtitles are retrieved via a simple asynchronous request response protocol. A peer willing to retrieve the contents of one or more subtitles for an item within a channel can send request messages to peers who announced to have them trough channelcast. Receivers can directly respond with a message including the contents as payload. If the receiving peer does not have the requested content for any reason, he simply discards the request. Two new Overlay Message Types have been introduced to implement the protocol. GET_SUBS (a) is sent by peers as request message
SUBS (b) is sent as response to a GET_SUBS request, in the case the requested contents are available.
An example of the entire diffusion process is shown in the image below. The figure represents a sample of the network showing one publisher (green circle), subscribers of his channel (purple circle), and other peers. The publisher has added two subtitles for an item in its channel, one in Dutch, one in Italian. Yellow shadows underneath the peers mean that the announcement for the subtitles has been received. Vertical and horizontal patterns inside the circles respectively reflect the availability of one of the actual subtitles contents.
ArchitectureIn the section the architecture of the Subtitles Subsystem is presented: first we will show the logical organization of software components, their responsibilities and how they integrate with the underlying Tribler platform. Then the focus will move on the single modules that make up the software components and the interactions they perform to achieve the desired behavior. In the figure below a logical view of the macro-blocks of the systems is shown. The components that were already implemented by the platform are pictured in a lighter shade of grey.
The responsibilities of the above components are further split in the modules they are made of, as shown in the next figure along with their reciprocal dependencies. Please consult the API reference for full details ( attachment:api.pdf or http://subtitles.zxq.net ) In the next figures the basic interactions between these modules are shown in several sequence diagrams. They are not meant to be fully detailed, but to show and let the reader understand the basic program flow. (Click on each figure to zoom)
Experimental resultsWe have performed several experiments to evaluate the effectiveness of the implemented solution; further evaluation tests are still in progress and will be published on this page as soon as they will be available. The experiments are based on a small scale emulation of peers in the Tribler overlay implementing the subtitle exchange features. The choice of emulation instead of a more common large scale simulation was take primarily because we needed a realistic response from the implemented software, that could also reveal possible problems in a real deployment: our system was in fact integrated and merged in the P2P Next - Next Share platform, and the experiments we performed were able to give us a certain degree of confidence about its reliability. Experiments FramworkFor our emulation we used 7 identical machine. Each of them has the following characteristics:
All of them were connected to the same LAN segment. On each node 20 instances of Tribler are executed, and each run their MegaCaches are reset. The experiment code has been slightly modified with respect to the production code to allow remote control and event logging. Two different settings were explored:
For each of these settings we varied on the uptime of the publisher and on the number and session characteristics of subscribers. To efficiently analyse the results of each experiment run we also assembled a simple framework whose role is to allow an easy collection, archival and parsing of the logs. The framework, also allows to generate several graphs summarising the basic characteristics of each experiment. The modified Tribler code can be found at: http://svn.tribler.org/abc/branches/andrea/d10-03-30-release5.2-r15178/ The scripts used to run the experiments are available at: http://svn.tribler.org/abc/branches/andrea/tribler-run-scripts/ The python code of the parsing and graphing framework are instead at: http://svn.tribler.org/abc/branches/andrea/logs-parser/ EvaluationWe wanted an experimental environment able to simulate the swarm dynamics of real clients, especially for what concerns the churn rate and the uptime length of single instance. To do so we decided to control the 140+ instances in our experiments based on real Tribler usage traces collected during previous experiments. The traces we used describe the behaviour of 5000 users, including the start and end of their sessions, their connectability and their downloads actions. For each experiment run we randomly select 140 users out of the traces, and we associate each of them to one instance. That instance is then commanded to connect and disconnect from the overlay following the actions of his corresponding real user. The announcement and retrieval phase of our subtitles diffusion solution are studied introducing in the emulated environment a moderator instance, and publishing on it a single subtitle for one of its channel items. This simplification facilitates to trace and measure the parameters we are interested in, without altering the validity of our results. The moderating instance is artificially controlled to go up and down depending on the single experiment run, and - depending on the single experiment - we introduced one or more subscribers to estimate the effects of their presence. Swarm Description We report here some graphs that describe the topology and behavior of the emulated swarms. Since the traces that guide the peers' sessions are randomly chosen and different for each experiment, it can be observed that there is a substantial variability in the characteristics of each run. The parameters that we considered to describe swarms behavior are:
For each of them we report below a boxplot graph, describing how this parameter vary among different experiments, and then we report a significative example for each of them.
Announcement Overhead We measured what is the overhead of the subtitle announcement phase, in term of bandwidth on publishers and subscribers, for the current protocol implementation. The size of a single announcement item can be easily computed a priori:
Making a total of 120 bytes overhead per announcement. We measured how the bandwidth usage evolves in time: the figures below are shown the results. The images above is taken from an experiment run for 24 hours in the official Tribler overlay. The bandwidth used for announcement periodically grows (in an almost perfect linear fashion) as connections with the experimental peers are established and gossip with them happens: the total consumed bandwidth for one day amounts to only to 25KBytes, confirming that the goal of minimizing overhead is met. Nevertheless 25Kbytes is much more then the minimum required bandwidth to inform all the 140 peers in the simulations: that bandwidth would amount to 140 * 120 bytes = 16 Kbytes. Therefore there must be a spoil of bandwidth in sending duplicates: this is confirmed in the figure below. Nearly the 50% of the announcements sent are duplicates. The fact that duplicates start to grow only after 4 hours of simulation time is due to the fact that Tribler block peers with whom he has already gossiped for a window of that amount. We believe that this isn't sufficient, since once elapsed the 4 hour timespan the duplicates start to grow way faster then useful announcements. Announcement Coverage During our tests we looked at another important parameter of announcement diffusion: we call it the diffusion coverage. In particular we measured the fraction of peers that appeared online at least once who has been reached by an announcement message sent by the moderator or by any subscriber. The following figure shows the evolution of coverage in time for three different experiment settings. The three of them were run in the Overlay # 42, but similar results were also achieved when running the experiment in the official Tribler Overlay.
An important result is that in every case, after around 1500 seconds from the beginning of the emulation (i.e. 25 minutes) more then the 80% of the appeared peers has been reached by announcements, and after ‰ 5000 seconds (i.e. 83 minutes) the coverage is stable between 90% and 100% in E1 and E3. In E2, instead, the 4 hours of absence of moderator keeping peers fed with its announcements, makes the coverage drop down to 80% as new peers connect to the overlay, but is soon goes up again as the moderator reappears. It is important to notice as in E3 the presence of a subscriber manages to compensate the absence of the moderator as he goes down. Diffusion speed A central factor for the effectiveness for metadata diffusion is the timely reception of the announcement for a peer. This time represents the window in which a peer searching for an item won't receive results for the corresponding subtitles: therefore we want that window to be as small as possible. In the next figure we show the CDF of waiting times in a single experiment. More then half of the emulated peers receive the announcement within 600 seconds from their first appearance in the overlay, and 90% of them receive it within 1200 seconds. Subtitles as Overlay Messages: Overhead We also wanted to measure the impact in terms of bandwidth usage of sending subtitles in overlay messages. To mimic user behaviour, every emulated Tribler instance decides to download a subtitle for which he has received an announcement with a 33% chance. Moreover, since it is unrealistic for an human user to download a subtitle as soon as he receives the announcement, we delay the request in time for an amount which follows a uniform distribution in the window (0, 3600] seconds. Following we show the results for an emulation including one publisher and one subscriber: the first figure illustrates the total bandwidth used for subtitles exchange, while the seconds shows how this is split between the publisher and the subscriber. The size of the exchanged subtitle is approximately 60Kbytes, covering a film 1:30 long. The overhall consumed bandwidth in 12 hours emulation amounts to ~ 3500 KBs. A slightly faster growth is recognizable in the first part of the emulation, in which the majority of peer's first appearances are concentrated. Next figure shows how the requests are split between the subscriber and the publisher. Notice that in the middle part of the simulation the publisher's line stops to grow since the moderator goes down for 4 hours.
Conclusions and Future WorkWhat we have done:
What we concluded:
What are the limitations:
Possible future work:
Attachments
|