Rich Metadata

This page describes my work on Rich Metadata done during my MSc thesis at TUDelft.

The goal of the thesis is to analyse, design, implement and evaluate a possible system that allows dissemination of rich metadata in the context of a P2P overlay, taking into account the intrinsic difficulties that come with this kind of environment, such as full decentralisation and the presence of a big number of nodes for which few assumptions about availability and reliability can be made. In the specific we study a possible solution to integrate subtitles ingestion and distribution within the Tribler P2P Overlay and its Channels abstraction, focusing on the network overhead that would be introduced, on the availability of metadata within the network, and on the quality of users’ experience.

For full reference, please consult the attachments at the bottom of this page (some of them are currently work in progress).

Definitions

We will introduce here several definitions that will be used throughout this page. The reader is assumed to be familiar with the concepts and terms related to Tribler, especially for what concerns Channels (see LivePlaylists).

(Digital) Item
any kind of digital representable information meant to be consumed by an user.
Content
the representation of an item (e.g. a file)
User
any participant in the P2P Network that can produce, publish and consume items
Publisher
An user that advertises one or more items to other users using channels. A publisher does not necessarily coincide with the producer of the ”item”, nor it necessarily has to host the content itself.
Channel
A collection of items published by a single publisher. See LivePlaylists.
Subscriber
An user who expressed its interest in a particular channel
Metadata
Any kind of digital representable knowledge about an item. The implementation will concentrate on textual subtitles.
Metacontent
The representation of an instance of meta- data (e.g. the file that contains metadata

Problem Description

The focus of this work will be to analyse possible mechanisms and infrastructures able to allow a content publisher to enrich the object of their publications (items) with Rich Metadata (RMD) and to disseminate it to interested consumers within a P2P Network.

RMD dissemination can be divided in three different but related aspects, namely:

  • Metadata ingestion
  • Metadata announcement
  • Metadata retrieval
Metadata ingestion
identifies the process, and the mechanisms involved in it, that allow a publisher to enrich one of the items in its channel with some metadata, and to make this metadata available to other users.
Metadata announcement
is the process and the related set of techniques used in the system in order to inform users of the existence and availability of metadata for a particular item in a channel. It enables users in the system to perform discovery of metadata.
Metadata retrieval
is the process, and the related techniques, through which any user in the system gets the physical representation of a metadata instance, i.e., its metacontent.

Preliminary problem analysys

Please see attachment:preliminary_analysys_document.pdf.

Goals

  • Minimize the speed of metadata announcement spread
  • Minimize the bandwidth usage
  • Maximize the availability of metacontents throughout the network

Assumptions

  • users are able to instantiate overlay connections to each other: we assume that the middleware providing the P2P Overlay services we are using offers some kind of NAT traversal mechanism allowing peers to exchange messages regardless them being behind NAT devices.
  • publishers are assumed to have a local copy of the rich metacontent they want to publish
  • publisher of contents are willing to donate their bandwidth to disseminate their contents for significant amounts of time

Subtitle Support

First step towards full Rich Metadata support has been the implementation of a Subtitle Exchange Subsystem integrated in Tribler. It is meant to be a seamless extension to the Channels concept, allowing publishers to enrich items in their channels with subtitles in several languages.

Please consult the User Guide (attachment:user_manual.pdf) for a Quick Start tutorial on how to use subtitles.

Design Overview

  • Using the Subtitle Subsystem an user can publish and disseminates subtitle metacontents for multimedia items distributed via BitTorrent.
  • Currently only subtitles in http://wiki.videolan.org/SubRip format (aka .srt) are supported.
  • Information about published subtitles is disseminated in a gossip-like fashion throughout the network.
  • Augmented ChannelCast protocol: for each gossiped item, also subtitles information is distributed.
  • The actual subtitles contents are requested and sent to interested peers via two new ad-hoc Overlay messages.
  • DIrect distribution of subtitles trough single messages has been preferred to BitTorrent based transfers because of the small size of subtitle files (usually less then 100 KBs). The bootstrap time for a bittorrent transfer for such small file would be much greater then the actual download time, thus causing excessive latency time.
  • To increase availability of the published subtitles, we leverage the
  • subscriptions mechanism already implemented in ChannelCast trough VoteCast:
  • subscribers of a channel automatically download subtitles as soon are they available, thus becoming an additional source for them.
  • From received gossips peers build a map of available contents sources in order to retrieve them.

Protocol Description

In this section an informal description of the dissemination and retrieval protocol will be given. A complete and more formal specificiation of the "on-the-wire" protocol can be found here: attachment:subtitles_on_the_wire_spec.pdf .

Subtitle Ingestion

To start the ingestion process a publisher must provide a path pointing to the local copy of the subtitles file, the language of the selected subtitle, and he has to choose an item previously added to his channel to whom the subtitle are to be associated. After that the subsystem performs the following actions:

  1. Create a copy of the subtitles file and store in the subtitles collecting dir.

To minimize the chance of filename collisions with other collected subtitles, the file is renamed with the hex representation of the SHA-1 digest over the concatenation of the item's infohash, the channel's permid and the 3-characters ISO 639-2 language code of the subtitles. (e.g. b96f1a2398cece6f558ced8d259fd979d2b087ab.srt)

  1. Insert information about the published subtitles into the appropriate section of the local MegaCache
  2. Trigger the dissemination process.

Subtitles announcement dissemination

Availability of subtitles within channels is disseminated via ChannelCast. The protocol has been updated and its messages now include information about subtitles for each gossiped torrent.

Attention has been payed to guarantee full backward compatibility: old messages are understood by instances running the new protocol, and instances running old versions of the protocol are still able to understand new messages.

A ChannelCast message is made of several entries: each of them represents an item (i.e. a torrent) published in a channel. Every peer builds the messages he sends including a mix of items from his own channel and from the channels he is subscribed to.

Our enhancements of the protocol add to each entry in the message the so called rich metadata entry. In the figure below the contents of this entry are shown.

description
is a 500bytes maximum bit string. It is thought to provide a freeform textual description for the item, but it is currently unsed.
timestamp
is an unix timestamp reflecting the date the rich metadata entry was last modified by its publisher
languages mask
is a 32-bit string. Each slot in the string corresponds to one of the supported subtitles languages. Each bit set to the value '1' means that the moderator has published a subtitle in the corresponding language. (See attachment:subtitles_on_the_wire_spec.pdf for a list of the supported langauges.)
sha1 checksums
is a list of SHA-1 digests. There is one entry for each published subtitle, corresponding to the hash of the .srt file. This is included to avoid third parties spreading fake copies of subtitles.
signature
is an ECC signature over the above fields, taken with the publisher's private key. It can be used by receiving peersto check that the message has not been altered.
have mask
is a 32-bit string, similar to the language mask. It represents the subset of the published subtitles which are available at the node who's currently sending the message. This permits the receiving peers to build a map of the physical locations of subtitle contents.

Subtitles retrieval

Subtitles are retrieved via a simple asynchronous request response protocol.

A peer willing to retrieve the contents of one or more subtitles for an item within a channel can send request messages to peers who announced to have them trough channelcast. Receivers can directly respond with a message including the contents as payload. If the receiving peer does not have the requested content for any reason, he simply discards the request.

Two new Overlay Message Types have been introduced to implement the protocol.

GET_SUBS (a) is sent by peers as request message

publisher_id
is the PermID identifying the channel where the requested subtitle(s) has been published
infohash
is the infohash of the item (i.e. the torrent) to whom subtitles are attached.
request_mask
is a 32-bit string. Each '1' in the string represents a requested subtitle in the corresponding language. In a correct message the request_mask is a valid subset of the languages mask sent during the announcement phase.

SUBS (b) is sent as response to a GET_SUBS request, in the case the requested contents are available.

publisher_id
is the PermID identifying the channel where the requested subtitle(s) has been published
infohash
is the infohash of the item (i.e. the torrent) to whom subtitles are attached.
answer_mask
is a 32-bit string. Each '1' in the string represents a subtitle in a given language included in this response message. It is a subset of the request_mask sent along with the GET_SUBS
payload
is a variable-size list of the UTF-8 encoded subtitle contents. The maximum size of this field is 1 MB.

An example of the entire diffusion process is shown in the image below.

The figure represents a sample of the network showing one publisher (green circle), subscribers of his channel (purple circle), and other peers. The publisher has added two subtitles for an item in its channel, one in Dutch, one in Italian. Yellow shadows underneath the peers mean that the announcement for the subtitles has been received. Vertical and horizontal patterns inside the circles respectively reflect the availability of one of the actual subtitles contents.

  • (a) The moderator gossips with its (logical) neighborhood about the
  • availability of the two subtitles. The languages mask and the have
  • mask are shown above the arrows with list notation. In this case they coincide because the moderator owns both the subtitle contents.
  • (b) As soon as he receives the announcements, the subscriber sends request to the moderator for every subtitle content he has. This is done in order to create replicas of subtitles thus increasing their availability. Another peer also sends request to the moderator, but only for the Italian subtitle.
  • (c) The subscriber starts to spread information about subtitles trough channelcast. languages mask and have mask are equal also for him, since he owns copies of both the subtitles as well
  • (d) The moderator goes offline; in the meanwhile the peer who previously requested the Italian subtitles has become a subsciber. He therfore spreads subtitles information via channelcast. This time his have mask only includes the Italian bit, since he does not have a local copy for the Dutch language.

Architecture

In the section the architecture of the Subtitles Subsystem is presented: first we will show the logical organization of software components, their responsibilities and how they integrate with the underlying Tribler platform. Then the focus will move on the single modules that make up the software components and the interactions they perform to achieve the desired behavior.

In the figure below a logical view of the macro-blocks of the systems is shown. The components that were already implemented by the platform are pictured in a lighter shade of grey.

RMD Gossip
creates outgoing, and processes incoming gossip messages, managing Rich Metadata (subtitles) information. Implements the necessary extension to the ChannelCast module.
RMD DB
Newly added component of the Tribler MegaCache, it manages the information needs of the whole Subtitles Subsystem, facilitating the retrieval and the persistence of rich metadata.
RMD Fetcher
Implements the request-response protocol described in the previous section.
RMD Manager
coordinates the operation of the components in the subsystem.
RMD User Services
provides an API to be used by clients of the subsystem.

The responsibilities of the above components are further split in the modules they are made of, as shown in the next figure along with their reciprocal dependencies. Please consult the API reference for full details ( attachment:api.pdf or http://subtitles.zxq.net )

In the next figures the basic interactions between these modules are shown in several sequence diagrams. They are not meant to be fully detailed, but to show and let the reader understand the basic program flow. (Click on each figure to zoom)

  • Peer gossiping subtitles.

  • Peer receiving gossip about subtitles.

  • Peer creating and sending a subtitle contents request.

  • Peer receiving some subtitle contents trough a SUBS message.

Experimental results

We have performed several experiments to evaluate the effectiveness of the implemented solution; further evaluation tests are still in progress and will be published on this page as soon as they will be available.

The experiments are based on a small scale emulation of peers in the Tribler overlay implementing the subtitle exchange features.

The choice of emulation instead of a more common large scale simulation was take primarily because we needed a realistic response from the implemented software, that could also reveal possible problems in a real deployment: our system was in fact integrated and merged in the P2P Next - Next Share platform, and the experiments we performed were able to give us a certain degree of confidence about its reliability.

Experiments Framwork

For our emulation we used 7 identical machine. Each of them has the following characteristics:

  • 8 Intel Xeon E5345 @ 2.33 Ghz CPUs
  • 16 GB primary memory
  • Fedora Core 8.something running Linux 2.6.23
  • ??Connectivity specs??

All of them were connected to the same LAN segment.

On each node 20 instances of Tribler are executed, and each run their MegaCaches are reset. The experiment code has been slightly modified with respect to the production code to allow remote control and event logging.

Two different settings were explored:

  1. The emulated Tribler instances are run inside the "official" Tribler Overlay. Therefore they are able to gossip and communicate with "vanilla" versions of the client spread trough the internet.
  2. The emulated Tribler instances are run inside an "ad-hoc" isolated Overlay, called Overlay # 42. Inside this overlay they cannot see other peers, and they can only interact each other.

For each of these settings we varied on the uptime of the publisher and on the number and session characteristics of subscribers.

To efficiently analyse the results of each experiment run we also assembled a simple framework whose role is to allow an easy collection, archival and parsing of the logs. The framework, also allows to generate several graphs summarising the basic characteristics of each experiment.

The modified Tribler code can be found at: http://svn.tribler.org/abc/branches/andrea/d10-03-30-release5.2-r15178/

The scripts used to run the experiments are available at: http://svn.tribler.org/abc/branches/andrea/tribler-run-scripts/

The python code of the parsing and graphing framework are instead at: http://svn.tribler.org/abc/branches/andrea/logs-parser/

Evaluation

We wanted an experimental environment able to simulate the swarm dynamics of real clients, especially for what concerns the churn rate and the uptime length of single instance. To do so we decided to control the 140+ instances in our experiments based on real Tribler usage traces collected during previous experiments.

The traces we used describe the behaviour of 5000 users, including the start and end of their sessions, their connectability and their downloads actions. For each experiment run we randomly select 140 users out of the traces, and we associate each of them to one instance. That instance is then commanded to connect and disconnect from the overlay following the actions of his corresponding real user.

The announcement and retrieval phase of our subtitles diffusion solution are studied introducing in the emulated environment a moderator instance, and publishing on it a single subtitle for one of its channel items. This simplification facilitates to trace and measure the parameters we are interested in, without altering the validity of our results. The moderating instance is artificially controlled to go up and down depending on the single experiment run, and - depending on the single experiment - we introduced one or more subscribers to estimate the effects of their presence.

Swarm Description

We report here some graphs that describe the topology and behavior of the emulated swarms. Since the traces that guide the peers' sessions are randomly chosen and different for each experiment, it can be observed that there is a substantial variability in the characteristics of each run.

The parameters that we considered to describe swarms behavior are:

  • The number of peers connected to the swarm in each instant of simulation time
  • The distribution of session lengths among peers
  • The frequency of churn events in a run (i.e. the frequency of events that modify the topology of the swarm)

For each of them we report below a boxplot graph, describing how this parameter vary among different experiments, and then we report a significative example for each of them.

number of peers
it is higly variable both during a single emulation and in different emulations: this is due to the randomness in choosing traces. However the medians in each experiment variy from a minimum of 59 contemporary connected peers to 82. Similarly if the first and third quartiles in a single experiment are considered, it can be notices that they vary in a limited range, around 20 peers wide.

churn events distribution
from the boxplot above it can be seen that in all the reported experiment runs the churn rate is particularly high, and that experiments do not vary significantly from each other from this point of view. In the figure next to it we show the CDF of the churn events for one of the experiments as an example: it can be seen that the distribution is highly concentrated around the origin of the axis: this means that the interval between two consecutive churn events tends to be very small; in the example 50% of churn events happens within 60 seconds from each other.

session length distribution
from the boxplot graph it is easy to see that the session length is highly variable within a single experiment, but more ore less the characteristics of this attribute are the same in different experiments. In the graph on the right the CDF of the session length distribution is shown for a single run. For a lot of peers the session length can be extremely small (less then 90 minutes for the 25 % of the peers), while for the upper 25% it is longer then 12 hours. (We didn't measure sessions longer then 12 hours since our emulations are at most 1 day long. The measurement of sessions longer then half this window would lead to biased results towards short sessions [for an explanation consult the thesis (when it will be written :P ))

Announcement Overhead

We measured what is the overhead of the subtitle announcement phase, in term of bandwidth on publishers and subscribers, for the current protocol implementation. The size of a single announcement item can be easily computed a priori:

  • 2 bytes for the bencoding of the (empty) description string
  • 12 byte string for the timestamp expressed as a bencoded integer
  • 6 bytes for the bencoded 32bit languages mask
  • 25 bytes for the bencoded list of SHA-1 digests (containing only 1 entry for 1 subtitle)
  • 67 bytes for the ECC signature
  • 6 bytes for the have mask
  • 2 bytes for the enclosing list.

Making a total of 120 bytes overhead per announcement.

We measured how the bandwidth usage evolves in time: the figures below are shown the results.

The images above is taken from an experiment run for 24 hours in the official Tribler overlay. The bandwidth used for announcement periodically grows (in an almost perfect linear fashion) as connections with the experimental peers are established and gossip with them happens: the total consumed bandwidth for one day amounts to only to 25KBytes, confirming that the goal of minimizing overhead is met.

Nevertheless 25Kbytes is much more then the minimum required bandwidth to inform all the 140 peers in the simulations: that bandwidth would amount to 140 * 120 bytes = 16 Kbytes. Therefore there must be a spoil of bandwidth in sending duplicates: this is confirmed in the figure below.

Nearly the 50% of the announcements sent are duplicates. The fact that duplicates start to grow only after 4 hours of simulation time is due to the fact that Tribler block peers with whom he has already gossiped for a window of that amount. We believe that this isn't sufficient, since once elapsed the 4 hour timespan the duplicates start to grow way faster then useful announcements.

Announcement Coverage

During our tests we looked at another important parameter of announcement diffusion: we call it the diffusion coverage. In particular we measured the fraction of peers that appeared online at least once who has been reached by an announcement message sent by the moderator or by any subscriber.

The following figure shows the evolution of coverage in time for three different experiment settings. The three of them were run in the Overlay # 42, but similar results were also achieved when running the experiment in the official Tribler Overlay.

  • experiment E1 represents the base case. In that scenario we run a moderator continuously for 12 hours.
  • experiment E2 differs from E1 in that the moderator goes up and down every 4 hours.
  • experiment E3 introduces a subscriber to the same setting of E2

An important result is that in every case, after around 1500 seconds from the beginning of the emulation (i.e. 25 minutes) more then the 80% of the appeared peers has been reached by announcements, and after  5000 seconds (i.e. 83 minutes) the coverage is stable between 90% and 100% in E1 and E3. In E2, instead, the 4 hours of absence of moderator keeping peers fed with its announcements, makes the coverage drop down to 80% as new peers connect to the overlay, but is soon goes up again as the moderator reappears. It is important to notice as in E3 the presence of a subscriber manages to compensate the absence of the moderator as he goes down.

Diffusion speed

A central factor for the effectiveness for metadata diffusion is the timely reception of the announcement for a peer. This time represents the window in which a peer searching for an item won't receive results for the corresponding subtitles: therefore we want that window to be as small as possible.

In the next figure we show the CDF of waiting times in a single experiment.

More then half of the emulated peers receive the announcement within 600 seconds from their first appearance in the overlay, and 90% of them receive it within 1200 seconds.

Subtitles as Overlay Messages: Overhead

We also wanted to measure the impact in terms of bandwidth usage of sending subtitles in overlay messages. To mimic user behaviour, every emulated Tribler instance decides to download a subtitle for which he has received an announcement with a 33% chance. Moreover, since it is unrealistic for an human user to download a subtitle as soon as he receives the announcement, we delay the request in time for an amount which follows a uniform distribution in the window (0, 3600] seconds.

Following we show the results for an emulation including one publisher and one subscriber: the first figure illustrates the total bandwidth used for subtitles exchange, while the seconds shows how this is split between the publisher and the subscriber. The size of the exchanged subtitle is approximately 60Kbytes, covering a film 1:30 long.

The overhall consumed bandwidth in 12 hours emulation amounts to ~ 3500 KBs. A slightly faster growth is recognizable in the first part of the emulation, in which the majority of peer's first appearances are concentrated.

Next figure shows how the requests are split between the subscriber and the publisher. Notice that in the middle part of the simulation the publisher's line stops to grow since the moderator goes down for 4 hours.

Conclusions and Future Work

What we have done:

  • Design and implementation of an extension of the ChannelCast protocol allowing an user to publish and disseminate Rich Metadata throughout the Overlay Network
  • Design and implementation of a simple asynchronous request-reply protocol, allowing subtitles exchange via Overlay Network messages.
  • Measurements of the implemented solution trough an emulation framework

What we concluded:

  • The ChannelCast extension shows itself to be extremely lightweight, contributing for a 25Kbytes bandwidth overhead in a day.
  • The direct transfer of subtitles trough overlay messages guarantees low latency transfers for small sized contents such subtitles, payed with a small overhead on the publisher and possible subscribers

What are the limitations:

  • Even if the announcement messages are extremely small, we found that peers exchange a huge number of duplicate messages, accounting for 50% of the announcement traffic. This can become a problem if more complex and rich announcements are to be sent.
  • The mechanism of direct subtitles exchange constitutes a relevant limitation to scalability, since all the peers rely only on the publisher and its possible subscribers to retrieve contents. The bandwidth capacity of the publisher can also constitute a bottleneck: this is even more true if the exchanged contents are bigger in size then simple subtitles files.

Possible future work:

  • Create an extensible framework to disseminate a broader variety of Metadata (e.g. thumbnails, video previews, timed descriptions or ads...)
  • Work on something able to overcome the duplicates problem (Bloom filters?)
  • Invert the dependency item -> metadata. Now metadata refers to items, but metadata should be the centre of a channel: metadata is what describes an item in a channel. This allows, for example, several physical contents (e.g. videos with different qualities) to be linked to the same item in a channel. Exploring an MPEG-21 like approach could be interesting.