SocioCast: Social Network castingThesis: http://www.tribler.org/attachment/wiki/SocioCast/thesis.pdf Research DirectionResearch Question
Definitions
Assumptions - made credible using literature or sound reasoning
Subproblems - made credible using results from theoretic models
TestingEach of the subproblems posed above should be solved using a model of SocioCast. First model: Explorative test. Hubs are interesting nodes and therefore we looked at their coverage. Tests have been done on a 100.000 size network with on average 10 links per node. These networks have been generated with an algorithm for scale-free networks and an algorithm for small-world networks. From the outcome it can be concluded that the coverage of hubs in scale free networks is far greater than those in small world networks. Also the coverage of hubs in a scale-free network outperforms the coverage of 25 leaf nodes in contrast to the small-world network. The problem of this first test is that the coverage is concluded from a breadth first search from the source. This does not take into account the bucket size (size of the cache used to store network links per peer). Second model: Run a simulation of a generated scale-free network bootstrapping. Unfortunately calculating SRI many times for every node is very expensive. exploration speed - Compare exploration speed of a new node with network churn and without and also connectibility issues. The problem is that connectibility can only be circumvented if contact is done from 2 ways. Simulations with only 1 active node are essentially the same as the first model, only select_peer is different. scalability collusion resistance traitor resistance To compare different designs Performance Indicators should be introduced
Comparisons of SocioCast vs existing networksThis information is to be used in the background chapter and the recommendation chapter. last.fm friend network is similar
interest groups
neighbors
Last.fm recommendations are directed towards a user or group. Tribler approvals and moderations are however undirected. This means these systems are hard to compare. Last.fm played tracks/artists can be linked to downloads. Tight social networks are formed by communication between members. Benefit of having friends in Last.fm is
Benefit of having groups in Last.fm is
Ideas for Tribler: Interest groups are a good idea as this will create small worlds for peers with similar interests. Probably does not make SocioCast resistant against attacks but makes finding relevant information easier. Share download behavior with friends/group. Might be more privacy sensitive than listening behaviour. Does however give implicit recommendations. Sending of explicit recommendations to friends/groups. (spam?) The neighbour algorithm in Last.fm is significantly different from recommendations in Tribler. If monitoring of behavior of users in Tribler it could be enhanced in the Last.fm way.. Easy adding of users by fetching them from gmail or hotmail would be nice, but could give privacy concerns. Research questions / Hurdlesidea study advisor Eliminate uncontrollable variables: good/bad peers. Evaluate SRI performance theoretically using a manually constructed network.
explain choice for maxflow sri using the above Network Which social network structure will arise if any and can this be influenced by the protocol? According to research? the existing social networks of Flickr, Youtube, Orkut and Livejournal are scale free and small world networks. Youtube has an interesting property where some peers are celebrities. These celebrities are highly connected but unlike hubs they do not have a higher than random chance to be connected to other highly connected peers. The problem I foresee is that it is not clear to users why they are building a social network. This may result into users making many friends just for the fun of it or leaving the functionality for what it is and not make any friends at all. A social network can only be used to infer some kind of trust information if relationships are meaningful. Unfortunately this can not be tested or prophesied and therefore only the recommendation can be made to make friendship relations more meaningful. Unlike friendship relations, moderator approvals are of clear use to users because it will help the user pick moderations from the same moderator for next files. Without a social network these moderator approvals are however useless, because without any ties it is hard to distinguish approvals of honest peers from collusion networks Which type of peers will be the originators of useful approvals? Hopefully hubs as these have the highest network coverage.. Can this be tested or made likely to believe? We assume the big information providers will be in large part the hubs in the network. The reasoning behind this is that these nodes put a lot of effort into building a meaningful social network and are active in the Tribler community. This will make the chance that they produce a lot of meaningful moderator approvals larger than average. Leaf nodes are less active in the network or potential hubs that have just started. These nodes are highly clustered with their friends which may or may not create specific interest clusters. This will make finding moderator approvals far more easy. Youtube like celebrities, be it approval or moderator providers. Celebrity moderators means many approvals which is good. Approval celebrities create a big star form network without scalefree characteristics. We assume the big information providers will be in large part the hubs in the network. The reasoning behind this is that these nodes put a lot of effort into building a meaningful social network and are active in the Tribler community. This will make the chance that they produce a lot of meaningful moderator approvals larger than average. Leaf nodes are less active in the network or potential hubs that have just started. These nodes are highly clustered with their friends which may or may not create specific interest clusters. This will make finding moderator approvals far more easy. The youtube dataset has favorites, but these point towards movies and not users. Could look at the correlation between amount of friends and amount of favorites to predict moderator approval behavior. Is the current youtube dataset sufficient? Doubtful. Peers have favorites, which can not be used to simulate approvals as these link to files and not peers. The correlation between the connectedness of nodes and the amount of (useful) approvals should be extracted. What is the effect of friend spam and can it be countered?
Should SocioCast work with bidirectional relationships or unidirectional and how does this affect the choice of real datasets and results? Protocol What message size should be chosen for sociocast messages? No idea.. Small message size will however hamper breadth first search of the social network. Whether this is bad remains to be seen. If hubs are the most interesting nodes peers can choose to send these relations first to direct the discovery algorithm. Should incoming relations be tracked and propagated? Incoming relations can only be trusted if they are reachable from the current path. It could help speed up discovery, but then again if peers use breadth first search it should not be necessary. Should it be made possible to remove relationships and how to do it? If friend spam becomes common practice it should be possible to remove connections. This could be done by adding a lifetime to relationships so they have to be renewed every few days or something. This could however cause social networks to collapse if central peers remain offline longer than the lifetime of their relationships. Which relations should be selected to propagate first? Hubs and peers with (useful?) approvals. Is bootstrapping a problem, and should peers be bootstrapped? Without a social network peers will have trouble distinguishing good metadata approvers from bad approvers. Bartercast could help, otherwise the amount of approvals could be compared to rate moderations. SRI ( Social Relevance Index )Is a global trust level which can be used for moderator trustworthiness calculation. 2 versions: BFS - The SRI of node b from the point of view of node a is 1/depth where depth the depth of b in the social graph that is known in a. Maxflow? Using maxflow is more interesting. Graph discovery focuses towards highly connected peers and hubs gain a higher SRI. Helps to quickly traverse a scale-free network where BFS would choke. Bootstrapping WARNING: guestimates
Due to the low amount of people that indicate moderators other indicators of trustworthiness should be used. A friend relation does not imply that a peer should be trusted concerning their opinion about moderators and therefore it should be used as a mere indicator. Not part of research though. Hack to bootstrap.. should be mentioned as a recommendation for bootstrapping, but does not need testing or modeling. Hard to calculate SRI if it is possible to alternate friendship and bartercast connections. Therefore bartercast 'trust' should only be used if no other information is available and should not be mixed with social connections. Should approvals be used as friendships links? If hubs are clustered and approved these links can help discover other hubs? Peers that insert valuable approvals into the network may choose to befriend other peers with valuable approvals. Only if this is the case approvals be used in the SRI to determine the relevance of social paths containing 2 or more approvals. This would require a name change for SRI as it will be a specific index for approval relevance and not social relevance. Epidemic protocol algorithm
Architecture:
A very interesting problem is: how to efficiently fill the 10MB of data (e.g. bootstrap phase) plus how to keep this data updated when only a few updates per day occur (e.g. maintenance phase). The challenge is not to spend 960 messages per 4 hour BuddyCast cycle on bandwidth. Only store entries which are most interesting. Problem is lack of info, thus collect info based on social distance. Possibly:
The epidemic protocol should be push-based because of the sparseness of social information. In the Youtube dataset almost half of the users has no friends. A push-based method will prevent users that have no friends from contacting other users without friends. SocioCast will be tied to BuddyCast like BarterCast. Buddycast3 connects 50% of the time to random peers and 50% of the time to taste buddies. This is unwanted behavior for SocioCast, because peers should expand their social network tree. Therefore it is proposed that Buddycast4 will connect 33% of the time to random peers, 33% of the time to taste buddies and 33% of the time to peers with a high SRI. Protocol message layoutThe layout is very similar to deployed protocols. Each peer only broadcasts its own relations to others. A message contains 2 parts:
The size of messages should not exceed 2kb. The sizes of all messageparts are guesstimates, but when 10 friends and 10 moderators are packed the message will not exceed this 2kb limitation. Design principle for security: Only exchange first hand observations of social network edges. The reason that the public keys of moderators are encrypted with MD5 are twofold. First it saves a lot of space in the protocol and second to help moderators remain anonymous. Peers need only know the opinion of their friends about moderations they have and thus MD5 is sufficient. MEGACACHESocioCast will require 3 SQL tables to keep track of friends and moderators seperately. All tables use peerid, which is linked to the public key of peers in the peer table.
Local relations will also be stored in these tables. Timestamp indicates at what time the relation was added to the local database.
As explained before SocioCast messages do not contain the PermID of moderators. It is possible based on moderations in the local database to deduct what PermID is likely referred to. However due to hash collisions the hash will be stored. Timestamp indicates at what time the relation was added to the local database.
xx_timestamp signifies the timestamp of the last relationship that has been sent and receive to and from peer peerid. This table will be purged based on purge actions in the friends table. Previously consideredCentrality is the social importance of a node. The measures for centrality are: degree centrality, betweenness, closeness, and eigenvector centrality. Random Walk Centrality might be interesting because the centrality of nodes is inversely proportional to the degree of branches along a path and it is determined relative to the observer, not as a global value per node. RWC is also a measure of effectiveness of communication, information from a node with high centrality is propagated more quickly. RWC paper It is hard to get a global overview of network connections free from tampering. And if unsafe relations are considered centrality and other network analyses may not be accurate enough and influenced. Youtube datasetSome statistical graphs have been constructed from a Youtube dataset. This Youtube dataset consists of 239535 users. About a third of all agents have favorites, 81330 users to be exact. The distribution of the amount of favorites per agent seems to be a powerlaw. A similar distribution can be seen for favorite holders per movie. We have also looked at the social network in this dataset. This is however not a complete crawl which has several implications. For each user that has been crawled it is known what his friends are. It is however possible that these friends themselves are not crawled. The networks that have been found are sparser than in reality and due to missing connections they are probably more connected than what is found in the following graphs. We have first looked at the distribution of social network sizes. The largest network consists of 134246 users of which 105793 users are part of our dataset. That means that we have no information about outgoing and incoming connections from the remaining 28453 users. The small amount of networks of size 1 can be explained due to Youtube allowing users to add themselves as a friend. This last graph is the result from a peer discovery algorithm performed on the largest network. In each step the algorithm determines the friends of newly discovered peers, the fringe. The algorithm started at a random node, therefore the fringe in step 0 is 1. The social small world experiment done in 1967 at Harvard University determined that social connections are of length 5.5 on average. The average in our graph is slightly higher which can be explained due to our dataset being sparser than in reality. http://en.wikipedia.org/wiki/Small_world_phenomenon For a better overview of social networks read the paper included below and the following website: http://socialnetworks.mpi-sws.mpg.de/ Remarks. Attachments
|