SocioCast: Social Network casting

Thesis: http://www.tribler.org/attachment/wiki/SocioCast/thesis.pdf

Research Direction

Research Question

Is a social network sufficient means for peers to distinguish good moderators

Definitions

  1. Hubs are the top 1% connected peers.
  2. Leaf nodes are the bottom 50% connected peers.
  3. Coverage is the social influence of a node compared to other nodes.

Assumptions - made credible using literature or sound reasoning

  1. The social network formed by SocioCast will be a combination of a scale-free and a small-world network.
  2. Hubs are the main providers of good approvals.

Subproblems - made credible using results from theoretic models

  1. A social network can be formed by SocioCast given that peers have added friends.
  2. SocioCast scales considering a fixed cache size to store relations.
  3. The coverage of hubs is significantly higher than the coverage of (colluding) leaf nodes.
  4. Socially relevant information equals good information.
  5. Resistant against friend spam attack.
  6. Resistant against traitoring hub attack.

Testing

Each of the subproblems posed above should be solved using a model of SocioCast.

First model: Explorative test. Hubs are interesting nodes and therefore we looked at their coverage. Tests have been done on a 100.000 size network with on average 10 links per node. These networks have been generated with an algorithm for scale-free networks and an algorithm for small-world networks. From the outcome it can be concluded that the coverage of hubs in scale free networks is far greater than those in small world networks. Also the coverage of hubs in a scale-free network outperforms the coverage of 25 leaf nodes in contrast to the small-world network.

The problem of this first test is that the coverage is concluded from a breadth first search from the source. This does not take into account the bucket size (size of the cache used to store network links per peer).

Second model: Run a simulation of a generated scale-free network bootstrapping. Unfortunately calculating SRI many times for every node is very expensive.

exploration speed - Compare exploration speed of a new node with network churn and without and also connectibility issues. The problem is that connectibility can only be circumvented if contact is done from 2 ways. Simulations with only 1 active node are essentially the same as the first model, only select_peer is different.

scalability

collusion resistance

traitor resistance

To compare different designs Performance Indicators should be introduced

  • global hub coverage given a fixed relation cache
  • bootstrap speed

Comparisons of SocioCast vs existing networks

This information is to be used in the background chapter and the recommendation chapter.

last.fm

friend network is similar

  • added via hotmail/gmail

interest groups

  • invite policy (open, owner approval, member approval, closed)
  • owner can link to content
  • users can recommend content
  • communication via forums / journal

neighbors

  • view listening behavior of users that listen to similar music

Last.fm recommendations are directed towards a user or group. Tribler approvals and moderations are however undirected. This means these systems are hard to compare.

Last.fm played tracks/artists can be linked to downloads.

Tight social networks are formed by communication between members.

Benefit of having friends in Last.fm is

  • keep track of your friends listening behavior

Benefit of having groups in Last.fm is

  • sense of belonging
  • communication
  • specific recommendations (ska concerts are recommended in the Ska user group)

Ideas for Tribler:

Interest groups are a good idea as this will create small worlds for peers with similar interests. Probably does not make SocioCast resistant against attacks but makes finding relevant information easier.

Share download behavior with friends/group. Might be more privacy sensitive than listening behaviour. Does however give implicit recommendations.

Sending of explicit recommendations to friends/groups. (spam?)

The neighbour algorithm in Last.fm is significantly different from recommendations in Tribler. If monitoring of behavior of users in Tribler it could be enhanced in the Last.fm way..

Easy adding of users by fetching them from gmail or hotmail would be nice, but could give privacy concerns.

Research questions / Hurdles

idea study advisor

Eliminate uncontrollable variables: good/bad peers. Evaluate SRI performance theoretically using a manually constructed network.

  • social relevance of friends
  • social relevance of distant celebrities
  • social relevance of distant non-popular leaf injectors
  • social relevance of a collusion network

explain choice for maxflow sri using the above

Network

Which social network structure will arise if any and can this be influenced by the protocol?

According to research? the existing social networks of Flickr, Youtube, Orkut and Livejournal are scale free and small world networks. Youtube has an interesting property where some peers are celebrities. These celebrities are highly connected but unlike hubs they do not have a higher than random chance to be connected to other highly connected peers.

The problem I foresee is that it is not clear to users why they are building a social network. This may result into users making many friends just for the fun of it or leaving the functionality for what it is and not make any friends at all. A social network can only be used to infer some kind of trust information if relationships are meaningful. Unfortunately this can not be tested or prophesied and therefore only the recommendation can be made to make friendship relations more meaningful.

Unlike friendship relations, moderator approvals are of clear use to users because it will help the user pick moderations from the same moderator for next files. Without a social network these moderator approvals are however useless, because without any ties it is hard to distinguish approvals of honest peers from collusion networks

Which type of peers will be the originators of useful approvals?

Hopefully hubs as these have the highest network coverage.. Can this be tested or made likely to believe?

We assume the big information providers will be in large part the hubs in the network. The reasoning behind this is that these nodes put a lot of effort into building a meaningful social network and are active in the Tribler community. This will make the chance that they produce a lot of meaningful moderator approvals larger than average.

Leaf nodes are less active in the network or potential hubs that have just started. These nodes are highly clustered with their friends which may or may not create specific interest clusters. This will make finding moderator approvals far more easy.

Youtube like celebrities, be it approval or moderator providers. Celebrity moderators means many approvals which is good. Approval celebrities create a big star form network without scalefree characteristics.

We assume the big information providers will be in large part the hubs in the network. The reasoning behind this is that these nodes put a lot of effort into building a meaningful social network and are active in the Tribler community. This will make the chance that they produce a lot of meaningful moderator approvals larger than average.

Leaf nodes are less active in the network or potential hubs that have just started. These nodes are highly clustered with their friends which may or may not create specific interest clusters. This will make finding moderator approvals far more easy. The youtube dataset has favorites, but these point towards movies and not users. Could look at the correlation between amount of friends and amount of favorites to predict moderator approval behavior.

Is the current youtube dataset sufficient?

Doubtful. Peers have favorites, which can not be used to simulate approvals as these link to files and not peers. The correlation between the connectedness of nodes and the amount of (useful) approvals should be extracted.

What is the effect of friend spam and can it be countered?

  • Is britney Spears spam? - Paper that describes ways to determine whether profiles are real or not.
  • Alan Mislove notes this is easily fooled and therefore network centrality and clustering should be used.

Should SocioCast work with bidirectional relationships or unidirectional and how does this affect the choice of real datasets and results?

Protocol

What message size should be chosen for sociocast messages?

No idea.. Small message size will however hamper breadth first search of the social network. Whether this is bad remains to be seen. If hubs are the most interesting nodes peers can choose to send these relations first to direct the discovery algorithm.

Should incoming relations be tracked and propagated?

Incoming relations can only be trusted if they are reachable from the current path. It could help speed up discovery, but then again if peers use breadth first search it should not be necessary.

Should it be made possible to remove relationships and how to do it?

If friend spam becomes common practice it should be possible to remove connections. This could be done by adding a lifetime to relationships so they have to be renewed every few days or something. This could however cause social networks to collapse if central peers remain offline longer than the lifetime of their relationships.

Which relations should be selected to propagate first?

Hubs and peers with (useful?) approvals.

Is bootstrapping a problem, and should peers be bootstrapped?

Without a social network peers will have trouble distinguishing good metadata approvers from bad approvers. Bartercast could help, otherwise the amount of approvals could be compared to rate moderations.

SRI ( Social Relevance Index )

Is a global trust level which can be used for moderator trustworthiness calculation.

2 versions:

BFS - The SRI of node b from the point of view of node a is 1/depth where depth the depth of b in the social graph that is known in a.

Maxflow?

Using maxflow is more interesting. Graph discovery focuses towards highly connected peers and hubs gain a higher SRI. Helps to quickly traverse a scale-free network where BFS would choke.

Bootstrapping

WARNING: guestimates

  • 1% of all people indicate trusted moderators
  • 10% of all people add friends
  • 100% of all peers have barter connections with other peers

Due to the low amount of people that indicate moderators other indicators of trustworthiness should be used. A friend relation does not imply that a peer should be trusted concerning their opinion about moderators and therefore it should be used as a mere indicator. Not part of research though.

Hack to bootstrap.. should be mentioned as a recommendation for bootstrapping, but does not need testing or modeling. Hard to calculate SRI if it is possible to alternate friendship and bartercast connections. Therefore bartercast 'trust' should only be used if no other information is available and should not be mixed with social connections.

Should approvals be used as friendships links? If hubs are clustered and approved these links can help discover other hubs?

Peers that insert valuable approvals into the network may choose to befriend other peers with valuable approvals. Only if this is the case approvals be used in the SRI to determine the relevance of social paths containing 2 or more approvals. This would require a name change for SRI as it will be a specific index for approval relevance and not social relevance.

Epidemic protocol algorithm

  • Sync, push/pull, storage of state versus bandwidth users
  • exchange table
    (keep last 3 days peers, keep top1000 SRI() peers, keep freq online&new friends peers)
  • Purge policies
  • Hot rumor mongering ?
  • Social_relevance_ranking_algorithm
    • In-Direct downloaded X MByte == Y Max_Flow points
    • Direct downloaded X MByte == Y points
    • f-o-a-f = X points
    • Friend = X points
    • Moderator with X approvals == Y points

Architecture:

  • BuddyCast4 links; candidate cache, connect to Highest SRI() 33% of the time.
  • Bootstrap into 1 friend or f-o-a-f
  • BarterCast links

A very interesting problem is: how to efficiently fill the 10MB of data (e.g. bootstrap phase) plus how to keep this data updated when only a few updates per day occur (e.g. maintenance phase). The challenge is not to spend 960 messages per 4 hour BuddyCast cycle on bandwidth.

Only store entries which are most interesting. Problem is lack of info, thus collect info based on social distance.

Possibly:

  • first pull 10MB
  • afterward do rumor mongering
  • do not send outgoing SocioCast messages
  • Only send out when:
    • received hot rumor
    • User updated his social network (voted or added/deleted a friend)
  • possibly use a 1 month expire on SocioCast messages to purge non-active peers
    • possibly expire on both time, social distance, and BarterCast activity.

The epidemic protocol should be push-based because of the sparseness of social information. In the Youtube dataset almost half of the users has no friends. A push-based method will prevent users that have no friends from contacting other users without friends.

SocioCast will be tied to BuddyCast like BarterCast. Buddycast3 connects 50% of the time to random peers and 50% of the time to taste buddies. This is unwanted behavior for SocioCast, because peers should expand their social network tree. Therefore it is proposed that Buddycast4 will connect 33% of the time to random peers, 33% of the time to taste buddies and 33% of the time to peers with a high SRI.

Protocol message layout

The layout is very similar to deployed protocols. Each peer only broadcasts its own relations to others. A message contains 2 parts:

Friends Moderator Approvals
public key 112 bytes MD5(public key) 32 bytes
ip 4 bytes
port 2 bytes
last_seen 4 bytes
connectable 1 byte
total max 10x 123 bytes total max 10x 32 bytes

The size of messages should not exceed 2kb. The sizes of all messageparts are guesstimates, but when 10 friends and 10 moderators are packed the message will not exceed this 2kb limitation.

Design principle for security: Only exchange first hand observations of social network edges.

The reason that the public keys of moderators are encrypted with MD5 are twofold. First it saves a lot of space in the protocol and second to help moderators remain anonymous. Peers need only know the opinion of their friends about moderations they have and thus MD5 is sufficient.

MEGACACHE

SocioCast will require 3 SQL tables to keep track of friends and moderators seperately. All tables use peerid, which is linked to the public key of peers in the peer table.

Friends Table
peerid_from ? bytes
peerid_to 32 bytes
timestamp 4 bytes

Local relations will also be stored in these tables. Timestamp indicates at what time the relation was added to the local database.

Moderator Table
peerid_from ? bytes
moderator_MD5 32 bytes
timestamp 4 bytes

As explained before SocioCast messages do not contain the PermID of moderators. It is possible based on moderations in the local database to deduct what PermID is likely referred to. However due to hash collisions the hash will be stored. Timestamp indicates at what time the relation was added to the local database.

Exchange Table
peerid ? bytes
incoming_timestamp 4 bytes
outgoing_timestamp 4 bytes

xx_timestamp signifies the timestamp of the last relationship that has been sent and receive to and from peer peerid. This table will be purged based on purge actions in the friends table.

Previously considered

Centrality is the social importance of a node. The measures for centrality are: degree centrality, betweenness, closeness, and eigenvector centrality. Random Walk Centrality might be interesting because the centrality of nodes is inversely proportional to the degree of branches along a path and it is determined relative to the observer, not as a global value per node. RWC is also a measure of effectiveness of communication, information from a node with high centrality is propagated more quickly. RWC paper

It is hard to get a global overview of network connections free from tampering. And if unsafe relations are considered centrality and other network analyses may not be accurate enough and influenced.

Youtube dataset

Some statistical graphs have been constructed from a Youtube dataset. This Youtube dataset consists of 239535 users. About a third of all agents have favorites, 81330 users to be exact.

The distribution of the amount of favorites per agent seems to be a powerlaw.

A similar distribution can be seen for favorite holders per movie.

We have also looked at the social network in this dataset. This is however not a complete crawl which has several implications. For each user that has been crawled it is known what his friends are. It is however possible that these friends themselves are not crawled. The networks that have been found are sparser than in reality and due to missing connections they are probably more connected than what is found in the following graphs.

We have first looked at the distribution of social network sizes. The largest network consists of 134246 users of which 105793 users are part of our dataset. That means that we have no information about outgoing and incoming connections from the remaining 28453 users. The small amount of networks of size 1 can be explained due to Youtube allowing users to add themselves as a friend.

This last graph is the result from a peer discovery algorithm performed on the largest network. In each step the algorithm determines the friends of newly discovered peers, the fringe. The algorithm started at a random node, therefore the fringe in step 0 is 1. The social small world experiment done in 1967 at Harvard University determined that social connections are of length 5.5 on average. The average in our graph is slightly higher which can be explained due to our dataset being sparser than in reality. http://en.wikipedia.org/wiki/Small_world_phenomenon

For a better overview of social networks read the paper included below and the following website: http://socialnetworks.mpi-sws.mpg.de/

Remarks

.

Attachments