 |
ohohlfeld.com : blog
|
|

|
|
I discovered a interesting compilation of data about various popular social networks, obtained by Google Adplaner and Google Insights. The report is entitled The 2008 Social Network Analysis Report - Geographic - Demographic and Traffic Data Revealed.
Data provided for Facebook seems quite interesting; while initially targeting colleges, most of the current users seems to be older, according to the information provided by Google. This is even more visible for the micro-blogging service Twitter. When looking at LinkedIn, the majority of the users seem to be in the post collage age, earn more money and has a higher education. Considering LinkedIn as a “network for professionals”, this is not unexpected. However, one has to rely on the validy of the data provided by a third-party.
There are a lot of social networks available, dedicated to different needs. However, there is none focusing on researcher as clientele. This seems to be changing with Mendeley, a social network dedicated to scientists. The site is still in the early beta phase and lacks of a lot of users, but already seems promising. Mendeley provides a client–which is also available for Linux and runs fine on my 64 bit Ubuntu installation–which allows managing ones publications and synchronises with the Mendeley profile.
As I want to explore this new network, I created my Mendeley profile just a couple of hours ago. Unlike the experiences made by Daniel Lemire, importing my publications from a BibTeX database was fairly easy. A feature that I’m missing currently is to publish a less detailed CV like it is possible in LinkedIn; when providing details about my education or professional experience, I’m enforced also to provide dates.

Spam (Image source)
Direct marketing is not a new approach and its history dates back to the 19th century when the first mail-order catalogues were distributed. Nowadays, the presence of unsolicited bulk e-mail is annoying Internet users world-wide on a daily basis. While there were some costs involved to distribute mail-order catalogues, the marginal cost to send an e-mail is tiny. Therefore, e-mail based campaigns are profitable even when a negligible amount of receivers goes for the advertised product. The bad news, as highlighted by Kanich et al. is, “a perverse byproduct of this dynamic is that sending as much spam as possible is likely to maximise profit”. In order to maximise the reach of spam advertisement, spammers need to fight with developers of anti-spam technology; the developers of anti-spam software play a cat-and-mouse game with the senders of spam, who have to adapt to the latest spam filtering technologies in order to reach as many people as possible.
However, the presence of spam, despite years of energetic deployment of anti-spam technology, demonstrates the profitability of campaigns using spam. So the natural question rises up: who goes for spam?
This issue is addressed in a paper entitled Spamalytics: An Empirical Analysis of Spam Marketing Conversion presented at the 15th ACM Conference on Computer and Communication Security on Tuesday October 28.
The authors are interested in the conversion rate of spam, which is the probability than an unsolicited e-mail will ultimately elicit a sale. Therefore they infiltrate ongoing spam campaigns sent using the Storm botnet to provide measures for different stages of the spam conversion pipeline as shown in the above figure. In order to understand their methodology, we need to briefly review the way Storm works.

- Storm Botnet Architecture (Source: Kanich et al.)
Storm is a peer-to-peer botnet that propagates via spam. The above figure shows the three primary classes of Storm nodes involved in sending spam: worker bots, proxy bots and master servers. While the worker bots are responsible for actually sending the spam, proxy bots act as conduits between workers and master servers. When downloading the Storm binary advertised in spam mails, the infected host becomes either a worker bot (if not reachable from the Internet, e.g. due to firewall restrictions) or a proxy bot. As the command and control traffic directed to the worker bots is unencrypted and always passes through a proxy bot, a man-in-the-middle attack is possible and carried out in the paper by Kanich et al.: by rewriting the comand and control traffic directed to worker bots, spam templates, dictionaries and addresses could be changed and adapted to their needs.
Their methodology can be summarised as follows. They hosted a set of Storm proxy bots, created duplicates of websites advertised in spam and have rewritten the command and control traffic to let the worker bots to advertise their sites instead of the original ones. Thus, no user received more spam, but some users received spam that is less dangerous that it would be otherwise.
Over the course of their experiment, they rewrote the content of about 470 million spam mails sent in three campaigns: about 347 million spams involved in a phamarcy campaign, 83 (38) million for a Storm self-advertisement campain using postcards (april fool). They received 28 purchases on the faked page for the advertised pharmaceutical product and 541 infections of the faked Storm binary, geographically distributed as shown below:

This translates into the following conversion rates (caution: results are not intended to be generalised in other contexts!):
- 1 in 12,500,000 pharmacy spams lead to a purchase.
- 1 in 265,000 greeting card spams lead to an infected machine.
- 1 in 178,000 April Fool’s Day spams lead to an infected machine.
- 1 in 10 people visiting an infection website downloaded the executable and ran it.
Many more information can be found in their paper (see below), such as top-10 most targeted email address domains, filtering statistics at each stage of the conversion pipeline, statistics about the efficiency of anti-spam methods deployed by typical free e-mail providers (e.g. hotmail and Google mail), time-to-click distribution (the first users visited the advertised page 10 seconds (sic!) after the spam was sent), effects of blacklisting and many more.
The paper is very well written and leads to new insights into how spam works. Interested readers should therefore consider reading this piece of well-conducted research.
Source: C. Kanich, C. Kreibich, K. Levchenko, B. Enright, G. Voelker, V. Paxson, S. Savage. Spamalytics: An Empirical Analysis of Spam Marketing Conversion. 15th ACM Conference on Computer and Communications Security 2008, Alexandria, VA, USA. [Summary, PDF Paper, BibTeX]
Further Information:
Why do TCP flows, that carry the vast majority of the Internet traffic, transmit at rates they do? A user upgrades his access line to a pipe which is able to carry 16 Mbit/s downstream. However, when retrieving data from the Internet, the user observes that the maximum throughput is not reached. What are possible reasons for this behaviour? How can root causes for low throughput be identified? Knowledge of factors that determine TCP’s throughput is therefore valuable for users as well as network operators.
Those answers are provided by tools used for TCP Root Cause Analysis. While some factors are obvious, others need further investigation. However, identifying possible causes that explain the observed throughput at a given time instant is non-trivial when traffic can only be observed at a particular point in the network without accessing e.g the client host.
TCP’s throughput may be limited for several reseaons. A server might not have enough bandwidth to saturate the users’ access line, or a link connecting the user to the server might be bottlenecked and thus limit the throughput. High load in the network can cause congestion and a TCP sender will limit its sending rate in congestion avoidance phases. Furthermore, the sending rate might be simply application limited, e.g. in case of a voice over IP client which sends a small amount of data in very frequent intervals and thus the application will not attempt to use all of the available network resources. The latter is a major cause for a throughput limitation (Siekkinen, PAM 2007).
While most of the rate limiting factors cannot be controlled by the users, some can. The amount of unacknowledged data that can be outstanding at any time is defined by the TCP window. Assume the window size is 64 Kb and the round trip time (RTT) is one second. Then his will result in 64 kb that can be transferred per second. Thus, also the RTT, that will increase with increasing distance to the remote server, can be a limiting factor. Moreover, if the window size is low, either at the sender (sender window) or at the receiver (receiver window), the application will also experience low throughput. In contrast to other limiting factors, the size of the (advertised) receiver window can be directly controlled by the user and a misconfiguration might be one possible cause of low throughput. The pioneering work by Zhang et al. found that congestion and limited windows are common causes for low throughput in observed TCP connections and showed that congestion is not always the cause for throughput limitations. Therefore, the view that throughput is limited by the network only is too restrictive.
Some root causes can be identified using a nice online tool. It allows to quickly get some TCP connection statistics (like RTT and congestion window measurements) for performing a TCP Root Cause Analysis by simply accessing a web page. The tool can be accessed here. An extended tool is proposed by Siekkinen et al.
Readers interested in TCP Root Cause Analysis should consider reading the following papers:
Will future gadgets be the “face of big brother”? Jonathan Zittrain, professor of Internet law at Oxford Internet institute, published a book entitled The Future of the Internet—And How to Stop it. The online version of the book is distributed under the Creative Commons Attribution Non-Commercial Share-Alike 3.0 license and can be downloaded here.
In this book, Zittrain worries about whether the Internet can survive the freedom that produced it. Its openness is a reason why it spreads beyond nerds and their friends, in order to let other people exploit the space later on. In the beginning, there was a general purpose machine (PC), which was only restricted by its hardware limitations and did not have locks to limit its usage to what is accepted by the manufacturer.
Nowadays, there are high-tech gadgets which usually have a better usability and are more seamless. But, on the other hand, they are also controlled by their manufacturer, which are free to introduce locks. This fact is illustrated in the introduction by taking the iPhone as an example of a gorgeous but restricted high-tech gadget. In order to design the phone as a more user friendly device, Apple is controlling everything running on it, in order to prevent people from uploading crap that will stop the phone from operating properly. The first version was not open to run any third-party software at all, which was relaxed later on. However, there is still the enforced limitation that third-party software cannot be exchanged directly between users. When Bob wrote a great software for security enabled Instant Messaging, he cannot give it directly to Alice but needs to share it on the iTunes Application store, run by Apple. Even though this is less restrictive than disallowing third-party software at all, Apple can still control what is run on the iPhone and kill every software that is not wanted and also government can enforce that. This is a difference to the general purpose machine we know as personal computer.
The Internet follows an innovation model that he calls the “two people in a room” model, which is different to the ones preferred in most enterprises. Metaphorically speaking, the two people in that imaginary room are hacking without following exact plans and even having a business plan or pre-defined milestones. This is different to the way CEO’s asked by Zittrain deal with innovation. In their models, people need to ensure that everything is planed ahead of time and well justified. But most of the interesting innovations in the last decades came from people who were not planing very far ahead.
Some people were working on KaZaA and wounded the music industry in more than a week. When they were done, they were about to take on the telecommunication industry and came up with Skype. This is the disruptive and unpredictable nature of the Internet as we know it. The question discussed in the book is how this might change in the future Internet.
Therefore, Zittrain gives a number of scenarios for the Internet of the future, where a plausible one is the “not with a bang, but with a whimper” scenario. In this scenario, we might end up in an ecosystem, where innovation takes place in the way known in many other industries; there is competition among a bunch of firms and, occasionally, they come up with something good. As Zittrain said in an interview with National Public Radio (NPR), he worries that, metaphorically, we end up with a technical elite class that swap files with each other and the mainstream will still have a narrow connection to that. The mainstream is using platforms where innovation is taking place very slowly and with the capacity to monitor and control very much.
Zittrain writes on page 5:
A lockdown on PCs and a corresponding rise of tethered appliances will eliminate what today we take for granted: a world where mainstream technology can be influenced, even revolutionized, out of left field. Stopping this future depends on some wisely developed and implemented locks, along with new technologies and a community ethos that secures the keys to those locks among groups with shared norms and a sense of public purpose, rather than in the hands of a single gatekeeping entity, whether public or private.
Although I did not finish reading the book yet, it appears to be well-written and worth reading for people interested in discussions on the future Internet.
Inspired by some work presented at IWQoS dealing with social networks and small world characteristics, I zoned out and was wondering whether someone did some analysis of Facebook and e.g. proofed the six degrees of separation assumption stated by Milgram. In 2006, an analysis of one million profiles of the German Facebook clone StudiVZ were presented in [0]. The findings provide interesting insights into StudiVZ, but the presented evaluation does not consider an extensive social network analysis. As the amount of users on Facebook is much higher than on StudiVZ and — from an international perspective — Facebook is more widely known, I would expect more work dealing with Facebook that gives more interesting insights into today’s social networks.
A student work presented at the University of Oslo by Sasan Zarghooni [1] focuses on self-presentation management on Facebook. Self-presentation management is understood as the management of the impression a person makes on other people. An introduction of the classical theory proposed by Goffman [2] is followed by a discussion aiming to show whether this theory can explain the self-representational behaviour observed on Facebook.
Goffman introduced a dramaturgical approach in [2], where he compared self-presentation to stage acting. An actor plays a role for a specific audience in a front stage area and retreats to a backstage, where he will change his behaviour. This concept can be clarified by the example of a teacher acting in an authoritarian manner in an unruly class (front stage), but shows a different behaviour at a family reunion. The concept of front- and backstages helps to understand why people behave differently in different places.
Some findings presented in [1]:
- “The e-mail like messaging system on Facebook allows for backstage interaction, and this way two friends may discuss the darkest secrets of their lives on Facebook without any other friends knowing.”
- A study by Ellison [3] “found that Facebook led to a substantial increase in subjective well-being and self-esteem for shy people (…) because Facebook provides users with better control over how they self-present”
- “It could suggest [A study by Walther [4]] that people consider their pictures to be the most important way of self-presenting: those who perceive themselves photogeneous do not engage heavily in other forms of self-presentation because they have already done a successfull self-presentation, whereas those who consider themselves less attractive wish to compensate”.
The work in [1] clearly states that “the more contacts or friends we have, the stronger is the need to segregate those who receive a particular self-presentation from those who receive another one”. This is the main reason why I believe that the discussion should be detached from a particular medium (e.g. Facebook) to a more macroscopic view. Different social networks provide different stages for different types of roles; business networks such as Xing or LinkedIn are used to manage a business role, whereas Facebook and StudiVZ appear to be more used for managing a role revealed to (closer) friends.
All in all, [1] is a well written student paper which is easy to read and gives a good introduction into Goffman’s theory of self-presentation.
[0] StudiVZ analysis
[1] Sasan Zarghooni, “A Study of Self-Presentation in Light of Facebook“, University of Oslo, 2007
[2] Goffman, E: “The Presentation of Self in Everyday Life”, 1982
WordPress 2.3 (the blog software I’m using) introduced the concept of canonical URLs, which caused a problem on my host after upgrading. To sum it up, there are many possible ways to access an article or a page in WordPress and the concept of canonical URLs will redirect pages to the permanent link specified in the settings. After upgrading to WordPress >2.3, my blog was no longer reachable, due to an infinite redirection loop:
$ wget blog.ohohlfeld.com
–2008-06-24 18:43:50– http://blog.ohohlfeld.com/
Resolving blog.ohohlfeld.com… 83.236.4.78
Connecting to blog.ohohlfeld.com|83.236.4.78|:80… connected.
HTTP request sent, awaiting response… 301 Moved Permanently
Location: http://blog.ohohlfeld.com/ [following]
–2008-06-24 18:43:51– http://blog.ohohlfeld.com/
Reusing existing connection to blog.ohohlfeld.com:80.
HTTP request sent, awaiting response… 301 Moved Permanently
Location: http://blog.ohohlfeld.com/ [following]
–2008-06-24 18:43:51– http://blog.ohohlfeld.com/
Reusing existing connection to blog.ohohlfeld.com:80.
HTTP request sent, awaiting response… 301 Moved Permanently
Location: http://blog.ohohlfeld.com/ [following]
(….) repeated infinitely
This problem is caused by an incorrectly set host attribute within the HTTP header due to mod_proxy. When using mod_proxy, an Apache server running the proxy will redirect requests to the corresponding web servers. In this internal HTTP request, the Apache proxy sets the Host: to the name of the internal host and thus HTTP_HOST is containing some different host name to what is set by the requesting browser. The original host name is populated in X-Forwarded-Host by mod_proxy. Thus, this is a special issue caused by this non-common Apache configuration.
As WordPress relies on a correctly set HTTP_HOST in some situation (redirects by the URL canonicalization, calling wp-cron, self-tag in ATOM feeds, …), it should be overwritten when a valid X-Forwarded-Host attribute is set in the HTTP header, as Andy pointed out. Basically (without further sanity checks), this can be accomplished by adding the following code to wp-config.php:
if ( isset( $_SERVER['HTTP_X_FORWARDED_HOST'] ) )
$_SERVER['HTTP_HOST'] = $_SERVER['HTTP_X_FORWARDED_HOST'];
Moreover, when plugins — such as Wassup — rely on aquiring the IP address of the calling user, the REMOTE_ADDRESS can be overwritten, as this will always point to the Apache running mod_proxy. I solved this by adding the following code to wp-config.php:
$_SERVER['REMOTE_ADDR'] = zen_get_ip_address();
The zen_get_ip_address(); procedure will return the correct IP address.
However, as already mentioned, these changes are not nessesarry when some of the discussed HTTP header attribute are not modified by an in-between proxy, which should be the normal case!
I was reading the 1/2008 (March) issue of EURESCOM mess@ge covering The Future Internet today. An article by Milon Gupta introduced the OMEGA project, which is running from 2008 to 2010 and focuses on “Gigabit speed at home without cable clutter”. The project is motivated by the fact that the home network could become the bottleneck in the future, high-speed Internet, as “many devices are limited to wireless transmission rates of 54 megabit per second, or require troublesome wiring to achieve higher rates”. The article proposes the following solution to install comfortable, high-speed home networks: “OMEGA will overcome these limitations by increasing the speed to one gigabit per second and by connecting home devices to the Internet and to each other through power line communications and wireless connections”.
Power line is a technique where, roughly speaking, data is transmitted using electric power lines instead of dedicated but more appropriate network cables. This approach is comfortable as a network of power lines is already installed in today’s homes and devices can be plugged in everywhere to get connectivity without worrying about running network cables. However, using power lines as carrier for wideband signals, such as high-speed network communiction, is a questionable approach as power lines are untwisted and unshielded and thus form a large antenna which will radiate the signals sent over power lines. Therefore, there is a high potential for interfering other radio services or being interfered. Power line communication can make HF Radio services–allowing world-wide communication and thus invaluable services such as emergency networks–unusable. Is it really worth loosing an invaluable resource just to send data using inappropriate wires? More information can be found here:
Is sending wideband signals over untwisted and unshielded wires really the highest of highs in electrical engineering? Maybe research should think about alternatives without suggesting to deploy the next best but inappropriate solution?
PS: Does this highlight the need to revisit the song released by the Buggles in 1979 entitled Video Killed the Radio Star?
I’m right back from the Meeting of VDE/ITG’s Next Generation Internet section. A prototypic P2P SIP implementation has been shown. The implementation was straight forward using classical techniques such as a Bamboo DHT for storing accounting data, hybrid encrypting of the communication and using the IETF P2P SIP location data tag for exchanging location information.
Other, highly interesting topics of this meeting were carrier Ethernet (the IEEE 802.1(Qay) protocols) and NGOSS.
The last IETF meeting was held in Philadelphia last week. Recall my previous posting about turning the protocol stack upside-down. Jonathan Rosenberg, who is with Cisco, presented a new trend in a talk at the meeting: TCP over UDP. The main motivation for this approach is NAT traversal, which can be done using UDP pretty well. Unfortunatly, the UDP protocol has, among other things, a lack of flow control, which is a pretty compfortable thing application level programmers do not want to miss. The combination of both advantages (NAT traversal and e.g. flow control) results in tunneling TCP over UDP. Thus, there is one more best effort datagram protocol layer in between. Does this endeavour highlights the need for a new Internet, as researcher in the Future Internet sector are claiming?
Newer Posts »
|
 |
© 2001-2008 by Oliver Hohlfeld, B.Sc.
| Imprint |
|
|
|