 |
ohohlfeld.com : blog
|
|

|
|
I discovered a interesting compilation of data about various popular social networks, obtained by Google Adplaner and Google Insights. The report is entitled The 2008 Social Network Analysis Report - Geographic - Demographic and Traffic Data Revealed.
Data provided for Facebook seems quite interesting; while initially targeting colleges, most of the current users seems to be older, according to the information provided by Google. This is even more visible for the micro-blogging service Twitter. When looking at LinkedIn, the majority of the users seem to be in the post collage age, earn more money and has a higher education. Considering LinkedIn as a “network for professionals”, this is not unexpected. However, one has to rely on the validy of the data provided by a third-party.
There are a lot of social networks available, dedicated to different needs. However, there is none focusing on researcher as clientele. This seems to be changing with Mendeley, a social network dedicated to scientists. The site is still in the early beta phase and lacks of a lot of users, but already seems promising. Mendeley provides a client–which is also available for Linux and runs fine on my 64 bit Ubuntu installation–which allows managing ones publications and synchronises with the Mendeley profile.
As I want to explore this new network, I created my Mendeley profile just a couple of hours ago. Unlike the experiences made by Daniel Lemire, importing my publications from a BibTeX database was fairly easy. A feature that I’m missing currently is to publish a less detailed CV like it is possible in LinkedIn; when providing details about my education or professional experience, I’m enforced also to provide dates.

Do we need mathematicians? This question was addressed by Ian Steward, professor of mathematics at the university of Warwick, UK, during this years Queen’s Lecture at the Technical University of Berlin. The annual queen’s lecture is held in Berlin since 1965 and were initially a present by queen Elizabeth II on the occasion of her visit to Berlin. In this year, the attendees were looking forward to get an answer to Do We Need Mathematicians - will Ian Steward make his profession obsolete by answering with no or will there be more support for mathematics in the German Year of Mathematics?
Of course, everyone knew the answer, as many of the luxuries of the 20th century have their base in mathematics, and Ian Steward immediately rephrased his opening question to Why Do We Need Mathematicians? This may be essential in times when many countries are worried about decreasing amounts of first-semester students in “tough subjects”, such as mathematics.
He then quickly moved on to present some believes and showed why they are wrong:
- The only job one can get holding a degree in mathematics is becoming a school teacher: Less then 5% of the British students graduating in mathematics go into school politics, 25% are employed in the financial sector.
- We did all the math in school. There is nothing new happen in mathematics. If there would be something new, we would have heard about it. Why would someone invent even more math, there is already too much - There are about 1-2 million pages of math published per year.
- We don’t use maths in our daily life: Mathematics is the Cinderella that doesn’t come to the ball. It’s working behind the scenes and only some will notice it.
- We have computers: Drugs don’t make doctors obsolete, telescopes do not make astronomers obsolete and microscopes doesn’t obsolete biologists. In fact, a computer is just a tool that helps the mathematician with getting routine work done and concentrating on actually solving problems. But one always need a professional to use a tool efficiently.

Ninety Mile Beach in Australia (Source: Wikipedia)
A large part of his talk was devoted to demonstrating where mathematics is used in daily life by following the example of going on holiday from the early stage of looking for possible locations, over flying to the actual destination to lying at the beach taking pictures. In the course of his demonstration he showed that maths is inside everywhere.
When using the Internet to holiday planing, a user will use a search engine whose results are ranked by the Page Rank Algorithm and the data communication to the travel agency might be secured using several cryptographic techniques.

Aeroplane (Source: Wikipedia)
When using the air-plane to fly to the chosen destination, one will encounter a huge complexity. a) Why does the aeroplane fly? One needs the principles of fluid dynamics and Navier-Stokes equations. A digital wind tunnel is used to simulate and study the aerodynamics of a planed aircraft. The Navier-Stokes equations are thus solved in the simulator. b) How do they navigate? A bunch of number theory is involved in the Global Positioning System. c) How do they know when to send the plane where? Addressing this subject involves network analysis, which is a hot topic in mathematics and spreads from computer networks such as the Internet to biological networks such as epidemic of diseases.

Digital Camera (Source: Wikipedia)
Arrived at the beach, one wants to take souvenir photos, but how does the camera manage to store so many pictures in one memory card? Shrinking the amount of data without loosing the picture involves complex techniques of data compression, which Ian Steward demonstrated by explaining the basics of JPEG.
When the queen’s lecture was over, there was a nice reception where the university orchestra played traditional and modern British music and the university served food and drinks.
All in all, I really enjoyed listening to his talk. He presented a nice collection of examples showing why mathematics is neccessary. However, I belive the talk was addressed to a different audience and was somewhat less relevant for students and employees at a technical university that a very familiar with the presented topics. From a computer scientists perspective, the examples were well known and thus can be considered as boring—the way he presented them was not. Actually his audience consisted of exactly those people that did not need to be convienced that mathematics is important. However, he is doing a great job in actually communicating science and I believe such a powerful talk will have a very strong effect on an audience that is less technophile.

Spam (Image source)
Direct marketing is not a new approach and its history dates back to the 19th century when the first mail-order catalogues were distributed. Nowadays, the presence of unsolicited bulk e-mail is annoying Internet users world-wide on a daily basis. While there were some costs involved to distribute mail-order catalogues, the marginal cost to send an e-mail is tiny. Therefore, e-mail based campaigns are profitable even when a negligible amount of receivers goes for the advertised product. The bad news, as highlighted by Kanich et al. is, “a perverse byproduct of this dynamic is that sending as much spam as possible is likely to maximise profit”. In order to maximise the reach of spam advertisement, spammers need to fight with developers of anti-spam technology; the developers of anti-spam software play a cat-and-mouse game with the senders of spam, who have to adapt to the latest spam filtering technologies in order to reach as many people as possible.
However, the presence of spam, despite years of energetic deployment of anti-spam technology, demonstrates the profitability of campaigns using spam. So the natural question rises up: who goes for spam?
This issue is addressed in a paper entitled Spamalytics: An Empirical Analysis of Spam Marketing Conversion presented at the 15th ACM Conference on Computer and Communication Security on Tuesday October 28.
The authors are interested in the conversion rate of spam, which is the probability than an unsolicited e-mail will ultimately elicit a sale. Therefore they infiltrate ongoing spam campaigns sent using the Storm botnet to provide measures for different stages of the spam conversion pipeline as shown in the above figure. In order to understand their methodology, we need to briefly review the way Storm works.

- Storm Botnet Architecture (Source: Kanich et al.)
Storm is a peer-to-peer botnet that propagates via spam. The above figure shows the three primary classes of Storm nodes involved in sending spam: worker bots, proxy bots and master servers. While the worker bots are responsible for actually sending the spam, proxy bots act as conduits between workers and master servers. When downloading the Storm binary advertised in spam mails, the infected host becomes either a worker bot (if not reachable from the Internet, e.g. due to firewall restrictions) or a proxy bot. As the command and control traffic directed to the worker bots is unencrypted and always passes through a proxy bot, a man-in-the-middle attack is possible and carried out in the paper by Kanich et al.: by rewriting the comand and control traffic directed to worker bots, spam templates, dictionaries and addresses could be changed and adapted to their needs.
Their methodology can be summarised as follows. They hosted a set of Storm proxy bots, created duplicates of websites advertised in spam and have rewritten the command and control traffic to let the worker bots to advertise their sites instead of the original ones. Thus, no user received more spam, but some users received spam that is less dangerous that it would be otherwise.
Over the course of their experiment, they rewrote the content of about 470 million spam mails sent in three campaigns: about 347 million spams involved in a phamarcy campaign, 83 (38) million for a Storm self-advertisement campain using postcards (april fool). They received 28 purchases on the faked page for the advertised pharmaceutical product and 541 infections of the faked Storm binary, geographically distributed as shown below:

This translates into the following conversion rates (caution: results are not intended to be generalised in other contexts!):
- 1 in 12,500,000 pharmacy spams lead to a purchase.
- 1 in 265,000 greeting card spams lead to an infected machine.
- 1 in 178,000 April Fool’s Day spams lead to an infected machine.
- 1 in 10 people visiting an infection website downloaded the executable and ran it.
Many more information can be found in their paper (see below), such as top-10 most targeted email address domains, filtering statistics at each stage of the conversion pipeline, statistics about the efficiency of anti-spam methods deployed by typical free e-mail providers (e.g. hotmail and Google mail), time-to-click distribution (the first users visited the advertised page 10 seconds (sic!) after the spam was sent), effects of blacklisting and many more.
The paper is very well written and leads to new insights into how spam works. Interested readers should therefore consider reading this piece of well-conducted research.
Source: C. Kanich, C. Kreibich, K. Levchenko, B. Enright, G. Voelker, V. Paxson, S. Savage. Spamalytics: An Empirical Analysis of Spam Marketing Conversion. 15th ACM Conference on Computer and Communications Security 2008, Alexandria, VA, USA. [Summary, PDF Paper, BibTeX]
Further Information:
Internet cartography is a serious and challenging research topic. Such maps are useful for e.g. evaluating applications and algorithms in simulation. As Rob Sherwood, who is now with Deutsche Telekom USA, highlighted when taking question after his talk given at ACM Sigcomm 2008 about DisCarte: we have conferences dealing with Internet Measurement, but we don’t know how exactly the Internet looks like. While traceroute, as an active measurement technique, it highlights parts of the Internet topology, other parts are hidden as routers don’t respond to traceroutes’ probes and are thus not visible. Moreover, traceroute will reveal only interfaces and not the routers itself. As routers are usually equipped with many interfaces, mapping interfaces obtained from traceroute probes to routers is another issue that makes Internet cartography a complex activity. In his paper, Rob Sherwood proposed a combined approach, which makes use of the Record Route (RR) mechanism provided by IP (cf. RFC 791) and outperforms current mapping techniques. The offered record route option will maintain a set of IP addresses (up to a certain length) used to record the route of an Internet datagram. Obviously, it is merely a matter of time until network providers which want to obfuscate their backbone topology configure their routers to no longer append their addresses to the RR list, but I agree with Sherwood’s view: having an, maybe, old but rather accurate map of the Internet is better than nothing.
While traceroute represents an active measurement approach, partial information about the network topology can also be obtained from passive measurements, as Brian Eriksson, who is with the University of Wisconsin-Madison, showed in his Sigcomm 2008 paper. However, a certain degree of active probes are still necessary, as Eriksson highlighted in his talk.
Map sources:
- A source of publicy available topologies of commercial and research backbone networks, which can be directly fed into simulators, can be found here.
- Internet Mapping Project: Larger number of maps maintained until 2000.
- Backbone maps as pixel graphics can be found here.
- EPFL’s netscope is designed to monitor overlay networks
If you know of more mapping sources, please place a link in the comments of this post. Many thanks in advance.
Why do TCP flows, that carry the vast majority of the Internet traffic, transmit at rates they do? A user upgrades his access line to a pipe which is able to carry 16 Mbit/s downstream. However, when retrieving data from the Internet, the user observes that the maximum throughput is not reached. What are possible reasons for this behaviour? How can root causes for low throughput be identified? Knowledge of factors that determine TCP’s throughput is therefore valuable for users as well as network operators.
Those answers are provided by tools used for TCP Root Cause Analysis. While some factors are obvious, others need further investigation. However, identifying possible causes that explain the observed throughput at a given time instant is non-trivial when traffic can only be observed at a particular point in the network without accessing e.g the client host.
TCP’s throughput may be limited for several reseaons. A server might not have enough bandwidth to saturate the users’ access line, or a link connecting the user to the server might be bottlenecked and thus limit the throughput. High load in the network can cause congestion and a TCP sender will limit its sending rate in congestion avoidance phases. Furthermore, the sending rate might be simply application limited, e.g. in case of a voice over IP client which sends a small amount of data in very frequent intervals and thus the application will not attempt to use all of the available network resources. The latter is a major cause for a throughput limitation (Siekkinen, PAM 2007).
While most of the rate limiting factors cannot be controlled by the users, some can. The amount of unacknowledged data that can be outstanding at any time is defined by the TCP window. Assume the window size is 64 Kb and the round trip time (RTT) is one second. Then his will result in 64 kb that can be transferred per second. Thus, also the RTT, that will increase with increasing distance to the remote server, can be a limiting factor. Moreover, if the window size is low, either at the sender (sender window) or at the receiver (receiver window), the application will also experience low throughput. In contrast to other limiting factors, the size of the (advertised) receiver window can be directly controlled by the user and a misconfiguration might be one possible cause of low throughput. The pioneering work by Zhang et al. found that congestion and limited windows are common causes for low throughput in observed TCP connections and showed that congestion is not always the cause for throughput limitations. Therefore, the view that throughput is limited by the network only is too restrictive.
Some root causes can be identified using a nice online tool. It allows to quickly get some TCP connection statistics (like RTT and congestion window measurements) for performing a TCP Root Cause Analysis by simply accessing a web page. The tool can be accessed here. An extended tool is proposed by Siekkinen et al.
Readers interested in TCP Root Cause Analysis should consider reading the following papers:
As the SIGCOMM 2008, held in Seattle this year, is getting closer, I noticed that the accepted papers are now available online. They can be accessed here. A group of researchers in my group at Deutsche Telekom Laboratories will present their Time Machine, which allows later inspection of network activity that becomes interesting in retrospect.
Edit: Serveral papers are reviewed in the blog of Michael Mitzenmacher.
I rececived a pointer to a mindmap illustrating steps that should be considered when writing a good research paper. This mindmap can be seen here.
Will future gadgets be the “face of big brother”? Jonathan Zittrain, professor of Internet law at Oxford Internet institute, published a book entitled The Future of the Internet—And How to Stop it. The online version of the book is distributed under the Creative Commons Attribution Non-Commercial Share-Alike 3.0 license and can be downloaded here.
In this book, Zittrain worries about whether the Internet can survive the freedom that produced it. Its openness is a reason why it spreads beyond nerds and their friends, in order to let other people exploit the space later on. In the beginning, there was a general purpose machine (PC), which was only restricted by its hardware limitations and did not have locks to limit its usage to what is accepted by the manufacturer.
Nowadays, there are high-tech gadgets which usually have a better usability and are more seamless. But, on the other hand, they are also controlled by their manufacturer, which are free to introduce locks. This fact is illustrated in the introduction by taking the iPhone as an example of a gorgeous but restricted high-tech gadget. In order to design the phone as a more user friendly device, Apple is controlling everything running on it, in order to prevent people from uploading crap that will stop the phone from operating properly. The first version was not open to run any third-party software at all, which was relaxed later on. However, there is still the enforced limitation that third-party software cannot be exchanged directly between users. When Bob wrote a great software for security enabled Instant Messaging, he cannot give it directly to Alice but needs to share it on the iTunes Application store, run by Apple. Even though this is less restrictive than disallowing third-party software at all, Apple can still control what is run on the iPhone and kill every software that is not wanted and also government can enforce that. This is a difference to the general purpose machine we know as personal computer.
The Internet follows an innovation model that he calls the “two people in a room” model, which is different to the ones preferred in most enterprises. Metaphorically speaking, the two people in that imaginary room are hacking without following exact plans and even having a business plan or pre-defined milestones. This is different to the way CEO’s asked by Zittrain deal with innovation. In their models, people need to ensure that everything is planed ahead of time and well justified. But most of the interesting innovations in the last decades came from people who were not planing very far ahead.
Some people were working on KaZaA and wounded the music industry in more than a week. When they were done, they were about to take on the telecommunication industry and came up with Skype. This is the disruptive and unpredictable nature of the Internet as we know it. The question discussed in the book is how this might change in the future Internet.
Therefore, Zittrain gives a number of scenarios for the Internet of the future, where a plausible one is the “not with a bang, but with a whimper” scenario. In this scenario, we might end up in an ecosystem, where innovation takes place in the way known in many other industries; there is competition among a bunch of firms and, occasionally, they come up with something good. As Zittrain said in an interview with National Public Radio (NPR), he worries that, metaphorically, we end up with a technical elite class that swap files with each other and the mainstream will still have a narrow connection to that. The mainstream is using platforms where innovation is taking place very slowly and with the capacity to monitor and control very much.
Zittrain writes on page 5:
A lockdown on PCs and a corresponding rise of tethered appliances will eliminate what today we take for granted: a world where mainstream technology can be influenced, even revolutionized, out of left field. Stopping this future depends on some wisely developed and implemented locks, along with new technologies and a community ethos that secures the keys to those locks among groups with shared norms and a sense of public purpose, rather than in the hands of a single gatekeeping entity, whether public or private.
Although I did not finish reading the book yet, it appears to be well-written and worth reading for people interested in discussions on the future Internet.
Just a quick side note: The recently elected spokesman of GI/ITG’s MMB section, Prof. Markus Siegle, suit the action to the word. Papers that were published in the 13th (2006) and 14th (2008) GI/ITG Conference on Measurement, Modelling and Evaluation of Computer and Communication Systems (MMB) have been added to the DBLP academic libary, run by Michael Ley at the University of Trier. Thus, these publications can now be included in typical author performance and reputation measures more easily.
Newer Posts »
|
 |
© 2001-2008 by Oliver Hohlfeld, B.Sc.
| Imprint |
|
|
|