ohohlfeld.com : blog
Ohohlfeld.com Banner

Mendeley: Social Network Dedicated to Researchers

December 7, 2008

Filed under: Social Network, internet, research — Tags: , , — Oliver @ 4:42 pm

There are a lot of social networks available, dedicated to different needs. However, there is none focusing on researcher as clientele. This seems to be changing with Mendeley, a social network dedicated to scientists. The site is still in the early beta phase and lacks of a lot of users, but already seems promising. Mendeley provides a client–which is also available for Linux and runs fine on my 64 bit Ubuntu installation–which allows managing ones publications and synchronises with the Mendeley profile.

As I want to explore this new network, I created my Mendeley profile just a couple of hours ago. Unlike the experiences made by Daniel Lemire, importing my publications from a BibTeX database was fairly easy. A feature that I’m missing currently is to publish a less detailed CV like it is possible in LinkedIn; when providing details about my education or professional experience, I’m enforced also to provide dates.

Spamalytics: Who goes for Spam?

November 2, 2008

Filed under: internet, papers, research — Tags: , , — Oliver @ 5:48 pm

Spam (Image source)

Direct marketing is not a new approach and its history dates back to the 19th century when the first mail-order catalogues were distributed. Nowadays, the presence of unsolicited bulk e-mail is annoying Internet users world-wide on a daily basis. While there were some costs involved to distribute mail-order catalogues, the marginal cost to send  an e-mail is tiny. Therefore, e-mail based campaigns are profitable even when a negligible amount of receivers goes for the advertised product. The bad news, as highlighted by Kanich et al. is, “a perverse byproduct of this dynamic is that sending as much spam as possible is likely to maximise profit”. In order to maximise the reach of spam advertisement, spammers need to fight with developers of anti-spam technology; the developers of anti-spam software play a cat-and-mouse game with the senders of spam, who have to adapt to the latest spam filtering technologies in order to reach as many people as possible.

However, the presence of spam, despite years of energetic deployment of anti-spam technology, demonstrates the profitability of campaigns using spam. So the natural question rises up: who goes for spam?

This issue is addressed in a paper entitled Spamalytics: An Empirical Analysis of Spam Marketing Conversion presented at the 15th ACM Conference on Computer and Communication Security on Tuesday October 28.

Spam Conversion Pipeline (Image source)

The authors are interested in the conversion rate of spam, which is the probability than an unsolicited e-mail will ultimately elicit a sale. Therefore they infiltrate ongoing spam campaigns sent using the Storm botnet to provide measures for different stages of the spam conversion pipeline as shown in the above figure. In order to understand their methodology, we need to briefly review the way Storm works.

Storm Botnet Architecture (Source: Kanich et al.)

Storm is a peer-to-peer botnet that propagates via spam. The above figure shows the three primary classes of Storm nodes involved in sending spam: worker bots, proxy bots and master servers. While the worker bots are responsible for actually sending the spam, proxy bots act as conduits between workers and master servers. When downloading the Storm binary advertised in spam mails, the infected host becomes either a worker bot (if not reachable from the Internet, e.g. due to firewall restrictions) or a proxy bot. As the command and control traffic directed to the worker bots is unencrypted and always passes through a proxy bot, a man-in-the-middle attack is possible and carried out in the paper by Kanich et al.: by rewriting the comand and control traffic directed to worker bots, spam templates, dictionaries and addresses could be changed and adapted to their needs.

Their methodology can be summarised as follows. They hosted a set of Storm proxy bots, created duplicates of websites advertised in spam and have rewritten the command and control traffic to let the worker bots to advertise their sites instead of the original ones. Thus, no user received more spam, but some users received spam that is less dangerous that it would be otherwise.

Over the course of their experiment, they rewrote the content of about 470 million spam mails sent in three campaigns: about 347 million spams involved in a phamarcy campaign, 83 (38) million for a Storm self-advertisement campain using postcards (april fool). They received 28 purchases on the faked page for the advertised pharmaceutical product and 541 infections of the faked Storm binary, geographically distributed as shown below:

This translates into the following conversion rates (caution: results are not intended to be generalised in other contexts!):

  • 1 in 12,500,000 pharmacy spams lead to a purchase.
  • 1 in 265,000 greeting card spams lead to an infected machine.
  • 1 in 178,000 April Fool’s Day spams lead to an infected machine.
  • 1 in 10 people visiting an infection website downloaded the executable and ran it.

Many more information can be found in their paper (see below), such as top-10 most targeted email address domains, filtering statistics at each stage of the conversion pipeline, statistics about the efficiency of anti-spam methods deployed by typical free e-mail providers (e.g. hotmail and Google mail), time-to-click distribution (the first users visited the advertised page 10 seconds (sic!) after the spam was sent), effects of blacklisting and many more.
The paper is very well written and leads to new insights into how spam works. Interested readers should therefore consider reading this piece of well-conducted research.

Source: C. Kanich, C. Kreibich, K. Levchenko, B. Enright, G. Voelker, V. Paxson, S. Savage. Spamalytics: An Empirical Analysis of Spam Marketing Conversion. 15th ACM Conference on Computer and Communications Security 2008, Alexandria, VA, USA. [Summary, PDF Paper, BibTeX]

Further Information:

SIGCOMM 2008 Papers Available

August 15, 2008

Filed under: conferences, papers, research — Tags: , — Oliver @ 10:47 am

As the SIGCOMM 2008, held in Seattle this year, is getting closer, I noticed that the accepted papers are now available online. They can be accessed here. A group of researchers in my group at Deutsche Telekom Laboratories will present their Time Machine, which allows later inspection of network activity that becomes interesting in retrospect.

Edit: Serveral papers are reviewed in the blog of Michael Mitzenmacher.

Howto Write a Good Research Paper Mindmap

August 14, 2008

Filed under: papers, research — Tags: , , , — Oliver @ 8:45 am

I rececived a pointer to a mindmap illustrating steps that should be considered when writing a good research paper. This mindmap can be seen here.

Papers of the 13th and 14th MMB added to DBLP library

July 13, 2008

Filed under: conferences, research — Tags: , — Oliver @ 7:47 am

Just a quick side note: The recently elected spokesman of GI/ITG’s MMB section, Prof. Markus Siegle, suit the action to the word. Papers that were published in the 13th (2006) and 14th (2008) GI/ITG Conference on Measurement, Modelling and Evaluation of Computer and Communication Systems (MMB) have been added to the DBLP academic libary, run by Michael Ley at the University of Trier. Thus, these publications can now be included in typical author performance and reputation measures more easily. ;-)

Why Do People Vote? Genetic Sources of Political Participation

July 3, 2008

Filed under: Biology, Genetics, Politics, journals, papers, research — Oliver @ 10:21 pm

Why do people vote? “When one person votes, everyone with the same preferences benefits from the increased likelihood that their preferred outcome will result. Yet those who do vote must bear the coast of time and effort required to learn about election alternatives and go to the polls. In large populations, the probability that a single vote will change the outcome of an election is minuscule (..), meaning that even very small costs to the individual typically outweight the expected benefits he or she would receive from voting. As a result, classic game theoretic models that assume individuals are self-interested and fully optimizing in their behavior show that the equilibrium amount of voter turnout approaches zero as the population becomes large (..). Yet in spite of this theoretical result, millions of people do vote, suggesting behavior drives their decision (..). In addition, the fact that millions of people abstain suggest that there may be inherent variation in the human tendency to participate in politics.” [1]

I read some papers this evening that were published very recently in 2008 (Although the results are older, as e.g. indicated by an elder abstract published here by Adam Kolber in 2007). These papers claim to have found evidence of heritability of voting. A paper entitled Genetic Variation in Political Participation [1] by Fowler, Baker and Daws, who are with the University of California in San Diego, published in the May 2008 issue of The American Politicial Science Review, showed in two independent studies of twins that voter turnout has very high heritability. Their findings are conducted from evaluating data about 168 monozygotic who were conceived from a single fertilized egg and share 100 % of their genes, and 102 dizygotic twins who were conceived from two separate eggs and share only 50 % of their genes on average. The data has been gathered from twin registry and voter registration records in Los Angeles county. Moreover, data of a national representative sample was used for an independent replication of the results. Their findings can be summarised as follows [2]:

While the choice of a particular candidate or party does not appear to be heritable, a significant proportion of the variation in the decision to participate in politics can be attributed to genetic factors. Fowler, Baker, and Dawes (2008) recently studied the voting behavior of two populations of twins and showed that heritability accounted for 53% of the variation in validated turnout of those living in Los Angeles county and 72% of the self-reported turnout in a nationally representative sample of young adults. They also showed that heritability accounted for 60% of the variation in a general index of political participation, including contributing to campaigns, running for office, volunteering for political organizations, and attending protests. These results were the first to suggest that humans exhibit inherent variability in their willingness to participate in politics.

As discussed in the paper, these findings would help to explain why models based primarily on environmental variables fit poorly to observed behavior and it would conform to two well known features of voting: i) parental turnout behavior has been shown to be one of the strongest predictors of turnout behavior in young adults  and ii) turnout behavior has been shown to be habitual—the majority of people either always vote or always abstain (cf. [1]).

However, this paper does not address specific mechanism that links genes to participation. Thus, it merits further investigation to find out why genes matter so much.

The first paper beg the question which genes matter, which is addressed in a follow-up work by Fowler et al. entitled Two Genes Predict Voter Turnout, published in The Journal of Politics in July 2008. In this paper, the authors hypothesize “that people with more transcriptionally efficient alleles of the MAOA and 5HTT genes are more likely to vote” [2].

In summary: Fowler et al. [1, 2] claim to found evidence that genes do contribute to variation in turnout. Their results suggest that both genes and environmental influences matter, without specifying to which degree both factors affect turnout. They do not claim that genetic effects are more important than environmental effects.

What these paper are not about: These paper do not claim that the choice of an particular candidate is influenced by genetics. Moreover, these paper do not state that genetics are the only effect that influences turnout—genetics is one effect among others that influences turnout.

German speakers might find this post by Marc interesting.

References:
[1] James H. Fowler, Laura A. Baker, Christopher T. Dawes: “Genetic Variation in Political Participation“. In: The American Political Science Review, May 2008: Volume 102, No. 2
[2] James H. Fowler, Christopher T. Dawes: “Two Genes Predict Voter Turnout“. In: The Journal of Politics, Vol. 70, No. 3, July 2008, Pp. 579–594

Self-Presentation in Ligth of Facebook

July 1, 2008

Inspired by some work presented at IWQoS dealing with social networks and small world characteristics, I zoned out and was wondering whether someone did some analysis of Facebook and e.g. proofed the six degrees of separation assumption stated by Milgram. In 2006, an analysis of one million profiles of the German Facebook clone StudiVZ were presented in [0]. The findings provide interesting insights into StudiVZ, but the presented evaluation does not consider an extensive social network analysis. As the amount of users on Facebook is much higher than on StudiVZ and — from an international perspective — Facebook is more widely known, I would expect more work dealing with Facebook that gives more interesting insights into today’s social networks.

A student work presented at the University of Oslo by Sasan Zarghooni [1] focuses on self-presentation management on Facebook. Self-presentation management is understood as the management of the impression a person makes on other people. An introduction of the classical theory proposed by Goffman [2] is followed by a discussion aiming to show whether this theory can explain the self-representational behaviour observed on Facebook.

Goffman introduced a dramaturgical approach in [2], where he compared self-presentation to stage acting. An actor plays a role for a specific audience in a front stage area and retreats to a backstage, where he will change his behaviour. This concept can be clarified by the example of a teacher acting in an authoritarian manner in an unruly class (front stage), but shows a different behaviour at a family reunion. The concept of front- and backstages helps to understand why people behave differently in different places.

Some findings presented in [1]:

  • “The e-mail like messaging system on Facebook allows for backstage interaction, and this way two friends may discuss the darkest secrets of their lives on Facebook without any other friends knowing.”
  • A study by Ellison [3] “found that Facebook led to a substantial increase in subjective well-being and self-esteem for shy people (…) because Facebook provides users with better control over how they self-present”
  • “It could suggest [A study by Walther [4]] that people consider their pictures to be the most important way of self-presenting: those who perceive themselves photogeneous do not engage heavily in other forms of self-presentation because they have already done a successfull self-presentation, whereas those who consider themselves less attractive wish to compensate”.

The work in [1] clearly states that “the more contacts or friends we have, the stronger is the need to segregate those who receive a particular self-presentation from those who receive another one”. This is the main reason why I believe that the discussion should be detached from a particular medium (e.g. Facebook) to a more macroscopic view. Different social networks provide different stages for different types of roles; business networks such as Xing or LinkedIn are used to manage a business role, whereas Facebook and StudiVZ appear to be more used for managing a role revealed to (closer) friends.

All in all, [1] is a well written student paper which is easy to read and gives a good introduction into Goffman’s theory of self-presentation.

[0] StudiVZ analysis

[1] Sasan Zarghooni, “A Study of Self-Presentation in Light of Facebook“, University of Oslo, 2007

[2] Goffman, E: “The Presentation of Self in Everyday Life”, 1982

Cold Topics in Networking

June 19, 2008

Filed under: research — Tags: , , — Oliver @ 12:33 pm

Jon Crowcroft published an article about Cold Topics in Networking in ACM SIGCOMM Computer Communication Review issue 1/08. In this article, he gives a rough heuristic on how to classify topics as being cold and gives some examples of cold topics afterwards.

Self-plagiarism in Academia

June 16, 2008

Filed under: conferences, papers, research — Tags: , — Oliver @ 4:07 pm

Due to the Internet it is easy to “steal” parts or the complete work of others — e.g. essays, theses or other works assigned to students — and re-use them by not labeling it as the work of others (citing). Writing an essay by using the cut & paste technique to copy text blocks from the Internet is easy and quick. Why should a student spend much time on writing an essay that has been already written before? According to a report by the BBC, Student plagiarism is common in the UK and probably becoming more so. In order to limit plagiarism, universities publish guidelines on how to avoid plagiarism. But what exactly is plagiarism? Wikipedia defines plagiarism as

Plagiarism is the practice of claiming or implying original authorship of (or incorporating material from) someone else’s written or creative work, in whole or in part, into one’s own without adequate acknowledgement.

Can there be something as self-plagiarism? Can we steal something from our own work? Yes, in some sense, and it is a problem in academia. I reported recently, that I’m currently involved in the review process for an academic conference. A couple of days ago, one of the reviewers, who worked on a paper that was also assigned to me, claimed to have found a case of self-plagiarism and notified the conference chairs to check this case. Subsequently, the chairs asked the reviewers to check this claim and re-visit their reviews if needed. In the end, the paper has been rejected due to self-plagiarism.

What happened here and why is it bad to steal from oneself? In a first step, I’m going to redefine the term to steal in context of self-plagiarism. It may be adequate when speaking about plagiarism in the sense of stealing a text, but an author cannot steal his own work. I only used this term to highlight the problem of plagiarism in the introduction of this post. According to Roig, “self-plagiarism occurs when authors reuse their own previously written work or data in a ‘new’ written product without letting the reader know that this material has appeared elsewhere” [Roi06]. Thus, self-plagiarism is more about (deceit and fraudulent) concealment than stealing.

But why can it be a problem in academia when authors are reusing previously written work without citing? Well, it is a problem due to novelty of scientific papers. A research paper should present something now, something that was not know before. A new result, a new algorithm, whatever. This makes it interesting and justifies a new publication. Thus, reusing an existing paper means consciously publishing a known fact by claiming to present something new, e.g. in order to increase one’s Google Scholar rating. Academic conferences want to publish and discuss unpublished work and thus self-plagiarism is a problem. (It is alright to publish an extended version or an article based on several conference papers in an academic journal)

And why is it desirable to do self-plagiarism? Well, reusing a previously published paper is much less work than doing originate research and increases the amount of published papers. The amount of published papers is a simple metric that may be used to guess the “competence” of an researcher (as discussed in an previous post). Thus, the more papers published, the better — publish or perish! This fact may entice an author into doing so.

IWQoS 2008 — A résumé

June 7, 2008

IWQoS 2008 Proceedings

The workshop is finally over and I’m back to Germany. All in all I have to say that IWQoS was a very interesting workshop, having contributions of a very high quality. I want to present a brief résumé  here, but I’m not giving an extensive review and thus recommend you to take a look at the program on your own.

  • Two-state Markov models for describing transmission channels are still popular (e.g. used by Liu et al.)
  • Algorithms in the field of Pre-Congestion Notification are subject to performance evaluations, which is a good thing in general as evaluations of RED active queue management have been published when RED was already widely deployed and thus were too late to be taken into account. It seems like this is not the case for PCN.
  • An interesting contribution has been made to the field of profile based traffic classification in the work of Hu et al., where data mining techniques are applied to generate distinct behavioral application profiles. The authors present an evaluation of an rule set for BitTorrent and PPLive. In contrast to the techniques presented in our talk about Spam and Traffic Profiling techniques in 2006, this approach seems to be more flexible — at least at first sight.
  • YouTube has been again subject to an extensive evaluation. In contrast to the papers presented at the Internet Measurement Conference in 2007, this paper discusses the social networks formed in YouTube and their small world character.
  • The invited talk given by a colleague of David Hutchison entitled QoS: (Still) a Grand Challenged? reviewed the evolution of QoS techniques starting from ATM and Broadband ISDN. The conclusion drawn from this talk is that QoS is still a considerable challenge and security and resilience issues need to be taken more seriously, which seems to be reasonable.However, it remains to be seen whether the delivery of 100 MBit/s to the home really changes the world as much as highlighted in the talk. What is known to me about ADSL service providers is that most of the users are not extensively using the big pipe they pay for and rather stick with ocassionally using HTTP and checking their mail. In the first days of ADSL deployment, those access lines were extensively used by power users and thus resulted in a high increase of traffic in the core. However, traffic in the core increases much more slowly with a increasing number of ADSL users nowadays, as most of the users are not using their access link very extensively. I’m wondering if this will be similar for 100 Mbit/s access links in the future.
Newer Posts »
© 2001-2008 by Oliver Hohlfeld, B.Sc. | Imprint

Warning: stristr() [function.stristr]: Empty delimiter. in /home/oliver/public_html/ohcomblog/wp-content/plugins/wassup/wassup.php on line 2093