ohohlfeld.com : blog
Ohohlfeld.com Banner

Spamalytics: Who goes for Spam?

November 2, 2008

Filed under: internet, papers, research — Tags: , , — Oliver @ 5:48 pm

Spam (Image source)

Direct marketing is not a new approach and its history dates back to the 19th century when the first mail-order catalogues were distributed. Nowadays, the presence of unsolicited bulk e-mail is annoying Internet users world-wide on a daily basis. While there were some costs involved to distribute mail-order catalogues, the marginal cost to send  an e-mail is tiny. Therefore, e-mail based campaigns are profitable even when a negligible amount of receivers goes for the advertised product. The bad news, as highlighted by Kanich et al. is, “a perverse byproduct of this dynamic is that sending as much spam as possible is likely to maximise profit”. In order to maximise the reach of spam advertisement, spammers need to fight with developers of anti-spam technology; the developers of anti-spam software play a cat-and-mouse game with the senders of spam, who have to adapt to the latest spam filtering technologies in order to reach as many people as possible.

However, the presence of spam, despite years of energetic deployment of anti-spam technology, demonstrates the profitability of campaigns using spam. So the natural question rises up: who goes for spam?

This issue is addressed in a paper entitled Spamalytics: An Empirical Analysis of Spam Marketing Conversion presented at the 15th ACM Conference on Computer and Communication Security on Tuesday October 28.

Spam Conversion Pipeline (Image source)

The authors are interested in the conversion rate of spam, which is the probability than an unsolicited e-mail will ultimately elicit a sale. Therefore they infiltrate ongoing spam campaigns sent using the Storm botnet to provide measures for different stages of the spam conversion pipeline as shown in the above figure. In order to understand their methodology, we need to briefly review the way Storm works.

Storm Botnet Architecture (Source: Kanich et al.)

Storm is a peer-to-peer botnet that propagates via spam. The above figure shows the three primary classes of Storm nodes involved in sending spam: worker bots, proxy bots and master servers. While the worker bots are responsible for actually sending the spam, proxy bots act as conduits between workers and master servers. When downloading the Storm binary advertised in spam mails, the infected host becomes either a worker bot (if not reachable from the Internet, e.g. due to firewall restrictions) or a proxy bot. As the command and control traffic directed to the worker bots is unencrypted and always passes through a proxy bot, a man-in-the-middle attack is possible and carried out in the paper by Kanich et al.: by rewriting the comand and control traffic directed to worker bots, spam templates, dictionaries and addresses could be changed and adapted to their needs.

Their methodology can be summarised as follows. They hosted a set of Storm proxy bots, created duplicates of websites advertised in spam and have rewritten the command and control traffic to let the worker bots to advertise their sites instead of the original ones. Thus, no user received more spam, but some users received spam that is less dangerous that it would be otherwise.

Over the course of their experiment, they rewrote the content of about 470 million spam mails sent in three campaigns: about 347 million spams involved in a phamarcy campaign, 83 (38) million for a Storm self-advertisement campain using postcards (april fool). They received 28 purchases on the faked page for the advertised pharmaceutical product and 541 infections of the faked Storm binary, geographically distributed as shown below:

This translates into the following conversion rates (caution: results are not intended to be generalised in other contexts!):

  • 1 in 12,500,000 pharmacy spams lead to a purchase.
  • 1 in 265,000 greeting card spams lead to an infected machine.
  • 1 in 178,000 April Fool’s Day spams lead to an infected machine.
  • 1 in 10 people visiting an infection website downloaded the executable and ran it.

Many more information can be found in their paper (see below), such as top-10 most targeted email address domains, filtering statistics at each stage of the conversion pipeline, statistics about the efficiency of anti-spam methods deployed by typical free e-mail providers (e.g. hotmail and Google mail), time-to-click distribution (the first users visited the advertised page 10 seconds (sic!) after the spam was sent), effects of blacklisting and many more.
The paper is very well written and leads to new insights into how spam works. Interested readers should therefore consider reading this piece of well-conducted research.

Source: C. Kanich, C. Kreibich, K. Levchenko, B. Enright, G. Voelker, V. Paxson, S. Savage. Spamalytics: An Empirical Analysis of Spam Marketing Conversion. 15th ACM Conference on Computer and Communications Security 2008, Alexandria, VA, USA. [Summary, PDF Paper, BibTeX]

Further Information:

SIGCOMM 2008 Papers Available

August 15, 2008

Filed under: conferences, papers, research — Tags: , — Oliver @ 10:47 am

As the SIGCOMM 2008, held in Seattle this year, is getting closer, I noticed that the accepted papers are now available online. They can be accessed here. A group of researchers in my group at Deutsche Telekom Laboratories will present their Time Machine, which allows later inspection of network activity that becomes interesting in retrospect.

Edit: Serveral papers are reviewed in the blog of Michael Mitzenmacher.

Howto Write a Good Research Paper Mindmap

August 14, 2008

Filed under: papers, research — Tags: , , , — Oliver @ 8:45 am

I rececived a pointer to a mindmap illustrating steps that should be considered when writing a good research paper. This mindmap can be seen here.

Why Do People Vote? Genetic Sources of Political Participation

July 3, 2008

Filed under: Biology, Genetics, Politics, journals, papers, research — Oliver @ 10:21 pm

Why do people vote? “When one person votes, everyone with the same preferences benefits from the increased likelihood that their preferred outcome will result. Yet those who do vote must bear the coast of time and effort required to learn about election alternatives and go to the polls. In large populations, the probability that a single vote will change the outcome of an election is minuscule (..), meaning that even very small costs to the individual typically outweight the expected benefits he or she would receive from voting. As a result, classic game theoretic models that assume individuals are self-interested and fully optimizing in their behavior show that the equilibrium amount of voter turnout approaches zero as the population becomes large (..). Yet in spite of this theoretical result, millions of people do vote, suggesting behavior drives their decision (..). In addition, the fact that millions of people abstain suggest that there may be inherent variation in the human tendency to participate in politics.” [1]

I read some papers this evening that were published very recently in 2008 (Although the results are older, as e.g. indicated by an elder abstract published here by Adam Kolber in 2007). These papers claim to have found evidence of heritability of voting. A paper entitled Genetic Variation in Political Participation [1] by Fowler, Baker and Daws, who are with the University of California in San Diego, published in the May 2008 issue of The American Politicial Science Review, showed in two independent studies of twins that voter turnout has very high heritability. Their findings are conducted from evaluating data about 168 monozygotic who were conceived from a single fertilized egg and share 100 % of their genes, and 102 dizygotic twins who were conceived from two separate eggs and share only 50 % of their genes on average. The data has been gathered from twin registry and voter registration records in Los Angeles county. Moreover, data of a national representative sample was used for an independent replication of the results. Their findings can be summarised as follows [2]:

While the choice of a particular candidate or party does not appear to be heritable, a significant proportion of the variation in the decision to participate in politics can be attributed to genetic factors. Fowler, Baker, and Dawes (2008) recently studied the voting behavior of two populations of twins and showed that heritability accounted for 53% of the variation in validated turnout of those living in Los Angeles county and 72% of the self-reported turnout in a nationally representative sample of young adults. They also showed that heritability accounted for 60% of the variation in a general index of political participation, including contributing to campaigns, running for office, volunteering for political organizations, and attending protests. These results were the first to suggest that humans exhibit inherent variability in their willingness to participate in politics.

As discussed in the paper, these findings would help to explain why models based primarily on environmental variables fit poorly to observed behavior and it would conform to two well known features of voting: i) parental turnout behavior has been shown to be one of the strongest predictors of turnout behavior in young adults  and ii) turnout behavior has been shown to be habitual—the majority of people either always vote or always abstain (cf. [1]).

However, this paper does not address specific mechanism that links genes to participation. Thus, it merits further investigation to find out why genes matter so much.

The first paper beg the question which genes matter, which is addressed in a follow-up work by Fowler et al. entitled Two Genes Predict Voter Turnout, published in The Journal of Politics in July 2008. In this paper, the authors hypothesize “that people with more transcriptionally efficient alleles of the MAOA and 5HTT genes are more likely to vote” [2].

In summary: Fowler et al. [1, 2] claim to found evidence that genes do contribute to variation in turnout. Their results suggest that both genes and environmental influences matter, without specifying to which degree both factors affect turnout. They do not claim that genetic effects are more important than environmental effects.

What these paper are not about: These paper do not claim that the choice of an particular candidate is influenced by genetics. Moreover, these paper do not state that genetics are the only effect that influences turnout—genetics is one effect among others that influences turnout.

German speakers might find this post by Marc interesting.

References:
[1] James H. Fowler, Laura A. Baker, Christopher T. Dawes: “Genetic Variation in Political Participation“. In: The American Political Science Review, May 2008: Volume 102, No. 2
[2] James H. Fowler, Christopher T. Dawes: “Two Genes Predict Voter Turnout“. In: The Journal of Politics, Vol. 70, No. 3, July 2008, Pp. 579–594

Self-Presentation in Ligth of Facebook

July 1, 2008

Inspired by some work presented at IWQoS dealing with social networks and small world characteristics, I zoned out and was wondering whether someone did some analysis of Facebook and e.g. proofed the six degrees of separation assumption stated by Milgram. In 2006, an analysis of one million profiles of the German Facebook clone StudiVZ were presented in [0]. The findings provide interesting insights into StudiVZ, but the presented evaluation does not consider an extensive social network analysis. As the amount of users on Facebook is much higher than on StudiVZ and — from an international perspective — Facebook is more widely known, I would expect more work dealing with Facebook that gives more interesting insights into today’s social networks.

A student work presented at the University of Oslo by Sasan Zarghooni [1] focuses on self-presentation management on Facebook. Self-presentation management is understood as the management of the impression a person makes on other people. An introduction of the classical theory proposed by Goffman [2] is followed by a discussion aiming to show whether this theory can explain the self-representational behaviour observed on Facebook.

Goffman introduced a dramaturgical approach in [2], where he compared self-presentation to stage acting. An actor plays a role for a specific audience in a front stage area and retreats to a backstage, where he will change his behaviour. This concept can be clarified by the example of a teacher acting in an authoritarian manner in an unruly class (front stage), but shows a different behaviour at a family reunion. The concept of front- and backstages helps to understand why people behave differently in different places.

Some findings presented in [1]:

  • “The e-mail like messaging system on Facebook allows for backstage interaction, and this way two friends may discuss the darkest secrets of their lives on Facebook without any other friends knowing.”
  • A study by Ellison [3] “found that Facebook led to a substantial increase in subjective well-being and self-esteem for shy people (…) because Facebook provides users with better control over how they self-present”
  • “It could suggest [A study by Walther [4]] that people consider their pictures to be the most important way of self-presenting: those who perceive themselves photogeneous do not engage heavily in other forms of self-presentation because they have already done a successfull self-presentation, whereas those who consider themselves less attractive wish to compensate”.

The work in [1] clearly states that “the more contacts or friends we have, the stronger is the need to segregate those who receive a particular self-presentation from those who receive another one”. This is the main reason why I believe that the discussion should be detached from a particular medium (e.g. Facebook) to a more macroscopic view. Different social networks provide different stages for different types of roles; business networks such as Xing or LinkedIn are used to manage a business role, whereas Facebook and StudiVZ appear to be more used for managing a role revealed to (closer) friends.

All in all, [1] is a well written student paper which is easy to read and gives a good introduction into Goffman’s theory of self-presentation.

[0] StudiVZ analysis

[1] Sasan Zarghooni, “A Study of Self-Presentation in Light of Facebook“, University of Oslo, 2007

[2] Goffman, E: “The Presentation of Self in Everyday Life”, 1982

Self-plagiarism in Academia

June 16, 2008

Filed under: conferences, papers, research — Tags: , — Oliver @ 4:07 pm

Due to the Internet it is easy to “steal” parts or the complete work of others — e.g. essays, theses or other works assigned to students — and re-use them by not labeling it as the work of others (citing). Writing an essay by using the cut & paste technique to copy text blocks from the Internet is easy and quick. Why should a student spend much time on writing an essay that has been already written before? According to a report by the BBC, Student plagiarism is common in the UK and probably becoming more so. In order to limit plagiarism, universities publish guidelines on how to avoid plagiarism. But what exactly is plagiarism? Wikipedia defines plagiarism as

Plagiarism is the practice of claiming or implying original authorship of (or incorporating material from) someone else’s written or creative work, in whole or in part, into one’s own without adequate acknowledgement.

Can there be something as self-plagiarism? Can we steal something from our own work? Yes, in some sense, and it is a problem in academia. I reported recently, that I’m currently involved in the review process for an academic conference. A couple of days ago, one of the reviewers, who worked on a paper that was also assigned to me, claimed to have found a case of self-plagiarism and notified the conference chairs to check this case. Subsequently, the chairs asked the reviewers to check this claim and re-visit their reviews if needed. In the end, the paper has been rejected due to self-plagiarism.

What happened here and why is it bad to steal from oneself? In a first step, I’m going to redefine the term to steal in context of self-plagiarism. It may be adequate when speaking about plagiarism in the sense of stealing a text, but an author cannot steal his own work. I only used this term to highlight the problem of plagiarism in the introduction of this post. According to Roig, “self-plagiarism occurs when authors reuse their own previously written work or data in a ‘new’ written product without letting the reader know that this material has appeared elsewhere” [Roi06]. Thus, self-plagiarism is more about (deceit and fraudulent) concealment than stealing.

But why can it be a problem in academia when authors are reusing previously written work without citing? Well, it is a problem due to novelty of scientific papers. A research paper should present something now, something that was not know before. A new result, a new algorithm, whatever. This makes it interesting and justifies a new publication. Thus, reusing an existing paper means consciously publishing a known fact by claiming to present something new, e.g. in order to increase one’s Google Scholar rating. Academic conferences want to publish and discuss unpublished work and thus self-plagiarism is a problem. (It is alright to publish an extended version or an article based on several conference papers in an academic journal)

And why is it desirable to do self-plagiarism? Well, reusing a previously published paper is much less work than doing originate research and increases the amount of published papers. The amount of published papers is a simple metric that may be used to guess the “competence” of an researcher (as discussed in an previous post). Thus, the more papers published, the better — publish or perish! This fact may entice an author into doing so.

When a machine only slowly forgets - Exploiting TrueCrypt et al.

February 24, 2008

Filed under: papers, research — Oliver @ 10:43 pm

Researchers at Princeton university have released a highly interesting paper on Thursday, which demonstrates that DRAM contents are not immediately lost when the system is turned off. Their paper shows how this property can be used to exploit state-of-the-art hard drive encryption tools, such as TrueCrypt, when the attacker gets physical access to the machine.

The root of the problem lies in an unexpected property of today’s DRAM memories. DRAMs are the main memory chips used to store data while the system is running. Virtually everybody, including experts, will tell you that DRAM contents are lost when you turn off the power. But this isn’t so. Our research shows that data in DRAM actually fades out gradually over a period of seconds to minutes, enabling an attacker to read the full contents of memory by cutting power and then rebooting into a malicious operating system. (…) This is deadly for disk encryption products because they rely on keeping master decryption keys in DRAM. This was thought to be safe because the operating system would keep any malicious programs from accessing the keys in memory, and there was no way to get rid of the operating system without cutting power to the machine, which “everybody knew” would cause the keys to be erased. (Source)

Further information, including images and videos as well as a experimental guide to quickly reproduce these results using Linux, can be found here.

Internet Measurement Conference 2007 Papers Out

October 28, 2007

Filed under: conferences, papers, research — Oliver @ 9:17 pm

Just a brief announcement: the papers presented at the IMC 07 are available on the web. There are many interesting publications and it’s worth to look at some of the papers.

Two papers are covering YouTube conntent [1] and traffic [2]. The first one received the best paper award.  The paper by Cha et al. [1] is devoted to the analysis of user generated content offered at YouTube.  Content production patterns, user participation and the way of how web surfer’s find content are examined. It was interesting to me that the authors also analysed content aliasing, i.e. multiple copy of the same video are present. They stated that “Most videos have 1 to 4 aliases, while the maximum number of aliases is 89 (…) A large number of aliases are uploaded on the same day as the original video or within a week.” (cf. Section 6.1). Moreover, they showed that simple caching of the most popular videos can offload server traffic by as much as 50%.

In contrast, Gill et al. [2] characterise YouTube traffic measured at the edge (university network)  during 85 consecutive days . YouTube traffic was responsible for 4.6 % of the total traffic on the campus Internet link (625,593 videos viewed). The authors also highlight that local caches (in-network) could shrink the traffic, as 50% of the video requests relate to previously requested videos. They state that caching could reduce YouTube traffic in the campus Internet link by a factor of 2, translating to 3.19 TB. However, it was quite interesting to see that although YouTube imposes a limit on the maximum video file size of 100 MB, 0.1 % of the analysed video were larger than that limit. Only 10 % of the analysed videos were larger than 21.9 MB. The file size should reflect the short duration of most videos: “the mean video duration observed on campus is 4.15 minutes with a median of 3.33 minutes (…) 52.3 % of the videos in the all time popular category are between 3 and 5 minutes long.”. They also evaluated the encoding bit-rate of the served videos, suggesting that the target audience are broadband users, the age and rating of the videos. Social networks were also subject to [3].

Dischinger et al. [4] presented a nice analysis of residual broadband access networks (focusing on cable and DSL links) by sending ICMP ping probes and TCP reset packets to sinks. The main research questions were: “1. what are the typical bandwidth, latency and loss characteristics of residual broadband links? 2.) how do the characteristics of broadband networks differ from those of academic or corporate networks and 3.) what are the implications of broadband-network properties for future protocol and system designers?” Some of the findings were that “many cable links show high variation in link bandwidths over shot timescales. Packet transmissions over cable suffer [from?] high jitter as a result of cable’s time-slotted access policy. DSL links show large last-hop delays and considerable deployment of active queue management policies such as random early detection (RED).”

All in all, there are many highly interesting papers and I suggest to take a look at them.

References:

[1] Cha et al.: “I Tube, You Tube, Everybody Tubes: Analyzing the World’s Largest User Generated Content Video System” (2007)
[2] Gill et al.: “YouTube Traffic Characterization: A View From the Edge” (2007)
[3] Mislove et al.: “Measurement and Analysis of Online Social Networks” (2007)
[4] Dischinger et al.: “Characterizing Residential Broadband Networks” (2007)

Men are louder than women

October 10, 2007

Filed under: journals, papers, research — Tags: , , , , , — Oliver @ 10:26 am

Well, when I was looking for some elder papers, I found a funny work in the Bell Technical Journal [1]. Basically, they conducted a speech volume measurement on their own network, which includes some interesting conclusions drawn from the measurement data.

They discovered that business calls tend to have a somewhat higher speech volume than social calls:

Speech volumes on business calls average slightly higher than those on social calls, partially because business talkers are predominately men and business calls tend to be over long distances. (…) Over-all, men tend to talk slightly louder then women, and business conversations are louder than social ones.

They also give some statistics on the distribution of speakers:

Approximately 73 per cent of the business calls observed were made by male speakers, whereas females made 81 per cent of the social calls. (…) most of the local telephone calls were made by women.

So is this a sign of old ‘fashion’ (?) gender roles where men are predominately into business and thus earn the money for a living whereas the role of women is (was?) socialising? So men were hunters that provide for food, whereas women do local phone calls to socialise and invite other families to eat the food men were hunting? :-)
Seriously, I’d love to see some more recent statistics, but I don’t think that service providers still analyse calls in that way. Fortunately, the role of women (especially their job opportunities) is changing nowadays, so I would expect different result if the study was conducted in 2007. If you know some related work, please, let me know.

(..) there is an increase in near-end speech volume of approximately 1 dB per 1000 miles. This increase may be caused by increased noise and distortion on longer toll connections or may be psychological.

The last point is quite funny as I know about people who intuitively speak louder when involved in long distance calls, as they thing they have to reach the speaker far away better that way and the quality of the used line is poor in general. :-)

[1] K. Adoo, Spech Volumes on Bell System Message Circuits–1960 Survey. Bell System Technical Journal, 1963.

ACM Sigcomm 2007 - When randomness plays with you

August 30, 2007

Filed under: conferences, papers, research — Oliver @ 10:13 pm

This years ACM Sigcomm conference is held in Kyoto, Japan. There is a paper about BubbleStorm, a flexible P2P system for meta data distribution and lookup and also about the analysis of Skype traffic. The paper by Xie et al. addresses the dynamics of IP addresses. Oliveira et al. studied the evolution of AS topology. These were the most interesting papers to me.

Update (2008-02-24): some theory related papers presented at ACM Sigcomm’07 are reviewed here.

© 2001-2008 by Oliver Hohlfeld, B.Sc. | Imprint

Warning: stristr() [function.stristr]: Empty delimiter. in /home/oliver/public_html/ohcomblog/wp-content/plugins/wassup/wassup.php on line 2093