A linguistic investigation of the junk emails
Purpose of the project
The goal of this project is to investigate the linguistic features of junk emails and maybe to design a filter for junk emails based on linguistic information rather than on a "bag-of-words" approach.
People
Resources
- The corpus of junk emails can be downloaded from here. This corpus consists of 1563 messages received by us in the last few years, but they are not necessary unique messages.
- Given that we are interested in linguistic features of the junk emails, we thought that it would be better to eliminate duplications. The corpus without duplications can be downloaded from here. The elimination of duplications was automatically done, but it did not consider only perfect matching between messages, but also small formatting differences. More details about the method will be available in our forthcoming paper at LREC2002: "A corpus-based investigation of junk emails"
- A frequency list generated from the corpus without duplications can be downloaded, as well a lematised list
Word clouds
Word cloud generated from the frequency list using the words which appear at least 5 times in the corpus.
Word cloud generated from the frequency list using the words which appear at least 5 times in the corpus after the stopwords were removed.
Both word clouds were produced using Wordle.
Papers
- C. Orasan and R. Krishnamurthy (2002) "A corpus-based investigation of junk emails", In Proceedings of Language Resources and Evaluation Conference (LREC-2002), Las Palmas, Spain (pdf)
Other resources
- Corpus of emails including junk mail built by Ion Androutsopoulos
- Extensive information about lawsuits, news and opinions about junk emails can be found at www.junkemail.org and spam.abuse.net