The SMS-Mv1.0 Corpus


The SMS-Mv1.0 corpus is a set of SMS messages initiated by 35 instances of mobile malware (abbreviated as SIMM messages) and benign SMS messages taken from the SMS Spam Collection v.1 corpus. Alltogether it contains 155 SIMM and 4539 benign SMS messages.


The SMS-Mv1.0 corpus can be downloaded here.


The benign SMS messages have following restrictions:

  • we excluded all ham SMS messages longer than 160 characters (238x)
  • we excluded all ham SMS messages which occurred more than three times:
    • Sorry, I'll call later (27/30 exluded)             
    • I cant pick the phone right now. Pls send a message (9/12 excluded)
    • Ok... (7/10 excluded)
    • Okie (1/4 excluded)
    • OK (1/4 excluded)
    • Your opinion about me? 1. Over 2. Jada 3. Kusruthi 4. Lovable 5. Silent 6. Spl character 7. Not matured 8. Stylish 9. Simple Pls reply.. (1/4 excluded)
    • 7 wonders in My WORLD 7th You 6th Ur style 5th Ur smile 4th Ur Personality 3rd Ur Nature 2nd Ur SMS and 1st "Ur Lovely Friendship"... good morning dear (1/4 excluded)
  • we replaced & by an ampersand  (49x)
  • we replaced 0x92 by an apostrophe (31x)
  • we replaced <#> by an empty string (49x)
  • we replaced < by an empty string (27x)
  • we replaced > by an empty string (27x)
  • we replaced 0x96 by an empty string (3x)
  • we replaced 0x94 by an empty string (2x)

All SIMM messages in the SMS-Mv1.0 Corpus were observed during the analysis of 35 mobile malware instances listed in [1]. We applied similar rules as for the benign SMS messages:

  • we excluded all SIMM messages longer than 160 characters
  • we considered only the first three sent SIMM messages with the identical dynamic structure
  • if the mobile malware instance tried to send empty messages, we considered only the first two occurrence


The "Benign SMS messages.txt" file contains all benign SMS messages, one per line. The "General structure of SIMM messages.txt" describes only the general structure of SIMM messages we observed during the analysis and not the amount of messages sent. All SIMM messages are given one SIMM message per line. Everything after the %% string is a comment and not part of the SIMM message. A block of SIMM messages start with a name and a sha256 hash value of the corresponding mobile malware instance. Pointy brackets <> denote variables in SIMM messages. Sometimes, if a SIMM message can dynamically change its content, an example is given in the same line. The last file "The SIMM messages.txt" includes SIMM messages used for evaluation in [1]. General dynamic structures in SIMM messages are replaced by randomly chosen valid strings fulfilling the above restrictions on compilation. See also [1] for more details.


The SIMM part of the SMS-Mv1.0 corpus has been collected by Marian Kühnel and the benign part by Tiago Agostinho de Almeida and José María Gómez Hidalgo. In case you find our SMS-M1.0 Corpus usefull, write me an email or use the reference to the paper below:

[1] M. Kühnel, U. Meyer. 4GMOP: Mopping Malware Initiated SMS Traffic in Mobile Networks. In the 16th Information Security Conference, ISC '13, pages 1-16. Springer 2013.


 The SMS-Mv1.0 Corpus is licensed under the GPLv3 license.