At least 97% of American use text messages over mobile phones every day. In 2016, according to the research conducted by Portio research, 8.3 trillion messages exchanged over the mobile phones. The rising flood of big data shows an exchange of 23 billion messages per day and 16 million messages per minute. There are around 6.4 billion mobile subscribers around the world by the end of 2012. According to Portio Research, there will be a CAGR growth of 4.8% of growth in mobile subscriber base from 2014 to 2017. By the end of 2017, the mobile subscriber reached to 7.4 billion mobile subscribers. The proliferation of smart devices powered by exponential computing has shown a significant rise in the global smartphone system-on-chip market lead by Qualcomm, Apple, MediaTrek, Samsung, HiSilicon, Spreadtrum, and a vast number of other smartphone chip manufacturers in the market. Powering the chips with artificial intelligence technology paves the path to 5G for higher performance and signal processing. Regardless of the multi-functional and advanced capabilities of smartphones, simple text messaging continued to soar in the worldwide markets. The exponential growth of computing processing power gave rise to generating such massive big data over the text messages. The timeline of mobile handset industry from 1983 (when the first mobile handset launched) to 2002 (first mobile phone with touchscreen) shows the significant increase in the computing and SoC (System-on-Chip) architecture for such tsunami of big data over the SMS messages. The first SMS communication service launched in 1992. 3G mobile services launched in 2002. In 2010, 4G networks launched. The speed of the delivery gave rise to the communication increase through SMS messages for businesses and individuals to manage a significant part of their lives. The mobile messaging service industry generated revenues of $212b in 2012 alone.
Figure 1. Adapted from Portio Research.
According to Portio Research, SMS traffic jumped to 100 billion messages from just 0.5 billion messages between 1996 and 1999. By the end of 2003, in another four years, SMS traffic quadrupled to 450 billion messages. In 2005, the SMS traffic reached over to a trillion messages mark only in two years between 2003 to 2005. By 2009, the world has seen traffic of five trillion messages. In 2015, the traffic peaked to 8.3 trillion messages. The SMS text traffic went by leaps and bounds with application-to-peer messaging and person-to-application SMS messaging with banking, mobile health, and mobile payments sector. This gave room to abundant SPAM from many telemarketers sending SMS texts. Nowadays, many recruiting agencies send SMS with job positions with subscription and sometimes without a subscription. I receive heavy SMS messages from recruiting agencies without subscription. This has been coordinated by some of the people who hacked my account on Facebook and Twitter. I still have those messages with me.
According to two Forrester Research publications Forrester Research Mobile Media Application Spending Forecast 2012–2017 EU-7 and Forrester Research Mobile Media Application Spending Forecast 2012–2017 US six billion SMS messages sent in the US alone every day. Majority of texts, i.e., 80% of the texts generated from American adults. The rise of the SPAM can be attributed to the SMS success open rate of 45% as opposed to 20% success open rate through emails. The response rate goes higher for text messages with 45% and 6% for emails. Americans exchange twice as many text messages as phone calls.
Naïve Bayes classifier
Considering exponential growth in big data and SMS traffic, there’s significant growth in SMS spam as a medium to commit fraud and advertise their job opportunities. The spam filtering can be applied through Naïve Bayes classifier by classifying SMS whether SPAM or HAM. In essence, Naïve Bayes classifier can work as anti-spam software with higher accuracy rates. In this Python implementation, it has shown an accuracy rate of 99.38% training and 98.15% of accuracy rate of testing. Kaggle dataset has been utilized to perform the SPAM detection through Naïve Bayes classifier. Kaggle dataset file has two columns with the label v1 and v2. V1 contains label either spam or ham text data, while the v2 column contains the actual SMS message. Approximately, in US users receive 1.1 billion SPAM SMS messages and Chinese mobile users receive 8.29 billion SMS spams every week by various advertising media and fraudulent corporations. Many classifiers can be applied to filter the SMS SPAM problem such as rule induction, neural networks, decision trees, Naïve Bayes, k-nearest neighbors, and support vector machines. One has to consider the fact that, classifying email is entirely different from classifying SMS text, as the length of the text is limited to 160 characters. Therefore, the featurization has to be adequate to identify between ham and spam. Historically, Naïve Bayes classification algorithm has proven to be highly effective in identifying SPAM.
Figure 2. Kaggle SPAM dataset.
Figure 3.Adapted from The 2016 IEEE Asia Pacific Conference on Wireless and Mobile (APWiMob) Research Paper
TF-IDF Vectorizer vs. Common vectorizer strategies
As with other problems, the process involves at first loading the dataset by reading in Python with ISO-8859–1 encoding and applying Naïve Bayes machine learning algorithm by training and testing stages of building a machine learning model. Any irrelevant column names in the file need to be dropped. The feature extraction can be performed either by count vectorizer or through TF-IDF vectorizer. The countvectorizer applies tokenization and occurrence counting through a single class. By applying the common vectorizer, the words can be tokenized through natural language processing and count the word occurrences through a minimalistic corpus of text files or documents. Alternatively, TF-IDF vectorizer can be applied as well as in the case of large text corpus; there will be the repetitive occurrence of words such as the, a, or is in the English language. The TFIDFTransformer and TFIDVectorizer in scikit learn will perform the count of the word occurrences.
Generating a word cloud through wordcloud library shows the most frequently repeating SPAM words such as call, free, now, UK, ringtone, customer service, chat, landline, text, etc. with a combination of blue and green. The wordcloud generated from the program attached below:
The data visualization for ham shows the following word cloud from the program.Figure 5. Python program output from Jupyter console for HAM text.
I have shared the program on Github at GPSingularity.
Figure 6. Jupyter Lab Python Notebook .
Arifin, D. D., Shaufiah, & Bijaksana, M. A. (2017, Janury 2017). Enhancing spam detection on mobile phone Short Message Service (SMS) performance using FP-growth and Naive Bayes Classifier. IEEE Explore Wireless and Mobile (APWiMob), 2016 IEEE Asia Pacific Conference. http://dx.doi.org/10.1109/APWiMob.2016.7811442
O’Grady, M. (2012). SMS Usage Remains Strong In The US: 6 Billion SMS Messages Are Sent Each Day. Retrieved May 13, 2015, from https://go.forrester.com/blogs/12-06-19-sms_usage_remains_strong_in_the_us_6_billion_sms_messages_are_sent_each_day/
Portio Research (2017). WorldWide SMS Markets 2014–2017. Retrieved May 13, 2018, from http://www.xconnect.net/wp-content/uploads/worldwide-sms-markets-portio-strikeiron.pdf
Smith, A. (2015). U.S. Smartphone Use in 2015. Retrieved May 13, 2018, from http://www.pewinternet.org/2015/04/01/us-smartphone-use-in-2015/
Srivastava, S. (2017). Global Smartphone SoC Market Crossed US$8 Billion in Q3 2017, A Third Quarter Record. Retrieved May 13, 2018, from https://www.counterpointresearch.com/global-smartphone-soc-market-crossed-us8-billion-q3-2017-third-quarter-record/