Natural language processing is a sub-branch of artificial intelligence. Building a machine or a tool to process the data through natural language processing requires mathematics, statistics, algorithms, and Python programming. Advanced techniques such as Word2Vec can convert words into vectors which makes it easier to process the text through mathematics and deep learning algorithms. Python language can handle the language humans speak, write, and understand. Before we begin the practical implementation of Python code in JupyterLab, it is critical to understand the essentials of natural language processing and machine learning classifiers. The following figure depicts the multiple branches of natural language processing.
It is crucial to set the context and background problem in an enterprise that we are trying to solve through natural language processing and machine learning in Python. Once the background problem statement is defined, the next step is to identify the dataset and preprocess the data to prepare the data in a way a machine can understand through Python language. Feature engineering is another critical aspect of data science problem to process the linguistics of the text. Any machine learning classifier can be applied to the preprocessed dataset to identify an email either as spam or non-spam. In this scenario, I will be processing a preprocessed dataset that contains 48 columns. It is essential to determine the word-frequency measure to understand the number of times a particular word appeared in the dataset divided by the number of words in the document multiplied by 100. In this preprocessed dataset from UCI, the last column has been identified as the label; one = SPAM and 0 = not SPAM. The utilization of this dataset does not require much data wrangling and preprocessing for measuring the accuracy of the classification through Naïve Bayes and AdaBoost classifiers.
There are several other machine learning classifiers that can be applied to resolve the email spam detection problem as part of the supervised machine learning algorithms. As part of solving the problem, initially a trained classification model will be generated that learns the data features from the training samples. Once the training is complete, it will be able to identify any new data through classification. Binary classification is the most common classifier that can determine if the email is SPAM or non-SPAM. When multiclass classification is applied on such problem, it can allow more than just two possible classes above and beyond binary classification with more outcomes. Handwritten recognition goes way back for being a research and development problem to identify the digits from 0 to 9 on several bank check systems. Multi-label classification is another type of algorithm typically applied in bioinformatics and genomics, where a protein can have multiple functions. In this scenario, breaking down the number of classifiers can be a solution for label classification from a multi-label classifier into many binary classifiers.
Figure 2: Adapted from Python Machine Learning by Example.
The Bayes’ theorem denotes event A and event B. Prediction of the weather, such as will there be a storm tomorrow or the probability of getting a head or a tail when flipped. The probability of P(A|B) is the probability of hypothesis event A for the data B. Here P(A|B) represents the posterior probability. P(B|A) represents the probability of observing B for the event A. P(A) denotes the probability of A being true. It is the prior probability of A. P(B) denotes the probability of the data event B.
Figure 3: Bayes’ Theorem
Python environment has to be set up on macOS or on Linux environment. There will not be a step-by-step guide for installing Python on macOS or Linux as part of this article. Either PyCharm or Anaconda distributions can be downloaded to set up Python and JupyterLab environments. Individual packages have to be imported if PyCharm has been chosen as the Python environment through Pip command. If Anaconda environment has been set up, it installs all the necessary Python 3.6 packages so that there won’t be any issues when Naïve Bayes classifier is accessed from sci-kit learn. Most of the time, there’s massive preprocessing for natural language processing. As long as the data does not have any imbalances and it is a fit any machine learning classifier can be applied to classify the SPAM problem in the emails. Python sci-kit learn comes with Naïve Bayes classifier for multinomial models. Multinomial Naïve Bayes classifier works with high accuracy for any discrete features such as word counts. However, it is also known to work well for TF-IDF as well. The data has been shuffled to access different chunks of data in Python program and the last 100 rows have been considered for training and testing. The machine learning model with Naïve Bayes classifier has shown an accuracy of 87%. Any other classifier can be applied as well. AdaBoost classifier has demonstrated an accuracy of 93%.
The JupyterLab Notebook has been shared on Github at GPSingularity.
Cox, T. (2018). Raspberry Pi 3 Cookbook for Python Programmers — Third Edition (3 ed.). Birmingham, England: Packt Publishing.
Hardeniya, N. (2016). Natural Language Processing: Python and NLTK . Birmingham, England: Packt Publishing.
Liu, Y. H. (2017). Python Machine Learning By Example . Birmingham, England: Packt Publishing.
Thanaki, J. (2017). Python Natural Language Processing. Birmingham, England : Packt Publishing.