Spam or Ham: Introduction to Natural Language Processing Part 2

Spam … or ham?

This is the second part of my series covering the basics of natural language processing. If you haven’t read the first part yet, you can find it here .

In this part, we will go through an end to end walk through of building a very simple text classifier in Python 3. We will be using the SMS Spam Collection Dataset which tags 5,574 text messages based on whether they are “spam” or “ham” (not spam).

Our goal is to build a predictive model which will determine whether a text message is spam or ham. For the code, see here .

Let’s begin!

After importing the data, I changed the column names to be more descriptive.

import pandas as pd
data = pd.read_csv("spam.csv", encoding = "latin-1")
data = data[['v1', 'v2']]
data = data.rename(columns = {'v1': 'label', 'v2': 'text'})

The first few entries of our data set looks like this:

From briefly exploring our data, we gain some insight into the text that we are working with: colloquial English. This particular data set also has 87% messages labelled as “ham” and 13% messages labelled as “spam”. The class imbalance will become important later when assessing the strength of our classifier.

Data Cleaning

Cleaning textual data is a little different than regular data cleaning. There is a much heavier emphasis on text normalisation than removing outliers or leverage points.

As per Wikipedia:

Text normalisation is the process of transforming text into a single canonical form that it might not have had before.

When used correctly, it reduces noise, groups terms with similar semantic meanings and reduces computational costs by giving us a smaller matrix to work with.

I will go through several common methods of normalisation, but keep in mind that it is not always a good idea to use them. Reserving the judgement for when to use what is the human component in data science.

No Free Lunch Theorem: there is never one solution that works well with everything. Try it with your data set to determine if it works for your special use case.

Case normalisation

Does casing provide any relevant information as to whether a text message is spam or ham? Is HELLO semantically the same as hello or Hello ? You could argue based on prior knowledge that spam messages tend to use more upper casing to capture the readers’ attention. Or, you could argue that it makes no difference and that all words should be reduced to the same case.

Removing stop words

Stop words are common words which do not add predictive value because they are found everywhere. They are analogous to the villain’s remark from the Incredibles movie that:

when everyone is a super, no one will be [a super] .

Some commonly agreed upon stop words from the English language:

  • because
  • to

There is a lot of debate over when removing stop words is a good idea. This practice is used in many information retrieval tasks (such as search engine querying), but can be detrimental when syntactical understanding of language is required.

Removing punctuations, special symbols

Once again, we must consider the importance of punctuation and special symbols to our classifier’s predictive capabilities. We must also consider the importance of each symbol’s functionality. For example, the apostrophe allows us to define contractions and differentiate between words like it’s and its.


Both of these techniques reduce inflection forms to normalise words with the same lemma. The difference between lemmatising and stemming is that lemmatising performs this reduction by considering the context of the word while stemming does not. The drawback is that there is currently no lemmatiser or stemmer with a very high accuracy rate.

Examples of stemming from Stanford NLP

Less common normalisation techniques include error correction, converting words to their parts of speech or mapping synonyms using a synonym dictionary. With the exception of error correction and synonym mapping, there are many pre-built tools for the other normalisation techniques, all of which can be found in the nltk library .

For this particular classification problem, we will only use case normalisation. The rationale is that it will be hard to apply a stemmer or lemmatiser onto colloquial English and that since the text messages are so short, removing stop words might not leave us with much to work with.

def review_messages(msg):
# converting messages to lowercase
msg = msg.lower()
return msg

For reference, this function does case normalisation, removing stop words and lemmatising.

from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

lemmatizer = WordNetLemmatizer()
stopwords = set(stopwords.words('english'))
def alternative_review_messages(msg):
# converting messages to lowercase
msg = msg.lower()
# removing stopwords
msg = [word for word in msg.split() if word not in stopwords]
# using a lemmatized
msg = " ".join([lemmatizer.lemmatize(word) for word in msg])
return msg

We apply the first function to normalise the text messages.

data['text'] = data['text'].apply(review_messages)

Vectorizing the Text

In the first part of this series, we explored the most basic type of word vectorizer, the Bag of Words Model, which will not work very well for our Spam or Ham classifier due to its simplicity.

Instead, we will use the TF-IDF vectorizer (Term Frequency — Inverse Document Frequency), a similar embedding technique which takes into account the importance of each term to document.

While most vectorizers have their unique advantages, it is not always clear which one to use. In our case, the TF-IDF vectorizer was chosen for its simplicity and efficiency in vectorizing documents such as text messages.

TF-IDF vectorizes documents by calculating a TF-IDF statistic between the document and each term in the vocabulary. The document vector is constructed by using each statistic as an element in the vector.

The TF-IDF statistic for term i in document j is calculated as follows:

After settling with TF-IDF, we must decide the granularity of our vectorizer. A popular alternative to assigning each word as its own term is to use a tokenizer. A tokenizer splits documents into tokens (thus assigning each token to its own term) based on white space and special characters.

For example, the phrase what’s going on might be split into what, ‘s, going, on .

The tokenizer is able to extract more information than word level analysers. However, tokenizers do not work well with colloquial English and may encounter issues splitting URLs or emails. As such, we will use a word level analyser, which assigns each word to its own term.

Before training the vectorizer, we split our data into a training set and a testing set. 10% of our data is allocated for testing.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data['text'], data['label'], test_size = 0.1, random_state = 1)
# training the vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)

Building and Testing the Classifier

The next step is to select the type of classifier to use. Typically in this step, we will choose several candidate classifiers and evaluate them against the testing set to see which one works the best. However, for our purposes, we can assume that a Support Vector Machine works well enough.

A larger value of C represents a smaller hyperplane. Parameters such as this can be precisely tuned via grid search.

from sklearn import svm
svm = svm.SVC(C=1000), y_train)

Now, let’s test it.

from sklearn.metrics import confusion_matrix
X_test = vectorizer.transform(X_test)
y_pred = svm.predict(X_test)
print(confusion_matrix(y_test, y_pred))

The results aren’t bad at all! We have no false positives and around 15% false negatives.

Let’s test it against a few new examples.

def pred(msg):
msg = vectorizer.transform([msg])
prediction = svm.predict(msg)
return prediction[0]

Looks right to me 🙂

I hope you enjoyed Part 2 of this tutorial. Once again, the entire code peice can be found here. Part 3 will be out shortly. It will cover neural based word embeddings and more complex applications of NLP.