Fake Jobs Text Classification using Naive Bayes

0 Shares
0
0
0
0

Disclosure: This post may contain affiliate links, meaning I recommend products and services I've used or know well and may receive a commission if you purchase them, at no additional cost to you. Learn more.



Text classification using Naive Bayes can possibly solve many challenges in society, including the identification of fake job postings.

Fake job posts have been known to trick victims into scams and/or having their personal data taken for the companies’ benefit. As so, I want to tackle this issue by creating a chrome extension that can automatically detect fake job postings on sites like Indeed, Seek, and Jora.

I will be writing a series of blog posts of developing a Fake Job Posting Classifier, extending it to a chrome extension, and finally making improvements and fine-tuning the model to reach its maximum potential (hopefully).

In this first blog post, I plan to introduce the following:

  1. Why use Naive Bayes for Natural Language Processing (NLP)
  2. Exploratory Data Analysis
  3. Data Preparation
  4. Modelling using Naive Bayes

Why Naive Bayes?

Why Naive Bayes? Because Naive Bayes is easy to implement, scalable, and suitable for text classification challenges.

The Naive Bayes algorithms are a family of probabilistic classifiers. They are based on applying the Bayes’ theorem to predict the label of a text. Because the Naive Bayes classifiers are probabilistic, they are able to calculate the probability of each label given a text, and return the label that has the highest probability.

It’s essentially a basic algorithm that can classify data based on conditional probability.


Exploratory Data Analysis

Before we dive into building a text classification Naive Bayes model for the fake jobs dataset, it’s good practice to gather insights about our data.

We first start off by importing our much-needed python libraries (csv, os, matplotlib, numpy, and pandas). And getting a preview of our data by using the .head() function.

file = 'fake_job_postings.csv'
script_dir = os.getcwd()
data_path = os.path.normcase(os.path.join(script_dir, 'data/'))
full_path = os.path.normcase(os.path.join(data_path, file))
fk_job = pd.read_csv(full_path, index_col='job_id',
                    dtype = {
                        'company_profile':str, 
                        'description':str, 
                        'requirements':str, 
                        'benefits':str},
                    na_values = "")
fk_job.head()
Preview of the fake jobs dataset

By using the .head() function, we get to preview our data in a tabular like format.

We could also use other functions such as .info() and .describe() to do some quick analysis of the data, but I’ll leave that up to yourself to try out.

To get a better understanding of our data better, we can also get our y label needed for training, “fraudulent”, and find it’s distinct count using the .value_counts() function.

fk_job.fraudulent.value_counts()
Count of real and fake job postings - 17014 real, 866 fake
Count of real and fake job postings

Here, we can see that about 17014 job postings which are real, and 866 which are fake. That gives us about 4.8% of our job postings that are fake.

We can also make graphs to give us a visual representation of the biases in our data. For example, if we want to know where all these job postings come from (geographically), we can plot countries against the count of job postings.

The code below takes our data, and group them by “country” and “fraudulent”, and aggregate them by count.

ccount = fk_job.groupby(['country', 'fraudulent']).size().unstack('fraudulent', fill_value=0)
ccount = ccount.sort_values(by=[0, 1], ascending=False)
ccount_10 = ccount[:10] # Take top 10 countries
fig = plt.figure()
axi = fig.add_axes([0, 0, 1, 1])
axi.bar(ccount_10.index, ccount_10[0], color = 'b')
axi.bar(ccount_10.index, ccount_10[1], bottom = ccount_10[0], color = 'r')
axi.set_ylabel('Count')
axi.set_xlabel('Country')
axi.set_title('Job postings by country')
axi.legend(labels=['Real', 'Fraudulent'])
Job postings by country - US in the majority

From our plot, we can see that the majority of our data comes from the US, staggering around the 10,000 mark. Whereas GB (Great Britain probably) only hovers around the 2,000 mark.

We can also tell from our plot, that majority of these fraudulent job postings come from the US. Which makes sense as most of our data is from the US.

This could mean that the model we’re going to develop will only be able to accurately predict in the US job space.


Data Preparation

Now that we’ve taken a nice good look at the data, we’re ready to start creating our Naive Bayes model. Here, we can start to bring in the libraries we need such as tensorflow.keras (for the tokenizer) and sklearn.

Because I want to keep it simple, I will only be using the job’s description from the data to create our model. In the future, we could incorporate the other features once we’ve identified similar fields in other data assets and on job search sites.

As NLP requires the data to be vectorised before it’s shoved into a machine learning model, we can do so as below:

max_length = 100
vocab_size = 1500
embedding_dim = 32
sentence = {}
sentence['descriptions'] = fk_job['description'].replace(np.nan, '', regex=True).to_numpy()
sentence['labels'] = fk_job['fraudulent'].to_numpy()
tokenizer = Tokenizer(num_words = vocab_size, oov_token = '<OOV>')
tokenizer.fit_on_texts(sentence['descriptions'])
sequences = tokenizer.texts_to_sequences(sentence['descriptions'])
padded_sequences = pad_sequences(sequences, maxlen = max_length, padding = 'post')

In the code above, we first replace all the NaN values and replace it with an empty string. This is just for data cleaning purposes.

Following that, we tokenize the “descriptions” so that each word is converted into a number. We do that because texts have to be converted into vectors before Naive Bayes can perform its classification.

Finally, we pad the sequence to ensure that all our job descriptions are of the same length. While padding is usually used for Neural Networks, I’m doing this here when we start using Neural Networks for building this classifier.


Modelling with Naive Bayes

Before we start creating our Naive Bayes model. We are first to split our data into training and testing sets. For simplicity sake, we’re just going to do a 70:30 ratio of our training data to testing data as below:

x_train, x_test, y_train, y_test = train_test_split(padded_sequences, sentence['labels'], test_size = 0.3, random_state=0)

From then, we just have to fit our training data to the Naive Bayes model and start predicting our testing data.

model_GNB = GaussianNB()
model_GNB.fit(x_train, y_train)
y_predict = model_GNB.predict(x_test)
accuracy_score(y_test, y_predict) 
# 0.8885160328113348 accuracy

By utilising sklearn’s accuracy_score() function, we can easily calculate the accuracy of our classifier. In which, we got an accuracy of about 88.8%, which is a pretty lucky number for our Feng Shui!

While this was a very straight forward way to train and test our Naive Bayes model. I’d recommend validating the stability of the model using K-Fold Cross Validation to ensure that the model has gotten most of the patterns from the data correct.


Summary

In this blog post, you’ve not only learned how to create a Naive Bayes text classifier in Python, but you’ve also learned how to do exploratory data analysis by building plots, and how to prepare data by tokenizing and padding.

Next, we’ll learn to save our model, deploy the model to a Flask API, and test it out using Postman.

If you have questions about creating a text classifier, leave a comment below 🙂


Further Reading

Articles

Data


0 Shares
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments