Tags

Other

Bayesian Theorem and its connection to Sentiment Analysis - Jamie Maguire

In my last post, I introduced sentiment analysis, the Naïve Bayes classification technique and why you or your business might be interested in this.

In this post I’ll delve into it in more detail and-and walk through an example and how it’s connected to sentiment analysis.

The rule itself is written like this: (Boone)

p(A|B) = p(B|A) p(A) / p(B)

Now let’s break this down and explain each component:

p(A|B): ‘The probability of A given B’.  This basically means the probability of finding observation A, given that some part of evidence B is there.  This is what we want to find out. (Boone)

p(B|A): This is the probability of the evidence turning up, given that the outcome obtains.

p(A): This is the probability of the outcome occurring, without the knowledge of the new evidence.

p(B): This is the probability of the evidence arising, without regard to the outcome.

The sample data set as discussed by (Amiune) illustrates how the theorem can be applied when trying to arrive at whether or not an email is a spam if it has the word “buy” in the mail body.

P(spam |words) = P(words/spam)P(spam) / P(words)

We have a database of 100 emails.

  • 60 of those 100 emails are spam
  • 48 of those 60 emails that are spam have the word “buy”
  • 12 of those 60 emails that are spam don’t have the word “buy”
  • 40 of those 100 emails aren’t spam
  • 4 of those 40 emails that aren’t spam have the word “buy”
  • 36 of those 40 emails that aren’t spam don’t have the word “buy”

What is the probability that an email is a spam if it has the word “buy” in the content?

The answer to the above is as follows:

There are 48 emails that are spam and have the word “buy”.

And there are 52 emails that have the word “buy”: 48 that are spam plus 4 that aren’t spam.

So the probability that an email is a spam if it has the word “buy” is 48/52 = 0.92.  So we should probably put this email in the spam folder.

Redefining the Problem to Use Probabilities

As mentioned previously, the rule and notation are based on probabilities, so we can redefine the problem to use probabilities rather than quantities.  Using the same database of emails.

60% of those emails are spam
80% of those emails that are spam have the word “buy”
20% of those emails that are spam don’t have the word “buy”
40% of those emails aren’t spam
10% of those emails that
aren’t spam have the word “buy”
90% of those emails that
aren’t spam don’t have the word “buy”

What is the probability that an email is a spam if it has the word “buy”? 

The notation to arrive at the answer looks like this:

P(spam) = probability that an email is a spam
P(not spam) = probability that an email isn’t spam 
P(“buy”|spam) = probability that an email that it is spam has the word “buy”
P(“buy”|not spam) = probability that an email that it isn’t spam has the word buy”
P(spam|”buy”) = probability that an email that has the word “buy” is spam

So P(spam|”buy”) is the answer we are looking for
P(“buy”|spam) * P(spam) counts all the emails that are spam and have the word “buy”
P(“buy”|not spam) * P(not spam) counts all the emails that
aren’t spam and have the word “buy”

Summing the previous two P(“buy”|spam) * P(spam) + P(“buy”|not spam) * P(not spam) – we count all the emails that have the word “buy”

Meaning the resulting equation looks like this:

P(spam|”buy”) = P(“buy”|spam) * P(spam) / (P(“buy”|spam) * P(spam) + P(“buy”|not spam) * P(not spam))

This is Bayesian Theorem.

Or , to inject the numbers:  0.8 * 0.6 / (0.8*0.6 + 0.1*0.4) = 0.48 / 0.52

The result of this simulation was: 0.9222485960747988

Or in plain English, based on our existing datasets, there is a 92% chance that emails that contain the word “buy” are spam type emails.

So how do use this theorem to apply sentiment analysis?  Read on!

Sentiment Analysis Using Bayesian Theorem

Performing sentiment analysis using Bayesian Theorem involves writing a Naïve Bayesian Classifier which is based on the Bayes Rule that we’ve just discussed.  This rule is a way of looking at the conditional probabilities of an event using a given set of mathematical probabilities.  As we’ve just seen, the rule if often used in email systems when trying to detect if the email is actually valid based on the presence of a certain set of keywords.

You can find a sample classifier on Github, have a play around with it and see how you get on.  In my next post, I’ll talk a little bit more about the difficulties of sentiment analysis and how some of these can be alleviated.

In the meantime, feel free to reach out if you have any questions or comments.

Who Are Ronald James?

We are a leading niche digital & tech recruitment specialist for the North East of England. We Specialise in the acquisition of high-performing technology talent across a variety of IT sectors including Digital & Technology Software Development.

Our ultimate goal is to make a positive impact on every client and candidate we serve - from the initial call and introduction, right up to the final delivery, we want our clients and candidates to feel they have had a beneficial and productive experience.

Contact our Team

If you’re looking to start your journey in sourcing talent or find your dream job, you’ll need a passionate, motivated team of experts to guide you. Check out our Jobs page for open vacancies. If interested, contact us or call 0191 300 6501 for a quick chat with our team.

Let's be Friends!

Follow us on our blog, Facebook, LinkedIn, Twitter or Instagram to follow industry news, events, success stories and new blogs releases.

 

Back to Blog

</Follow Us>