What is Sentiment Analysis?
Sentiment
Analysis (SA) is the computational study of people’s opinions or emotions in
online text/social media. It is also sometimes referred to as opinion mining or
emotion analysis. SA is a classification process. Document level, sentence
level and aspect level classification is typically done to classify the opinion
or sentiment in the text. Aspect level SA is done to classify sentiment with
respect to specific aspects of entities. Opinion holders can have different
opinions of different aspects of the same entities. For example, in SA of
reviews for a restaurant the same person can tweet about the food with a
positive sentiment– “The food was awesome,” and waiting in a negative
sentiment- “but we had to wait for very long, I hated it!”
An introduction to
different study methods
A mix of
machine learning (ML) and lexicon based methods are used to evaluate emotion
through text. Machine learning techniques first trains the algorithm with some
particular inputs with known outputs so that later it can work with new unknown
data. Lexicon analysis is based on the assumption that collective polarity of a
sentence or a piece of text is the sum of individual polarity of its
words. Naïve Bayes algorithm, n-gram sentiment analysis or SVM methods are
a few of the commonly used approaches in such studies.
Online text/sms coaching
Let us go back to the study of online one on one chat conversations between a coach and a mentee to perform emotional/sentiment analysis to quantify the mental state or happiness
level of the mentee. Typically in SA, as described above we study one time
online tweets or reviews by different people. And the opinion is meant to take
a side or be something which can be characterized into some polarity of good or
not good, in favor or not in favor etc. Also, because it is an opinion, it is
about something. When we study conversation however, we are dealing with a very
different problem. A conversation between two people can be about many things
and may have varying contexts which also depend upon the flow of the
conversation. What a person seeks is support and we wish to evaluate through
emotion analysis the mental health of the person and how the chat
conversations with the coach affects the same over a conversation as well as
over weeks or months of support. To evaluate that effectively using ML/software/logic, we study the
conversations to understand the structure if there is any. For example, let us
explore how a typical conversation between a coach and a mentee may go.
Most starting conversations often begin with a "Hello", "how are you doing?". In written texts our conversations are normally shorter and especially in a professional therapy setup- time bound. The initial responses to such questions depend upon the level of mental happiness of the mentee and also the level of comfort he or she has with online therapy as well as their therapist. Only after a series of exchanges after that the coaches starts to really get to know how the person is doing. Still for the ML we need to look out for first or initial responses from the patient or mentee, to get a generic idea about how the person is feeling in the beginning of the day before getting the help from therapist. For example a first response like “It really sucks!” to "How are you doing today?" can be the baseline information about the patient's health which can be used later to evaluate whether the person felt better at the end of the chat. Normally coaches will ask a series of
questions after this to understand what is affecting the patient. By this point in the conversation the coach really begins to make
suggestions and applying principles of therapy to help the patient deal with
his or her current situation. Similarly, when the conversation ends, a few last
text exchanges can be used to analyze if anything changed about the emotional
wellbeing of the mentee after chatting with the coach. These pieces of text can be collected, saved and used as training data for SA processing.
Phase I
Research Methods with ML/Python-
A sample size of patients in the age group 25-44 yrs receiving support from trained coaches via SMS texting/online chatting is studied for improvement in emotional and mental wellbeing over a period of 12 weeks. An initial pilot study starts with training data wherein supervised annotation is done for the start chat of the SMS texts categorizing them into -[‘positive’, ‘negative’, ‘neutral’]. SA is done then on test data using various available methods- Machine learning algorithm (Naïve Bayes or SVM etc) or Lexicon based to categorize datasets into the same wellbeing markers. A trend in the direction of starting with a lower marker of happiness like ‘Negative’ or ‘Neutral’ and moving towards higher marker is considered as progress and is desired both during a chat session as well as over weeks. Accuracy of methods is studied, improved and compared. The best method is recommended for a detailed study.
A sample size of patients in the age group 25-44 yrs receiving support from trained coaches via SMS texting/online chatting is studied for improvement in emotional and mental wellbeing over a period of 12 weeks. An initial pilot study starts with training data wherein supervised annotation is done for the start chat of the SMS texts categorizing them into -[‘positive’, ‘negative’, ‘neutral’]. SA is done then on test data using various available methods- Machine learning algorithm (Naïve Bayes or SVM etc) or Lexicon based to categorize datasets into the same wellbeing markers. A trend in the direction of starting with a lower marker of happiness like ‘Negative’ or ‘Neutral’ and moving towards higher marker is considered as progress and is desired both during a chat session as well as over weeks. Accuracy of methods is studied, improved and compared. The best method is recommended for a detailed study.
Python
library scikit-learn for machine learning, nltk for tokenization-
tokenizing, stemming to build word list to finally create a bad of
words, pandas and numpy for data processing- training data is checked
for skew levels for ‘positive’, ‘negative’, ‘neutral’, Plotly is used
for plotting graphs to see the skew of data among other applications.
Test data is finally classified using different machine learning
algorithms choosing a ratio of 8/2 for training data/test data.
Emoticons, exclamation marks or other special characters, any
abbreviations are ignored for now.
Phase II
The
well-being markers are now classified into more specific categories-
into [‘Easy’, ‘Normal’, ‘Difficult’, ‘At risk’]. SA is done in the
beginning and end of chats and also evaluated over time (12 weeks?) to
measure wellbeing progress as being affected by online support offered
by the coach. A trend in the direction of starting with a lower marker
of happiness like ‘At risk’ and moving towards higher marker like ‘Easy’
is considered as progress and is desired both during a chat session as
well as over time.
Comments
Post a Comment