Sentimental Analysis Using Auto-Encoders

Prudvi ,Rajendar,Srunith & furqan | May 2023

Sentiment Analysis is the process of finding and validating whether a meaning of a sentence is positive, negative or neutral. Extracting sentiments from the social media and recommending others to make the people’s work easier of opting things. Auto encoders have gained a lot of attention in recent years as a building block of Deep Learning. They act as the algorithms by constructing outputs from the given inputs. Using loss function, we will take sure that outputs are not so different from inputs.

In a neural network implementation of auto encoders, the hidden layer is taken as the learned feature. There will be some difficulty with plain auto encoders much effort has to be spend on regularizations and loss function in order to prevent them against over fitting. Little attention for the loss function, which is a critical problem for modelling textual data. The loss functions are squared Euclidean distance and element-wise KL Divergence. The problem with those loss functions are they try to reconstruct all dimensions of input independently .However, we argue that this is not the optimal approach when we work with text classification. There are two reasons for this. First one is- In natural language the distribution of word frequencies obeys the powerlaw. According to power law, few of the most frequent words will be responsible for most of the probability mass of word occurrences. Due to this, the Auto encoder puts much more effort on reconstructing the most frequent words and ignores the less frequent ones. This may lead to a bad performance and less accuracy especially when the class distribution doesn’t has the frequent words so well.

For sentiment analysis, this problem is especially severe and it effects too. Because it is obvious that the useful and important features only occupy a small fraction of the whole vocabulary and Reconstructing irrelevant words such as ’chef’ or ’tasty’ very well is not likely to help learn more useful representations to classify the sentiment of restaurant reviews. The second reason for this is reconstructing all the words from the input text explicitly is also very expensive, because the latent space has to contain all the necessities of the inputs carried by the words. It really costs high, here we need to store the words which are useful and which are not useful also. As the size of the vocabulary of the input can be of the size of five to ten thousand even for a medium sized dataset, the size of the hidden layer that has to be choosen should be very large such that it can learn all the inputs and yields better output, the large hidden layer will be a great disadvantage to the capacity of the model and makes it too difficult for scaling largeproblemsn auto encoder is known to be a kind of artificial neural network.

The main aim of auto encoders is it will learn the economical information codings in a unsupervised fashion. When we want to deploy some image or anything which is of highly dimensional to internet. There will be the problem of taking high time or space. Here with the help of auto encoders we can reduce the dimension of the image or anything or deploy it which takes less time and space. Many assumptions and other features exist to the essential one with the aim of forcing the learned representations (the hidden representations) of the input to assume helpful properties. Auto encoders square measure is responsible for finding several issues, from facial recognition to feat of the linguistics which means vocabulary of words. Auto encoders remodels the unsupervised learning method to a supervised one by the method of self reconstruction. This allows us to use all the tools which are developed for supervised learning like back propagation to expeditiously train the autoencoders. An auto encoder is one of the applications of artificial neural networks .It just understands the input and copies it to the output. It has an internal layer ( an encoder) which describes a code which is used to read the input, and also a output layer. It has two main parts: an encoder that maps the input into the code, and a decoder that maps the code for the reconstruction of the original input it may be a image or text. The idea of auto encoders has became very famous in the field of neural networks, and the first application based on auto encoders was made in 1980’s . The most important application of autoencoders are dimensionality reduction and feature learning, But recently we are using this application for studying generative models also.

Machine Learning

The main steps for any machine learning project is

i. Gathering data

ii .Preprocessing

iii. Training a model

The first thing for any machine learning algorithm is to import the required packages The second thing is to load the datasets

1. Loading Datasets: As our project is related to recommendation system. We have choosed restaurant reviews dataset which contains two columns Review and Label There are 1000 rows. None of the row is empty. This is a good sign for accuracy metric. The dataset which we took is in tsv format. TSV stands for Tabseparated values file. TSV is a file extension for a tab-delimited file. It is used with spreadsheet software. TSV stands for Tab Separated Values. These files are used for raw data and can be imported into and exported from spreadsheet software in a multi-purpose way. The next process is

2. Data Pre-processing: This is the main part of any machine learning algorithm. In our project, as it is word data we need to convert it into categorical data. This is the main step. Cleaning data: Beautiful soup is a Python library for removing all the HTML tags from the data.. It commonly saves the time of programmers. By this we have removed all the html tags, removes the numbers in the string and also the punctuations. Porter Stemmer: Stemming is the process of producing the root word without gerund forms. Stemming programs are commonly referred to as stemming algorithms. Now our data is free from all the tags, punctuations, ing forms, numbers, stopwords, uppercase and lowercase misconceptions etc.. We have divided the words to the format of tokens. We have used tokenizer to accomplish the task. Now the biggest step of our project to convert this into categorical data. Word Embeddings: We have many techniques for this from various libraries such as One-HotEncoding, Label Encoding, GLOVE We have choosed Count Vectorizer to do this job. Finally we are ready with our categorical data.

3.Passing to auto encoder: We have selected the sparsity autoencoder which will add some sparsity constraint to the autoencoder and gives good results.

4. Selecting a model: We have selected logistic regression classifier and trained the model. The results that we received from the auto encoder is passed to the logistic regression classifier and tested on the test data

To continue reading and Access the trial code of the project follow the project documentation