Word Prediction Using RNN

Introduction

Hello Everyone! In the following blog, we are exploring the concept of Word Prediction using RNN which stands for Recurrent Neural Networks. This is a topic that is very interesting in terms of its complexity and ability to predict. Hope you enjoy it and by the end of you understand the extraordinary concept of RNN.

So let us begin with What exactly is Word Prediction. Otherwise called language modeling, Word prediction of what word can come as the name suggests in daily and trivial as well as application-specific tasks. One of the examples is when using Gmail, Google Docs, or even the keyboards in our smartphones, we can see the machine trying to predict words or sentences. This is the magic of machine learning!

The understanding of such text prediction although can be done in many ways, using RNN is one of the ways that we will explore. The basis of this is a field (or extension) of Machine Learning called Natural Language Processing.

In this article, we will uncover key concepts of Neural Network, its various types, brief information about Recurrent Neural Network(RNN), its principle behind word prediction, and a quick look at benefits and limitations that RNN offers, as well as the importance of LSTM to predict words and perform language modeling.

NEURAL NETWORK

A Network Network, if you have noticed, sounds like Neuron. That is because it is basically an imitation of how our neurons and this brain works. It consists of different layers bonded to each other thus forming a network. It learns from vast volumes of data and uses complex algorithms to train a neural net. These are also known as Simulated Neural Networks[SNNs] or Artificial Neural Networks[ANNs], and serve as the heart of deep learning algorithms. Let us explore how these brain-like networks work.

Artificial neural networks (ANNs) constitute node layers possessing an input layer, one or more hidden layers, and an output layer. Individual node, or an artificial neuron, connects to another and has an associated weight and threshold. If the output of any distinct node is past the specified threshold value, that node is activated, transmitting data to the next network layer. The figure below represents the working of a neural network.

Working of Neural Network

Source: https://www.ibm.com/cloud/learn/neural-networks

TYPES OF NEURAL NETWORKS:

As biological neurons are used for so many different tasks, artificial neurons can be used in plenty of ways too. Neural networks can be categorized into different types based on the application, their method of implementation, and many other factors. The following represent some of the most common types of neural networks for their common use cases.

1. Feedforward Neural Network
Feed-Forward neural networks are networks in which the data only moves in one direction. By that, we mean that there is no feedback as such and data travels from input to hidden and then towards output layers. It kind of works like a mathematical function where you put in some input that maps to a certain output based on some predefined pattern. In the case of neural networks, we have input layers, accompanied by an activation function or step activation function which are then multiplied by weights.

Uses: Classification Problems, Speech Recognition, Computer Vision, Face Recognition.

A fun fact: Whenever the feedforward neural networks, are coupled with feedback systems, guess what we get: RNN!

Feed Forward Neural Network

Image courtesy: https://www.mygreatlearning.com/blog/types-of-neural-networks/

2. Convolutional Neural Network

Usually, there is a two-dimensional array, but Convolutional Neural Network has a three-dimensional arrangement of neurons. Each neuron in the process only processes a small part of the data. The bits of data are recognized by taking input features batch-wise. In this process, proper modifications are done on data depending on the data pattern.

Uses: Classification of Image Detection, Pattern Recognition, Machine translation.

Here is an example of CNN works with visual Data. In this computing process, the image is converted from HSI to RGB scale.

Architecture of CNN

Image courtesy: https://www.mygreatlearning.com/blog/types-of-neural-networks/

3. Recurrent Neural Network

As we discussed before it is similar in working to feedforward neural networks. The major difference is the fact that in recurrent neural networks, there is a system to provide feedback for the output layer back to the input. This principle of recurrent neural networks predicts the outcome of the layer and thus makes the system kind of self-correcting. In the computation part of the process, each neuron functions as a memory cell, it provides some information as it progresses. The most interesting part is, that by holding the information for the future stage, the network is able to learn and make changes accordingly. Hence using this way of traversing the data it is able to increase its own accuracy as it goes.

Uses: Time series prediction, sentiment analysis, text processing, voice recognition.

Here is an example to showcase the system of feedback in RNN.

Structure of RNN

Image Courtesy: https://builtin.com/data-science/recurrent-neural-networks-and-lstm

Let us talk about LSTM now, to understand the meticulous working of RNN.

LSTM is utilized as a fundamental block for the layers of RNN. It consists of gates that store memory. There are 3 types of gates, similar to layers, input gate, output gate, and forget gate(although the last one is different from layers). These gates manage the memory and decide when it is input, when it is output, and when it should be forgotten. The aim of LSTM is to store memory for an extended period of time and use it accordingly.

Uses : Grammar Learning, Human Action Recognition, Market Prediction, Speech Recognition.

Here is a depiction of the 3 gates of LSTM.

LSTM with its three gates

Image courtesy: https://builtin.com/data-science/recurrent-neural-networks-and-lstm

Recurrent Neural Networks are especially important in areas where data is needed to be stored something that is a perfect fit for word prediction. As a consequence of their internal memory, RNN remembers the necessary information which is necessary for predicting future words.

The given figure represents a Recurrent Neural Network. Different neural network layers are compressed into a single layer of RNN. Here A, B, and C are different parameters of the network which are utilized to improve the model's output. The output is conveyed back as feedback, while x is the input layer, h is the hidden layer and y is the output layer.

Fully connected Recurrent Neural Network

Image courtesy: https://www.simplilearn.com/tutorials/deep-learning-tutorial/rnn#what_is_a_neural_network

WORKING OF RNN:

In an RNN, the information processes through a loop. When it causes a decision, it considers the current input and what it has learned from the previous inputs. As RNN consists of internal memory, it can store past and present information. Moreover, RNNs apply weights to the current and also to the previous input.

Let us discuss the working of RNN. The feedback system creates a loop. This feedback system creates a loop that helps the model build accuracy and helps it to improve continuously. This is the special advantage of RNN.
The figure below shows the working of a fully connected RNN.
Here,
● The input layer(‘x’): takes and processes the information, passing it onto the middle layer(‘h’)
● Hidden layer(‘h’): each hidden layer has its own activation functions and weights, and biases.

Thus, RNN normalizes the various activation functions, weights, and biases so that each hidden layer has the exact parameters. As a result, it creates one layer instead of making multiple hidden layers, which loops as many times as required.

Working of Recurrent Neural Network

Image courtesy: https://www.simplilearn.com/tutorials/deep-learning-tutorial/rnn#what_is_a_neural_network

LIMITATIONS OF RNN:

Although RNN is one of the best methods out there, There are two main issues in Recurrent neural networks (RNNs), which are listed below

EXPLODING GRADIENTS:

Exploding gradients are when error gradients accumulate slowly and these accumulations result in considerable updates to weights during the training. There is a classic example of exponential growth by stacking multiple errors. This causes inconsistencies in the model and unable to learn from the given training data.

VANISHING GRADIENTS:

Vanishing Gradients emerge when the values of the gradient are comparatively little and the model fails to learn and endures way too long. This problem occurs due to differences in the variance of input and output activation functions. When a deep multilayer feed-forward network or an RNN fails to propagate usable gradient intake from the output end of the model around to the layers near the input end of the model. Thus it results in a lack of capacity of models with multiple layers to learn on a given dataset or models with various layers to merge to an inaccurate and premature solution.

BENEFITS OF RNN:

The major benefit of RNN comes through its unique ability to retain memory throughout time. This makes it advantageous over applications where memory-based predictions are necessary. Moreover, recurrent neural networks can also be used in hybrid with other neural networks like Convolutional Networks. This places RNN on a very interesting platform.

WORD/TEXT PREDICTION:

RNN or Recurrent Neural Networks repeat themselves by using a feedback loop. The output from the previous step is fed as input to the current action, along with the Short Term Memory Network (LSTM) model, which is one of the important Recurrent Neural Networks. Here the objective is to predict the next word(s) given a set of existing words to the user. This procedure is more complex in other languages apart from English.

PREDICTION OF NEXT WORD

The figure below represents how a sequence of three words results in the prediction of the fourth word.
The network has three hidden layers, individually of which is an affine function (e.g., a matrix dot product multiplication), pursued by a non-linear function, then the final hidden layer is followed by an output from the earlier layer activation function.
The input vectors denoting each word in the succession are lookups in a word embedding matrix, based on a one-hot encoded vector depicting the term in the vocabulary. Here, all the words input have identical word embedding. In this context, a word is honestly a token representing a word or a punctuation mark.
As a result, The output will be a one-hot encoded vector representing the predicted fourth word in the sequence.
Here,

The first hidden layer carries a vector describing the first word in the sequence as an input, and the output activations function as one of the inputs into the second hidden layer.
The second hidden layer accepts the input from the first hidden layer's activations and information of the second word characterized as a vector. Also, These two inputs could be either added or concatenated jointly.
The third hidden layer pursues the same system as the second hidden layer, accepting the activation from the second hidden layer connected with the vector defining the third word in the sequence. Again, these inputs can be added or concatenated together.

The output from the final hidden layer reaches through an activation function that builds an output denoting a word from the vocabulary as a one-hot encoded vector.
In conclusion, the second and third hidden layers could use the same weight matrix, opening the opportunity to refactor this into a loop to become recurrent.

A fully connected network for text prediction.

Source: Fastai deep learning course V3 by Jeremy Howard.

Word Embedding

A word embedding is a class of approaches for representing words and documents using a dense vector representation. The geometric relationship between words in word embeddings can represent the semantic relationship between words. Words closer to each other have a stronger relationship than words away from each other. Using the vector from the word embedding assists in averting the resultant activations from being extremely sparse. They can define any word in a few dimensions, primarily based on the number of unique words in our text.

The following figure gives us a few examples of word vectors. Words/vectors nearer to each other signify the cosine distance or geometric distance between them slightly corresponds to others. There could be a vector "male to female" representing the relationship between a word and its feminine. That vector may assist us in anticipating "king" when "he" is used and "Queen" when she is employed in the sentence.

Examples of Word vector

Source:https://towardsdatascience.com/word-embeddings-and-the-chamber-of-secrets-lstm-gru-tf-keras-de3f5c21bf16\

Refraction of Loop

A loop must be factored into the network's model for a network to be recurrent. Word embedding causes the second and third loop to have the exact weights, making it easier to reiterate the loop. Each loop iteration takes an input of a vector representing the following word in the sequence with the output activations from the last iteration. These inputs are concatenated or added together.
The output from the last iteration represents the following word in the sentence being put through the previous layer activation function, which converts it to a one-hot encoded vector representing a word in the vocabulary.
The figure below allows the network to predict a word at the end of a sequence of any arbitrary length.
The output from the last iteration represents the following word in the sentence being put through the previous layer activation function, which converts it to a one-hot encoded vector representing a word in the vocabulary.
The figure below allows the network to predict a word at the end of a sequence of any arbitrary length.

A basic RNN.

Source: Fastai deep learning course V3 by Jeremy Howard.

Significance of LSTM

When we go for backpropagation, the vanishing gradient is a problem faced by neural networks. It has a considerable effect, and the weight update process is widely affected, and the model becomes ineffective. So, we use LSTM, which holds a hidden state and a memory cell with three gates that are forgotten, read, and input gate.
The forget gate is primarily used to get proper control of what information needs to be terminated, which isn't required. Input gate makes foolproof that newer transmission is added to the cell, and output makes sure what regions of the cell are output to the next hidden state. The sigmoid function utilized in each gate equation assembles to bring the value to 0 or 1.

Architecture of RNN

Source: https://medium.com/@antonio.lopardo/the-basics-of-language-modeling-1c8832f21079

Text Generation Using LSTM:

The identical architecture of an LSTM is depicted in this figure. Here, X is the term subscript t implies that time instant. As can be noticed, c and h are inputs arriving from an earlier or last step. We have the forget gate that contains the weights to know precisely what information must be cleared before driving to the next gate. We utilize sigmoids here. The input is added here, and some new information is written in the cell instantly. Eventually, the output gate outputs the information that is passed to the next LSTM cell.

Architecture of LSTM

Source: https://medium.com/@antonio.lopardo/the-basics-of-language-modeling-1c8832f21079

Key steps for Text Prediction:

Load the essential libraries required for LSTM and NLP purposes
Load the text information
Accomplishing the required text cleaning
Construct a dictionary of words with keys as integer values
Prepare dataset as input and output sets utilizing dictionary
Depict the LSTM model for text generation

The figure below represents the outcome after the implementation of the above steps.

Result of Text Prediction

Source: https://bansalh944.medium.com/text-generation-using-lstm-b6ced8629b03

We learned a lot while preparing this blog, here are the References that we went through:

Blog Link: https://textgenerationrnn.blogspot.com/2022/01/word-prediction-using-rnn.html

Thank you for reading the article!

The contributors to this article are:

Arya Patil- Elecs- B-17

Varun Shelke-Elecs-B-48

Swapnil Bonde-Elecs-B-59

Yah Tobre-Elecs-B-63

Natural Language Processing

Search This Blog