Quick Tour of Natural Language Processing

Satishkumar Moparthi
9 min readJan 18, 2021

--

Natural language processing (NLP) is a branch of artificial intelligence that helps computers understand, interpret and manipulate human language. It is used for practical purposes that help us with everyday activities, such as texting, e-mail, and communicating across languages.

Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural-language generation.

Everything we express (either verbally or in written) carries huge amounts of information. The topic we choose, our tone, our selection of words, everything adds some type of information that can be interpreted and value extracted from it. In theory, we can understand and even predict human behavior using that information.

In simple terms, NLP represents the automatic handling of natural human language like speech or text, and although the concept itself is fascinating, the real value behind this technology comes from the use cases.

What is text ?

You can think text as sequence of characters , words, sentences, paragraphs, Phrases and named entities ..etc.;

In academic terms, a text is anything that conveys a set of meanings to the person who examines it. You might have thought that texts were limited to written materials, such as books, magazines, newspapers, and ‘zines (an informal term for magazine that refers especially to fanzines and webzines). Those items are indeed texts — but so are movies, paintings, television shows, songs, political cartoons, online materials, advertisements, maps, works of art, and even rooms full of people. If we can look at something, explore it, find layers of meaning in it, and draw information and conclusions from it, we’re looking at a text.

Text Pre-processing

Machine Learning takes data in the form of numbers. There are many encoding technique’s like BagOfWord, n-gram, TF-IDF, Word2Vec,OHE to encode text into numeric vector. But before encoding we first need to clean the text data and this process to prepare(or clean) text data before encoding is called text preprocessing. It transforms text into a more digestible form so that machine learning algorithms can perform better.

In natural language processing, text preprocessing is the practice of cleaning and preparing text data. NLTK and re are common Python libraries used to handle many text preprocessing tasks.

Text cleaning

  1. Text / Case normalization

Text normalization is the process of transforming a text into a canonical (standard) form. In NLP text (small T )is different from Text ( Capital T ),here we convert all word in lower form. In natural language processing, normalization encompasses many text preprocessing tasks including stemming, lemmatization, upper or lowercasing, and stop words removal.

hi welcome to the course on text analytics. text analytics is a very important course

2. Tokenizing the text

In natural language processing, tokenization is the text preprocessing task of breaking up text into smaller components of text (known as tokens). You can think of a token as a useful unit for semantic processing. There are many tokenizing methods as mentioned below.

['Hi', 'welcome', 'to', 'the', 'course', 'on', 'Text', 'Analytics', '.', 'TEXT', 'analytics', 'is', 'a', 'very', 'important', 'course']
['This', 'hotel', 'is', 'awesome', ',', 'isn', "'", 't', 'it', '?', 'it', 'couldn', "'", 't', 'have', 'been', 'a', 'better', 'place', 'than', 'this']
['LMAO', '#killing', 'it', ',', 'luv', 'mah', 'lyf', 'YOLO', 'LOL', ':D', ':D', '<3', '@raju']
['#chilling', '#lifegoals', '#yolo', '#wanderlust']
['chilling', 'lifegoals', 'yolo', 'wanderlust']

Removing stop words and punctuation

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~'][nltk_data] Downloading package stopwords to
[nltk_data] C:\Users\SatishMoparthi\AppData\Roaming\nltk_data...
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"][nltk_data] Unzipping corpora\stopwords.zip.['able', 'work', 'today', 'taking']

3. Stemming

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

  1. porter stemmer — oldest one originally developed in 1979
  2. snowball stemmer — sophasticated stemmer, supports multiple languages. faster than porter stemmer.
['he', 'is', 'veri', 'method', 'and', 'orderli', 'in', 'hi', 'execut']
['he', 'is', 'veri', 'method', 'and', 'order', 'in', 'his', 'execut']

4. Lemmatization

  • Like stemming, lemmatization takes the word to the root form called as lemma
  • It involves resolving words to their dictionary form
  • A lemma of a word is its dictionary form or canonical form
  • Lemmetizer in NLTK uses WordNet data set which comprises a list of synonyms
  • Lemmetize is very aggressive in taking the word to the root form
  • if the word to be stemmed is not part of the dictionary, it leaves it as is
  • ensures that the meaning of the sentence is not altered
  • In most of the scenarios the no. distinct words after lemmatization could be same as before
  • every step in text cleaning helps is reducing the number of words. but lemmetizer might not make a difference.
['he', 'is', 'driving', 'and', 'drive', 'the', 'down', 'of', 'the', 'drived', 'vehicle']
['he', 'be', 'drive', 'and', 'drive', 'the', 'down', 'of', 'the', 'drive', 'vehicle']
Difference between Lemmetization and Stemming.
['study', 'studying', 'cry', 'cry', 'his', 'like', 'execute', 'orderly', 'university', 'universal']
['studi', 'studi', 'cri', 'cri', 'his', 'like', 'execut', 'order', 'univers', 'univers']

Visualizing Text Data

  1. Word Cloud : Word Cloud is a data visualization technique used for representing text data in which the size of each word indicates its frequency or importance. Significant textual data points can be highlighted using a word cloud.
Word Cloud

2. Bar Graph

Bar Graph

Converting Text into Numeric

  1. DTM — Document Term Matrix : A document-term matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. Term Document matrix (TDM ) is transpose of DTM.

Bag of words create sparse matrix.

Sparsity = number of zero elements / total elements

Document Term Matrix 
good interest paper plain realli thi
0 2 1 1 1 1 1
1 0 1 1 0 0 0


Term Document Matrix
0 1
good 2 0
interest 1 1
paper 1 1
plain 1 0
realli 1 0
thi 1 0

2. Term Frequency ( tf ) : Gives us the frequency of the word in each document in the corpus. It is the ratio of number of times the word appears in a document compared to the total number of words in that document. It increases as the number of occurrences of that word within the document increases.

Term frequency — No. of occurrences of a word/Total No. of words in a doc

doc0 = ‘good plain paper realli good interest’
doc1 = ‘paper interest’

TF(good,doc0) = 2/6 = 1/3
TF(interest,doc0)=1/6

3. Document Frequency ( df ) :

Document frequency — No. of documents in which the word is present/Total no. of docs

doc0 = ‘good plain paper realli good interest’
doc1 = ‘paper interest’

DF(good) = 1/2 =DF(plain) = DF(realli)
DF(interest) = 2/2 = 1 = DF(paper)

The word good is present only in one document out of 2. And the word interest present in both documents.

4. Inverse document frequency ( idf ): The inverse document frequency is a measure of whether a term is common or rare in a given document corpus. It is obtained by dividing the total number of documents by the number of documents containing the term in the corpus. IDF tries to weigh how relevant a word is to a document.

Inverse document frequency — ln(1/DF) = ln(Total no. docs/No. of documents in which the word is present)

IDF values will be low, if the word is present in most/all of the documents
IDF values will be high, if the word is present is few of the documents

why we use log for IDF ?

Usage of log can be understood from Zapf’s law. Word occurrences in English follows a power law distribution, some words such as ‘the’ and ‘in’ are frequently found in usage, while such as geometry and civilization occur less in the text. IDF without a log will have large numbers that will dominate in the ML model. Taking log will bring down dominance of IDF in TF-IDF values.

5. TF-IDF : TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

TFIDF helps is document searching:
1. how frequent is the word within a document
2. how relevant is the word across documents

high tf-idf means a word is more frequent and more relevant in a document.

TFIDF(good,0) = Tf(good,0) * IDF(good) = 2/6*0.69 = 0.23
TFIDF(intrest,0) = TF(interest,0)*IDF(interest) = 1/6 *0 = 0

6. n-grams ( unigram, bigram, 3-gram,4-gram) :

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles

Count vectorizer can be applied on n-grams:

  • 1-gram — A single word as a feature
  • bi-gram — A pair of words as a feature
  • tri-gram — 3 words as a feature
  • by default count vectorizer is applied on 1-gram.
  • ngram_range argument will allow us to get the DTM for any of the combinations of n-grams.

2- gram code

2-gram

1 and 2 gram code

unigram and bi gram.

7. Word2Vec / Average Word2Vec / Tf-Idf weighted word2vec

This technique takes semantic meaning of sentence. Word transformed into a dimension vector. This is not a sparse vector and generally has dimensionality << BOW ( DTM ) and TF-IDF. If two words are semantically similar, then vectors of these words are closer geometrically.

Word2Vec also retains relationships between words such as King:Queen::Man:Woman

Vector(man) — Vector(woman) || Vector(king) — Vector(Queen)

Vector(walking) — Vector(walked) || Vector(swimming) — Vector(swam)

|| -> parallel

Google took data corpus from Google News to train it’s Word2Vec method.

Average W2V:

Let’s take the some reviews from end user as an example. Each review has a sequence of words / sentence.

vector v1 of review r1 is

average w2v(r1) = (1/(number of words in r1)(w2v(w1)+w2v(w2)+….+))

Tf-Idf weighted word2vec

tfidf w2v of r1 = (tfidf(w1)*w2v(w1)+tfidf(w2)*w2v(w2)+……)/(tfidf(w1)+tfidf(w2)+…….)

--

--

Satishkumar Moparthi
Satishkumar Moparthi

No responses yet