Natural Language Processing
We are leaning NLP!
Tokens:
We,are,learning,NLP,!
Syntax means proper arrangement of words according to rules.
Now when syntax makes sense it becomes semantics
His future is very bright. (semantics)
Data Processing
1) Pre-processing steps
(Text Normalisation)
Example words like slang,short form,misspelled,special meaning
character needs to be converted into canonical form.
Words Canonical form
b4,beefore,bifore before
Gni8 Good night
tysm Thank you so much
gr8,grt great
Statement: Gn tke care
Perform text normalisation
Answer: Good night take care. (corpus)
There are multiple steps for performing text normalisation:
1. Sentence Segmentation 2. Tokenization
3. Removing stop words, special characters and numbers
4. converting text to a common case
5. Stemming
6. Lemmatization
It is the process of sentence boundary detection which reduces the corpus
into a sentence.
(Boundary detection example: full stop)
(Corpus means an entire paragraph)
Step 1 : Sentence Segmentation
World War II (1939-1945) was a global conflict between
the
Allies—mainly the U.S., U.K., Soviet Union, and China—and
the Axis powers, led by Germany, Italy, and Japan.
It began with Germany's invasion of Poland and included
major events like the Holocaust, the bombing of Hiroshima and
Nagasaki, and the D-Day invasion. The war ended with the
Axis powers' defeat, reshaping global politics and leading
to the creation of the United Nations, the Cold War, and
decolonization movements worldwide. It remains the deadliest
conflict in human history.
(Sentence Segmentation)
1.World War II (1939-1945) was a global conflict between
the
Allies—mainly the U.S., U.K., Soviet Union, and China—and
the Axis powers, led by Germany, Italy, and Japan.
2. It began with Germany's invasion of Poland and included
major events like the Holocaust, the bombing of Hiroshima and
Nagasaki, and the D-Day invasion
3. The war ended with the
Axis powers' defeat, reshaping global politics and leading
to the creation of the United Nations, the Cold War, and
decolonization movements worldwide.
4. It remains the deadliest
conflict in human history.
Step 2: Tokenization
It is the porcess of dividing the sentence further into tokens.
A token can be any word or number or special character that forms
a sentence.
It is achieved by finding the boundary of a word that is where one
word ends and another word begins,for example in english, the space
is the boundary detector.
For example:
It remains the deadliest
conflict in human history.
Tokens:
It,remains,the,deadliest,conflict,in,human,history,.
Step 3: Removing stopwords,special characters and numbers.
Stopwords are frequently occuring words that make a meaningful sentence
but for the machine it is complete waste as they do not provide any
information about the corpus.
Example: a an and are as for from is into in if on or such the
there to
An Apple had fallen on the head of Newton
Tokenization: An,Apple,had,fallen,on,the head,of,Newton
Removal of stopwords: Apple had fall the head Newton
Step 4: Converting text to common case
Converting all the tokens to a common case i.e. either small letter/lower
case or capital letters/uppercase
Example: APPLE HAD FALL THE HEAD NEWTON
Step 5: Stemming
The process of removing the affixes from the words to get back its
base word is called stemming.
Example: healed
(word: heal affix: -ed)
healing (stem/base word: heal affix: -ing)
taking (stem/base word: tak affix: -ing)
studies (stem/base word: studi affix: -es)
studying (stem/base word: study affix: -ing)
Keep in mind that removes affixes may not make any sense
Step 6: Lemmatization
This is also a process of removing the affixes from the word to create
a meaning base word. The word we get after removing the affix is called
lemma. Since it always focuses on creating a meaningful lemma,this process
time is longer and better than stemming
Word Affixes Lemma
healed -ed heal
studies -es study
Taking -ing Take
To perform NLP we have multiple well known some of them are:
1. Bag of words 2. Term Frequency and Inverse Document Frequency(TFIDF)
3. Natural Language Toolkit (NLTK)
Now we are going to learning "Bag of Words"
The inventor of BoW (Bag of words) was Alan Turing who created the
first basic mechanical digital computer in the year 1950.
We regard him as the father of AI.
BoW is an algorithm as it contains meaningful words also known as
Tokens which are scattered in a dataset just like a bag full of words
scattered with no specfic order.
Alan Mathison Turing was an English mathematician,
computer scientist, logician, cryptanalyst,
philosopher and theoretical biologist.
He was highly influential in the development
of theoretical computer science,
providing a formalisation of
the concepts of algorithm and computation with the
Turing machine, which can be considered a model
of a general-purpose computer.
Turing is widely considered to be the father of
theoretical computer science.
Dataset: ('Alan',1) , ('Mathison',1),('Turing',3),('was',2) so on
So BoW returns (i) A vocabulary of words for the corpus
(ii) The frequency of these words
The steps involved in BoW algorithm are:
1) Text Normalisation: The collection of data is processed to get
normalised corpus.
2) Create Dictionary: This step will create a list of all unique words
available in normalised corpus.
3)Create Document Vectors: For each document in the corpus, create a
list of unique words with its number of occurences.
4)Create Document Vectors for all the Documents: Repeat step 3 for all
documents in the corpus to create a docuemnt vector table.
To use text normalisation
we will take a python library called nltk
nltk stands natural language toolkit
To install nltk we will use:
pip install nltk
then we will open python in cmd
by typing: python
then we will type : >> import nltk
now we need to download nltk packages
nltk.download()
TFIDF -> Term Frequency and Inverse Document Frequency
This method is considered a better version of BoW (Bag of Words) why
1) BoW gives us numeric vector(i.e. word frequency of a document) but
TFIDF though it numeric value but it gives us the importance of each
word in a document.
2) TFIDF is a statistical measure of important words in a document.
Term Frequency means the frequency of a word in one document.
Term frequency can be easily found from the document vector table.
Now How to find a document table ?
Let see
Suppose we have three documents.
Document 1: I like oranges
Document 2: I also like grapes
Document 3: Oranges and grapes are rich in vitaminC
Now we need to find the text normalisation
Document 1: [I,like,oranges]
Document 2: [also,grapes]
Document: [and ,are,rich,in,VitaminC]
Creating a dictionary
I like oranges also grapes and are rich in vitaminC
Now creating a document vector:
1) The list of words from the dictionary must be written in a row.
Then for each word in the document 1 if it maches the vocabulary in the
dictionary then put 1 under it and those which do matches put 0. and
repeat the steps with document 2 and document 3 respectively.
I like oranges also grapes and are rich in vitaminC
Document 1: 1 1 1 0 0 0 0 0 0 0
Document 2: 1 1 0 1 1 0 0 0 0 0
Document 3: 1 1 1 1 1 1 1
Thus we get the term frequency
Document Frequency: is the number of documents in whihc the word occurs
irrespective of how many times it has occured in those document.
For example:
I like oranges also grapes and are rich in vitaminC
2 2 2 1 2 1 1 1 1 1
Inverse document frequency: is obtained when document frequency is in the
denominator and the total number of documents in the numerator.
I like oranges also grapes and are rich in vitaminC
3/2 3/2 3/2 3/1 3/2 3/1 3/1 3/1 3/1 3/1