Skip to main content

Natural Language Processing CBSE Class 10

Natural Language Processing

We are leaning NLP!
Tokens:
We,are,learning,NLP,!
Syntax means proper arrangement of words according to rules.
Now when syntax makes sense it becomes semantics
His future is very bright. (semantics)
Data Processing
1) Pre-processing steps
(Text Normalisation)
Example words like slang,short form,misspelled,special meaning 
character needs to be converted into canonical form.
  
Words Canonical form
 
b4,beefore,bifore           before

Gni8                        Good night
 
tysm    Thank you so much
gr8,grt                       great
 
Statement:   Gn tke care
Perform text normalisation
Answer: Good night take care. (corpus)
 
There are multiple steps for performing text normalisation:
1. Sentence Segmentation   2. Tokenization
3. Removing stop words, special characters and numbers
4. converting text to a common case
5. Stemming
6. Lemmatization 
It is the process of sentence boundary detection which reduces the corpus
into a sentence.
(Boundary detection example: full stop)
(Corpus means an entire paragraph)
Step 1 :  Sentence Segmentation
World War II (1939-1945) was a global conflict between 
the 
Allies—mainly the U.S., U.K., Soviet Union, and China—and 
the Axis powers, led by Germany, Italy, and Japan. 
It began with Germany's invasion of Poland and included 
major events like the Holocaust, the bombing of Hiroshima and 
Nagasaki, and the D-Day invasion. The war ended with the 
Axis powers' defeat, reshaping global politics and leading 
to the creation of the United Nations, the Cold War, and 
decolonization movements worldwide. It remains the deadliest 
conflict in human history.
(Sentence Segmentation)
1.World War II (1939-1945) was a global conflict between 
the 
Allies—mainly the U.S., U.K., Soviet Union, and China—and 
the Axis powers, led by Germany, Italy, and Japan. 
2. It began with Germany's invasion of Poland and included 
major events like the Holocaust, the bombing of Hiroshima and 
Nagasaki, and the D-Day invasion
3. The war ended with the 
Axis powers' defeat, reshaping global politics and leading 
to the creation of the United Nations, the Cold War, and 
decolonization movements worldwide.
4. It remains the deadliest 
conflict in human history.
Step 2: Tokenization
It is the porcess of dividing the sentence further into tokens.
A token can be any word or number or special character that forms
a sentence.
It is achieved by finding the boundary of a word that is where one
word ends and another word begins,for example in english, the space
 is the boundary detector.
For example:
It remains the deadliest 
conflict in human history.
Tokens:
It,remains,the,deadliest,conflict,in,human,history,.
Step 3: Removing stopwords,special characters and numbers.
Stopwords are frequently occuring words that make a meaningful sentence
but for the machine it is complete waste as they do not provide any
information about the corpus.
Example:   a an and are as for from is into in if on or such the 
there to
An Apple had fallen on the head of Newton
Tokenization:  An,Apple,had,fallen,on,the head,of,Newton
Removal of stopwords:  Apple had fall the head Newton
Step 4:  Converting text to common case
Converting all the tokens to a common case i.e. either small letter/lower
case or capital letters/uppercase
Example:   APPLE HAD FALL THE HEAD NEWTON
Step 5: Stemming
The process of removing the affixes from the words to get back its
base word is called stemming.
Example: healed  
(word: heal  affix:  -ed)
healing  (stem/base word: heal affix: -ing)
taking   (stem/base word: tak  affix: -ing)
studies (stem/base word: studi  affix: -es)
studying (stem/base word: study affix: -ing)

Keep in mind that removes affixes may not make any sense
Step 6: Lemmatization
This is also a process of removing the affixes from the word to create
a meaning base word. The word we get after removing the affix is called
lemma. Since it always focuses on creating a meaningful lemma,this process
time is longer and better than stemming
Word Affixes Lemma
healed -ed heal
studies -es             study
Taking          -ing            Take
To perform NLP we have multiple well known some of them are:
1. Bag of words  2.  Term Frequency and Inverse Document Frequency(TFIDF)
3. Natural Language Toolkit (NLTK)
Now we are going to learning "Bag of Words"
The inventor of BoW (Bag of words) was Alan Turing who created the
first basic mechanical digital computer in the year 1950.
We regard him as the father of AI.
BoW is an algorithm as it contains meaningful words also known as 
Tokens which are scattered in a dataset just like a bag full of words
scattered with no specfic order.

Alan Mathison Turing was an English mathematician, 
computer scientist, logician, cryptanalyst, 
philosopher and theoretical biologist.
He was highly influential in the development 
of theoretical computer science, 
providing a formalisation of 
the concepts of algorithm and computation with the 
Turing machine, which can be considered a model 
of a general-purpose computer.
Turing is widely considered to be the father of 
theoretical computer science.
Dataset:   ('Alan',1) , ('Mathison',1),('Turing',3),('was',2) so on
So BoW returns (i) A vocabulary of words for the corpus
               (ii) The frequency of these words
The steps involved in BoW algorithm are:
1) Text Normalisation: The collection of data is processed to get
normalised corpus.
2) Create Dictionary: This step will create a list of all unique words
available in normalised corpus.
3)Create Document Vectors: For each document in the corpus, create a
list of unique words with its number of occurences.
4)Create Document Vectors for all the Documents: Repeat step 3 for all
documents in the corpus to create a docuemnt vector table.
To use text normalisation
we will take a python library called nltk
nltk stands natural language toolkit
To install nltk we will use:
pip install nltk

then we will open python in cmd
by typing:  python
then we will type :  >> import nltk
now we need to download nltk packages
nltk.download()
TFIDF  -> Term Frequency and Inverse Document Frequency
This method is considered a better version of BoW (Bag of Words) why
1) BoW gives us numeric vector(i.e. word frequency of a document) but
TFIDF though it numeric value but it gives us the importance of each 
word in a document.
2) TFIDF is a statistical measure of important words in a document.

Term Frequency means the frequency of a word in one document. 
Term frequency can be easily found from the document vector table.
Now How to find a document table ?
Let see
Suppose we have three documents.
Document 1: I like oranges
Document 2: I also like grapes
Document 3: Oranges and grapes are rich in vitaminC
Now we need to find the text normalisation
Document 1:  [I,like,oranges]
Document 2: [also,grapes]
Document: [and ,are,rich,in,VitaminC]
Creating a dictionary
I  like  oranges  also grapes  and  are   rich  in vitaminC
Now creating a document vector:
1)  The list of words from the dictionary must be written in a row.
Then for each word in the document 1 if it maches the vocabulary in the
dictionary then put 1 under it and those which do matches put 0. and
repeat the steps with document 2 and document 3 respectively.
                     I  like  oranges  also grapes  and  are   rich  in vitaminC
Document 1: 1   1      1        0     0     0    0       0  0      0
Document 2: 1   1      0         1    1     0     0       0  0      0
Document 3:            1              1      1    1       1  1      1
Thus we get the term frequency
Document Frequency: is the number of documents in whihc the word occurs
irrespective of how many times it has occured in those document.
For example:
I  like  oranges  also grapes  and  are   rich  in vitaminC
2    2      2       1     2      1    1      1   1    1
Inverse document frequency: is obtained when document frequency is in the
denominator and the total number of documents in the numerator.
I  like  oranges  also grapes  and  are   rich  in vitaminC
3/2  3/2    3/2     3/1  3/2    3/1   3/1   3/1  3/1  3/1


Popular posts from this blog

Panagram ISC 2025 Specimen Practical Paper

import java.util.*; class panagram //ISC 2025 Practical Question {     //str for storing the sentence     String str;     panagram()     {         str="";     }     void accept()     {         Scanner sc=new Scanner(System.in);         System.out.println("Enter a sentence:");         str=sc.nextLine();     }     void panagramcheck()     {         int letters[]=new int[26];          StringTokenizer st=new StringTokenizer(str);         while(st.hasMoreTokens())         {             String w = st.nextToken().toUpperCase();             for(int i=65;i<=90;i++)             {                 for(int j=...

Program in Java: ISC Program CellPhone Keystrokes

import java.util.Scanner; public class Keypad {     public static void main(String args[])     {         //Array to hold keystrokes for each letter         int keys[] = new int[26];         //intialise         keys['A'-'A']=1; //A         keys['B'-'A']=2; //B         keys['C'-'A']=3; //C         keys['D'-'A']=1; //D         keys['E'-'A']=2; //E         keys['F'-'A']=3; //F         keys['G'-'A']=1; //G         keys['H'-'A']=2; //H         keys['I'-'A']=3; //I         keys['J'-'A']=1; //J         keys['K'-'A']=2; //K         keys['L'-'A']=3; //L         keys['M'-'A']=1; //M         keys['N'-'A']=2; //N       ...

ISC Program: Predict day of the week from date

Algorithm : 1)Take the last two digits of the year. 2)Divide by 4, discarding any fraction. 3)Add the day of the month. 4)Add the month's key value: JFM AMJ JAS OND 144 025 036 146 5)Subtract 1 for January or February of a leap year. 6)For a Gregorian date, add 0 for 1900's, 6 for 2000's, 4 for 1700's, 2 for 1800's; for other years, add or subtract multiples of 400. 7)For a Julian date, add 1 for 1700's, and 1 for every additional century you go back. 8)Add the last two digits of the year. 9)Divide by 7 and take the remainder. Example : Let's take a date: 26/03/2027 Last two digit of the year = 27 Divide by 4 discard fraction = 27/4 = 6.75 = 6 Add day = 6 + 26 = 32 Month key = 4 + 32 = 36 Add year code = 36 + 6 = 42 Now add two digits of the first year = 42 + 27 = 69 Now get the remainder after dividing by 7 = 69%7=6 So 1 is Sunday so 6 is Friday So 27/03/2027 Program : import java.util.Scanner; public class daydate {     public static void main(String[] arg...