Chapter 9: Natural Language Processing (NLP)
🧠 NLP = “Giving machines the ability to read, understand, and generate human language.”
🔹 1. Text Preprocessing (Cleaning Text Before Feeding to Models)
Key Steps:
Step | Purpose |
---|---|
Tokenization | Split text into words or sentences |
Lowercasing | Uniform case for comparison |
Stopword Removal | Remove common words (like "is", "the") |
Stemming | Reduce words to root (e.g., "playing" → "play") |
Lemmatization | Similar to stemming, but more accurate |
Removing Punctuation & Numbers | Clean irrelevant symbols |
Example (Using NLTK):
🔹 2. Text Representation (Feature Engineering)
Technique | Description |
---|---|
Bag of Words (BoW) | Count how often each word appears |
TF-IDF | Weighs rare but important words higher |
Word2Vec | Converts words to dense vectors with meaning |
GloVe | Pretrained word vectors from huge corpora |
BERT/GPT Embeddings | Deep contextual representations |
📌 Goal: Convert text into numbers so ML/DL models can understand.
🔹 3. NLP Models & Techniques
Task | Model/Algorithm |
---|---|
Sentiment Analysis | Logistic Regression, LSTM, BERT |
Text Classification | Naive Bayes, SVM, CNN |
Named Entity Recognition (NER) | SpaCy, Transformers |
Machine Translation | Sequence-to-sequence, Transformer |
Text Summarization | Seq2Seq + Attention, Pegasus |
Question Answering | BERT, GPT models |
Chatbots | RNN + DialogFlow, GPT-4, Retrieval-based |
🔹 4. Transformers, BERT & GPT
🔸 Transformers
Use self-attention to understand context in sequences.
-
Core of modern NLP
-
Input can be processed in parallel, unlike RNNs
🔸 BERT (Bidirectional Encoder Representations from Transformers)
-
Reads text in both directions (context-aware)
-
Fine-tuned for:
-
Sentiment Analysis
-
Question Answering
-
NER
-
🔸 GPT (Generative Pretrained Transformer)
-
Focuses on text generation
-
Examples: GPT-2, GPT-3, GPT-4
-
Used in ChatGPT, AI writers, etc.
🔹 5. Hands-on Projects in NLP
Project | Tools & Models |
---|---|
Chatbot | RNN/Transformer, Preprocessed text |
Sentiment Analysis | LSTM/BERT + IMDB dataset |
Text Summarizer | Seq2Seq + Attention |
Spam Classifier | TF-IDF + Logistic Regression |
Question Answering Bot | BERT/Q&A Dataset |
🔹 6. Popular Libraries for NLP
Library | Purpose |
---|---|
NLTK | Preprocessing & linguistic tasks |
spaCy | Fast NLP tasks (NER, POS tagging) |
Scikit-learn | BoW, TF-IDF + ML models |
Transformers (HuggingFace) | Pretrained models (BERT, GPT, RoBERTa) |
TextBlob | Simple NLP operations |
OpenAI API | GPT-based text generation |
✅ Summary of Chapter 9
Topic | Summary |
---|---|
Text Preprocessing | Clean, tokenize, remove noise |
Text Vectorization | Convert text to numeric (BoW, TF-IDF, Word2Vec, BERT) |
NLP Models | Classification, generation, translation |
Transformers | State-of-the-art for all major NLP tasks |
Libraries | HuggingFace, NLTK, spaCy, OpenAI API |
💡 Mini Tasks:
-
Build a spam/ham classifier using Naive Bayes + TF-IDF.
-
Train a sentiment analyzer using LSTM or BERT on IMDB data.
-
Create a simple chatbot using
nltk.chat
or Transformer. -
Use HuggingFace to load a BERT model for Q&A.