Chapter 9: Natural Language Processing (NLP)
🧠 NLP = “Giving machines the ability to read, understand, and generate human language.”
🔹 1. Text Preprocessing (Cleaning Text Before Feeding to Models)
Key Steps:
| Step | Purpose |
|---|---|
| Tokenization | Split text into words or sentences |
| Lowercasing | Uniform case for comparison |
| Stopword Removal | Remove common words (like "is", "the") |
| Stemming | Reduce words to root (e.g., "playing" → "play") |
| Lemmatization | Similar to stemming, but more accurate |
| Removing Punctuation & Numbers | Clean irrelevant symbols |
Example (Using NLTK):
🔹 2. Text Representation (Feature Engineering)
| Technique | Description |
|---|---|
| Bag of Words (BoW) | Count how often each word appears |
| TF-IDF | Weighs rare but important words higher |
| Word2Vec | Converts words to dense vectors with meaning |
| GloVe | Pretrained word vectors from huge corpora |
| BERT/GPT Embeddings | Deep contextual representations |
📌 Goal: Convert text into numbers so ML/DL models can understand.
🔹 3. NLP Models & Techniques
| Task | Model/Algorithm |
|---|---|
| Sentiment Analysis | Logistic Regression, LSTM, BERT |
| Text Classification | Naive Bayes, SVM, CNN |
| Named Entity Recognition (NER) | SpaCy, Transformers |
| Machine Translation | Sequence-to-sequence, Transformer |
| Text Summarization | Seq2Seq + Attention, Pegasus |
| Question Answering | BERT, GPT models |
| Chatbots | RNN + DialogFlow, GPT-4, Retrieval-based |
🔹 4. Transformers, BERT & GPT
🔸 Transformers
Use self-attention to understand context in sequences.
-
Core of modern NLP
-
Input can be processed in parallel, unlike RNNs
🔸 BERT (Bidirectional Encoder Representations from Transformers)
-
Reads text in both directions (context-aware)
-
Fine-tuned for:
-
Sentiment Analysis
-
Question Answering
-
NER
-
🔸 GPT (Generative Pretrained Transformer)
-
Focuses on text generation
-
Examples: GPT-2, GPT-3, GPT-4
-
Used in ChatGPT, AI writers, etc.
🔹 5. Hands-on Projects in NLP
| Project | Tools & Models |
|---|---|
| Chatbot | RNN/Transformer, Preprocessed text |
| Sentiment Analysis | LSTM/BERT + IMDB dataset |
| Text Summarizer | Seq2Seq + Attention |
| Spam Classifier | TF-IDF + Logistic Regression |
| Question Answering Bot | BERT/Q&A Dataset |
🔹 6. Popular Libraries for NLP
| Library | Purpose |
|---|---|
| NLTK | Preprocessing & linguistic tasks |
| spaCy | Fast NLP tasks (NER, POS tagging) |
| Scikit-learn | BoW, TF-IDF + ML models |
| Transformers (HuggingFace) | Pretrained models (BERT, GPT, RoBERTa) |
| TextBlob | Simple NLP operations |
| OpenAI API | GPT-based text generation |
✅ Summary of Chapter 9
| Topic | Summary |
|---|---|
| Text Preprocessing | Clean, tokenize, remove noise |
| Text Vectorization | Convert text to numeric (BoW, TF-IDF, Word2Vec, BERT) |
| NLP Models | Classification, generation, translation |
| Transformers | State-of-the-art for all major NLP tasks |
| Libraries | HuggingFace, NLTK, spaCy, OpenAI API |
💡 Mini Tasks:
-
Build a spam/ham classifier using Naive Bayes + TF-IDF.
-
Train a sentiment analyzer using LSTM or BERT on IMDB data.
-
Create a simple chatbot using
nltk.chator Transformer. -
Use HuggingFace to load a BERT model for Q&A.