ML Playground / NLP View Notebook

Natural Language Processing (NLP)

How machines understand, interpret, and generate human language -- from tokenization to word embeddings.

What is NLP?

Natural Language Processing is a field in AI that helps computers understand and work with human language. It powers Google Translate, Siri, Alexa, ChatGPT, spam filters, and sentiment analysis tools.

Humans speak natural languages (English, Tamil, Hindi). Computers understand only numbers. NLP is the translator between the two.

What NLP Can Do

TaskExample
Understand MeaningKnow that "I'm feeling down" means someone is sad
Extract Information (NER)Pull names, dates, locations from articles
Translate LanguagesEnglish to Japanese using translation models
Generate TextWrite paragraphs or code from prompts
Summarize DocumentsCondense a 2000-word article into 3 lines
Answer QuestionsLike ChatGPT does

1. Tokenization

Splits text into individual units (tokens) that ML models can process. Models cannot understand raw text -- they need discrete units.

Word Tokenization

"Hello world" -> ["Hello", "world"]

Character Tokenization

"Hello" -> ["H", "e", "l", "l", "o"]

Subword Tokenization

"playing" -> ["play", "##ing"] (used in Transformers)

from nltk.tokenize import word_tokenize word_tokenize("I'm learning NLP.") # ['I', "'m", 'learning', 'NLP', '.']

2. Stopword Removal

Remove frequent words like "the", "is", "a", "an" that carry little meaning in classification tasks.

from nltk.corpus import stopwords stopwords.words('english') # includes 'is', 'the', etc.

3. Stemming vs Lemmatization

StemmingLemmatization
Cuts suffix: "studies" -> "studi"Finds root word: "studies" -> "study"
Fast but less accurateSlower but linguistically correct
"Running" -> "Runn""Running" -> "Run"
from nltk.stem import PorterStemmer, WordNetLemmatizer import nltk nltk.download("wordnet") stemmer = PorterStemmer() lemmatizer = WordNetLemmatizer() print(stemmer.stem("studies")) # studi print(lemmatizer.lemmatize("studies")) # study

4. Bag of Words (BoW)

The simplest way to convert text to numbers. Count how many times each word appears -- ignore word order entirely.

"I love NLP" -> [1, 1, 1, 0] (vocabulary: [I, love, NLP, Python]) "I love Python" -> [1, 1, 0, 1]
from sklearn.feature_extraction.text import CountVectorizer docs = ["I love NLP", "I love Python"] cv = CountVectorizer() X = cv.fit_transform(docs) print(cv.get_feature_names_out()) # ['love', 'nlp', 'python'] print(X.toarray()) # [[1, 1, 0], [1, 0, 1]]

5. TF-IDF (Term Frequency - Inverse Document Frequency)

Improves BoW by reducing the weight of common words and boosting rare but important terms.

TF(t, d) = Count of term t in document d / Total terms in document d IDF(t) = log(Total documents / Documents containing term t) TF-IDF = TF * IDF High TF-IDF: term is frequent in one doc but rare across corpus Low TF-IDF: term is common everywhere (like "the") or absent
from sklearn.feature_extraction.text import TfidfVectorizer docs = ["I love NLP", "NLP loves me", "I love Python and NLP"] tfidf = TfidfVectorizer() X = tfidf.fit_transform(docs) print(tfidf.get_feature_names_out()) print(X.toarray())

6. Word Embeddings

Dense vector representations where words with similar meaning are close in vector space. Unlike BoW/TF-IDF, embeddings capture semantic relationships.

The famous analogy: king - man + woman = queen. This is only possible with embeddings, not BoW or TF-IDF.

Word2Vec (Google, 2013)

Trained on huge text corpora. Captures semantic analogies.

GloVe (Stanford, 2014)

Uses global word co-occurrence statistics.

FastText (Facebook, 2016)

Works with subwords. Handles rare words and misspellings.

BERT (Google, 2018)

Contextual embeddings. Same word gets different vectors in different sentences.

from gensim.models import Word2Vec # Training a small toy model sentences = [ ["I", "love", "natural", "language", "processing"], ["I", "love", "deep", "learning"], ["NLP", "is", "fun"], ["Python", "is", "great", "for", "NLP"] ] model = Word2Vec(sentences, vector_size=20, window=3, min_count=1) # Vector for word 'NLP' print(model.wv['NLP']) # Find similar words print(model.wv.most_similar('love'))

Progression of Text Representations

MethodCaptures Meaning?Word Order?Context?
Bag of WordsNoNoNo
TF-IDFPartially (importance)NoNo
Word2Vec / GloVeYes (static)NoNo
BERT / TransformersYes (contextual)YesYes

NLPTokenizationTF-IDFBoWWord2VecBERTEmbeddings