Natural Language Processing (NLP)

How machines understand, interpret, and generate human language -- from tokenization to word embeddings.

What is NLP?

Natural Language Processing is a field in AI that helps computers understand and work with human language. It powers Google Translate, Siri, Alexa, ChatGPT, spam filters, and sentiment analysis tools.

Humans speak natural languages (English, Tamil, Hindi). Computers understand only numbers. NLP is the translator between the two.

What NLP Can Do

Task	Example
Understand Meaning	Know that "I'm feeling down" means someone is sad
Extract Information (NER)	Pull names, dates, locations from articles
Translate Languages	English to Japanese using translation models
Generate Text	Write paragraphs or code from prompts
Summarize Documents	Condense a 2000-word article into 3 lines
Answer Questions	Like ChatGPT does

1. Tokenization

Splits text into individual units (tokens) that ML models can process. Models cannot understand raw text -- they need discrete units.

Word Tokenization

"Hello world" -> ["Hello", "world"]

Character Tokenization

"Hello" -> ["H", "e", "l", "l", "o"]

Subword Tokenization

"playing" -> ["play", "##ing"] (used in Transformers)

from nltk.tokenize import word_tokenize word_tokenize("I'm learning NLP.") # ['I', "'m", 'learning', 'NLP', '.']

2. Stopword Removal

Remove frequent words like "the", "is", "a", "an" that carry little meaning in classification tasks.

from nltk.corpus import stopwords stopwords.words('english') # includes 'is', 'the', etc.

3. Stemming vs Lemmatization

Stemming	Lemmatization
Cuts suffix: "studies" -> "studi"	Finds root word: "studies" -> "study"
Fast but less accurate	Slower but linguistically correct
"Running" -> "Runn"	"Running" -> "Run"

from nltk.stem import PorterStemmer, WordNetLemmatizer import nltk nltk.download("wordnet") stemmer = PorterStemmer() lemmatizer = WordNetLemmatizer() print(stemmer.stem("studies")) # studi print(lemmatizer.lemmatize("studies")) # study

4. Bag of Words (BoW)

The simplest way to convert text to numbers. Count how many times each word appears -- ignore word order entirely.

"I love NLP" -> [1, 1, 1, 0] (vocabulary: [I, love, NLP, Python]) "I love Python" -> [1, 1, 0, 1]

Pros: Simple, works for spam detection and basic classification
Cons: Ignores word order ("dog bites man" = "man bites dog"), sparse matrix, no semantics

from sklearn.feature_extraction.text import CountVectorizer docs = ["I love NLP", "I love Python"] cv = CountVectorizer() X = cv.fit_transform(docs) print(cv.get_feature_names_out()) # ['love', 'nlp', 'python'] print(X.toarray()) # [[1, 1, 0], [1, 0, 1]]

5. TF-IDF (Term Frequency - Inverse Document Frequency)

Improves BoW by reducing the weight of common words and boosting rare but important terms.

TF(t, d) = Count of term t in document d / Total terms in document d IDF(t) = log(Total documents / Documents containing term t) TF-IDF = TF * IDF High TF-IDF: term is frequent in one doc but rare across corpus Low TF-IDF: term is common everywhere (like "the") or absent

from sklearn.feature_extraction.text import TfidfVectorizer docs = ["I love NLP", "NLP loves me", "I love Python and NLP"] tfidf = TfidfVectorizer() X = tfidf.fit_transform(docs) print(tfidf.get_feature_names_out()) print(X.toarray())

6. Word Embeddings

Dense vector representations where words with similar meaning are close in vector space. Unlike BoW/TF-IDF, embeddings capture semantic relationships.

The famous analogy: king - man + woman = queen. This is only possible with embeddings, not BoW or TF-IDF.

Word2Vec (Google, 2013)

Trained on huge text corpora. Captures semantic analogies.

GloVe (Stanford, 2014)

Uses global word co-occurrence statistics.

FastText (Facebook, 2016)

Works with subwords. Handles rare words and misspellings.

BERT (Google, 2018)

Contextual embeddings. Same word gets different vectors in different sentences.

from gensim.models import Word2Vec # Training a small toy model sentences = [ ["I", "love", "natural", "language", "processing"], ["I", "love", "deep", "learning"], ["NLP", "is", "fun"], ["Python", "is", "great", "for", "NLP"] ] model = Word2Vec(sentences, vector_size=20, window=3, min_count=1) # Vector for word 'NLP' print(model.wv['NLP']) # Find similar words print(model.wv.most_similar('love'))

Progression of Text Representations

Method	Captures Meaning?	Word Order?	Context?
Bag of Words	No	No	No
TF-IDF	Partially (importance)	No	No
Word2Vec / GloVe	Yes (static)	No	No
BERT / Transformers	Yes (contextual)	Yes	Yes

NLPTokenizationTF-IDFBoWWord2VecBERTEmbeddings