How machines understand, interpret, and generate human language -- from tokenization to word embeddings.
Natural Language Processing is a field in AI that helps computers understand and work with human language. It powers Google Translate, Siri, Alexa, ChatGPT, spam filters, and sentiment analysis tools.
Humans speak natural languages (English, Tamil, Hindi). Computers understand only numbers. NLP is the translator between the two.
| Task | Example |
|---|---|
| Understand Meaning | Know that "I'm feeling down" means someone is sad |
| Extract Information (NER) | Pull names, dates, locations from articles |
| Translate Languages | English to Japanese using translation models |
| Generate Text | Write paragraphs or code from prompts |
| Summarize Documents | Condense a 2000-word article into 3 lines |
| Answer Questions | Like ChatGPT does |
Splits text into individual units (tokens) that ML models can process. Models cannot understand raw text -- they need discrete units.
"Hello world" -> ["Hello", "world"]
"Hello" -> ["H", "e", "l", "l", "o"]
"playing" -> ["play", "##ing"] (used in Transformers)
Remove frequent words like "the", "is", "a", "an" that carry little meaning in classification tasks.
| Stemming | Lemmatization |
|---|---|
| Cuts suffix: "studies" -> "studi" | Finds root word: "studies" -> "study" |
| Fast but less accurate | Slower but linguistically correct |
| "Running" -> "Runn" | "Running" -> "Run" |
The simplest way to convert text to numbers. Count how many times each word appears -- ignore word order entirely.
Improves BoW by reducing the weight of common words and boosting rare but important terms.
Dense vector representations where words with similar meaning are close in vector space. Unlike BoW/TF-IDF, embeddings capture semantic relationships.
The famous analogy: king - man + woman = queen. This is only possible with embeddings, not BoW or TF-IDF.
Trained on huge text corpora. Captures semantic analogies.
Uses global word co-occurrence statistics.
Works with subwords. Handles rare words and misspellings.
Contextual embeddings. Same word gets different vectors in different sentences.
| Method | Captures Meaning? | Word Order? | Context? |
|---|---|---|---|
| Bag of Words | No | No | No |
| TF-IDF | Partially (importance) | No | No |
| Word2Vec / GloVe | Yes (static) | No | No |
| BERT / Transformers | Yes (contextual) | Yes | Yes |
NLPTokenizationTF-IDFBoWWord2VecBERTEmbeddings