Natural Language Processing

Master fundamental Natural Language Processing and text processing concepts to develop a language recognition application.

Projects

MODULE 1

Website Portfolio

MODULE 2

Voiceflow Chatbot

MODULE 3

BoW/TF-IDF

MODULE 4

Text Classification

Weekly Notes

Week 1

Create Ai chat bot on Voiceflow
Create class website on Google sites to use for a class assistant.
Embed the chat bot into the free website created.

Week 2

Neuron/Node:
1. Neural Network
2. Deep Learning
3. GPT has 23 billion neurons
Word2Vec:
1. Vec means vector
Sequence-to-sequence learning (Seq2Seq): is a machine learning approach where models are trained to transform input sequences (like words or characters) into output sequences, often of different lengths or types, commonly used in tasks like language translation and text summarization.
Transformers:
1. Understand the difference between river Bank and robbing the Bank.
ARTICLE: NIPS 2017: Attention is All You Need
Theory of mind test
Ethical progression of building Ai
Online moderation
1. use Ai to find self harm and possible hate crimes
2. Ai can predict possible cases (pre-crime?)
***Create STREAMLIT Application
1. Fork files and edit code to Github
2. Launch Streamlit app
3. Edit Json code

Week 3 - Taking a step back in the basics of NLP

What is Natural Language?
- A set of agreed upon words
- NLP is the merge with Ai and Linguistics
- Natural Language Understanding (NLU) - understanding
- Natural Language Generation (NLG) - generate a response
Tokenization
- word-tokenizer
- Use import nltk to tokenize words
Stemming
Lemmatization
Named Entity Recognition (NER)
- Extracts important entities from text
- Use NER to extract top entities and query only the entities and not the whole dataset.

Week 4 - Text Classification with Machine Learning

1. Word Vectorization

Turn the words to numbers and clean it
Use Octoparse web scraper to gather Amazon data.

2. Recommender systems

Use unstructured data
Create actionable systems can be used to create better products
What problem can we solve?
- Songs for Spotify, movies on Netflix, videos on YT, related post on Twitter, similar dishes on UberEATS, and ads on Facebook.

3. Where do they get the Data?

Explicit data
Implicit data
Google Tag Manager - tracks every click & contact you make with the website.
https://www.analyticsmania.com/post/google-tag-manager-use-cases/

4. Kinds of Recommenders

Collaborative Filtering - Find similar users
Content-Based Filtering

5. Word2Vec

Bag of Words
TF-IDF Vectorizer- NLP algorithm used to determine the term frequency of key words based on the number of times it appears in the corpus (group) of documents.
Every word will have its own TF & IDF scores (weight)

Notes:

Look at LakeraAi to get information on hacking Ai to have it answer things it's not supposed to.

Check out Dataiku and OpenAi Whisper.

Read chapter 3

For Concise notes:

Each item in the dataset is represented as a unique vector.

Similarity between items is quantified by calculating the cosine similarity between their vectors.

Cosine similarity measures the cosine of the angle between two vectors, producing a score from -1 to 1.

A score close to 1 indicates high similarity, while a score near -1 implies significant difference.

This technique is widely used in information retrieval, text analysis, and machine learning for assessing the likeness between data points.

Week 5 - Preprocess Text Data

Set up environment
Preprocess text
Get it in the format 'text'/'label'
Apply cleaning function
Stem and Lemma
TF/IDF, BoW and the Model
Create model with Vectorized data
Evaluate NLP Model
Test model with new data
Make predictions

*work on Modules 3,4 & 5

Week 6 - Data Processing

Basic cleaning
Lemma
Tokens (this is not Word2Vec)
Vectorize - turn words into numbers
- Bag of Words
- TF-IDF
ML Model
- Binary classification
Create recommendation or Sentiment analysis with model
Word2Vec
- Cluster in 2 dimensional space
- Projector.tensorflow.org
- Continuous Bag of Word (CBOW)
- SkipGram

Additional Information:

Sequence to sequence learning
Transformer
Review NLTK Vader
Read Chapter 5 - Word Embedding
Use nlp-fundamentals to create Custom GPT web app on Streamlit

Vader NLTK

Sentiment Intensity Analyzer

Custom GPT App

Create your own GPT app with this template

Week 7 - Review of YTD

Look for Text Classification datasets on Kaggle
- Look for datasets that have columns with large amounts of text.
- Word2Vec
Word Embedding
- ChatGPT only works with 'word embedding'.
- W.E. is how you create a custom GPT
What are embeddings?
OpenAI’s text embeddings measure the relatedness of text strings. Embeddings are commonly used for:
- Search (where results are ranked by relevance to a query string)
- Clustering (where text strings are grouped by similarity)
- Recommendations (where items with related text strings are recommended)
- Anomaly detection (where outliers with little relatedness are identified)
- Diversity measurement (where similarity distributions are analyzed)
- Classification (where text strings are classified by their most similar label)
- An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.
Review Lamaindex
Review Langchain
AR-code.com
Examstopics.com
Review Mistrel Ai
Review Groc.com
- uses Llama and Mixtral
AWS Access:
- Sagemaker
- Ask Dr. about the certification exams
Resume companies use CBOW to scan the resumes
- Extract the skills from the job description
- Ask for the word density for each skill
- I need my resume to reflect this key word density

8. Review Pinecone

HW: Complete 'Word Embedding' in website

Week 8 - Azure

Week 8 - Azure for Personal ChatGPT

Azure
- Figure out use cases
  1. Search internal documents with GPT
  2. Liquid resume
  3. Search real estate data
Semantic Search
Review Enterprise or Solutions Architecture

Week 7 - Deep Learning

Ai
NLP
Colab - Learn to code a neural network
- Can't have empty values or nulls
- Change everything to 1
Regression Model
Classification Model
Make all data Numeric
Scale the data
Turn non-numeric values into 0 or 1.
- 1-10 can be 1-5 = Bad, 6-10 = Good
- Then Bad = 0 and Good = 1
You can over train the models
- Overfitting