Create Ai chat bot on Voiceflow
Create class website on Google sites to use for a class assistant.
Embed the chat bot into the free website created.
Neuron/Node:
Neural Network
Deep Learning
GPT has 23 billion neurons
Word2Vec:
Vec means vector
Sequence-to-sequence learning (Seq2Seq): is a machine learning approach where models are trained to transform input sequences (like words or characters) into output sequences, often of different lengths or types, commonly used in tasks like language translation and text summarization.
Transformers:
Understand the difference between river Bank and robbing the Bank.
ARTICLE: NIPS 2017: Attention is All You Need
Theory of mind test
Ethical progression of building Ai
Online moderation
use Ai to find self harm and possible hate crimes
Ai can predict possible cases (pre-crime?)
***Create STREAMLIT Application
Fork files and edit code to Github
Launch Streamlit app
Edit Json code
What is Natural Language?
A set of agreed upon words
NLP is the merge with Ai and Linguistics
Natural Language Understanding (NLU) - understanding
Natural Language Generation (NLG) - generate a response
Tokenization
word-tokenizer
Use import nltk to tokenize words
Stemming
Lemmatization
Named Entity Recognition (NER)
Extracts important entities from text
Use NER to extract top entities and query only the entities and not the whole dataset.
1. Word Vectorization
Turn the words to numbers and clean it
Use Octoparse web scraper to gather Amazon data.
2. Recommender systems
Use unstructured data
Create actionable systems can be used to create better products
What problem can we solve?
Songs for Spotify, movies on Netflix, videos on YT, related post on Twitter, similar dishes on UberEATS, and ads on Facebook.
3. Where do they get the Data?
Explicit data
Implicit data
Google Tag Manager - tracks every click & contact you make with the website.
https://www.analyticsmania.com/post/google-tag-manager-use-cases/
4. Kinds of Recommenders
Collaborative Filtering - Find similar users
Content-Based Filtering
5. Word2Vec
Bag of Words
TF-IDF Vectorizer- NLP algorithm used to determine the term frequency of key words based on the number of times it appears in the corpus (group) of documents.
Every word will have its own TF & IDF scores (weight)
Notes:
Look at LakeraAi to get information on hacking Ai to have it answer things it's not supposed to.
Check out Dataiku and OpenAi Whisper.
Read chapter 3
Each item in the dataset is represented as a unique vector.
Similarity between items is quantified by calculating the cosine similarity between their vectors.
Cosine similarity measures the cosine of the angle between two vectors, producing a score from -1 to 1.
A score close to 1 indicates high similarity, while a score near -1 implies significant difference.
This technique is widely used in information retrieval, text analysis, and machine learning for assessing the likeness between data points.
Set up environment
Preprocess text
Get it in the format 'text'/'label'
Apply cleaning function
Stem and Lemma
TF/IDF, BoW and the Model
Create model with Vectorized data
Evaluate NLP Model
Test model with new data
Make predictions
*work on Modules 3,4 & 5
Basic cleaning
Lemma
Tokens (this is not Word2Vec)
Vectorize - turn words into numbers
Bag of Words
TF-IDF
ML Model
Binary classification
Create recommendation or Sentiment analysis with model
Word2Vec
Cluster in 2 dimensional space
Continuous Bag of Word (CBOW)
SkipGram
Additional Information:
Sequence to sequence learning
Transformer
Review NLTK Vader
Read Chapter 5 - Word Embedding
Use nlp-fundamentals to create Custom GPT web app on Streamlit
Sentiment Intensity Analyzer
Create your own GPT app with this template
Look for Text Classification datasets on Kaggle
Look for datasets that have columns with large amounts of text.
Word2Vec
Word Embedding
ChatGPT only works with 'word embedding'.
W.E. is how you create a custom GPT
What are embeddings?
OpenAI’s text embeddings measure the relatedness of text strings. Embeddings are commonly used for:
Search (where results are ranked by relevance to a query string)
Clustering (where text strings are grouped by similarity)
Recommendations (where items with related text strings are recommended)
Anomaly detection (where outliers with little relatedness are identified)
Diversity measurement (where similarity distributions are analyzed)
Classification (where text strings are classified by their most similar label)
An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.
Review Lamaindex
Review Langchain
Examstopics.com
Review Mistrel Ai
Review Groc.com
uses Llama and Mixtral
AWS Access:
Sagemaker
Ask Dr. about the certification exams
Resume companies use CBOW to scan the resumes
Extract the skills from the job description
Ask for the word density for each skill
I need my resume to reflect this key word density
8. Review Pinecone
HW: Complete 'Word Embedding' in website
Week 8 - Azure for Personal ChatGPT
Azure
Figure out use cases
Search internal documents with GPT
Liquid resume
Search real estate data
Semantic Search
Review Enterprise or Solutions Architecture
Ai
NLP
Colab - Learn to code a neural network
Can't have empty values or nulls
Change everything to 1
Regression Model
Classification Model
Make all data Numeric
Scale the data
Turn non-numeric values into 0 or 1.
1-10 can be 1-5 = Bad, 6-10 = Good
Then Bad = 0 and Good = 1
You can over train the models
Overfitting
Review the code
complete 3 different types of models