Articles, Blog

Make an AI sound like a YouTuber (LAB): Crash Course AI #8

October 9, 2019


Flower, dog, anxious, senior,
car, Item, president, worried, avacado, Zendaya, licorice, Nerdfighter, toothbrush, zany, expedient, This isn’t really a vlogbrothers video. It’s just a random string of words. There aren’t any coherent sentences. It looks like John Green bot could use some
help speaking a bit more like human John Green – sounds like an excellent task for Natural
Language Processing. INTRO Hey, I’m Jabril and welcome to Crash Course
AI! Today, we’re going to tackle another hands-on
lab. Our goal today is to get John-Green-bot to
produce language that sounds like human John Green… and have some fun while doing it. We’ll be writing all of our code using a
language called Python in a tool called Google Colaboratory, and as you watch this video,
you can follow along with the code in your browser from the link we put in the description. In these Colaboratory files, there’s some
regular text explaining what we’re trying to do, and pieces of code that you can run
by pushing the play button. Now, these pieces of code build on each other,
so keep in mind that we have to run them in order from top to bottom, otherwise we might
get an error. To actually run the code and experiment with
changing it you’ll have to either click “open in playground” at the top of the
page or open the File menu and click “Save a Copy to Drive”. And just an fyi: you’ll need a Google account
for this. Now, we’re going to build an AI model that
plays a clever game of fill-in-the-blank. We’ll be able to give John-Green-bot any
word prompt like “good morning,” and he’ll be able to finish the sentence. Like any AI, John-Green-bot won’t really
understand anything, but AI generally does a really good job of finding
and copying patterns. When we teach any AI system to understand
and produce language, we’re really asking it to find and copy patterns in some behavior. So to build a natural language processing
AI, we need to do four things: First, gather and clean the data. Second, set up the model. Third, train the model. And fourth, make predictions. So let’s start with the first step: gather
and clean the data. In this case, the data are lots of examples
of human John Green talking, and thankfully, he’s talked a lot online., We need some way to process his speech. And how can we do that? Subtitles. And conveniently there’s a whole database
of subtitle files on the nerdfighteria wiki that I pulled from. I went ahead and collected a bunch and put
them into one big file that’s hosted on crash course ai’s GitHub.. This first bit of code in 1.1 loads it. So if you wanted to try to make your AI sound
like someone else, like Michael from Vsauce, or me, this is where you’d load all that
text instead. Data gathering is often the hardest and slowest
part of any machine learning project, but in this instance its pretty straightforward. Regardless, we still aren’t done yet, now
we need to clean and prep our data for our model. This is called preprocessing. Remember, a computer can only process data
as numbers, so we need to split our sentences into words, and then convert our words into
numbers. When we’re building a natural language processing
program the term “word” may not capture everything we need to know. How many instances there are of a word can
also be useful. So instead, we’ll use the terms lexical
type and lexical token. Now a lexical type is a word, and a lexical
token is a specific instance of a word, including any repeats. So, for example, in the sentence: The goal of machine learning is to make a
learning machine. We have eleven lexical tokens but only nine
lexical types, because “learning” and “machine” both occur twice. In natural language processing, tokenization
is the process of splitting a sentence into a list of lexical tokens. In English, we put spaces between words, so
let’s start by slicing up the sentence at the spaces. “Good morning Hank, it’s Tuesday.” would turn into a list like this. And we would have five tokens. However there are a few problems. Something tells me we don’t really want
a lexical type for Hank-comma and Tuesday-period, so let’s add some extra rules for punctuation. Thankfully, there are prewritten libraries
for this. Using one of those, the list would look something
like this. In this case we would have eight tokens instead
of five, and tokenization even helped split up our contraction “it’s” into “it”
and “apostrophe-s.” Looking back at our code, before tokenization,
we had over 30,000 lexical types. This code also splits our data into a training
dataset and a validation dataset. We want to make sure the model learns from
the training data, but we can test it on new data it’s never seen before. That’s what the validation dataset is for. We can count up our lexical types and lexical
tokens with this bit of code in box 1.3. And it looks like we actually have about 23,000
unique lexical types. But remember how many instances of a word
can also be useful. This code block here at step 1.4 allows us
to separate how many lexical types occur more than once twice and so on. It looks like we’ve got a lot of rare words
— almost 10,000 words occur only once! Having rare words is really tricky for AI
systems, because they’re trying to find and copy patterns, so they need lots of examples
of how to use each word. Oh Human John Green. Your master of prose. Let’s see what weird words you use. Pisgah? What even is a lilliputian? Some of these are pretty tricky and are going
to be too hard for John-Green-bot’s AI to learn with just this dataset But others seem doable if we take advantage
of morphology. Morphology is the way a word gets shape-shifted
to match a tense, like you’d add an “ED” to make something past tense, or when you
shorten or combine words to make them totes-amazeballs. Dear viewers, I did not write that in the
script. In English, we can remove a lot of extra word
endings, like ED, ING, or LY, through a process called stemming. And so, with a few simple rules, we can clean
up our data even more. I’m also going to simplify the data by replacing numbers with the hashtag or pound signs. Whatever you want to call it. This should take care of a lot of rare words. Now we have 3,000 fewer lexical types and
only about 8,000 words only occur once. We really need multiple examples of each word
for our AI to learn patterns reliably, so we’ll simplify even more by replacing each
of those 8,000 or so rare lexical tokens with the word ‘unk’ or unknown. Basically, we don’t want John-Green-bot
to get embarrassed if he sees a word he doesn’t know. So by hiding some words, we can teach John-Green-bot
how to keep writing when he bumps into a one-time made-up words like zombicorns. And just to satisfy my curiosity… Yeah, John-Green-bot doesn’t need words
like “whippersnappers” or “zombification”. John what’s up with the fixation with zombies? Anyway, we’ll be fine without them. Now that we finally have our data all cleaned
and put together, we’re done with preprocessing and can move on to Step 2: setting up the
model for John-Green-bot. There are a couple key things that we need
to do. First, we need to convert the sentences into
lists of numbers. We want one word for every lexical type, so
we’ll build a dictionary that assigns every word in our vocabulary a number. Second, unlike us, the model can read a bunch
of words at the same time, and we want to take advantage of that to help John-Green-bot
learn quickly. So we’re going to split our data into pieces
called batches. Here, we’re telling the model to read 20
sequences (which have 35 words each) at the same time! Alright! Now, it’s time to finally build our AI. We’re going to program John-Green-bot with
a simple language model that takes in a few words and tries to complete the rest of the
sentence. So we’ll need two key parts, an embedding
matrix and a recurrent neural network or RNN. Just like we discussed in the Natural Language
Processing video last week, this is an “Encoder-Decoder” framework. So let’s take it apart. An embedding matrix is a big list of vectors,
which is basically a big table of numbers, where each row corresponds to a different
word. These vector-rows capture how related two
words are. So if two words are used in similar ways,
then the numbers in their vectors should be similar. But to start, we don’t know anything about
the words, so we just assign every word a vector with random numbers. Remember we replaced all the words with numbers
in our training data, so now when the system reads in a number, it just looks up that row
in the table and uses the corresponding vector as an input. Part 1 is done: Words become indices, which
become vectors, and our embedding matrix is ready to use. Now, we need a model that can use those vectors
intelligently. This is where the RNN comes in. We talked about the structure of a recurrent
neural network in our last video too, but it’s basically a model that slowly builds
a hidden representation by incorporating one new word at a time. Depending on the task, the RNN will combine
new knowledge in different ways. With John-Green-bot, we’re training our
RNN with sequences of words from Vlogbrothers scripts. Ultimately, our AI is trying to build a good
summary to make sure a sentence has some overall meaning, and it’s keeping track of the last
word to produce a sentence that sounds like English. The RNN’s output after reading the final
word so far in a sentence is what we’ll use to predict the next word. And this is what we’ll use to train John-Green-bot’s AI after we build it. All of this is wrapped up in code block 2.3 So Part 2 is done. We’ve got our embedding matrix and our RNN. Now, we’re ready for Step 3: train our model. Remember when we split the data into pieces
called batches? And remember earlier in Crash Course AI when
we used backpropagation to train neural networks? Well we can put those pieces together, iterate
over our dataset, and run backpropagation on each example to train the model’s weights. So in step 3.1 we’re defining how to train
our model and in step 3.2 we’re defining how to evaluate our model and in step 3.3
we’re actually creating our model. Which means training and evaluating it. Over the span of one epoch of training this
model, the network will loop over every batch of data — reading it in, building representations,
predicting the next word, and then updating its guesses. This will train over 10 epochs, which might
take a couple minutes. We’re printing two numbers with each epoch,
which are the model’s training and validation perplexities. As the model learns, it realizes there are
fewer and fewer good choices for the next word. The perplexity is a measure of how well the
model has narrowed down the choices. Okay, it looks like the model is done training
and has a perplexity of about 45 on train and 72 on validation, but it started with
perplexities in the hundreds! We can interpret perplexity as the average
number of guesses the model makes before it predicts the right answer. After seeing the data once, the model needed
over 300 guesses for the next word, but now it’s narrowed it down to fewer than 50. That’s a pretty good improvement, even though
it’s far from perfect. Time to see what the model can write, but
to do that, we need one final ingredient. So far in Crash Course AI, we’ve talked
a lot about the one best label or the one best prediction an AI can make, but this doesn’t
always make sense to solve certain problems. If you wrote stories by always having characters
do the next obvious thing, they’d be pretty boring. So Step 4 is inference, the part of AI where
the machine gets to make some choices, but we can still help it a little bit. Let’s think about what the final layer of
the RNN is actually doing. We talk about it like it’s outputting a
single label or prediction, but actually the network is producing a bunch of scores or
probabilities. The most likely word has the highest probability,
the next most likely word has the second highest probability, and so on. Because we get probabilities at every step,
instead of taking the best one each time to produce 1 sentence, we could sample 3 words
and start 3 new sentences. Each of those 3 sentences could then start
3 more new sentences… and then we have a branching diagram of possibilities. Inference is so important because what the
model can produce and what we want aren’t necessarily the same thing. What we want is a really good sentence, but
the model can only tell us the score for one word at a time. Let’s look at this branching diagram. Whenever we choose a word, we create a new
branch, and keep track of its score or probability. If we multiply each score through to the end
of the branch, we see that the top branch, made the best scoring choice, but a worse
sentence overall. So we’re going to implement a basic sampler
in our program. This will take a bunch of random paths, so
we can sort the results by the probability of the full sentences, and we can see which
sentences are best overall. Also, when asking John-Green-bot to generate
all these sentences, we need to give him a word to start. I’m going to try “Good” for now, but
you can try other things by changing the code in 4.1. Remember the preprocessing we did on our data? That’s why these sentences look a little off,
with hashtags for numbers, and the space before word endings that we introduced when stemming. And look at the sentence you get from taking
the highest probability word each time. Good morning Hank, it’s Tuesday. I’m going to be like, I’m going to be
like, I’m going to be like, I’m going to see it isn’t as interesting as the ones where we mixed it up a bit and took different branches. To be honest though… none of these are great
Vlogbrothers scripts. That’s because of two important things: First, there’s our data. Remember, we didn’t have many examples of
how to use each word. In fact, we had to cut out a lot of “rare
words” during training because they only showed up once, so we couldn’t teach John-Green-bot
to recognize any patterns related to them. Lots of state-of-the-art models address this
by downloading data from Wikipedia, large collections of books, or even Reddit when
they train their models. We’ll include some links in the description
if you want to play with some fancier models. But the second, bigger issue is that AI models
are missing the understanding we have as humans. Even if John Green Bot split up words perfectly
and predicted sentences that sound like English, it’s still John-Green-bot using tools like
tokenization, an embedding matrix, and a simple language model to predict the next word. When human John Green writes, he uses his
understanding of the world, like in Vlogbrothers videos, he considers Hank’s perspective
or whoever’s watching. He’s not just trying to predict which next
word has the highest probability. Building models that interact with people,
and the world, is why natural language processing is so exciting, but it’s also why it’ll
take a lot more work to get John-Green-bot to generate language as well as human John
Green does. We’ve left a bunch of notes in the code
for you to play if you want to make your own AI. You can train for longer, change the sentence
prompt, or, if you’re feeling adventurous, replace the text data to speak in someone
else’s voice. If you end up using this to make something
cool let us know in the comments. Thanks for watching, see you
next week. PBS Digital Studios wants to hear from you. We do a survey every year that asks what you’re into, your favorite pbs shows, and things you would like to see more from PBS Digital Studios. You even get to vote on potential new shows. All of this helps us make more stuff that you want to see. The survey takes about 10 minutes and you might win a sweet t-shirt. Link is in the description. Thanks. Crash Course AI is produced in association
with PBS Digital Studios! If you want to help keep all Crash Course
free for everybody, forever, you can join our community on Patreon. And if you want to learn more about NLP check out this video from Crash Course Computer Science.

54 Comments

  • Reply Violet Holiday October 4, 2019 at 7:45 pm

    I'm early

  • Reply Lou 2 October 4, 2019 at 7:45 pm

    2nd

  • Reply TimeAndChance October 4, 2019 at 7:45 pm

    first

  • Reply Nathan Animation October 4, 2019 at 7:45 pm

    Lol I like this

  • Reply Mitsuyuki-Hime October 4, 2019 at 7:47 pm

    Wild

  • Reply Justine Chang Lee Hau October 4, 2019 at 7:49 pm

    Hey! It's you!

  • Reply Smoof October 4, 2019 at 7:50 pm

    Take notes Reddit Youtubers

  • Reply Kyle Meinhardt October 4, 2019 at 7:51 pm

    One of the first

  • Reply Watts The Safeword October 4, 2019 at 7:51 pm

    I wonder if this youtube AI will also block all LGBT terminology. 😂

  • Reply Kyle Meinhardt October 4, 2019 at 7:51 pm

    6th?

  • Reply Hope Moore October 4, 2019 at 7:52 pm

    I haven't seen 1 through 7, but I was excited to see Jabrils as the host! I tell people to watch his machine learning videos! You're awesome, dude.

  • Reply Sumit Patil October 4, 2019 at 7:55 pm

    Oy….where is my natural AI sound…🙄

  • Reply ПΣJI-ƧΛMΛ October 4, 2019 at 8:02 pm

    i have no reason to be here

  • Reply Andrzej October 4, 2019 at 8:02 pm

    I'm very interested in the subject, I find the presenter and the robot highly annoying. Don't you have any women over there?

  • Reply Ryker Quackenbush October 4, 2019 at 8:04 pm

    AI needs courage.
    Times one billion.

  • Reply SwordQuake2 October 4, 2019 at 8:29 pm

    >Doesn't know what Lilliputian is.

  • Reply Timothy McDaniel October 4, 2019 at 8:46 pm

    This video is totes amazeballs.

  • Reply Jacob Parry October 4, 2019 at 9:25 pm

    I keep thinking these are, like traditional linguistics vids and I always get disappointed😭

  • Reply joseph duenas October 4, 2019 at 9:32 pm

    jabrillll 👌🏽👌🏽👌🏽🔥🔥🔥

  • Reply IceMetalPunk October 4, 2019 at 10:06 pm

    The pure defeat in your voice as you read the words "totes amazeballs" was visceral. I completely empathize with you; when I first saw that commercial that emphasized the phrase "totes McGoats", I immediately had a migraine. Stay strong, Jabril!

  • Reply Flaming Basketball Club October 4, 2019 at 10:54 pm

    Can't we learn more about Natural Language Processing on Codecademy?

  • Reply Learn and Grow - Kids TV October 4, 2019 at 11:22 pm

    🤔 💡Thanks for sharing! 😊

  • Reply IceMetalPunk October 4, 2019 at 11:25 pm

    Thank you for introducing me to Hugging Face… now I'm just playing with it all day and alternately being very impressed and laughing my butt off.

  • Reply S T A L K E R #SavetheLotus October 5, 2019 at 12:16 am

    Quite Shy

  • Reply Geoffrey Winn October 5, 2019 at 12:41 am

    Educational!

  • Reply Ace Hardy October 5, 2019 at 12:57 am

    👑

  • Reply N Squared October 5, 2019 at 1:56 am

    You monster. What is a life worth without knowledge of the word "whippersnapper"

  • Reply fidelio October 5, 2019 at 2:14 am

    5:34 ole jg and his bs.

  • Reply erin baggarly October 5, 2019 at 2:20 am

    I'm sorry. I stumbled into the wrong CC. I was looking for John. Keep up the good work. lm an old wonderer and wanderer apparently. Best wishes, Me.

  • Reply John Royce October 5, 2019 at 3:05 am

    Surprising lack of Kizuna Ai jokes

  • Reply Sh1nrue October 5, 2019 at 3:42 am

    wait, jabrils can talk directly ???! where's the dubbing ??

  • Reply William Barros October 5, 2019 at 5:58 am

    All those crash courses are FUN with Knowledge.But you should develop an Android App with a library of the courses. Thus we can listen and study them!

  • Reply Arghya Polley October 5, 2019 at 8:51 am

    please share the code

  • Reply Andy Morrall October 5, 2019 at 8:56 am

    If you want the AI to produce "interesting" sentences, it not only needs a model of what is possible to say, it needs a model of the listener, so it knows how acceptable they might find "totes amazeballs" and "Lilliputian".

  • Reply Alexi Xeno October 5, 2019 at 10:16 am

    Oh gawd… I feel sorry for any Ai based on V-sauce. That man is chaotic enough, let's (not) create a AI trying to imitate that orderly randomness

  • Reply Herr Vorragend October 5, 2019 at 10:55 am

    What the …! Zombicorns?!
    Now I really want John Greenbot to write a poem about Zombicorns 😀

  • Reply Palace Of Wisdom October 5, 2019 at 3:28 pm

    I forced a bot to watch 1000 hours of Vlogbrothers…

  • Reply Debt Collector October 5, 2019 at 3:35 pm

    OMG jabrils on crashcourse

  • Reply Mikerhinos October 5, 2019 at 6:00 pm

    Be careful because in EU we now have to deal with RGPD and the voice is a personnal data so you can't use someone's voice to train your model without an autorization… 🙁

  • Reply Ryuk Baduk October 5, 2019 at 8:25 pm

    Yall test the open AI's new language stuffs

  • Reply Postposterous October 5, 2019 at 9:13 pm

    Wow – your growth has been exponential. Don't know what else to say… we made some videos too…for some reason. 😎

  • Reply John Opalko October 5, 2019 at 9:45 pm

    So, basically, it's a sophisticated implementation of Markov chains?

  • Reply TheJaredtheJaredlong October 5, 2019 at 11:14 pm

    I hope someone compiles a super data set of every word John Green has ever written and spoken online: his books, vlog brothers, brotherhood 2.0, crash course, twitter, commencement speeches, tour speeches, hacked personal emails, etc.

  • Reply John Bradley Evans October 5, 2019 at 11:29 pm

    I am so armadillo.

  • Reply ACTIONKEY October 6, 2019 at 5:31 am

    CrashCourse
    – AI examples are a good idea

  • Reply Fraser McFadyen October 6, 2019 at 9:19 am

    I wrote this before I was going on the road and then I’m not doing that the time of time and then maybe I should go get a ride and get my stuff to get to my truck.

  • Reply Fraser McFadyen October 6, 2019 at 9:26 am

    I know this is just an exercise, but part of what makes John Green sound like John Green is his use of infrequent words like “whippersnapper” (not that infrequent, perhaps) and original coinages like “zombicorns”. How could you program that?

  • Reply Mark Susskind October 6, 2019 at 11:46 am

    Nerdfighters obsessed over zombies first.

  • Reply Mark Susskind October 6, 2019 at 11:49 am

    Totes-Amazeballs? Scriptwriter_AI must've not ruled out rare words.

  • Reply Alisa Adler October 6, 2019 at 1:34 pm

    it's interesting if AI just uses patterns, how does it make jokes? I read about such program. Is it that easy to make a joke out of patterns?

  • Reply Michael Villalobos October 7, 2019 at 2:23 pm

    love this video

  • Reply Jany JJ October 7, 2019 at 7:31 pm

    I WOULD LOVEEEE IT IF CRASH COURSE HAD AN ACCOUNTING COURSE!!❤️️.

  • Reply I want my wig back you imbecile October 8, 2019 at 12:37 am

    Just wondering, is there a Math crash course(playlist)? 🤔

  • Reply Unripe Banana October 8, 2019 at 3:32 pm

    Now we need to see all the words that were said once in the actual videos

  • Leave a Reply