Python Exercise for Beginners – Implementing One Hot Encoding

In today’s Python exercise we will create a one-hot encoding algorithm to convert categorical data to a numerical format.

The reason why we need to do this is that machine learning algorithms only can deal with numbers, not strings.

What is One Hot Encoding?

One-hot encoding is a way of representing categorical information in binary format, but in such a way that only one digit in the binary number is set to 1. This is why it is called one-hot because only one bit is ON at any time in the binary number.
The type of categorical data we are talking about is the type where the order is not applicable(nominal).

If the category has a natural order, for instance (Day of the Week: Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday), then you don’t need to use one-hot encoding. You can just assign an integer to each day of the week starting from zero.

In general, it’s a bad idea to assign an ordinal value to a category which nominal( e.g. cat or dog) as the machine learning algorithm will assume that there is a natural order in the category.

In this video, I am going to show you several ways of creating one-hot encoding for different types of data that you might come across.

First, let’s just see a simple example of one-hot-encoding

this number we have the length of our binary number:

categories=["cat", "fox", "badger", "person"]

As you can see this category is nominal, meaning that there is no natural order that can be applied. So this is a good candidate to use one-hot encoding.

To apply this algorithm, we need to count the number of elements in the category. And with this we will have the length of our binary number:

length_of_binary_number = len(categories)
print(length_of_binary_number)
4

Manual One-Hot-Encoding

To represent this category using One-Hot-Encoding we could then simply define manually each category:

one_hot_encodings = dict()

one_hot_encodings["cat"] = 0b1000
one_hot_encodings["fox"] = 0b0100
one_hot_encodings["badger"] = 0b0010
one_hot_encodings["person"] = 0b0001

print(one_hot_encodings)
{'cat': 8, 'fox': 4, 'badger': 2, 'person': 1}

You will notice that we didn’t assign 0b0000 to any of the elements. That was on purpose.

One-Hot-Encoding for larger categories

To encode categories with a small number of elements, it’s ok to do it manually and use just a binary number. But when you are dealing with larger categories with thousands of items, representing all the possible items using just a binary number is not possible.

Consider for instance a 32-bit integer, which is what a computer will typically use to store an integer.

One-hot encoding only allows us to use one bit to represent an item in a category. So if we are limited to a 32-bit number we can encode a category with a maximum of 32 elements!

To represent categories with a large number of items, we need to use arrays.

Encoding the words in the bible using one-hot encoding

In this exercise, we are going to use a NumPy array to store the features in a one-hot encoding format resulting from finding every word that is different in the English text of Genesis.

First, we download the text from genesis using the HTTP Request library:

import requests

r = requests.get("http://www.stewartonbibleschool.org/bible/text/genesis.txt", stream = True)

  # Check if the image was retrieved successfully
if r.status_code == 200:
    bible_genesis_text = r.text
    print(bible_genesis_text[0:2000])
The First Book of Moses called

GENESIS

1:1: In the beginning God created the heaven and the earth.
1:2: And the earth was without form, and void; and darkness was upon the face of the deep.  And the Spirit of God moved upon the face of the waters.
1:3: And God said, Let there be light: and there was light.
1:4: And God saw the light, that it was good: and God divided the light from the darkness.
1:5: And God called the light Day, and the darkness he called Night.  And the evening and the morning were the first day.
1:6: And God said, Let there be a firmament in the midst of the waters, and let it divide the waters from the waters.
1:7: And God made the firmament, and divided the waters which were under the firmament from the waters which were above the firmament: and it was so.
1:8: And God called the firmament Heaven.  And the evening and the morning were the second day.
1:9: And God said, Let the waters under the heaven be gathered together unto one place, and let the dry land appear: and it was so.
1:10: And God called the dry land Earth; and the gathering together of the waters called he Seas: and God saw that it was good.
1:11: And God said, Let the earth bring forth grass, the herb yielding seed, and the fruit tree yielding fruit after his kind, whose seed is in itself, upon the earth: and it was so.
1:12: And the earth brought forth grass, and herb yielding seed after his kind, and the tree yielding fruit, whose seed was in itself, after his kind: and God saw that it was good. 
1:13: And the evening and the morning were the third day.
1:14: And God said, Let there be lights in the firmament of the heaven to divide the day from the night; and let them be for signs, and for seasons, and for days, and years: 
1:15: And let them be for lights in the firmament of the heaven to give light upon the earth: and it was so. 
1:16: And God made two great lights; the greater light to rule the day, and the lesser light to rule the night: he made the stars also.
1:17: And

Removing verse numbers using Python regular expressions

We are going to do some very light data cleansing. Something you are expected to do in any data science project.

Now we need to iterate through each verse in Generis, being careful to discard the verse numbers. We do not want these for the one-hot encoding.

Since the length of the verse number differs, we can’t just remove a fixed number of characters from each line.

This is a great place to use a simple regular expressions.

import re

lines =  bible_genesis_text.splitlines()

verses_without_number=""
for line in lines:
  verse_without_number = re.sub("d+:d+:", "", line )
  verses_without_number += (verse_without_number + "n")

print(verses_without_number[0:2000])
The First Book of Moses called

GENESIS

 In the beginning God created the heaven and the earth.
 And the earth was without form, and void; and darkness was upon the face of the deep.  And the Spirit of God moved upon the face of the waters.
 And God said, Let there be light: and there was light.
 And God saw the light, that it was good: and God divided the light from the darkness.
 And God called the light Day, and the darkness he called Night.  And the evening and the morning were the first day.
 And God said, Let there be a firmament in the midst of the waters, and let it divide the waters from the waters.
 And God made the firmament, and divided the waters which were under the firmament from the waters which were above the firmament: and it was so.
 And God called the firmament Heaven.  And the evening and the morning were the second day.
 And God said, Let the waters under the heaven be gathered together unto one place, and let the dry land appear: and it was so.
 And God called the dry land Earth; and the gathering together of the waters called he Seas: and God saw that it was good.
 And God said, Let the earth bring forth grass, the herb yielding seed, and the fruit tree yielding fruit after his kind, whose seed is in itself, upon the earth: and it was so.
 And the earth brought forth grass, and herb yielding seed after his kind, and the tree yielding fruit, whose seed was in itself, after his kind: and God saw that it was good. 
 And the evening and the morning were the third day.
 And God said, Let there be lights in the firmament of the heaven to divide the day from the night; and let them be for signs, and for seasons, and for days, and years: 
 And let them be for lights in the firmament of the heaven to give light upon the earth: and it was so. 
 And God made two great lights; the greater light to rule the day, and the lesser light to rule the night: he made the stars also.
 And God set them in the firmament of the heaven to give light upon the earth,

Creating a NumPy array with a one-hot encoding for each verse in Genesis

print(verses_without_number.split()[0:25])
['The', 'First', 'Book', 'of', 'Moses', 'called', 'GENESIS', 'In', 'the', 'beginning', 'God', 'created', 'the', 'heaven', 'and', 'the', 'earth.', 'And', 'the', 'earth', 'was', 'without', 'form,', 'and', 'void;']

Notice we still have punctuation in each word. We don’t want it!
Luckily we don’t need to come up with a regex for that. Python has a secret up its sleeves

import string

string.punctuation
'!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~'

If we import the string library we have access to a handy attribute, which contains all the punctuation characters that you will normally find in a text.

Using str.strip() we can then remove all the punctuation from our words before applying one-hot encoding

for word in verses_without_number.split()[0:25]:
  word_without_punctuuation = word.strip(string.punctuation)
  print(word_without_punctuuation)
The
First
Book
of
Moses
called
GENESIS
In
the
beginning
God
created
the
heaven
and
the
earth
And
the
earth
was
without
form
and
void

Generating One-hot encoding using Numpy Arrays

Now that we have cleansed our data we are ready to generate our one-hot encodings

words_index= {}
words_in_genesis_list = verses_without_number.split()

for word in words_in_genesis_list:
  word_without_punctuation = word.strip(string.punctuation)
  if word not in words_index:
    words_index[word_without_punctuation] = len(words_index)
list(words_index.items())[0:10]
[('The', 0),
 ('First', 1),
 ('Book', 2),
 ('of', 1799),
 ('Moses', 4),
 ('called', 1413),
 ('GENESIS', 6),
 ('In', 7),
 ('the', 1056),
 ('beginning', 2331)]

Now it is time to generate one-hot encodings using Numpy:

import numpy as np

result = np.zeros((len(words_in_genesis_list), len(words_index)))
for index, word in enumerate(words_in_genesis_list):
  word_without_punctuation = word.strip(string.punctuation)
  hot_index = words_index[word_without_punctuation]
  result[index][hot_index-1] = 1

print(result.shape)
result[0:100]
(38267, 2670)

array([[0., 0., 0., ..., 0., 0., 1.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

One-hot Encoding with Pandas

Since One-hot encoding is used quite often in Data Science, you will find that it is implemented already for you in the most popular data science libraries.

In the Pandas library you can apply One-hot encoding to a column in a Panda data frame using the get_dummies() method.

First let’s create a Panda Data Frame with a single column from our existing data set.

import pandas as pd
# we use a lambda function and map to strip all the punctuation
words_in_genesis_list = list(map( lambda x: x.strip(string.punctuation), words_in_genesis_list))
df = pd.DataFrame(words_in_genesis_list, columns=["Word"])
df
  Word
0 The
1 First
2 Book
3 of
4 Moses
38262 in
38263 a
38264 coffin
38265 in
38266 Egypt

38267 rows × 1 columns

To Apply One-Hot encoding we simply do:

pd.get_dummies(df["Word"], prefix="word")
  word_A word_Abel word_Abel-mizraim word_Abida word_Abide word_Abimael word_Abimelech word_Abimelech’s word_Abraham word_Abraham’s word_yoke word_yonder word_you word_young word_younger word_youngest word_your word_yours word_yourselves word_youth
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
38262 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
38263 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
38264 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
38265 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
38266 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

38267 rows × 2670 columns

 

We have added a massive amount of columns to the panda table. Obviously, this is not for human consumption. But for a machine learning algorithm, this is exactly what it needs to make some sense of the data.

One-Hot encoding with Sklearn

import sklearn.preprocessing as preprocessing

labelEncoder = preprocessing.LabelEncoder()
sk_words_index = labelEncoder.fit_transform(words_in_genesis_list)
print(sk_words_index)
onehotEnc = preprocessing.OneHotEncoder()
onehotEnc.fit(sk_words_index.reshape(-1, 1))
one_hot_encoded_words = onehotEnc.transform(sk_words_index.reshape(-1, 1))

print("The One-Hot-Encoded verses")
print(one_hot_encoded_words.toarray())
print(one_hot_encoded_words.shape)
[ 600  208  102 ...  973 1535  159]
The One-Hot-Encoded verses
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
(38267, 2670)

One Hot Encoding with Keras

One-hot-encoding with Keras comes with a few extra functionalities. For instance, by default, all words are converted to lowercase to avoid duplicate words being treated differently. Also you can specify the maximum number of words that you want to consider to build a word_index, based on word frequency. This can be useful if you are looking to remove outliers. You can also customize the filter of words that Keras uses to strip unwanted characters from the text(i.e. punctuation, numbers, etc)

from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(
    lower=False,
    num_words=None
)

tokenizer.fit_on_texts(words_in_genesis_list)
#sequences = tokenizer.texts_to_sequences(verses_without_number_list)
one_hot_results = tokenizer.texts_to_matrix(words_in_genesis_list, mode='binary')
one_hot_results.shape
(38267, 2686)

Interesting that the results from the Keras One-Hot-Encoding differ slightly from the other results. Why is that?

keras_word_index = tokenizer.word_index

for word in words_index:
  if word not in keras_word_index:
    print(word)
Tubal-cain
Hazar-maveth
El-paran
En-mishpat
Hazezon-tamar
Beer-lahai-roi
Beer-sheba
Jehovah-jireh
Kirjath-arba
Lahai-roi
Padan-aram
Jegar-sahadutha
El-elohe-Israel
El-beth-el
Allon-bachuth
Ben-oni
Baal-hanan
Zaphnath-paaneah
Poti-pherah
Abel-mizraim

Aha. Seems like Keras treats words with a dash(-) a bit differently.

keras_word_index = tokenizer.word_index

for word in keras_word_index:
  if word not in words_index:
    print(word)
Beer
sheba
aram
El
roi
Poti
pherah
cain
Lahai
Baal
hanan
Hazar
maveth
paran
En
mishpat
Hazezon
tamar
lahai
Jehovah
jireh
Kirjath
arba
Jegar
sahadutha
elohe
beth
el
Allon
bachuth
Ben
oni
Zaphnath
paaneah
mizraim

I think the culprit is the filter list, which is stripping all the -. Let’s remove – from the filters property and see what happens

tokenizer = Tokenizer(
    lower=False,
    num_words=None,
    filters='!"#$%&()*+,./:;<=>?@[\]^_`{|}~tn',
    split=' '
)

tokenizer.fit_on_texts(words_in_genesis_list)
#sequences = tokenizer.texts_to_sequences(verses_without_number_list)
one_hot_results = tokenizer.texts_to_matrix(words_in_genesis_list, mode='binary')
one_hot_results.shape
(38267, 2671)

Seems like we still have an extra column in the Keras word index( 2671 vs 2670 )

keras_word_index = tokenizer.word_index

for word in keras_word_index:
  if word not in words_index:
    print(word)
for word in words_index:
  if word not in keras_word_index:
    print(word)

print(len(keras_word_index))
print(len(words_index))
2670
2670

The index of words seems to be identical. But why the extra column?
The clue is in the Keras documentation. It seems that they use index 0 for internal use. Hence none of the words have the first column assigned to 1.

Resources

Recommended Courses for Data Science

Source Code

https://github.com/armindocachada/python-for-beginners-exercises


Posted

in

,

by