Gender Prediction – Machine HandsOn Tutorial – Part 1

Recently, with the introduction of virtual Person Assistant such as Apple’s Siri, Microsoft’s Cortona & Amazon’s Alexa, AI & machine learning is all the rage.

Big boys over at the silicon valley has been pitching their efforts into making complex AI algorithms which would have required a PHD in the relevant field not too long ago, assessable to the mass public. Wit.ai, Luis are some of the Natural Language products (some form of Machine Learning, Machine learning is a branch in computer science that studies the design of algorithms that can learn.) I have evaluated over the past few months. I am all but convinced that this is the way of the future…

In these few blog entries, I will try to create a simplistic example demonstrating how some of these AI SaaS work under the hood. Starting with

  • Part 1, Where we would create a simple Python script utilizing Sci-Kit Learn to generate a machine learning model & subsequently using the Model to perform some prediction.
  • Part 2, We will be looking at how to create a web based entries editor using Vue.js javascript framework; which could i) create a new Session for the user, and ii) Enter the entries as a training sets. At the same time, we will also be looking at Firebase. Which would be holding all the training sets for the different sessions intact.
  • Part 3, With the engine which we have proven from Part 1. We will need to host it as a Web API. We will be looking at using Python’s Flask to create a web assessable API in which the user could i) Start training with the training dataset from our Firebase ii) Check on the status for the training models by Session ids iii) And of course a query API to exercise our engine with the models which was created.

In the end we will have a full application consisting of a Vue.js based training set entry editor for data stored on Google’s Firebase and a web api for querying which is hosted on Heroku.
I have hosted a working end result here.

The first step to about anything in data science is loading in your data. This is also the starting point of this tutorial.
The dataset could be found here.
We split the data with a typical 80/20, a.k.a. a training and a test set; the Name (2nd column) is further used to create training patterns by taking its subset e.g.:

Mary => { ‘firstLetter1’: ‘m’, ‘firstLetter2’: ‘ma’, ‘firstLetter3’: ‘mar’, ‘lastLetter3’: ‘ary’, ‘lastLetter2’: ‘ry’, ‘lastLetter1’: ‘y’}

With the training set on one side and gender result on the other, we try to do fitting on the Test Set. Hoping that the computer will be able to find some pattern which correlates the gender of the person with the subsets of their Name. This is all done through the Magic of a Pipeline which ultimately Vectorizer transformer & utilize the Decision-Tree classifier on the vectorized sets. (Seems to work pretty brilliantly).

Once it is done, we save the training model as myPipeline.pkl. This would allow us to reuse the model to do many future prediction of the gender from a first Name.

Due to the fact that some of the Sci-Kit learn library still remain in the Python 2 era. We would need an older version of Python to execute the following Tutorial. I am using

$ python –version
Python 2.7.14 :: Anaconda, Inc.

To install the dependencies, I normally go the no-brainer path of using the pip tools to download and set up the requirements. Its function is about similar typical install tool, such as the Debian Aptitute apt tool or Node’s npm tool.

$ pip install pandas numpy sklearn scipy

Now that we have all the library requirements fulfilled, we could run the script, which would generate a trained model myPipeline.pkl to be used at a later date (subsequent script).

$ python generateModel.py


import pandas as pd
import numpy as np
from sklearn.utils import shuffle
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.externals import joblib

names = pd.read_csv('namesDataset.csv')
names = names.as_matrix()[:, 1:]

# 80% reserved for Training
TRAIN_SPLIT = 0.8

def featureF(name):
    name = name.lower()
    return {
        'firstLetter1': name[0],
        'firstLetter2': name[0:2],
        'firstLetter3': name[0:3],
        'lastLetter1': name[-1],
        'lastLetter2': name[-2:],
        'lastLetter3': name[-3:],
    }

features = np.vectorize(featureF)

X = features(names[:, 0]) # X contains the features

y = names[:, 1] # y contains the targets

# Shuffle sorted names list for better training
X, y = shuffle(X, y)
X_train, X_test = X[:int(TRAIN_SPLIT * len(X))], X[int(TRAIN_SPLIT * len(X)):]
y_train, y_test = y[:int(TRAIN_SPLIT * len(y))], y[int(TRAIN_SPLIT * len(y)):]

pipeline = Pipeline([('dict', DictVectorizer()), ('dtc', DecisionTreeClassifier())])
pipeline.fit(X_train, y_train)

joblib.dump(pipeline, 'myPipeline.pkl', compress = 1)

# Testing
print pipeline.predict(features(["Percy", "Alex", "Emma"])) 

Since we are using a rather comprehensive set of data, the process is going to take a bit of time.
Once the trained model has been created, we could utilize it do make prediction of the name’s Gender. ( You may change the test set, which could be seen close to the bottom of the Prediction source code.

$ python genderPredict.py


import pandas as pd
import numpy as np
from sklearn.utils import shuffle
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.externals import joblib

def featureF(name):
    name = name.lower()
    return {
        'firstLetter1': name[0],
        'firstLetter2': name[0:2],
        'firstLetter3': name[0:3],
        'lastLetter1': name[-1],
        'lastLetter2': name[-2:],
        'lastLetter3': name[-3:],
    }

features = np.vectorize(featureF)

pipeline = joblib.load('myPipeline.pkl', mmap_mode=None)

# Testing
testSet = ["Percy", "Navya", "Bravya", "Mariam", "Thomas", "Paul", "Janet", "Ashley"]
print testSet
print pipeline.predict(features(testSet))     

Leave A Comment?