We recently worked with one of India's largest NBFCs(Non Banking Financial Company) to solve an interesting problem they had. Azure ML is insanely powerful and it helped us get started straight away with solving the problem instead of having setup up environments etc.
Like I said, this is India's largest NBFC and their customer support team deals with a lot of email. They used to manually classify the emails on three lables - category, type and subtype. Their pain points were the following:
- The 50 member customer support team takes between 24-48 hours to identify the labels through the manual process.
- A new employee which joins the support team takes a lot of time to onboard, understandably due to the training involved.
- When classification is done manually, it is prone to errors - because we are human :)
What if we can free up humans from this mundane task of classifying so that they can use their time for assisting their customers better? Thats exactly what we did. Using a combination of tools -
SQL Server for initial dataset they had,
Azure ML for Machine Learning. What follows is an overview of the overall workflow of how we went about solving this problem and how we achieved the desired accuracy.
This is a classic use case for Natural Language Processing. however the challenge here was unique because it involves classification of multi class and multi labels. The emails preprocessing and feature selection plays a very important role in attaining the desired accuracy. Hence we will cover in detail the various steps involved in achieving the healthy and appropriate quantity of training set. The classification of the email depends on the feature list; in this case the features are words in the emails. Selection of the feature list plays a crucial role as natural language involves multiple meaningless words like
the etc.. The support emails are a bit complex because each email can have multiple queries. The training set should have equal and appropriate set of features across all labels for the machine to learn.
There are 2 ways of representing features bag-of-words with bi-gram representation or TFIDF (term frequency multiplied by inverse document frequency). Since the feature list runs into many words, there is no conclusive representation or algorithm that is best – really depends on the data and the accuracy.
Step#1 - Data Preparation
Involves reading the training set, passing through basic data cleaning processes like removing special characters/ignoring numbers, emails, removing stopwords and changing to lowercase. The advanced natural language preprocessing tasks are also performed here like stemming and lemmatize. The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.
Step#2 - Feature Extraction
The extraction of words is done by the feature hashing algorithm which takes all the word count as per bitsize of the hash table. If the bitsize is 20 then the hash table will contain 2 to the power of 20 words, the bitsize again depends on the dataset and the amount of features we want to input as training set. In the same module we select the count of words within each feature set, it will support N-gram representation among bag-of-words; can be unigram, bigram or trigram. Out of all the features available it is very important to select right features to reduce dimensionality, the experiment uses Chi-squared method to find the appropriate list of feature.
Step#3 - Multiclass Classification algorithm
This step involves using of multiple classification algorithms that will help classify the test emails using the training set. While checking against different algorithms, we understand that Decision forest algorithm provides the best accuracy for this dataset. Decision forest algorithm is based on decision tree algorithm which rapidly builds a series of decision trees learning from trained dataset.
Step#4 - Evaluation
The precision of the model depends on the number of predicted classes against actual classes, the performance of the model depends on precision, accuracy and recall evaluation parameters.
Experiment And Results
The classification involves predicting the classes category, type and subtype labels. The classification algorithm sequentially runs on training and test data staring with category, using the output of 1st label the 2nd label is predicted and likewise 2rd label will have the input of the first 2 labels. There can be multiple ways of running the experiment because of multi label classification. Upon reading the training data, modules like Split Data can be used to separate the data into training and test data. There are multiple ways of using Split data module like Randomized, stratified, recommender and relative etc.. Split Data module is also used for selecting appropriate training set, i.e. equal number of data across all trained classes. R-Script and Python execution modules are available in the studio to run complied programs that can be used for preprocessing steps. In our experiment we are using R programming code which is highly recommended for Data cleaning work. Studio also helps import these programs as modules into zipped file that are stored in memory and are run during the execution of the module.
The Azure ML studio provides a very intuitive way of creating the modules and running the ML algorithm. They have sources that can access data via
Azure BLOB or
SQL Database. Multiple experiments can be created to do the following like preprocessing, feature selection, ML and evaluation – the advantage of having multiple experiments is to decouple the process so that they are easier to design implement and debug. Studio provides text analytics modules that are used for feature selection using hashing method, hashing is the best method for feature selection because they are hashed and queried later based on hash values which makes it faster processing instead of string. Once the feature is hashed and extracted, it is very important to select the best set of features to reduce the dimensionality. The feature set directly impacts the training time and the accuracy, we cannot tell the right number of feature set but ML studio helps to change the values and test accuracy as applicable.
The next step of our experiment is to run the classification algorithm using the studio modules. The different multi class algorithms available are Decision forest, Decision Jungle, Logistics regression and Neural network. To choose the most appropriate for text learning algorithm we need to experiment with multiple algorithms with the most optimal parameters to avoid data fitting. The Azure studio helps us with modules like Sweep parameters which will internally run the algorithm for multiple set of values and generate the most optimal one without us changing and experimenting every time. This again can be compared against the other algorithms available to arrive at the best accuracy driven model. The modules like evaluation will help us get the accuracy, recall and precision values using which we are arrive at the most suitable ML classification algorithm for the problem statement. In our test case, we see that the Decision forest is the right algorithm.
As you can see in the above steps how Azure ML studio helps a ML developer like me to easily access, preprocess, classify and evaluate with the help of numerous modules. As a developer I do not have to worry about the setup, configuration, processing and deployment.
Results and Correlation against the trainingset
Example – for the category attribute we see that the
Request – NSR class has 93.3%,
Enquiry – SR and
Enquiry – NSR is 59.1%, the training set of these classes are high and hence the high accuracy while for other classes like Complaint – SR and Request – SR has very low training set.
Further studying the classes and proportion of dataset present in the training set, we see the correlation between accuracy of classes for higher training classes – considering only the classes with good proportion of training set we see that accuracy achieved is better i.e category:76.8%, type:67.6 and subtype:60.9.
Next Steps to Increase Accuracy
- High volume of healthy training set – the training set should have high volume of distinctive key words, perhaps sit with customer care and understand the manual process of classifying and keywords looked for.
- Appropriate distribution of training set across all classes – Current dataset has high volume of few classes, the distribution of training set across all classes is very important.
Hope this was useful. In future posts I plan to write how-to guides for using Azure ML studio. If you have some machine learning problem - please feel free to get in touch and we would love to solve it.
Happy Machine Learning! :)