At Outware, Part of Arq Group, we always keenly follow all the latest trends in the rapidly evolving mobile technology ecosystem and search for future opportunities that the latest technologies present. Machine Learning and AI have been experiencing a massive boom thanks to growing public interest and technological advances in recent years, and we’ve been conducting various experiments of potential new use cases with this technology on mobile devices. As an iOS developer I’ve been training and deploying machine learning models directly onto iOS devices using Apple’s relatively new CoreML framework.
This article will broadly be divided into two parts. In the first part, we will discuss the history of native Machine Learning on mobile devices and the current, state-of-the-art performance that is possible on iOS. In the second part, we’ll share all the knowledge we accumulated in the process of building our first machine learning-powered iOS application, as well as a review of all the tools and techniques we found the most useful in reaching our goal. We’ll then analyse the performance of all the tools that we used in building our food recognition application and how they all compare to each other. Finally, we’ll make some predictions for the future of machine learning in the mobile ecosystem, and which direction the market is moving towards.
This graphic above from Google Trends illustrates the popularity of ‘machine learning’ as a search term over time with a marked increase occurring around mid-2015 (all details can be viewed here).
Machine learning on mobile devices pre-CoreML/post-CoreML:
Machine learning on iOS devices has only existed for a relatively short period. Prior to CoreML’s first release in iOS 11, Apple’s first foray into native machine learning was in iOS 10 with two low-level APIs for building convolution neural networks (CNN): BNNS and MPSCNN (this article by Matthijs Hollemans provides a good explanation and comparison of the two APIs). The main distinction between BNNS (Basic Neural Network Subroutines) and MPSCNN (Metal Performance Shaders Convolutional Neural Network) is that BNNS is optimised for the CPU, which has a superior performance prediction inference, and MPSCNN is optimised for the GPU, which has superior performance for training a model. Although it was a good first step into the technology by providing APIs that can access the device’s GPU and give the user the ability to declare their own neural network architecture and data flow, it was still a very incomplete solution for viable on-device machine learning models. It provided no interface for quickly converting third party models into an iOS readable model, and the developer was required to manually declare their neural network architecture and manage the data flow between all neurons in the network.
Apple resolved some of the ambiguity and complexity of the two ML frameworks released in iOS 10 by releasing CoreML in iOS 11. Broadly speaking, CoreML provided another layer of abstraction above the two frameworks, which eliminated the need for the developer to be concerned with the lower-level implementation details, and provided an easy interface to encapsulate a machine learning model that utilises both the CPU and GPU while performing network inference and can convert third party model architectures into the CoreML format.
CoreML has only been public since the release of iOS 11 in September 2017, and it is due to see major improvements with the eventual release of iOS 12, which is currently in Beta. The beta version of CoreML 2 has already made many marked improvements in accuracy and efficiency. Given today’s climate of widespread fear of major tech companies breaching user privacy and artificial intelligence becoming capable of gaining deeper and deeper insights into profiling user behaviour and their interests, there’s both excitement and apprehension about the potential of this powerful new technology. In 2017 there was a news story from Facebook’s Artificial Intelligence Research Lab about two chatbots developing their own hybrid version of English to communicate that provoked mass hysteria about the potential of machines to self-learn and become insubordinate to their human creators. Deploying machine learning models locally onto mobile devices could help calm public anxiety about artificial intelligence systems constantly feeding off user data across the distributed network and self-training at an exponential rate. The advantages of this approach is that it provides assurance to users that none of their sensitive private data will be captured and used by any third party providers to retrain their artificial intelligence systems. At a more practical level it’s a much more efficient solution in terms of minimising the network data usage that is incurred by the process of dispatching data loads to the cloud for machine learning prediction tasks and provides a much faster and more seamless experience to users when using machine learning systems.
Why we trained a food recognition model:
Given the new capabilities of on-device machine learning with the latest releases of iOS, it seems like ML has matured to the point of being able to cater to valid market use cases. There is a strong case for embarking on an internal research project that could coincide with future application domains we’ve identified in the mobile ecosystem.
We see improving in-store experiences with mobile technology as a major opportunity for the retail sector, especially utilising technologies such as AR and image recognition. The most common everyday retail experience for a regular person is probably grocery shopping, so we asked ourselves whether there was a way to improve this experience with mobile technology. Amazon have gained a lot of attention in the past year with their highly ambitious Amazon Go project: a physical grocery store that relies entirely on computer vision and object tracking to identify each shopper and the items they place in their baskets. This new system eliminates the need for customers to pay at a register, as their account is charged as soon as they exit the store. In the process, Amazon is also able to collect much richer data about all their customers and their shopping habits.
Beyond Amazon Go, computer vision techniques have proven very valuable in the industry of food production in automating vital quality control mechanisms in the production flow. Given this presupposition that the retail experience in grocery shopping is ripe with opportunity to be innovated, we identified a simple use case of building an image classification model for fresh food produce that displays associated information to the user. Fresh produce was targeted as a more achievable preliminary goal because we could avoid the complexity of having to account for specific product packaging and text recognition requirements (OCR) when building our image classification model.
How we trained our food recognition model:
The first step involved in building our fresh food image classification model was collecting our datasets. We found the best resource for collecting datasets for computer vision is ImageNet, an open image database founded by leading computer vision researchers from Stanford University and other US universities. For each image class we wanted to train our image classifier to recognise, we downloaded the first 500 images of that fresh food class from ImageNet. Once our datasets were collected, we then went about the daunting process of training our image classification model.
Over the course of the year we experimented with many of the most powerful tools that exist in the machine learning community, principally Tensorflow (the most prominent tool in Google’s AI arsenal for training and building machine learning models) and Keras (a powerful machine learning toolkit built on top of the Tensorflow engine). Although both of these platforms are very powerful in their own right, we initially ran into many complications when trying to employ them in our training process. Firstly, the learning curve was very steep and any would-be developer who wished to use these tools would find themselves spending a long time getting their head around all the implementation details and understanding the process of tuning hyperparameters in a model training task. Secondly, we found there was no native support for CoreML built into these frameworks. The only viable method to convert these models into CoreML format required the use of the early-stage coremltools library which we found was unable to convert certain TensorFlow and Keras models we had trained.
After a lot of frustration trying to achieve results with Tensorflow, we stumbled upon Apple’s unheralded machine learning toolkit TuriCreate. TuriCreate is a machine learning framework based on the technology produced by the company GraphLab, which Apple acquired in August 2016. It’s thankfully a much simpler framework to use, and we were able to successfully train and deploy our first CoreML image classification model within one day of getting started. The only disadvantages we found with TuriCreate was that it offered less fine control of the hyperparameters in the training process than you would get using Tensorflow, as well as the fact that it’s specifically only designed for the Apple ecosystem.
(Thankfully, with recent experimentation and improvements with the Tensorflow toolkit we were able to achieve some success, namely we found tfcoreml was capable of converting a Tensorflow version of our food recognition model into CoreML and Tensorflow released a mobile optimized version of their architecture TFLite.)
Our first prototype involved deploying our food recognition model into an iOS app, then utilising the device’s live camera feed and performing model inferences every second with the latest frame from the camera stream. When our food recognition model has identified a class of food with a high probability in three consecutive frames, it determines that it is highly likely the user wants to learn more information about this food class, and a 3D label is then displayed in AR. The user then has the option to press on that label , which will direct them to a details screen displaying the nutritional information about that particular food.
This link above shows a demo of our food recognition application recognising classes of food and displaying 3D models of the recognised food in augmented reality. The user can tap on the floating 3D model and be redirected to a nutrition information screen which also has links to potential recipes that include this food item.
Performance comparison between CoreML 2/CoreML 1.5 and other ML platforms:
In our food recognition application, the original 77 class CoreML 1.5 Food recognition model training process required using Apple’s TuriCreate Python framework that produced a 90 MB model. In comparison, the CoreML 2 beta model trained with the exact same dataset was able to be trained without relying on writing TuriCreate python scripts, instead using the new CreateML framework that requires virtually no code and can be managed entirely through a drag-and-drop interface. The outputted model produced significantly more accurate prediction results.
Most impressively however, was the massive reduction in model size for CoreML 2: the outputted model was only 1.2 MB. This is a 75-fold decrease in file size while at the same time producing a significantly more accurate prediction model. One of the leading techniques that made this technological breakthrough possible was the use of model quantisation (its implementation is explained thoroughly in this CoreML demonstration from the 2018 Apple WWDC).
Basically, model quantisation is a method whereby all the ‘weights’ in an ML model are compressed by decreasing the degree of precision of their values, all the while maintaining comparable accuracy. This is a technique that is also employed by Tensorflow Lite to reduce the size of models intended to be deployed on mobile and embedded devices, and appears to be a major improvement for ML capabilities on edge devices. When we re-trained our food recognition model with Tensorflow and deployed it as a TFLite model, the file size was similarly very small at less than 4MB. Clearly, Apple’s team of data scientists and ML researchers have focused a lot of attention and effort into CoreML over the past year since its first release: they see a lot of value in its future potential, and it’s something everyone in the iOS development community should keep their eye on.
These above images illustrate the power of weight quantisation in reducing model file sizes while retaining similar model accuracy as well as a direct comparison in the size of CoreML models for each version of iOS. (Source: ‘What’s new in CoreML, Part 1” Apple WWDC 2018)
Predicting the future of machine learning on mobile:
Given the sheer potential of machine learning, many new applications have been identified as areas of major growth and the developer community is producing a breadth of research and literature on the topic. Leading US advisory firm Gartner published a research paper early this year that outlined ten major applications of mobile-based machine learning which they predict will have a major impact in the future. One broad theme that was present in many of the predicted trends was personalisation, where more insight about the user is gained and the system is optimised to react to their common behavioural patterns. This includes biometric recognition-based authentication, which already exist in iOS with FaceID and TouchID. Machine learning can also be deployed to optimise system resource management, such as tracking apps that are running in the background and terminating depending on whether they’ve been classified as commonly used apps.
AR-Indoor wayfinding is a new technology which has a lot of potential for mass market adoption once its performance becomes more reliable and scalable to implement. Using computer vision to recognise a user’s surroundings and their exact bearing and position could be the missing puzzle piece in widespread adoption of this technology. This has the potential to replace the time-consuming and expensive solution of installing beacon systems in an indoor space and make this technology much more scalable. Blippar, an augmented reality company from the UK has made major inroads towards developing a solution to this problem which relies on a computer vision approach and any player in this area of the marketplace who can see successfully harness computer vision approaches as a cornerstone of their solution has the potential to take over this emerging market.
Beyond image recognition, machine learning is gaining popularity as a tool for personal photography. An image manipulation app from Russia called Prisma gained mass popularity over the past two years by offering users the ability to transform their personal photographs into artworks based on the style of famous artists using a machine learning technique called style-transfer. Given the relatively short life-period of these new image-style techniques, as well as the popularity of image sharing and social media in the mobile ecosystem, it seems reasonable to expect new innovations and further market adoption in this area.
In conclusion machine learning has the potential to transform many areas of day-to-day human life for better if harnessed correctly and mobile devices appear to be one of the most ideal platforms for mobilising this technology. There’s a concerted effort within the industry among the major technology corporations to rapidly prototype and improve their machine learning tools for mobile devices and whoever does the best job of positioning themselves on this upcoming wave which will come over the global marketplace stands to reap major benefits.