
I was originally planning to write the second article in my series on Kubeflow. However, I got bogged down with a lot of work and in the interim a new model deployment framework emerged by Oracle called GraphPipe. I’m still planning on taking a look at Kubeflow at some point as it has many nice features (like AB testing/multi-armed bandit updates to redirect traffic), but I found GraphPipe easier to use and at least according to its website it is much faster than JSON type APIs (like Kubeflow).
In this article I’m going to go over an example of deploying a trained PyTorch model using GraphPipe and my own model agnostic (MA) library (which now includes support for GraphPipe). For this example, I chose the ChexNet (the one from Rajpurkar et al.) and implementation by arroweng (i.e. Weng et al.) that is publicly availible on GitHub.
1. Refactor code to support "single example" processing (or alternatively whatever mode you need for production).
This first step will be the same in almost all cases. It is also generally the most time consuming and painstacking. This will generally require very carefully reading through the implementation code. My model_agnostic class aims to make this process somewhat easier (unfortunately though it still sucks).
1(a) Load the model weights
Before we even try to refactor we need to make sure that we can load the model weights (suprisingly this causes many more problems than you would ever believe). To do this we subclass PytorchModel from MA. For now we won’t worry about the preprocessing and process_result steps. Instead we will just focus on loading the model weights.
There are multiple ways to load a PyTorch model and MA supports them both. The first way is if you initially saved the complete model with torch.save(the_model, some_path).
In practice this is pretty rare, and instead most of the time you save just the state_dict and not the whole model (i.e torch.save(the_model.state_dict(), path))
. Here arroweng provides the saved state_dict so we have to implment create_model. ChexNet is essentially just a DenseNet121 modified for 14 conditions. Therefore all we need to do in create_model is return a a DenseNet121 class. The rest MA handles in the PytorchModel such as transforming the state_dict to a CPU usable format (or vice versa). Finally, we set training mode to false as we want only forward propragation to take place.
1(b) Implement preprocessing and process_result
In order to test that this setup works we need to run the created model on some data. These same functions will also be used later for when we deploy the model. So now is the time to get them as slimmed down as possible. This is often one of the more time consuming steps, due to the code frequently evaluating the model in batches. Additionally, the model often has a data loader when we need it to run using an individual example. As such, this process will vary considerably for your specific model/use case. Interestingly, for this particular model arroweng used data augmentation with TenCrop even during testing. Therefore in my preprocessing method I decided to have an augmentation parameter which the user could mark to true or not.
2. Convert PyTorch model to Caffe2
This part is relatively straightforward and well documented on the PyTorch website. Basically, it involves Caffe2 tracing the PyTorch execution of the model.
In order to test that the model was sucessfully translated to Caffe2 I used the following code.
If everything works the following code should run with no errors.
3. Serving with GraphPipe
3(a) Run GraphPipe Docker container
Now take your saved Caffe ONNX file (in this case chexnet-py.onnx) and save it either in the cloud or in a local directory. Then run the following command
docker run -it - rm -e https_proxy=${https_proxy} -p 9000:9000 sleepsonthefloor/graphpipe-onnx:cpu - value-inputs=https://raw.githubusercontent.com/isaacmg/s2i_pytorch_chex/master/value_inputs2.json - model=http://url_to_model.onnx - listen=0.0.0.0:9000`
What this command does is pull the GraphPipe ONNX CPU docker image, your model, and the other information to run the container. "Value inputs" is the dimensionality of your input values.
{"0": [1, [1, 3, 224, 224]]}
So in this example (if you plan on not using the cropping augmentation) the input would be batch size one, three channels (i.e RGB), and a 224×224 image size. If everything goes right you should something like "INFO[0005] Listening on 0.0.0.0:9000" in your terminal.
3(b) Define GraphPipe serving class
So your preprocessing function will remain the same and you will want to define your process_result function if you haven’t already. MA will handle all the behind the scenes calling of the GraphPipe docker container. To use this now all you need is the following
You can now wrap this class in any Django or Flask API to finish deployment.
Conclusion
Now you might wonder why this is any better than simply running PyTorch in a Flask/Django REST API. The answer is (1) this approach is generally faster as GraphPipe optimizes the model prediction phase (2) this method is highly scalable, and (3) this API can be called by any application in any language (granted that you could perform equivalent preprocessing). In my opinion the biggest benefit is (2). As it is very easy to now spawn off new Docker containers if the latency of the prediction becomes the application bottleneck.
The finished product is available on my GitHub (I will add all the exporting scripts soon). Also, in a few days I should have the model demo up and running on Heroku. I have couple more articles planned for my series on model deployment including Kubeflow, GraphPipe deployment of Tensorflow models, and deploying models in Java production applications (with a particular emphasis on realtime streaming predictions with Flink). I’m not sure what order I’ll publish them, but I do plan on getting around to it eventually. Finally, at some point I plan on doing a benchmark on which methods are truly fastest in terms of prediction speed.
More resources