Data Science as Software: from Notebooks to Tools [Part 3]
This is the final part of the series of how to go on from Jupyter Notebooks to software solutions in Data Science.Part 1 covered the basics of setting up the working environment and data exploration.Part 2 dived deep into data pre-processing and modelling. Part 3 will deal with how you can move on from Jupyter, front end development and your daily work in the code. The overall agenda of the series is the following:
- Setting up your working environment [Part 1]
- Important modules for data exploration [Part 1]
- Machine Learning Part 1: Data pre-processing [Part 2]
- Machine Learning Part 2: Models [Part 2]
- Moving on from Jupyter [Part 3]
- Shiny stuff: when do we get a front end? [Part 3]
- Your daily work in the code: keeping standards [Part 3]
Assume you’re done with the first iteration of your prototype: you have a pipeline for loading the data, you pre-process it, transform it as necessary and are able to train a model. Now how do you move on from here?
The first step is to understand the differences between programming in Jupyter and using an IDE (PyCharm will be our example). Jupyter is great for the following:
- Prototyping : You throw code together easily and produce something to start with in a matter of hours.
- Direct output : Since your code runs in cells the output of every cell is output directly after running the cell. This enhances your ability to produce prototypes or try out new approaches since you get feedback fast.
- Visualization : Jupyter is great for visualization. Since you can easily include libraries such as matplotlib you not only get feedback for variable values, you also can produce plots, charts, tables and much more.
- Variables in memory : this is a special feature about Jupyter that needs to be highlighted. Variables of a notebook are kept in memory as long as they are not inside a method. This is true for any python script that is run directly, but using this is promoted in Jupyter due to the prototyping nature of Jupyter notebooks.
You can see that Jupyter is a great tool for exploration and prototyping and I strongly suggest that you use this tool when starting out in Data Science projects. But after the first working prototype it is time to bring some software craftmanship into your project. Using an IDE offers following advantages:
- Modularity : You can easily navigate through your code, be it methods, classes or modules (even libraries). This reinforces using a modular approach to your logic, meaning that you divide your code into components.
- Code quality : Your IDE offers a lot of different tools for enhancing code quality through proper coding guidelines (such as PEP 8 ) and highlights syntax errors immediatelly but also indicating unused variables only to name a few.
- Refactoring : Refactoring means restructuring your code without changing the underlying logic. Simple refactoring tasks such as renaming variable or methods is straightforward with copy and paste, but such things as moving method implementations from module to another is more difficult. IDE’s offer tools to support you in this.
- Git support : Version control is very important when projects have more than a single person working on them and even then, I would strongly suggest using a VCS (version control system). Git is probably the most famous distributed VCS currently available and your IDE ideally has an integration for it.
- Code inspection : Whenever you have bugs or performance issues you can do way much more than simply print-statements. Using the debugger to go through critical code line by line a profiler for performance measurements are only a few things which your IDE supports.
The following comparison highlights the differences between Jupyter and an IDE.
How do you start utilizing IDEs more and move on from Jupyter-only? In my opinion, the simplest approach to achieve this is the following.
- Methods: Divide your code into methods instead of having “code prose” (a continuous script running everything line by line). Split up your logic into different methods with defined input and output and start grouping logic into methods. Try to keep things abstract, that way you can exchange one method for another without having to rewrite large parts of your code.
- Modules: Group your methods together and create modules or classes out of them. Classes are needed when you have multiple objects that you want to manipulate individually and need different attributes for each object. Modules are more generic and group together larger groups of methods that all follow a specific goal.
- Refactoring: Refactoring is restructuring or refining your code. There is no perfect code and a coding project is in a sense never finished. That’s why you should try to minimize dependencies between methods in a way that allows you to restructure your code (e.g. exchange a normalizing method with another). Refactoring should occur regularly, e.g. 20min every day or a few hours every Friday (“Refactoring Friday”).
These are of course only the first steps along your journey, but they should greatly help you and enhance code quality and code performance.
Knowing how to proceed from Jupyter to a software tool is one thing, but you have to practise this knowledge in your daily work. For this, you have a guiding principle: “clean code”. This book by Robert Martin gives you language agnostic principles and guides to follow. Just a few things you can take away already without having to read the book:
- Variable, module and method names should reflect what is done in these modules/methods or what the variable stores.
- Keep methods short and consice, meaning: a method has a clearly defined scope and just does one thing
- DRY: Don’t repeat yourself. Knowledge inside the code should be kept in one place. If you do a thing more than two times you should make a method out of it.
- KISS: Keep it simple, stupid. Why overcomplicate things? Try to strive for the simplest solution possible.
- Refactor regularly. Your software is growing and you constantly get new insights or ideas to try out. Take time to refactor what you did and tidy your code up.
A very important topic code versioning. How do you keep track of what was done with the code over time? For this, a version control software(VCS) is needed. Although different versions exist (centralized, de-centralized), git might be the most famous one among them.
Git in its most basic form follows a distinct pattern and it is more than enough to get you started. Git consists of a remote repository, the global code base for all contributors to pull from and a local repository, the one you are working in. Following commands get you started:
- git pull: pulling from a remote repository.
- git add: adding files that where changed for a commit
- git commit: committing added files to a push to the remote repository
- git push: pushing committed files to a repository
What you need to take away from this is that code versioning is needed even if you work alone on your code since you might want to roll back to a previous version or you want to keep different branches in parallel.
Before we conclude this series I want to touch on a topic that is quite important but sometimes overlooked: How do you share your results? A Jupyter notebook can only get you so far and when it comes to a professional context, sharing results is at least as valuable as getting them in the first place. Sharing results in this place means
- how do you make your model accessible inside a software application?
- how do you visualize results of your model?
First things first: making your model accessible inside a software application is done with software interfaces. You can code a method as a wrapper around your model and thus have an interface. But this works only in a monolithic application, meaning that any code, database and frontends are connected on the very same machine, which is quite unrealistic.
In the most basic form you have a database, backend application (containing your model and any other business logic) and a frontend to communicate with the backend. A very simple schematic of this infrastructure is displayed here:
The simplest way to accomplish this is by writing a backend application containing your model and provide interfaces. Python offers many wrappers for popular databases ( PyMongo or psycopg ) in a way for you to access them.
On the other end, providing an interface is best done by creating REST APIs and running your application via Flask or Django as a web server. Separate articles on how to accomplish this will follow, so stay tuned!
Wrapping up this series, what did we cover so far?
Part 1 showed the basics of setting up a working environment in Python and modules for initial data exploration such as Jupyter notebook and Pandas.Part 2 had deeper insights into data pre-processing generally and for the domains of image, language and audio data in particular. It also covered libraries for machine learning models. This part covered how to move on from Jupyter, coding standards to keep in mind and a general approach to architecture.