综合编程

How to Un-Delete Your Jupyter Notebooks

微信扫一扫,分享到朋友圈

How to Un-Delete Your Jupyter Notebooks

The metadata science of hacking Jupyter notebooks with SQL and command line fu


Mar 26
·7min read


Who doesn’t love Jupyter notebooks? They’re interactive, giving you the instant gratification of immediate feedback. They’re extensible — you can even deploy them as websites. Most importantly for data scientists and machine learning engineers, they’re expressive — they span the space between the scientists and engineers who manipulate data
and the lay audience that consumes and wants to understand the information
that data represents.

But Jupyter notebooks have their drawbacks. They’re big JSON files that store the code, markdown, input, output, and metadata of every cell that you run. To understand what I mean, here’s a short notebook I wrote to define and test the sigmoid function.

A Jupyter notebook.

And here’s what (part of) it looks like when IPython isn’t rendering it (I’ve abridged all but the first actual input cell, because even for a short notebook it’s long and ugly):

{
"nbformat": 4,
"nbformat_minor": 2,
"metadata": {
"language_info": {
"name": "python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"version": "3.6.8-final"
},
"orig_nbformat": 2,
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"npconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": 3,
"kernelspec": {
"name": "python36864bitvenvscivenv55fc700d3ea9447888c06400e9b2b088",
"display_name": "Python 3.6.8 64-bit ('venv-sci': venv)"
}
},
"cells": [
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import numpy as npn",
"import random"
]
},
...
}

It’s kind of hard to version control them as a result. It also means there’s a lot of junk data in there you usually don’t care very much about saving, like the cell execution count and outputs.

The next time you wonder why your Jupyter notebook is running so slowly, open it in a plain-text editor and see how many massive dataframes are just hanging out in your notebook’s metadata.

Under the right circumstances, though, all that junk can look like precious gems. Those circumstances usually involve:

  • Accidentally closing a notebook without saving it
  • Hitting the wrong keyboard shortcut and deleting important cells
  • Opening the same notebook in multiple browser windows and overwriting your own work

Anyone who’s ever strayed into the dangerous territory of doing their development in an IPython notebook has done these things at least once, probably at the same time.

If this is you, and you’re here because you Googled recover deleted jupyter notebook refreshed browser window
, don’t panic. First I’m going to tell you how to fix it.* Then I’m going to tell you how to prevent it from happening again.

*Yes, you can fix it. (Probably.)

a guide, made for you by me, in Jupyter notebook, about Jupyter notebook

Requirements:

  • A Python virtual environment
    with Jupyter, IPython (with nbformat and nbconvert) and jupytext installed (preferably a fresh venv, so you can test things out)
  • Working installation of sqlite3 (optional but recommended: a database browser like SQLiteStudio
    )
  • Hope and determination

Scenario: You accidentally deleted a couple of cells in an active notebook

(Unscientific) estimated probability of recovery: 90%

Relative difficulty: Easier

In the least-worst case, you’ve hit x
on a cell you didn’t actually want to delete, and now you want to get back its code or data.

Method 1: In and Out

When you use a notebook, the IPython kernel runs your code. The IPython kernel is a process that is separate from your Python interpreter. (That’s also why you need to link a new kernel to a new virtual environment. The two are not automatically connected.) It sends and receives messages, like your code cells, using JSON.

When you run a cell or hit “save”, the notebook server sends your code as JSON to a notebook on your computer that stores your input and output. So the little words In
and Out
next to your cell aren’t just words, they’re containers — specifically lists of your session history. You can print out and index into them.

IPython In[] container

Method 2: IPython Magic

Use the %history
line magic to print your input history (last in, first out). This powerful command
grants you access to your current and past sessions by absolute or relative number.

If the current IPython process is still connected, and you’ve installed nbformat
in your virtual environment
, execute this code in a cell to recover your notebook:

>>> %notebook your_notebook_filename_backup.ipynb

This magic renders the entire current session history as a new Jupyter notebook.

Well, that wasn’t so bad.

This won’t always work, and you might need to expend a bit of effort weeding extraneous cells from the output.

There are lots
more things you can do with the history magic. Here are a few recipes I find useful:

  • %history -l [LIMIT]
    get the last n
    inputs
  • %history -g -f FILENAME
    : writes your entire
    saved history to a file
  • %history -n -g [PATTERN]
    : search your history with a glob pattern and print the session and line numbers
  • %history -u
    : get only the unique
    history from the current session.
  • %history [RANGE] -t
    : get the native history, a.k.a. the IPython-generated source code (good for debugging)
  • %history [SESSION]/[RANGE] -p -o
    : print input and output with the >>>
    prompt (nice for readmes and documentation)

If you’ve been working in a really big data science notebook for a long time, the %notebook
magic strategy might produce a lot of noise that you don’t want. Use the other parameters to whittle down the output of %history -g
, then use jupytext (explained below) to convert the results.

Scenario: You closed an unsaved notebook

(Unscientific) estimated probability of recovery: 70–85%

Relative difficulty: Harder

Remember how we said version control is hard with notebooks? A kernel can connect to more than one frontend at the same time. Which means those two browser tabs with the same notebook open can access the same variables. Which is how you overwrote your code in the first place.

IPython stores your session history in a database. By default, you can find it under your home directory in a folder called .ipython/profile_default
.

$ ls ~/.ipython/profile_defaultdb  history.sqlite  log  pid  security  startup

Back up history.sqlite
to a copy.

$ cp history.sqlite history-bak.sqlite

Open the backup, either in a database browser or via the sqlite3 command line interface. It has three tables: history, output_history, and sessions. Depending on what you want to recover, you may need to join all three, so brush off your SQL.

SQLiteStudio view of IPython history

Eyeball it

If you can tell from the database browser GUI which session number in the history
table has your code, then your life is a bit simpler.

SQLiteStudio view of history table

Either execute the SQL command in the browser or on the command line:

sqlite3 ~/.ipython/profile_default/history-bak.sqlite 
"select source || char(10) from history where session = 1;" > recovered.py

All this does is specify the session number and the filename to write to (in the example given, 1
and recovered.py
) and selects your source code from the database, separating each block with a newline character (which in ASCII is 10).

If you wanted to select the line number as a Python comment, you could do so with a query like:

"select '# Line:' || line || char(10) || source || char(10) from history where session = 1;"

Once you have a Python executable, you can pretty much breathe easy. But you could turn it back into a notebook with jupytext
, a miraculous tool that can convert plaintext formats to Jupyter notebooks.

jupytext --to notebook recovered.py

Not too terrible!

Scenario: You opened your notebook in multiple tabs, reloaded an old version, erased all your work, and killed your kernel

(Unscientific) estimated probability of recovery: 50–75%

Relative difficulty: Hardest

None of the above worked, but you’re not ready to give up yet.

Hard mode

Go back to whatever tool you’re using to navigate your history-backup.sqlite
database. The queries you write will require creative search techniques
that make the most of the information you have:

  • Timestamps for session starts (always non-null)
  • Timestamps for session ends (sometimes null in useful ways)
  • Output history
  • Number of commands executed per session
  • Your input (code and markdown)
  • IPython’s rendered source code

For example, you could find everything you wrote involving pytest this year with a query like:

select line, sourcefrom historyjoin sessionson sessions.session = history.sessionwhere sessions.start < '2020-01-01 00:00:00.000000'and history.source like '%pytest%';

Once you’ve shaped your view to the rows you want, you can export it to an executable .py
file as before, and convert it back to .ipynb
with jupytext.

How to avoid needing this article

As you savor your relief at not having to rewrite your notebook from scratch, take a moment to ponder a few measures to guard against future expeditions into history.sqlite
:

  • Don’t connect identical frontends to the same kernel. In other words, don’t keep the same notebook open in multiple browser tabs. Using Visual Studio Code as your Jupyter IDE
    largely eliminates this risk.
  • Back up your IPython history database file regularly just in case.
  • Convert your notebooks to plaintext whenever possible — at least for backup. Jupytext
    makes this almost trivial.
  • Use IPython’s %store
    magic to store variables, macros, and aliases in the IPython database. All you need to do is find your ipython_config.py
    file in profile_default
    (or run ipython profile create
    if you don’t have one), and add this line: c.StoreMagics.autorestore = True
    . You can store, alias, and access anything from environment variables to small machine learning models if you want to. Here’s the full documentation.

What are some of your biggest challenges with Jupyter notebooks? Drop a comment on topics you’d like to tackle in future posts.

Resources

Creating a dynamic application with LoopBack

上一篇

Collective #599

下一篇

你也可能喜欢

评论已经被关闭。

插入图片

热门栏目

How to Un-Delete Your Jupyter Notebooks

长按储存图像,分享给朋友