-
Notifications
You must be signed in to change notification settings - Fork 8
Jupyter Notebooks
(Image Credit: Wikimedia Foundation, CC)
(URL: https://github.com/clizarraga-UAD7/Workshops/wiki/Jupyter-Notebooks)
Getting started with Jupyter Notebooks: A Python Programming Environment for Data Analysis and Modeling.
There are multiple scientific programming environments for carrying out data analysis and modeling. Python is one of the top rated general-purpose programming language and widely used in Data Science. Using Python with a Jupyter Notebook, provides an efficient way of presenting the input code and the results, offering a reproducible interactive computing environment as well as an explainable technical document for readers to follow. In this workshop, participants will use a Python programming environment based on Jupyter Notebooks for performing data analytics and data visualization.
Learning objectives
-
List of offline and online Jupyter Notebooks that support Python programming.
-
Understand how to run and start a Jupyter Notebook session.
-
Describe the Jupyter Notebooks user interface.
-
Identify main Python general libraries and their purpose.
-
Demonstrate how to read data files into Jupyter using the Pandas library.
-
Try out NumPy library for numerical/mathematical functions.
-
Show the use of SciPy library for scientific computing capabilities.
-
Exhibit the use of Matplotlib and Seaborn visualization libraries to make simple data plots.
-
Indicate how to save a Jupyter Notebook and end your Jupyter session.
Please see the Slides of the Workshop
What are Jupyter Notebooks?
Jupyter Notebooks is a product of the Jupyter Project, which is a community dedicated to produce open-source interactive development environment for science and the scientific computing supporting a group of programming languages, mainly Julia, Python and R. Today, there is a community of 150+ options of programming languages that run on Jupyter Notebooks.
In what sort of Applications are Jupyter Notebooks being used?
You can find a wide variety of scientific applications where Jupyter Notebooks are being used:
- Artificial Intelligence and Machine Learning
- Biology, Chemistry and Physics
- Earth Sciences and Geospatial Analysis
- Economics and Finance
- Linguistics, Natural Language Processing and Text Mining
- Mathematics and Statistics
- Psychology and Neurosciences
- Signal, Sound and Video Analysis
- Many other...
Jupyter Notebooks is composed of two types of cells. A Code cell, where the user inputs segments of code, and a Text cell, where the user can input text segments enhanced with Markdown Language.
The next generation of Jupyter Notebooks has been named Jupyter Lab, which is the one used now, even when people refer to it as a Jupyter Notebook (See Jupyter Lab Documentation)
Jupyter Notebooks started as IPython Notebooks back in 2011, and became the Jupyter Project in 2015. Initially it was designed by Fernando Perez from UC Berkeley around 2001, when he tried to replicate a Wolfram Mathematica Notebook from Wolfram (Mathematica Notebook Examples).
There are other Notebooks that are used for more special programming environments:
- BeakerX. BeakerX is a collection of kernels and extensions to the Jupyter interactive computing environment. BeakerX supports Java Virtual Machine, Python and JavaScript. Also it supports Groovy, Scala, Clojure, Kotlin, Java, and SQL.
- Apache Zeppelin. Is web-based multipurpose notebook designed for big data computing environments. It supports programming languages: Apache Spark, Apache Flink, Python, R, JDBC, Markdown Language and Shell Scripts.
If desired to work locally on a computer, there several options for installing Jupyter Notebooks. We mention two of them.
First method: Anaconda Python. After downloading and installing it. From a terminal window run the command jupyter lab
.
A new tab browser will open with Jupyter. Any new desired package that is not installed, can be installed via the conda install
command using a Command Line Interfase (i.e. terminal window).
Second method: Jupyter SciPy Notebook - Docker image. This is a basic Jupyter Notebook with Python common libraries installed. Prior, you need to install Docker Desktop if you don't have it already. Then from a terminal, you need to download the latest Jupyter Notebook with SciPy installed from DockerHub. To do this, in the terminal run the command:
docker pull jupyter/scipy-notebook:latest
You will see that it is downloading a series of different software layers that compose the Docker image.
After downloading the desired image, in your computer, go to your desired working directory. The next step is to launch a Docker container by running:
docker run -it --rm -v "${PWD}":/home/jovyan/work -p 8888:8888 jupyter/scipy-notebook
Your files will be stored in whatever local directory from where you launched the Docker container. You can find more information about Docker containers in these notes.
Then open a new browser tab with URL: localhost:8888
. It will ask you for a token, go and copy the series of 48 characters that appear after token=
. Or simply copy the URL included in the running terminal, similar to this http://127.0.0.1:8888/lab?token=7d5a143bfe924787eba5e20110407204854bc7664fa8f1d4
and paste it into a new browser.
There are several options of Docker images of Jupyter Notebooks. The Docker image jupyter-scipy-notebook has a size of 948MB, and has an Ubuntu OS underneath.
The Jupyter Datascience Notebook - Docker Image has a larger size of 1.49GB. It includes Python, R and Julia programming language options.
There are several cloud-based options for using Jupyter Notebooks via the web browser, which need no installation or configuration.
Open Platforms Use:
- Google Colab. [ Getting Started ].
- MyBinder. [ Getting Started ].
Reserved to University of Arizona users:
- Cyverse.Org. Cyberinfrastructure support for Life Science research, National Science Foundation.
- UA HPC. University of Arizona High Performance Computing.
Python has a collection of Libraries that are used in data analysis and modeling.
-
NumPy. Is the basic Library for scientific computing with Python, it includes all mathematical functions, random number generators, linear algebra tools, Fourier transforms, and more. [ NumPy Tutorials | NumPy User Guide ].
-
Pandas. It is used as a data manipulation and analysis tool. [ Pandas Tutorial | Pandas User Guide ].
-
Matplotlib. Is a complete library for creating data visualizations. [ Matplotlib Tutorials | Matplotlib User Guide | Matplotlib Example Gallery ].
-
Seaborn. Is statistical visualization library based on Matplotlib. [ Seaborn Tutorial | Seaborn Example Gallery ].
-
SciPy. Is a collection of algorithms for numerical computing. [ SciPy User Guide | SciPy Cookbook ].
-
Scikit-Learn. Is the collection of Machine Learning algorithms in Python. [ Scikit-Learn User Guide | Scikit-Learn Examples ].
-
Scikit-Image. A collection of algorithms for image processing.
-
Natural Language Toolkit - NLTK [ NLTK Book | NLTK Examples]
-
Hugging Face. A Machine Learning Library. [Documentation]
-
More available Python Libraries: PyPi - Python Packages Index | Anaconda Packages
Other specialized Machine Learning Libraries in Python.
The development of special applications of big data analysis in machine learning modeling, is very dynamic. There is algo a large set of libraries, which we only mention a few.
- TensorFlow 2. [ TensorFlow 2 Overview ].
- Keras. [ Keras Overview ].
- PyTorch. [ PyTorch Tutorials ].
- [More...]
You can download the following Notebooks to your Google Colab session, to follow the tutorial and do the proposed exercises found there.
- Basic introduction to Python
- NumPy Basic Tutorial.
- Reading data files in Jupyter Notebooks
- More examples
- Examples from Python Data Science Handbook. Jake VanderPlas.
There are many sources of popular datasets used for learning Data Analysis in Python.
- Google Dataset Search
- Kaggle
- Papers with Code
- University of California at Irvine
- US Census Bureau
- US Data.Gov
More datasets: Kdnuggets: Complete Collection of Data Repositories (Part 1)
Some libraries in Python come with a collection of datasets that help us practice data analysis.
- pydataset: Package with 700+ datasets,
- seaborn: Data Visualization package,
- sklearn: Machine Learning package,
- statsmodel: Statistical Model package
- nltk: Natural Language Tool Kit package.
For example, the pydataset
library:
# Install pydataset package (if not installed)
!pip install pydataset
# Import package
from pydataset import data
# Check out datasets
data()
dataset_id title
0 AirPassengers Monthly Airline Passenger Numbers 1949-1960
1 BJsales Sales Data with Leading Indicator
2 BOD Biochemical Oxygen Demand
3 Formaldehyde Determination of Formaldehyde
4 HairEyeColor Hair and Eye Color of Statistics Students
... ... ...
752 VerbAgg Verbal Aggression item responses
753 cake Breakage Angle of Chocolate Cakes
754 cbpp Contagious bovine pleuropneumonia
755 grouseticks Data on red grouse ticks from Elston et al. 2001
756 sleepstudy Reaction times in a sleep deprivation study
# Load as a dataframe
df = data('iris')
df.head(n=25)
Using seaborn
datasets:
# Import seaborn
import seaborn as sns# Check out available datasets
print(sns.get_dataset_names())
['anagrams', 'anscombe', 'attention', 'brain_networks', 'car_crashes', 'diamonds', 'dots', 'dowjones', 'exercise', 'flights', 'fmri', 'geyser', 'glue', 'healthexp', 'iris', 'mpg', 'penguins', 'planets', 'seaice', 'taxis', 'tips', 'titanic']
- Python NumPy Tutorial. Justin Johnson. CS231n: Deep Learning for Computer Vision. Stanford University.
- NumPy Illustrated: A visual guide to NumPy. . Lev Maximov.
- Python for Data Analysis, 3rd. Edition. Wes McKinney (O'Reilly Media, 2021).
- Python Data Science Handbook. Jake VanderPlas (O'Reilly Media, 2016).
- Google Colab Tips. Amit Chaudhary
- VSCode on Google Colab. Amit Chaudhary
Extra:
- Nbtutor. Visualize Python code execution (line-by-line) in Jupyter Notebooks.
- Create a presentation from a Jupyter Notebook. Aleksandra Płońska, Piotr Płoński. More on the Data Science Learning Resources Wiki
Created: 01/22/2022 (C. Lizarraga); Last update: 10/04/2022 (C. Lizarraga)
University of Arizona, D7 Data Science Institute, 2022.
- Introduction to the Command Line Interface Shell
- Unix Shell - Command Line Programming
- Introduction to Github Wikis
- Introduction to Github
- Github Wikis and Github Pages
- Introduction to Docker
- Introduction to Python for Data Science - RezBaz AZ 2022.
- Jupyter Notebooks
- Pandas for Data Analysis
- Exploratory Data Analysis with Python
- Low-code Data Exploration Tools
- Outlier Analysis and Anomalies Detection.
- Data Visualization with Python
- Introduction to Time Series Analysis
- Low-code Time Series Analysis
- Time Series Forecasting
- Overview of Machine Learning Algorithms
- Overview of Deep Learning Algorithms
- Introduction to Machine Learning with Scikit-Learn
Carlos Lizárraga, Data Lab, Data Science Institute, University of Arizona.