Python is a general purpose programming languague that has become the leader in the scientific computing and data science landscape.
Python success has been driven primarily because of its intuitive syntaxis, flexibility, and extensivility, meaning that it's easy to learn, it can be used for a wide variety of taks, and incorporating numerical computing libraries written in high-performance languages like C++ and Fortran is straightforward. Additionally, Python benefits from a large and strong community of developers commited to free and open source software.
Python libraries for Scientific Computing and Data Science are extensive, secure, and mature. Companies like Google, Microsoft, Apple, Dropbox, and Netflix, use Python for several critical applications, in particular the ones based on data processing and machine learning. Companies like Instagram and Youtube were built from the ground up in Python.
In academia, Python is the facto standard for research and applications in artificial intelligence (AI), machine learning (ML), and big data analysis. All the major frameworks for AI and ML, i.e., Tensorflow, Keras, PyTorch, and MXNet, are based on Python.
Python also has a continiously growing an strong presence in the Data Visualization field, with libraries like Matplotlib, Seaborn, Altair, and Plotly.
Python presence in statistics is weaker and less mature than the one of languages like R, STATA, and SPSS. Libraries like Statsmodels and PyMC3 are in constant development and helping too close the gap between Python and other frameworks. This is not to say that you cannot do statistics in Python. As a matter of fact, you can do absolutely everything that can be done in other languages, but it may requiere more effort or more advance knowledge of Python.
The two pillars of Scientific Computing and Data Science in Python are NumPy and Pandas. NumPy is a library for numerical computing, particularly matrix-like computation. Pandas is a library for data analysis and data frame manipulation. NumPy and Pandas are commonly used in tandem as the base for any kind of data processing and modeling.
In this project, I focus in the fundamentals of NumPy and Pandas for Data Science, Machine Learning, and Scientific Computing in Python. I will also plan to introduce the UNIX shell and Python basics such that you can use NumPy and Pandas effectively.
If you want to acquire the math fundamentals before approaching Python and its libraries, I am working in a another project which covers that (here). The Linear Algebra chapters is the most important and can be found here.
I'll cover the following topics:
- Introduction to the UNIX shell and Bash
- Introduction to Jupyter Notebooks: set-up, user-guide, and best practices
- Introduction to Python basics
- Introduction to NumPy
- Introduction to Pandas
The content of each section will be delivered as a Jupyter Notebook and/or Markdown files.
There are two alternatives to access the contents: remote and local.
The remote option does not require any installation or configuration from your part. It is click and play. But, you will have a lot less control over the Python set up, you will probably have to use a slow machine in the cloud, and you won't be able to save and recover anything you change in the Notebooks (you will have to download the Notebooks with the changes). Since this is a begginer workshop, I highly recommend to use this option, as performance and control are not issues at this point.
The local option does require to follow a series of instructions to download and set up everything in your computer. I do provide the code to copy-paste and run in the terminal, but such instructions only work for Linux-based systems, like Macbooks are Linux-machines. If you are a Windows users, you have two options:
- downloading and installing terminal emulators like GitBash and Cygwin
- to install the Windows Subsystem for Linux (WSL).
If you are a beginner, GitBash and Cygwin should work just fine. I do not advise trying (WSL) unless you feel comfortable with using the terminal.
This will build an online development environment with the repository contents. Beware that it may take 2-3 minutes to be ready.
Then navigate to the notebooks
directory and click on TURORIAL-NAME.ipynb
To obtain the files locally, run this in the command line:
git clone https://github.com/pabloinsente/intro-sc-python.git
To set up your system, you need Python 3.6 installed in a Linux/Mac machine. Check you have Python installed by running this in the terminal:
python3 --version
you should see something like
python 3.6.X # the X stands for any number
If you do not have Python 3.6 installed, go to the Python website, search Python 3.6.8 under the "Looking for a specific release?" section, and follow the downloading and installing instructions.
Once you have Python installed, It is recommended to use a virtual environment before installing the dependencies. To do this, navigate into the cloned repository in the console by:
cd intro-sc-python
Note that you may need to change the path to cd
into the directory.
Then run this inside that directory to create the virtual environment:
python3 -m venv venv
And activate your environment by running:
source venv/bin/activate
Make sure to have the latest pip version:
pip install --upgrade pip
Install dependencies by running:
pip install -r requirements.txt
To run the notebook, navigate to the notebooks
directory and launch Jupyter Lab as:
jupyter lab ./notebooks/TURORIAL-NAME.ipynb