Intermediate Python: NumPy
If you’ve recently completed a course or book on the basics of Python, and are now wondering where to go next, exploring different Python packages would be a natural next step. The NumPy package (short for Numerical Python) is pretty straightforward, yet is quite useful, especially for scientific computation, data science, and machine learning applications.
Many data analysis and machine learning Python libraries are built on top of NumPy, so mastering the basics will be crucial to being successful in utilizing those libraries down the road. This article isn’t intended to serve as a comprehensive, or in-depth resource for NumPy. Rather, this is more of an introduction to the package, and a sort of nudge in the right direction for programmers who are newer to Python who may want to explore scientific or data science applications.
Prerequisites
To follow along with the code snippets in this article you need to obviously have Python installed on your machine, as well as the NumPy package. The command
pip install numpy
should do the trick, but if you have any problems with that, the NumPy site can help you set up in the Getting Started section.
Also, I recommend using Jupyter Notebooks or the IPython console in Spyder to follow along, but IDLE also works. Jupyter and Spyder come with Anaconda, which you can download here, and the package should have NumPy installed already.
Why NumPy?
One of the biggest advantages of using the NumPy package is the ndarray (n-dimensional array) data structure. The NumPy
ndarray
is much more powerful than the python list
, and provides a larger variety of operations and functions than a python array
. To understand these advantages, we first need to dig a little into Python’s elementary data types.
Python is a dynamically-typed language, and that’s one of the features that contributes to its ease of use. Python allows us to assign a variable an integer value, then reassign that same variable to a different type (like a string):
However, in a statically-typed language like C++, to assign a variable a value we first have to assign that variable a type. After the variable has been declared, we can’t reassign the value of it to that of a different type:
This dynamic typing functionality is pretty convenient, but comes at a cost. Python is implemented in C, and elementary data types in Python are actually not raw data types, but pointers to C structures that contain a number of different values. The extra information stored in a Python data type like an
integer
is what allows dynamic typing, but comes with significant overhead, where performance costs will become apparent when dealing with a very large amount of data.
The high flexibility and performance cost applies to Python
lists
as well. Since lists
can be heterogeneous (contain different data types in the same list), each element in a list
contains its own type and reference information, just as Python objects do. Heterogeneous lists will benefit from this sort of structure, but when all elements of a list are of the same elementary type, the type information that is stored turns out to be redundant, and wasteful of precious memory.
Python
arrays
are much more efficient at storing uniform data types than lists
, but the NumPy ndarray
provides functionality that arrays
don’t (eg. matrix and vector operations).Array Creation
First off, check if you have NumPy installed — import and check that you have at least version 1.8.
NOTE: You can justimport numpy
instead of importing it asnp
, but for the rest of the tutorial, wherever you seenp
, just replace it withnumpy
(e.g.np.array()
→numpy.array()
). Also, I may be a little inconsistent when using the terms “array” or “ndarray”, so just remember these terms refer to the same thing.
Now let’s see how we can create multidimensional arrays (ndarrays) with NumPy. Since we compared ndarrays to Python lists, let’s first see how NumPy lets us create an array from a list:
Passing the Python list
[1,1,2,3,5,8,13]
to np.array()
creates an ndarray of 32-bit integer values. The values held in ndarrays will always be of the same type. With all ndarrays, the .dtype
property will return the data type of the values the array holds. **Numpy docs on data types
If we pass a
list
containing values of different types to np.array()
NumPy will up-cast the values so they can be of the same type:
**Note: be sure that when you call
array()
you provide a list of numbers as a single argument :np.array( [1,2,3] )
instead of just the numbers as multiple arguments: np.array( 1,2,3 )
; this is a pretty common error.
NumPy also allows you to explicitly specify the data type of the array when you create it with the
dtype
argument:
It’s common to know the size of the array, but not know the contents of the array at the time of creation. In this case, NumPy allows the creation of an array of a specified size with placeholder values:
np.ones()
and np.empty()
can also be used to return an array with all 1’s, or without initializing entries, respectively. If you would like to specify a value to use as a placeholder, use np.full(size, placeholder)
:
Arrays can also be initialized with random values. Using
np.random.random(s)
you can create an array of size s
filled with random values between 0 and 1. Passing an integer value will yield a 1-D array of that length:
You can pass the dimensions for a higher-dimension array:
If you want an array with random integer values use
np.random.randint(min, max, size)
. Specify the minimum value for min
, the maximum value for max
, and of course the size of the array for size
just as we did with np.random.random()
.
There are a number of other pretty useful ways of creating ndarrays, including:
- filling with evenly-spaced values in a given range
- filling an array with random numbers over a normal distribution
- creating an identity matrix
If you’re interested, check out the array creation documentation to explore these array creation routines along with many others.
Array Manipulation
Creating arrays is fine, but where NumPy really shines is the methods for manipulation and computation with arrays. Not only are these methods straightforward and convenient to use, but when it comes to element-wise operations (especially on large arrays), these methods have pretty exceptional performance — much greater than that of iterating through each element, like you might normally do without NumPy.
The
ndarray
object allows us to perform arithmetic operations element-wise on two arrays of the same size:
Remember, using the
+
operator between two lists
doesn’t add them element-wise. This actually results in concatenation of the two lists
. Furthermore, if we try to use the -
operator between two lists
, Python will return an error since, again, lists
do not naturally allow for element-wise operations without explicitly stating so with for-loops.Element-wise vs Matrix Multiplication
If you have ever used MATLAB before, you know how easy it can be to work with n-dimensional arrays and matrices. NumPy does an excellent job of providing some of that convenient functionality, and may feel much more familiar to MATLAB users than working with bare-bones Python.
Matrix multiplication in MATLAB is as simple as using the
*
operator on two matrices (e.g. a * b
). With NumPy, the *
operator will actually return element-wise multiplication.
For matrix multiplication the
@
operator is used for arrays
:While on the topic of matrix multiplication, NumPy also has amatrix
class, which is actually a subclass ofarray
. Thearray
class is for general purpose use, while thematrix
class is intended for linear algebra computations. The documentation recommends that the majority of the time, you should use thearray
class, unless you are specifically working with linear algebra calculations. If you want to work with higher-dimensional arrays (3-D for example), thearray
class supports this, while thematrix
class is always working with 2 dimensions.
What’s more, thematrix
class does not utilize the same operations asarray
does. So for this article, we will focus on thearray
class.
Just like multiplying two arrays with each other, we can multiply all the elements of an array by a single number. NumPy also makes it really convenient to get properties of arrays, such as the sum, min/max, dimensions of the array (
ndim
), and the size (total elements) of an array
.
You also have access to basic statistical values for your arrays:
Get Elements Based on a Condition
One pretty cool
numpy
class method is numpy.where()
. This allows you to return elements from an array that satisfy a specified condition. If, for example, you had a 5x5 array of integers from 0–50, and you wanted to know where the values that are greater than 25 are, you could do the following:
If you wanted to find low values (e.g. < 15) and replace them with a -1:
There’s a lot you can do with this method, especially with a little creative thinking, but just remember how it works:
np.where(cond[, x, y])
— if cond
condition is satisfied, return x
, else return y
.Indexing, Slicing, and Reshaping
Indexing works the same as with
lists
:
To check if a value is present in an array the
in
keyword can be used just as with lists
:
Slicing works with
arrays
just as it does with lists
, and arrays
can be sliced in multiple dimensions:Arrays
in NumPy have a lot of functionality when it comes to transformation. For example, if you have a 3x5 array and would like to reshape it to a 5x3:
You can also transpose
arrays
using array.T
:Arrays
can also be transposed using np.transpose(a)
, where a
is the array you wish to transpose.
We’ve really only scratched the surface of what you can do with the NumPy library, and if you’d like to know what else you can do, the official documentation has a great Getting Started guide and, of course, is where you can explore the rest of the library. If you want to use Python for scientific computation, machine learning, or data science, NumPy is one of the libraries that you should really get comfortable with.
Here are a few other related Python libraries I’d recommend checking out, which are considered core in this domain:
To conclude, I’ll send you off with a few resources for continuing your journey in mastering Python for data science and scientific applications.
Python Data Science Handbook by Jake VanderPlas — This is a really excellent primer for getting started with Data Science. He also goes over NumPy towards the beginning, but in much more detail than this article.
Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow by Aurélien Géron — If you’re really interested in machine learning and deep learning, this book might be the most often recommended book for getting started.
Towards Data Science — Data Science, Machine Learning, AI, general programming. This is one of the best publications on Medium, and is an excellent resource for a huge range of topics relating to Data Science.
Machine Learning Course Offered by Stanford with Andrew Ng — Not exactly specific to Python, but if you’d like to really get into machine learning and dive into some of the theory along with hands-on work, if you have the time to dedicate to the course, this is an excellent place to start. Andrew Ng is brilliant.
Thanks for reading!
Comments