I once walked into a company completely unprepared as a data scientist. While I expected to be training models, my role turned out to be software engineering and the app made the heaviest use of numpy I’d ever seen.
While I’d used
np.array()
to convert a list to an array many times, I wasn’t prepared for line after line of linspace
, meshgrid
and vsplit
.
I needed to get comfortable with numpy fast if I was going to be able to read and write code.
This is curated list of numpy array functions and examples I’ve built for myself.
We’ll cover background info on Arrays in the first section, then get to the advanced functions that will help you become faster working with data.
Table of Contents:
1. Array Overview
2. Generating Arrays
3. Manipulating Arrays
1. Array Overview
2. Generating Arrays
3. Manipulating Arrays
1) Array Overview
What are Arrays?
Array’s are a data structure for storing homogeneous data. That mean’s all elements are the same type.
Numpy’s Array class is
ndarray
, meaning “N-dimensional array”.import numpy as nparr = np.array([[1,2],[3,4]]) type(arr)#=> numpy.ndarray
It’s n-dimensional because it allows creating almost infinitely dimensional arrays depending on the shape you pass on initializing it.
For example:
np.zeros((2))
generates a 1D array. np.zeros((2,2))
generates a 2D array. np.zeros((2,2,2))
generates a 3D array. np.zeros((2,2,2,2))
generates a 4D array. And so on…np.zeros((2)) #=> array([0., 0.])np.zeros((2,2)) #=> array([[0., 0.], #=> [0., 0.]])np.zeros((2,2,2)) #=> array([[[0., 0.], #=> [0., 0.]], #=> #=> [[0., 0.], #=> [0., 0.]]]) ...
Arrays vs Lists
- Arrays use less memory than lists
- Arrays have significantly more functionality
- Arrays require data to be homogeneous; lists do not
- Arithmetic on arrays operates like matrix multiplication
Important Parameters
shape: a tuple representing dimensions of an array. An array of shape
(2,3,2)
is a 2x3x2 dimension array. And looks like below.np.zeros((2,3,2))#=> array([[[0., 0.], #=> [0., 0.], #=> [0., 0.]], #=> #=> [[0., 0.], #=> [0., 0.], #=> [0., 0.]]])
dtype: the type of value stored in an array. Array’s are homogenious so we can’t mix multiple data types like strings and integers. The value of
dtype
can be np.float64
, np.int8
, int
, str
or one of several other types.2) Generating Arrays
zeros
Generate an array of zeros with a specified shape.
This is useful when you want to initialize weights in an ML model to 0 before beginning training. This is also often used to initialize an array with a specific shape and then overwrite it with your own values.
np.zeros((2,3))
#=> array([[0., 0., 0.],
#=> [0., 0., 0.]])
ones
Generate an array of ones with a specified shape.
Useful if you need to initialize values to 1 before incrementally subtracting from them.
np.ones((2,3))
#=> array([[1., 1., 1.],
#=> [1., 1., 1.]])
empty
np.empty()
is a little different than zeros and ones, as it doesn’t preset any values in the array. Some people say it’s slightly faster to initialize but that’s negligible.
This is sometimes used when initializing an array in advance of filling it with data for the sake of readable code.
arr = np.empty((2,2))
arr
#=> array([[1.00000000e+000, 1.49166815e-154],
#=> [4.44659081e-323, 0.00000000e+000]])
full
Initialize an array with a given value.
Below we initialize an array with
10
. And then another array with ['a','b']
pairs.np.full((3,2), 10) #=> array([[10, 10], #=> [10, 10], #=> [10, 10]])np.full((3,2), ['a','b']) #=> array([['a', 'b'], #=> ['a', 'b'], #=> ['a', 'b']], dtype='<U1')
array
This is probably what you’ve seen the most in real life. It initializes an array from an “array-like” object.
Useful if you’re storing data in another data structure but need to convert it into a numpy object so it can be passed to sklearn.
li = ['a','b','c'] np.array(li)#=> array(['a', 'b', 'c'], dtype='<U1')
Note:
np.array
also has a parameter called copy
which you can set to True
to guarantee a new array object is generated rather than pointing to an existing object._like
There are several
_like
functions corresponding to the functions we’ve discussed: empty_like
, ones_like
, zeros_like
and full_like
.
They generate an array with the same shape as the passed-in array but with their own values. So
ones_like
generates an array of ones, but you pass it an existing array and it takes the shape of that, instead of you specifying the shape directly.a1 = np.array([[1,2],[3,4]]) #=> array([[1, 2], #=> [3, 4]])np.ones_like(a1) #=> array([[1, 1], #=> [1, 1]])
Notice how the 2nd array of 1’s took on the shape of the first array.
rand
Generate an array with random values.
This is useful when you want to initialize pre-trained weights in a model to random values, which is likely more often than initializing them to zero.
np.random.rand(3,2)
#=> array([[0.94664048, 0.76616114],
#=> [0.395549 , 0.84680126],
#=> [0.42873 , 0.77736086]])
asarray
np.asarray
is a wrapper around np.array
, which sets the parameter copy=False
. See np.array
above.arange
Generates an array of values with a set interval between an upper and lower limit. It’s numpy’s version of
list(range(50,60,2))
with lists.
Below we generate an array of every second value between 50 and 60.
np.arange(50,60,2)
#=> array([50, 52, 54, 56, 58])
linspace
Generates an array of numbers with equal intervals between 2 other numbers. Instead of specifying the interval directly like
arange
, we specify how many numbers to generate between the upper and lower limit.
Below we return an array of 6 numbers between 10 and 20, and 5 numbers between 0 and 2.
np.linspace(10, 20, 6) #=> array([10., 12., 14., 16., 18., 20.])np.linspace(0, 2, 5) #=> array([0. , 0.5, 1. , 1.5, 2. ])
Notice how we specify the number of elements in the array instead of stating the interval itself.
meshgrid
Generates a matrix of coordinates based on 2 input arrays.
This can be a little tricky to wrap your head around. So let’s walk through an example. Generate 2 arrays and pass those to
np.meshgrid
.x = np.array([1,2,3]) y = np.array([-3,-2,-1]) xcors, ycors = np.meshgrid(x, y) xcors #=> [[1 2 3] #=> [1 2 3] #=> [1 2 3]]ycors #=> [[-3 -3 -3] #=> [-2 -2 -2] #=> [-1 -1 -1]]
Here we can see 2 different matrices outputted, based on the values and shape of inputted arrays.
But don’t imagine this as 2 separate matrices. Those are actually pairs of (x,y) coordinates representing points in a plane. I’ve combined them below.
[[(1, -3), (2, -3), (3, -3)]
[(1, -2), (2, -2), (3, -2)],
[(1, -1), (2, -1), (3, -1)]]
3) Manipulating Arrays
copy
Make a copy of an existing array.
Assigning an array to a new variable name will point back to the original array. You need to be careful with this behaviour so you don’t unintentionally modify existing variables.
Consider this example. Although we modify
a2
, the value of a1
also changes.a1 = np.array([1,2,3]) a2 = a1a2[0] = 10 a1 #=> array([10, 2, 3])
Now compare that to this. We modify
a2
but a1
does not change… because we made a copy!a1 = np.array([1,2,3]) a2 = a1.copy()a2[0] = 10 a1 #=> array([1, 2, 3])
shape
Get the shape of an array.
Very useful when dealing with massive multi-dimensional arrays where it’s not possible to eyeball the dimensions.
a = np.array([[1,2],[3,4],[5,6]])
a.shape
#=> (3, 2)
reshape
Reshapes an array.
This is insanely useful and I can’t image using a library like Keras without it. Let’s walk through an example of creating and reshaping an array.
Generate an array.
a = np.array([[1,2],[3,4],[5,6]])
a
#=> array([[1, 2],
#=> [3, 4],
#=> [5, 6]])
Check it’s shape.
a.shape
#=> (3, 2)
Reshape the array from 3x3 to 2x3.
a.reshape(2,3)
#=> array([[1, 2, 3],
#=> [4, 5, 6]])
Flatten the array into 1 dimension.
a.reshape(6)
#=> array([1, 2, 3, 4, 5, 6])
Reshape the array into a 6x1 matrix.
a.reshape(6,1)
#=>array([[1],
#=> [2],
#=> [3],
#=> [4],
#=> [5],
#=> [6]])
Reshape the array into 3 dimensions, 2x3x1.
a.reshape(2,3,1)
#=> array([[[1],
#=> [2],
#=> [3]],
#=>
#=> [[4],
#=> [5],
#=> [6]]])
resize
Similar to
reshape
but it mutates the original array.a = np.array([['a','b'],['c','d']]) a #=>array([['a', 'b'], #=> ['c', 'd']], dtype='<U1')a.reshape(1,4) #=> array([['a', 'b', 'c', 'd']], dtype='<U1')a #=>array([['a', 'b'], #=> ['c', 'd']], dtype='<U1')a.resize(1,4) a #=> array([['a', 'b', 'c', 'd']], dtype='<U1')
Notice how calling
reshape
didn’t change a
, but calling resize
permanently changed its shape.transpose
Transposes an array.
Can we useful for swapping rows and columns before generating a pandas data frame or doing aggregate calculations like count or sum.
a = np.array([['s','t','u'],['x','y','z']]) a #=> array([['s', 't', 'u'], #=> ['x', 'y', 'z']], dtype='<U1')a.T #=> array([['s', 'x'], #=> ['t', 'y'], #=> ['u', 'z']], dtype='<U1')
Notice how everything has been flipped over the diagonal axis between
s
and z
.flatten
Flattens an array into 1 dimension and returns a copy.
This achieves the same result as
reshape(6)
below. But flatten
can be useful when you don’t know the size of an array in advance.a = np.array([[1,2,3],['a','b','c']]) a.flatten() #=> array(['1', '2', '3', 'a', 'b', 'c'], dtype='<U21')a.reshape(6) #=> array(['1', '2', '3', 'a', 'b', 'c'], dtype='<U21')
ravel
Flattens an array-like object into 1 dimension. Similar to
flatten
but it returns a view of an array instead of a copy.
The big benefit though is that it can be used on non-arrays like lists, where
flatten
would fail.np.ravel([[1,2,3],[4,5,6]]) #=> array([1, 2, 3, 4, 5, 6])np.flatten([[1,2,3],[4,5,6]]) #=> AttributeError: module 'numpy' has no attribute 'flatten'
hsplit
Horizontally splits an array into subarrays.
You can imagine this like splitting each column in a matrix into its own array.
Useful in ML for splitting out time-series data if each column describes an object, and each row is a time period for those objects.
a = np.array( [[1,2,3], [4,5,6]]) a #=> array([[1, 2, 3], #=> [4, 5, 6]])np.hsplit(a,3)# #=> [array([[1],[4]]), # #=> array([[2],[5]]), # #=> array([[3],[6]])]
vsplit
Vertically splits an array into subarrays.
You can imagine this as splitting off each row into its own column.
Useful in ML if each row represents an object and each column is a different feature of those objects.
a = np.array( [[1,2,3], [4,5,6]]) a #=> array([[1, 2, 3], #=> [4, 5, 6]])np.vsplit(a,2)#=> [array([[1, 2, 3]]), #=> array([[4, 5, 6]])]
stack
Joins arrays on an axis.
This is essentially the opposite of
vsplit
and hsplit
in that it combines separate arrays into a single array.
Along
axis=0
a = np.array(['a', 'b', 'c']) b = np.array(['d', 'e', 'f'])np.stack((a, b), axis=0) #=> array([['a', 'b', 'c'], #=> ['d', 'e', 'f']], dtype='<U1')
Along
axis=1
a = np.array(['a', 'b', 'c']) b = np.array(['d', 'e', 'f'])np.stack((a, b), axis=1) #=> array([['a', 'd'], #=> ['b', 'e'], #=> ['c', 'f']], dtype='<U1')
Conclusion
I consider this the basics of numpy. You’ll come across these functions repeatedly when reading existing code at work or doing tutorials online.
Comfort with the above means you won’t get stuck understanding how
meshgrid
is used to generate a matplotlib chart. Or how to quickly add a dimension so your data conforms with input requirements to a Keras model.
Are there any numpy functions you can’t live without?
No comments:
Post a Comment