Friday, November 27, 2020

6 Costly Numpy Mistakes to Avoid in Python

 Numpy is one of the most central libraries to Python but we all make simple mistakes, or we even make those mistakes that we know we shouldn’t make but still haven’t really figured out how to broach the subject.

I’m a pretty average programmer and even now, I still fumble around with many problems that I face in Python. So with that in mind, I decided to write down what common problems we face and how to go about them.

It’s embarrassing to have lived through these: but here we go!

1: Lists or Numpy Arrays?

When I started programming, I couldn’t figure out the difference and only when I began to use matrix methods did I really appreciate the difference here.

Simply, a List is an ordered set of elements whereas, an array is a grid of values and it contains information about the raw data, how to locate an element, and how to interpret an element.

Lists are declared between by a pair of square brackets [‘this is a list’] whereas a Numpy Array is defined a follows: np.array([1,2,3,4]).

They key difference between them is that numpy data structures perform better as they require less size, they perform quicker than lists, and other libraries (e.g. scipy) have routines optimised for Numpy Arrays.

Also, the output is different. For example,

list1 = [1,2,3,4]
print(“List: “, list1)
import numpy as np
a = np.array([1,2,3,4])
print(“Numpy Array: “,a)

Output is:

List: [1, 2, 3, 4]
Numpy Array: [1 2 3 4]

(Note the missing commas!)

2: Miscalculating Reshapes

Sometimes you need to reshape a matrix and turn it into a vector and other times, you need to do the opposite and turn a vector back into a matrix.

Where I usually mess up is that I make something algorithmic here but my code would usually have a bug in which the reshaped array does not have the same number elements as the original matrix.

The following example shows how to reshape an array the right way.

import numpy as npa = np.array([1,3,5,7,9,11])
b = a.reshape(3,2)
print("a: ",a)
print("b: ",b)

The output is,

a: [ 1 3 5 7 9 11]
b:
[[ 1 3]
[ 5 7]
[ 9 11]]

3: Indexing Badly

Numpy arrays make it easy to index but even then, I’ve still made so many stupid mistakes. Firstly, numpy indices start with 0 so make sure you’re doing that correctly. Also, the final index is not considered, but the index before that is considered.

So an array may be of length 79, but to index into the final item you’ll have to use the index 78 (as it starts at 0).

The following example is the correct way to create a new array from an existing array.

import numpy as np
a = np.array([1,3,5,7,9,11])
b = a[1:4]
print("Original Array, a: ", a)
print("New Array, b: ", b)

The output is,

Original Array, a: [ 1 3 5 7 9 11]
New Array, b: [3 5 7]

4: PATH variable issue

PATH variables are two dime a dozen and usually take me all day to fix.

The usual scenario is:

import Numpy

and you’re returned:

Import error: No module named numpy

Despite having installed numpy already (pip3 install numpy) in your terminal. Now this occurs when the PATH variable is not set correctly.

If you see this error, then first check whether the PATH variable is set correctly and try to fix it. I usually use PyCharms inbuilt suite to install libraries but if all else fails, ask a software engineering friend. Keep them close!

5: Sorting

Another common mistake happens when you try to sort an array. Sorting is super easy in Python but I’ll often sort the wrong way and struggle to diagnose it till later.

You should always define whether you want to sort by ascending or descending. However, a Numpy Array can be sorted in many ways, not just ascending or descending way.

Naturally, you should think about which way you want to sort your array (given the problem you’re faced) but for the reader, some examples:

# Python program to demonstrate sorting in numpy
a = np.array([[1, 4, 2], [3, 4, 6], [0, -1, 5]])
# sorted array
print("Array elements in sorted order:\n", np.sort(a, axis=None))
# sort array row-wise
print("Row-wise sorted array:\n", np.sort(a, axis=1))
# specify sort algorithm
print("Column wise sort by applying merge-sort:\n", np.sort(a, axis=0, kind='mergesort'))
# Example to show sorting of structured array set alias names for dtypes
dtypes = [('name', 'S10'), ('grad_year', int), ('cgpa', float)]
# Values to be put in array
values = [('Hrithik', 2009, 8.5), ('Ajay', 2008, 8.7), ('Pankaj', 2008, 7.9), ('Aakash', 2009, 9.0)]
# Creating array
arr = np.array(values, dtype=dtypes)
print("\nArray sorted by names:\n", np.sort(arr, order='name'))
print("Array sorted by grauation year and then cgpa:\n", np.sort(arr, order=['grad_year', 'cgpa']))

The output is,

Array elements in sorted order:
[-1 0 1 2 3 4 4 5 6]
Row-wise sorted array:
[[ 1 2 4]
[ 3 4 6]
[-1 0 5]]
Column wise sort by applying merge-sort:
[[ 0 -1 2]
[ 1 4 5]
[ 3 4 6]]
Array sorted by names:
[(b’Aakash’, 2009, 9. ) (b’Ajay’, 2008, 8.7) (b’Hrithik’, 2009, 8.5)(b’Pankaj’, 2008, 7.9)]
Array sorted by grauation year and then cgpa:
[(b’Pankaj’, 2008, 7.9) (b’Ajay’, 2008, 8.7) (b’Hrithik’, 2009, 8.5)(b’Aakash’, 2009, 9. )]

6: Views vs Copies

This is quite a technical problem but really interesting. In the world of Python and Numpy, we have something called a view and a copy. A view is an actual part of the original object but a copy is an entirely different object.

When you look at a copy: even though you’ve indexed into the original object, the compiler will make a copy of what you’ve selected and that’s what you’ll be seeing/using, but, it’ll be slower (as it takes a while for the compiler to copy the required part of the object).

The following example should clarify this:

import numpy as np
a = np.random.randn(5,2)
print("Array is: ", a)
av = a[:3, :]
print(av.base is a)
print("av is a View and returns: ", av)
ac = a[[0,1,2], :]print(ac.base is a)
print("ac is a Copy and returns: ", av)

The output is:

Array is: [[-9.04167793e-02 -9.86453934e-01]
[ 5.73769512e-01 1.56332206e+00]
[ 1.25860275e-01 -1.01739258e-03]
[-1.36741893e+00 5.46968242e-01]
[ 1.77061813e+00 1.19694848e+00]]
Trueav is a View and returns: [[-9.04167793e-02 -9.86453934e-01]
[ 5.73769512e-01 1.56332206e+00]
[ 1.25860275e-01 -1.01739258e-03]]
Falseac is a Copy and returns: [[-9.04167793e-02 -9.86453934e-01]
[ 5.73769512e-01 1.56332206e+00]
[ 1.25860275e-01 -1.01739258e-03]]

More information on this topic can be found here:

There are loads of ways to mess up your code by making silly mistakes but for what it’s worth, I’ve pretty much made them all. The final mistake is the one I’d say pay attention to because it can really slow down your programs.

The above mistakes are relatively simple but if you’re cognisant of them, you’ll spend much less time than I did debugging. I may have reinstalled python like 10 times with problem 4 above!

Good luck!

Hopefully you guys found this interesting and keep in touch!

Keep up to date with my latest articles here!

WRITTEN BY

AI and ML. Helper at Towards Data Science. Formerly at Cambridge University ML. Get my Introduction To Data Science eBook for free!

A Medium publication sharing concepts, ideas, and codes.

Take the right action on your data, based on what the data really represent and not on what you think they are

Image for post

In a previous article I showed how to create with IBM Cloud Pak for Data an automatic process to discover data and ingest them in a catalog while enforcing governance policies. One of the key elements of this process is the ability to recognize what kind of data are ingested. This is what is called Data Classification — not to be confused with classification in the ML context.

In this article I will go deeper in this particular topic and explain the concepts behind the data classification process as implemented in IBM Cloud Pak for Data or the IBM Information Server portfolio. …


Tricks only garnered from experience

Image for post

When we think about Data Science, we have to separate our thought into two streams. There’s the academic side of things, and then there’s the side which is pragmatic and full of real life experience.

It’s not easy as well. There’s limited data, sometimes it’s messy and also there’s often spurious correlation. So as much as people tell you there should be a relationship between X and Y, quite simply, often, there’s just not.

However, to really understand if a phenomenon exists, a few helpful tips and tricks can really push you in the right direction. …


They both use physics, it should be easy

Image for post

Displaying a graph or a network in a way that is not a complete mess can be hard. You want the most connected nodes to be close to each other, and to avoid edges crossing unnecessarily. The idea of the force-directed approach is that, instead of using a set of rules or a complex algorithm, a good layout for the graph is achieved by making every node act as if it was an object in an environment where simple physical properties apply: 1- Things that are connected attract each other, and 2- things that are close to each other push each other away. In that sense, the force-directed layout modules of popular graph visualisation frameworks such as D3.js …


At the start of the Gartner Hype Cycle for a reason

Image for post

The market for AI services is estimated to exceed 5.5 trillion dollars by 2027. A platform dominating this market could have almost unlimited growth potential. Millions of GPU hours are consumed every few days to train bigger and stronger AIs for the world. Simultaneously, collaboration and sharing of knowledge is achieved through the thousands of academic AI papers published every year. The sharing of the trained models though, is still in its infancy stage.

This is where AI Marketplaces start their career. Sharing is caring, and if AI developers and companies can turn a profit in the process, even better.

In this article, we will look at three promising solutions you have to know. …


Explainable machine learning at your fingertips

Black-box models aren’t cool anymore. It’s easy to build great models nowadays, but what’s going on inside? That’s what Explainable AI and LIME try to uncover.

Image for post

Don’t feel like reading? Check out my video on the topic:

Knowing why the model makes predictions the way it does is essential for tweaking. Just think about it — if you don’t know what’s going on inside, how the hell will you improve it?

LIME isn’t the only option for machine learning model interpretation. The alternative is SHAP. You can learn more about it here:

Today we also want to train the model ASAP and focus on interpretation. Because of that, the identical dataset and modeling process is used. …

No comments:

Must Watch YouTube Videos for Databricks Platform Administrators

  While written word is clearly the medium of choice for this platform, sometimes a picture or a video can be worth 1,000 words. Below are  ...