Wednesday, February 26, 2020

Mastering String Methods in Pandas

Pandas is a popular python library that enables easy to use data structures and data analysis tools. Pandas can be used for reading in data, generating statistics, aggregating, feature engineering for machine learning and much more. The Pandas library also provides a suite of tools for string/text manipulation.

In this post, we will walk through some of the most important string manipulation methods provided by pandas.

Let’s get started!

First, let’s import the Pandas library

import pandas as pd

Now, let’s define an example pandas series containing strings:

s = pd.Series(['python is awesome', 'java is just ok', 'c++ is overrated'])

Let’s print this series:

print(s)

We notice that the series has ‘dtype: object’, which is the default type automatically inferred. In general, it is better to have a dedicated type. Since the release of Pandas 1.0, we are now able to specify dedicated types. Make sure Pandas is updated by executing the following command in a terminal:

pip install -U pandas

We can specify ‘dtype: string’ as follows:

s1 = pd.Series(['python is awesome', 'java is just ok', 'c++ is overrated'], dtype='string')

Let’s print the series:

We can see that the series type is specified. It is best to specify the type, and not use the default ‘dtype: object’ because it allows accidental mixtures of types which is not advisable. For example, with ‘dtype: object’ you can have a series with integers, strings, and floats. For this reason, the contents of a ‘dtype: object’ can be vague.

Next, let’s look at some specific string methods. Let’s consider the ‘count()’ method. Let’s modify our series a bit for this example:

s = pd.Series(['python is awesome. I love python.', 'java is just ok. I like python more', 
                'c++ is overrated. You should consider learning another language, like java or python.'], dtype="string")

Let’s print the new series:

print(s)

Let’s count the number of times the word ‘python’ appears in each strings:

print(s.str.count('python'))

We see this returns a series of ‘dtype: int64’.

Another method we can look at is the ‘isdigit()’ method which returns a boolean series based on whether or not a string is a digit. Let’s define a new series to demonstrate the use of this method. Let’s say we have a series defined by a list of string digits, where missing string digits have the value ‘unknown’:

s2 = pd.Series(['100', 'unknown', '20', '240', 'unknown', '100'], dtype="string")

If we use the ‘isdigit()’ method, we get:

print(s2.str.isdigit())

We can also use the ‘match()’ method to check for the presence of specific strings. Let’s check for the presence of the string ‘100’:

print(s2.str.match('100'))

We can even check for the presence of ‘un’:

print(s2.str.match('un'))

All of which is in concert with what we’d expect. We can also use methods to change the casing of the string text in our series. Let’s go back to our series containing opinions about different programming languages, ‘s1':

s1 = pd.Series(['python is awesome. I love python.', 'java is just ok. I like python more', 
                'c++ is overrated. You should consider learning another language, like java or python.'], dtype="string")

We can use the ‘upper()’ method to capitalize the text in the strings in our series:

s_upper = s1.str.upper()print(s_upper)

We also use the ‘lower()’ method:

s_lower = s_upper.str.lower()print(s_lower)

We can also get the length of each string using ‘len()’:

print(s1.str.len())

Let’s consider a few more interesting methods. We can use the ‘strip()’ method to remove whitespace. For this, let’s define and print a new example series containing strings with unwanted whitespace:

s3 = pd.Series([' python', 'java', 'ruby ', 'fortran '])
print(s3)

As you can see, there is whitespace to the left of ‘python’ and to the right of ‘ruby’ and ‘fortran’. We can remove this with the ‘strip()’ method:

print(s3.str.strip())

We can also remove whitespace on the left with ‘lstrip’:

print(s3.str.lstrip())

and on the right with ‘rstrip’:

print(s3.str.rstrip())

In the previous two examples I was working with ‘dtype=object’ but, again, try your best to remember to specify ‘dtype=strings’ if you are working with strings.

You can also use the strip methods to remove unwanted characters in your text. Often times, in real text data you have the presence of ‘\n’ which indicates a new line. Let’s modify our series and demonstrate the use of strip in this case:

s3 = pd.Series([' python\n', 'java\n', 'ruby \n', 'fortan \n'])
print(s3)

An we can remove the ‘\n’ character with ‘strip()’:

print(s3.str.strip(' \n'))

In this specific example, I’d like to point out a difference in behavior between ‘dtype=object’ and ‘dtype= strings’. If we specify ‘dtype= strings’ and print the series:

s4 = pd.Series([' python\n', 'java\n', 'ruby \n', 'fortan \n'], dtype='string')
print(s4)

We see that ‘\n’ has been interpreted. Nonetheless using ‘strip()’ on the newly specified series still works:

print(s4.str.strip(‘ \n’))

The last method we will look at is the ‘replace()’ method. Suppose we have a new series with poorly formatted dollar amounts:

s5 = pd.Series(['$#1200', 'dollar1,000', 'dollar10000', '$500'], dtype="string")
print(s5)

We can use the ‘replace()’ method to get rid of the unwanted ‘#’ in the first element:

print(s5.str.replace('#', ''))

We can also replace the text ‘dollar’ with an actual ‘$’ sign:

s5 = s5.str.replace('#', '')               
s5 = s5.str.replace('dollar', '$')
print(s5)

Finally, we can remove the ‘,’ from the 2nd element:

s5 = s5.str.replace(',', '')
print(s5)

I will stop here but feel free to play around with the methods a bit more. You can try applying some of the Pandas methods to freely available data sets like Yelp or Amazon reviews which can be found on Kaggle or to your own work if it involves processing text data.

To summarize, we discussed some basic Pandas methods for string manipulation. We went over generating boolean series based on the presence of specific strings, checking for the presence of digits in strings, removing unwanted whitespace or characters, and replacing unwanted characters with a character of choice.

There are many more Pandas string methods I did not go over in this post. These include methods for concatenation, indexing, extracting substrings, pattern matching and much more. I will save these methods for a future article. I hope you found this post interesting and/or useful. The code in this post is available on GitHub. Thank you for reading!

Wednesday, February 19, 2020

Announcing Oracle Cloud Shell

I’m excited to announce the release of Cloud Shell for Oracle Cloud Infrastructure! Cloud Shell gives you access to an always available Linux shell directly in the Oracle Cloud Infrastructure Console. You can use the shell to interact with resources, follow labs and tutorials, and quickly run OCI CLI commands. Cloud Shell joins the existing the suite of developer tools (CLI, SDKs, and APIs) as a way to help you manage your Oracle Cloud Infrastructure resources. And, it’s now available in all commercial regions.

Cloud Shell is free to use (with monthly limits on usage) and easy to access from the Console. You don't need to set up any resources to get started with Cloud Shell; just click the new icon at the top of the Console.

Cloud Shell offers the following features:

A pre-authenticated OCI CLI, so no set up is required to start using the CLI in Cloud Shell
A full Linux shell, with key developer tools for interacting with Oracle Cloud Infrastructure services, and preinstalled language runtimes
5 GB of storage for your home directory, so you can save your work between Cloud Shell sessions

Your Cloud Shell machine is managed by the Cloud Shell service, and is available from anywhere you can access the Console. So, you always have a way to interact with Oracle Cloud Infrastructure resources, from your browser.

Getting Started

To launch Cloud Shell, you only need a valid IAM policy that grants you the use of Cloud Shell.

Screenshot that highlights the Cloud Shell icon in the top-right corner of the page.

While you navigate through the Console, Cloud Shell remains active in its own pane.

Screenshot that shows the Cloud Shell pane open at the bottom of the Console page.

Quickstart Examples

Here are just a few example of what you can do with Cloud Shell.

Run an OCI CLI Command in the Console

Open Cloud Shell.
Run an OCI CLI command at the prompt. Try this one:
```
$ oci os ns get
```
Confirm the result (your tenancy’s unique namespace).

Use Preinstalled Tools to Interact with Cloud Resources

For example, you can connect to your Container Engine for Kubernetes cluster as follows:

In the Console, navigate to your clusters (Developer Services > Container Clusters).
Select a cluster, and then click Access Kubeconfig.
Copy the oci ce cluster… command.
Open Cloud Shell.
Paste the copied CLI command into Cloud Shell.
Verify that you can connect to your cluster with kubectl:
```
$ kubectl get nodes
```
Success!

More Information

For more information, see the Cloud Shell documentation and FAQs.

Thursday, February 13, 2020

How to Delete Millions of Rows Fast with SQL

t's a rare event. Removing data from a table. But every now and then you may need to do a bit of spring cleaning and clear down data.

This is easy to do with delete.

But this presents a problem. Deleting lots of rows can be slow. And there's a chance it'll take even longer because another session has locked the data you want to remove.

Luckily there's a trick to speed up the process:

Turn the DML into DDL!

In this post we'll start with a quick recap of how delete works. Then look at several alternatives you can use in Oracle Database to remove rows faster:

If you want to see which is quickest, you can skip straight to the performance comparison.

How to Delete Rows with SQL

Removing rows is easy.

Use a delete statement. This lists the table you want to remove rows from. Make sure you add a where clause that identifies the data to wipe, or you'll delete all the rows!

delete from table_to_remove_data

where  rows_to_remove = 'Y';

I discuss how delete works - including why you probably don't want to do this - in more detail in this video.

But when you do want to erase data, delete can take a while. Particularly if you're clearing down most of the rows in a table.

So how can you make delete faster?

It's time to switch the SQL statements to DDL.

Let's start with the easiest case: emptying all the data from a table.

Delete all the Rows Fast with Truncate

If you want to wipe all the data in a table, the fastest, easiest way is with a truncate:

truncate table to_empty_it;

This is an instant metadata operation. This will also reset the high-water mark for the table. By default it also deallocates space all the space above the minextents for the table. You can change this behaviour with the storage clause:

truncate table ... reuse storage leaves all the space allocated to the table
truncate table ... drop storage(the default) deallocates all the space above the minextents for the table
truncate table ... drop all storage deallocates all space from the table

Before using truncate you'll also need to check if there are any foreign keys pointing to this table. If there are - even if the child tables are empty – truncate will throw the following error:

truncate table to_empty_it;

ORA-02266: unique/primary keys in table referenced by enabled foreign keys

To get around this you'll need to disable the foreign keys referencing the table first. And re-enable them afterwards. For example:

alter table to_empty_it_child 

  modify constraint fk disable;

truncate table to_empty_it;

alter table to_empty_it_child 

  modify constraint fk enable;

Of course, the child tables must be empty too. If there are rows in them, you can't revalidate the foreign keys!

If you want to clear down the parent and child tables in one fell swoop, from Oracle Database 12c there is a trick. Add the cascade clause:

truncate table to_empty_it

  cascade;

Beware!

The cascade option will remove all the rows from your table. And the child tables referencing it. This is an easy way to end up with an empty database!

Use extreme caution when doing this.

Fortunately you need to declare the foreign keys as on delete cascade for this to work. This is a rare option. So it's unlikely you'll be able to do this. Make sure you double-check before running a cascaded truncate!

But deleting everything in a table in unusual. Normally you want to remove a subset of the rows. There are several other DDL tricks you can use to make this faster.

This most widely available is to save the rows you want to keep in a temporary table. Then switch the data over.

Remove Rows with Create-Table-as-Select

Hang on. Removing data by creating a table? How does that work?

Bear with me.

Inserting rows in a table is faster than deleting them. Loading data into a new table using create-table-as-select (CTAS) is faster still.

So if you're removing most of the rows from a table, instead of issuing a delete you can:

Create a new table saving the rows you want to keep
Truncate the original table
Load the saved rows back in with insert as select

For example:

create table rows_to_keep

  select * from massive_table

  where  save_these = 'Y';

truncate table massive_table;

insert into massive_table

  select * from rows_to_keep;

Or there's another version of this technique which can be even faster.

Switch the tables over.

The process for this is similar:

Create a new table saving the rows you want to keep
Drop or rename the original table
Rename the new table to the original

For example:

create table rows_to_keep

  select * from massive_table

  where  save_these = 'Y';

rename massive_table

  to massive_archived;

rename rows_to_keep

  to massive_table;

This only loads the data once. So can be even faster than using truncate + insert to swap the rows over as in the previous method.

To complete the switch you also need to copy any indexes, constraints, grants, etc. from the old table to the new. These steps could take a long time. So you'll need to test on your tables to see which CTAS method is quickest.

It's also more likely you'll make a mistake copying the dependencies.

But both of these options are fiddly. And come with a huge drawback:

You'll need to take your application offline to do these safely.

This makes CTAS methods a non-starter for many scenarios.

Luckily Oracle Database has a couple of other tricks available.

Delete All Rows in a Partition Fast

When you partition a table, you logically split it into many sub-tables. You can then do operations which only affect rows in a single partition.

This gives an easy, fast way to remove all the rows in a partition. Drop or truncate it!

alter table to_archive_from

  drop partition to_remove_it;

alter table to_archive_from

  truncate partition to_empty_it;

As with a regular truncate, you'll need to disable foreign keys on child tables before you can truncate a partition.

To use this method the following criteria need to be true:

The table is partitioned
You want to remove all the rows in a partition

So the big question:

Is it worth adding partitioning a table to make deletes faster?

To answer this, you first need to ask a follow-up question:

Will you want to repeat this delete on a regular schedule?

If this is a one-off cleanse, there's little to gain by partitioning the table first.

But for regular data removal, this can help a lot. For example, an information lifecycle management process. With this you archive the oldest data to another table, database or storage system.

Provided you've partitioned the table on insert date it's simple to wipe out the data in the oldest partition.

But remember: partitioning a table affects all operations against it. Partitioning by date may make your archival process simple and fast, but it may also make key queries against the table slower.

Test your entire application workload against the table before diving in with partitioning! You may find the impact to other SQL statements makes partitioning infeasible for many of your tables.

Luckily Oracle Database 12c Release 2 added another fast way to remove lots of data:

A filtered table move.

Delete Data with a Filtered Table Move

Typically you use alter table … move to change which tablespace you store rows in. Or other physical properties of a table such as compression settings.

With a filtered table move you can migrate a subset of the data. Do this by adding a where clause to the statement:

alter table to_delete_from

  move including rows

  where rows_to_keep = 'Y';

This only moves the rows matching the where clause. Provided you want to remove a large chunk of data, this can be much faster than delete.

And it has an online clause. So unlike the CTAS methods, you can do this while the application is still running.

Though - like truncate - it will fail with an ORA-02266 if there are enabled foreign keys pointing to the table. So this needs careful planning to use on parent tables. Or an application outage.

Which is the Fastest Way to Delete Lots of Data?

So, the all-important question: which method is quickest?

To find out, I loaded a table with 100,000 rows. Then measured how long it took to delete 90 percent, 50 percent, and 10 percent of the data.

Except for the partitioning comparison, the tests were on a non-partitioned table. The partitioned table was split into ranges of 10,000 rows. So the tests truncated 9, 5, and 1 partitions respectively.

This graph shows the findings:

A regular delete was by far and away the slowest operation. When deleting 90% of the data it took 11 seconds on average. It only bested any DDL methods when removing as little as 10% of the rows. And even then it had a similar runtime to many of the DDL methods.

So it can be worth checking if there's a faster option than delete when removing relatively few rows.

Truncating partitions gave the best performance overall. Though was marginally slower than the other DDL methods when removing most of the data. This makes it a great option if you want to schedule regular deletes. For example, to remove the oldest month of data.

But partitioning has other implications. Adding it to a table may be inappropriate for other use cases. Provided you're on Oracle Database 12.2 or higher, a filtered table move will often give the best performance to simplicity trade-off.

A final note. My test tables only had one index: the primary key. It's likely your tables have at least a couple of other indexes. Which will make the processes slower. And may change the relative speed of each.

So if you need to find the fastest way to delete your data, test each method on your table!

If you want to repeat the tests I ran, use the remove_rows procedure in this Live SQL script. Note the storage quota on Live SQL is too small to test deleting 100,000 rows! So you'll need to try another environment.

Summary

Removing most of the rows in a table with delete is a slow process. One that gets slower the more data you're wiping. Add in other user activity such as updates that could block it and deleting millions of rows could take minutes or hours to complete.

Changing the process from DML to DDL can make the process orders of magnitude faster. By far and away the safest of these is a filtered table move.

But if fast SQL is your goal, it's worth trying out a few of the techniques above to find which makes the process the quickest for you.

Just remember: DDL commits in Oracle Database!

So while using DDL can give great performance gains complete to delete, it's also more risky. Should anything go wrong in the process, you can't rollback the statement to recover the lost data. You have to restore from a backup.

Ensure you double, triple or even quadruple test your code to avoid accidentally removing lots of data from your production database!

Subrat's Technical Blog