PyMySQL: Connecting Python and SQL for Data Science

 

Easily access MySQL databases and execute SQL queries in Python

Photo by Maikel Oosterink on Unsplash

SQL and Python are indispensable tools for data practitioners to work effectively with data.

A common use case would be the initial retrieval of data from relational databases using SQL queries, followed by subsequent manipulation and analysis of the data in Python with libraries such as pandas.

But did you know that these two seemingly isolated tasks can be combined into a single Python script to easily deliver the same outcome?

In this article, we discover the concepts and implementation of PyMySQL for connecting to and interacting with MySQL databases from Python.

(1) What is PyMySQL, and Why Use It?

PyMySQL is a pure-Python MySQL client library, which means it is a Python package that creates an API interface for us to access MySQL relational databases.

The documentation page states that PyMySQL was built based on PEP 249. It means PyMySQL was developed based on the Python Database API Specification, which was set to promote standardization of Python modules used for database access.

The key reason for using PyMySQL is that it serves as a handy interface to interact directly with MySQL databases by incorporating SQL statements within the confines of Python scripts.

It also means we do not require access to a separate RDBMS interface for running SQL queries needed for our downstream Python operations.

(2) Initial Setup

(i) Installation

We can install PyMySQL using pip:

pip install PyMySQL

The following version requirements also need to be met:

(ii) Data Setup

For this demo, we generate a 1000-row mock dataset from Mockaroo to represent a customers dataset. Then, the data is imported into a local MySQL database via the Table Data Import Wizard function to form a customers table.

Screenshot of the customers table in MySQL Workbench | Image by author

The data I generated can be found here in the project GitHub repo.

If you would like to know more free resources on generating mock data for experiments like this, do check out the following reference:

(iii) Import packages

We then import all the necessary packages for the project:

import pymysql
import pandas as pd
import matplotlib.pyplot as plt

(3) Connection and Cursor Objects

The PyMySQL library is built upon two key Python objects: connection objects and cursor objects.

  • Connection Object: Creates access to the MySQL database
  • Cursor Object: Executes SQL queries to communicate with the MySQL database, e.g., fetch database records, and call procedures.

(4) Establishing Connection to MySQL Server

Before interacting with the MySQL database, we must connect to the server. The first step is to define the configuration parameters.

Here are the default configuration settings we can use to connect to a local database instance. Note that these values may vary depending on the specific settings you created for your schema and database.

These parameters should ideally be stored in a separate .py file and parsed into the main Python script using tools like configparser.

With the configuration parameters defined, we can parse them into a PyMySQL connect method call, which can be wrapped within a Python function.

We then execute the connection by calling the Python function and returning the connection object as a variable:

(5) Retrieving Data from MySQL Database

Let's say we would like to extract a subset of the data (i.e., customers who drive BMW) from the customers table. We first define a function that does the following:

  • Create a cursor object from the connection object set up earlier.
  • Parse SQL query into the cursor object and execute the query. For this example, the query is SELECT * FROM customers WHERE car_make = 'BMW';, which we store as a string inside a Python variable called query.
  • Commit the execution to the connection object for the query to reach the database. It is because the connection object does not automatically commit by default.
  • Fetch all the data records resulting from the SQL query that was executed.
  • Convert the data records (which are returned as a list of dictionaries) into a pandas DataFrame.

The above steps are wrapped in the Python function (get_records) shown below:

Running the function returns the following output:

Image by author

Now that the records from the SQL query are stored as a Python pandas dataframe, we are free to perform a myriad of analysis and manipulation techniques on the data.

For example, we can use pandas and matplotlib to visualize the distribution of how often these customers visit the specific mobile application (based on the column freq_app_visit):

Histogram of the `freq_app_visit` column of the dataframe | Image by author

(6) Other CRUD Functions

Besides reading data from MySQL databases, we can also use PyMySQL to perform other CRUD (Create, Read, Update, Delete) functions to manage the database.

For example, we can execute a query that inserts a data record into the customers table. Like before, we create a Python function for this specific operation of data insertion.

Once done, we can view the newly added record using the get_record Python function we created earlier:

Retrieving data of the newly inserted record | Image by author

(7) Wrapping It Up

In this article, we learned how we could leverage the capabilities of PyMySQL to interact with MySQL databases using Python.

In particular, we saw the ease with which we can retrieve, manipulate, and analyze the data from MySQL databases within the confines of a single Python script or notebook.

You can check out the GitHub repo for this project here and the official PyMySQL documentation here.

If you would like to see how PyMySQL is used for cloud databases (rather than local databases as shown in this demo), do check out this TowardsDataScience writeup:

Comments

Popular posts from this blog

Flutter for Single-Page Scrollable Websites with Navigator 2.0

A Data Science Portfolio is More Valuable than a Resume

Better File Storage in Oracle Cloud