f you work in the field of data science, regardless of the job title or position, SQL is one of the most important skills you need to have. I would put it in the top 3 along with Python.
When I first started learning data science, I found it a little difficult to set up an environment for practicing SQL. I would need to install a database engine and an IDE for writing queries such as MySQL Workbench or pgAdmin.
It’s not very complicated to make this setup and you’re likely to use such environments when you get a job in data science. However, there is a much simpler option for practicing when you just get started learning.
This option is sqlite3, which is a built-in Python module. It provides an easy-to-use SQL interface. The sqlite3 module is based on SQLite, which is a C-language library that provides a lightweight disk-based database.
Since most of us start the data science journey with installing Python, why not make use of it to learn SQL as well?
In this article, we will go through several examples to learn how to:
- Create a database
- Create tables
- Populate tables
- Query tables
with the sqlite3 module.
Creating a database
The first step is to create a database. We can use the connect function for this task. It is used for connecting to a database and if the database does not exist, the connect function also creates it.
import sqlite3
con = sqlite3.connect("sample.db")
The next step is to create a database cursor, which is required for executing SQL statements and fetching results.
cursor = con.cursor()
We can see the tables in this database using the following query and the fetchall function. We are basically querying the built-in sqlite_master table.
check_tables = """SELECT name FROM sqlite_master WHERE type='table';"""
cursor.execute(check_tables)
cursor.fetchall()
# output
[]
Since the database does not contain any tables, the fetchall function returns an empty list.
Creating tables
Let’s create a table. We will write the query to create a table and then use the cursor to execute it.
create_products_table = """CREATE TABLE IF NOT EXISTS products (
id integer PRIMARY KEY,
category text NOT NULL,
price real,
cost real
);"""
cursor.execute(create_products_table)
Let’s check the tables in the sample database again.
check_tables = """SELECT name FROM sqlite_master WHERE type='table';"""
cursor.execute(check_tables)
cursor.fetchall()
# output
[('products',)]
We now see the products table exists in the database.
We can execute any query using the cursor. For instance, the following code snippet executes the query to select all the rows in the products table and assigns the result to a variable called result. Then, the fetchall function is called on the result variable to see what it contains.
result = cursor.execute("SELECT * FROM products")
result.fetchall()
# output
It does not return anything since the products table is empty. Let’s populate it using the insert into statement.
populate_products_table = """INSERT INTO products VALUES
(1001, 'A', 15.9, 12.9),
(1002, 'B', 24.9, 20.5),
(1003, 'A', 13.5, 10.6),
(1004, 'A', 17.5, 13.5),
(1005, 'B', 28.9, 23.5)
;"""
cursor.execute(populate_products_table)
Note: The insert statement implicitly opens a transaction, which needs to be committed before changes are saved in the database.
con.commit()
The products table should have 5 rows now. Let’s confirm by executing the select statement above.
result = cursor.execute("SELECT * FROM products")
result.fetchall()
# output
[(1001, 'A', 15.9, 12.9),
(1002, 'B', 24.9, 20.5),
(1003, 'A', 13.5, 10.6),
(1004, 'A', 17.5, 13.5),
(1005, 'B', 28.9, 23.5)]
We can, of course, write more advanced queries. For instance, the following query returns the rows that belong to category A.
result = cursor.execute("SELECT * FROM products WHERE category='A'")
result.fetchall()
# output
[(1001, 'A', 15.9, 12.9),
(1003, 'A', 13.5, 10.6),
(1004, 'A', 17.5, 13.5)]
One line of code to create a Pandas DataFrame
One of the nice things about sqlite3 module is that it is compatible with Pandas. Hence, we can easily write the results of a query into a Pandas DataFrame.
We can call the read_sql_query function of Pandas with the query and connection object.
products = pd.read_sql_query("SELECT * FROM products", con)
products
Let’s create another table and populate it.
create_customers_table = """CREATE TABLE IF NOT EXISTS customers (
id integer PRIMARY KEY,
customer id integer NOT NULL,
is_member integer NOT NULL,
purchase_date text,
purchased_product integer,
purchase_quantity integer
);"""
populate_customer_table = """INSERT INTO customers VALUES
(1, 110, 0, "2022-12-23", 1002, 5),
(2, 112, 0, "2022-12-14", 1001, 4),
(3, 113, 1, "2022-12-08", 1003, 6),
(4, 113, 1, "2022-12-14", 1002, 4),
(5, 114, 0, "2022-12-21", 1004, 10)
;"""
cursor.execute(create_customers_table)
cursor.execute(populate_customer_table)
con.commit()
We can create a DataFrame with the customers table as follows:
customers = pd.read_sql_query("SELECT * FROM customers", con)
customers
More complex queries
Let’s write a query that retrieves data from both products and customers tables. It will include a join statement.
query = '''
SELECT
customer_id,
purchased_product,
purchase_quantity,
price
FROM customers c
LEFT JOIN products p ON c.purchased_product = p.id
'''
cursor.execute(query)
cursor.fetchall()
# output
[(110, 1002, 5, 24.9),
(112, 1001, 4, 15.9),
(113, 1003, 6, 13.5),
(113, 1002, 4, 24.9),
(114, 1004, 10, 17.5)]
You can use the description attribute to see the column names in the output of the query:
cursor.description
# output
(('customer_id', None, None, None, None, None, None),
('purchased_product', None, None, None, None, None, None),
('purchase_quantity', None, None, None, None, None, None),
('price', None, None, None, None, None, None))
# save the names in a Python list using list comprehension
col_names = [description[0] for description in cursor.description]
col_names
# output
['customer_id', 'purchased_product', 'purchase_quantity', 'price']
Closing the connection
We can use the close method to close the current connection to the database.
con.close()
To verify if the changes are saved in the database, we can create a new connection and write a query to check.
con = sqlite3.connect("sample.db")
cursor = con.cursor()
result = cursor.execute("SELECT * FROM products")
result.fetchall()
# output
[(1001, 'A', 15.9, 12.9),
(1002, 'B', 24.9, 20.5),
(1003, 'A', 13.5, 10.6),
(1004, 'A', 17.5, 13.5),
(1005, 'B', 28.9, 23.5)]
Yes, the changes were saved in the sample.db database.
Conclusion
The sqlite3 module provides an easy-to-use environment for practicing SQL, which is of crucial importance for pretty much anyone working in the data science ecosystem.
Not having to install any additional tool or database engine makes it even more encouraging to use sqlite3. Moreover, all it takes to save the result of a query into a Pandas DataFrame is just one line of code. I think this is a great feature since Pandas is the most widely-used data manipulation and analysis tool.
No comments:
Post a Comment