Python and Apache Cassandr for Beginners
Learn how to connect Python to Cassandra and insert data with the Datastax ODBC driver, and use the Astra web console to query data.
Python is one of the most widely used programming languages with a huge and supportive community, while Cassandra is one of the most popular NoSQL databases traditionally used for web applications storage or also data centric applications that are dependent on quick retrieval of data. Together, they are found in many applications across various industries as well as in academia.
This Cassandra Python tutorial is intended for beginners in Python and Cassandra. The code samples you can see throughout the article are publicly available in this Github repository. We will guide you through setting up Python as well as DataStax Astra, a managed Cassandra-as-a-Service application hosted on any cloud for free. We will show you how to connect Python to Cassandra and insert data with the Datastax ODBC driver as well as use the Astra web console to query data stored in Cassandra with the CQL console. The upcoming Python SDK for Astra will enable API access for REST, GraphQL and a schemaless JSON document API for a given Astra database instance, which will be reviewed in an upcoming article.
Getting Started with Python
Python has gained plenty of popularity over the past decade and with good reasons. Simply put it is a programming language that is very readable, easy and free to use. It is also easy to get started with, but at the same time it can be used for a variety of different applications and different areas. Multiple Python libraries exist that allow solving problems like string manipulation, implementing machine learning algorithms, object-oriented programming and for data engineering purposes which we will be focusing more on in this tutorial.
Installing Python
There are two major versions of Python available, Python 2 and Python 3. The former is no longer receiving updates nor supported, while the latter is the latest version and the one we will be using. Installing Python can be easily achieved irrespective of the OS that you are using.
The best way to install Python on your computer is through the official Python page. Based on the OS you are using, pick the appropriate installer from the official page.
To verify that Python is correctly installed on your computer, open a command line window and execute the following:
python --version
This will return with the version that you just installed:
Python 3.8.2
If you correctly installed Python and got a command not found or a similar error message, this will most likely mean that Python has not correctly been added to the PATH variable on your OS; so make sure to double-check that the path that Python was installed on, is part of the PATH.
Installing Python Dependencies
Like any other programming language, Python also uses a separate utility to install packages. The built-in package manager for Python is called pip. Except from pip, other popular Python package managers include virtualenv and conda, but for the purposes of this Cassandra Python tutorial we are going to be using pip.
Each Python project that uses pip, will usually have a file called requirements.txt in the root directory of the repository, in the same way we have our requirements.txt in our Github project. The contents of this file are very simple; each line consists of the name of the package, optionally followed by a specific version for that package:
cassandra-driver==3.25.0
numpy==1.19.3
astrapy==0.0.2
simplejson==3.17.2
To install the required packages which includes Cassandra Python libraries cassandra-driver and astrapy, simply navigate to the root directory in your command line and execute:
pip install -r requirements.txt
What happens in the background, is that pip will fetch these packages from the default and public Python Package Index, PyPi. You can inspect the PyPi homepage of each package, including available versions, further documentation, links to each package’s Github repository and examples of how to use each package. For example this is the page for numpy, and this is the page for cassandra-driver.
This completes the installation of the dependencies of our Python project.
What is Cassandra?
Cassandra is the leading open-source NoSQL distributed database management system(DBMS). In contrast to traditional SQL DBMS like Oracle or SQL Server, NoSQL databases follow a different storage model. While in SQL systems data is organised in tables and columns that are driven from fixed schemas, in Cassandra a fixed schema is not enforced, a dynamic number of columns within the same table(or column-family as a table is often called within Cassandra) is allowed as well as being able to handle and store unstructured data. In addition to the above, the language which is used to interact with Cassandra is a variant (and subset) of the traditional SQL, called Cassandra Query Language(CQL).
Cassandra is used by 90% of Fortune 100 companies. The fact that is rapidly growing can be explained by its set of rich features that are particularly beneficial for large amounts of data. Its distributed architecture ensures ultra fast write performance, and fast retrievals for data querying, no single point of failure which results in 100% high availability and significant reduction in time to market due to the simplicity of deploying, managing and maintaining a Cassandra Cluster.
Setting up a Cassandra database
Cassandra is open source which means it’s free to use. Depending on the OS you are using, you can download Cassandra and its dependencies locally, configure and install them. Kubernetes users can also use Docker images, but all of this process can be tiresome, especially for first-time users.
The best way to get started with Cassandra, is through a managed Cassandra database which is available through the web. Datastax Astra is a serverless database-as-a-Service powered by Apache Cassandra which can be launched with just a few clicks, has a generous free tier, and is available in major cloud providers (Amazon Web Services, Azure or Google Cloud).
The official Astra guide has all the information you need to create an Astra service; you need to register for an account and then select a few more details, such as choosing a cloud provider and naming your database.
Using astrapy to generate JSON and insert to Astra
Using React, Angular or Vue as a frontend? astrapy is a handy Cassandra Python library that creates a schemaless, JSON document oriented API atop Datastax Astra’s REST API. In this example we are going to authenticate to Astra using a token(instead of client secret), generate a dummy JSON document and issue a PUT REST call to insert the JSON in an Astra collection.
Before navigating through the code make sure you have astrapy and simplejson libraries installed. You can check that by executing pip freeze. If you don’t have them install them from the requirements.txt file in the root of the project with pip install -r requirements.txt.
First in the code, we will get an Astra HTTP client. Before doing so, we need to generate an application token and export five environment variables in total. Navigate to Datastax Astra homepage and click on the database name:
Then, just below the database name, click on Connect:
This time, remain on the Connect using an API option:
Note 3 of the 5 environment variables that you need to export in the right part of the page:
- ASTRA_DB_ID
- ASTRA_DB_REGION
- ASTRA_DB_KEYSPACE
For the ASTRA_DB_APPLICATION_TOKEN environment variable, let’s generate the connection credentials. Click where it says “here”:
A new page will popup, select the role(select R/W User) and click on Generate Token:
Once you do, you will get a window with all the details:
Make sure to keep the other(Client Id, Client Secret) information in a place where you can reference them as we will use them later. Export the ASTRA_DB_APPLICATION_TOKEN environment variable, which is equal to the token that was generated in this step..
Fill in any name you like for the fifth and final environment variable ASTRA_DB_COLLECTION. In my case this variable is equal to demo_book.
Once you have exported the five environment variables according to the OS you are using, you are ready to start execution of the Python script. The first part will authenticate against Datastax Astra using token authentication:
Once we have the HTTP client object, that is passed to the next method in which we create and insert a JSON document to an Astra collection:
To execute the Python script type python json_document_api.py and hit enter.
Finally we can confirm that the document has been inserted successfully by issuing a curl command to retrieve it through the command line as follows:
This command will return all documents that have been inserted, in our case:
Setting up connection from Python to Astra using a driver
The Cassandra Python cassandra-driver makes it easy to authenticate and insert tabular data in Datastax Astra.Up to this point we have made sure to install Python and to have an Astra serverless instance that we can work with using a schemless, JSON Document — oriented API. Let’s continue by actually interacting with our Astra database using the cassandra-driver as an schema — driven alternative to the Document API in the Astra Python SDK.
Before starting to code, we need to get the prerequisites for configuring the connection from Python to Astra. From the Astra dashboard page, click on the database name from the left hand side of the page:
Then, just below the database name, click on Connect:
Choose Python from the drivers list:
Now you will see a detailed guide of the steps you need to follow pop up on the right side of the page. First, we will make sure to download the bundle and store it in our local storage. We will reference this later in our code. On the right hand side of the page, click on the Download Bundle button and then on click on Secure Connect Bundle popup button
This will download the bundle in the default downloads directory configured from your browser. This is usually located in your home directory, with the name of the bundle following a specific naming format of secure-connect-<database-name>.zip, for example in our case it is named secure-connect-cassandra-pythondemo.zip. Note the location of the downloaded zip.
How to create a Cassandra Table in Astra
In this section we are going to be generating a fictional time series dataset in Python and insert the data in our Astra database using the Datastax Python ODBC/JDBC driver.
First, we are going to create the Astra table that will hold our data. Navigate to the Astra dashboard page, click on the database name from the left hand side of the page:
Click on the Cassandra Query Language(CQL) Console:
Copy the intro/demo_readings.sql from the Github repo and paste it on the CQL Console and hit Enter:
This completes the creation of the Astra table. As you can see in the above DDL script, this timeseries dataset consists of floating value metric(value) that is captured in continuous intervals(value_ts) for a fictional pair of hardware(device_id and timeseries_id), while the timestamp of when the record was captured is also included(publication_ts).
Inserting data in Cassandra with Python
Following the creation of the Cassandra table, let’s navigate to Python. First, make sure to git clone the project in your local filesystem:
git clone git@github.com:andyadamides/python-cassandra-intro.git
Note: If you don’t have git installed, follow this Github guide to do so.
The program consists of one Python script called main.py. The entrypoint is located in the final two lines of the script. The Python interpreter knows by design to start execution from this part of the script:
if __name__ == “__main__”:
main()
The main() method performs two high level tasks, it establishes the connection with the Astra database and then it inserts data that have been generated:
def main():
“””The main routine.”””
session = getDBSession()
generateDataEntrypoint(session)
Establishing the connection to the Astra database, takes place in the getDBSession() method:
At this step make sure to fill in the correct details for connecting to Astra. In particular make sure to export the three environment variables that the Python codes expects in order to securely and successfully establish the connection to Astra:
- ASTRA_PATH_TO_SECURE_BUNDLE
- ASTRA_CLIENT_ID
- ASTRA_CLIENT_SECRET
Note: ASTRA_CLIENT_ID and ASTRA_CLIENT_SECRET were generated in the above section.
Locate the downloaded bundle zip from the previous step, and copy it in a directory that can be configured as part of the ASTRA_PATH_TO_SECURE_BUNDLE environment variable:
cloud_config= {
‘secure_connect_bundle’: os.environ.get(‘ASTRA_PATH_TO_SECURE_BUNDLE’)
}
For example, if you put the secure bundle zip in the root of the Python project the value of ASTRA_PATH_TO_SECURE_BUNDLE environment variable would need to be equal to ‘../secure-connect-cassandra-pythondemo.zip’ and the root directory of the project would include the following files and folders:
Similarly set ASTRA_CLIENT_ID and ASTRA_CLIENT_SECRET environment variables with the values from the previous step.
Once connection is established, we proceed by generating and inserting data in the generateDataEntrypoint(session) method by passing the Astra session object that was generated in the getDBSession() method in the previous step.
Note: Having hardcoded secrets is not recommended as a best practice; make sure to not use this in production and design your applications with security first mindset. We will not be covering best practices for sourcing secrets securely in this article.
As we are generating fictional data, we are making use of the numpy and random Python libraries to help create a list of ids based on an arbitrary range and then randomly pick one id:
We are also surrounding our generation script with two loops and we run as many times as we configure based on two static variables. For example, the following script will generate 2 timeseries with 10 rows each(based on the timeseries_to_generate and number_of_rows variables):
Inserting the data to Astra takes place in the generateInsertData(t_id, number_of_rows, session) method:
First, we are preparing dummy data with the readings and device_ids variables. For these variables we are creating two more numpy arbitrary lists in the same way we did with timeseries ids previously. We are also introducing a new Python module called datetime in which we use for creating the value_ts and t_pub_ts variables.
Following the initialisation of the above variables, we are preparing the insert statement to Astra with the insert_query variable. A PreparedStatement is particularly useful for queries that are executed multiple times in an application, which fits our use-case. As you can see we are going to call insert_query within the following for loop as many times as the number_of_rows variable is, inserting data to the table we created in the previous step:
session.execute(insert_query, [device_id, t_id, value_ts, t_pub_ts, round(random.choice(readings),2)])
The session.execute(insert_query, data) function call is effectively using the Astra database session that we created in the above step. Its first argument is the insert_query prepared statement, and the second argument is an array-like object that contains the actual data to be inserted into the database table.
Executing the Python script and checking results in Astra
Let’s go ahead and execute the Python script.
Note: Make sure to git clone the repo as well as to follow all the prerequisites as listed at the beginning of this article(installing Python, setting up Astra, setting up the Astra bundle and client id/secret in Python) before attempting to execute this script.
Open a command prompt and navigate to the location of the Python script file. Make sure to configure the number of data to be generated(timeseries_to_generate and number_of_rows variables). Then, execute:
python main.py
Based on the configurations you put together you will get an output similar to the following:
Processing:130483SuccessSuccessSuccessSuccessSuccessSuccessSuccessSuccessSuccessSuccessProcessing:176990SuccessSuccessSuccessSuccessSuccessSuccessSuccessSuccessSuccessSuccess
The above indicates the execution of the Python program was successful. To check that data has been inserted in Astra, let’s move to the Astra console and execute a CQL script.
In the Astra console, navigate to the CQL Console tab:
Type the following CQL query and hit enter:
select * from cassandra_pythondemo.demo_readings;
This gives the following result set:
The insertion script in Python worked like a charm and we have successfully inserted data in the Astra Cassandra database.
Conclusion
We have now successfully built a Python script that connects to Astra, a serverless Database-as-a-Service powered by Apache Cassandra. We have shown how to navigate the Astra website to create new Cassandra tables, execute queries through the CQL Console, how to generate data in Python using multiple Python libraries such as numpy and datetime and how to configure the connection to Astra with Python cassandra-driver, and insert data with prepared statements. We have also used the astrapy Cassandra Python library to interact with the Astra Document API to insert and retrieve JSON data.
Learn More:
The code for this Cassandra Python tutorial can be found here:
https://github.com/andyadamides/python-cassandra-intro
Comments