Wednesday, December 4, 2019

Full stack heroku

Becoming the Full Stack Data Scientist: Part 1

Mandy Gu
Mandy Gu
Dec 4 · 6 min read
The full stack Data Scientist is the much coveted poster-child hire for all data roles. This person (essentially a unicorn) has strong business acumen, statistical background, modelling capabilities and data engineering know-hows.
On one hand, the full stack Data Scientist is working with engineering teams to deploy models to production and on the other, distilling complex machine learning ideas to business stakeholders.

I’m still trying to navigate what it means to be this type of mythical Data Scientist. Through this journey, my abilities as a developer and independent thinker grew in parallel to my holistic understanding of the data workflow.
I hope to convey my learnings through this series, which I anticipate will span across 5 instalments. Each part will intersperse my own musings followed by a technical portion relevant to one aspect of the data workflow.
In this article, I will introduce the beginnings of the holistic data workflow and walk through the first step of the data journey: setting up and connecting to a database.
Estimated time for setting up the database: < 30 minutes

The Holistic Workflow Part 1

In a world with fast automation where Google’s AutoML is quickly surpassing the capabilities of individuals, that type of Data Science role is becoming obsolete. Today’s top dollar Data Scientists are being paid for their critical thinking and not for these typesetter tasks.

Every data product begins with a business problem.

But that’s not the right order of operations. The solution should follow the problem, and not the other way around. At the end of the day, our impact is measured by the lift we drive to the company’s KPIs and not by the novelty of our approach.
Thus, every data product should begin with the identification of a business problem. The business scoping involves a problem quantification (what is the expected impact of my work?) and collaboration with domain experts and non-technical stakeholders to understand the value drivers.

Following the problem identification, we need to retrieve the data.

This step involves familiarity with the datawarehouse/data lake/data marts and the ETL workflows which keeps this data fresh.
The data storage unit is more than a collection of our training data. It also becomes the foundation of our feedback loop (the ideal workflow will write new data points with their ground truths into the same table(s) for the next training process to make the model even smarter) and can also be used to log model performance and help with model version control.
This is also where our practical example will kick in — working and connecting to an actual database. The next article will dive into ETL workflows, orchestrators and the such.

Set up a Postgres Database using Heroku

Heroku is a platform as a service (PaaS) that enables developers to build, run, and operate applications entirely in the cloud.
Heroku is a high level service built on top of AWS with a value proposition based on convenience. Heroku services, including Heroku Postgres, are ridiculously easy to set up. We’ll be using different Heroku services frequently for the rest of this series, so it’s definitely worth creating an account for!
  1. Create a Heroku account
  2. (Optional, but recommended) Install the Heroku Command Line Interface
  3. Go to the dashboard and click New to create a new app
4. Under the resources tab, search for Heroku Postgresas an add on
I selected Hobby dev (the free plan) for my database. The free plan limits the database to 10,000 rows, which should be enough for most hobby projects.
Amazing — now we have our Postgres database!

Connecting to the Database using Python

Python Dependencies

  • pandas
sqlalchemy is used to create an engine which connects to the database. The SQLAlchemy website proudly mentions that the tool is used by esteemed organizations such as Dropbox, Yelp and reddit.
pandas is a staple in every Data Scientist’s toolkit — it supports reading SQL outputs into dataframes as well as writing dataframes into SQL tables.

Database URI

On Heroku, the URI can be found by clicking into the database add-on on the app page.
On the add-on page, click the settings tab.
The Database Credentials section should be the first thing on the page (there is a view credential button on the right to show the actual credentials). Like any good administrator, I’ve kept my credentials hidden.
You’ll notice that one of the credentials is the URI itself. This is the string that you need to connect to the database. The string is constructed using the host, database, user, port, password.
The format of the URI string is:
postgresql://{user}:{password}@{host}:{port}/{schema}
The first term corresponds to the SQL dialect — since our database is a Postgres database, we specify postgresql.
Most SQL GUIs will require typing in each credential separately and not as a part of the URI.

Bringing in the Python

import pandas as pd
from sqlalchemy import create_engineURI = REDACTEDengine = create_engine(URI)# Execute a SQL query using pandas
df = pd.read_sql_query('select 1', con=engine)# Write a pandas dataframe into the database
df.to_sql('table_name', con=engine)
That’s 1 of 10,000 rows used!
Note that Heroku limits the number of connections to 20 and that this simple exercise up one. After closing the Python session, the connection to the database gets freed up.
We can modify the code to close the connection after we are done to be mindful about resources.
engine = create_engine(URI)conn = engine.connect()# Execute a SQL query using pandas
df = pd.read_sql_query('select 1', con=conn)# Write a pandas dataframe into the database
df.to_sql('table_name2', con=conn)conn.close()engine.dispose()

That’s all for today! In this article, we set up a Postgres database and connected to it via Python. This is essentially the foundation of every data warehouse — a myriad of connections to each service’s DBAPI.
Stay tuned for next time :-)

Towards Data Science

Sharing concepts, ideas, and codes.

Must Watch YouTube Videos for Databricks Platform Administrators

  While written word is clearly the medium of choice for this platform, sometimes a picture or a video can be worth 1,000 words. Below are  ...