Tuesday, March 31, 2020

Learn how to read data into a Pandas DataFrame in 5 minutes

It is said that Data Scientist spends 80% of their time in preprocessing the data, so lets deep dive into the data preprocessing pipeline also known as ETL pipeline and let's find out which stage takes the most time. In this blog post, we will learn how to extract data from different data sources. Let's take a real-life dataset so it’s easier to follow.
This lesson uses data from the World Bank. The data comes from two sources:
  1. World Bank Indicator Data — This data contains socio-economic indicators for countries around the world. A few example indicators include population, arable land, and central government debt.
  2. World Bank Project Data — This data set contains information about World Bank project lending since 1947.
  1. CSV — CSV stands for comma-separated value. This is how the file looks
id,regionname,countryname,prodline,lendinginstr
P162228,Other,World;World,RE,Investment Project Financing
P163962,Africa,Democratic Republic of the Congo;Democratic Republic of the Congo,PE,Investment Projec
Let's load this file using pandas.
import pandas as pd
df_projects = pd.read_csv('../data/projects_data.csv')#ERROR:
#/opt/conda/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2785: DtypeWarning: Columns (44) have mixed types. Specify dtype option on import or set low_memory=False.interactivity=interactivity, compiler=compiler, result=result)
We will get a DType warning error. Basically, pandas figure out the data types of our file and read them appropriately but one of our columns had multiple data types thus the warning error. We can pass the data type of the string while reading. Please refer to pandas documentation to read more.
df_projects = pd.read_csv('../data/projects_data.csv',dtype=str)
Output:
Let’s read another CSV file:
df_population = pd.read_csv("../data/population_data.csv")
# ParserError: Error tokenizing data. C error: Expected 3 fields in line 5, saw 63
Looks like something is wrong with this CSV file. Let’s check the contents. Here, if the file is small you can directly open the CSV file using notepad/excel or you can use the python code below:
with open("../data/population_data.csv") as f:
    lis = [line.split() for line in f]        # create a list of lists
    #print(lis)
    for i, x in enumerate(lis):              #print the list items 
        print ("line{0} = {1}".format(i, x))#Output:line0 = ['\ufeff"Data', 'Source","World', 'Development', 'Indicators",']
line1 = []
line2 = ['"Last', 'Updated', 'Date","2018-06-28",']
line3 = []
line4 = ['"Country', 'Name","Country', 'Code","Indicator', 'Name","Indicator', 'Code","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017",']
Looks like the first 4 lines of the CSV files are corrupted. So, we can skip the first 4 rows by using skiprows parameter.
df_population = pd.read_csv("../data/population_data.csv",skiprows=4)
Output:
2. JSON — It is a file format with key/value pairs.
[{"id":"P162228","regionname":"Other","countryname":"World;World","prodline":"RE","lendinginstr":"Investment Project Financing"},{"id":"P163962","regionname":"Africa","countryname":"Democratic Republic of the Congo;Democratic Republic of the Congo","prodline":"PE","lendinginstr":"Investment Project Financing"},{"id":"P167672","regionname":"South Asia","countryname":"People\'s Republic of Bangladesh;People\'s Republic of Bangladesh","prodline":"PE","lendinginstr":"Investment Project Financing"}]
Thankfully, pandas have the feature to read JSON directly.
import pandas as pd
df_json = pd.read_json('population_data.json',orient='records')
Other Methods:
import json# read in the JSON file
with open('population_data.json') as f:
    json_data = json.load(f)# print the first record in the JSON file
print(json_data[0])
3. XML — Another data format is called XML (Extensible Markup Language). XML is very similar to HTML at least in terms of formatting.
<ENTRY>
  <ID>P162228</ID>
  <REGIONNAME>Other</REGIONNAME>
  <COUNTRYNAME>World;World</COUNTRYNAME>
  <PRODLINE>RE</PRODLINE>
  <LENDINGINSTR>Investment Project Financing</LENDINGINSTR>
</ENTRY>
<ENTRY>
  <ID>P163962</ID>
  <REGIONNAME>Africa</REGIONNAME>
  <COUNTRYNAME>Democratic Republic of the Congo;Democratic Republic of the Congo</COUNTRYNAME>
  <PRODLINE>PE</PRODLINE>
  <LENDINGINSTR>Investment Project Financing</LENDINGINSTR>
</ENTRY>
There is a Python library called BeautifulSoup, which makes reading in and parsing XML data easier. Here is the link to the documentation: Beautiful Soup Documentation
# import the BeautifulSoup library
from bs4 import BeautifulSoup# open the population_data.xml file and load into Beautiful Soup
with open("population_data.xml") as fp:
    soup = BeautifulSoup(fp, "lxml") # lxml is the Parser type
Let’s see what soup looks like:
<html><body><p><?xml version="1.0" encoding="utf-8"?>
<root xmlns:wb="http://www.worldbank.org">
<data>
<record>
<field key="ABW" name="Country or Area">Aruba</field>
<field key="SP.POP.TOTL" name="Item">Population, total</field>
<field name="Year">1960</field>
<field name="Value">54211</field>
</record>
<record>
<field key="ABW" name="Country or Area">Aruba</field>
<field key="SP.POP.TOTL" name="Item">Population, total</field>
<field name="Year">1961</field>
<field name="Value">55438</field>
</record>
How to read XML as a data frame?
data_dictionary = {'Country or Area':[], 'Year':[], 'Item':[], 'Value':[]}for record in soup.find_all('record'):
    for record in record.find_all('field'):
        data_dictionary[record['name']].append(record.text)df = pd.DataFrame.from_dict(data_dictionary)
df = df.pivot(index='Country or Area', columns='Year', values='Value')df.reset_index(level=0, inplace=True)
Basically, we need to make a dictionary for each column or for each row and then convert the dictionary to dataframe.
Output:
4. SQL — SQL databases store data in tables using primary and foreign keys
To read data from the SQL database, you need to have your data stored in the database. To know how to Convert CSV to SQL DB read this blog.
SQLite3 to Pandas
import sqlite3
import pandas as pd# connect to the database
conn = sqlite3.connect('population_data.db')# run a query
pd.read_sql('SELECT * FROM population_data', conn)
SQLAlchemy to Pandas
import pandas as pd
from sqlalchemy import create_engineengine=create_engine('sqlite:////home/workspace/3_sql_exercise/population_data.db')
pd.read_sql("SELECT * FROM population_data", engine)
5. API — Extracting data from the Web is a very tiring process. But a lot of companies have made their data public through API. APIs generally provide data in either JSON or XML format. Some APIs are public where you do not need to log in whereas some are private, where users need to generate an API key.
import requests
import pandas as pdurl = 'http://api.worldbank.org/v2/countries/br;cn;us;de/indicators/SP.POP.TOTL/?format=json&per_page=1000'
r = requests.get(url)
r.json()
API calls are different for different websites, here we are using requests library to get the data in a JSON format. Read more about different API structures here

Monday, March 30, 2020

Set up a Container Native Deployment with Oracle Container Pipelines

In this surrealistic time of the COVID 19 virus, a lot of conferences where I was going to speak this year were cancelled, so I decided to write my presentation in the form of an article. Not sure if I’m able to tell this story somewhere in the world this year, so here it is! This blog is also available on Oracle.com
Delivering high quality software in a 24*7 environment is a challenge for a lot of teams these days. Teams of multinationals need to deliver more functionality more quickly. Some of these teams are able to deliver new software every hour. How is it possible to meet these demands?

DevOps

You must have heard the term many times these days. Maybe you already work like this in your team. Anyway, methods of working are changing rapidly, and the DevOps’s way of working is the way a lot of teams aim towards. Either to start with or to improve their own way of working and meet the high demands of their clients, business teams or goals set by their company.

Other ways of Development & Operations

Besides the methods of working, teams are also facing new technologies and innovations of how to develop, build, deploy, operate and monitor applications. A lot of companies these days shift from on-premises to “Cloud Native” applications. This means, for instance, that a build pipeline of an application might look a little different from traditional applications. Also take into consideration that the application landscape is being redesigned into a (partially) microservices or serverless landscape, supported by container-native platforms and private or commercial cloud vendors.

DevOps challenges – the “Ops” in DevOps

I won’t go through all aspects of DevOps, but one challenge DevOps teams face is the Ops part of it. These are business critical applications, that don’t allow any downtime or performance loss. A lot of teams are not well focused on the Ops part. This leads to error rates of applications being high: 32% of production apps have problems and even worse: that are often only discovered when they are reported by customers.
There are several reasons for this to happen:
  • Lack of continuity and/or automation
  • Lack of visibility

CALMSS Model

To become more successful in DevOps, Forrester developed a model for DevOps teams, to help them achieve their goals. The so-called CALMMS model:
This whitepaper will focus on the technology part, which supports certain highlighted aspects of the CALMSS model, especially the “Oracle way” of how to interpret these. The last few years, Oracle has done the major job of adopting and keeping up Cloud Native technologies and embedding them in their current cloud offerings, which will be discussed in the coming part.

Solutions for Cloud Native Deployments

Through the years, the industry has developed some solutions for automating deployments as much as possible, in line with the DevOps way of working.

Oracle Container Pipeline – A Cloud Native Container Pipeline

To meet a high level of automation, it is essential to automate your software delivery from development to operations. Specially for in the cloud, Oracle has some technologies to support this. The ingredients you may need are the following: Versioning & Container Registry, Containers & Orchestration Engine, Provisioning, Container Pipelines and Packaging & Deployments. Having these ingredients will enable your team to meet a higher level of delivering, because less manual work needs to be done when it’s implemented. Of course, the implementation itself takes time and investment, but it will become beneficial later, when you are up to speed with it.

Versioning & Container Registry

For setting up a continuous, and in this case a container pipeline, you need a mechanism to version you source code and register your application containers in a registry. There are various source code repositories:
  • Git (commonly used with GitHub)
  • SubVersion
  • Bitbucket
The most common is GitHub. There are even movements to let GitHub be the “source of truth” and implement a GitOps methodology. With this, you declaratively describe the entire desired state of your system in Git. Everything is controlled and automated with YAMLs and pull requests. Container registries are used to store container images created during the application development process. Container images placed in the registry can be used during application development. Companies often use a public registry, but it is recommended to have private registry. Within the Oracle Cloud, you can set up your own private registry, with benefits like high availability and maintenance.

Set up your private container registry at Oracle Cloud Infrastructure

To be able to push and pull your containers to a private OCI registry, the following steps need to be applied:
  • Set up an auth token in your Cloud Console
  • From your OCI client, log into your OCI registry with the docker interface (fra is the Frankfurt region)
  1. docker login fra.ocir.io  
Credentials to be filled in. You can find your user details in your cloud user settings:
  1. <tenancy-namespace>/oracleidentitycloudservice/<username>.  
Then, you can either pull an image, or use your local build image and push it to your private registry. Tag the image for pushing.
  1. docker tag <docker hub name>/helloworld:latest  
  2. <region-key>.ocir.io/<tenancy-namespace>/<repo-name>/<image-name>:<tag>  
Region, tenancy etc. should apply to your own situation. And push.
  1. docker push  
  2. <region-key>.ocir.io/<tenancy-namespace>/<repo-name>/<image-name>:<tag>  
You can verify the push afterwards.

Containers & Orchestration Engine

From bare metal to virtualization to containerization. Containers have gained significant popularity the last few years. There are many container technologies to choose from, but there is a general lack of knowledge about the subtle differences in these technologies, and when to use what. Docker has become the most popular and more or less the de facto standard for container engines and runtime, though there are more, such as:
  • CoreOS - Rocket
  • Linux Containers (LXC). Docker was built on top of it.
  • Kata containers
For a long time, since 1979, the technology for containers was already there, hidden in UNIX and later in Linux as chroot, where the root directory of a process and its children were isolated to a new location in the filesystem. This was the beginning of process isolation: segregating file access for each process.
So, some basic characteristics of containers are:
  • Container: Configurable unit for small set of services/applications. Light weighted images
  • Share the OS kernel – No hypervisor. Except for Kata Containers which has its own kernel
  • Isolated resources (CPU, Memory, Storage)
  • Application & Infrastructure software as a whole

Orchestration

In a container-based landscape, there can be a large number of containers, which leads to questions such as: How to manage and structure these? An orchestration platform can be the solution. Kubernetes is such a platform. It manages storage, compute resources, networking where “Infrastructure as code” is the way for lifecycle management. The orchestration platforms present these days are:
  • Docker Swarm
  • Kubernetes, on premises or Cloud Engines by Microsoft (AKE), Google (GKE), IBM/RedHat, Amazon and Oracle (OKE)
  • Red Hat OpenShift
Kubernetes might now be considered as a standard platform and adopted technology to orchestrate containers. Once initiated as an internal project at Google, Kubernetes is now a framework for building distributed platforms. It manages and orchestrates container processes. It takes care of the lifecycle management of runtime containers.
Some of the basic concepts of Kubernetes are:
  • Master: Controls Kubernetes nodes.
  • Node: Perform requested, assigned tasks by the master
  • Pod: Scheduler entity for a group of one or more containers
  • Replication controller:  Control of identical pods across a cluster.
  • Service: Work definitions and connection between containers and pods.
  • kubelet: Reads container manifests to watch containers’ lifecycles
  • kubectl: command line configuration tool for Kubernetes
  • etcd: Key – Value store holding all Cluster Configs

Provisioning

It’s important to provision your “Infrastructure as Code”. Now, every Cloud vendor has a portal where you can set up a Kubernetes Cluster in minute, but when you want to do it repeatedly and more automated, it’s recommended to use the Terraform Provider. Terraform is a tool to provision Cloud environments. These scripts can be easily integrated in your container pipeline, as seen in this whitepaper. For a detailed setup of Terraform, look at: http://https://community.oracle.com/blogs/mnemonic/2018/09/23/oracle-kubernetes-engine-setup-fast-with-terraform

Packaging & Deployments

In the Opensource community, it’s all about adoption. If there is a fine technology or good initiative, it will be embraced and finally become some sort of a standard. Speaking of packaging and deployment, Helm has become a widely used tool for doing this. It’s a Release and Package Management tool for Kubernetes and can be integrated with CI build tools (Maven, Jenkins, and Wercker)
This is a simple setup of the Helm components
Helm workflow according to V3

Container Pipelines

Setting up a Cloud Native Container pipeline can be done with different technologies. There are a lot of Opensource technologies to facilitate this. A few well known are:
  • JenkinsX and Tekton: you could run the default Jenkins in the cloud, but JenkinsX and Tekton are cloud native
  • Knative, Originated by Google
    - K8S-native OSS framework for creating (CI/CD) systems
  • Spinnaker: an open source, multi-cloud continuous delivery platform initiated in 2014 by Netflix
  • OpenShift Pipelines
  • Azure DevOps Pipelines
  • Oracle Container Pipelines (Wercker): a Docker and Kubernetes native CI CD platform for Kubernetes and Microservices Deployment
    - Former Wercker, acquired by Oracle
    - Partial Open Source (the CLI)

Melting this all together in: Oracle Container Pipelines

Oracle Container Pipelines is fully web based and integrated with tools like GitHub/lab or Bitbucket. It contains all the build workflows, and has dependencies, environment settings and permissions. Before Oracle acquired it, it was called Wercker. It is a Container-native Open Source CI/CD Automation platform for Kubernetes & Microservice Deployments. Every Artifact can be a packaged Docker Container.
In the base, it is a CI/CD tool designed specifically for container development. It’s pretty much codeless, meant to work with containers, to be used with any stack in any environment. Its central concept is pipelines, which operate on code and artifacts to generate a container. A pipeline generates an artifact, which can be packaged into a container. The aim is to work directly with containers through the entire process, from Development to Operation. This means the code is containerized from the beginning, with a development environment that's almost the same as the production one.
Oracle Container Pipelines works with some of following characteristic:
  • Organizations
    - This is the team, group or department grouped together to work on a certain project, as a unit in Wercker. It hosts applications, and users can be part of one or more Organizations.
  • Applications
    - This contains the build workflows and consist of dependencies, environment configuration and permissions. Applications are linked to a versioning system, usually to a project on Github, Gitlab, or Bitbucket, or your own SCN. 
  • Steps
    - Stages in the pipeline with an Isolated script or compiled binary for accomplishing specific automation tasks.
    - Such as install, build, configure, test. You can add an npm-install, maven build or a python script to test your build.
    - Or a Docker action (Push, Pull etc.).
  • Pipelines (pipeline consists of steps)
    - A series of steps that are triggered on a Git push, or the completion of another pipeline. This is more or less a GitOps approach example of a pipeline with the build of a container, push to the registry and deploy to a Kubernetes Cluster (OKE)
  • Workflows
    - Workflows is a set of chained branched pipelines to form multi-stage, multi-branch complex CI/CD flows. These are configured in the Web UI and depend on the wercker.yaml where the pipelines are configured, but they are not part of that yaml. Variations are based on branch.
    - Workflows can be executed in parallel. werker.yaml: the central file that must be on the root of your Git repository, and defines the build of your application using the steps and pipelines you configured in Wercker

Configure Wercker for OKE

To be able to deploy applications to your managed Kubernetes Engine, you need to set some configuration Items on the configuration tab of your Wercker Application, in order to access your OKE:

Blend in Helm and Terraform in your Wercker Workflow

Building an entire Oracle Kubernetes Engine and deployment of your applications can be achieved by creating pipelines that consists of all the necessary steps, such as:
  • Use a lightweight image for the build
  • Performing all the steps needed: Terraform commands, Helm commands, specific kubectl commands or scripts
  • Configurations for API keys (in case of Terraform build of OKE), OCI and Kubernetes details

Terraform provision

Set up a temp Terraform box

Provision Kubernetes OKE cluster

Helm Steps

The Helm steps are basically the same as the other steps:
  • Push a build container image to your Container Registry
  • Set up a temp box for Helm install
  • Fetch Helm repo and generate charts for install your application container
Example of the running pipeline
These are more or less the steps to take in order to set up a container-based pipeline where you:
  • Provision Infrastructure with Terraform
  • Do Helm initialization and repository Fetch
  • Install application containers with Helm

Conclusions

Wercker (or Oracle Container Pipelines) is, in my opinion, a good option for your containerized pipelines, with lots of options for different methods and technologies. It requires some work to set up, but components can be integrated on different kind of levels.
To me, it is currently unclear how this will evolve, especially with other more well-known platforms such as Jenkins-X and Tekton. I will closely follow the different solutions!
Michel Schildmeijer
ABOUT THE AUTHOR MICHEL SCHILDMEIJER
Having made his start in the pharmacy sector, Michel transitioned to IT in 1996, working on a UNIX TTY terminal-based system and the MUMPS language. He currently works as a solutions architect at Qualogy, with a focus on middleware, application integration and service-oriented architecture. His passion for middleware started in 2000 when working as a support analyst for a financial institute with BEA WebLogic and Tuxedo. Michel is an expert on the WebLogic platform. He serves customers in his role as architect and advises them in all aspects of their IT landscape. He became an Oracle ACE in 2012 and wrote two books about WebLogic: Oracle WebLogic Server 11gR1 PS2: Administration Essentials and Oracle WebLogic Server 12c: First Look. He is a well-known speaker at national and international conferences and is recognised as an official Oracle Speaker. Read his blog: https://community.oracle.com/blogs/mnemonic
More posts by Michel Schildmeijer
Comments

Must Watch YouTube Videos for Databricks Platform Administrators

  While written word is clearly the medium of choice for this platform, sometimes a picture or a video can be worth 1,000 words. Below are  ...