Subrat's Technical Blog: Learn how to read data into a Pandas DataFrame in 5 minutes

Overview

It is said that Data Scientist spends 80% of their time in preprocessing the data, so lets deep dive into the data preprocessing pipeline also known as ETL pipeline and let's find out which stage takes the most time. In this blog post, we will learn how to extract data from different data sources. Let's take a real-life dataset so it’s easier to follow.

This lesson uses data from the World Bank. The data comes from two sources:

World Bank Indicator Data — This data contains socio-economic indicators for countries around the world. A few example indicators include population, arable land, and central government debt.
World Bank Project Data — This data set contains information about World Bank project lending since 1947.

Types of Data files

CSV — CSV stands for comma-separated value. This is how the file looks

id,regionname,countryname,prodline,lendinginstr
P162228,Other,World;World,RE,Investment Project Financing
P163962,Africa,Democratic Republic of the Congo;Democratic Republic of the Congo,PE,Investment Projec

Let's load this file using pandas.

import pandas as pd
df_projects = pd.read_csv('../data/projects_data.csv')#ERROR:
#/opt/conda/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2785: DtypeWarning: Columns (44) have mixed types. Specify dtype option on import or set low_memory=False.interactivity=interactivity, compiler=compiler, result=result)

We will get a DType warning error. Basically, pandas figure out the data types of our file and read them appropriately but one of our columns had multiple data types thus the warning error. We can pass the data type of the string while reading. Please refer to pandas documentation to read more.

df_projects = pd.read_csv('../data/projects_data.csv',dtype=str)

Output:

Let’s read another CSV file:

df_population = pd.read_csv("../data/population_data.csv")
# ParserError: Error tokenizing data. C error: Expected 3 fields in line 5, saw 63

Looks like something is wrong with this CSV file. Let’s check the contents. Here, if the file is small you can directly open the CSV file using notepad/excel or you can use the python code below:

with open("../data/population_data.csv") as f:
    lis = [line.split() for line in f]        # create a list of lists
    #print(lis)
    for i, x in enumerate(lis):              #print the list items 
        print ("line{0} = {1}".format(i, x))#Output:line0 = ['\ufeff"Data', 'Source","World', 'Development', 'Indicators",']
line1 = []
line2 = ['"Last', 'Updated', 'Date","2018-06-28",']
line3 = []
line4 = ['"Country', 'Name","Country', 'Code","Indicator', 'Name","Indicator', 'Code","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017",']

Looks like the first 4 lines of the CSV files are corrupted. So, we can skip the first 4 rows by using skiprows parameter.

df_population = pd.read_csv("../data/population_data.csv",skiprows=4)

Output:

2. JSON — It is a file format with key/value pairs.

[{"id":"P162228","regionname":"Other","countryname":"World;World","prodline":"RE","lendinginstr":"Investment Project Financing"},{"id":"P163962","regionname":"Africa","countryname":"Democratic Republic of the Congo;Democratic Republic of the Congo","prodline":"PE","lendinginstr":"Investment Project Financing"},{"id":"P167672","regionname":"South Asia","countryname":"People\'s Republic of Bangladesh;People\'s Republic of Bangladesh","prodline":"PE","lendinginstr":"Investment Project Financing"}]

Thankfully, pandas have the feature to read JSON directly.

import pandas as pd
df_json = pd.read_json('population_data.json',orient='records')

Other Methods:

import json# read in the JSON file
with open('population_data.json') as f:
    json_data = json.load(f)# print the first record in the JSON file
print(json_data[0])

3. XML — Another data format is called XML (Extensible Markup Language). XML is very similar to HTML at least in terms of formatting.

<ENTRY>
  <ID>P162228</ID>
  <REGIONNAME>Other</REGIONNAME>
  <COUNTRYNAME>World;World</COUNTRYNAME>
  <PRODLINE>RE</PRODLINE>
  <LENDINGINSTR>Investment Project Financing</LENDINGINSTR>
</ENTRY>
<ENTRY>
  <ID>P163962</ID>
  <REGIONNAME>Africa</REGIONNAME>
  <COUNTRYNAME>Democratic Republic of the Congo;Democratic Republic of the Congo</COUNTRYNAME>
  <PRODLINE>PE</PRODLINE>
  <LENDINGINSTR>Investment Project Financing</LENDINGINSTR>
</ENTRY>

There is a Python library called BeautifulSoup, which makes reading in and parsing XML data easier. Here is the link to the documentation: Beautiful Soup Documentation

# import the BeautifulSoup library
from bs4 import BeautifulSoup# open the population_data.xml file and load into Beautiful Soup
with open("population_data.xml") as fp:
    soup = BeautifulSoup(fp, "lxml") # lxml is the Parser type

Let’s see what soup looks like:

<html><body><p><?xml version="1.0" encoding="utf-8"?>
<root xmlns:wb="http://www.worldbank.org">
<data>
<record>
<field key="ABW" name="Country or Area">Aruba</field>
<field key="SP.POP.TOTL" name="Item">Population, total</field>
<field name="Year">1960</field>
<field name="Value">54211</field>
</record>
<record>
<field key="ABW" name="Country or Area">Aruba</field>
<field key="SP.POP.TOTL" name="Item">Population, total</field>
<field name="Year">1961</field>
<field name="Value">55438</field>
</record>

How to read XML as a data frame?

data_dictionary = {'Country or Area':[], 'Year':[], 'Item':[], 'Value':[]}for record in soup.find_all('record'):
    for record in record.find_all('field'):
        data_dictionary[record['name']].append(record.text)df = pd.DataFrame.from_dict(data_dictionary)
df = df.pivot(index='Country or Area', columns='Year', values='Value')df.reset_index(level=0, inplace=True)

Basically, we need to make a dictionary for each column or for each row and then convert the dictionary to dataframe.

Output:

4. SQL — SQL databases store data in tables using primary and foreign keys

To read data from the SQL database, you need to have your data stored in the database. To know how to Convert CSV to SQL DB read this blog.

SQLite3 to Pandas

import sqlite3
import pandas as pd# connect to the database
conn = sqlite3.connect('population_data.db')# run a query
pd.read_sql('SELECT * FROM population_data', conn)

SQLAlchemy to Pandas

import pandas as pd
from sqlalchemy import create_engineengine=create_engine('sqlite:////home/workspace/3_sql_exercise/population_data.db')
pd.read_sql("SELECT * FROM population_data", engine)

5. API — Extracting data from the Web is a very tiring process. But a lot of companies have made their data public through API. APIs generally provide data in either JSON or XML format. Some APIs are public where you do not need to log in whereas some are private, where users need to generate an API key.

import requests
import pandas as pdurl = 'http://api.worldbank.org/v2/countries/br;cn;us;de/indicators/SP.POP.TOTL/?format=json&per_page=1000'
r = requests.get(url)
r.json()

API calls are different for different websites, here we are using requests library to get the data in a JSON format. Read more about different API structures here

Subrat's Technical Blog

Tuesday, March 31, 2020

Learn how to read data into a Pandas DataFrame in 5 minutes

Overview

Types of Data files

No comments:

Microsoft Fabric : Dynamic Data Masking

Report Abuse