Subrat's Technical Blog: The Best (FREE) Data Repositories for Aspiring Data Scientists

Earlier this week, Google announced that its Dataset Search engine is now out of beta. This is a great accomplishment for the world and an invaluable tool for any aspiring Data Scientist in 2020.

In honor of the news, I thought I’d put together a list of my favorite data repositories that I’ve used in the past to create a quick reference guide for any and all aspiring Data Scientists. No matter what industry you want to get into, there’s definitely a dataset for you here :)

Awesome Public Datasets

Awesome Public Datasets is a repository on GitHub of high quality topic-centric public data sources. They are collected and tidied from blogs, answers, and user responses. Almost all of these are free with a few exceptions here and there

Data is Plural

Date is Plural is a weekly newsletter of useful/curious datasets. You can find a huge archive of datasets on their google doc. Just hit ctrl + f for a topic you’d like to look into and see the dozens of results that pop up.

Data World

Data World is an open data repository containing data contributed by thousands of users and organizations all across the world.

What I love about this is site is that it contains really hard to find data from. In particular, the healthcare field is one of the more difficult industries to get publicly available data from(due to privacy concerns). But luckily, Data World has 3667 free health datasets you can use for your next project.

Google Data Set Search

A data set search engine… powered by Google. No further explanation needed.

Kaggle

Kaggle enables data scientists and other developers to engage in running machine learning contests, write and share code, and to host datasets. The types of data science problems posted on Kaggle can be anything from attempting to predict cancer occurrence by examining patient records to analyzing sentiment to evoke by movie reviews and how this affects audience reaction.

Makeover Monday

This repository is mostly for data visualizations, but I think what they do is a lot of fun.

Makeover Monday was an initiative started in the first week of 2016, between Andy Kriebel (Head Coach, the Information Lab UK — @vizwizbi) and Andy Cotgreave (Tableau Evangelist — @acotgreave).

Every week, usually on a Sunday, Andy K will post (via blog and twitter) an original visualization to be “made over”. Some are awful, some are already great in which case the challenge is to present a different angle on the original

When complete, post a link to the visualisation and/or a picture, using the hashtag #MakeoverMonday. All the individual screenshots are compiled into one big Pinterest collage of combined visualizations

r/datasets/

A place to share, find, and discuss Datasets. You can request datasets from other subsribers as well as share and contribute your own.

UCI Machine Learning Repository

The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow graduate students at UC Irvine. Since that time, it has been widely used by students, educators, and researchers all over the world as a primary source of machine learning data sets. As an indication of the impact of the archive, it has been cited over 1000 times, making it one of the top 100 most cited “papers” in all of computer science.

United States Government

Under the terms of the 2013 Federal Open Data Policy, newly-generated government data is required to be made available in open, machine-readable formats, while continuing to ensure privacy and security.

That’s going to be all for now. Please feel free to bookmark this article and use it as a quick reference for your data pursuits.

Did I miss your favorite repository? Let me know below so I can add it to the guide. Until next time everyone, happy coding.

Subrat's Technical Blog

Monday, January 27, 2020

The Best (FREE) Data Repositories for Aspiring Data Scientists

No comments:

Deduplicating Data on the Databricks Lakehouse: Making joins, BI, and AI queries “safe by default.”

Report Abuse