Best Public Datasets for Machine Learning and Data Science
Best open-access datasets for machine learning, data science, sentiment analysis, computer vision, natural language processing (NLP), clinical data, and others.
This resource is continuously updated. If you know any other suitable and open dataset, please let us know by emailing us at pub@towardsai.net or by dropping a comment below.
📚 Check out the Monte Carlo Simulation An In-depth Tutorial with Python. 📚
Dataset Finders
Google Dataset Search: Similar to how Google Scholar works, Dataset Search lets you find datasets wherever they are hosted, whether it’s a publisher’s site, a digital library, or an author’s web page. It’s a phenomenal dataset finder, and it contains over 25 million datasets.
Kaggle: Kaggle provides a vast container of datasets, sufficient for the enthusiast to the expert.
UCI Machine Learning Repository: The Machine Learning Repository at UCI provides an up to date resource for open-source datasets.
VisualData: Discover computer vision datasets by category; it allows searchable queries.
CMU Libraries: Discover high-quality datasets thanks to the collection of Huajin Wang, at CMU.
General Datasets
Housing Datasets
Boston Housing Dataset: Contains information collected by the US Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive and has been used extensively throughout the literature to benchmark algorithms.
Geographic Datasets
Google-Landmarks-v2: An improved dataset for landmark recognition and retrieval. This dataset contains 5M+ images of 200k+ landmarks from across the world, sourced and annotated by the Wiki Commons community.
Machine Learning Datasets:
Mall Customers Dataset: The Mall customers dataset contains information about people visiting the mall in a particular city. The dataset consists of various columns like gender, customer id, age, annual income, and spending score. It’s generally used to segment customers based on their age, income, and interest.
IRIS Dataset: The iris dataset is a simple and beginner-friendly dataset that contains information about the flower petal and sepal width. The data is divided into three classes, with 50 rows in each class. It’s generally used for classification and regression modeling.
MNIST Dataset: This is a database of handwritten digits. It contains 60,000 training images and 10,000 testing images. This is a perfect dataset to start implementing image classification where you can classify a digit from 0 to 9.
Boston Housing Dataset: Contains information collected by the US Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive and has been used extensively throughout the literature to benchmark algorithms.
Fake News Detection Dataset: It is a CSV file that has 7796 rows with four columns. There are four columns: news, title, news text, result.
Wine quality dataset: The dataset contains different chemical information about the wine. The dataset is suitable for classification and regression tasks.
SOCR data — Heights and Weights Dataset: This is a basic dataset for beginners. It contains only the height and weights of 25,000 different humans of 18 years of age. This dataset can be used to build a model that can predict the height or weight of a human.
Titanic Dataset: The dataset contains information like name, age, sex, number of siblings aboard, and other information about 891 passengers in the training set and 418 passengers in the testing set.
Credit Card Fraud Detection Dataset: The dataset contains transactions made by credit cards; they are labeled as fraudulent or genuine. This is important for companies that have transaction systems to build a model for detecting fraudulent activities.
Computer Vision Datasets
xView: xView is one of the most massive publicly available datasets of overhead imagery. It contains images from complex scenes around the world, annotated using bounding boxes.
ImageNet: The largest image dataset for computer vision. It provides an accessible image database that is organized hierarchically, according to WordNet.
Kinetics-700: A large-scale dataset of video URLs from Youtube. Including human-centered actions. It contains over 700,000 videos.
Google’s Open Images: A vast dataset from Google AI containing over 10 million images.
Cityscapes Dataset: This is an open-source dataset for Computer Vision projects. It contains high-quality pixel-level annotations of video sequences taken in 50 different city streets. The dataset is useful in semantic segmentation and training deep neural networks to understand the urban scene.
IMDB-Wiki dataset: The IMDB-Wiki dataset is one of the most extensive open-source datasets for face images with labeled gender and age. The images are collected from IMDB and Wikipedia. It has five million-plus labeled images.
Color Detection Dataset: The dataset contains a CSV file that has 865 color names with their corresponding RGB(red, green, and blue) values of the color. It also has the hexadecimal value of the color.
Stanford Dogs Dataset: It contains 20,580 images and 120 different dog breed categories.
Sentiment Analysis Datasets
Lexicoder Sentiment Dictionary: This dataset is specific for sentiment analysis. The dataset contains over 3000 negative words and over 2000 positive sentiment words.
IMDB reviews: An interesting dataset with over 50,000 movie reviews from Kaggle.
Stanford Sentiment Treebank: Standard sentiment dataset with sentiment annotations.
Twitter US Airline Sentiment: Twitter data on US airlines from February 2015, classified as positive, negative, and neutral tweets
Natural Language Processing (NLP) Datasets
HotspotQA Dataset: Question answering dataset featuring natural, multi-hop questions, with intense supervision for supporting facts to enable more explainable question answering systems.
Amazon Reviews: A vast dataset from Amazon, containing over 45 million Amazon reviews.
Rotten Tomatoes Reviews: Archive of more than 480,000 critic reviews (fresh or rotten).
SMS Spam Collection in English: A dataset that consists of 5,574 English SMS spam messages.
Enron Email Dataset: It contains around 0.5 million emails of over 150 users.
Recommender Systems Dataset: It contains various datasets from popular websites like Goodreads book reviews, Amazon product reviews, bartending data, data from social media, and others that are used in building a recommender system.
UCI Spambase Dataset: Classifying emails as spam or non-spam is a prevalent and useful task. The dataset contains 4601 emails and 57 meta-information about the emails. You can build models to filter out the spam.
IMDB reviews: The large movie review dataset consists of movie reviews from IMDB website with over 25,000 reviews for training and 25,000 for the testing set.
Self-driving (Autonomous Driving) Datasets
Waymo Open Dataset: This is a fantastic dataset resource from the folks at Waymo. Includes a vast dataset of autonomous driving, enough to train deep nets from zero.
Berkeley DeepDrive BDD100k: One of the largest datasets for self-driving cars, containing over 2000 hours of driving experiences across New York and California.
Bosch Small Traffic Light Dataset: Dataset for small traffic lights for deep learning.
LaRa Traffic Light Recognition: Another dataset for traffic lights. This dataset is gathered from Paris.
WPI datasets: Datasets for traffic lights, pedestrian, and lane detection.
Comma.ai: It contains details such as a car’s speed, acceleration, steering angle, and GPS coordinates.
MIT AGE Lab: A sample of the 1,000+ hours of multi-sensor driving datasets collected at AgeLab.
LISA: Laboratory for Intelligent & Safe Automobiles, UC San Diego Datasets: This dataset includes traffic signs, vehicle detection, traffic lights, and trajectory patterns.
Cityscape Dataset: This is an extensive dataset that has street scenes in 50 different cities.
Clinical Datasets
COVID-19 Dataset: The Allen Institute of AI research has released a vast research dataset of over 45,000 scholarly articles about COVID-19.
MIMIC-III: Openly available dataset developed by the MIT Lab for Computational Physiology, comprising de-identified health data associated with ~40,000 critical care patients. It includes demographics, vital signs, laboratory tests, medications, and more.
Datasets for Recommender Systems
MovieLens: It contains rating data sets from the MovieLens web site.
Jester: It contains 4.1 Million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users. It’s mostly used for the collaborative filter.
Million Song Dataset: It can be used for both collaborative and content-based filtering.
Note:
If you are aware of other high-quality, free datasets, which you recommend to people for research and application of machine learning, deep learning, data science, and others. Please feel free to suggest them in the comments below or by emailing us directly at pub@towardsai.net.
If the reason is reliable, we will analyze them and include them in this list. Also, please let us know your experience with using any of these datasets in the comments section.
Happy learning!
Acknowledgments:
The authors would like to thank the members of Lionbridge and the largest AI Community for the immense support, along with constructive criticism in preparation for this resource.
DISCLAIMER: The views expressed in this article are those of the author(s) and do not represent the views of Carnegie Mellon University. These writings do not intend to be final products, yet rather a reflection of current thinking, along with being a catalyst for discussion and improvement.
References and Sources:
[1] The 50 Best Free Datasets for Machine Learning, Lionbridge AI, https://lionbridge.ai/datasets/the-50-best-free-datasets-for-machine-learning/
[2] Google Cloud Public Datasets, Google, https://cloud.google.com/public-datasets/
[3] Machine Learning and AI Datasets, Carnegie Mellon University, https://guides.library.cmu.edu/c.php?g=844845&p=6191907
[4] Big Data and AI: 30 Amazing and Free Public Data Sources, Forbes, https://www.forbes.com/sites/bernardmarr/2018/02/26/big-data-and-ai-30-amazing-and-free-public-data-sources-for-2018/#f3bdeb5f8aec
[5] Awesome Autonomous Vehicles Datasets, Github, https://github.com/takeitallsource/awesome-autonomous-vehicles#datasets
[6] Fueling the Gold Rush, The Greatest Public Datasets for AI, StartupGrind, https://medium.com/startup-grind/fueling-the-ai-gold-rush-7ae438505bc2
[7] Places to Find Free Datasets for Data Science Projects, Dataquest, https://www.dataquest.io/blog/free-datasets-for-projects/
[8] The Best Datasets for Natural Language Processing, Gengo AI, https://gengo.ai/datasets/the-best-25-datasets-for-natural-language-processing/
[9] Awesome Public Datasets, Github, https://github.com/awesomedata/awesome-public-datasets#machinelearning
[10] StatLib Datasets Archive, Carnegie Mellon, http://lib.stat.cmu.edu/datasets/
[11] Institutional Research and Analysis | Common Datasets | https://www.cmu.edu/ira/CDS/index.html
[12] Datasets and Project Suggestions | Andrew W. Moore | http://www.cs.cmu.edu/~awm/15781/project/data.html
[13] Datasets | Machine Learning Repository | MIT | https://ocw.mit.edu/courses/sloan-school-of-management/15-097-prediction-machine-learning-and-statistics-spring-2012/datasets/
[14] Datasets | MIT Lincoln Laboratory | https://www.ll.mit.edu/r-d/datasets
[15] Stanford Large Network Dataset Collection | Stanford University | https://snap.stanford.edu/data/
[16] Stanford Common Dataset | Stanford University | https://snap.stanford.edu/data/
[17] Datalab | UC Berkeley | http://www.lib.berkeley.edu/libraries/data-lab
[18] Exploring Datasets | Data Science at Berkeley | https://datascience.berkeley.edu/open-data-sets/
[19] DeepDrive | UC Berkeley | https://bdd-data.berkeley.edu/
[20] Machine Learning Datasets and Project Ideas — Work on real-time Data Science Projects | Data Flair | https://data-flair.training/blogs/machine-learning-datasets/
Comments