Machine Learning - NoSQL Style
Victor Lu, Consulting Senior Principal Consultant for Core Database, provides his analysis of NoSQL solutions.
Mini Big Bangs and Real-Time Machine Learning
HELP EXPLORE WHAT OUR UNIVERSE IS MADE OF!
Kaggle's "TrackML Particle Tracking Challenge" is a result of their partnership with CERN (the world largest high energy physics laboratory) for real-time pre-processing and filtering of the most promising events from colliding protons, which are essentially recreating mini big bangs. The first place winner did not use much machine learning (only some logistic regression for candidate pruning), but rather classical mathematical modeling with statistics and 3D geometry.
The Kaggle competition winning solution reminds me of the fun days I spent at Fermi National Lab in Chicago Suburb in the early 1990's. Our team photo was on Newsweek because of the top quark discovery. The work was recognized with the 2019 High Energy and Particle Physics Prize, awarded by the European Physical Society at a ceremony in Ghent, Belgium on July 15.
Particle physicists were among the first to use artificial intelligence techniques in software development, data analysis, and theoretical calculations. The first of a series of workshops on this topic, titled Artificial Intelligence in High-Energy and Nuclear Physics (AIHENP), was held in 1990. Even so, top quark discovery, data acquisition such as real-time pre-processing and filtering, and physics data analysis did not have many of the sophisticated machine learning tools available then. What mattered was the clear understanding of the problem to solve based on deep domain knowledge that physicists had about top quark, the design of the data acquisition system/event selection process that resulted in "clean data," and optimized infrastructure to store that data available for efficient analysis.
In 2019, for business domain experts to solve business problems and support executive initiatives, they also need "clean data" to be available at the right place and at the right time, especially if real-time analytics is important for the business use case.
Oracle Database is the best starting point where analytic logic is built into the database itself. Oracle NoSQL Database complements Oracle Database as a solution for applications with the following characteristics:
- Produce and consume data at high volume and velocity
- Require instantaneous response time to match user expectations
- Developed with continuously evolving data models
- Scale on-demand based on the dynamic workloads
What exactly can Oracle NoSQL Database do?
Think of other NoSQL solutions like Cosmos DB, DynamoDB, MongoDB, and Cassandra...
Here is what I have learned from top Oracle NoSQL Database experts Dave Rubin and Ashok Joshi.
Web-Scale Applications and NoSQL Database
Over the last decade, horizontally scalable, shared-nothing non-relational database architectures have helped to jumpstart many highly scalable web development projects. Modern micro-services-based applications require engines that can easily handle multiple data models without compromising on the core guarantee of transactions, high performance, and geo-distribution features. The Oracle RDBMS was the clear choice for decades for many applications and will continue to be the dominating database leader in the cloud with the introduction of autonomous databases. To extend their lead in the enterprise application database market, Oracle introduced Oracle NoSQL Database.
Oracle NoSQL Database is for serious enterprise application developers. It provides a scalable, distributed NoSQL database designed to provide highly reliable, flexible, and available data management across a configurable set of storage nodes. In addition, customers can be productive immediately by choosing the fully managed NoSQL Database Cloud Service, a server-less data store that provides all back-end administration, furnishing your application with auto-scaling, low request latency, and high availability so that developers can focus on delivering business value. This service resides in Oracle's fast-growing list of globally distributed data centers and is currently live in seven regions across the world.
Customers can also choose to run Oracle NoSQL Database on-premise via global, multi-cloud, multi-region Oracle NoSQL clusters. For the serious enterprise application developer, full ACID (Atomicity, Consistency, Isolation, Durability) transaction support is crucial, and much like database security, they expect native support in the database, not some bolt-on additions that expose applications to data loss, data corruption, exposed customer data, or even closure of business because of data being wiped out by hackers. For Oracle NoSQL, each operation can be fully ACID, flushing and syncing all data, as well as requiring quorum acknowledgement.
Alternatively, applications may choose to favor latency over consistency or durability and choose to read loosely consistent data or choosing to write to memory only, letting the data store asynchronously flush to persistent storage. Read consistency is obtained thru directly reading from a shard’s dynamically elected leader node, which handles all write operations. Each Shard in a NoSQL Database cluster will have a dynamically elected leader node, preventing a single point of failure. These mechanisms provide the ultimate control for the developer of both transactional and eventually consistent applications. Developers can define a default consistency and durability policy using the KVStoreConfig.setConsistency() and KVStoreConfig.setDurabioity() methods.
Oracle NoSQL Database provides critical features such as online elastic scale out and scale in, support for multiple data models (key/value, table, JSON (JavaScript Object Notation) documents and graph), SQL Query, full text search, authentication, authorization, and multi-datacenter disaster recovery.
Cosmos DB and DynamoDB
If you are familiar with other NoSQL engines, you will find related interesting Oracle NoSQL features.
Oracle and Microsoft have worked together for decades to provide customer with reliable, secure on-premise solutions. Both Oracle and Microsoft really understand customer needs both on-premise and in the cloud. With the enhanced Oracle-Microsoft cloud collaboration customers will increasingly see jointly tested, validated deployment architectures, and best practices: These include co-branded white papers with reference architectures, and distributed cloud deployment templates for generic and custom applications across Oracle Cloud and Azure.
Microsoft launched Cosmos DB as an Azure service replacing Microsoft's previous NoSQL offering DocumentDB, and competitive against Amazon Web Services (AWS) DynamoDB. According to Microsoft, Cosmos DB is already a $100M Annual Business. Microsoft CEO Satya Nadella said: "This AI era is mostly first a data era. And that's where I think the opportunity lies...Cosmos DB happens to be one of the best database products to be able to capture the signals that you want around your customers from a variety of different sources."
Cosmos DB includes support for multiple data models, including document, graph, key-value, table and column-family. Oracle actually can claim the thought leadership that multi-model database is the right approach and Oracle NoSQL Database has been built following the same principle.
Similar to Cosmos DB, when customers need to elastically scale throughput or storage, they can turn to Oracle NoSQL cloud service, a server-less datastore that provides all back end administration, furnishing your application with auto-scaling, low request latency, and high availability. Oracle NoSQL can be a world-wide distributed solution with data size ranging from kilobytes to petabytes. Independent of the scale, distribution, or failures, Oracle NoSQL continues to provide a single system image of the globally-distributed resources.
Cosmos DB and Oracle NoSQL are both powerful platform that can be used to enable real-time analytics applications. Apache Spark is a powerful open source general-purpose cluster computing engine for performing high speed sophisticated analytics. Spark allows loading of data from a Hadoop file. This can be done by extending Hadoop's InputFormat class. So, any datastore that implements Hadoop's InputFormat specification can act as a data source to Spark. Oracle NoSQL Database provides TableInputFormat class which extends Hadoop's InputFormat class there by allowing data to be loaded from Oracle NoSQL Database.
Subscription Stream API can be used to subscribe to all logical changes (puts and deletes) made to Oracle NoSQL Database table. These change feed can be used to alert downstream alert systems and machine learning algorithms for real-time corrective actions and real-time analytics.
Similar to Cosmos DB and Google Cloud Spanner, Oracle NoSQL Database is single database as a service that offers familiar database metaphors, high consistency and availability, horizontal scale, and minimal management hassle. Oracle NoSQL and CosmosDB do offer multiple consistency models in the same database and do not force a commitment to a conventional column-style, key/value, or document-based paradigm.
Oracle NoSQL Database is integrated out of box with the following Oracle products: Big Data SQL, Internet of Things, Spatial and Graph, Golden Gate, Relational Database, BerkeleyDB, Fusion Middleware, Enterprise Manager, SQL Developer, and Communication Elastic Charging Engine.
Oracle NoSQL Database is also going to be easily wired together with Oracle SaaS - the most complete, innovative, and proven cloud suite of SaaS applications that enable customers to transform their business with the latest intelligent technologies such as AI and machine learning.
Another popular NoSQL service is Amazon's Dynamo DB. According to Amazon's whitepaper, DynamoDB has used many Oracle technologies. For example, the storage engines that were in use are Berkeley Database (BDB) Transactional Data Store, BDB Java Edition, and MySQL.
MongoDB and Cassandra
Oracle's NoSQL Database Cloud Service, Microsoft's Cosmos DB, and Amazon's DynamoDB are all cloud-native services and are well-integrated with the latest features on each cloud platform. According to Amazon, this type of integration has several benefits over other NoSQL platforms because self-managed NoSQL databases have the following common problems:
- Outage (infrastructure failure, buggy patches) - Lost revenue
- Spotty or unpredictable performance - Unpredictable User Experience
- Extra effort to scale, upgrade and maintain - Lost time and resource
- Required special training - Less time to develop features that matter
While the above are true for many situations, there are more and more customers that want to develop hybrid cloud and multicloud solutions to avoid cloud vendor lock-in. For example, dev teams who work primarily with AWS need to plan ahead for app portability in multi-cloud deployments. For the customers who plan for hybrid cloud applications that need on-premise NoSQL database deployment, AWS DynamoDB simply is not an option.
MongoDB and Cassandra are comparable to Oracle NoSQL Database (on-premise). Developers and data scientists can leverage Oracle NoSQL Database as a flexible, scalable, and performant distributed database to meet the rigors of AI application development – a flexible data model, performance, scalability & redundancy, and tunable consistency.
Oracle NoSQL can serve as a metadata repository of all source data assets and analytics visualizations, stored in rich JSON document structures. Queries against Oracle NoSQL can be highly parallelized across the distributed database cluster.
Oracle NoSQL Database provides a set of aggregate operations that perform calculations on your datasets via the SQL-like query language. This and many other features provide customers with dynamic analysis capabilities.
A database administrator does not need to write application logic to query data any more. Instead, SQL can be used to inspect the data via the NoSQL query shell.
Overall, Oracle NoSQL Cloud Service, Cosmos DB, and DynamoDB are the choices for customers who want a NoSQL engine that is best integrated with the latest cloud features. Oracle NoSQL, MongoDB, and Cassandra are the choices if a customer wants to build their own NoSQL data store either on-premise or on multiple cloud platforms.
Oracle Products for Enterprise Machine Learning
If you have developed applications using CosmosDB, DynamoDB, MongoDB and Cassandra as data stores, Oracle NoSQL Database has comprehensive features to support the same use cases. Oracle NoSQL Database is only one of many Oracle products that you can choose from to support all your application needs, from traditional Enterprise applications to IOT, Robotic, or mobile/5G initiatives. Using Oracle products or third-party products from Oracle Cloud Marketplace, you can choose the right product combinations that work best for your dataset and access patterns. Here are some of the Oracle product combinations that can be used to fit your specific machine learning project needs.
Oracle RDBMS – Do Machine Learning All in a Converged Database
Oracle Database owns almost half of the world’s database market share and is the best in functionality, availability, performance and scalability. For many of the same reasons, Oracle Database is also the best relational database for machine learning. It’s multi-model, and specific data models address the needs of specific classes of applications. For example, document and multimedia data rely on formats like XML and JSON.
Graph databases, spatial databases and key-value stores are used for connectivity analysis, geographic analysis and high-performance lookup, respectively.
If you want to leverage the tamper-resistance and non-repudiation properties of blockchain in use cases that do not involve multiple organizations, look forward to the new Blockchain Tables feature. Blockchain Tables can also be used to improve data integrity and fraud protections of services provided by central authorities, such as an Escrow company, a SaaS service provider, Trade Exchange, or a Central Bank.
In many use cases, there is no need to leave Oracle Database to do machine learning where data are efficiently stored, retrieved, processed, and analyzed.
Oracle Big Data SQL (Cloud SQL) - Data Virtualization + Kafka + Spark
Oracle Big Data SQL enables organizations to immediately analyze data across Apache Hadoop, Apache Kafka, NoSQL, object stores, and Oracle Database leveraging their existing SQL skills, security policies, and applications with extreme performance. Data can be queried across many systems without being copied and replicated.
Oracle NoSQL+ Oracle Functions + Object Storage – Go Serverless
Oracle Functions can be used to build a Serverless Real-Time Data Processing app on Oracle Cloud. It is a fully managed, highly scalable, on-demand, Functions-as-a-Service platform, built on enterprise-grade Oracle Cloud Infrastructure and powered by the Fn Project open-source engine. You don't have to worry about the underlying infrastructure because Oracle Functions will ensure that your app is highly available, scalable, secure, and monitored. You can develop functions to interact with the Oracle Cloud Infrastructure Object Storage. You’ll use Oracle Functions to process real-time streams, Oracle NoSQL Database to persist records in a NoSQL database, and Oracle Analytics products to aggregate data, with raw data archived to Oracle Object Storage.
Oracle NoSQL Database + Object Storage + Oracle Cloud Infrastructure Data Science
Oracle NoSQL Database is a great data store for low-latency reads and writing web applications. However, when it is necessary to train a machine learning model by scanning all data in a NoSQL Database, NoSQL Database data can be copied to Oracle Object storage. Oracle's forthcoming Data Science Platform can be used to build a recommendation engine that provides personalized recommendations to customers using this data.
Oracle Object Storage is used to bring data together from all available domains within the same region. Then data scientists can analyze data on the Data Science Platform to derive insights. New user profiles and user actions are distributed by Oracle NoSQL replication.
Oracle NoSQL Database + Oracle Analytics Cloud
NoSQL DB data can be extracted and loaded into Oracle Analytics Cloud together with data from other sources and power deeper insights by embedding machine learning and AI into every aspect of the analytics process, making customer’s job easier than ever. Oracle employs smart data preparation and discovery to enhance customer overall experience. Natural language processing (NLP) and natural language generation (NLG) power modern, conversation-style analytics.
Oracle NoSQL Database makes it possible to create new data pipeline for Oracle Analytics Cloud and Oracle Analytics for Applications.
- Oracle Analytics for Fusion ERP (initial offering)
- Oracle Analytics for Fusion HCM
- Oracle Analytics for Fusion SCM
- Oracle Analytics for Fusion CX
- Oracle Analytics for NetSuite
Oracle NoSQL+ Oracle RDBMS - Same Data, Multiple Presentations
When working on machine learning project with relational data source, it is usually necessary to join relational tables into a machine learning dataset. Usually this process requires familiarity with the relational models. The data will be presented through denormalization using a (materialized) view and or queried to represent a semantic model of the relationships between the relational tables. This dataset is necessary for machine learning (ML) algorithms, processing, and data exploration. It is possible to use a single Oracle NoSQL Database table modeled for the access patterns of multiple relational tables. This way, user can structure data in such a way that it’s “pre-joined” right in the table. There will be no more constant analytical queries against production relational databases every time the machine learning prediction or query request is processed. Real-time applications, machine learning inference and incremental/online training that process data related to particular subset of employee, product, customer, country/region can benefit from low latency, unlimited scalability of an Oracle NoSQL Database and leave the core Oracle relational database for the critical business functions it was designed for. Similar use cases can be achieved using Oracle NoSQL Database and Oracle MySQL relational database.
Oracle RDBMS + Oracle NoSQL DB, NEWSQL, AGI
The story does not end here though. The NoSQL boom came primarily from the need for so-called "web-scale" applications that require high performance and massive scalability to support a large number of users with an instant query response time.
Many of the early “web scale” NoSQL applications doing machine learning realized that that losing a few data points early on, when they had such rich statistical density, didn’t matter. Many of these NoSQL solutions put aside the rich feature set offered by the traditional relational database management system such as ACID guarantees, SQL, joins, and foreign keys.
Oracle NoSQL Database, on the other hand, is built with enterprise application requirements in mind. Oracle NoSQL Database Cloud Service is fully managed with zero administration. ACID transactions are fully supported for the data you store in Oracle NoSQL Database Cloud Service. If required, consistency can be relaxed in favor of latency. Developers can access data with SQL queries using Secondary Indexes.
Oracle RDBMS provides customer with dependable data integrity and durability, SQL query flexibility, and high performance with all kinds of workloads. Oracle Sharding enables distribution and replication of data across a pool of Oracle databases that do not share hardware or software. The pool of databases is presented to the application as a single logical database. Applications elastically scale (data, transactions and users) to any level, on any platform, simply by adding additional databases (shards) to the pool of up to 1000 shards. When customers develop applications using both Oracle NoSQL Database and Oracle RDBMS, they can have the best of both the NoSQL world and the RDBMS world.
NewSQL is a new approach to relational databases that combines transactional ACID guarantees of relational database and the horizontal scalability of NoSQL. However, NEWSQL databases are not ready for prime time yet for applications that requires strong data consistency. For example, according to Daniel Abadi, NEWSQL databases use consensus protocols to enforce consistency. Some requires IT staff devote resources to clock synchronization, or pay cloud provider a premium for this functionality. Application developers must code around possibility of consistency violations that may arrive as a result of clock skew jumps. Other NEWSQL databases lacks scalability and can only process a fixed number of messages per second.
Oracle has been the king of the database world for several decades. Having the best platform for enterprise real-time analytics/machine learning is only the start. AI is still in the early innings. A lot of the AI research community itself, while actively pursuing AGI, goes to great lengths to emphasize how far we still are, perhaps out of concern that the media hype around AI may lead to dashed hopes and yet another AI nuclear winter. We will see database platforms evolve and support the enterprise customers and their efforts to put machine learning/AI research into daily operations.
What will happen to the world of database in the next 5-10 years to support machine learning? Keep an open mind and keep up to date - it is an OpenWorld out there...
To learn more about AI and machine learning, visit the Oracle AI page.
No comments:
Post a Comment