4.1 Data Modeling Basics
GridDB is a unique Key-Container data model that resembles Key-Value. It has the following features.
- A concept resembling a RDB table that is a container for grouping Key-Value.
- A schema to define the data type for the container can be set. An index can be set in a column.
- Transactions can be carried out on a row basis within the container. In addition, ACID is guaranteed on a container basis.
GridDB manages data on a block, container, partition, and partition group basis.
-
Block
A block is a data unit for data persistence processing in a disk (hereinafter known as a checkpoint) and is the smallest physical data management unit in GridDB. Multiple container data are arranged in a block. Before initial startup of GridDB, a size of either 64 KB or 1 MB can be selected for the block size to be set up in the definition file (cluster definition file). Specify 64 KB if the installed memory of the system is low, or if the frequency of data increase is low.
As a database file is created during initial startup of the system, the block size cannot be changed after initial startup of GridDB.
-
Container
A container consists of multiple blocks. A container is a data structure that serves as an interface with the user. There are 2 data types in a container, collection and time series. -
Table
A table is a special container form that exists only in NewSQL products and SQL can be operated as an interface in NewSQL. Before registering data in an application, there is a need to make sure that a container or table is created beforehand. Data is registered in a container or table. -
Row
A row refers to a line of data to be registered in a container or table. Multiple rows can be registered in a container or table, but this does not mean that data is arranged in the same block. Depending on the registration and update timing, data is arranged in suitable blocks within partitions. Normally, there are columns with multiple data types in a row. -
Partition
A partition is a data management unit that includes one or more containers or tables.
-
Partition Group
A group of multiple partitions is known as a partition group.
A partition is a data arrangement unit between clusters for managing the data movement to adjust the load balance between nodes and data multiplexing (replica) in case of a failure. Data replica is arranged in a node to compose a cluster on a partition basis. A node that can be updated against a container inside a partition is known as an owner node and 1 node is allocated to each partition. A node that maintains replicas other than owner nodes is a backup node. Master data and multiple backup data exist in a partition, depending on the number of replicas set.
Data maintained by a partition group is saved in an OS disk as a physical database file. A partition group is created with a number that depends on the degree of parallelism of the database processing threads executed by the node.
To compare GridDB to other NoSQL databases, in a Key-Value database (Redis), the key points to any value and attributes of that value usually cannot be indexed or queried, a Key in a Key-Document database (like MongoDB) points to a document where different documents can have different structures.
Containers
GridDB has two container types: Collections which can have any type of row key, and TimeSeries which always have a timestamp as the row key. They also feature several other unique features.
In Java, a container is defined by a static class of variables; timestamps, simple strings, numbers, geometry types (More info here), blobs or arrays of strings and numbers are all supported. (Full list here) In all other languages, the container is defined by a ContainerInfo object, but some languages do not support arrays and geometry types at this time.
In a Collection, Row Keys can be either be unique or not, this is determined by placing a @RowKey attribute in front of the type in Java or the row_key parameter in a ContainerInfo object in other languages.
Data Modeling
With the Key-Container model, each device, application, sensor, account or dataset get its own container and typically the same schema would be used for each type of device. A unique ID would be part of the container key and a collection can be used to help organize the many container keys. Writes to an ad-hoc queries of an individual container are very fast as just that container needs to be locked meanwhile the total time required to query many containers is usually faster than Key-Column or relational data models.Let's look at this example of a water company's sensor recording and billing application.
The above ER diagram would be represented in Java with the following classes:
static class AccountRecord { @RowKey String accountId; String billingName; String billingAddress; String billingEmail; String[] sensorIds; }
static class SensorReading { @RowKey Date timestamp; double liters; double psi; }
The following shows how a GridDB puts data into different containers as well as how the same data would be represented in a Relational Database.
Notice on the relational side that all of the sensors will populate into one table.(Note: we settled on inserting all sensors into one table because it was the more efficient method in RDMBS). Imagine now that the application scales out to several tens of thousands of sensors; the RDBMS table housing the sensors will quickly become too cumbersome to do any meaningful work without significant slow downs.
Over time each sensor would push data into the appropriate SENSOR_$sensorId container and customers would be written to the ACCOUNTS container with the sensorIds that apply to their account. When it comes time to generate bills, the billing application would iterate through the rows of accounts, reading each of the READING_$sensorId containers to calculate the bill. GridDB's novel Key-Container data model allows developers an easy and efficient way to model data (Time Series or not) of many individual inputs into many containers that can be aggregated and iterated.