Whether you’re building your cluster on-site or through the cloud, a little bit of pre-planning can go a long way. When estimating the requirements, we should at the very least know:
* Memory usage
* Number of nodes in the cluster
* Disk usage As a note: these rough calculations are strictly for the use of GridDB. The OS and other applications will also add extra memory/disk-space requirements, so please plan accordingly!
Memory Usage GridDB is executed in-memory, so the most crucial step is to estimate our disk/data costs. We will need to estimate both the data size of our rows and the number of rows in total, which is essentially a roundabout way of estimating the estimated database size. The basic formula follows this model:
Memory capacity used = (row data size — no. of registered rows · 0.75) \ + (8 — no. of registered rows) — \ (assigned index number + 2) · 0.66 (byte)
These numbers may be bit a difficult to reach off the top of your head, but it is recommended to be more generous with your expectations. It is safer to overestimate and be well-prepared than to be stingy and risk floundering. Once we reach our number, we will need to think how about how many nodes our cluster will need to employ. Let’s go through some example numbers together to help with visualization. So just as an example, let’s say we knew we were going to have an application
similar to this electric company. We would start with estimating a row data size of 100 bytes. For the number of rows, we can take a look at how many sensors we plan to deploy and extrapolate from there. For this example, let’s choose a nice round number of 1 billion.
Memory capacity used = (100 bytes — 1,000,000,000 · 0.75) \ + (8 — 1,000,000,000 — 6 · 0.66) \ = 133333333333.3333 bytes + 72727272727.27273 = 202020202020.202 bytes = 202GB
So according to these estimates, we will need our cluster to be able to process — at the very least — ~200GB of data.
Number of Nodes To estimate the number of nodes, we should have the estimations made for our total memory usage. We also need to know what sort of replication factor is intended for the future cluster. The default value set is 2.
Number of nodes = (Total memory usage · Memory size per machine) — \ Number of replicas
Please make note that the estimate arrived at here should be considered as a minimum for our total memory use. If it can be afforded, a larger number of nodes is better and safer for load balancing/higher availability. But quickly, before we move on, we should decide on how much of our DB we want to keep in-memory. Essentially we must decide how much of our data needs to be easily accessed as “hot data” (which is stored in-memory). For this example, we will keep half of our data in-memory and the rest on disk, meaning half resulting in a total need of ~101GB of memory. So now back to our calculations.
Number of Nodes = (101 GB · 16GB) x 2
From our calculations, we will need at least 13 nodes with 16GB of memory each. We can adjust the replication number and how much data we need to keep in-memory before we decide on a number we like. For this example, we will leave the numbers as is.
Disk Usage Now we will need to estimate the size of the files that will be created in GridDB. In GridDB, two kinds of files are created: a checkpoint file and a transaction log file. To accurately gauge these numbers, we need to know the memory usage of each node, which can be calculated like this:
Memory usage per node = (Total memory usage — Number of replicas) · \ Number of nodes (bytes)
(101GB x 2)/13 = 15.53GB = memory usage/node
Using this calculation, we can estimate the size of the checkpoint file:
File size = Memory usage per node — 2 (bytes)
15.53GB x 2 = 31.07GB
Estimating the size of the transaction log is a bit more difficult because it really depends on the frequency of updates. We will need to predict our rows’ update frequency (per second) and then assume a checkpoint interval. For the row update frequency, we can guess that we will receive 10,000 updates/second. The checkpoint interval’s default value is 1200 seconds (20 minutes) So armed with these estimations, we can estimate the size of our transaction-log file:
File size = Row data size — Row update frequency — Checkpoint interval (bytes)
File size = 100 bytes x 10,000 (frequency updates) X 1200 = 1200000000bytes = 1.2GB
Now we just need to add these numbers to figure out our disk usage per node:
Disk usage per node = Transaction log file size + Checkpoint file size + \ spillover from our in-memory data
Let’s plug in our numbers.
Disk usage per node = 31.07GB + 1.2GB + 15.53GB = 47.8GB.
So based on our estimations, for our DB of 1 billion rows, we will need a 13-node GridDB cluster with a total of ~100GB of memory. It will have a replication factor of 2 and require ~50GB of disk space per node. With 50GB of disk space per node, the total amount of disk storage needed ends up at roughly 650GB. Now that we have a rough estimate of our database, we can start building!
If you have any questions about the blog, please create a Stack Overflow post here https://stackoverflow.com/questions/ask?tags=griddb .
Make sure that you use the “griddb” tag so our engineers can quickly reply to your questions.