Look at the big picture – the presentation layer, and what the users would like to get.

Sometimes the “perfect” database design will cause the presentation layer to be manipulated. Because of this, the performance will be bad and the question will be posed “do we need a stronger machine?” or a suggestion, “let’s redesign the database.” When you design an analytics solution, it means the data engineer, or the analytics tools must get the data fast without structure changes (View on top of a view ….) while maintaining performance and being efficient.

There are many database architectures, choose the best fit according to your business needs:

  1. Schema on read or on write: Would it be structural database or unstructured?
    It depends on your source data
  2. Reporting tool: Choose a reporting tool that has a native data driver to your database.
  3. Late transaction: How to deal with a late income transaction. How to reprocess or ignore?
  4. Duplication
  5. How much data to keep. Although information is considered a gold mine, do you need to save a lot of information in your analytical database? Or can you work with a ”thin” and fast system and save the old information in a quarriable cheap storage?

Data transformation

Choose an ETL tool: One picture is worth more than 1,000 words. An ETL tool can decrease development and maintenance time. The time “saving” increases when the process is more complicated from a business perspective. Does ETL process code that is generated from an ETL tool take more time and less effective than human code writing? Not really. In both cases it depends on the developer.

Micro batch processing 

Micro-batch processing involves small pieces of code that transform data. The micro batch runs under a full business process which is composed from a many micro batch processes. The workflow can run by a trigger; when a file arrives or a message is delivered  from a topic, queue, TCP call, CDC or simple schedule. There are many advantages when developing micro batches and some of them are:

  1. Commit points and recovery
  2. Maintenance and release
  3. Amount of data increases
  4. Agile development
  5. Debug and troubleshooting

Commit points and recovery

“The process failed, run it again.” When the process takes more time, consumes more resources and the backlog increases; the exception is more painful. Developing your workflow recovery is possible when your process is divided into small micro batches.

Maintenance

Micro-batch processing is a dedicated functionality process, therefore the development will be more reliable, and it will match faster. It is like a manufacturing plant. You know what you get, and you know what the output is. Adding features or removing behavior will not always cause regression.

Amount of data increases

As the amount of data increases, this does not necessarily involve spending more money to increase resources or redevelop it into a different technology. Understanding which tasks cause more resource consumption will be easier when developing micro batches. It is more often that memory leak or an untuned process causes resource problems. If the process takes more time to run because of data increases, then the bulk of data can be separated into small pieces and run parallelly.

Agile Development

A micro batch has an entry point and result. The developer can mock the process with known files and compare the result with those files. Each developer can take a ”piece” of the whole process to develop.

Debug and troubleshooting

Because it looks like an assembly line, and the outcome does not suit the desired result.  The developer can check the result of each step, unlike a single process.

Data enrichment

The process can enrich the data from other data sources or from other micro services. It can be written in any language.  The only thing that matters is that the micro process will get a file with known structure and will create a file with known structure. Known structure does not mean csv, it can be even json or xml or another format. Depending on the requirements, a micro batch flow can process even one entity row (structure) and can run every second.

Building a workflow based on many micro batches

The most usage consumption is aggregation, sorting and keeping lookup data in memory.

Sorting

Most of the ETL tools use a known sorting algorithm.

Aggregation

With some ETL tools there is a big difference when aggregating sorted data or not. It is recommended to create known size data groups (Partitions) and aggregate them.  Some tools offer threading processes and can manage those threads with a thread pool.

Lookup tables

When the lookup table is big, it is recommended to create a micro service for it. No need to create a micro service for each lookup table.

Dividing a Flow

It is recommended if it is possible to separate the heavy tasks into different micro batches.

The entire task is called data preparation.

One Response

Leave a Reply

Your email address will not be published. Required fields are marked *