BigData platforms came to the world as the amount of data started to grow dramatically due to the use of web and mobile applications. lots of users’ data activity was gathered and needed to be stored on platforms with large and fast storage platforms and more computation resources to process this huge amount of data. These data platforms should be able to scale very easily without downtime in case the amount of data grows and give good response time to user and processes queries. Big data platforms use clusters to split the huge amount of data between many servers, each server stores a portion of the data in its local disk storage which is called sharding every shard is replicated to other nodes for high availability in case one node is down.
When the query is being executed all servers start to process the query in parallel each on its local data using the power of all nodes to process the data as fast as possible. Big Data platforms also use common architecture solutions:
- columnar storage – instead of storing many columns in 1 row. Each column is stored in a dedicated storage segment. Thus when running a query on a smaller number of columns only these columns are being retrieved from the disk and not the whole record.
- Compression – due to the repeating values of many records the column data is also compressed thus storing smaller data in the storage and speeding up queries that read fewer data from the disk.
- Cluster – Ability to distribute load and data among many servers and scale out incase more resources are needed
- sharding or partitions – data is distributed between many servers each server store and process portion of the whole data in shared nothing architecture
Parallelism – huge amount of data is being processed fast as many cluster servers are processing the data in the same time
Traditional databases which use main storage (shared everything architecture) could not store huge amounts of data (it will cost a lot) , processing and running queries would take lots of time .
SeaData is an expert in the world of BigData and provides Data Architecture consulting services, DataOps, and Data Engineering projects to leading companies in Israel and the world.
We use prem and cloud environments depending on customer use cases.
Some of the technologies we specialize in are listed below:
Apache Hadoop is free open Source software for massively distributed computation and Big Data storage. It can store PetaBytes of data and process it very fast. It does it using a cluster of many commodity servers (nodes) where each data node stores a portion of the data and is used as a compute node to process its local data.
Read More about Hadoop
Google BigQuery is a fast, powerful, flexible, and cost-effective serverless data warehouse that’s tightly integrated with the other services on a Google Cloud Platform. Designed to help you make informed decisions quickly, the cloud-based data warehouse and analytics platform use a built-in query engine and a highly scalable serverless computing model to process terabytes of data in seconds and petabytes in minutes.
Read more about Google Big Query
Amazon Redshift is a fully managed, cloud-based big data warehouse service offered by Amazon.
The platform provides a storage system that stores petabytes of data in easy-to-access clusters that can be queried in parallel. Each of these nodes can be accessed independently by users and applications.
Read more about Amazon redshift