Data Engineering | Towards Data Science

Mastering Hadoop, Part 3: Hadoop Ecosystem: Get the most out of your cluster

Niklas Lang — Sat, 15 Mar 2025 01:20:01 +0000

As we have already seen with the basic components (Part 1, Part 2), the Hadoop ecosystem is constantly evolving and being optimized for new applications. As a result, various tools and technologies have developed over time that make Hadoop more powerful and even more widely applicable. As a result, it goes beyond the pure HDFS & MapReduce platform and offers, for example, SQL, as well as NoSQL queries or real-time streaming.

Hive/HiveQL

Apache Hive is a data warehousing system that allows for SQL-like queries on a Hadoop cluster. Traditional relational databases struggle with horizontal scalability and ACID properties in large datasets, which is where Hive shines. It enables querying Hadoop data through a SQL-like query language, HiveQL, without needing complex MapReduce jobs, making it accessible to business analysts and developers.

Apache Hive therefore makes it possible to query HDFS data systems using a SQL-like query language without having to write complex MapReduce processes in Java. This means that business analysts and developers can use HiveQL (Hive Query Language) to create simple queries and build evaluations based on Hadoop data architectures.

Hive was originally developed by Facebook for processing large volumes of structured and semi-structured data. It is particularly useful for batch analyses and can be operated with common business intelligence tools such as Tableau or Apache Superset.

The metastore is the central repository that stores metadata such as table definitions, column names, and HDFS location information. This makes it possible for Hive to manage and organize large datasets. The execution engine, on the other hand, converts HiveQL queries into tasks that Hadoop can process. Depending on the desired performance and infrastructure, you can choose different execution engines:

MapReduce: The classic, slower approach.
Tez: A faster alternative to MapReduce.
Spark: The fastest option, which runs queries in-memory for optimal performance.

To use Hive in practice, various aspects should be considered to maximize performance. For example, it is based on partitioning, so that data is not stored in a huge table, but in partitions that can be searched more quickly. For example, a company’s sales data can be partitioned by year and month:

CREATE TABLE sales_partitioned (
    customer_id STRING,
    amount DOUBLE
) PARTITIONED BY (year INT, month INT);

This means that only the specific partition that is required can be accessed during a query. When creating partitions, it makes sense to create ones that are queried frequently. Buckets can also be used to ensure that joins run faster and data is distributed evenly.

CREATE TABLE sales_bucketed (
    customer_id STRING,
    amount DOUBLE
) CLUSTERED BY (customer_id) INTO 10 BUCKETS;

In conclusion, Hive is a useful tool if structured queries on huge amounts of data are to be possible. It also offers an easy way to connect common BI tools, such as Tableau, with data in Hadoop. However, if the application requires many short-term read and write accesses, then Hive is not the right tool.

Pig

Apache Pig takes this one step further and enables the parallel processing of large amounts of data in Hadoop. Compared to Hive, it is not focused on data reporting, but on the ETL process of semi-structured and unstructured data. For these data analyses, it is not necessary to use the complex MapReduce process in Java; instead, simple processes can be written in the proprietary Pig Latin language.

In addition, Pig can handle various file formats, such as JSON or XML, and perform data transformations, such as merging, filtering, or grouping data sets. The general process then looks like this:

Loading the Information: The data can be pulled from different data sources, such as HDFS or HBase.
Transforming the data: The data is then modified depending on the application so that you can filter, aggregate, or join it.
Saving the results: Finally, the processed data can be stored in various data systems, such as HDFS, HBase, or even relational databases.

Apache Pig differs from Hive in many fundamental ways. The most important are:

Attribute	Pig	Hive
Language	Pig Latin (script-based)	HiveQL (similar to SQL)
Target Group	Data Engineers	Business Analysts
Data Structure	Semi-structured and unstructured data	Structured Data
Applications	ETL processes, data preparation, data transformation	SQL-based analyses, reporting
Optimization	Parallel processing	Optimized, analytical queries
Engine-Options	MapReduce, Tez, Spark	Tez, Spark

Apache Pig is a component of Hadoop that simplifies data processing through its script-based Pig Latin language and accelerates transformations by relying on parallel processing. It is particularly popular with data engineers who want to work on Hadoop without having to develop complex MapReduce programs in Java.

HBase

HBase is a key-value-based NoSQL database in Hadoop that stores data in a column-oriented manner. Compared to classic relational databases, it can be scaled horizontally and new servers can be added to the storage if required. The data model consists of various tables, all of which have a unique row key that can be used to uniquely identify them. This can be imagined as a primary key in a relational database.

Each table in turn is made up of columns that belong to a so-called column family and must be defined when the table is created. The key-value pairs are then stored in the cells of a column. By focusing on columns instead of rows, large amounts of data can be queried particularly efficiently.

This structure can also be seen when creating new data records. A unique row key is created first and the values for the individual columns can then be added to this.

Put put = new Put(Bytes.toBytes("1001"));
put.addColumn(Bytes.toBytes("Personal"), Bytes.toBytes("Name"), Bytes.toBytes("Max"));
put.addColumn(Bytes.toBytes("Bestellungen", Bytes.toBytes("Produkt"),Bytes.toBytes("Laptop"));
table.put(put);

The column family is named first and then the key-value pair is defined. The structure is used in the query by first defining the data set via the row key and then calling up the required column and the keys it contains.

Get get = new Get(Bytes.toBytes("1001"));
Result result = table.get(get);
byte[] name = result.getValue(Bytes.toBytes("Personal"), Bytes.toBytes("Name"));
System.out.println("Name: " + Bytes.toString(name));

The structure is based on a master-worker setup. The HMaster is the higher-level control unit for HBase and manages the underlying RegionServers. It is also responsible for load distribution by centrally monitoring system performance and distributing the so-called regions to the RegionServers. If a RegionServer fails, the HMaster also ensures that the data is distributed to other RegionServers so that operations can be maintained. If the HMaster itself fails, the cluster can also have additional HMasters, which can then be retrieved from standby mode. During operation, however, a cluster only ever has one running HMaster.

The RegionServers are the working units of HBase, as they store and manage the table data in the cluster. They also answer read and write requests. For this purpose, each HBase table is divided into several subsets, the so-called regions, which are then managed by the RegionServers. A RegionServer can manage several regions to manage the load between the nodes.

The RegionServers work directly with clients and therefore receive the read and write requests directly. These requests end up in the so-called MemStore, whereby incoming read requests are first served from the MemStore and if the required data is no longer available there, the permanent memory in HDFS is used. As soon as the MemStore has reached a certain size, the data it contains is stored in an HFile in HDFS.

The storage backend for HBase is, therefore, HDFS, which is used as permanent storage. As already described, the HFiles are used for this, which can be distributed across several nodes. The advantage of this is horizontal scalability, as the data volumes can be distributed across different machines. In addition, different copies of the data are used to ensure reliability.

Finally, Apache Zookeeper serves as the superordinate instance of HBase and coordinates the distributed application. It monitors the HMaster and all RegionServers and automatically selects a new leader if an HMaster should fail. It also stores important metadata about the cluster and prevents conflicts if several clients want to access data at the same time. This enables the smooth operation of even larger clusters.

HBase is, therefore, a powerful NoSQL database that is suitable for Big Data applications. Thanks to its distributed architecture, HBase remains accessible even in the event of server failures and offers a combination of RAM-supported processing in the MemStore and the permanent storage of data in HDFs.

Spark

Apache Spark is a further development of MapReduce and is up to 100x faster thanks to the use of in-memory computing. It has since developed into a comprehensive platform for various workloads, such as batch processing, data streaming, and even machine learning, thanks to the addition of many components. It is also compatible with a wide variety of data sources, including HDFS, Hive, and HBase.

At the heart of the components is Spark Core, which offers basic functions for distributed processing:

Task management: Calculations can be distributed and monitored across multiple nodes.
Fault tolerance: In the event of errors in individual nodes, these can be automatically restored.
In-memory computing: Data is stored in the server’s RAM to ensure fast processing and availability.

The central data structures of Apache Spark are the so-called Resilient Distributed Datasets (RDDs). They enable distributed processing across different nodes and have the following properties:

Resilient (fault-tolerant): Data can be restored in the event of node failures. The RDDs do not store the data themselves, but only the sequence of transformations. If a node then fails, Spark can simply re-execute the transactions to restore the RDD.
Distributed: The information is distributed across multiple nodes.
Immutable: Once created, RDDs cannot be changed, only recreated.
Lazily evaluated (delayed execution): The operations are only executed during an action and not during the definition.

Apache Spark also consists of the following components:

Spark SQL provides an SQL engine for Spark and runs on datasets and DataFrames. As it works in-memory, processing is particularly fast, and it is therefore suitable for all applications where efficiency and speed play an important role.
Spark streaming offers the possibility of processing continuous data streams in real-time and converting them into mini-batches. It can be used, for example, to analyze social media posts or monitor IoT data. It also supports many common streaming data sources, such as Kafka or Flume.
With MLlib, Apache Spark offers an extensive library that contains a wide range of machine learning algorithms and can be applied directly to the stored data sets. This includes, for example, models for classification, regression, or even entire recommendation systems.
GraphX is a powerful tool for processing and analyzing graph data. This enables efficient analyses of relationships between data points and they can be calculated simultaneously in a distributed manner. There are also special PageRank algorithms for analyzing social networks.

Apache Spark is arguably one of the rising components of Hadoop, as it enables fast in-memory calculations that would previously have been unthinkable with MapReduce. Although Spark is not an exclusive component of Hadoop, as it can also use other file systems such as S3, the two systems are often used together in practice. Apache Spark is also enjoying increasing popularity due to its universal applicability and many functionalities.

Oozie

Apache Oozie is a workflow management and scheduling system that was developed specifically for Hadoop and plans the execution and automation of various Hadoop jobs, such as MapReduce, Spark, or Hive. The most important functionality here is that Oozie defines the dependencies between the jobs and executes them in a specific order. In addition, schedules or specific events can be defined for which the jobs are to be executed. If errors occur during execution, Oozie also has error-handling options and can restart the jobs.

A workflow is defined in XML so that the workflow engine can read it and start the jobs in the correct order. If a job fails, it can simply be repeated or other steps can be initiated. Oozie also has a database backend system, such as MySQL or PostgreSQL, which is used to store status information.

Presto

Apache Presto offers another option for applying distributed SQL queries to large amounts of data. Compared to other Hadoop technologies, such as Hive, the queries are processed in real-time and it is therefore optimized for data warehouses running on large, distributed systems. Presto offers broad support for all relevant data sources and does not require a schema definition, so data can be queried directly from the sources. It has also been optimized to work on distributed systems and can, therefore, be used on petabyte-sized data sets.

Apache Presto uses a so-called massively parallel processing (MPP) architecture, which enables particularly efficient processing in distributed systems. As soon as the user sends an SQL query via the Presto CLI or a BI front end, the coordinator analyzes the query and creates an executable query plan. The worker nodes then execute the queries and return their partial results to the coordinator, which combines them into a final result.

Presto differs from the related systems in Hadoop as follows:

Attribute	Presto	Hive	Spark SQL
Query Speed	Milliseconds to seconds	Minutes (batch processing)	Seconds (in-memory)
Processing Model	Real-time SQL queries	Batch Processing	In-Memory Processing
Data Source	HDFS, S3, RDBMS, NoSQL, Kafka	HDFS, Hive-Tables	HDFS, Hive, RDBMS, Streams
Use Case	Interactive queries, BI tools	Slow big data queries	Machine learning, streaming, SQL queries

This makes Presto the best choice for fast SQL queries on a distributed big data environment like Hadoop.

What are alternatives to Hadoop?

Especially in the early 2010s, Hadoop was the leading technology for distributed Data Processing for a long time. However, several alternatives have since emerged that offer more advantages in certain scenarios or are simply better suited to today’s applications.

Cloud-native alternatives to Hadoop

Many companies have moved away from hosting their servers and on-premise systems and are instead moving their big data workloads to the cloud. There, they can benefit significantly from automatic scaling, lower maintenance costs, and better performance. In addition, many cloud providers also offer solutions that are much easier to manage than Hadoop and can, therefore, also be operated by less trained personnel.

Amazon EMR (Elastic MapReduce)

Amazon EMR is a managed big data service from AWS that provides Hadoop, Spark, and other distributed computing frameworks so that these clusters no longer need to be hosted on-premises. This enables companies to no longer have to actively take care of cluster maintenance and administration. In addition to Hadoop, Amazon EMR supports many other open-source frameworks, such as Spark, Hive, Presto, and HBase. This broad support means that users can simply move their existing clusters to the cloud without any major problems.

For storage, Amazon uses EMR S3 as primary storage instead of HDFS. This not only makes storage cheaper as no permanent cluster is required, but it also has better availability as data is stored redundantly across multiple AWS regions. In addition, computing and storage can be scaled separately from each other and cannot be scaled exclusively via a cluster, as is the case with Hadoop.

There is a specially optimized interface for the EMR File System (EMRFS) that allows direct access from Hadoop or Spark to S3. It also supports the consistency models and enables metadata caching for better performance. If necessary, HDFS can also be used, for example, if local, temporary storage is required on the cluster nodes.

Another advantage of Amazon EMR over a classic Hadoop cluster is the ability to use dynamic auto-scaling to not only reduce costs but also improve performance. The cluster size and the available hardware are automatically adjusted to the CPU utilization or the job queue size so that costs are only incurred for the hardware that is needed.

So-called spot indices can then only be added temporarily when they are needed. In a company, for example, it makes sense to add them at night if the data from the productive systems is to be stored in the data warehouse. During the day, on the other hand, smaller clusters are operated and costs can be saved as a result.

Amazon EMR, therefore, offers several optimizations for the local use of Hadoop. The optimized storage access to S3, the dynamic cluster scaling, which increases performance and simultaneously optimizes costs, and the improved network communication between the nodes is particularly advantageous. Overall, the data can be processed faster with fewer resource requirements than with classic Hadoop clusters that run on their servers.

Google BigQuery

In the area of data warehousing, Google Big Query offers a fully managed and serverless data warehouse that can come up with fast SQL queries for large amounts of data. It relies on columnar data storage and uses Google Dremel technology to handle massive amounts of data more efficiently. At the same time, it can largely dispense with cluster management and infrastructure maintenance.

In contrast to native Hadoop, BigQuery uses a columnar orientation and can, therefore, save immense amounts of storage space by using efficient compression methods. In addition, queries are accelerated as only the required columns need to be read rather than the entire row. This makes it possible to work much more efficiently, which is particularly noticeable with very large amounts of data.

BigQuery also uses Dremel technology, which is capable of executing SQL queries in parallel hierarchies and distributing the workload across different machines. As such architectures often lose performance as soon as they have to merge the partial results again, BigQuery uses tree aggregation to combine the partial results efficiently.

BigQuery is the better alternative to Hadoop, especially for applications that focus on SQL queries, such as data warehouses or business intelligence. For unstructured data, on the other hand, Hadoop may be the more suitable alternative, although the cluster architecture and the associated costs must be taken into account. Finally, BigQuery also offers a good connection to the various machine learning offerings from Google, such as Google AI or AutoML, which should be taken into account when making a selection.

Snowflake

If you don’t want to become dependent on the Google Cloud with BigQuery or are already pursuing a multi-cloud strategy, Snowflake can be a valid alternative for building a cloud-native data warehouse. It offers dynamic scalability by separating computing power and storage requirements so that they can be adjusted independently of each other.

Compared to BigQuery, Snowflake is cloud-agnostic and can therefore be operated on common platforms such as AWS, Azure, or even in the Google Cloud. Although Snowflake also offers the option of scaling the hardware depending on requirements, there is no option for automatic scaling as with BigQuery. On the other hand, multiclusters can be created on which the data warehouse is distributed, thereby maximizing performance.

On the cost side, the providers differ due to the architecture. Thanks to the complete management and automatic scaling of BigQuery, Google Cloud can calculate the costs per query and does not charge any direct costs for computing power or storage. With Snowflake, on the other hand, the choice of provider is free and so in most cases it boils down to a so-called pay-as-you-go payment model in which the provider charges the costs for storage and computing power.

Overall, Snowflake offers a more flexible solution that can be hosted by various providers or even operated as a multi-cloud service. However, this requires greater knowledge of how to operate the system, as the resources have to be adapted independently. BigQuery, on the other hand, has a serverless model, which means that no infrastructure management is required.

Open-source alternatives for Hadoop

In addition to these complete and large cloud data platforms, several powerful open-source programs have been specifically developed as alternatives to Hadoop and specifically address its weaknesses, such as real-time data processing, performance, and complexity of administration. As we have already seen, Apache Spark is very powerful and can be used as a replacement for a Hadoop cluster, which we will not cover again.

Apache Flink

Apache Flink is an open-source framework that was specially developed for distributed stream processing so that data can be processed continuously. In contrast to Hadoop or Spark, which processes data in so-called micro-batches, data can be processed in near real-time with very low latency. This makes Apache Flink an alternative for applications in which information is generated continuously and needs to be reacted to in real-time, such as sensor data from machines.

While Spark Streaming processes the data in so-called mini-batches and thus simulates streaming, Apache Flink offers real streaming with an event-driven model that can process data just milliseconds after it arrives. This can further minimize latency as there is no delay due to mini-batches or other waiting times. For these reasons, Flink is much better suited to high-frequency data sources, such as sensors or financial market transactions, where every second counts.

Another advantage of Apache Flink is its advanced stateful processing. In many real-time applications, the context of an event plays an important role, such as the previous purchases of a customer for a product recommendation, and must therefore be saved. With Flink, this storage already takes place in the application so that long-term and stateful calculations can be carried out efficiently.

This becomes particularly clear when analyzing machine data in real-time, where previous anomalies, such as too high a temperature or faulty parts, must also be included in the current report and prediction. With Hadoop or Spark, a separate database must first be accessed for this, which leads to additional latency. With Flink, on the other hand, the machine’s historical anomalies are already stored in the application so that they can be accessed directly.

In conclusion, Flink is the better alternative for highly dynamic and event-based data processing. Hadoop, on the other hand, is based on batch processes and therefore cannot analyze data in real-time, as there is always a latency to wait for a completed data block.

Modern data warehouses

For a long time, Hadoop was the standard solution for processing large volumes of data. However, companies today also rely on modern data warehouses as an alternative, as these offer an optimized environment for structured data and thus enable faster SQL queries. In addition, there are a variety of cloud-native architectures that also offer automatic scaling, thus reducing administrative effort and saving costs.

In this section, we focus on the most common data warehouse alternatives to Hadoop and explain why they may be a better choice compared to Hadoop.

Amazon Redshift

Amazon Redshift is a cloud-based data warehouse that was developed for structured analyses with SQL. This optimizes the processing of large relational data sets and allows fast column-based queries to be used.

One of the main differences to traditional data warehouses is that data is stored in columns instead of rows, meaning that only the relevant columns need to be loaded for a query, which significantly increases efficiency. Hadoop, on the other hand, and HDFS in particular is optimized for semi-structured and unstructured data and does not natively support SQL queries. This makes Redshift ideal for OLAP analyses in which large amounts of data need to be aggregated and filtered.

Another feature that increases query speed is the use of a Massive Parallel Processing (MPP) system, in which queries can be distributed across several nodes and processed in parallel. This achieves extremely high parallelization capability and processing speed.

In addition, Amazon Redshift offers very good integration into Amazon’s existing systems and can be seamlessly integrated into the AWS environment without the need for open-source tools, as is the case with Hadoop. Frequently used tools are:

Amazon S3 offers direct access to large amounts of data in cloud storage.
AWS Glue can be used for ETL processes in which data is prepared and transformed.
Amazon QuickSight is a possible tool for the visualization and analysis of data.
Finally, machine learning applications can be implemented with the various AWS ML services.

Amazon Redshift is a real alternative compared to Hadoop, especially for relational queries, if you are looking for a managed and scalable data warehouse solution and you already have an existing AWS cluster or want to build the architecture on top of it. It can also offer a real advantage for high query speeds and large volumes of data due to its column-based storage and massive parallel processing system.

Databricks (lakehouse platform)

Databricks is a cloud platform based on Apache Spark that has been specially optimized for data analysis, machine learning, and artificial intelligence. It extends the functionalities of Spark with an easy-to-understand user interface, and optimized cluster management and also offers the so-called Delta Lake, which offers data consistency, scalability, and performance compared to Hadoop-based systems.

Databricks offers a fully managed environment that can be easily operated and automated using Spark clusters in the cloud. This eliminates the need for manual setup and configuration as with a Hadoop cluster. In addition, the use of Apache Spark is optimized so that batch and streaming processing can run faster and more efficiently. Finally, Databricks also includes automatic scaling, which is very valuable in the cloud environment as it can save costs and improve scalability.

The classic Hadoop platforms have the problem that they do not fulfill the ACID properties and, therefore, the consistency of the data is not always guaranteed due to the distribution across different servers. With Databricks, this problem is solved with the help of the so-called Delta Lake:

ACID transactions: The Delta Lake ensures that all transactions fulfill the ACID guidelines, allowing even complex pipelines to be executed completely and consistently. This ensures data integrity even in big data applications.
Schema evolution: The data models can be updated dynamically so that existing workflows do not have to be adapted.
Optimized storage & queries: Delta Lake uses processes such as indexing, caching, or automatic compression to make queries many times faster compared to classic Hadoop or HDFS environments.

Finally, Databricks goes beyond the classic big data framework by also offering an integrated machine learning & AI platform. The most common machine learning platforms, such as TensorFlow, scikit-learn, or PyTorch, are supported so that the stored data can be processed directly. As a result, Databricks offers a simple end-to-end pipeline for machine learning applications. From data preparation to the finished model, everything can take place in Databricks and the required resources can be flexibly booked in the cloud.

This makes Databricks a valid alternative to Hadoop if a data lake with ACID transactions and schema flexibility is required. It also offers additional components, such as the end-to-end solution for machine learning applications. In addition, the cluster in the cloud can not only be operated more easily and save costs by automatically adapting the hardware to the requirements, but it also offers significantly more performance than a classic Hadoop cluster due to its Spark basis.

In this part, we explored the Hadoop ecosystem, highlighting key tools like Hive, Spark, and HBase, each designed to enhance Hadoop’s capabilities for various data processing tasks. From SQL-like queries with Hive to fast, in-memory processing with Spark, these components provide flexibility for big data applications. While Hadoop remains a powerful framework, alternatives such as cloud-native solutions and modern data warehouses are worth considering for different needs.

This series has introduced you to Hadoop’s architecture, components, and ecosystem, giving you the foundation to build scalable, customized big data solutions. As the field continues to evolve, you’ll be equipped to choose the right tools to meet the demands of your data-driven projects.

The post Mastering Hadoop, Part 3: Hadoop Ecosystem: Get the most out of your cluster appeared first on Towards Data Science.

Forget About Cloud Computing. On-Premises Is All the Rage Again

Ari Joury, PhD — Fri, 14 Mar 2025 18:19:36 +0000

Ten years ago, everybody was fascinated by the cloud. It was the new thing, and companies that adopted it rapidly saw tremendous growth. Salesforce, for example, positioned itself as a pioneer of this technology and saw great wins.

The tides are turning though. As much as cloud providers still proclaim that they’re the most cost-effective and efficient solution for businesses of all sizes, this is increasingly clashing with the day-to-day experience.

Cloud Computing was touted as the solution for scalability, flexibility, and reduced operational burdens. Increasingly, though, companies are finding that, at scale, the costs and control limitations outweigh the benefits.

Attracted by free AWS credits, me and my CTO started out with setting up our entire company IT infrastructure on the cloud. However, we were shocked when we saw the costs ballooning after just a few software tests. We decided to invest in a high-quality server and moved our whole infrastructure onto it. And we’re not looking back: This decision is already saving us hundreds of Euros per month.

We’re not the only ones: Dropbox already made this move in 2016 and saved close to $75 million over the ensuing two years. The company behind Basecamp, 37signals, completed this transition in 2022, and expects to save $7 million over five years.

We’ll dive deeper into the how and why of this trend and the cost savings that are associated with it. You can expect some practical insights that will help you make or influence such a decision at your company, too.

Cloud costs have been exploding

According to a recent study by Harness, 21% of enterprise cloud infrastructure spend—which will be equivalent to $44.5 billion in 2025—is wasted on underutilized resources. According to the study author, cloud spend is one of the biggest cost drivers for many software enterprises, second only to salaries.

The premise of this study is that developers must develop a keener eye on costs. However, I disagree. Cost control can only get you so far—and many smart developers are already spending inordinate amounts of their time on cost control instead of building actual products.

Cloud costs have a tendency to balloon over time: Storage costs per GB of data might seem low, but when you’re dealing with terabytes of data—which even we as a three-person startup are already doing—costs add up very quickly. Add to this retrieval and egress fees, and you’re faced with a bill you cannot unsee.

Steep retrieval and egress fees only serve one thing: Cloud providers want to incentivize you to keep as much data as possible on the platform, so they can make money off every operation. If you download data from the cloud, it will cost you inordinate amounts of money.

Variable costs based on CPU and GPU usage often spike during high-performance workloads. A report by CNCF found that almost half of Kubernetes adopters found that they’d exceeded their budget as a result. Kubernetes is an open-source container orchestration software that is often used for cloud deployments.

The pay-per-use model of the cloud has its advantages, but billing becomes unpredictable as a result. Costs can then explode during usage spikes. Cloud add-ons for security, monitoring, and data analytics also come at a premium, which often increases costs further.

As a result, many IT leaders have started migrating back to on-premises servers. A 2023 survey by Uptime found that 33% of respondents had repatriated at least some production applications in the past year.

Cloud providers have not restructured their billing in response to this trend. One could argue that doing so would seriously impact their profitability, especially in a largely consolidated market where competitive pressure by upstarts and outsiders is limited. As long as this is the case, the trend towards on-premises is expected to continue.

Cost efficiency and control

There is a reason that cloud providers tend to advertise so much to small firms and startups. The initial setup costs of a cloud infrastructure are low because of pay-as-you-go models and free credits.

The easy setup can be a trap, though, especially once you start scaling. (At my firm, we noticed our costs going out of control even before we scaled to a decent extent, simply because we handle large amounts of data.) Monthly costs for on-premises servers are fixed and predictable; costs for cloud services can quickly balloon beyond expectations.

As mentioned before, cloud providers also charge steep data egress fees, which can quickly add up when you’re considering a hybrid infrastructure.

Security costs can initially be higher on-premises. On the other hand, you have full control over everything you implement. Cloud providers cover infrastructure security, but you remain responsible for data security and configuration. This often requires paid add-ons.

https://datawrapper.dwcdn.net/czWge/1/

A round-up can be found in the table above. On the whole, an on-premises infrastructure comes with higher setup costs and needs considerable know-how. This initial investment pays off quickly, though, because you tend to have very predictable monthly costs and full control over additions like security measures.

There are plenty of prominent examples of companies that have saved millions by moving back on-premises. Whether this is a good choice for you depends on several factors, though, which need to be assessed carefully.

Should you move back on-premises?

Whether you should make the shift back to server racks depends on several factors. The most important considerations in most cases are financial, operational, and strategic.

From a financial point of view, your company’s cash structure plays a big role. If you prefer lean capital expenditures but have no problem racking up high operational costs every month, then you should remain on the cloud. If you can make a higher capital expenditure up front and then refrain from bleeding cash, you should do this though.

At the end of the day, the total operational costs (TCO) are key though. If your operational costs on cloud are consistently lower than running servers yourself, then you should absolutely stay on the cloud.

From an operational point of view, staying on the cloud can make sense if you often face spikes in usage. On-premises servers can only carry so much traffic; cloud servers scale pretty seamlessly in proportion to demand. If expensive and specialized hardware is more accessible for you on the cloud, this is also a point in favor of staying on the cloud. On the other hand, if you are worried about complying with specific regulations (like GDPR, HIPAA, or CSRD for example), then the shared-responsibility model of cloud services is likely not for you.

Strategically speaking, having full control of your infrastructure can be a strategic advantage. It keeps you from getting locked in with a vendor and having to play along with whatever they bill you and what services they are able to offer you. If you plan a geographic expansion or rapidly deploy new services, then cloud can be advantageous though. In the long run, however, going on-premises might make sense even when you’re expanding geographically or in your scope of services, due to increased control and lower operational costs.

The decision to move back on-premises depends on several factors. Diagram generated with the help of Claude AI.

On the whole, if you value predictability, control, and compliance, you should consider running on-premises. If, on the other hand, you value flexibility, then staying on the cloud might be your better choice.

How to repatriate easily

If you are considering repatriating your services, here is a brief checklist to follow:

Assess Current Cloud Usage: Inventory applications and data volume.
Cost Analysis: Calculate current cloud costs vs. projected on-prem costs.
Select On-Prem Infrastructure: Servers, storage, and networking requirements.
Minimize Data Egress Costs: Use compression and schedule transfers during off-peak hours.
Security Planning: Firewalls, encryption, and access controls for on-prem.
Test and Migrate: Pilot migration for non-critical workloads first.
Monitor and Optimize: Set up monitoring for resources and adjust.

Repatriation is not just for enterprise companies that make the headlines. As the example of my firm shows, even small startups need to make this consideration. The earlier you make the migration, the less cash you’ll bleed.

The bottom line: Cloud is not dead, but the hype around it is dying

Cloud services aren’t going anywhere. They offer flexibility and scalability, which are unmatched for certain use cases. Startups and companies with unpredictable or rapidly growing workloads still benefit greatly from cloud solutions.

That being said, even early-stage companies can benefit from on-premises infrastructure, for example if the large data loads they’re handling would make the cloud bill balloon out of control. This was the case at my firm.

The cloud has often been marketed as a one-size-fits-all solution for everything from data storage to AI workloads. We can see that this is not the case; the reality is a bit more granular than this. As companies scale, the costs, compliance challenges, and performance limitations of cloud computing become impossible to ignore.

The hype around cloud services is dying because experience is showing us that there are real limits and plenty of hidden costs. In addition, cloud providers can often not adequately provide for security solutions, options for compliance, and user control if you don’t pay a hefty premium for all this.

Most companies will likely adopt a hybrid approach in the long run: On-premises offers control and predictability; cloud servers can jump into the fray when demand from users spikes.

There’s no real one-size-fits-all solution. However, there are specific criteria that should help you guide your decision. Like every hype, there are ebbs and flows. The fact that cloud services are no longer hyped does not mean that you need to go all-in on server racks now. It does, however, invite for a deeper reflection about the advantages that this trend offers for your company.

The post Forget About Cloud Computing. On-Premises Is All the Rage Again appeared first on Towards Data Science.

Mastering Hadoop, Part 2: Getting Hands-On — Setting Up and Scaling Hadoop

Niklas Lang — Thu, 13 Mar 2025 20:42:01 +0000

Now that we’ve explored Hadoop’s role and relevance, it’s time to show you how it works under the hood and how you can start working with it. To start, we are breaking down Hadoop’s core components — HDFS for storage, MapReduce for processing, YARN for resource management, and more. Then, we’ll guide you through installing Hadoop (both locally and in the cloud) and introduce some essential commands to help you navigate and operate your first Hadoop environment.

Which components are part of the Hadoop architecture?

Hadoop’s architecture is designed to be resilient and error-free, relying on several core components that work together. These components divide large datasets into smaller blocks, making them easier to process and distribute across a cluster of servers. This distributed approach enables efficient data processing—far more scalable than a centralized ‘supercomputer.’

Hadoop Components | Source: Author

The basic components of Hadoop are:

Hadoop Common comprises basic libraries and functionalities that are required by the other modules.
The Hadoop Distributed File System (HDFS) ensures that data is stored on different servers and enables a particularly large bandwidth.
Hadoop YARN takes care of resource distribution within the system and redistributes the load when individual computers reach their limits.
MapReduce is a programming model designed to make the processing of large amounts of data particularly efficient.

In 2020, Hadoop Ozone, which is used as an alternative to HDFS, was added to this basic architecture. It comprises a distributed object storage system that was specially designed for Big Data workloads to better handle modern data requirements, especially in the cloud environment.

HDFS (Hadoop Distributed File System)

Let’s dive into HDFS, the core storage system of Hadoop, designed specifically to meet the demands of big Data Processing. The basic principle is that files are not stored as a whole on a central server, but are divided into blocks of 128MB or 256MB in size and then distributed across different nodes in a computer cluster.

To ensure data integrity, each block is replicated three times across different servers. If one server fails, the system can still recover from the remaining copies. This replication makes it easy to fall back on another node in the event of a failure.

According to its documentation, Hadoop pursues the following goals with the use of HDFS:

Fast recovery from hardware failures by falling back on working components.
Provision of stream data processing.
Big data framework with the ability to process large data sets.
Standardized processes with the ability to easily migrate to new hardware or software.

Apache Hadoop works according to the so-called master-slave principle. In this cluster, there is one node that takes on the role of the master. It distributes the blocks from the data set to various slave nodes and remembers which partitions it has stored on which computers. Only the references to the blocks, i.e. the metadata, are stored on the master node. If a master fails, there is a secondary name node that can take over.

The master within the Apache Hadoop Distributed File System is called a NameNode. The slave nodes, in turn, are the so-called DataNodes. The task of the DataNodes is to store the actual data blocks and regularly report the status to the NameNode that they are still alive. If a DataNode fails, the data blocks are replicated by other nodes to ensure sufficient fault tolerance.

The client saves files that are stored on the various DataNodes. In our example, these are located on racks 1 and 2. As a rule, there is only one DataNode per machine in a rack. Its primary task is to manage the data blocks in memory.

The NameNode, in turn, is responsible for remembering which data blocks are stored in which DataNode so that it can retrieve them on request. It also manages the files and can open, close, and, if necessary, rename them.

Finally, the DataNodes carry out the actual read and write processes of the client. The client receives the required information from the DataNodes when a query is made. They also ensure the replication of data so that the system can be operated in a fault-tolerant manner.

MapReduce

MapReduce is a programming model that supports the parallel processing of large amounts of data. It was originally developed by Google and can be divided into two phases:

Map: In the map phase, a process is defined that can transform the input data into key-value pairs. Several mappers can then be set up to process a large amount of data simultaneously to enable faster processing.
Reduce: The Reduce phase starts after all mappers have finished and aggregates all values that have the same key. The aggregation can involve various functions, such as the sum or the determination of the maximum value. Between the end of the Map phase and the start of the Reduce phase, the data is shuffled and sorted according to the keys.

A classic application for the MapReduce mechanism is word counting in documents, such as the seven Harry Potter volumes in our example. The task is to count how often the words “Harry” and “Potter” occur. To do this, in the map phase, each word is split into a key-value pair with the word as the key and the number one as the value, as the word has occurred once.

The positive aspect of this is that this task can run in parallel and independently of each other, so that, for example, a mapper can run for each band or even for each page individually. This means that the task is parallelized and can be implemented much faster. The scaling depends only on the available computing resources and can be increased as required if the appropriate hardware is available. The output of the map phase could look like this, for example:

[(„Harry“, 1), („Potter“, 1), („Potter“, 1), („Harry“, 1), („Harry”, 1)]

MapReduce using the example of word counts in Harry Potter books | Source: Author

Once all mappers have finished their work, the reduce phase can begin. For the word count example, all key-value pairs with the keys “Harry” and “Potter” should be grouped and counted.

The grouping produces the following result:

[(„Harry“, [1,1,1]), („Potter“, [1,1])]

The grouped result is then aggregated. As the words are to be counted in our example, the grouped values are added together:

[(„Harry“, 3), („Potter“, 2)]

The advantage of this processing is that the task can be parallelized and at the same time only minimal file movement takes place. This means that even large volumes can be processed efficiently.

Although many systems continue to use the MapReduce program, as used in the original Hadoop structure, more efficient frameworks, such as Apache Spark, have also been developed in the meantime. We will go into this in more detail later in the article.

YARN (Yet Another Resource Negotiator)

YARN (Yet Another Resource Negotiator) manages the hardware resources within the cluster. It separates resource management from data processing, which allows multiple applications (such as MapReduce, Spark, and Flink) to run efficiently on the same cluster. It focuses on key functions such as:

Management of performance and memory resources, such as CPU or SSD storage space.
Distribution of free resources to running processes, for example, MapReduce, Spark, or Flink.
Optimization and parallelization of job execution.

Similar to HDFS, YARN also follows a master-slave principle. The Resource Manager acts as the master and centrally monitors all resources in the entire cluster. It also allocates the available resources to the individual applications. The various node managers serve as slaves and are installed on each machine. They are responsible for the containers in which the applications run and monitor their resource consumption, such as memory space or CPU performance. These figures are fed back to the Resource Manager at regular intervals so that it can maintain an overview.

At a high level, a request to YARN looks like this: the client calls the Resource Manager and requests the execution of an application. This then searches for available resources in the cluster and, if possible, starts a new instance of the so-called Application Master, which initiates and monitors the execution of the application. This in turn requests the available resources from the node manager and starts the corresponding containers. The calculation can now run in parallel in the containers and is monitored by the Application Master. After successful processing, YARN releases the resources used for new jobs.

Hadoop common

Hadoop Common can be thought of as the foundation of the complete Hadoop ecosystem on which the main components can be built. It contains basic libraries, tools, and configuration files that can be used by all Hadoop components. The main components include:

Common libraries and utilities: Hadoop Common provides a set of Java libraries, APIs, and utilities needed to run the cluster. This includes, for example, mechanisms for communication between the nodes in the cluster or support for different serialization formats, such as Avro. Interfaces required for file management in HDFS or other file systems are also included.
Configuration management: Hadoop is based on a large number of XML-based configuration files, which define the main system parameters that are essential for operation. One central aspect is the network parameters required to control the machines in the cluster. In addition, the permitted storage locations for HDFs are defined here or the maximum resource sizes, such as the usable storage space, are determined.
Platform independence: Hadoop was originally developed specifically for Linux environments. However, it can also be extended to other operating systems with the help of Hadoop Common. This includes native code support for additional environments, such as macOS or Windows.
Tools for I/O (input/output): A big data framework processes huge volumes of data that need to be stored and processed efficiently. The necessary building blocks for various file systems, such as TextFiles or Parquet, are therefore stored in Hadoop Common. It also contains the functionalities for the supported compression methods, which ensure that storage space is saved and processing time is optimized.

Thanks to this uniform and central code base, Hadoop Common provides improved modularity within the framework and ensures that all components can work together seamlessly.

Hadoop Ozone

Hadoop Ozone is a distributed object storage system that was introduced as an alternative to HDFS and was developed specifically for big data workloads. HDFS was originally designed for large files with many gigabytes or even terabytes. However, it quickly reaches its limits when a large number of small files need to be stored. The main problem is the limitation of the NameNode, which stores metadata in RAM and, therefore, encounters memory problems when billions of small files are kept.

In addition, HDFS is designed for classic Hadoop use within a computing cluster. However, current architectures often use a hybrid approach with storage solutions in the cloud. Hadoop Ozone solves these problems by providing a scalable and flexible storage architecture that is optimized for Kubernetes and hybrid cloud environments.

Unlike HDFS, where a NameNode handles all file metadata, Hadoop Ozone introduces a more flexible architecture that doesn’t rely on a single centralized NameNode, improving scalability. Instead, it uses the following components:

The Ozone Manager corresponds most closely to the HDFS NameNode, but only manages the bucket and volume metadata. It ensures efficient management of the objects and is also scalable, as not all file metadata has to be kept in RAM.
The Storage Container Manager (SCM) can best be imagined as the DataNode in HDFS and it has the task of managing and replicating the data in so-called containers. Various replication strategies are supported, such as triple copying or erasure coding to save space.
The Ozone 3 Gateway has an S3-compatible API so it can be used as a replacement for Amazon S3. This means that applications developed for AWS S3 can be easily connected to Ozone and interact with it without the need for code changes.

This structure gives Hadoop Ozone various advantages over HDFS, which we have briefly summarized in the following table:

Attribute	Hadoop Ozone	HDFS
Storage Structure	Object-based (buckets & keys)	Block-based (files & blocks)
Scalability	Millions to billions of small files	Problems with many small files
NameNode – Dependency	No central NameNode & scaling possible	NameNode is bottleneck
Cloud Integration	Supports S3 API, Kubernetes, multi-cloud	Strongly tied to the Hadoop Cluster
Replication Strategy	Classic 3-fold replication or erasure coding	Only 3-fold replication
Applications	Big data, Kubernetes, hybrid cloud, S3 replacement	Traditional Hadoop workloads

Hadoop Ozone is a powerful extension of the ecosystem and enables the implementation of hybrid cloud architectures that would not have been possible with HDFS. It is also easy to scale as it is no longer dependent on a central name node. This means that big data applications with many, but small, files, such as those used for sensor measurements, can also be implemented without any problems.

How to start with Hadoop?

Hadoop is a robust and scalable big data framework that powers some of the world’s largest data-driven applications. While it can seem overwhelming for beginners due to its many components, this guide will walk you through the first steps to get started with Hadoop in simple, easy-to-follow stages.

Installation of Hadoop

Before we can start working with Hadoop, we must first install it in our respective environment. In this chapter, we differentiate between several scenarios, depending on whether the framework is installed locally or in the cloud. At the same time, it is generally advisable to work on systems that use Linux or macOS as the operating system, as additional adaptations are required for Windows. In addition, Java should already be available, at least Java 8 or 11, and internal communication via SSH should be possible.

Local Installation of Hadoop

To try out Hadoop on a local computer and familiarize yourself with it, you can perform a single-node installation so that all the necessary components run on the same computer. Before starting the installation, you can check the latest version you want to install at https://hadoop.apache.org/releases.html, in our case this is version 3.4.1. If a different version is required, the following commands can simply be changed so that the version number in the code is adjusted.

We then open a new terminal and execute the following code, which downloads the specified version from the Internet, unpacks the directory, and then changes to the unpacked directory.

wget https://downloads.apache.org/hadoop/common/hadoop-3.4.1/hadoop-3.4.1.tar.gz
tar -xvzf hadoop-3.4.1.tar.gz
cd hadoop-3.4.1

If there are errors in the first line, this is most likely due to a faulty link and the version mentioned may no longer be accessible. A more up-to-date version should be used and the code executed again. The installation directory has a size of about one gigabyte.

The environment variables can then be created and set, which tells the system under which directory Hadoop is stored on the computer. The PATH variable then allows Hadoop commands to be executed from anywhere in the terminal without having to set the full path for the Hadoop installation.

export HADOOP_HOME=~/hadoop-3.4.1 
export PATH=$PATH:$HADOOP_HOME/bin

Before we start the system, we can change the basic configuration of Hadoop, for example, to define specific directories for HDFS or specify the replication factor. There are a total of three important configuration files that we can adjust before starting:

core-site.xml configures basic Hadoop settings, such as the connection information for multiple nodes.
hdfs-site.xml contains special parameters for the HDFS setup, such as the typical directories for data storage or the replication factor, which determines how many replicas of the data are stored.
yarn-site.xml configures the YARN component, which is responsible for resource management and job scheduling.

For our local test, we can adjust the HDFS configuration so that the replication factor is set to 1, as we are only working on one server, and replication of the data is, therefore, not useful. To do this, we use a text editor, in our case nano, and open the configuration file for HDFS:

nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

The file then opens in the terminal and probably does not yet have any entries. A new XML with the property key can then be added within the configuration area:

 
    dfs.replication 
    1

Various properties can then be set according to this format. The different keys that can be specified in the configuration files, including the permitted values, can be found at https://hadoop.apache.org/docs/current/hadoop-project-dist/. For HDFS, this overview can be seen here.

Now that the configuration has been completed, Hadoop can be started. To do this, HDFS is initialized, which is the first important step after a new installation, and the directory that is to be used as the NameNode is formatted. The next two commands then start HDFS on all nodes that are configured in the cluster and the resource management YARN is started.

hdfs namenode -format 
start-dfs.sh 
start-yarn.sh

Problems may occur in this step if Java has not yet been installed. However, this can easily be done with the corresponding installation. In addition, when I tried this on macOS, the NameNode and DataNode of HDFS had to be started explicitly:

~/hadoop-3.4.1/bin/hdfs --daemon start namenode
~/hadoop-3.4.1/bin/hdfs --daemon start datanode

For YARN, the same procedure works for the Resource and NodeManager:

~/hadoop-3.4.1/bin/yarn --daemon start resourcemanager
~/hadoop-3.4.1/bin/yarn --daemon start nodemanager

Finally, the running processes can be checked with the jps command to see whether all components have been started correctly.

Hadoop installation in a distributed system

For resilient and productive processes, Hadoop is used in a distributed environment with multiple servers, known as nodes. This ensures greater scalability and availability. A distinction is typically made between the following cluster roles:

NameNode: This role stores the metadata and manages the file system (HDFS).
DataNode: This is where the actual data is stored and the calculations take place.
ResourceManager & NodeManagers: These manage the cluster resources for YARN.

The same commands that were explained in more detail in the last section can then be used on the individual servers. However, communication must also be established between them so that they can coordinate with each other. In general, the following sequence can be followed during installation:

Set up several Linux-based servers to be used for the cluster.
Set up SSH access between the servers so that they can communicate with each other and send data.
Install Hadoop on each server and make the desired configurations.
Assign roles and define the NameNodes and DataNodes in the cluster.
Format NameNodes and then start the cluster.

The specific steps and the code to be executed then depend more on the actual implementation.

Hadoop installation in the cloud

Many companies use Hadoop in the cloud to avoid having to operate their own cluster, potentially save costs, and also be able to use modern hardware. The various providers already have predefined programs with which Hadoop can be used in their environments. The most common Hadoop cloud services are:

AWS EMR (Elastic MapReduce): This program is based on Hadoop and, as the name suggests, also uses MapReduce, which allows users to write their programs in Java that process and store large amounts of data in a distributed manner. The cluster runs on virtual servers in the Amazon Elastic Compute Cloud (EC2) and stores the data in the Amazon Simple Storage Service (S3). The keyword “Elastic” comes from the fact that the system can change dynamically to adapt to the required computing power. Finally, AWS EMR also offers the option of using other Hadoop extensions such as Apache Spark or Apache Presto.
Google Dataproc: Google’s alternative is called Dataproc and enables a fully managed and scalable Hadoop cluster in the Google Cloud. It is based on BigQuery and uses Google Cloud Storage for data storage. Many companies, such as Vodafone and Twitter are already using this system.
Azure HDInsight: The Microsoft Azure Cloud offers HDInsight for complete Hadoop use in the cloud and also provides support for a wide range of other open-source programs.

The overall advantage of using the cloud is that no manual installation and maintenance work is required. Several nodes are used automatically and more are added depending on the computing requirements. For the customer, the advantage of automatic scaling is that costs can be controlled and only what is used is paid for.

With an on-premise cluster, on the other hand, the hardware is usually set up in such a way that it is still functional even at peak loads so that the entire hardware is not required for a large part of the time. Finally, the advantage of using the cloud is that it makes it easier to integrate other systems that run with the same provider, for example.

Basic Hadoop commands for beginners

Regardless of the architecture selected, the following commands can be used to perform very general and frequently recurring actions in Hadoop. This covers all areas that are required in an ETL process in Hadoop.

Upload File to HDFS: To be able to execute an HDFS command, the beginning hdfs dfs is always required. You use put to define that you want to upload a file from the local directory to HDFS. The local_file.txt describes the file to be uploaded. To do this, the command is either executed in the directory of the file or the complete path to the file is added instead of the file name. Finally, use /user/hadoop/ to define the directory in HDFS in which the file is to be stored.

hdfs dfs -put local_file.txt /user/hadoop/

List files in HDFS: You can use -ls to list all files and folders in the HDFS directory /user/hadoop/ and have them displayed as a list in the terminal.

hdfs dfs -put local_file.txt /user/hadoop/

Download file from HDFS: The -get parameter downloads the file /user/hadoop/file.txt from the HDFS directory to the local directory. The dot . indicates that the file is stored in the current local directory in which the command is being executed. If this is not desired, you can define a corresponding local directory instead.

hdfs dfs -get /user/hadoop/file.txt

Delete files in HDFS: Use -rm to delete the file /user/hadoop/file.txt from the HDFS directory. This command also automatically deletes all replications that are distributed across the cluster.

hdfs dfs -rm /user/hadoop/file.txt

Start MapReduce command (process data): MapReduce is the distributed computing model in Hadoop that can be used to process large amounts of data. Using hadoop jar indicates that a Hadoop job with a “.jar” file is to be executed. The corresponding file containing various MapReduce programs is located in the directory /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar. From these examples, the wordcount job is to be executed, which counts the words occurring in a text file. The data to be analyzed is located in the HDFS directory /input and the results are then to be stored in the directory output/.

hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount input/ output/

Monitor the progress of a job: Despite the distributed computing power, many MapReduce jobs take a certain amount of time to run, depending on the amount of data. Their status can therefore be monitored in the terminal. The resources and running applications can be displayed using YARN. To be able to execute a command in this system, we start with the command yarn, and with the help of application-list we get a list of all active applications. Various information can be read from this list, such as the unique ID of the applications, the user who started them, and the progress in %.

yarn application -list

Display logs of a running job: To be able to delve deeper into a running process and identify potential problems at an early stage, we can read out the logs. The logs command is used for this, with which the logs of a specific application can be called up. The unique application ID is utilized to define this application. To do this, the APP_ID must be replaced by the actual ID in the following command, and the greater than and less than signs must be removed.

yarn logs -applicationId

With the help of these commands, data can already be saved in HDFS, and MapReduce jobs can also be created. These are the central actions for filling the cluster with data and processing it.

Debugging & logging in Hadoop

For the cluster to be sustainable in the long term and to be able to read out errors, it is important to master basic debugging and logging commands. As Hadoop is a distributed system, errors can occur in a wide variety of components and nodes. It is therefore essential that you are familiar with the corresponding commands to quickly find and switch off errors.

Detailed log files for the various components are stored in the $HADOOP_HOME/logs directory. The log files for the various servers and components can then be found in their subdirectories. The most important ones are:

NameNode-Logs contains information about the HDFS metadata and possible connection problems:

cat $HADOOP_HOME/logs/hadoop-hadoop-namenode-.log

DataNode logs show problems with the storage of data blocks:

cat $HADOOP_HOME/logs/hadoop-hadoop-datanode-.log

YARN ResourceManager logs reveal possible resource problems or errors in job scheduling:

cat $HADOOP_HOME/logs/yarn-hadoop-resourcemanager-.log

NodeManager logs help with debugging executed jobs and their logic:

cat $HADOOP_HOME/logs/yarn-hadoop-nodemanager-.log

With the help of these logs, specific problems in the processes can be identified and possible solutions can be derived from them. However, if there are problems in the entire cluster and you want to check the overall status across individual servers, it makes sense to carry out a detailed cluster analysis with the following command:

hdfs dfsadmin -report

This includes the number of active and failed DataNodes, as well as the available and occupied storage capacities. The replication status of the HDFS files is also displayed here and additional runtime information about the cluster is provided. An example output could then look something like this:

Configured Capacity: 10 TB
DFS Used: 2 TB
Remaining: 8 TB
Number of DataNodes: 5
DataNodes Available: 4
DataNodes Dead: 1

With these first steps, we have learned how to set up a Hadoop in different environments, store and manage data in HDFS, execute MapReduce jobs, and read the logs to detect and fix errors. This will enable you to start your first project in Hadoop and gain experience with big data frameworks.

In this part, we covered the core components of Hadoop, including HDFS, YARN, and MapReduce. We also walked through the installation process, from setting up Hadoop in a local or distributed environment to configuring key files such as core-site.xml and hdfs-site.xml. Understanding these components is crucial for efficiently storing and processing large datasets across clusters.

If this basic setup is not enough for your use case and you want to learn how you can extend your Hadoop cluster to make it more adaptable and scalable, then our next part is just right for you. We will dive deeper into the large Hadoop ecosystem including tools like Apache Spark, HBase, Hive, and many more that can make your cluster more scalable and adaptable. Stay tuned!

The post Mastering Hadoop, Part 2: Getting Hands-On — Setting Up and Scaling Hadoop appeared first on Towards Data Science.

Mastering Hadoop, Part 1: Installation, Configuration, and Modern Big Data Strategies

Niklas Lang — Wed, 12 Mar 2025 00:22:46 +0000

Nowadays, a large amount of data is collected on the internet, which is why companies are faced with the challenge of being able to store, process, and analyze these volumes efficiently. Hadoop is an open-source framework from the Apache Software Foundation and has become one of the leading Big Data management technologies in recent years. The system enables the distributed storage and processing of data across multiple servers. As a result, it offers a scalable solution for a wide range of applications from data analysis to machine learning.

This article provides a comprehensive overview of Hadoop and its components. We also examine the underlying architecture and provide practical tips for getting started with it.

Before we can start with it, we need to mention that the whole topic of Hadoop is huge and even though this article is already long, it is not even close to going into too much detail on all topics. This is why we split it into three parts: To let you decide for yourself how deep you want to dive into it:

Part 1: Hadoop 101: What it is, why it matters, and who should care

This part is for everyone interested in Big Data and Data Science that wants to get to know this classic tool and also understand the downsides of it.

Part 2: Getting Hands-On: Setting up and scaling Hadoop

All readers that weren’t scared off by the disadvantages of Hadoop and the size of the ecosystem, can use this part to get a guideline on how they can start with their first local cluster to learn the basics on how to operate it.

Part 3: Hadoop ecosystem: Get the most out of your cluster

In this section, we go under the hood and explain the core components and how they can be further advanced to meet your requirements.

Part 1: Hadoop 101: What it is, why it matters, and who should care

Hadoop is an open-source framework for the distributed storage and processing of large amounts of data. It was originally developed by Doug Cutting and Mike Cafarella and started as a search engine optimization project under the name Nutch. It was only later renamed Hadoop by its founder Cutting, based on the name of his son’s toy elephant. This is where the yellow elephant in today’s logo comes from.

The original concept was based on two Google papers on distributed file systems and the MapReduce mechanism and initially comprised around 11,000 lines of code. Other methods, such as the YARN resource manager, were only added in 2012. Today, the ecosystem comprises a large number of components that go far beyond pure file storage.

Hadoop differs fundamentally from traditional relational databases (RDBMS):

Attribute	Hadoop	RDBMS
Data Structure	Unstructured, semi-structured, and unstructured data	Structured Data
Processing	Batch processing or partial real-time processing	Transaction-based with SQL
Scalability	Horizontal scaling across multiple servers	Vertical scaling through stronger servers
Flexibility	Supports many data formats	Strict schemes must be adhered to
Costs	Open source with affordable hardware	Mostly open source, but with powerful, expensive servers

Which applications use Hadoop?

Hadoop is an important big data framework that has established itself in many companies and applications in recent years. In general, it can be used primarily for the storage of large and unstructured data volumes and, thanks to its distributed architecture, is particularly suitable for data-intensive applications that would not be manageable with traditional databases.

Typical use cases for Hadoop include:

Big data analysis: Hadoop enables companies to centrally collect and store large amounts of data from different systems. This data can then be processed for further analysis and made available to users in reports. Both structured data, such as financial transactions or sensor data, and unstructured data, such as social media comments or website usage data, can be stored in Hadoop.
Log analysis & IT monitoring: In modern IT infrastructure, a wide variety of systems generate data in the form of logs that provide information about the status or log certain events. This information needs to be stored and reacted to in real-time, for example, to prevent failures if the memory is full or the program is not working as expected. Hadoop can take on the task of data storage by distributing the data across several nodes and processing it in parallel, while also analyzing the information in batches.
Machine learning & AI: Hadoop provides the basis for many machine learning and AI models by managing the data sets for large models. In text or image processing in particular, the model architectures require a lot of training data that takes up large amounts of memory. With the help of Hadoop, this storage can be managed and operated efficiently so that the focus can be on the architecture and training of the AI algorithms.
ETL processes: ETL processes are essential in companies to prepare the data so that it can be processed further or used for analysis. To do this, it must be collected from a wide variety of systems, then transformed and finally stored in a data lake or data warehouse. Hadoop can provide central support here by offering a good connection to different data sources and allowing Data Processing to be parallelized across multiple servers. In addition, cost efficiency can be increased, especially in comparison to classic ETL approaches with data warehouses.

The list of well-known companies that use Hadoop daily and have made it an integral part of their architecture is very long. Facebook, for example, uses Hadoop to process several petabytes of user data every day for advertisements, feed optimization, and machine learning. Twitter, on the other hand, uses Hadoop for real-time trend analysis or to detect spam, which should be flagged accordingly. Finally, Yahoo has one of the world’s largest Hadoop installations with over 40,000 nodes, which was set up to analyze search and advertising data.

What are the advantages and disadvantages of Hadoop?

Hadoop has become a powerful and popular big data framework used by many companies, especially in the 2010s, due to its ability to process large amounts of data in a distributed manner. In general, the following advantages arise when using Hadoop:

Scalability: The cluster can easily be scaled horizontally by adding new nodes that take on additional tasks for a job. This also makes it possible to process data volumes that exceed the capacity of a single computer.
Cost efficiency: This horizontal scalability also makes Hadoop very cost-efficient, as more low-cost computers can be added for better performance instead of equipping a server with expensive hardware and scaling vertically. In addition, Hadoop is open-source software and can therefore be used free of charge.
Flexibility: Hadoop can process both unstructured data and structured data, offering the flexibility to be used for a wide variety of applications. It offers additional flexibility by providing a large library of components that further extend the existing functionalities.
Fault tolerance: By replicating the data across different servers, the system can still function in the event of most hardware failures, as it simply falls back on another replication. This also results in high availability of the entire system.

These disadvantages should also be taken into account.

Complexity: Due to the strong networking of the cluster and the individual servers in it, the administration of the system is rather complex, and a certain amount of training is required to set up and operate a Hadoop cluster correctly. However, this point can be avoided by using a cloud connection and the automatic scaling it contains.
Latency: Hadoop uses batch processing to handle the data and thus establishes latency times, as the data is not processed in real-time, but only when enough data is available for a batch. Hadoop tries to avoid this with the help of mini-batches, but this still means latency.
Data management: Additional components are required for data management, such as data quality control or tracking the data sequence. Hadoop does not include any direct tools for data management.

Hadoop is a powerful tool for processing big data. Above all, scalability, cost efficiency, and flexibility are decisive advantages that have contributed to the widespread use of Hadoop. However, there are also some disadvantages, such as the latency caused by batch processing.

Does Hadoop have a future?

Hadoop has long been the leading technology for distributed big data processing, but new systems have also emerged and become increasingly relevant in recent years. One of the biggest trends is that most companies are turning to fully managed cloud data platforms that can run Hadoop-like workloads without the need for a dedicated cluster. This also makes them more cost-efficient, as only the hardware that is needed has to be paid for.

In addition, Apache Spark in particular has established itself as a faster alternative to MapReduce and is therefore outperforming the classic Hadoop setup. It is also interesting because it offers an almost complete solution for AI workloads thanks to its various functionalities, such as Apache Streaming or the machine learning library.

Although Hadoop remains a relevant big data framework, it is slowly losing importance these days. Even though many established companies continue to rely on the clusters that were set up some time ago, companies that are now starting with big data are using cloud solutions or specialized analysis software directly. Accordingly, the Hadoop platform is also evolving and offers new solutions that adapt to this zeitgeist.

Who should still learn Hadoop?

With the rise of cloud-native data platforms and modern distributed computing frameworks, you might be wondering: Is Hadoop still worth learning? The answer depends on your role, industry, and the scale of data you work with. While Hadoop is no longer the default choice for big data processing, it remains highly relevant in many enterprise environments. Hadoop could be still relevant for you if at least one of the following is true for you:

Your company still has a Hadoop-based data lake.
The data you are storing is confidential and needs to be hosted on-premises.
You work with ETL processes, and data ingestion at scale.
Your goal is to optimize batch-processing jobs in a distributed environment.
You need to work with tools like Hive, HBase, or Apache Spark on Hadoop.
You want to optimize cost-efficient data storage and processing solutions.

Hadoop is definitely not necessary for every data professional. If you’re working primarily with cloud-native analytics tools, serverless architectures, or lightweight data-wrangling tasks, spending time on Hadoop may not be the best investment.

You can skip Hadoop if:

Your work is focused on SQL-based analytics with cloud-native solutions (e.g., BigQuery, Snowflake, Redshift).
You primarily handle small to mid-sized datasets in Python or Pandas.
Your company has already migrated away from Hadoop to fully cloud-based architectures.

Hadoop is no longer the cutting edge technology that it once was, but it still has importance in different applications and companies with existing data lakes, large-scale ETL processes, or on-premises infrastructure. In the following part, we will finally be more practical and show how an easy cluster can be set up to build your big data framework with Hadoop.

The post Mastering Hadoop, Part 1: Installation, Configuration, and Modern Big Data Strategies appeared first on Towards Data Science.

Kubernetes — Understanding and Utilizing Probes Effectively

Kiril Vodenicharov — Thu, 06 Mar 2025 03:59:54 +0000

Introduction

Let’s talk about Kubernetes probes and why they matter in your deployments. When managing production-facing containerized applications, even small optimizations can have enormous benefits.

Aiming to reduce deployment times, making your applications better react to scaling events, and managing the running pods healthiness requires fine-tuning your container lifecycle management. This is exactly why proper configuration — and implementation — of Kubernetes probes is vital for any critical deployment. They assist your cluster to make intelligent decisions about traffic routing, restarts, and resource allocation.

Properly configured probes dramatically improve your application reliability, reduce deployment downtime, and handle unexpected errors gracefully. In this article, we’ll explore the three types of probes available in Kubernetes and how utilizing them alongside each other helps configure more resilient systems.

Quick refresher

Understanding exactly what each probe does and some common configuration patterns is essential. Each of them serves a specific purpose in the container lifecycle and when used together, they create a rock-solid framework for maintaining your application availability and performance.

Startup: Optimizing start-up times

Start-up probes are evaluated once when a new pod is spun up because of a scale-up event or a new deployment. It serves as a gatekeeper for the rest of the container checks and fine-tuning it will help your applications better handle increased load or service degradation.

Sample Config:

startupProbe:
  httpGet:
    path: /health
    port: 80
  failureThreshold: 30
  periodSeconds: 10

Key takeaways:

Keep periodSeconds low, so that the probe fires often, quickly detecting a successful deployment.
Increase failureThreshold to a high enough value to accommodate for the worst-case start-up time.

The Startup probe will check whether your container has started by querying the configured path. It will additionally stop the triggering of the Liveness and Readiness probes until it is successful.

Liveness: Detecting dead containers

Your liveness probes answer a very simple question: “Is this pod still running properly?” If not, K8s will restart it.

Sample Config:

livenessProbe:
  httpGet:
    path: /health
    port: 80
  periodSeconds: 10
  failureThreshold: 3

Key takeaways:

Since K8s will completely restart your container and spin up a new one, add a failureThreshold to combat intermittent abnormalities.
Avoid using initialDelaySeconds as it is too restrictive — use a Start-up probe instead.

Be mindful that a failing Liveness probe will bring down your currently running pod and spin up a new one, so avoid making it too aggressive — that’s for the next one.

Readiness: Handling unexpected errors

The readiness probe determines if it should start — or continue — to receive traffic. It is extremely useful in situations where your container lost connection to the database or is otherwise over-utilized and should not receive new requests.

Sample Config:

readinessProbe:
  httpGet:
    path: /health
    port: 80
  periodSeconds: 3
  failureThreshold: 1
  timeoutSeconds: 1

Key takeaways:

Since this is your first guard to stopping traffic to unhealthy targets, make the probe aggressive and reduce the periodSeconds .
Keep failureThreshold at a minimum, you want to fail quick.
The timeout period should also be kept at a minimum to handle slower Containers.
Give the readinessProbe ample time to recover by having a longer-running livenessProbe .

Readiness probes ensure that traffic will not reach a container not ready for it and as such it’s one of the most important ones in the stack.

Putting it all together

As you can see, even if all of the probes have their own distinct uses, the best way to improve your application’s resilience strategy is using them alongside each other.

Your startup probe will assist you in scale up scenarios and new deployments, allowing your containers to be quickly brought up. They’re fired only once and also stop the execution of the rest of the probes until they successfully complete.

The liveness probe helps in dealing with dead containers suffering from non-recoverable errors and tells the cluster to bring up a new, fresh pod just for you.

The readiness probe is the one telling K8s when a pod should receive traffic or not. It can be extremely useful dealing with intermittent errors or high resource consumption resulting in slower response times.

Additional configurations

Probes can be further configured to use a command in their checks instead of an HTTP request, as well as giving ample time for the container to safely terminate. While these are useful in more specific scenarios, understanding how you can extend your deployment configuration can be beneficial, so I’d recommend doing some additional reading if your containers handle unique use cases.

Further reading:
Liveness, Readiness, and Startup Probes
Configure Liveness, Readiness and Startup Probes

The post Kubernetes — Understanding and Utilizing Probes Effectively appeared first on Towards Data Science.

Building a Data Engineering Center of Excellence

Richie Bachala — Fri, 14 Feb 2025 02:35:48 +0000

As data continues to grow in importance and become more complex, the need for skilled data engineers has never been greater. But what is data engineering, and why is it so important? In this blog post, we will discuss the essential components of a functioning data engineering practice and why data engineering is becoming increasingly critical for businesses today, and how you can build your very own Data Engineering Center of Excellence!

I’ve had the privilege to build, manage, lead, and foster a sizeable high-performing team of data warehouse & ELT engineers for many years. With the help of my team, I have spent a considerable amount of time every year consciously planning and preparing to manage the growth of our data month-over-month and address the changing reporting and analytics needs for our 20000+ global data consumers. We built many data warehouses to store and centralize massive amounts of data generated from many OLTP sources. We’ve implemented Kimball methodology by creating star schemas both within our on-premise data warehouses and in the ones in the cloud.

The objective is to enable our user-base to perform fast analytics and reporting on the data; so our analysts’ community and business users can make accurate data-driven decisions.

It took me about three years to transform teams (plural) of data warehouse and ETL programmers into one cohesive Data Engineering team.

I have compiled some of my learnings building a global data engineering team in this post in hopes that Data professionals and leaders of all levels of technical proficiency can benefit.

Evolution of the Data Engineer

It has never been a better time to be a data engineer. Over the last decade, we have seen a massive awakening of enterprises now recognizing their data as the company’s heartbeat, making data engineering the job function that ensures accurate, current, and quality data flow to the solutions that depend on it.

Historically, the role of Data Engineers has evolved from that of data warehouse developers and the ETL/ELT developers (extract, transform and load).

The data warehouse developers are responsible for designing, building, developing, administering, and maintaining data warehouses to meet an enterprise’s reporting needs. This is done primarily via extracting data from operational and transactional systems and piping it using extract transform load methodology (ETL/ ELT) to a storage layer like a data warehouse or a data lake. The data warehouse or the data lake is where data analysts, data scientists, and business users consume data. The developers also perform transformations to conform the ingested data to a data model with aggregated data for easy analysis.

A data engineer’s prime responsibility is to produce and make data securely available for multiple consumers.

Data engineers oversee the ingestion, transformation, modeling, delivery, and movement of data through every part of an organization. Data extraction happens from many different data sources & applications. Data Engineers load the data into data warehouses and data lakes, which are transformed not just for the Data Science & predictive analytics initiatives (as everyone likes to talk about) but primarily for data analysts. Data analysts & data scientists perform operational reporting, exploratory analytics, service-level agreement (SLA) based business intelligence reports and dashboards on the catered data. In this book, we will address all of these job functions.

The role of a data engineer is to acquire, store, and aggregate data from both cloud and on-premise, new, and existing systems, with data modeling and feasible data architecture. Without the data engineers, analysts and data scientists won’t have valuable data to work with, and hence, data engineers are the first to be hired at the inception of every new data team. Based on the data and analytics tools available within an enterprise, data engineering teams’ role profiles, constructs, and approaches have several options for what should be included in their responsibilities which we will discuss in this chapter.

Data Engineering team

Software is increasingly automating the historically manual and tedious tasks of data engineers. Data processing tools and technologies have evolved massively over several years and will continue to grow. For example, cloud-based data warehouses (Snowflake, for instance) have made data storage and processing affordable and fast. Data pipeline services (like Informatica IICS, Apache Airflow, Matillion, Fivetran) have turned data extraction into work that can be completed quickly and efficiently. The data engineering team should be leveraging such technologies as force multipliers, taking a consistent and cohesive approach to integration and management of enterprise data, not just relying on legacy siloed approaches to building custom data pipelines with fragile, non-performant, hard to maintain code. Continuing with the latter approach will stifle the pace of innovation within the said enterprise and force the future focus to be around managing data infrastructure issues rather than how to help generate value for your business.

The primary role of an enterprise Data Engineering team should be to transform raw data into a shape that’s ready for analysis — laying the foundation for real-world analytics and data science application.

The Data Engineering team should serve as the librarian for enterprise-level data with the responsibility to curate the organization’s data and act as a resource for those who want to make use of it, such as Reporting & Analytics teams, Data Science teams, and other groups that are doing more self-service or business group driven analytics leveraging the enterprise data platform. This team should serve as the steward of organizational knowledge, managing and refining the catalog so that analysis can be done more effectively. Let’s look at the essential responsibilities of a well-functioning Data Engineering team.

Responsibilities of a Data Engineering Team

The Data Engineering team should provide a shared capability within the enterprise that cuts across to support both the Reporting/Analytics and Data Science capabilities to provide access to clean, transformed, formatted, scalable, and secure data ready for analysis. The Data Engineering teams’ core responsibilities should include:

· Build, manage, and optimize the core data platform infrastructure

· Build and maintain custom and off-the-shelf data integrations and ingestion pipelines from a variety of structured and unstructured sources

· Manage overall data pipeline orchestration

· Manage transformation of data either before or after load of raw data through both technical processes and business logic

· Support analytics teams with design and performance optimizations of data warehouses

Data is an Enterprise Asset.

Data as an Asset should be shared and protected.

Data should be valued as an Enterprise asset, leveraged across all Business Units to enhance the company’s value to its respective customer base by accelerating decision making, and improving competitive advantage with the help of data. Good data stewardship, legal and regulatory requirements dictate that we protect the data owned from unauthorized access and disclosure.

In other words, managing Security is a crucial responsibility.

Why Create a Centralized Data Engineering Team?

Treating Data Engineering as a standard and core capability that underpins both the Analytics and Data Science capabilities will help an enterprise evolve how to approach Data and Analytics. The enterprise needs to stop vertically treating data based on the technology stack involved as we tend to see often and move to more of a horizontal approach of managing a data fabric or mesh layer that cuts across the organization and can connect to various technologies as needed drive analytic initiatives. This is a new way of thinking and working, but it can drive efficiency as the various data organizations look to scale. Additionally — there is value in creating a dedicated structure and career path for Data Engineering resources. Data engineering skill sets are in high demand in the market; therefore, hiring outside the company can be costly. Companies must enable programmers, database administrators, and software developers with a career path to gain the needed experience with the above-defined skillsets by working across technologies. Usually, forming a data engineering center of excellence or a capability center would be the first step for making such progression possible.

Challenges for creating a centralized Data Engineering Team

The centralization of the Data Engineering team as a service approach is different from how Reporting & Analytics and Data Science teams operate. It does, in principle, mean giving up some level of control of resources and establishing new processes for how these teams will collaborate and work together to deliver initiatives.

The Data Engineering team will need to demonstrate that it can effectively support the needs of both Reporting & Analytics and Data Science teams, no matter how large these teams are. Data Engineering teams must effectively prioritize workloads while ensuring they can bring the right skillsets and experience to assigned projects.

Data engineering is essential because it serves as the backbone of data-driven companies. It enables analysts to work with clean and well-organized data, necessary for deriving insights and making sound decisions. To build a functioning data engineering practice, you need the following critical components:

Data Engineering Center of Excellence

The Data Engineering team should be a core capability within the enterprise, but it should effectively serve as a support function involved in almost everything data-related. It should interact with the Reporting and Analytics and Data Science teams in a collaborative support role to make the entire team successful.

The Data Engineering team doesn’t create direct business value — but the value should come in making the Reporting and Analytics, and Data Science teams more productive and efficient to ensure delivery of maximum value to business stakeholders through Data & Analytics initiatives. To make that possible, the six key responsibilities within the data engineering capability center would be as follow –

Data Engineering Center of Excellence — Image by Author.

Let’s review the 6 pillars of responsibilities:

1. Determine Central Data Location for Collation and Wrangling

Understanding and having a strategy for a Data Lake.(a centralized data repository or data warehouse for the mass consumption of data for analysis). Defining requisite data tables and where they will be joined in the context of data engineering and subsequently converting raw data into digestible and valuable formats.

2. Data Ingestion and Transformation

Moving data from one or more sources to a new destination (your data lake or cloud data warehouse) where it can be stored and further analyzed and then converting data from the format of the source system to that of the destination

3. ETL/ELT Operations

Extracting, transforming, and loading data from one or more sources into a destination system to represent the data in a new context or style.

4. Data Modeling

Data modeling is an essential function of a data engineering team, granted not all data engineers excel with this capability. Formalizing relationships between data objects and business rules into a conceptual representation through understanding information system workflows, modeling required queries, designing tables, determining primary keys, and effectively utilizing data to create informed output.

I’ve seen engineers in interviews mess up more with this than coding in technical discussions. It’s essential to understand the differences between Dimensions, Facts, Aggregate tables.

5. Security and Access

Ensuring that sensitive data is protected and implementing proper authentication and authorization to reduce the risk of a data breach

6. Architecture and Administration

Defining the models, policies, and standards that administer what data is collected, where and how it is stored, and how it such data is integrated into various analytical systems.

The six pillars of responsibilities for data engineering capabilities center on the ability to determine a central data location for collation and wrangling, ingest and transform data, execute ETL/ELT operations, model data, secure access and administer an architecture. While all companies have their own specific needs with regards to these functions, it is important to ensure that your team has the necessary skillset in order to build a foundation for big data success.

Besides the Data Engineering following are the other capability centers that need to be considered within an enterprise:

Analytics Capability Center

The analytics capability center enables consistent, effective, and efficient BI, analytics, and advanced analytics capabilities across the company. Assist business functions in triaging, prioritizing, and achieving their objectives and goals through reporting, analytics, and dashboard solutions, while providing operational reports and visualizations, self-service analytics, and required tools to automate the generation of such insights.

Data Science Capability Center

The data science capability center is for exploring cutting-edge technologies and concepts to unlock new insights and opportunities, better inform employees and create a culture of prescriptive information usage using Automated AI and Automated ML solutions such as H2O.ai, Dataiku, Aible, DataRobot, C3.ai

Data Governance

The data governance office empowers users with trusted, understood, and timely data to drive effectiveness while keeping the integrity and sanctity of data in the right hands for mass consumption.

As your company grows, you will want to make sure that the data engineering capabilities are in place to support the six pillars of responsibilities. By doing this, you will be able to ensure that all aspects of data management and analysis are covered and that your data is safe and accessible by those who need it. Have you started thinking about how your company will grow? What steps have you taken to put a centralized data engineering team in place?

Thank you for reading!

The post Building a Data Engineering Center of Excellence appeared first on Towards Data Science.

Polars vs. Pandas — An Independent Speed Comparison

Eirik Berge, PhD — Tue, 11 Feb 2025 21:07:55 +0000

Overview

Introduction — Purpose and Reasons

Speed is important when dealing with large amounts of data. If you are handling data in a cloud data warehouse or similar, then the speed of execution for your data ingestion and processing affects the following:

Cloud costs: This is probably the biggest factor. More compute time equals more costs in most billing models. In other billing based on a certain amount of preallocated resources, you could have chosen a lower service level if the speed of your ingestion and processing was higher.
Data timeliness: If you have a real-time stream that takes 5 minutes to process data, then your users will have a lag of at least 5 minutes when viewing the data through e.g. a Power BI rapport. This difference can be a lot in certain situations. Even for batch jobs, the data timeliness is important. If you are running a batch job every hour, it is a lot better if it takes 2 minutes rather than 20 minutes.
Feedback loop: If your batch job takes only a minute to run, then you get a very quick feedback loop. This probably makes your job more enjoyable. In addition, it enables you to find logical mistakes more quickly.

As you’ve probably understood from the title, I am going to provide a speed comparison between the two Python libraries Polars and Pandas. If you know anything about Pandas and Polars from before, then you know that Polars is the (relatively) new kid on the block proclaiming to be much faster than Pandas. You probably also know that Polars is implemented in Rust, which is a trend for many other modern Python tools like uv and Ruff.

There are two distinct reasons that I want to do a speed comparison test between Polars and Pandas:

Reason 1 — Investigating Claims

Polars boasts on its website with the following claim: Compared to pandas, it (Polars) can achieve more than 30x performance gains.

As you can see, you can follow a link to the benchmarks that they have. It’s commendable that they have speed tests open source. But if you are writing the comparison tests for both your own tool and a competitor’s tool, then there might be a slight conflict of interest. I’m not here saying that they are purposefully overselling the speed of Polars, but rather that they might have unconsciously selected for favorable comparisons.

Hence the first reason to do a speed comparison test is simply to see whether this supports the claims presented by Polars or not.

Reason 2 — Greater granularity

Another reason for doing a speed comparison test between Polars and Pandas is to make it slightly more transparent where the performance gains might be.

This might be already clear if you’re an expert on both libraries. However, speed tests between Polars and Pandas are mostly of interest to those considering switching up their tool. In that case, you might not yet have played around much with Polars because you are unsure if it is worth it.

Hence the second reason to do a speed comparison is simply to see where the speed gains are located.

I want to test both libraries on different tasks both within data ingestion and Data Processing. I also want to consider datasets that are both small and large. I will stick to common tasks within data engineering, rather than esoteric tasks that one seldom uses.

What I will not do

I will not give a tutorial on either Pandas or Polars. If you want to learn Pandas or Polars, then a good place to start is their documentation.
I will not cover other common data processing libraries. This might be disappointing to a fan of PySpark, but having a distributed compute model makes comparisons a bit more difficult. You might find that PySpark is quicker than Polars on tasks that are very easy to parallelize, but slower on other tasks where keeping all the data in memory reduces travel times.
I will not provide full reproducibility. Since this is, in humble words, only a blog post, then I will only explain the datasets, tasks, and system settings that I have used. I will not host a complete running environment with the datasets and bundle everything neatly. This is not a precise scientific experiment, but rather a guide that only cares about rough estimations.

Finally, before we start, I want to say that I like both Polars and Pandas as tools. I’m not financially or otherwise compensated by any of them obviously, and don’t have any incentive other than being curious about their performance

Datasets, Tasks, and Settings

Let’s first describe the datasets that I will be considering, the tasks that the libraries will perform, and the system settings that I will be running them on.

Datasets

A most companies, you will need to work with both small and (relatively) large datasets. In my opinion, a good data processing tool can tackle both ends of the spectrum. Small datasets challenge the start-up time of tasks, while larger datasets challenge scalability. I will consider two datasets, both can be found on Kaggle:

A small dataset on the format CSV: It is no secret that CSV files are everywhere! Often they are quite small, coming from Excel files or database dumps. What better example of this than the classical iris dataset (licensed with CC0 1.0 Universal License) with 5 columns and 150 rows. The iris version I linked to on Kaggle has 6 columns, but the classical one does not have a running index column. So remove this column if you want precisely the same dataset as I have. The iris dataset is certainly small data by any stretch of the imagination.
A large dataset on the format Parquet: The parquet format is super useful for large data as it has built-in compression column-wise (along with many other benefits). I will use the Transaction dataset (licensed with Apache License 2.0) representing financial transactions. The dataset has 24 columns and 7 483 766 rows. It is close to 3 GB in its CSV format found on Kaggle. I used Pandas & Pyarrow to convert this to a parquet file. The final result is only 905 MB due to the compression of the parquet file format. This is at the low end of what people call big data, but it will suffice for us.

Tasks

I will do a speed comparison on five different tasks. The first two are I/O tasks, while the last three are common tasks in data processing. Specifically, the tasks are:

Reading data: I will read both files using the respective methods read_csv() and read_parquet() from the two libraries. I will not use any optional arguments as I want to compare their default behavior.
Writing data: I will write both files back to identical copies as new files using the respective methods to_csv() and to_parquet() for Pandas and write_csv() and write_parquet() for Polars. I will not use any optional arguments as I want to compare their default behavior.
Computing Numeric Expressions: For the iris dataset I will compute the expression SepalLengthCm ** 2 + SepalWidthCm as a new column in a copy of the DataFrame. For the transactions dataset, I will simply compute the expression (amount + 10) ** 2 as a new column in a copy of the DataFrame. I will use the standard way to transform columns in Pandas, while in Polars I will use the standard functions all(), col(), and alias() to make an equivalent transformation.
Filters: For the iris dataset, I will select the rows corresponding to the criteria SepalLengthCm >= 5.0 and SepalWidthCm <= 4.0. For the transactions dataset, I will select the rows corresponding to the categorical criteria merchant_category == 'Restaurant'. I will use the standard filtering method based on Boolean expressions in each library. In pandas, this is syntax such as df_new = df[df['col'] < 5], while in Polars this is given similarly by the filter() function along with the col() function. I will use the and-operator & for both libraries to combine the two numeric conditions for the iris dataset.
Group By: For the iris dataset, I will group by the Species column and calculate the mean values for each species of the four columns SepalLengthCm, SepalWidthCm, PetalLengthCm, and PetalWidthCm. For the transactions dataset, I will group by the column merchant_category and count the number of instances in each of the classes within merchant_category. Naturally, I will use the groupby() function in Pandas and the group_by() function in Polars in obvious ways.

Settings

System Settings: I’m running all the tasks locally with 16GB RAM and an Intel Core i5–10400F CPU with 6 Cores (12 logical cores through hyperthreading). So it’s not state-of-the-art by any means, but good enough for simple benchmarking.
Python: I’m running Python 3.12. This is not the most current stable version (which is Python 3.13), but I think this is a good thing. Commonly the latest supported Python version in cloud data warehouses is one or two versions behind.
Polars & Pandas: I’m using Polars version 1.21 and Pandas 2.2.3. These are roughly the newest stable releases to both packages.
Timeit: I’m using the standard timeit module in Python and finding the median of 10 runs.

Especially interesting will be how Polars can take advantage of the 12 logical cores through multithreading. There are ways to make Pandas take advantage of multiple processors, but I want to compare Polars and Pandas out of the box without any external modification. After all, this is probably how they are running in most companies around the world.

Results

Here I will write down the results for each of the five tasks and make some minor comments. In the next section I will try to summarize the main points into a conclusion and point out a disadvantage that Polars has in this comparison:

Task 1 — Reading data

The median run time over 10 runs for the reading task was as follows:

# Iris Dataset
Pandas: 0.79 milliseconds
Polars: 0.31 milliseconds

# Transactions Dataset
Pandas: 14.14 seconds
Polars: 1.25 seconds

For reading the Iris dataset, Polars was roughly 2.5x faster than Pandas. For the transactions dataset, the difference is even starker where Polars was 11x faster than Pandas. We can see that Polars is much faster than Pandas for reading both small and large files. The performance difference grows with the size of the file.

Task 2— Writing data

The median run time in seconds over 10 runs for the writing task was as follows:

# Iris Dataset
Pandas: 1.06 milliseconds
Polars: 0.60 milliseconds

# Transactions Dataset
Pandas: 20.55 seconds
Polars: 10.39 seconds

For writing the iris dataset, Polars was around 75% faster than Pandas. For the transactions dataset, Polars was roughly 2x as fast as Pandas. Again we see that Polars is faster than Pandas, but the difference here is smaller than for reading files. Still, a difference of close to 2x in performance is a massive difference.

Task 3 —Computing Numeric Expressions

The median run time over 10 runs for the computing numeric expressions task was as follows:

# Iris Dataset
Pandas: 0.35 milliseconds
Polars: 0.15 milliseconds

# Transactions Dataset
Pandas: 54.58 milliseconds
Polars: 14.92 milliseconds

For computing the numeric expressions, Polars beats Pandas with a rate of roughly 2.5x for the iris dataset, and roughly 3.5x for the transactions dataset. This is a pretty massive difference. It should be noted that computing numeric expressions is fast in both libraries even for the large dataset transactions.

Task 4 — Filters

The median run time over 10 runs for the filters task was as follows:

# Iris Dataset
Pandas: 0.40 milliseconds
Polars: 0.15 milliseconds

# Transactions Dataset
Pandas: 0.70 seconds
Polars: 0.07 seconds

For filters, Polars is 2.6x faster on the iris dataset and 10x as fast on the transactions dataset. This is probably the most surprising improvement for me since I suspected that the speed improvements for filtering tasks would not be this massive.

Task 5 — Group By

The median run time over 10 runs for the group by task was as follows:

# Iris Dataset
Pandas: 0.54 milliseconds
Polars: 0.18 milliseconds

# Transactions Dataset
Pandas: 334 milliseconds 
Polars: 126 milliseconds

For the group-by task, there is a 3x speed improvement for Polars in the case of the iris dataset. For the transactions dataset, there is a 2.6x improvement of Polars over Pandas.

Conclusions

Before highlighting each point below, I want to point out that Polars is somewhat in an unfair position throughout my comparisons. It is often that multiple data transformations are performed after one another in practice. For this, Polars has the lazy API that optimizes this before calculating. Since I have considered single ingestions and transformations, this advantage of Polars is hidden. How much this would improve in practical situations is not clear, but it would probably make the difference in performance even bigger.

Data Ingestion

Polars is significantly faster than Pandas for both reading and writing data. The difference is largest in reading data, where we had a massive 11x difference in performance for the transactions dataset. On all measurements, Polars performs significantly better than Pandas.

Data Processing

Polars is significantly faster than Pandas for common data processing tasks. The difference was starkest for filters, but you can at least expect a 2–3x difference in performance across the board.

Final Verdict

Polars consistently performs faster than Pandas on all tasks with both small and large data. The improvements are very significant, ranging from a 2x improvement to a whopping 11x improvement. When it comes to reading large parquet files or performing filter statements, Polars is leaps and bound in front of Pandas.

However…Nowhere here is Polars remotely close to performing 30x better than Pandas, as Polars’ benchmarking suggests. I would argue that the tasks that I have presented are standard tasks performed on realistic hardware infrastructure. So I think that my conclusions give us some room to question whether the claims put forward by Polars give a realistic picture of the improvements that you can expect.

Nevertheless, I am in no doubt that Polars is significantly faster than Pandas. Working with Polars is not more complicated than working with Pandas. So for your next data engineering project where the data fits in memory, I would strongly suggest that you opt for Polars rather than Pandas.

Wrapping Up

Photo by Spencer Bergen on Unsplash

I hope this blog post gave you a different perspective on the speed difference between Polars and Pandas. Please comment if you have a different experience with the performance difference between Polars and Pandas than what I have presented.

If you are interested in AI, Data Science, or data engineering, please follow me or connect on LinkedIn.

Like my writing? Check out some of my other posts:

The post Polars vs. Pandas — An Independent Speed Comparison appeared first on Towards Data Science.

Align Your Data Architecture for Universal Data Supply

Bernd Wessely — Fri, 31 Jan 2025 17:02:00 +0000

Photo by Simone Hutsch on Unsplash

Now that we understand the business requirements, we need to check if the current Data Architecture supports them.

If you’re wondering what to assess in our data architecture and what the current setup looks like, check the business case description.

· Assessing against short-term requirements ∘ Initial alignment approach · Medium-term requirements and long-term vision · Step-by-step conversion ∘ Agility requires some foresight ∘ Build your business process and information model ∘ Holistically challenge your architecture ∘ Decouple and evolve

Assessing against short-term requirements

Let’s recap the short-term requirements:

Immediate feedback with automated compliance monitors: Providing timely feedback to staff on compliance to reinforce hand hygiene practices effectively. Calculate the compliance rates in near real time and show them on ward monitors using a simple traffic light visualization.
Device availability and maintenance: Ensuring dispensers are always functional, with near real-time tracking for refills to avoid compliance failures due to empty dispensers.

The current weekly batch ETL process is obviously not able to deliver immediate feedback.

However, we could try to reduce the batch runtime as much as possible and loop it continuously. For near real-time feedback, we would also need to run the query continuously to get the latest compliance rate report.

Both of these technical requirements are challenging. The weekly batch process from the HIS handles large data volumes and can’t be adjusted to run in seconds. Continuous monitoring would also put a heavy load on the data warehouse if we keep the current model, which is optimized for tracking history.

Before we dig deeper to solve this, let’s also examine the second requirement.

The smart dispenser can be loaded with bottles of various sizes, tracked in the Dispenser Master Data. To calculate the current fill level, we subtract the total amount dispensed from the initial volume. Each time the bottle is replaced, the fill level should reset to the initial volume. To support this, the dispenser manufacturer has announced two new events to be implemented in a future release:

The dispenser will automatically track its fill level and send a refill warning when it reaches a configurable low point. This threshold is based on the estimated time until the bottle is empty (remaining time to failure).
When the dispenser’s bottle is replaced, it will send a bottle exchange event.

However, these improved devices won’t be available for about 12 months. As a workaround, the current ETL process needs to be updated to perform the required calculations and generate the events.

A new report is needed based on these events to inform support staff about dispensers requiring timely bottle replacement. In medium-sized hospitals with 200–500 dispensers, intensive care units use about two 1-liter bottles of disinfectant per month. This means around 19 dispensers need refilling in the support staff’s weekly exchange plan.

Since dispenser usage varies widely across wards, the locations needing bottle replacements are spread throughout the hospital. Support staff would like to receive the bottle exchange list organized in an optimal route through the building.

Initial alignment approach

Following the principle "Never change a running system," we could try to reuse as many components as possible to minimize changes.

Initial idea to implement short-term requirements— Image by author

We would have to build NEW components (in green) and CHANGE existing components (in dark blue) to support the new requirements.

We know the batch needs to be replaced with stream processing for near real-time feedback. We consider using Change Data Capture (CDC)a technology to get updates on dispenser usage from the internal relational database. However, tests on the Dispenser Monitoring System showed that the Dispenser Usage Data Collector only updates the database every 5 minutes. To keep things simple, we decide to reschedule the weekly batch extraction process to sync with the monitoring system’s 5-minute update cycle.

By reducing the batch runtime and continuously looping over it, we effectively create a microbatch that supports stream processing. For more details, see my article on how to unify batch and stream processing.

Reducing the runtime of the HIS Data ETL batch process is a major challenge due to the large amount of data involved. We could decouple patient and occupancy data from the rest of the HIS data, but the HIS database extraction process is a complex, long-neglected COBOL program that no one dares to modify. The extraction logic is buried deep within the COBOL monolith, and there is limited knowledge of the source systems. Therefore, we consider implementing near real-time extraction of patient and occupancy data from HIS as "not feasible."

Instead, we plan to adjust the Compliance Rate Calculation to allow near real-time Dispenser Usage Data to be combined with the still-weekly updated HIS data. After discussing this with the hygiene specialists, we agree that the low rates of change in patient treatment and occupancy suggest the situation will remain fairly stable throughout the week.

The Continuous Compliance Rate On Ward Level will be stored in a real-time partition associated to the ward entity of the data warehouse. It will support short runtimes of the new Traffic Light Monitor Query that is scheduled as successor to the respective ETL batch process.

Consequently, the monitor will be updated every 5 minutes, which seems close enough to near real-time. The new Exchange List Query will be scheduled weekly to create the Weekly Bottle-Exchange Plan to be sent by email to the support staff.

We feel confident that this will adequately address the short-term requirements.

Medium-term requirements and long-term vision

However, before we start sprinting ahead with the short-term solution, we should also examine the medium and long-term vision. Let’s recap the identified requirements:

Granular data insights: Moving beyond aggregate reports to gain insight into compliance at more specific levels (e.g., by shift or even person).
Actionable alerts for non-compliance: Leveraging historical data with near real-time extended monitoring data to enable systems to notify staff immediately of missed hygiene actions, ideally personalized by healthcare worker.
Personalized compliance dashboards: Creating personalized dashboards that show each worker’s compliance history, improvement opportunities, and benchmarks.
Integration with smart wearables: Utilizing wearable technology to give real-time and discrete feedback directly to healthcare workers, supporting compliance at the point of care.

These long-term visions highlight the need to significantly improve real-time processing capabilities. They also emphasize the importance of processing data at a more granular level and using intelligent processing to derive individualized insights. Processing personalized information raises security concerns that must be properly addressed as well. Finally, we need to seamlessly integrate advanced monitoring devices and smart wearables to receive personalized information in a secure, discreet, and timely manner.

That leads to a whole chain of additional challenges for our current architecture.

But it’s not only the requirements of the hygiene monitoring that are challenging; the hospital is also about to be taken over by a large private hospital operator.

This means the current HIS must be integrated into a larger system that will cover 30 hospitals. The goal is to extend the advanced monitoring functionality for hygiene dispensers so that other hospitals in the new operator’s network can also benefit. As a long-term vision, they want the monitoring functionality to be seamlessly integrated into their global HIS.

Another challenge is planning for the announced innovations from the dispenser manufacturer. Through ongoing discussions about remaining time to failure, refill warnings, and bottle exchange events, we know the manufacturer is open to enabling real-time streaming for Dispenser Usage Data. This would allow data to be sent directly to consumers, bypassing the current 5-minute batch process through the relational database.

Step-by-step conversion

We want to counter the enormous challenges facing our architecture with a gradual transformation.

Since we’ve learned that working agile is beneficial, we want to start with the initial idea and then refine the system in subsequent steps.

But is this really agile working?

Agility requires some foresight

What I often encounter is that people equate "acting in small incremental steps" with working agile. While it’s true that we want to evolve our architecture progressively, each step should aim at the long-term target.

If we constrain our evolution to what the current IT architecture can deliver, we might not be moving toward what is truly needed.

When we developed our initial alignment, we just reasoned on how to implement the first step within the existing architecture’s constraints. However, this approach narrows our view to what’s ‘feasible’ within the current setup.

So, let’s try the opposite and clearly address what’s needed including the long-term requirements. Only then we can target the next steps to move the architecture in the right direction.

For architecture decisions, we don’t need to detail every aspect of the business processes using standards like Business Process Model and Notation (BPMN). We just need a high-level understanding of the process and information flow.

But what’s the right level of detail that allows us to make evolutionary architecture decisions?

Build your business process and information model

Let’s start very high to find out about the right level.

In part 3 of my series on Challenges and Solutions in Data Mesh I have outlined an approach based on modeling patterns to model an ontology or enterprise data model. Let’s apply this approach to our example.

Note: We can’t create a complete ontology for the healthcare industry in this article. However, we can apply this approach to the small sub-topic relevant to our example.

Let’s identify the obvious modeling patterns relevant for our example:

Party & Role: The parties acting in our healthcare example include patients, medical device suppliers, healthcare professionals (doctors, nurses, hygiene specialists, etc.), the hospital operator, support staff and the hospital as an organizational unit.

Location: The hospital building address, patient rooms, floors, laboratories, operating rooms, etc.

Ressource / Asset: The hospital as a building, medical devices like our intelligent dispensers, etc.

Document: All kinds of files representing patient information like diagnosis, written agreements, treatment plans, etc.

Event: We have identified dispenser-related events, such as bottle exchange and refill warnings, as well as healthcare practitioner-related events, like an identified hand hygiene opportunity or moment.

Task: From the doctor’s patient treatment plan, we can directly derive procedures or activities that healthcare workers need to perform. Monitoring these procedures is one of the many information requirements for delivering healthcare services.

The following high-level modeling patterns my not be as obvious for the healthcare setup in our example at first sight:

Product: Although we might not think of hospitals of being product-oriented, they certainly provide services like diagnoses or patient treatments. If pharmaceuticals, supplies, and medical equipment are offered, we even can talk about typical products. A better overall term would probably be a "health care offering".

Agreement: Not only agreements between provider networks and supplier agreements for the purchase of medical products and medicines but also agreements between patients and doctors.

Account: Our use case is mainly concerned with upholding best hygiene practices by closely monitoring and educating staff. We just don’t focus on accounting aspects here. However, accounting in general as well as claims management and payment settlement are very important healthcare business processes. A large part of the Hospital Information System (HIS) therefore deals with accounting.

Let’s visualize our use case with this high-level modeling patterns and their relationships.

Our example from the healthcare sector, illustrated with high-level modeling patterns – Image by author

What does this buy us?

With this high-level model we can identify ‘hygiene monitoring’ as an overall business process to observe patient care and take appropriate action so that infections associated with care are prevented in the best possible way.

We recognize ‘patient management’ as an overall process to manage and track all the patient care activities related to the healthcare plan prepared by the doctors.

We recognize ‘hospital management’ that organizes assets like hospital buildings with patient bedrooms as well as all medical devices and instrumentation inside. Patients and staff occupy and use these assets over time and this usage needs to be managed.

Let’s describe some of the processes:

A Doctor documents the Diagnosis derived from the examination of the Patient
A Doctor discusses the derived Diagnosis with the Patient and documents everything that has been agreed with the Patient about the recommended treatment in a Patient Treatment Plan.
The Agreement on the treatment triggers the Treatment Procedure and reflects the responsibility of the Doctor and Nurses for the patient’s treatment.
A Nurse responsible for Patient Bed Occupancy will assign a patient bed at the ward, which triggers a Patient Bed Allocation.
A Nurse responsible for the patient’s treatment takes a blood sample from the patient and triggers several Hand Hygiene Opportunities and Dispenser Hygiene Actions detected by Hygiene Monitoring.
The Hygiene Monitoring calculates compliance from Dispenser Hygiene Action, Hand Hygiene Opportunity, and Patient Bed Allocation information and documents it for the Continuous Compliance Monitor.
During the week ongoing Dispenser Hygiene Actions cause the Hygiene Monitoring to trigger Dispenser Refill Warnings.
A Hygiene Specialist responsible for the Hygiene Monitoring compiles a weekly Bottle Exchange Plan from accumulated Dispenser Refill Warnings.
Support Staff responsible for the weekly Exchange Bottle Tour receives the Bottle Exchange Plan and triggers Dispenser Bottle Exchange events when replacing empty bottles for the affected dispensers.
and so on …

This way we get an overall functional view of our business. The view is completely independent of the architectural style we’ll choose to actually implement the business requirements.

A high-level business process and information model is therefore a perfect artifact to discuss any use case with healthcare practitioners.

Holistically challenge your architecture

With such a thorough understanding of our business, we can challenge our architecture more holistically. Everything we already understand and know today can and should be used to drive our next step toward the target architecture.

Let’s examine why our initial architecture approach falls short to properly support all identified requirements:

Near real-time processing is only partly addressed

A traditional data warehouse architecture is not the ideal architectural approach for near real-time processing. In our example, the long-running HIS data extraction process is a batch-oriented monolith that cannot be tuned to support low-latency requirements.

We can split the monolith into independent extraction processes, but to really enable all involved applications for near real-time processing, we need to rethink the way we share data across applications.

As data engineers, we should create abstractions that relieve the application developer from low-level data processing decisions. They should neither have to reason about whether batch or stream processing style needs to be chosen nor need to know how to actually implement this technically.

If we allow the application developers to implement the required business logic independent of these technical data details, it would greatly simplify their job.

You can get more details on how to practically implement this in my article on unifying batch and stream processing.

The initial alignment is driven by technology, not by business

Business requirements should drive the IT architecture decisions. If we turn a blind eye and soften the requirements to such an extent that it becomes ‘feasible’, we allow technology to drive the process.

The discussion with the hygiene specialists about the low rates of change in patient treatment and occupancy are such a softening of requirements. We know that there will be situations where the state will change during the week, but we accept the assumption of stability to keep the current IT architecture.

Even if we won’t be able to immediately change the complete architecture, we should take steps into the right direction. Even if we cannot enable all applications at once to support near real-time processing, we should take action to create support for it.

Smart devices, standard operational systems (HIS) and advanced monitoring need to be seamlessly integrated

The long-term vision is to seamlessly integrate the monitoring functionality with available HIS features. This includes the integration of various new (sub-)systems and new technical devices that are essential for operating the hospital.

With an architecture that focuses one-sidedly on analytical processing, we cannot adequately address these cross-cutting needs. We need to find ways to enable flexible data flow between all future participants in the system. Every application or system component requires to be connected to our mesh of data without having to change the component itself.

Overall, we can state that the initial architecture change plan won’t be a targeted step towards such a flexible integration approach.

Decouple and evolve

To ensure that each and every step is effective in moving towards our target architecture, we need a balanced decoupling of our current architecture components.

Universal data supply therefore defines the abstraction data as a product for the exchange of data between applications of any kind. To enable current applications to create data as a product without having to completely redesign them, we use data agents to (re-)direct data flow from the application to the mesh.

Modern Data And Application Engineering Breaks the Loss of Business Context

By using these abstractions, any application can also become near real-time capable. Because it doesn’t matter if the application is part of the operational or the analytical plane, the intended integration of the operational HIS with hygiene monitoring components is significantly simplified.

Operational and Analytical Data

Let’s examine how the decoupling helps, for instance, to integrate the current data warehouse to the mesh.

The data warehouse can be redefined to act like one among many applications in the mesh. We can, for instance, re-design the ETL component Continuous Compliance Rate on Ward Level as an independent application producing the data as a product abstraction. If we don’t want or can’t touch the ETL logic itself, we can instead use the data agent abstraction to transform data to the target structure.

We can do the same for Dispenser Exchange Events or any other ETL or query / reporting component identified. The COBOL monolith HIS Data can be decoupled by implementing a data agent that separates the data products HIS occupancy data and HIS patient data. This allows to evolve the data delivering components completely independent of the consumers.

Whenever the dispenser vendor is ready to deliver advanced functionalities to directly create the required exchange events, we would just have to change the Dispenser Exchange Events component. Either the vendor can deliver the data as a product abstraction directly, or we can convert the dispenser’s proprietary data output by adapting Dispenser Exchange Event data agent and logic.

Aligned Architecture as an Adapted Data Mesh enabling universal data supply – Image by author

Whenever we are able to directly create HIS patient data or HIS occupancy data from the HIS, we can partly or completely decommission the HIS Data component without affecting the rest of the system.

We need to assess our architecture holistically, considering all known business requirements. A technology-constrained approach can lead to intermediate steps that are not geared towards what’s needed but just towards what seems feasible.

Dive deep into your business and derive technology-agnostic processes and information models. These models will foster your business understanding and at the same time allow your business to drive your architecture.

In subsequent steps, we will look at more technical details on how to design data as a product and data agents based on these ideas. Stay tuned for more insights!

The post Align Your Data Architecture for Universal Data Supply appeared first on Towards Data Science.

Stop Creating Bad DAGs – Optimize Your Airflow Environment By Improving Your Python Code

Alvaro Leandro Cavalcante Carneiro — Thu, 30 Jan 2025 20:31:57 +0000

Valuable tips to reduce your DAGs’ parse time and save resources.

Photo by Dan Roizer on Unsplash

Apache Airflow is one of the most popular orchestration tools in the data field, powering workflows for companies worldwide. However, anyone who has already worked with Airflow in a production environment, especially in a complex one, knows that it can occasionally present some problems and weird bugs.

Among the many aspects you need to manage in an Airflow environment, one critical metric often flies under the radar: DAG parse time. Monitoring and optimizing parse time is essential to avoid performance bottlenecks and ensure the correct functioning of your orchestrations, as we’ll explore in this article.

That said, this tutorial aims to introduce [airflow-parse-bench](https://github.com/AlvaroCavalcante/airflow-parse-bench), an open-source tool I developed to help Data engineers monitor and optimize their Airflow environments, providing insights to reduce code complexity and parse time.

Why Parse Time Matters

Regarding Airflow, DAG parse time is often an overlooked metric. Parsing occurs every time Airflow processes your Python files to build the DAGs dynamically.

By default, all your DAGs are parsed every 30 seconds – a frequency controlled by the configuration variable _min_file_process_interval_. This means that every 30 seconds, all the Python code that’s present in your dags folder is read, imported, and processed to generate DAG objects containing the tasks to be scheduled. Successfully processed files are then added to the DAG Bag.

Two key Airflow components handle this process:

DagFileProcessorManager: Checks which files need to be processed.
DagFileProcessorProcess: Executes the actual file parsing.

Together, both components (commonly referred to as the dag processor) are executed by the Airflow Scheduler, ensuring that your DAG objects are updated before being triggered. However, for scalability and security reasons, it is also possible to run your dag processor as a separate component in your cluster.

If your environment only has a few dozen DAGs, it’s unlikely that the parsing process will cause any kind of problem. However, it’s common to find production environments with hundreds or even thousands of DAGs. In this case, if your parse time is too high, it can lead to:

Delay DAG scheduling.
Increase resource utilization.
Environment heartbeat issues.
Scheduler failures.
Excessive CPU and memory usage, wasting resources.

Now, imagine having an environment with hundreds of DAGs containing unnecessarily complex parsing logic. Small inefficiencies can quickly turn into significant problems, affecting the stability and performance of your entire Airflow setup.

How to write better DAGs?

When writing Airflow DAGs, there are some important best practices to bear in mind to create optimized code. Although you can find a lot of tutorials on how to improve your DAGs, I’ll summarize some of the key principles that can significantly enhance your DAG performance.

Limit Top-Level Code

One of the most common causes of high DAG parsing times is inefficient or complex top-level code. Top-level code in an Airflow DAG file is executed every time the Scheduler parses the file. If this code includes resource-intensive operations, such as database queries, API calls, or dynamic task generation, it can significantly impact parsing performance.

The following code shows an example of a non-optimized DAG:

In this case, every time the file is parsed by the Scheduler, the top-level code is executed, making an API request and processing the DataFrame, which can significantly impact the parse time.

Another important factor contributing to slow parsing is top-level imports. Every library imported at the top level is loaded into memory during parsing, which can be time-consuming. To avoid this, you can move imports into functions or task definitions.

The following code shows a better version of the same DAG:

Avoid Xcoms and Variables in Top-Level Code

Still talking about the same topic, is particularly interesting to avoid using Xcoms and Variables in your top-level code. As stated by Google documentation:

If you are using Variable.get() in top level code, every time the .py file is parsed, Airflow executes a Variable.get() which opens a session to the DB. This can dramatically slow down parse times.

To address this, consider using a JSON dictionary to retrieve multiple variables in a single database query, rather than making multiple Variable.get() calls. Alternatively, use Jinja templates, as variables retrieved this way are only processed during task execution, not during DAG parsing.

Remove Unnecessary DAGs

Although it seems obvious, it’s always important to remember to periodically clean up unnecessary DAGs and files from your environment:

Remove unused DAGs: Check your dags folder and delete any files that are no longer needed.
Use .airflowignore: Specify the files Airflow should intentionally ignore, skipping parsing.
Review paused DAGs: Paused DAGs are still parsed by the Scheduler, consuming resources. If they are no longer required, consider removing or archiving them.

Change Airflow Configurations

Lastly, you could change some Airflow configurations to reduce the Scheduler resource usage:

min_file_process_interval: This setting controls how often (in seconds) Airflow parses your DAG files. Increasing it from the default 30 seconds can reduce the Scheduler’s load at the cost of slower DAG updates.
dag_dir_list_interval: This determines how often (in seconds) Airflow scans the dags directory for new DAGs. If you deploy new DAGs infrequently, consider increasing this interval to reduce CPU usage.

How to Measure DAG Parse Time?

We’ve discussed a lot about the importance of creating optimized DAGs to maintain a healthy Airflow environment. But how do you actually measure the parse time of your DAGs? Fortunately, there are several ways to do this, depending on your Airflow deployment or operating system.

For example, if you have a Cloud Composer deployment, you can easily retrieve a DAG parse report by executing the following command on Google CLI:

gcloud composer environments run $ENVIRONMENT_NAME 
 - location $LOCATION 
 dags report

While retrieving parse metrics is straightforward, measuring the effectiveness of your code optimizations can be less so. Every time you modify your code, you need to redeploy the updated Python file to your cloud provider, wait for the DAG to be parsed, and then extract a new report – a slow and time-consuming process.

Another possible approach, if you’re on Linux or Mac, is to run this command to measure the parse time locally on your machine:

time python airflow/example_dags/example.py

However, while simple, this approach is not practical for systematically measuring and comparing the parse times of multiple DAGs.

To address these challenges, I created the airflow-parse-bench, a Python library that simplifies measuring and comparing the parse times of your DAGs using Airflow’s native parse method.

Measuring and Comparing Your DAG’s Parse Times

The [airflow-parse-bench](https://github.com/AlvaroCavalcante/airflow-parse-bench) tool makes it easy to store parse times, compare results, and standardize comparisons across your DAGs.

Installing the Library

Before installation, it’s recommended to use a virtualenv to avoid library conflicts. Once set up, you can install the package by running the following command:

pip install airflow-parse-bench

Note: This command only installs the essential dependencies (related to Airflow and Airflow providers). You must manually install any additional libraries your DAGs depend on.

For example, if a DAG uses boto3 to interact with AWS, ensure that boto3 is installed in your environment. Otherwise, you’ll encounter parse errors.

After that, it’s necessary to initialize your Airflow database. This can be done by executing the following command:

airflow db init

In addition, if your DAGs use Airflow Variables, you must define them locally as well. However, it’s not necessary to put real values on your variables, as the actual values aren’t required for parsing purposes:

airflow variables set MY_VARIABLE 'ANY TEST VALUE'

Without this, you’ll encounter an error like:

error: 'Variable MY_VARIABLE does not exist'

Using the Tool

After installing the library, you can begin measuring parse times. For example, suppose you have a DAG file named dag_test.py containing the non-optimized DAG code used in the example above.

To measure its parse time, simply run:

airflow-parse-bench --path dag_test.py

This execution produces the following output:

Execution result. Image by author.

As observed, our DAG presented a parse time of 0.61 seconds. If I run the command again, I’ll see some small differences, as parse times can vary slightly across runs due to system and environmental factors:

Result of another execution of the same DAG. Image by author.

In order to present a more concise number, it’s possible to aggregate multiple executions by specifying the number of iterations:

airflow-parse-bench --path dag_test.py --num-iterations 5

Although it takes a bit longer to finish, this calculates the average parse time across five executions.

Now, to evaluate the impact of the aforementioned optimizations, I replaced the code in mydag_test.py with the optimized version shared earlier. After executing the same command, I got the following result:

Parse result of the optimized code. Image by author.

As noticed, just applying some good practices was capable of reducing almost 0.5 seconds in the DAG parse time, highlighting the importance of the changes we made!

Further Exploring the Tool

There are other interesting features that I think it’s relevant to share.

As a reminder, if you have any doubts or problems using the tool, you can access the complete documentation on GitHub.

Besides that, to view all the parameters supported by the library, simply run:

airflow-parse-bench --help

Testing Multiple DAGs

In most cases, you likely have dozens of DAGs to test the parse times. To address this use case, I created a folder named dags and put four Python files inside it.

To measure the parse times for all the DAGs in a folder, it’s just necessary to specify the folder path in the --path parameter:

airflow-parse-bench --path my_path/dags

Running this command produces a table summarizing the parse times for all the DAGs in the folder:

Testing the parse time of multiple DAGs. Image by author.

By default, the table is sorted from the fastest to the slowest DAG. However, you can reverse the order by using the --order parameter:

airflow-parse-bench --path my_path/dags --order desc

Inverted sorting order. Image by author.

Skipping Unchanged DAGs

The --skip-unchanged parameter can be especially useful during development. As the name suggests, this option skips the parse execution for DAGs that haven’t been modified since the last execution:

airflow-parse-bench --path my_path/dags --skip-unchanged

As shown below, when the DAGs remain unchanged, the output reflects no difference in parse times:

Output with no difference for unchanged files. Image by author.

Resetting the Database

All DAG information, including metrics and history, is stored in a local SQLite database. If you want to clear all stored data and start fresh, use the --reset-db flag:

airflow-parse-bench --path my_path/dags --reset-db

This command resets the database and processes the DAGs as if it were the first execution.

Conclusion

Parse time is an important metric for maintaining scalable and efficient Airflow environments, especially as your orchestration requirements become increasingly complex.

For this reason, the [airflow-parse-bench](https://github.com/AlvaroCavalcante/airflow-parse-bench) library can be an important tool for helping data engineers create better DAGs. By testing your DAGs’ parse time locally, you can easily and quickly find your code bottleneck, making your dags faster and more performant.

Since the code is executed locally, the produced parse time won’t be the same as the one present in your Airflow cluster. However, if you are able to reduce the parse time in your local machine, the same might be reproduced in your cloud environment.

Finally, this project is open for collaboration! If you have suggestions, ideas, or improvements, feel free to contribute on GitHub.

References

maximize the benefits of Cloud Composer and reduce parse times | Google Cloud Blog

Optimize Cloud Composer via Better Airflow DAGs | Google Cloud Blog

Scheduler – Airflow Documentation

Best Practices – Airflow Documentation

GitHub – AlvaroCavalcante/airflow-parse-bench: Stop creating bad DAGs! Use this tool to measure and…

The post Stop Creating Bad DAGs – Optimize Your Airflow Environment By Improving Your Python Code appeared first on Towards Data Science.

Battle of the Ducks

Thomas Reid — Tue, 28 Jan 2025 14:01:58 +0000

Image by AI (Dalle-3)

As some of you may know, I’m a big fan of the DuckDB Python library, and I’ve written many articles on it. I was also one of the first to write an article about an even newer Python library called Fireducks and helped bring that to people’s attention.

If you’ve never heard of these useful libraries, check out the links below for an introduction to them.

DuckDB

New Pandas rival, FireDucks, brings the smoke!

Both libraries are increasing their share of data science workloads where it could be argued that data manipulation and general wrangling are at least as important as the data analysis and insight that the machine learning side of things brings.

The core foundations of both tools are very different; DuckDB is a modern, embedded analytics database designed for efficient processing and querying of gigabytes of data from various sources. Fireducks is designed to be a much faster replacement for Pandas.

Their key commonality, however, is that they are both highly performant for general mid-sized Data Processing tasks. If that’s your use case, which one should you choose? That’s what we’ll find out today.

Here are the tests I’ll perform.

read a large CSV file into memory, i.e. a DuckDB table and a Fireducks dataframe
perform some typical data processing tasks against both sets of in-memory data
create a new column in the in-memory data sets based on existing table/data frame column data.
write out the updated in-memory data sets as CSV and Parquet

Input data set

I created a CSV file with fake sales data containing 100 million records.

The schema of the input data is this,

order_id (int)
order_date (date)
customer_id (int)
customer_name (str)
product_id (int)
product_name (str)
category (str)
quantity (int)
price (float)
total (float)

Here is a Python program you can use to create the CSV file. On my system, this resulted in a file of approximately 7.5GB.

# generate the 100m CSV file
#
import polars as pl
import numpy as np
from datetime import datetime, timedelta

def generate(nrows: int, filename: str):
    names = np.asarray(
        [
            "Laptop",
            "Smartphone",
            "Desk",
            "Chair",
            "Monitor",
            "Printer",
            "Paper",
            "Pen",
            "Notebook",
            "Coffee Maker",
            "Cabinet",
            "Plastic Cups",
        ]
    )

    categories = np.asarray(
        [
            "Electronics",
            "Electronics",
            "Office",
            "Office",
            "Electronics",
            "Electronics",
            "Stationery",
            "Stationery",
            "Stationery",
            "Electronics",
            "Office",
            "Sundry",
        ]
    )

    product_id = np.random.randint(len(names), size=nrows)
    quantity = np.random.randint(1, 11, size=nrows)
    price = np.random.randint(199, 10000, size=nrows) / 100

    # Generate random dates between 2010-01-01 and 2023-12-31
    start_date = datetime(2010, 1, 1)
    end_date = datetime(2023, 12, 31)
    date_range = (end_date - start_date).days

    # Create random dates as np.array and convert to string format
    order_dates = np.array([(start_date + timedelta(days=np.random.randint(0, date_range))).strftime('%Y-%m-%d') for _ in range(nrows)])

    # Define columns
    columns = {
        "order_id": np.arange(nrows),
        "order_date": order_dates,
        "customer_id": np.random.randint(100, 1000, size=nrows),
        "customer_name": [f"Customer_{i}" for i in np.random.randint(2**15, size=nrows)],
        "product_id": product_id + 200,
        "product_names": names[product_id],
        "categories": categories[product_id],
        "quantity": quantity,
        "price": price,
        "total": price * quantity,
    }

    # Create Polars DataFrame and write to CSV with explicit delimiter
    df = pl.DataFrame(columns)
    df.write_csv(filename, separator=',',include_header=True)  # Ensure comma is used as the delimiter

# Generate data with random order_date and save to CSV
generate(100_000_000, "/mnt/d/sales_data/sales_data_100m.csv")

Installing WSL2 Ubuntu

Fireducks only runs under Linux, so as I usually run Windows, I’ll be using WSL2 Ubuntu for my Linux environment, but the same code should work on any Linux/Unix setup. I have a full guide on installing WSL2 here.

Setting up a dev environment

OK, we should set up a separate development environment before starting our coding examples. That way, what we do won’t interfere with other versions of libraries, Programming, etc….. we might have on the go for other projects.

I use Miniconda for this, but you can use whatever method suits you best.

If you want to go down the Miniconda route and don’t already have it, you must install Miniconda first. Get it using this link,

Miniconda – Anaconda documentation

Once the environment is created, switch to it using the activatecommand, and then install Jupyter and any required Python libraries.

#create our test environment
(base) $ conda create -n duck_battle python=3.11 -y

# Now activate it
(base) $ conda activate duck_battle

# Install python libraries, etc ...
(duck_battle) $ pip install jupyter fireducks duckdb

Test 1 – Reading a large CSV file and display the last 10 records

DuckDB

import duckdb

print(duckdb.__version__)

'1.1.3'

# DuckDB read CSV file 
#
import duckdb
import time

# Start the timer
start_time = time.time()

# Create a connection to an in-memory DuckDB database
con = duckdb.connect(':memory:')

# Create a table from the CSV file
con.execute(f"CREATE TABLE sales AS SELECT * FROM read_csv('/mnt/d/sales_data/sales_data_100m.csv',header=true)")

# Fetch the last 10 rows
query = "SELECT * FROM sales ORDER BY rowid DESC LIMIT 10"
df = con.execute(query).df()

# Display the last 10 rows
print("nLast 10 rows of the file:")
print(df)

# End the timer and calculate the total elapsed time
total_elapsed_time = time.time() - start_time

print(f"DuckDB: Time taken to read the CSV file and display the last 10 records: {total_elapsed_time} seconds")

#
# DuckDB output
#

Last 10 rows of the file:
   order_id order_date  customer_id   customer_name  product_id product_names  
0  99999999 2023-06-16          102   Customer_9650         203         Chair   
1  99999998 2022-03-02          709  Customer_23966         208      Notebook   
2  99999997 2019-05-10          673  Customer_25709         202          Desk   
3  99999996 2011-10-21          593  Customer_29352         200        Laptop   
4  99999995 2011-10-24          501  Customer_29289         202          Desk   
5  99999994 2023-09-27          119  Customer_15532         209  Coffee Maker   
6  99999993 2015-01-15          294  Customer_27081         200        Laptop   
7  99999992 2016-04-07          379   Customer_1353         207           Pen   
8  99999991 2010-09-19          253  Customer_29439         204       Monitor   
9  99999990 2016-05-19          174  Customer_11294         210       Cabinet   

    categories  quantity  price   total  
0       Office         4  59.58  238.32  
1   Stationery         1  78.91   78.91  
2       Office         5   9.12   45.60  
3  Electronics         3  67.42  202.26  
4       Office         7  53.78  376.46  
5  Electronics         2  55.10  110.20  
6  Electronics         9  86.01  774.09  
7   Stationery         5  21.56  107.80  
8  Electronics         4   5.17   20.68  
9       Office         9  65.10  585.90  

DuckDB: Time taken to read the CSV file and display the last 10 records: 59.23184013366699 seconds

Fireducks

import fireducks
import fireducks.pandas as pd

print(fireducks.__version__)
print(pd.__version__)

1.1.6
2.2.3

# Fireducks read CSV
#
import fireducks.pandas as pd
import time

# Start the timer
start_time = time.time()

# Path to the CSV file
file_path = "/mnt/d/sales_data/sales_data_100m.csv"

# Read the CSV file into a DataFrame
df_fire = pd.read_csv(file_path)

# Display the last 10 rows of the DataFrame
print(df_fire.tail(10))

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Fireducks: Time taken to read the CSV file and display the last 10 records: {elapsed_time} seconds")         

#
# Fireducks output
#

          order_id  order_date  customer_id   customer_name  product_id  
99999990  99999990  2016-05-19          174  Customer_11294         210   
99999991  99999991  2010-09-19          253  Customer_29439         204   
99999992  99999992  2016-04-07          379   Customer_1353         207   
99999993  99999993  2015-01-15          294  Customer_27081         200   
99999994  99999994  2023-09-27          119  Customer_15532         209   
99999995  99999995  2011-10-24          501  Customer_29289         202   
99999996  99999996  2011-10-21          593  Customer_29352         200   
99999997  99999997  2019-05-10          673  Customer_25709         202   
99999998  99999998  2022-03-02          709  Customer_23966         208   
99999999  99999999  2023-06-16          102   Customer_9650         203   

         product_names   categories  quantity  price   total  
99999990       Cabinet       Office         9  65.10  585.90  
99999991       Monitor  Electronics         4   5.17   20.68  
99999992           Pen   Stationery         5  21.56  107.80  
99999993        Laptop  Electronics         9  86.01  774.09  
99999994  Coffee Maker  Electronics         2  55.10  110.20  
99999995          Desk       Office         7  53.78  376.46  
99999996        Laptop  Electronics         3  67.42  202.26  
99999997          Desk       Office         5   9.12   45.60  
99999998      Notebook   Stationery         1  78.91   78.91  
99999999         Chair       Office         4  59.58  238.32 

Fireducks: Time taken to read the CSV file and display the last 10 records: 65.69259881973267 seconds

There is not much in it; DuckDB edges it by about 6 seconds.

Test 2— Calculate total sales by category

DuckDB

# duckdb process data
#
import duckdb
import time

# Start total runtime timer
query_sql="""
SELECT 
    categories, 
    SUM(total) AS total_sales
FROM sales
GROUP BY categories
ORDER BY total_sales DESC
"""
start_time = time.time()

# 1. Total sales by category
start = time.time()
results = con.execute(query_sql).df()

print(f"DuckDB: Time for sales by category calculation: {time.time() - start_time} seconds")

results

#
# DuckDb output
#

DuckDB: Time for sales by category calculation: 0.1401681900024414 seconds

  categories  total_sales
0 Electronics 1.168493e+10
1 Stationery  7.014109e+09
2 Office      7.006807e+09
3 Sundry      2.338428e+09

Fireducks

import fireducks.pandas as pd

# Start the timer
start_time = time.time()

total_sales_by_category = df_fire.groupby('categories')['total'].sum().sort_values(ascending=False)
print(total_sales_by_category)

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Fireducks: Time taken to calculate sales by category: {elapsed_time} seconds")

#
# Fireducks output
#

categories
Electronics    1.168493e+10
Stationery     7.014109e+09
Office         7.006807e+09
Sundry         2.338428e+09
Name: total, dtype: float64

Fireducks: Time taken to calculate sales by category:  0.13571524620056152 seconds

There is not much in it there, either. Fireducks shades it.

Test 3— Top 5 customer spend

DuckDB

# duckdb process data
#
import duckdb
import time

# Start total runtime timer
query_sql="""
SELECT 
    customer_id, 
    customer_name, 
    SUM(total) AS total_purchase
FROM sales
GROUP BY customer_id, customer_name
ORDER BY total_purchase DESC
LIMIT 5
"""
start_time = time.time()

# 1. Total sales by category
start = time.time()
results = con.execute(query_sql).df()

print(f"DuckdDB: Time to calculate top 5 customers: {time.time() - start_time} seconds")

results

#
# DuckDb output
#

DuckdDB: Time to calculate top 5 customers: 1.4588654041290283 seconds

  customer_id customer_name  total_purchase
0 681         Customer_20387 6892.96
1 740         Customer_30499 6613.11
2 389         Customer_22686 6597.35
3 316         Customer_185   6565.38
4 529         Customer_1609  6494.35

Fireducks

import fireducks.pandas as pd

# Start the timer
start_time = time.time()

top_5_customers = df_fire.groupby(['customer_id', 'customer_name'])['total'].sum().sort_values(ascending=False).head(5)
print(top_5_customers)

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Fireducks: Time taken to calculate top 5 customers: {elapsed_time} seconds")

#
# Fireducks output
#

customer_id  customer_name 
681          Customer_20387    6892.96
740          Customer_30499    6613.11
389          Customer_22686    6597.35
316          Customer_1859     6565.38
529          Customer_1609     6494.35
Name: total, dtype: float64
Fireducks: Time taken to calculate top 5 customers: 2.823930263519287 seconds

DuckDB wins that one, being almost twice as fast as Fireducks.

Test 4— Monthly sales figures

DuckDB

import duckdb
import time

# Start total runtime timer
query_sql="""
SELECT 
    DATE_TRUNC('month', order_date) AS month,
    SUM(total) AS monthly_sales
FROM sales
GROUP BY month
ORDER BY month
"""
start_time = time.time()

# 1. Total sales by category
start = time.time()
results = con.execute(query_sql).df()

print(f"DuckDB: Time for seasonal trend calculation: {time.time() - start_time} seconds")

results

# 
# DuckDB output
#

DuckDB: Time for seasonal trend calculation: 0.16109275817871094 seconds

  month        monthly_sales
0 2010-01-01   1.699500e+08
1 2010-02-01   1.535730e+08
2 2010-03-01   1.702968e+08
3 2010-04-01   1.646421e+08
4 2010-05-01   1.704506e+08
... ... ...
163 2023-08-01 1.699263e+08
164 2023-09-01 1.646018e+08
165 2023-10-01 1.692184e+08
166 2023-11-01 1.644883e+08
167 2023-12-01 1.643962e+08

168 rows × 2 columns

Fireducks

import fireducks.pandas as pd
import time

def seasonal_trend():
    # Ensure 'order_date' is datetime
    df_fire['order_date'] = pd.to_datetime(df_fire['order_date'])

    # Extract 'month' as string
    df_fire['month'] = df_fire['order_date'].dt.strftime('%Y-%m')

    # Group by 'month' and sum 'total'
    results = (
        df_fire.groupby('month')['total']
        .sum()
        .reset_index()
        .sort_values('month')
    )
    print(results)

start_time = time.time()
seasonal_trend()
# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time

print(f"Fireducks: Time for seasonal trend calculation: {time.time() - start_time} seconds")

#
# Fireducks Output
#

       month         total
0    2010-01  1.699500e+08
1    2010-02  1.535730e+08
2    2010-03  1.702968e+08
3    2010-04  1.646421e+08
4    2010-05  1.704506e+08
..       ...           ...
163  2023-08  1.699263e+08
164  2023-09  1.646018e+08
165  2023-10  1.692184e+08
166  2023-11  1.644883e+08
167  2023-12  1.643962e+08

[168 rows x 2 columns]
Fireducks: Time for seasonal trend calculation: 3.109074354171753 seconds

DuckDB was significantly quicker in this example.

Test 5— Average order by product

DuckDB

import duckdb
import time

# Start total runtime timer
query_sql="""
SELECT 
    product_id,
    product_names,
    AVG(total) AS avg_order_value
FROM sales
GROUP BY product_id, product_names
ORDER BY avg_order_value DESC
"""
start_time = time.time()

# 1. Total sales by category
start = time.time()
results = con.execute(query_sql).df()

print(f"DuckDB: Time for average order by product calculation: {time.time() - start_time} seconds")

results

#
# DuckDb output
#

DuckDB: Time for average order by product calculation: 0.13720130920410156 seconds

  product_id product_names avg_order_value
0 206        Paper         280.529144
1 208        Notebook      280.497268
2 201        Smartphone    280.494779
3 207        Pen           280.491508
4 205        Printer       280.470150
5 200        Laptop        280.456913
6 209        Coffee Maker  280.445365
7 211        Plastic Cups  280.440161
8 210        Cabinet       280.426960
9 202        Desk          280.367135
10 203       Chair         280.364045
11 204       Monitor       280.329706

Fireducks

import fireducks.pandas as pd

# Start the timer
start_time = time.time()

avg_order_value = df_fire.groupby(['product_id', 'product_names'])['total'].mean().sort_values(ascending=False)
print(avg_order_value)

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time

print(f"Fireducks: Time for average order calculation: {time.time() - start_time} seconds")

#
# Fireducks output
#

product_id  product_names
206         Paper            280.529144
208         Notebook         280.497268
201         Smartphone       280.494779
207         Pen              280.491508
205         Printer          280.470150
200         Laptop           280.456913
209         Coffee Maker     280.445365
211         Plastic Cups     280.440161
210         Cabinet          280.426960
202         Desk             280.367135
203         Chair            280.364045
204         Monitor          280.329706
Name: total, dtype: float64
Fireducks: Time for average order calculation: 0.06766319274902344 seconds

Fireducks gets one back there and was twice as fast as DuckDB.

Test 6— product performance analysis

DuckDB

import duckdb
import time

# Start total runtime timer
query_sql="""
WITH yearly_sales AS (
    SELECT 
        EXTRACT(YEAR FROM order_date) AS year,
        SUM(total) AS total_sales
    FROM sales
    GROUP BY year
)
SELECT 
    year,
    total_sales,
    LAG(total_sales) OVER (ORDER BY year) AS prev_year_sales,
    (total_sales - LAG(total_sales) OVER (ORDER BY year)) / LAG(total_sales) OVER (ORDER BY year) * 100 AS yoy_growth
FROM yearly_sales
ORDER BY year
"""
start_time = time.time()

# 1. Total sales by category
start = time.time()
results = con.execute(query_sql).df()

print(f"DuckDB: Time for product performance analysis calculation: {time.time() - start_time} seconds")

results

#
# DuckDb output
#

Time for product performance analysis  calculation: 0.03958845138549805 seconds

   year total_sales prev_year_sales yoy_growth
0  2010 2.002066e+09 NaN            NaN
1  2011 2.002441e+09 2.002066e+09   0.018739
2  2012 2.008966e+09 2.002441e+09   0.325848
3  2013 2.002901e+09 2.008966e+09  -0.301900
4  2014 2.000773e+09 2.002901e+09  -0.106225
5  2015 2.001931e+09 2.000773e+09   0.057855
6  2016 2.008762e+09 2.001931e+09   0.341229
7  2017 2.002164e+09 2.008762e+09  -0.328457
8  2018 2.002383e+09 2.002164e+09   0.010927
9  2019 2.002891e+09 2.002383e+09   0.025383
10 2020 2.008585e+09 2.002891e+09   0.284318
11 2021 2.000244e+09 2.008585e+09  -0.415281
12 2022 2.004500e+09 2.000244e+09   0.212756
13 2023 1.995672e+09 2.004500e+09  -0.440401

Fireducks

import fireducks.pandas as pd

# Start the timer
start_time = time.time()

df_fire['year'] = pd.to_datetime(df_fire['order_date']).dt.year
yearly_sales = df_fire.groupby('year')['total'].sum().sort_index()
yoy_growth = yearly_sales.pct_change() * 100

result = pd.DataFrame({
    'year': yearly_sales.index,
    'total_sales': yearly_sales.values,
    'prev_year_sales': yearly_sales.shift().values,
    'yoy_growth': yoy_growth.values
})

print(result)

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Time for product performance analysis  calculation: {time.time() - start_time} seconds")

#
# Fireducks output
#

    year   total_sales  prev_year_sales  yoy_growth
0   2010  2.002066e+09              NaN         NaN
1   2011  2.002441e+09     2.002066e+09    0.018739
2   2012  2.008966e+09     2.002441e+09    0.325848
3   2013  2.002901e+09     2.008966e+09   -0.301900
4   2014  2.000773e+09     2.002901e+09   -0.106225
5   2015  2.001931e+09     2.000773e+09    0.057855
6   2016  2.008762e+09     2.001931e+09    0.341229
7   2017  2.002164e+09     2.008762e+09   -0.328457
8   2018  2.002383e+09     2.002164e+09    0.010927
9   2019  2.002891e+09     2.002383e+09    0.025383
10  2020  2.008585e+09     2.002891e+09    0.284318
11  2021  2.000244e+09     2.008585e+09   -0.415281
12  2022  2.004500e+09     2.000244e+09    0.212756
13  2023  1.995672e+09     2.004500e+09   -0.440401

Time for product performance analysis  calculation: 0.17495489120483398 seconds

DuckDB is quicker this time.

Test 7 – Add a new column to the data set and update its value

DuckDB

import duckdb

from datetime import datetime

start_time = time.time()

# Add new columns
con.execute("""
ALTER TABLE sales ADD COLUMN total_with_tax FLOAT
"""
)

# Perform the calculations and update the table
con.execute("""
UPDATE sales
SET total_with_tax = CASE 
    WHEN total <= 100 THEN total * 1.125  -- 12.5% tax
    WHEN total > 100 AND total <= 200 THEN total * 1.15   -- 15% tax
    WHEN total > 200 AND total <= 500 THEN total * 1.17   -- 17% tax
    WHEN total > 500 THEN total * 1.20   -- 20% tax
END;
""")

print(f"Time to add new column: {time.time() - start_time} seconds")

# Verify the new columns
result = con.execute("""
    SELECT 
        *
    FROM sales
    LIMIT 10;
""").fetchdf()

print(result)

#
# DuckDB output
#

Time to add new column: 2.4016575813293457 seconds

   order_id order_date  customer_id   customer_name  product_id product_names  
0         0 2021-11-25          238  Customer_25600         211  Plastic Cups   
1         1 2017-06-10          534  Customer_14188         209  Coffee Maker   
2         2 2010-02-15          924  Customer_14013         207           Pen   
3         3 2011-01-26          633   Customer_6120         211  Plastic Cups   
4         4 2014-01-11          561   Customer_1352         205       Printer   
5         5 2021-04-19          533   Customer_5342         208      Notebook   
6         6 2012-03-14          684  Customer_21604         207           Pen   
7         7 2017-07-01          744  Customer_30291         201    Smartphone   
8         8 2013-02-13          678  Customer_32618         204       Monitor   
9         9 2023-01-04          340  Customer_16898         207           Pen   

    categories  quantity  price   total  total_with_tax  
0       Sundry         2  99.80  199.60      229.539993  
1  Electronics         8   7.19   57.52       64.709999  
2   Stationery         6  70.98  425.88      498.279602  
3       Sundry         6  94.38  566.28      679.536011  
4  Electronics         4  44.68  178.72      205.528000  
5   Stationery         4  21.85   87.40       98.324997  
6   Stationery         3  93.66  280.98      328.746613  
7  Electronics         6  39.41  236.46      276.658203  
8  Electronics         2   4.30    8.60        9.675000  
9   Stationery         2   6.67   13.34       15.007500

Fireducks

import numpy as np
import time
import fireducks.pandas as pd

# Start total runtime timer
start_time = time.time()
# Define tax rate conditions and choices
conditions = [
    (df_fire['total'] <= 100),
    (df_fire['total'] > 100) & (df_fire['total'] <= 200),
    (df_fire['total'] > 200) & (df_fire['total'] <= 500),
    (df_fire['total'] > 500)
]

choices = [1.125, 1.15, 1.17, 1.20]

# Calculate total_with_tax using np.select for efficiency
df_fire['total_with_tax'] = df_fire['total'] * np.select(conditions, choices)

# Print total runtime
print(f"Fireducks: Time to add new column: {time.time() - start_time} seconds")
print(df_fire)

#
# Fireducks oputput
#

Fireducks: Time to add new column: 2.7112433910369873 seconds

          order_id order_date  customer_id   customer_name  product_id  
0                0 2021-11-25          238  Customer_25600         211   
1                1 2017-06-10          534  Customer_14188         209   
2                2 2010-02-15          924  Customer_14013         207   
3                3 2011-01-26          633   Customer_6120         211   
4                4 2014-01-11          561   Customer_1352         205   
...            ...        ...          ...             ...         ...   
99999995  99999995 2011-10-24          501  Customer_29289         202   
99999996  99999996 2011-10-21          593  Customer_29352         200   
99999997  99999997 2019-05-10          673  Customer_25709         202   
99999998  99999998 2022-03-02          709  Customer_23966         208   
99999999  99999999 2023-06-16          102   Customer_9650         203   

         product_names   categories  quantity  price   total    month  year  
0         Plastic Cups       Sundry         2  99.80  199.60  2021-11  2021   
1         Coffee Maker  Electronics         8   7.19   57.52  2017-06  2017   
2                  Pen   Stationery         6  70.98  425.88  2010-02  2010   
3         Plastic Cups       Sundry         6  94.38  566.28  2011-01  2011   
4              Printer  Electronics         4  44.68  178.72  2014-01  2014   
...                ...          ...       ...    ...     ...      ...   ...   
99999995          Desk       Office         7  53.78  376.46  2011-10  2011   
99999996        Laptop  Electronics         3  67.42  202.26  2011-10  2011   
99999997          Desk       Office         5   9.12   45.60  2019-05  2019   
99999998      Notebook   Stationery         1  78.91   78.91  2022-03  2022   
99999999         Chair       Office         4  59.58  238.32  2023-06  2023   

          total_with_tax  
0              229.54000  
1               64.71000  
2              498.27960  
3              679.53600  
4              205.52800  
...                  ...  
99999995       440.45820  
99999996       236.64420  
99999997        51.30000  
99999998        88.77375  
99999999       278.83440  

[100000000 rows x 13 columns]

They have very similar run times yet again. A draw.

Test 8 – Write out the updated data to a CSV file

DuckDB

start_time = time.time()

# Write the modified sales_data table to a CSV file
start = time.time()
con.execute("""
    COPY (SELECT * FROM sales) TO '/mnt/d/sales_data/final_sales_data_duckdb.csv' WITH (HEADER TRUE, DELIMITER ',')
""")

print(f"DuckDB: Time to write CSV to file: {time.time() - start_time} seconds")

DuckDB: Time to write CSV to file: 54.899176597595215 seconds

Fireducks

# fireducks write data back to CSV
#
import fireducks.pandas as pd

# Tidy up DF before writing out
cols_to_drop = ['year', 'month']
df_fire = df_fire.drop(columns=cols_to_drop)
df_fire['total_with_tax'] = df_fire['total_with_tax'].round(2) 
df_fire['order_date'] = df_fire['order_date'].dt.date

# Start total runtime timer
start_time = time.time()

df_fire.to_csv('/mnt/d/sales_data/fireducks_sales.csv',quoting=0,index=False)

# Print total runtime
print(f"Fireducks: Time to write CSV  to file: {time.time() - start_time} seconds")

Fireducks: Time to write CSV  to file: 54.490307331085205 seconds

Too close to call again.

Test 9— Write out the updated data to a parquet file

DuckDB

# DuckDB write Parquet data
# 

start_time = time.time()

# Write the modified sales_data table to a Parquet file
start = time.time()
con.execute("COPY sales TO '/mnt/d/sales_data/final_sales_data_duckdb.parquet' (FORMAT 'parquet');")

print(f"DuckDB: Time to write parquet to file: {time.time() - start_time} seconds")

DuckDB: Time to write parquet to file: 30.011869192123413 seconds

Fireducks

import fireducks.pandas as pd
import time

# Start total runtime timer
start_time = time.time()

df_fire.to_parquet('/mnt/d/sales_data/fireducks_sales.parquet')

# Print total runtime
print(f"Fireducks: Time to write Parquet to file: {time.time() - start_time} seconds")

Fireducks: Time to write Parquet to file: 86.29632377624512 seconds

That’s the first major discrepancy between run times. Fireducks took almost a minute longer to write out its data to Parquet than did DuckDB.

Summary

So, what are we to make of all this? Simply put, there is nothing much in it between these two libraries. Both are superfast and capable of processing large data sets. Once your data is in memory, either in a DuckDB table or Fireducks dataframe, both libraries are equally capable of processing it in double quick time

The choice of which one to use depends on your existing infrastructure and skill set.

If you’re a database person, DuckDB is the obvious library to use, as your SQL skills would be instantly transferable.

Alternatively, if you’re already embedded in the Pandas’ world, Fireducks would be a great choice for you.

_OK, that’s all for me just now. I hope you found this article useful. If you did, please check out my profile page at this link. From there, you can see my other published stories, follow me or subscribe to get notified when I post new content._

If you like this content, you might find these articles interesting, too.

Building a Data Dashboard

Speed up Pandas code with Numpy

The post Battle of the Ducks appeared first on Towards Data Science.