Modern Data Engineering Technologies to Learn in 2023

3 min readJan 27, 2023

Data engineering is a crucial aspect of data science and analytics, as it involves the process of acquiring, cleaning, and preparing data for analysis. In today’s rapidly evolving world, staying current with modern data engineering technologies is essential for any data professional. In this article, I will introduce you to some of the most popular and widely used data engineering technologies that you may want to learn in 2023. These technologies include Apache Kafka, Apache Spark, Apache Storm, Apache Flink, Kubernetes, Airflow, TensorFlow, Apache Arrow, MLflow, and Delta Lake.

As a data engineer, you should learn these technologies because they are widely used in industry for building efficient and scalable data pipelines and applications.

Apache Kafka, Apache Spark, Apache Storm, and Apache Flink are all powerful tools for processing real-time data streams, which is a crucial aspect of modern data engineering. By learning these technologies, data engineers can build data pipelines that can handle large volumes of data and process it in real-time, enabling organizations to make faster and more informed decisions.

Kubernetes is a popular tool for container orchestration, which is essential for deploying and managing data pipelines and applications in a distributed environment. Airflow is a platform for programmatically authoring, scheduling, and monitoring workflows, which is necessary for managing the complex dependencies and execution of data pipelines.

TensorFlow is a powerful machine learning framework that is widely used in industry for building and deploying machine learning models. By learning TensorFlow, data engineers can build data pipelines that can handle data from machine learning models and enable organizations to make more informed decisions.

Apache Arrow, MLflow, and Delta Lake are all useful for data management, which is a crucial aspect of data engineering. By learning these technologies, data engineers can build data pipelines that can handle large volumes of data and process it efficiently, enabling organizations to make faster and more informed decisions.

Here are some links to learning content on these technologies:

Apache Kafka

An open-source, distributed event streaming platform that is used for building real-time data pipelines and streaming apps.

Confluent Kafka Platform: https://www.confluent.io/product/platform/
Kafka Tutorials: https://kafka.apache.org/intro
Kafka: The Definitive Guide: https://www.oreilly.com/library/view/kafka-the-definitive/9781491936153/

Apache Spark

An open-source, distributed computing system that can process large amounts of data quickly. It is often used for big data processing, machine learning, and graph processing.

Spark documentation: https://spark.apache.org/docs/latest/
Spark tutorials: https://spark.apache.org/docs/latest/sql-getting-started.html
Learning Spark: https://www.oreilly.com/library/view/learning-spark/9781449359034/

Apache Storm

An open-source, distributed real-time computation system for processing streams of data.

Storm documentation: https://storm.apache.org/documentation/Home.html
Storm tutorials: https://storm.apache.org/documentation/Tutorial.html
Storm Blueprints: https://www.packtpub.com/big-data-and-business-intelligence/storm-blueprints-patterns-distributed-real-time-computation

Apache Flink

An open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications.

Flink documentation: https://ci.apache.org/projects/flink/flink-docs-stable/
Flink tutorials: https://ci.apache.org/projects/flink/flink-docs-stable/dev/tutorials/index.html
Stream Processing with Apache Flink: https://www.oreilly.com/library/view/stream-processing-with/9781491983874/

Kubernetes

An open-source container orchestration system for automating the deployment, scaling, and management of containerized applications.

Kubernetes documentation: https://kubernetes.io/docs/
Kubernetes tutorials: https://kubernetes.io/docs/tutorials/
Kubernetes Up and Running: https://www.oreilly.com/library/view/kubernetes-up-and/9781491935349/

Airflow

An open-source platform to programmatically author, schedule, and monitor workflows.

Airflow documentation: https://airflow.apache.org/docs/
Airflow tutorials: https://airflow.apache.org/docs/tutorial.html
Mastering Apache Airflow: https://www.oreilly.com/library/view/mastering-apache-airflow/9781789346488/

TensorFlow

An open-source machine learning framework for building and deploying machine learning models.

TensorFlow documentation: https://www.tensorflow.org/docs/
TensorFlow tutorials: https://www.tensorflow.org/tutorials/
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/

Apache Arrow

An open-source columnar memory format for big data analytics and is used to accelerate big data analytics by enabling big data systems to process data faster and more efficiently.

Arrow documentation: https://arrow.apache.org/docs/
Arrow tutorials: https://arrow.apache.org/docs/python/tutorials.html
Big Data Interoperability with Apache Arrow: https://www.oreilly.com/library/view/big-data-interoperability/9781492044130/

MLflow

An open-source platform to manage the end-to-end machine learning lifecycle.

MLflow documentation: https://mlflow.org/docs/
MLflow tutorials: https://mlflow.org/docs/tutorials/index.html
Machine Learning with MLflow: https://www.oreilly.com/library/view/machine-learning-with/9781492046721/

Delta Lake

An open-source storage layer that sits on top of existing data lake file storage, such as HDFS, S3, GCS and provides ACID transactions, data versioning, and rollback.

Delta Lake documentation: https://delta.io/docs/
Delta Lake tutorials: https://delta.io/docs/getting_started/index.html
Delta Lake Guide: https://databricks.com/delta/guide

These are just a few examples of the many resources available for learning these technologies. With the right training and practice, you can become proficient in using these tools to build efficient and effective data pipelines.