Your Journey into Data Engineering

In this article, you will explore the tasks of a Data Engineer together with relevant services that are available on the Azure platform. You will gain an understanding of the appropriate storage services, enabling you to implement a solution based on a given set of business and technical requirements.

In this article we will answer the following questions:

What is Data Engineering?
What are use cases for Data Engineering?
What are the benefits of Data Engineering?
What Data Engineering services are available on Azure?
What are real-life customer cases concerning Data Engineering?

This article requires a basic understanding of Data and AI. This article is a follow-up to our previous article Cooking with Data and AI. If you would like to learn more and get an introduction to Data and AI, please refer to this article.

What is Data Engineering?

Even though data engineering is a hot topic, and everyone has their definition of it, we could not find an official definition of data engineering in the Cambridge dictionary. However, taking the two words together (i.e., Data and Engineering), we can get a broader understanding of what data engineering means:

Data: information, especially facts or numbers, collected to be examined and considered and used to help decision-making or information in an electronic form that can be stored and used by a computer.

Engineering: the study of using scientific principles to design and build machines, structures, and other things, including bridges, roads, vehicles, and buldings.

To us, Data Engineering is the transforming and cleaning of data in such a way that it is ready to be used or consumed. The first thing that might come to mind is cleaning the data. However, when we take a deep dive into data engineering, there is more a data engineer must think about. A data engineer also must consider file formats, and even data formats (i.e., date/time). In addition, a data engineer must decide on whether to process data in real-time or schedule the process in batch. Let’s not forget about data security, monitoring, and optimizing data storage. To put it in other words: a data engineer thinks about the entire Extract, Transform, and Load (ETL) or Extract, Load, Transform (ELT) process.

To speak in cooking terms, data engineering is the cutting, washing, and prepping (transforming) the ingredients (your data) of your dish in such a way that they can be cooked (ready for Analytics, Data Science, or visualizations).

Some of the transformation techniques a data engineer might use are:

Handling missing values: decide standard rules on how to handle missing values. This can either mean filling up missing values with an “empty” value (Null /NaN /empty string, etc.) or with a constant.
Deduplicating: creating a ruleset on which rows are identified as duplicates and delete those accordingly.
Data format conversion: transforming data formats into the correct ones. For example, casting numbers from String to Float.
Aggregation: the act of presenting the data in a summarized format.

If you sign up for our workshop, you’ll get to understand some of these transformation techniques, pipeline automation, storage options, etc. in more depth using various data services on Azure. You will become familiar with transformation techniques using Azure Data Factory, Data flow, and Azure Databricks. You will also learn how to perform data loads into different data storage options whilst transforming data.

What are the use cases for Data Engineering?

Given the fact that data always needs to be cleaned and transformed before it's useful, all data-related cases are use cases for Data Engineering. Without cleaning and transforming data, it’s impossible to adhere to secure and reliable standards as your data quality cannot be safeguarded, with unreliable datasets as a result. You can find Data Engineering in any industry, in ISVs to use innovation based on data, on the Web, within Healthcare, Farming, Factories just to name a few.

Recently, we've noticed an increase in the number of devices and software that generate data to meet business and user needs. Given these developments, we have a need to store more data than ever before. As you can imagine, this data needs to be interpreted, managed, transformed, processed, aggregated, and last but not least visualized in reports to make well-informed decisions.

Regarding Data Engineering on Azure, use cases are plenty. Azure can work for a range of industries, including for example the web, healthcare, and the Internet of Things (IoT). Let's explore how Azure can make a difference in the healthcare industry.

Healthcare

In the healthcare industry, the use of Spark accelerates big-data analytics and AI solutions. On Azure you can run Spark in (1) open-source Apache Spark, (2) HDInsights, (3) Azure Databricks, and (4) Synapse Spark. As services 3 and 4 are easily scalable, they can be easily used in for example genome studies or pharmacy sales forecasting at a petabyte-scale.

What are the benefits of Data Engineering?

Informed Business Decisions

Clean data helps businesses in making informed decisions. Using untransformed data will cost a lot of time, money and might cause uninformed decisions, quality problems, etc. For example: if I’m cooking a whole carrot without cutting it up first, it takes longer to get cooked when I cut it up in tinier pieces.

Speed and Efficiency

Clean and correctly stored data will improve the data gathering time as you don’t need to search where what data is landed, once you’ve standardized it and cleaned it. Your data is organized, and you know where to find what, just like in the kitchen where hopefully you didn’t store your milk in the dry store but in the fridge where it belongs.

Insights

Once your data is clean and transformed it’s easier to visualize and analyze for your business. This way you can gain insights as well as perform predictions on future relations and prepare your business for it. Marketing-wise, you’ll for example be able to expand, grow your business because you’re well informed, as you have the data that will tell you where to target.

What Data Engineering benefits are available on Azure?

Looking at the DP203 Data Engineering exam created by Microsoft, the self-paced learning paths showcase the following (non-streaming) data services related to data engineering. We list them from least to most comprehensive (Note: this is not limited to other services):

Azure Data Lake Storage Gen2

Azure Data Lake Storage (ADLS) Gen2 is designed with big data in mind. ADLS Gen2 combines the capabilities of Azure Data Lake Storage Gen1 with Azure Blob Storage. This combination enables the user to store data within a folder-like structure, all the while obtaining file-level security. In addition to this, the user will get a low-cost easily scalable tiered storage, with high availability/disaster recovery capabilities. like structure, all the while obtaining file-level security. In addition to this, the user will get a low-cost easily scalable tiered storage, with high availability/disaster recovery capabilities. Note: this solution is used to store data. It does not include out-of-the-box query and data transformation options.

Azure Data Factory

Azure Data Factory (ADF) is a fully managed, serverless data integration service. It integrates over 90 already built-in data sources in a graphical user interface. You also have the possibility to code connectors for your (on-premise) datasets. Note: This solution is solely used for data ingestion and transformation and is not an analytics platform. If the user wishes to do data analysis, however, ADF does integrate with other tools like Databricks and Azure Synapse Analytics.

Azure Databricks

Azure Databricks will help you to gain insights from your data while running the latest version of Apache Spark. Set up your environment within minutes, enjoy autoscaling possibilities, and collaborate with your colleagues when using notebooks. Note: Azure Databricks currently supports multiple languages including Python, Scala, R, Java, and SQL.

Azure Synapse Analytics

Azure Synapse Analytics, previously known as Azure SQL Data Warehouse, is much more than just your ordinary SQL data warehouse. This comprehensive enterprise analytics service can fast track your analytics journey by bringing together Spark for big data analytics, and dedicated SQL or serverless pools for your data warehouse.

Azure Synapse Analytics is an all-in-one solution where Data Factory is already integrated within the product. Where in Azure Data Factory you would create a pipeline, you can do the same in Synapse Analytics, which can be referred to as Synapse Pipelines. In addition to this, Azure Synapse Analytics also offers integration with other Azure services such as Power BI, Azure Purview, CosmosDB, and AzureML.

What are real-life customer cases concerning Data Engineering: Illimity?

Illimity is a digital-native bank that wanted to simplify and accelerate data management.

Illimity uses Azure Data Lake Storage, Azure Data Factory, Azure Synapse Analytics, and Azure Databricks to ingest, transform, load, and gain insights from their data. For its raw data, it uses Azure Data Lake Storage, to consolidate and facilitate an easy way to collect, refine and query raw data. To visualize they have used Power BI for reporting. This was all done to get a comprehensive data management solution.

Their main goals were the following:

Data-driven decisions
Digital pioneer
Centralized Data storage, management
Increased velocity of data access and use

Microsoft Customer Story-illimity optimizes data governance and streamlines compliance with Azure Purview

Want to know more? Come to our workshop!

We will be giving a free workshop 28th of October concerning our First Course: Deep dive into Data Engineering: Prepping Ingredients for Cooking. If you are interested, sign up here.

This is the second article in a sequence of many. The next topics will be on:

Data science/ Analytics
Data Visualizations

Stay tuned to find out more!