7 Data Engineering Projects to Level Up Your Skills in 2023
While data science has been hailed as the “sexiest job of the 21st century,” it’s not necessarily the only lucrative job working with data. On average, data engineers actually make $10,000 more than data scientists, and in recent years, data engineering has become the fastest-growing tech occupation. Data engineers plan, build, and maintain the backend infrastructure that enables analytics and data science professionals to extract insights from data.
If you’re looking to land a job in this promising industry, but don’t know where to start, then data engineering projects are the best way to demonstrate your skills to prospective employers. Keep reading to learn more about project ideas, where to find datasets, and how to promote your projects during the interview process.
What Is the Point of a Data Engineering Project?
If you’re looking for a data engineering job, but don’t yet have any experience as a data engineer, then a portfolio of data engineering projects is a great way to land your first role. The best data engineering projects showcase the end-to-end data process, from exploratory data analysis (EDA) and data cleaning to data modeling and visualization.
In these projects, make sure that you show evidence of data pipeline best practices. You should be able to spot failure points in data pipelines and build systems that are resistant to failure. Finally, create data visualizations to show the outcome of your project, and build a dedicated website to host your project, be it a portfolio or personal website.
Data Engineering Project Ideas
Analysis projects involve parsing large datasets for patterns, anomalies, and other insights. You can analyze a variety of data inputs, such as numbers, text, or audio.
Sentiment analysis (AKA “opinion mining”) is the use of natural language processing (NLP) to discover how people feel about a product, public figure, or political party.
Social media posts are ripe for this kind of analysis. You can obtain tweets from Twitter about a trending topic or hashtag using the Apache NiFi GetTwitter processor—which obtains real-time tweets and ingests them into a messaging queue—or use Twitter’s Recent Search Endpoint.
Once you’ve obtained your dataset, you can determine sentiment scores using Microsoft Azure’s Text Analytics Cognitive Service. You can then visualize the results using Python’s Plotly and Dash libraries, similar to what this Github user did.
Extract, Transform, Load (ETL)
Extract, Transform, Load (ETL) is the process of extracting data from its original source, preparing the data for analysis, and loading it into a target database. Most ETL tools can perform all three steps.
Building an ETL project shows you are familiar with the end-to-end data engineering process, from extracting and processing data to analyzing and visualizing data. One popular project is to build a data pipeline that ingests real-time sales data. Using this data pipeline, you can analyze sales metrics such as:
Total revenue and cost per country
Units sold vs units cost per region
Revenue vs profit by region and sales channel
Units sold by country
Sentiment Analysis on Stocks (Financial Sentiment Analysis)
Stock sentiment—i.e. how people are feeling about a stock—influences stock market volatility, trading volume, and company earnings. One great data engineering project is to use natural language processing to see how headlines and social media posts are affecting stock prices on a daily basis. For this project, Medium user @Bohmian extracted data from FinViz, a financial news aggregator that also features visualizations of stock data.
Extracting Inflation Data
Inflation is a pertinent topic for analysis, given that the US is experiencing the highest rate of inflation since 1982. You can analyze inflation by tracking changes in the price of goods and services online. Github user @uhussain built a pipeline using petabytes of web page data contained in Common Crawl, an open repository of web-crawl data containing raw webpage data, metadata extracts, and text extracts. The goal of this project is to calculate the inflation rate using the price of goods and services online.
Building Data Pipelines
A data pipeline is a set of tools and processes for moving data from one system to another. Each step delivers an output that serves as an input for the next step. Building recommendation engines are great projects to show that you understand how to build data pipelines, as a complex data pipeline brings data from many sources to the recommendation engine, essentially combining product ratings with behavioral user data.
In this project, you can build a movie recommender system on Azure using Spark SQL to analyze the Movielens dataset. Then you’ll deploy Databricks Spark on Azure with Spark SQL to analyze the dataset for user recommendations before you build the data pipeline.
Since you probably don’t have access to data on user behavior, ratings are a good place to start. You can scrape data for music, movies, video games, and books from rating sites such as Last.fm, MovieLens, GoodReads, or even Kaggle.
Creating a Data Repository
A data repository—also known as a data library or data archive—is a large database infrastructure that collects, manages, and stores datasets for data analysis, sharing, and reporting. A good data repository project collects and integrates data from numerous sources. This project on GitHub uses data from a fictional taxi company called Olber. The data is collected from two separate devices. Each taxi has a meter, which sends information about the duration, distance, pickup, and dropoff location for each ride. A separate device accepts payment from customers and sends data about cab fares. You can download that dataset here.
Analyze Security Breach
The traditional approach to fighting cyberattacks involves gathering data about malware, data breaches, phishing campaigns, and other attack vectors, and extracting the data to create a digital fingerprint of the attack. These fingerprints are then compared against files and network traffic to detect potential threats.
However, predictive analytics can be used to discover a data breach before it happens, as is the case with this project. Machine learning solutions have enabled organizations to cut down the time it takes to detect cyber attacks by determining the probability of an attack and mounting defenses before cybercriminals infiltrate the system.
Data Engineering Project Checklist
Whatever kind of data engineering project you decide to pursue, make sure that your project uses a variety of data sources and tools, and shows proficiency with the different stages of the data engineering process.
Various Data Sources Like APIs, CSVs, Webpages, JSON, etc.
Working with a variety of data sources shows that you know how to deal with structured and unstructured data, as well as obtain data using APIs and web scrapers. Valuable datasets can be found everywhere, from social media posts to web pages and more.
Data ingestion is the process of transporting data from one or more sources to a target site for further processing and analysis. This target site is typically a data warehouse, which is a special kind of database designed for efficient reporting. The ingestion process is the backbone of an analytics architecture. This is because downstream analytics systems rely on consistent and accessible data. Collecting and cleansing the data reportedly takes 60-80% of the time in any analytics project, so plan accordingly.
Data storage and retrieval are critical components of an effective data pipeline. Building a data pipeline requires you to make trade-offs. For example, should you use a SQL or NoSQL database to store your data? If you are collecting data that is semi-structured or unstructured, MongoDB is best. This is because complex queries like joins are slower in MySQL. Instead, MySQL is best for structured datasets that are already familiar to you (i.e. they don’t require much cleaning).
As a data engineer, it’s important that you can communicate complex technical concepts to a non-technical audience. This makes data visualization an important skill, and any data engineering project should include data visualizations. Visuals should be based on the following questions:
Who is my audience?
What questions do they have?
What answers do I have for them?
What other questions will my visualizations inspire?
Usage of Several Tools
To build a rich data infrastructure, data engineers require a mix of different programming languages, data management tools, data warehouses, and data processing tools. So make sure that your portfolio shows your proficiency with a range of different tools.
Data Engineering Platforms
You can use the following platforms to clean your data, automate workflows, store and retrieve data.
Prefect is a dataflow automation platform that you can use to design, automate, and test your workflows. The platform has scheduling, monitoring, error handling, logging, and data serialization capacities. The best part about Prefect is you can build automated workflows for moving data from a source to a target location so the data can be used for analytics and reporting.
Cadence is a coding platform and workflow engine that makes application development easier. It’s fault-oblivious, which means that you can write stateful applications without having to worry about handling complex process failures or non-functional requirements such as durability, availability, and scalability of your application.
Amundsen is an open-source data catalog originally created by Lyft. Basically, it’s a data discovery application built on top of a metadata engine. It indexes data resources (such as tables, dashboards, streams) with a Google PageRank-inspired algorithm that recommends results based on names, descriptions, tags, and querying/viewing activity. Consequently, tables that are queried often show up higher in search results than less queried tables.
Great Expectations is a tool for maintaining data quality. Both the structure and content of a given data file will dictate what you can do with the data, so it’s important to understand these parameters before you proceed with a data project. Using validation rules to cleanse data helps prevent data quality issues from slipping into data products (remember: garbage in = garbage out).
Top Data Engineering Tools
Amazon Redshift is a cloud-based petabyte-scale data warehouse service, which manages the work of setting up, operating, and scaling a data warehouse. You can use it for processing real-time analytics, combining multiple data sources, log analytics, and more. AWS Redshift costs a fraction of what competitors like Oracle and Teradata charge for comparable products and can handle huge volumes of data. It’s best to use when you have a massive dataset.
While Tableau is one of the most popular data visualization tools, you should also consider using Tableau Prep, which allows you to clean, aggregate, merge and prepare your data for analysis in Tableau. Tableau Prep is comprised of two products: Tableau Prep Builder for building your data flows, and Tableau Prep Conductor for scheduling, monitoring, and managing flows in your server environment.
Part of Google Cloud, Looker is a multicloud advanced analytics platform that allows you to create dynamic dashboards for in-depth analysis. Looker recently launched a new feature called Spectacles, which finds SQL errors by running queries in your database. This increases the reliability of your data and ensures that you eliminate errors before they hit production.
How To Promote Your Data Engineering Projects
Once you have a few data engineering projects under your belt, think about how to publicize your projects. In addition to displaying your projects on a portfolio, you can add them to your resume and LinkedIn profile, and also promote them on developer platforms such as Github and Stackoverflow.
The best way to showcase your projects is by building a portfolio website. For each project, include detailed documentation that explains what you’ve built. You can also create blog posts and Github repositories that showcase your problem statement, proposed architecture, data analysis process, and findings.
A good data engineering resume includes a comprehensive rundown of the tools and technologies you’ve used. During the screening process, recruiters are looking for your competency level with tools, so they’ll scan your resume for keywords.
However, the ultimate hiring decision is made by the engineering team. So make sure your resume includes the following:
Display a solid technical skillset (language-specific skills; databases, ETL and warehouse-related skills; operational programming problems; algorithms and data structures; understanding of system design)
Communicate the challenges you faced and how you solved them
Show that you can easily learn a new tech stack
Skills and certifications
Demonstrate soft skills such as teamwork, communication, and adaptability by highlighting specific problems you’ve solved using these skills
A strong website explains your work experience, problem-solving skills, and your passion for the field. A well-written bio also conveys soft skills like verbal and written communication, and teamwork. Include links to your Github, Stackoverflow, and portfolio so that recruiters can see samples of your work and personal projects.
Like developers, data engineers are expected to have a presence on Github. The Github demo project board can help you demonstrate your skills as a Data Engineer or Data Scientist. Use Github to host your source code projects and collaborate with other Github users to review their code and propose changes. Github is also one of the largest coding communities around, and using it can provide wide exposure for your project.
LinkedIn is undoubtedly the number-one professional networking platform in the world. Use LinkedIn to document the responsibilities of your role, projects, and activities you’ve participated in. Anything you can’t fit on your resume should go on your LinkedIn profile. Here, you can expand on your work experience descriptions, include hyperlinks to your work, and write a full biography that summarizes your professional career (also a chance to explain your backstory if you’re a career switcher).
Remember, LinkedIn profiles must be optimized for search engines, so look for keywords that are used in job descriptions. For example, if your prospective employers use the term “data cleaning” instead of “data scrubbing” use the preferred keyword to ensure your resume passes the Applicant Tracking System.
As one of the most popular online communities for programmers, Stackoverflow is a great place to network with other developers and search for jobs. Many data professionals use Stackoverflow on a daily basis to find answers to obscure programming-related questions, so getting accustomed to searching for answers on the platform is a valuable skill.
Data Engineering Project FAQs
How Do You Start a Data Engineering Project?
Start by thinking of a topic you’re curious about. Then find a dataset that can help you answer a related question. You can find free datasets on Kaggle, FiveThirtyEight, Google Trends, the U.S. Census Bureau, or Data.gov. You can also search for data from specific organizations or government agencies, or use an open API or web scraping tools to obtain data from web pages.
What Do Data Engineers Build?
Data engineers build and maintain an organization’s data infrastructure—including databases, data warehouses, and data pipelines. They also build tools for data analytics and data science teams.
Can You Add Data Engineering Projects to Your Resume?
Yes! Personal projects are a great way to showcase your knowledge of the end-to-end data process, especially if you lack relevant work experience. Projects also demonstrate work ethic and self-motivation. If you are switching careers, projects can be a great way to show off your domain expertise from another industry. If you’re an entry-level data engineer, list your projects on your resume, as you would with your work experience.
If you're interested in becoming a data engineer, explore the Washington University Data Engineering Boot Camp, where you'll benefit from curated curriculum and 1:1 mentorship.