Creating an Event Driven ETL Pipeline for COVID-19 Data
The Challenge
Earlier this year a group of software engineers came together and formed a python guild within our IT organization and it has served as a great place to learn from other engineers and get great feedback. Last week the #CloudGuruChallenge was posted in our Slack channel, and I was interested by the amount of things I did NOT know how to do. I’m fortunate enough to work for an organization with great tooling available to all software engineers, and I wanted an opportunity to learn more about the frameworks that have been put into place which make my job so much easier.
The challenge was to create a daily ETL pipeline to intake data from John Hopkins and the New York Times, as well as create a visualization dashboard. You can see the full specifications here.
The Approach
I am more familiar with Jenkins and vanilla CloudFormation templates, but why hold myself back to only one type of build failure when I could expand my horizons (see figure 1)? I wanted to use GitHub Actions as my CI/CD pipeline, as well as use the Serverless Framework for my IaC. I was shocked at the amount of yaml the Serverless Framework replaced, specifically with regard to Lambda Functions. The Serverless Framework also allowed me to easily setup an Event Bridge scheduled event with only two lines of yaml instead of… too many. I also decided on DynamoDB for my database, which turned out to be both a blessing and a curse.
The Bugs
One of the first issues I ran into was the classic chicken before the egg problem regarding Lambda and S3. CloudFormation does not allow you to have a Lambda Function in the same template as the S3 bucket containing its code. After a few hours of over engineering I decided to try the Serverless Framework since they abstract this issue for you along with simplifying your template.
I decided to use the pandas library for my data manipulation, and I ran into several issues where, after packaging, my lambda function deployments were too big to upload. After trying and failing to use a few different serverless plugins, then convincing myself no one must use pandas with Lambda, I realized it was because I wasn’t zipping my deployments. I was never able to get GitHub Actions to work with serverless plugins, but if you know something I don’t drop me an answer on my SO question! I was able to get around the issue by running some extra commands in the action’s steps.
I decided to wait to decide how I would visualize my dataset after I setup my ETL pipeline. I figured that AWS Quicksight would be a great way to visualize my DynamoDB table and keep things within the AWS ecosystem. However, Quicksight does not currently support DynamoDB. I decided to create another Lambda Function that subscribes to the SNS Topic fired upon a successful upload. The lambda would load the DynamoDB data into an S3 bucket after receiving a message from the topic. This took a little extra configuration, but I still think it’s a viable option due to DynamoDB’s flexibility and speed.
The Results
I didn’t get too fancy with my Quicksight dashboard, but it does the job! Let me know if you would like to discuss an acquisition!
I’m looking forward for more challenges to come; thanks for reading!
Feel free to checkout my GitHub, or find me on LinkedIn.
Figure 1
“The one with the Red Badge of Courage”