I would love to get your input on this on how to become a data engineer like what are some skills required and how did you do it assuming you have a programming background.
Are you trying to switch technical specialties or coming in new? I’m assuming the former from your ask, but going to generalize for all readers.
If it’s the latter, I’d say most data engineers seem to come from either a standard development background focused mostly on back end work. You know how to do ORMs but are also decent at SQL and database structure, but most importantly, have been doing coding for a while.
You come from an analytical background, data analysis, data science, visualization, etc. You may not be as proficient at production coding practices, but you really grok data and all its messiness, when to do an aggregated, denormalized structure vs a normalized transactional one, etc..
Either way, you need to enhance whatever skills your lacking on either the standard code side (likely in Scala or Python) or you need to get a little more familiar with various data storage types (relational databases, graph, document store, etc.) and the weirdness inherent in data that analysts and statisticians are happy to tell you about.
If you’re coming in totally cold out of a boot camp or college, then the basics are really try to understand at least the basics of SQL and how databases work (this is the basis of all the other more complicated stuff anyway), a programming language (Python is your faster bet here if you’ve never programmed), and a sampling of a few other things – data modeling, how APIs work, ORMs, what data scientists/analysts/visualization people do, how back end datasources work for standard software, a bit of DevOps, etc. The big parts are the basics of data and programming to access and process it then store it elsewhere though because that’s the foundation.
Happy to answer more questions since I sort of fell into it over the years, but work with a variety of other data engineers and also make a point of teaching and spreading the word when I can.
Thank you for writing such a thoughtful reply and It means a lot. I have mostly programmed in python and I do have the knowledge of SQL but not advance just beginner since I mostly did ORM instead of writing plain SQL queries but I would love to know more like what resource can I consume since to broaden my knowledge
No worries. The main reason for learning SQL is it’s the basis of other query languages for non-relational databases. I’d say knowing the basics will get you a ways from that standpoint, and the advanced stuff you end up picking up as you do it.
I stumbled a lot into how some of this works by experience, but I’d say doing basic free tutorials for various database types (Postgres for relational, MongoDB for document, Neo4j for graph) are a good start. There’s a ton of links out there on Dev.to and various Medium sites (Toward Data Science often has some good stuff) to get you on more of the basics of data engineering.
This is also an interesting way to look at things since you’re already doing Python.
Beyond that, I’d say look into the idea of DataOps (using DevOps, Agile, and Lean principles to work data projects) for some good way to get up to speed on the up and coming way to go. The DataOps Podcast is a great way to do that, along with resources that are put out by a few companies pushing the idea forward that have put out a ton of free info.
Also, if you haven’t found it already, the Data Engineering podcast is a good place to get an idea of general toolsets and challenges being used.
Probably a lot more to cover there – I’d say just getting the basics and a good hold on how to work with data in general as well as make sure your understanding of different data storage types, brush up on your coding (especially libraries like Pandas and PySpark to work with data frames), etc. The joy and terror of data engineering is that it’s so wildly different for each implementation based on what you’re trying to build, but those basics don’t change much (general ideas of how to build a data warehouse haven’t changed in like 40 years).
No personal experience, but I’ve come across some guides and learning resources that might help.
Thank you for sharing these
I recommend you check “Designing Data-Intensive Applications” from Martin Kleppmann. It’s a good introduction to many of the aspects you’ll have to deal with as a Data Engineer. Another source of good information are podcasts, I can vouch for The Data Engineering Podcast and the data section of Software Engineering Daily. Both have some really good interview with people from the industry that can shed light on how data engineering fits in different companies.
Sure I’ll check them out
Thank you for sharing these resources