Data Engineer 101

The popularity of the mobile devices and the pervasiveness of all kinds of information on internet, rises to the record high, and still keeps fast growing, "big data" becomes an ever "sexy" concepts. With the ever-huge-and-still-growing size of data as well as the dazzling complexity, in past a few years, the "data engineer" role has been spin off from the general "software development engineer" role and is attracted more and more attention. However, to many people, there are still lots of mysterious around this role, here are some simple Q&A based on my past 7 years experience working closely with data in a giant software company.

What is Data Engineer

Data Engineer is an engineer who build the systems which handles data processing pipeline. Let's think about this from two perspectives. (1), from a business and management perspective, there are many important questions need to be answered, such as "which region generates most revenue?", "how was my customer base increasing over the past year?", "who are the most important target customer segments that we should invest on?", all these questions (ideally) need to use some solid data about the business to answer. When we have the data at our hand to make the decision, it's called a "data-driven" decision. (2), from a practical/technical perspective, where and how do we get these data? The very old fashion way, in small business, the owners and/or managers meticulously maintain important numbers with pens and papers, and do all the math on paper, as we all did at school. Then calculator comes in handy, we do not need to calculate by ourselves, but still need the manual/literal book-keeping efforts. Starting about 20-30 years ago, PC starts going into every office and home, EXCEL becomes the primary go-to software for all kinds (big or small) of cooperate data-keeping and reports generation tool, however, the data entry work is mostly still manual and manageable, minimum engineering efforts are involved at that time. Until the past 15-20 years, the internet and things on the web just went viral, the traditional method of collecting all kinds of data become impossible by human efforts, so the engineers steps in. At the beginning, they are just part of the traditional software engineering team, a handful people who are working with the data. More recently, as the data scale grows bigger (from hundreds to thousands entries a day to hundreds even thousands entries per seconds -- think about the popular websites API calls) and complexity increases (number of different types of data sources, business relationships between different types of data etc.), it is not enough to just have the "ad-hoc" efforts from general software engineers to understand and build the more sophisticated data systems, hence the dedicated "data engineer" emerged. In short, data engineers are people who possess the technical skill set to make the enormous data collecting and processing as automated as possible, in a timely fashion, so that the important business questions can be answered with the data presented.

What is the difference between Data Engineer and Data Scientist

Data Engineers are more focused on the data collection and data "massaging" (transformation) process, on the engineering side of building up the data pipeline system. At the output end of the data pipeline, most commonly the processed data are landed into a centralized place, called "data warehouse". Data Scientists are usually more focused on working with the resulting data in the warehouse, on the analysis side. They look for the patterns and insights, do more in-deep and ad-hoc complex analysis (beyond straight-forward aggregation and segmentation) to help answer those important questions, and most likely to raise more questions about the systems and business (when some "unexpected" observations happens) and seek for more data to be collected (requirement generation to answer the "why" on those "unexpected"). However, above distinction is made in the ideal perfect world, which rarely happens in real life. In most organizations and especially in the early stage of a "data team" is formed, the initial hires are usually required to possess both sides of the desired skill sets and do them all : (a). collect/understand the requirements from the business side; (b). build the data pipeline systems; (c). generate automated reports; (d). do some ad-hoc analysis on the data to answer or raise questions.

What does Data Engineer do in day-to-day work

This becomes repetitive, but it is very true, basically a typical data engineer does one of the 4 things in their most time at work: (a).collect/understand the requirements from the business side; (b). build the data pipeline systems; (c). generate automated reports; (d). do some ad-hoc analysis on the data to answer or raise questions.

It would be heavily depending on the stage of your organization and the readiness or the maturity level of the data team, the later (more mature) the stage, the more designated on individual roles, usually, there are 1 people to collect the requirement (mostly they are the program managers), 2-3 people are building the pipelines (could be more, depending on the complexity of the system and business logic), 1-2 people are generating the reports (such tasks could be tricky, far more complex than one button click on EXCEL to get a pie chart or line trend graph, some rather sophisticated query writing skills are required here), 1 people are doing the ad-hoc analysis work. This is just roughly the ratio of the team that I have observed so far across different projects that I have been on, and the size could be proportionally grown as the project get to the more evolved stage. And sometimes the (a) and (d) stage would put such responsibility into a single hand, such roles are called either "data program managers" or "data scientists", depending on which parts (a) or (d) are more emphasized. In some startup or any early stage data team, the "data engineer" would most likely do all 4 by him/herself. On the other hand, in a more "traditional sense" or "strictly defined scope", a Data Engineer is more focused on (b) and (c) or any one of them (data pipeline building and reports generation).

What is desired skill set for Data Engineer

Let's discuss the desired skill set based on the 4 sets of important things in the day-to-day work and why we are looking for them.

(a).collect/understand the requirements from the business side; communication, both oral (since you will talk to your stake holders/managers to solicit what do they want AND translate those "fuzzy" business language into "data models" and translate back from your model results to the business meaningful insights/action items) and written (you would have numeric emails back and forth to clarify things/requirement and sometimes formal design documents and progress reports), Please note, in the "oral" part, "listening" is sometimes more important than "speaking". Because you are mostly on the "receiver" side (especially at the initial stage of your career), only if you can "listen" well and "receive" the message, you can start to "speak" -- to clarify some requirements and re-state your understanding to make sure you get the information and requirement correctly. I cannot emphasize more on the importance of such "confirmation" stage, as the cost of "correcting" such issues at this early stage could be extremely expensive.

(b). build the data pipeline systems; Database design, more specifically, data warehouse design (that's the "house" of your data, the "good" or "bad" design could make the later query efforts from all downstream parties significantly different.) Data pipeline building, (i) identify the data sources, (ii) understand their individual structures and relationships between them (because you need to eventually "join or link" information across different data sources), (iii) design the pipeline (iv) implement/build the pipeline (there are many available software on the market to choose from for implementation, prior knowledge about any of those would definitely a plus, however, it's not uncommon that many pipelines are just build upon generic programming language, such as Java, Python, Perl, SQL, or mixed of many). Most junior people would think about the "implementation" part (iv) is the most important skills required, I would say it's only partially true, it is probably quite fundamental at the early stage, as they are the basic "building bricks" for this job. However the more your career advanced (which also probably means that you have failed quite a few times either by "implementing wrong" or by "implementing right", but not on the "right things"), the more you would appreciate that the "understanding what needs to be done" and "design" is far more important than "implementation" itself, it is like "do the right things" vs "do things right". For many junior people, they may wonder "what if I do not have any experience, I do not even know how to do things right (because you have never did it before and you do not have any experience), how could I know what's the 'right' things to do?" My suggestion would be simple: "LOOK AT THE DATA, USE YOUR COMMON SENSE". However, just as Voltaire said, "common sense is not that common". It takes times to learn how to use "common sense" to look at the data, to observe the "unexpected" and to ask questions, but the earlier you start proactively looking at the data & using "common sense" to ask questions, the steeper you learning curve would go along the road.

(c). generate automated reports; the "common sense" things and "do the right things" definitely apply to this stage as well, however, this stage could be a little bit too late to correct any significant errors/issues (which are usually happened way earlier up the processing pipeline or even design or requirement collection stage). From a pure technical point of view, experience of any data visualization software on the market would be great; on the other hand, the query-writing skills are usually having a higher bar in order to handle complicated business scenarios. Some "visual sense and judgement" are also desirable skills to make the reports easily understandable, especially for those non-technical audiences. If you do have some talents in UX design to make the reports "look and feel" more attractive, that is definitely a plus, but rarely a requirement, not mention it also mostly constrained by the software/tools that you pick to use. From another totally different side, I would value the "INTEGRITY" highly than anything else. Here is why. When you have spent many hours or days or even weeks to finished some reports, and find the results just look so "ugly" and "unacceptable" (lots of data issues are most likely first bubbled up in the reports generation steps, as the graphical visual representation is much powerful than those pure numbers in a database), it is quite hard to accept such "ugly" figures, which cause lots of frustration and resentment (towards whoever potentially to make this happen), and it is also enticing to make some simple "twist" to make it look much nicer. There are lots of judgement calls here, sometimes such 'twists" are necessary steps, if we do have a sound understanding and reasons to do so. But most of time, it could be very dangerous to do so since your "twists" could mask some important issues which are difficult to observe from other angles. At a baseline, NEVER FAKE DATA, especially you are on your own (and you know nobody would double-check your queries), INTEGRITY about the data (and as a PERSON) is above anything.

(d). do some ad-hoc analysis on the data to answer or raise questions. More critical/advanced skills: critical thinking, ask good questions, understand the bigger industry/business scenarios. Basic skills: query-writing, some basic statistical knowledge, machine learning knowledge (not much emphasize on the theoretical part, but more on the application side, like: can I solve this problem without using any learning algorithm? this real-life problem translate to what ML problem? a classification problem or a clustering problem? are they supervised or unsupervised learning problem? If supervised, where can I get the labeled training data? are the labeled data trustworthy and with acceptable quality? which methods I should use to address this issue? what nature of the data/problem lead me to believe so? how can I evaluate my results? if they are not good, what's the root cause? data is not clean, model is not good or we model it wrong or we should not use learning model at all and should use rule-based system instead? etc.)

What makes you a great Data Engineer

Beside all the basic technical side requirements: such as some programming skills, query-writing skills, data base design skills, statistical skills; and more advanced soft skills like communication, critical thinking, curiosity and constant learning, the single most important factor which will make you a great Data Engineer, is the PASSION and LOVE of data, you feel the excitement to work on the data, to look at them, to make sense of them and to utilize them to achieve something bigger (like answer those big business questions and to help make very important business decisions). On the personality traits side, you need to be very logical and detail-oriented, which is super important to work with data, as the data work is about making details arranged together in a logical way. At the same time, you need to practice the "zooming" capability and flexibility when you exam the data, just like zooming your camera lens, from raw detail level data, to different aggregation angles, and to some even bigger/higher perspective, jumping out of the data themselves to re-think about what questions we intend to answer & how they can be answer, by looking at these data. The more agile you are "zooming" your data lens and the thinking process, the better you are as a Data Engineer, and the more ready for yourself to go to the next level of your career (on either the Data track or some other tracks).

What to learn and how to learn to be a Data Engineer

I found it is really difficult to present any general advice on this as it really varies based on the individual's background. I would put it in some big rough levels categories, and mostly on technical side, as the soft skill sides are much more general and could be polished in many more settings not limited to the Data Engineer role.

a. Basics: college level mathematics, statistics, computer programming (any language, Java, C, C++, C#, Python, Perl etc.) you need to have the capability of doing procedural programming (give computer instructions on what to do, like first get all the files, then open each of them one by one, then iterate the file line by line, do sth on it etc. ), and basic SQL (you need to know how to write the result-oriented queries)

b. Intermediate: Database design (data warehouse design), Object-oriented design (useful for the data pipeline design),

c. Advanced: understand the big data platform usage and infrastructure (map-reduce, distributed file systems, big tables etc.); machine learning knowledge on application level to solve real life problems.

All these contents can be found in numerous MOOC resources available online, however, even most of them are designed to be "real-life" and "practical", after all they are still "courses" and need to be self-contained and "complete" in order to be solvable by students. The best way to learn is to get the chance to do a real "real-life" problem (by an intern project, by offering a hand at the neighbor "data team" if you are already in a software company or IT team), you would have a sense and learn how to handle the totally "imperfect" world, how "dirty" the data could be, how "incomplete" the information could be, and still you need to "find a way" to do your work and produce results... only by that time, welcome to the real data world and enjoy the journey!

最后编辑于：2017.11.27 04:40:57

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 203,362评论 5赞 477
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,330评论 2赞 381
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 150,247评论 0赞 337
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,560评论 1赞 273
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,580评论 5赞 365
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,569评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,929评论 3赞 395
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,587评论 0赞 258
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,840评论 1赞 297
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,596评论 2赞 321
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,678评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,366评论 4赞 318
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,945评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,929评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,165评论 1赞 259
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 43,271评论 2赞 349
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,403评论 2赞 342