[ad_1]
The industry-wide neglect of data design and data quality (and what you can do about it)
My favorite way of explaining the difference between data science and data engineering is this:
If data science is “making data useful,” then data engineering is “making data usable.”
These disciplines are so exciting that it’s easy to get ahead of ourselves and forget that before we can make data usable (let alone useful), we need to make data in the first place.
But what about “making data” in the first place?
The art of making good data is terribly neglected. If you have no data — no inputs — to work with, then there’s not an awful lot that your data engineers and data scientists can help you with.
But even when you do have some data, there’s a chance you’re missing something: data quality. If you’ve collected truly rancid data, forget about extracting value from it. It’s futile to battle the inescapable gravity of this basic law of nature: Garbage In, Garbage Out.
Data plays the same role in data science and AI as ingredients play in cooking. A spiffy kitchen full of all the most modern implements won’t save you; if your ingredients are garbage, you may as well give up. No matter how you slice and dice them, you’re not about to cook up anything worthwhile. That’s why you need to think about investing in good data before you rush headlong into your project.
If you care about results, invest in good data before chasing fancy algorithms, models, and a parade of data scientists.
Let me make a little guess about you, dear reader: you’re not new to Garbage In, Garbage Out (GIGO). Or QIQO for the more upbeat glass-half-full personalities out there (the Q is for quality). You’re practically begging me to say something you haven’t heard before, yet here I am chafing your patience with GIGO talk. Again. Yes, we’ve all repeated the GIGO principle ad nauseam. I’m at least as sick if it as you are.
But riddle me this. If we have a whole industry of GIGO-respecting professionals and we also understand that designing quality datasets isn’t trivial, where’s the evidence that we put our money where our mouths are?
If data quality is so obviously important — after all, it’s the foundation of the whole multibillion dollar data/AI/ML/statistics/analytics shebang — what do we call the professionals who are responsible for it? This is not a trick question. All I want you to tell me is:
What’s the *job title* of the person whose primary role is the design, collection, curation, and documentation of high quality datasets?
Except, unfortunately, it may as well be a trick question. Whenever I chat with a group of datafolk at a conference, I try to sneak the question in. And every time I’ve asked them who’s responsible for data quality in their organizations, they’ve never come up with anything remotely resembling consensus. Whose job is it? Data engineers say data engineers, statisticians say statisticians, researchers say researchers, UX designers say UX designers, product managers say product managers… GIGO ad nauseam indeed. Data quality seems to be exactly the kind of “everybody’s job” that ends up being nobody’s job, since it requires skills (!) yet no one seems to be investing in them intentionally, let alone sharing best practices.
Data quality is exactly the kind of “everybody’s job” that ends up being nobody’s job.
Maybe I care a little bit too much about the data science profession. If I were here just for my own career, I’d make a quick buck with data charlatanism, but I want data careers in general to matter. To be worth something. To be useful. To make the world better than we found it. So when I see the two most important prerequisites neglected (data quality and data leadership), it breaks my heart.
If the data quality professional / data designer / data curator / data collector / data steward / dataset engineer / data excellence expert career doesn’t even have a name (see?) or a community, no wonder you won’t find it on a resume or in a university program. What keywords will your recruiters use to search for candidates? What interview questions will you use to screen for the core skills? And good luck finding excellence — your candidate will need quite the symphony of skills.
What keywords will your recruiters use to search for candidates? What interview questions will you use to screen for the core skills?
First off, let’s recognize that we’re not talking about your kid cousin’s “data labeling” summer job here, the kind of job that involves mindless data entry and/or selecting all the cupcake shots among a purgatory of bakery thumbnails and/or going door to door with a paper survey. Thought I’d mention this because “isn’t it just data labeling?” is a question I’ve been asked multiple times in a tone of polite concern for my blood pressure. What a way to dismiss a whole category of genius.
“Isn’t it just data labeling?” No. (What a way to dismiss a whole category of genius.)
No, we’re talking about the kind of person who designs that data collection process in the first place. It takes at least a pinch of user experience design, a dash of decision science, a spoonful of survey design experience, a lump of psychology, a dollop of experimental social science with field experience (anyone who’s got real experience will anticipate the Philadelphia Problem for you in their sleep), and a chunk of statistics training too (though you don’t need a whole statistician), plus solid analytics experience, plenty of domain expertise, some project/program management skills, a bit of exposure to data product management, and enough of a data engineering background to think about data collection at scale. This is a rare blend — we urgently need a new specialization.
To have any hope of building a mature data ecosystem, we must give a new generation of specialists a good home where they will be appreciated for flexing their specialist skills.
But until we’ve fought for a data-making career that is well recognized, well managed, and well rewarded, we’re stuck. Budding badasses with an aptitude for this array of skills would be lemmings to throw themselves at it. It’s a desk-in-the-basement kind of job these days, if it’s a job at all. To have any hope of building a mature data ecosystem, we must give a new generation of specialists a good home where they will be appreciated for flexing their specialist skills.
So what can you do?
If there are already people with these skills and talents who, despite a history of neglect, are stepping up in your organization to take on data quality, are you encouraging them? Are you nurturing them? Are you rewarding them? I hope you are. Whereas if you’re creating incentives to chase the paychecks in buzzy MLOps or PhD-spangled data science, you’re shooting yourself (and our whole industry) in the foot.
Google’s People + AI Research (PAIR) team recently released the Data Cards Playbook to help train the community in data design, data transparency, data quality, and data documentation best practices. I’m so proud of our work and I’m thrilled those materials are freely available for everyone’s benefit, but there’s still so much to learn. If you’re on this path too and passionately championing data excellence, please share the lessons you’re learning with the rest of the world.
If a research paper falls in a forest and no one uses it, did it make a sound? It’s a long journey from good ideas to an established discipline of excellence… a journey that needs all the cheerleading and amplifying it can get. If you believe in this and you can inspire even one other person to take it seriously, you’ll have played a vital part in building the future. Thank you in advance for spreading the word.
Our community has done a great job of celebrating data scientists. We’re doing a decent job of celebrating MLOps and data engineers. But we’re doing a pathetic job of celebrating the people on whom all the other data careers depend: the people who design data collection and are responsible for data excellence, documentation, and curation. Maybe we could start by naming them (I’d love to hear your suggestions) and at least acknowledging that they matter. From there, will we progress to training them, hiring them, and appreciating them for their specialized skills? I sure hope so.
If you had fun here and you’re looking for an entire applied AI course designed to be fun for beginners and experts alike, here’s the one I made for your amusement:
P.S. Have you ever tried hitting the clap button here on Medium more than once to see what happens? ❤️
Here are some of my favorite 10 minute walkthroughs:
Although the site emphasizes data documentation and AI (gotta catch that zeitgeist) the Data Cards Playbook is so much more. It’s the strongest set of general data design resources I’m aware of. Preview:
Let’s be friends! You can find me on Twitter, YouTube, Substack, and LinkedIn. Interested in having me speak at your event? Use this form to get in touch.
[ad_2]
Source link