Touching
Data Science
16 June, 2020, 19:53 GMT +3
Data Science Job of a Dream
Eugene:
Hi Dennis, it's a pleasure to have you here!
Most of my fellow MITx learners are eager to land a job in a tech giant like Microsoft. We have so many questions for you! What do you do there?
Dennis Sawyers,
MICROSOFT:
#hide
Eugene:
#hide
Dennis Sawyers,
MICROSOFT:
My position right now is called Solution Architect in Data and AI. I'm constantly doing data science pretty much every single day across a vast array of customers. Starting from data ingestion and through that whole pipeline.
Eugene:
#hide
Dennis Sawyers,
MICROSOFT:
Basically, that is the following:

  • help people use the Azure Machine Learning suite to do data science.
  • constantly working with customers to get them started on Azure
  • show them how to do machine learning on Azure
  • help with getting access to their data
  • advice on how to implement those projects end to end

That is starting from data ingestion and going through that whole pipeline.
Right now, big-name companies are starving for data scientists. Top tech may not even be best for you.
Eugene:
What skill set would be enough to be selected for a data science position in Microsoft?
DENNIS SAWYERS,
MICROSOFT:
I was a data scientist at Ford Motor Company for four years. Don't get too attached to top tech. You can get into a big name company too. That can be even better for you. Google and Microsoft, Amazon, in particular — all have major customers and all the FORTUNE 500. When I got a job at Ford Motor, it was FORTUNE 10.

There's Chrysler, GM, and there are big companies all over the place. Walmart, Lowe's, grocery chains and restaurants, and fast-food chains like McDonald's. Just try to get into a big name company. Because all of those companies right now, they're all starving for data scientists.
Eugene:
So – Data Science isn't all around?
DENNIS SAWYERS,
MICROSOFT:
Yeah. There's a significant problem with data science. There is a high concentration of the field on the West Coast of the United States and maybe a bit in New York. Everywhere else in the country, there's just not enough people.
Eugene:
#hide
DENNIS SAWYERS,
MICROSOFT:
Don't be afraid to start at a manufacturing company or something like that. Everyone's trying to get in Google, Amazon, Microsoft. Even if you do get in on a junior position, it is hard to move up. If you start at a non-tech FORTUNE 500 company, you're going to be a much bigger fish in a smaller pond. You'll have a lot more responsibility right off the bat. You're going to be doing some fantastic projects. By the time that you're a couple of years out of college, you're already going to do these super big-ticket projects with a serious contribution to your company business.

At Ford Motor Company, I had an analytics project where I increased the number of activated vehicles and modems from about 15% to 80%, over two and a half year period. Why was it an important metric? Business wanted customers to use their technology. Activating the modem involved downloading and using an app that Ford Motor Company invested tens of millions of dollars.

I did another AI project at Ford. I was predicting auto vehicle sales and market share. And that was super impactful. Executives at Ford used that to determine how to allocate billions of dollars of incentive spending.

Those were just two projects. Had I joined a more prominent tech company, or at the beginning, I wouldn't have had the seniority to tackle those problems by myself. With that ability to generate value, many companies will welcome you to positions with even more responsibility, including top tech.
Use your spare time to  do volunteer work
Eugene:
#hide
DENNIS SAWYERS,
MICROSOFT:
Try to get on the most impactful projects that you can. Use your spare time to do volunteer work. Tech competitions are good, but what's better is finding some charity/volunteer organizations and doing some stuff for them. That's another area where they don't have too many data scientists, but they have massive projects that make a difference in the world. Once you have done a few of those, you get a lot of credibility, and you learn a lot along the way too.
It's all about getting an understanding of what end to end data science really is
Eugene:
Why would anyone dreaming of a big tech would seriously consider anything else?
DENNIS SAWYERS,
MICROSOFT:
Once you learn how to contribute there it's straightforward. Nothing can prevent you from being as productive in a tech or any other company.

I'm doing a cloud solution architect role right now in data and AI. I could very easily switch to AI research in Microsoft or AI product development. Those are the natural roles for me. It's all about getting an understanding of what end to end data science really is.
Eugene:
#hide
DENNIS SAWYERS,
MICROSOFT:
During my study I overly focused on the algorithms and alike. Students tend to do that, and they fall into excessively focusing on the accuracy metrics but forget about how it plugs into a business ecosystem.
Every database starts out perfect
Eugene:
I spent two days trying to improve accuracy from 97.96% to 98% doing the lab project once... Where the students approach ends and grown-up Data Science begins then?
DENNIS SAWYERS,
MICROSOFT:
Most companies' data is terrible beyond belief (for data science purposes). For years, these people have all these IT systems set up by people who don't think about data in the long-term perspective. Nobody is following database 101.

A friend of mine likes to say that every database starts perfect. But as business requirements pile in, the database gets stranger and stranger. More and more of our job focuses on data cleansing. So get good at that.

My favorite tool for data cleansing is alteryx. It's a drag-and-drop GUI software. It's the best tool in terms of drag-and-drop data engineering.

Get good at data engineering in addition to the algorithms. A lot of problems in data science isn't so much on the algorithm as on the data ingestion part.

Get good at using auto ML and also hyperdrive. Any hyper-parameter tuning tool, but not the open-source one. If you're focused, if you want to focus on a single company, like if you want to work at

  • Amazon – study AWS
  • Google – study GCP
  • Microsoft study Azure Machine Learning service
Eugene:
To learn any of these tools is quite a scope. What part is the most valuable to begin with?
DENNIS SAWYERS,
MICROSOFT:
All of these versions have some version of hyper-parameter tuning. Know how to use those things. In terms of when to use those things – always.

The first thing that you should do when you start a new machine learning project is to transform your data and run it through auto ML. That will give you a baseline. Oftentimes people can beat it! But the automated machine learning algorithms are getting better and better and better. As a data scientist, you're spending less and less time on things like algorithm selection and algorithm tuning

There's still a lot of room in terms of creativity for taking a business problem, and knowing what metric to minimize. Take accuracy, for example. False-positive and false-negative are rarely equally bad.
Eugene:
For example?
DENNIS SAWYERS,
MICROSOFT:
Say you're a restaurant. You have a takeout place and items on hand that you sell to customers as they walk in. It's usually more OK to throw out extra items than to make customers wait. You will lose less money that way than if a customer walks in and walks out without buying anything.

In a problem like that, you're looking for precision or recall, rather than accuracy. You're trying to assign a weight to say that throwing out food is only 10% as bad as a customer.
Eugene:
You are talking about false-positive/false-negative trade-off that I tie to an actual business problem I solve.
DENNIS SAWYERS,
MICROSOFT:
Right! And an automated ML can not do that and never will.

As a data scientist, you want to get to know well:

  1. traditional things like algorithms
  2. how to work with auto ML because you'll be using that a lot
  3. what business needs.
EUGENE:
#hide
DENNIS SAWYERS,
MICROSOFT:
There's something that I like to call the data science productivity problem. Many companies hire a ####-ton of data scientists, get a lot of data, and don't get the return on investment from it.

There are a few reasons for that. The biggest one is their data is terrible, and they don't do enough legwork on cleansing it. So their data scientists have to do all their data cleansing. But data scientists are not data engineers, they are slow at it. And they're bad at it.
EUGENE:
#hide
DENNIS SAWYERS,
MICROSOFT:
My previous job at Ford Motor Company, for example. We had some of the strangest data you had ever seen in your life. A field entirely populated with Greek letters, and there always were three Greek letters like ⍺β𝜋. And there was no translation table because it was in everybody's heads. Moreover, greek letters turn it into gibberish, since they aren't supported by many databases, including hadoop. You lose the information when you move them.
Eugene:
#hide
DENNIS SAWYERS,
MICROSOFT:
Problems like that slow everything down. Once you solved that, you should focus on scoring and retraining. In most companies, it's done manually:
Eugene:
#hide
DENNIS SAWYERS,
MICROSOFT:
Once you have a model that you'd like, the next step is to deploy it to a pipeline. And by pipeline, I mean, there's this concept in Azure called Azure Machine Learning pipeline. You can set up scripts that automatically score data and automatically retrain the model, and you can schedule all that automatically. So you can never, you know, touch it again.

Master software engineering skills and make pipelines. If you can do an automated scoring and retraining pipelines – you are productive. You wouldn't do data cleansing all day and be able to do exciting work.

Think about it.
Eugene:
What directions of DS / ML are the most desired?
DENNIS SAWYERS,
MICROSOFT:
#hide
Stay away from projects & examples that are a little bit too esoteric.
Eugene:
#hide
DENNIS SAWYERS,
MICROSOFT:
Stay away from projects & examples that are a little bit too esoteric. For example, I worked in a Carnegie Mellon research lab. I built an anomaly detection algorithm around radiation portal monitors there. We ended up using a self-scoring multivariate anomaly detection algorithm. Its paper was cited about forty times and known extremely little.

During an interview, it's a lot more challenging to talk about such a rare topics, than a well-known classification algorithm, where you applied it to a novel problem.

I also worked for an NBA team when I was a student. And I built the classification model using logistic regression and naive Bayes around predicting which European basketball player to draft in the NBA.

Everybody knows basketball, and everybody knows Naive Bayes and logistic regression. It's a lot easier to have that conversation.
Eugene:
What is your vision of Data Science development for the next ten years?
DENNIS SAWYERS,
MICROSOFT:
That is a great question I will answer it with PowerPoint:
Eugene:
Thank you, Dennis! That was great!
DENNIS SAWYERS,
MICROSOFT:
#hide
Eugene S.
Previous
Home Office HOW-TO
How to be productive at home and enjoy it
Next
MITx 6.86x Data Science Course Review
MITx Statistics & Data Science Micromasters program Machine Learning with Python (6.86x) Course Review