5 Data Science Projects That Will Get You Hired in 2019


You’ve been taking online courses, and reading a bunch of textbooks, but now what?
Getting a job in data science can seem intimidating.
The best way to showcase your skills is with a portfolio of data science projects.
This shows that you can actually apply data science skills.
Here’s 5 types of data science projects that will boost your portfolio, and help you land a data science job.
1. Data Cleaning
Data scientists can expect to spend up to 80% of their time cleaning data. This is a huge pain point.
If you can show that you’re experienced at cleaning data, you’ll immediately be more valuable.
To create a data cleaning project, find some messy datasets, and start cleaning.
If you don’t know where to look, try the Data.gov website. You can search for data or browse by topic.
For even more ideas, check out these 18 datasets. This answer posted on Quora also has some great sources for messy data.
Now that you have your data, you need to pick a tool.
If you’re using Python, be sure to check out the Pandas library.
For a quick intro, use 10 Minutes to Pandas. This is a great tutorial for newcomers to Pandas.
Another great article is Pythonic Data Cleaning With Numpy and Pandas.
This article has some great data cleaning examples. Here’s one that shows how to drop numerous columns from a dataframe.
If you want to use R instead, use the dplyr package.
This package is great because it uses a “grammar of data manipulation.”
So what does that mean? Basically, if you have a data cleaning task, there’s a logical verb that’s got you covered.
For example, if you want to pick variables based on their names, you use select()
.
Chapter 5 from Hadley Wickham’s R for Data Science is a great dplyr tutorial.
This chapter is all about transforming data.
Here’s a more detailed use of the select()
function.
In this example, we’re only selecting 4 out of the total 19 variables.
You’ve got the data, you’ve got the tools, now what?
If you’re unsure how to structure your project, use this outline.
- Import data
- Join multiple data sets
- Detect missing values
- Look for anomalies (1, 2, 3, “Fred”?)
- Impute for missing values
- Data quality assurance
This is a very general outline to get you started.
Your actual workflow will depend on your project.
For a detailed example of a real-life data cleaning project, check out this awesome article from Tich Mangono.
Tich’s project required data cleaning and reshaping for a dashboard app.
His data was spread wide across numerous tabs, but the app required a long format.
Tich’s workflow looked something like this:
- Set up the environment and load the data
- Investigate the data
- Parse the different data tabs
- Standardize existing columns and create new ones
- Clean up the data using “apply” and “lambda” functions
- Reshape the data from wide to long by pivoting on multi-level indices and stacking
- Concatenate and save the final results back to Excel
As you can see, Tich’s workflow is a lot more detailed.
It’s important to get an initial feel for your data with a quick investigation, and then plan out your steps.
You’ll add and update tasks as you make discoveries about your data.
Even so, it’s still a good idea to have a game plan before you dive in.
2. Exploratory Data Analysis
Another important aspect of data science is exploratory data analysis (EDA).
This process involves generating questions, and investigating them with visualizations.
EDA is important because it allows you to understand your data, and make unintended discoveries.
A great source for EDA datasets is the IBM Analytics Community.
What I really like about these datasets is that they have a “real-world” feel. Here’s some example Sales datasets:
For an EDA project with Python, use the Matplotlib library.
Matplotlib’s library has a ton of great tutorials to learn from.
This article from George Seif also has some great examples of data visualizations in Python with code.
Here’s his overlaid histogram example:
For R users, be sure to use the ggplot2 package.
Like dplyr, this package utilizes a “grammar of” strategy. This time it’s a grammar of graphics.
I’m going to point you towards R for Data Science again, because Chapter 3 is a great ggplot2 tutorial.
Here’s an example of ggplot2’s elegance:
To build an EDA project, keep the following topics in mind:
- Formulate relevant questions, and hypotheses
- Test those questions with visualizations
- Identify trends in the data
- Look for relationships between variables
- Communicate results with visualizations (scatter plots, histograms, etc.)
For a great EDA project example, check this out this epic post from William Koehrsen.
This example takes a look at doctor’s appointment no-shows.
Missed appointments can cost the US health care system nearly $200. If you work in the health policy sector, this is a major issue.
3. Interactive Data Visualizations
Interactive data visualizations include tools such as dashboards.
These are useful for both data science teams, and more business-oriented end-users.
Dashboards allow data science teams to collaborate, and draw insights together.
They also provide a useful tool for end-users that don’t need all the fine details, just a quick and easy way to interact with their data.
For Python users, I recommend Dash by Plotly.
With Dash you can easily create amazing looking dashboards quickly.
Here’s an example dashboard of Uber rides in NYC:
To start learning Dash check out the user guide.
For another great tutorial, check out the Dash in 5 Minutes from Plotly.
For R users, a great tool for interactive visualizations is Shiny from RStudio.
You can host Shiny from a webpage, embed directly into RMarkdown notebooks, or build dashboards.
Check out this dashboard tutorial from RStudio to get started with Shiny.
For a great example, check out the Twin Cities Buses dashboard.
Here are some keys to creating dashboards:
- Know your audience
- Tell a story
- Keep things simple
- Use appropriate visualizations
The Art and Science of Effective Dashboard Design, has a very detailed guide with even more key elements.
4. Machine Learning
A machine learning project is another important piece of your data science portfolio.
Now before you run off and start building some deep learning project, take a step back for a minute.
Rather than building a complex machine learning model, stick with the basics.
Linear regression and logistic regression are great to start with.
These models are easier to interpret and communicate to upper level management.
To get started with Python, use scikit-learn library.
This library covers a ton of useful machine learning topics. Here’s a few:
- Classification
- Regression
- Clustering
- Dimensionality reduction
- Model selection
- Preprocessing
A great place to start learning is the logistic regression page.
There’s a good explanation of this model, and links to various examples.
For R users, check out the Caret package.
Here’s a good tutorial on logistic regression using Caret.
Your machine learning project should include the following:
- Reason why you chose a specific model
- Split data into train/test sets to avoid overfitting
- Select the right evaluation metric (AUC, adj-R^2, confusion matrix, etc.)
- Feature engineering and selection
- Hyperparametre tuning
I’d also recommend focusing on a project that has a business impact, such as predicting customer churn, fraud detection, or loan default.
These are more real-world than predicting flower type.
Here’s a good example from Denis Batalov on predicting customer churn.
I really like this example because Denis ties his result to a business impact.
He takes a look at the financial outcome of using vs. not using his model.
Based on his assumptions, he approximates a $2MM per year savings by using his machine learning model.
If you can tie your results to a business impact , you’ll score some serious bonus points with potential employers.
5. Communication
The final project type should focus on communication.
Communication is an important aspect of data science.
Effectively communicating results is what separates the good data scientists from the great ones.
It doesn’t matter how fancy your model is, if you can’t explain it to teammates or customers, you won’t get their buy-in.
Here’s 11 great presentations on Slideshare if you need some inspiration.
Here’s some tips for creating great presentations:
- Know your audience
- Present relevant visualizations
- Don’t overcrowd your slides
- Make sure your presentation flows well
- Tie results to a business impact (reduced cost, increased revenue)
Also keep in mind that practice helps.
Find a friend, and present to them before the big day.
Another effective technique is to practice out loud alone, and record yourself.
This is a good approach because you can go back and see what was working and what wasn’t. Maybe you were speaking too fast, or rambling on.
The recording is good because you can give yourself feedback, and adjust your technique as you practice.
Notebooks are also an effective communication tool.
Jupyter Notebooks and RMarkdown files are great ways for teams to communicate with each other.
These notebooks are great for building a portfolio.
You can easily convert these markdown files to static websites using Jekyll, and host them for free using GitHub Pages.
Be sure to document all of these different types of projects on GitHub and your GitHub Pages portfolio.
Provide links to your projects from your LinkedIn profile. This is a great way to easily communicate with potential employers.
Conclusion
The best way to showcase your Data Science skills is with these 5 types of projects:
- Data Cleaning
- Exploratory Data Analysis
- Interactive Data Visualizations
- Machine Learning
- Communication
Be sure to document all of these on your portfolio website.
If you need help setting that up, check out our tutorial video.
For even more learning resources, check out these top data science books.
Stay positive, keep building projects, and you’ll be well on your way to landing a job in data science!
Leave a comment below and let me know which type of project you’re going to build first.