Picture This! A Beginner’s Guide to Understanding Data Visualization with Matplotlib.

It’s been said that a picture paints a thousand words. If that’s the case then I can just post a picture here and call this roughly one thousand word blog complete! But that wouldn’t be any fun, would it? Maybe we should update the expression to “a picture paints a thousand points of data” and go from there. That would make more sense as this blog is about data science.

By www.freakingnews.com

The point I am trying to make, perhaps poorly, is that pictures are extremely important and valuable in interpreting and communicating results about data. And just to be clear, by pictures I mean data visualization. So I am going to explore one of the most basic and fundamental data visualization libraries in data science; Matplotlib!

Matplotlib is an open-source data visualization library used with Python and Numpy.

If you’re not familiar with Python and Numpy you can learn more by clicking on their respective links above. Basically Python is a very popular coding language used by data scientists. And Numpy is a library for numerical and scientific computations. Together with Matplotlib, data scientists can create various visual representations of complex data in a simple way.

So now that we sort of know what Matplotlib is, why should we use it and how should we use it?

As I mentioned earlier, data visualization is a critical part of exploring data and sharing results about the data. Not only is visualization important in data science, it’s a pretty big part of life as well. Take a minute to think of a good book you read that was made into a good movie. The book probably contained a lot more information (or data) than the movie. And it likely took a lot longer to read than the runtime of the movie, unless you’re a speed reader. But a well made movie can be a good visual explanation of the general themes of the book.

By Quora

So a Matplotlib visualization is like the movie summarizing the data which is like the book. Make sense?

A major part of a data scientist’s job is to analyze and understand data and interpret and communicate what that data means. Often times the audience is not technical, so a simple, concise visual of the results can go a long way.

There are a lot of different ways that Matplotlib can accomplish this. But for the sake of brevity, I am going to focus on three of the most common ones. And I encourage you to continue exploring on your own.

Let’s focus on Histograms, Bar Graphs, and Scatter Plots.

A Histogram shows a numeric distribution of data. A very common use of a Histogram is showing ages of people as a number range and the frequency of something occurring within each age range. Look at the example below of ages of women who frequent the library.

By SoftSchools.com

You can see why this is useful as it clearly shows the ages of women who visit the library in the buckets below. By the way, the technical term for those buckets is actually “bins”.

And you can create a histogram with as little as one line of code. It’s as simple as defining your x variable which in this case would be the ages of women who visited the library, and determining how many bins, or “buckets” you want to categorize them into. See the example code below.

plt.hist(x, bins=10)

So in summary, histograms are used to represent data that has been split into some number of groups or bins and visualize an occurrence for each group.

Bar graphs or bar charts are frequently used visualizations and you probably see several a week without even realizing it. Since I’m having pizza tonight, I thought I’d show a bar graph comparing pizza topping preferences! I’m not surprised that anchovies and garlic didn’t make it into the sample but that’s what will be on my pie this evening!

By fifth grade

A bar graph shows comparisons across categories on the X axis, or bottom. And the values they represent on the Y axis which in this graph defines number of people on the left. So X marks the spot and Y answers the question of how many people like that pizza topping.

Bar graphs can also be created with just a few lines of simple code. In this case we define our data which is the pizza toppings and the number of people who like each one. After that we just plot it out per the example below.

data = {‘Cheese’:10, ‘Mushrooms and olives’:15, ‘Sausage’:20, ‘Pepperoni’:25}
plt.bar(toppings, values, width = 0.4)
plt.title(“Favorite Pizza Topings”)

To sum it up, bar graphs are frequently used to show comparisons across categories with one axis showing the specific categories being compared and the other axis showing a value scale.

Scatter plots are cool in that they show the relationship and correlation between two variables while also showing any outliers that may exist in the data. Take a look at the simple scatter plot below.

By ThingLink

This shows the relationship between a husband’s age (on the X axis) and a wife’s age (on the Y axis). It also shows a positive correlation between the two. The reason it is positive is that you can see that as a husband’s age increases, so does the wife’s age. And generally they are close to one another in age. Of course there are outliers which are identified by the dots that fall outside of the rest. For example you will see a dot way up at the top right side of the chart. That says that there is a 77 year old husband in the data who is very lucky to have an 83 year old wife. And if you make it to either of their ages you should consider yourself lucky!

And creating a scatter plot is easy as one, two three! Like the above examples, you just need to define your data. In this case X would be a list of the ages of husbands and Y would be a list of the ages of wives. Then you plot it out like the example below.

plt.xlabel(“Husband’s age”)
plt.ylabel(“Wife’s Age”)
plt.scatter(x, y, marker=’o’);

To conclude, scatter plots are very useful in visualizing the relationship and correlation between two variables while also identifying any outliers.

There is so much more to Matplotlib and data visualization tools in data science. But I thought that this brief introduction could help you to get your feet wet and start exploring more on your own.

By CU Management

I am a Data Scientist with a background in fin tech, and account management. I am a graduate of Flatiron School's Data Science program.