A practical Data Visualization exercise: historical weather records
I just spent $3.00 buying a public dataset from weather stations. A few days later, these beautiful graphs came out in order to see, learn and understand. Not to analyse.
The task of a #DataViz designer is not and easy one. The goal of a good data visualization job is not to extract final conclusions, nor to analyze information at its full detail. For that sake, we use AI bots or other kind of Data Scientists. Analyzing machines –or minds– don’t need to visualize graphs at all, they use other computer or mathematical tools that, in many cases, don’t make any single plot.
Data Visualization is intented to be used by a human mind (provided with eyesight) to help them understand big sets of data, that would take much more time to process if they could only read the raw, original files with large amount of text and numbers. Human brain is trained to recognize patterns in images, in a much more efficient way –still– than a computer do. We do that constantly. That’s why convert abstract data into visual patterns is a good idea to transform that data into knowledge.
In this article I explain a practical exercise of #DataViz that I just made at home, by breaking down the key aspects of a typical Data Visualization project, and how I managed them to achieve the final result.
Step 1: Start with a clear idea
Everything started with a very basic question: What’s the average weather of a single day in the year, in a given place? In other words: How’s the typical 24th March in New York City? What are the odds that it’s going to be cloudy, rainy or sunny?
Using basic statistical methods to historic weather records would do the job but, what if I wanted to know that for each and every day of the year, in many places?
Step 2: Get the (big) dataset
As I was pretty sure that such a dataset might exists, I rushed into the AEMET website (National Weather Service in Spain) looking for it. After a few minutes I found they *sell* the whole dataset of weather stations records, many of them storing data since early XX century. As I found the price very affordable (about 2.50€), I just paid and download it.
I got the raw material.
Step 3: Open the vault
Next task is to open the package and see what’s inside. You never know what’s going to be: it could be a rare file format, it could be an aggregated file with mixed data, etc
In my case it was basically a list of 291 files in CSV format. Every file was named after a long code, so another file relating those codes with the detailed name of the venue was provided. Every single CSV file had a set of columns for several basic weather data: maximum temperature, average temperature, highest pressure, rain… so initially the dataset was perfect for my purpose
Step 4: Clean the data
One basic advice I would give to beginners in this field is: Never expect a perfect dataset. As I opened some of the 291 CSV files to overview them, I noticed that they were everything but homogeneous. Some stations started recording data in 1920, some others in 1990, 1963, so forth and so on. Not only that: many stations started recording on 1st January of the year, but many other files start their first row at April, September or any other random day of the year.
And, of course, dataset aren’t complete. Some stations didn’t record average temperatures for many weeks (don’t know the reason) and some other ones started recording some other data after a few decades of records.
To completely clean, sort off and reduce this dataset would take hours and headaches in front of a spreadsheet software. So I decided to get rid of that job and focus on plot everything, including the missing data.
Step 5: Choose the graph
One of the most difficult parts of Data Visualization is to choose the right type of graph you need to plot. It depends on the amount of variables you need to render, and the intention of the graph. In my case, I wanted to represent the evolution of the average temperatures (T_MEDIA) and the total amount of rain (PRECIP) of every day of each available year in the dataset. That may lead inmediately to a classical calendar view, with a 365-cell grid, divided by month as a reference, or not divided at all. That’s right. But I needed to plot another dimension: time. The typical CSV file contained more than 10,000 rows of data, representing 30,40,50 years of historical data.
How may I cast that time evolution in a single 365-cell calendar graph?
One possible solution is to lean on animation: a live graph that plays every year every second or so. But the intention is to get something that could be printed easily without loosing any information.
Then, an image came to my mind. When you chop a trunk from a tree, time is revealed in the surrounding rings we see in the cross-section: each ring belongs to a passed year. If every “ring” in my plot was a circle, it makes a round every 360º, very similar to the 365 days of a year (or 366 in the case of a step year).
So, around ~1º of an arch would represent the temperature or rain from one single day, and every concentric ring will do so with the passing years.
Perfect. Now it’s time to plot!
Step 5: Data mapping and drawing
So, in one hand I had hundreds of CSV files with about 10,000+ rows of data each. Each row had more than a dozen columns, almost half of them empty.
On the other hand, I needed a method that allows me to quickly read those files unattendedly, no matter if they were 12 or 300, extract from them the right data, cast it to a ~1º arch painted in a way that can easily identified with hot/cold temperature, and finally save the rendition in a digital image file.
There’s plenty of software development tools that allowed me to do so, but as I lack of time (and this project was made on my few spare time), I needed to bet for the best: Processing language, a JAVA-based coding tool that is —from my humble opinion— the fastest, easiest tool to make graphic-based applications.
Processing is a high-level layer code that enables to even non-seasoned developers to access all the power of JAVA while saving them from pretty much of its complexity and abstraction. Processing allowed me to read CSV fils with just one single line of code, and looping through the contents with even less.
The applet I developed was able to:
- Read every line of the file
- Extract the year from the date
- Cast every date into a position in a circumference
- Draw an arch which color was linked to the average temperature, of whose saturation was linked to the amount of rain. In Data Visualization you don’t have to care very much about the variable mapping accuracy: you can adjust the hue, saturation or brightness of the pixels as long as they are proportional to the values. What matters here is to show clear patterns, not to retrieve original accurate data.
- Draw a new color arch with every year in the data file, every new ring slightly bigger than the previous one, so the ring goes from past (inner ring) to present (outer ring)
- Save the plot in a PNG file
Well, with my code, all this duty was performed in less than 5 minutes, for 289 data files, resulting in a total of 478 PNG image files: one set for the average temperatures, and another one for the rain data.
NOW: The results
Okay but, what about those interesting insights? After taking a look at those beautiful 478 images, you can easily, and very quickly, label the climate of the place you are looking at. Let’s review some examples:
The graphs are quite self-explanatory, nevertheless some labels were added. Every year starts at January in the “three o’clock” position, going through clockwise, with April beginning at the six o’clock position, July at the nine o’clock position, and so on.
When the app found missing data (that used to happen very often), if left the corresponding pixel blank. The color legend is the common cold-to-warm one.
This way, it’s straightforward to identify the seasons. Remember that this dataset belongs to weather stations among all spanish territory, so it’s boreal hemisphere. The yellow-orange-redding pie belongs to summer, the blue-violet pie belongs to winter. Being that set, let’s overview some other cases from other towns:
This ring shows clearly a cooler place, but with also non-crude freezing winters. Well, it actually belongs to the beach town of Llanes, in the northern atlantic coast.
This ring suggest a slightly warm place, with very mild winter, with a few uncommon summertime pattern. This plot belongs to El Hierro, the smallest island of the tropical Canary Islands archipielago.
This plot belongs to Ciudad Real, an inland city with large temperature differences between summer and a longer winter. In large time series like this one, we can try to “slice” the graph in order to appreciate (or not) more detailed patterns. Are current summers hotter than 50 years ago? The strip taken from the slice suggests that past summers were slightly milder than current ones. Don’t forget that the goal of a #DataViz job is to fins patterns qualitatively, not to measure anything on the graph!
What about rain?
Let’s view some plots of the rainfall data:
This ring suggest a place were rain is not related to a specific season. It’s just slightly more intense and frequent in the autumn. This ring belongs to Santander’s airport, a city in the Cantabric coast (Atlantic climate).
This “inverted C” pattern is talking about a very dry place during the summer, while mild rain scattered through the rest of the year, with some heavy episodes only in the autumn; a typical pattern present in the mediterranean climate. The rings belongs to Malaga, a mediterranean coastal capital city in southern’s Spain.
This graph is surprising: it’s “C” pattern suggests a place with dry winter and humid spring, being dry on the average of the year. This ring belongs to Teruel, an inland city of center-eastern Spain were winter uses to be extremely cold, so may be snowfall was not being recorded as rainfall. The place is dry during the rest of the year, with frequent thunderstorms, just like the graph suggests with the scattering dots pattern almost half of the year.
This ring-like graphs, while unfrequent on common data handling software like Microsoft Excel or similar, helps users to identify patterns from thousands of data points in a fast overview of an image. As the user’s eye is trained to this kind of rendering, more different data can be fed into the app, representing daily information during years and years.
Overlapping related rings representing different variables could also be useful to get more and more insights from big data, with just a quick look in a graph like this:
If you liked this article, subject, or if you are interested on Data Visualization projects, please feel free to contact me or provide some feedback in the comments area.