Using Google Refine to easily uncover news stories in data

Google Refine is a powerful tool for spot checking whether a story might exist inside the data you’ve collected. It also might clue you in to another story based on data you already have.

Google refine allows you to both clean up, or normalize, data in a spreadsheet or CSV file and examine it for interesting trends, numbers, outliers, etc. It is unlike a typical spreadsheet program, and can help a reporter make a quick analysis of numbers or data without learning how to manipulate them on a spreadsheet.

You wouldn’t use Refine like you would Excel, to run formulas on existing values or anything like that. Refine is a supplement to that sort of analysis. It’s what you do before you start working with Excel to make sure you’re applying that analysis to the right data instead of what you may have thought was the right data.

Download and install Google Refine here or by searching for it (where else?) on Google.

We’re going to explore Google Refine using a partial file downloaded from Transparent.Utah.Gov. (Note: I have inserted errors into this file for training purposes.) This file contains 10,000 records of expenses filed by Salt Lake City in FY2011. You can download the file here.

When you open Refine, you’ll notice it brings up a web browser window. While it is browser-based, it is not Web-based. You aren’t uploading data somewhere. Refine is just using the browser to run the program. You can also use Refine without an Internet connection.

GETTING STARTED

Open Refine, and click “Create Project” on the left. At the top you’ll see a bunch of file types that Refine works with. In the middle you’ll see several ways you can get data into Refine. We’re going to use a file. So under “Get data from” click “This Computer” and click the “Choose file” button. Find your file and hit “Choose,” then hit “Next.”

What you see here is a preview of how your file will look with the current import settings in Refine. You may need to change things up to make it look right. For instance, there may be blank rows at the top you need to ignore. There may not be column headers. It may be a tab-separated file instead of a comma-separated file. Do whatever you need to do to make the preview start to make sense and click the “Update Preview” button to check your settings. (For this example file, the default settings should be fine.) Once you’ve done that, at the top right you’ll see a spot to give your project a name. Click the “Create Project” button when you’re ready.

NAVIGATING THE DATA

Now you’re looking at your main project screen. This is the spreadsheet you imported. In the large blue bar at the top you’ll see the total number of rows or records in your spreadsheet. Just below that you can select how many rows Refine shows you at once, either 5, 10, 25 or 50, and you can scan through the various pages of data. Choose to see 50 rows.

The wonderful thing about Refine is that it’s very difficult to screw up your data, and even if you do, there’s an unlimited number of undos to get back to the beginning. So don’t worry about contorting the data any way you want — you won’t hurt it.

Let’s dig into this spreadsheet and start testing out Refine’s capabilities.

THE MAIN TOOLS

Google Refine uses Facets and Filters to cut through data in a meaningful way. Using these, you can break a column or several columns down into various components.

Facets look at the contents of a column and group similar items together. Let’s look at a Text Facet on the Payee Name column of the spreadsheet. To get there, click the down arrow inside the “Payee Name” header. You’ll see the main array of Refine tools. Choose the first one, “Facet,” and choose “Text Facet.”

To the left of your table you’ll see the results of that facet.

The dark blue bar tells you which column you’re applying the facet to. The light blue bar shows how many results were found and allows you to sort the results by name (alphabetical) or count (most common to least common).

The default sort is by name, and you can see there are 68 choices — or 68 different names listed in the 10,000 rows of data in the Payee Name column. At the bottom of the results there’s a thin blue bar with two lines in it. You can drag that up or down to reveal more or fewer results.

Regardless of how you sort, next to each name you’ll see a number. That number represents the number of times that name appeared in the column you selected. Choose to sort by count. You’ll see that in this table 9,839 “Payee Name” results are “Not Applicable,” followed by several results for actual named entities.

Click on “Aramark Uniform Services.” On the left, that is highlighted, and on the right are all of the results in your spreadsheet that have Aramark Uniform Services listed as the Payee Name.

What if you want to look at all the results except those that have “Not Applicable” listed? Click “Not Applicable” on the left-hand side, and all the “Not Applicable” results show up on the right. On the left, in the blue bar next to “Payee Name” you’ll see the word “invert.” Click that. Now all of the entries except “Not Applicable” appear on the right.

You can also add other facets and filters to the results shown for a compounding effect. Put a text facet on the “DESC” column. You’ll get 53 choices. Choose “City Building Supplies.” Now the records on the right are only showing the ones where the “Payee Name” is something other than “Not Applicable” and where the “DESC” is “City Building Supplies.”

Let’s clear all the facets we currently have applied to the spreadsheet by clicking the X button in the blue bar at the top of each facet on the left-hand side. Now we’re back to seeing all of the data.

Now we’ll look at number facets. Click on the down arrow in the “Amount” column and select “facet,” then “numeric facet.” On the left you’ll see a graph showing you a range of numbers and how many records are in each segment of that range. On the right and left of that graph are two handles to allow you to select only the records in a certain numeric range. Grab the left handle and drag it right until it is just past the big block of records in the middle.

Now you’re looking at only records where the expense was greater than $100,000, with the largest being about $1.7 million. This is an easy way for you to find out the outliers of any spreadsheet — what’s unusual about the data. As a reporter, you may scan the list of Payee Names and wonder why R P Wetlands & Waterfowl LLC got more than $560,000 of Salt Lake City’s money. You might look at the Description field and ask why the city paid more than $5.3 million for City Data Processing Services on the same date.

Clear the numeric facet by clicking the X next to Amount in the blue bar on the left side.

Now click the down arrow in the “Posting Date” column again, choose “Facet” and “Timeline Facet.” On the left, you’ll see a similar graph to what you saw with the numeric facet, but this one is grouping all the items by the date they were filed. This looks less impressive because the data set we’re working with is not the full expense list for Salt Lake City for all of FY2011, it’s only a small subset. In fact, it’s only the expenses posted July 1-9, 2010. However, if we had a full dataset, you would be able to see whether there are certain dates throughout the year that the city consistently pays out more often, or whether it went a stretch without paying bills. You might add a numeric facet for any payment over $100,000 and see where those payments plot on the calendar.

Now for a note about Refine. You may have a column that appears to be dates. However, in spreadsheets, the program doesn’t know that 02-14-2012 or 4/15/03 is a date unless you indicate such, it thinks it’s really just another number or a bunch of text. If you run a timeline facet on what you believe is a date column, and it doesn’t seem to work, you need to tell Refine that the numbers in that column represent a date. To do that, you would click on the down arrow for the “Posting Date” column. Under the “Edit Cells” menu is a “Common Transforms” menu. You would select the “To date” option, and it would convert all those cells into actual dates.

Clear the timeline facet from the left-hand side.

CLUSTERS

Another great feature about Refine is its ability to “normalize” data. Sometimes even data that looks like it’s the same isn’t really, and you need to make it the same by normalizing it so that you can run an accurate analysis.

Run a text facet on the “Org1” column. You’ll see 21 choices. But if you look more closely you’ll see there are some things that don’t look right. For instance, there’s an “Airport” and an “Airort.” It looks like there’s a misspelling in the data. Then there’s a “Fire” and “Fire Department,” which are really the same thing. Refine makes this easy to fix.

There’s a “Cluster” button on the right-hand side of the list of choices. Click that. What comes up is a menu to help you clean the data. Refine combs all the data you have in the column to see whether some of it might represent the same thing. It can do this in two ways. The first method is called “key collision,” which you can see labeled just above the main box on the page.

What Refine is asking here is whether you think these similar items are, in fact, supposed to be the same. In the first instance, it found “Community Development Dept.” and the same phrase with a space at the end. Those should be the same, so click the checkbox under the word “Merge,” then at the very bottom click “Merge Selected & Re-Cluster.”

Now, we’re going to check the other method. Click on the box next to “Method” where it says “Key Collision” and change it to “Nearest Neighbor.” Now Refine brings up an issue with one item where “Department” is misspelled. Check Merge again and click the Re-Cluster button.

Some iterations are more difficult to spot than others, so Refine gives you a few options to make sure your data is squeaky clean. At the top are two numbers, the first one is Radius and the second is Block Chars. Changing these numbers tells Refine to look at your data in more ways to see whether some of it is incorrect.

Change the Block Chars number to 1. You can see the misspelling of “Airport,” which you can now merge. I suggest playing around with changing the numbers in both the Radius and Block Chars, as well as chaining the Distance Function to get a sense of what the different features might do. Can you find which set of features brings to light that there are several rows of “Public Services” that should merge with “Public Services Department”?

When you’re done, click “Close” at the bottom of the screen.

Now you’ve got a shorter list of options in the ORG1 window on the left. Looking down the list you can see that there are still some changes that can be made. For instance, there’s a listing for “Fire” and a listing for “Fire Department.” You can fix those manually. When you hover over the word “Fire,” the words “Edit” and “Include” appear next to it. Select “Edit.” You can now add “Department” and the “Fire” category will merge with the “Fire Department” entries.

Now you’ve properly sorted the data, and there are actually 16 choices instead of the original 21.

Finally, look at the “DESC” column. Notice how some entries are in all capital letters, and some are in lowercase? You can easily make this uniform by clicking the down arrow in the “DESC” column, selecting “Edit Cells,” then “Common Transforms” and then “To titlecase.” This will change the entire column’s entries to titlecase, which is much easier to read. You can do the same to the “Payee Name” column.

EXPORTING

When you’ve massaged the data to where you want it, you can export the file to Excel. Refine will only export the data you have selected. If you want to export the entire data set, clear all of your facets, and click “Export” in the top right corner of Refine. You can export to a variety of file types.

If you have facets applied to narrow the data, when you click export only that portion of the entire data set will be exported. So it’s important to note whether you have facets applied before you export.

Finally, readers will look at your PDFs

PDFs have long been a double-edged sword on our site.

They’re great because you can give readers so much more information without cluttering up your story with a bunch of numbers or names or extraneous details better suited for a sidebar or graphic. But as interesting or useful as that information may be, sticking a link to the PDF file in the story has never panned out traffic-wise. Readers just don’t seem to care enough about the information to click the link (or don’t notice it’s there because they don’t read every word) to get all that goodness.

Now we have a workable solution that won’t completely clutter the story. We’re using Scribd to embed documents inside the story text just like we do videos. The early returns are great. Far more people are reading the material in the few PDFs we’ve embedded than ever did through traditional linking.

So if you’ve got a PDF, text document of nearly any flavor (including those you can’t read because you don’t have the software) or PowerPoint presentation, it can now be shared in a meaningful way. Check with your friendly Web editor for details.