9 min read

How to process unstructured data

Move from scattered pieces of information to a usable database for identifying patterns, verifying facts, and forming the basis for your reporting

After conducting audience research to identify information needs, creating an information map of sources that house data you will need for your reporting, and conducting targeted manual and automated data gathering of those sources, you are on your way to original reporting hat fills gaps in your intended audience’s information needs.

It's likely that the data you have collected is messy, unstructured, and full of noise.

So before arriving at a reporting hypothesis and concept for your reporting based on the gathered data, you will need to clean the data to make sense of it.

That step, the subject of this post, is called information processing. Information processing will turn that raw data into something you can use.

It’s about taking scattered pieces of information and building a cohesive picture, allowing you to identify patterns, verify facts, and ultimately form the basis for your reporting.

Jump: WHAT is information processing | WHY we process data | HOW to process data | WHAT comes next

For newsrooms with limited resources or operating in challenging environments, this cleaning phase is especially important. A well-organized dataset that meets information needs gives you a unique ability to remain relevant even when operating from overseas, especially in cases when censorship obstruct or distort the public debate on such issues.

Don’t neglect your information map, either. Remember that maintaining your information map is the difference between a one-off report and a sustainable, living database you can return to again and again for your reporting.

Security protocol should also be in place, including redacting personally identifying information from your database and encrypting your database and communications.

What is information processing?

Information processing is the methodical process of organizing, standardizing, and structuring your raw data so that you can search it and analyze it. In our case, the purpose of doing this is to spot patterns that will help us arrive at a reporting hypothesis to test and news products to create.

The goal of information processing is to create a clean, consistent dataset that allows you to ask and answer complex questions. It's about adding structure and meaning to every piece of information you've gathered.

A lot of the information you'll gather—especially from interviews, social media, or forum discussions—is unstructured. It's text, images, and audio that don't fit into a neat row-and-column format. Your processing workflow must be able to transform this qualitative data into a format that can be analyzed alongside quantitative data.

One way to do this is by tagging your data with keywords or categories that can searched and sorted in a manner that the data itself cannot. This way, your previously unstructured data can be integrated into your more structured data through these common tags.

For example, if you are searching for data on a specific job role, your images or interviews related to that role would come up when you search your database.

💡
How we cleaned and organized our image database

In our data gathering process, we came across images posted online of job ads in the physical world. It was possible to look at those images and fill in some the appropriate columns of our database, such as the advertised wage, type of job, and benefits. Other information required a manual search, such as looking up the district in which the company was located.

Other images we collected were not so straightforward. We collected images of workers’ physical environments, and we tagged them with the type of job, the fact that it was an image, and a description of what the image was showing.

Depending on the amount and type of images or non-standard data you are collecting, consider what would be most useful for locating or using the data without referencing the image itself unless it proves key to your reporting.

Why we process information systematically

Our systematic approach to information processing directly supports service-oriented journalism. By creating a structured and searchable database of information, we are building a tool that can serve our audience's needs more effectively.

It moves us from a reactive model of journalism—where we report on an event after it happens—to a proactive model, where we can anticipate and address information gaps before they become critical.

A clean, well-organized dataset allows us to:

  • Identify trends: We can quickly spot patterns that would be invisible in a disorganized mess of data.
  • Answer people’s questions: We can use the database to answer specific, real-time questions from our audience, providing them with the information they need to make decisions.
  • Uncover systemic issues: By standardizing and connecting different data points, we can uncover systemic issues and patterns of wrongdoing that are not visible from a single source of information.

The final product of this process is a clean, structured, and searchable master database that we can use to analyze the data, search it, and form and test hypotheses for our reporting.

The information processing workflow: A step-by-step guide

Our workflow is designed to be a repeatable process that turns raw, messy data into a clean, analyzable dataset. It's an adaptable process that can be scaled up for a large investigation or scaled down for a more focused project.

Step 1: Ingestion and cleaning

The first step is to get all your raw data into a single, centralized location. This could be a cloud-based spreadsheet, an SQL database, or a dedicated data management platform. The goal is to bring all the information—whether it's from web scraping, manual collection, or crowdsourcing—into one place.

💡
How we did it

We use scripts to automate the ingestion of scraped data directly into our master database.

For manual entries, we use a simple, standardized data entry form.

This doesn't have to be complex: For one of our projects, we were just working through Google Drive. On another project, we migrated to Notion.

We assigned team members different roles, in charge of managing and cleaning different kinds or sources of content. At our regular team meetings, we discussed our progress and any need to change our process across the board.

After your data is in one place, it needs to be “cleaned.” This is where you address the mess. This involves removing duplicates, correcting typos, and handling missing values.

One challenge we encountered is that some data is missing information we were seeking. For example, when gathering job data in one of our projects, some data sources listed wages but not benefits. Instead of leaving columns blank, we noted that the data was missing and also why the data was missing.

Noting these gaps and omissions was important in our verification stage, for a better understanding of what we know and don’t know, and why.

Step 2: Standardization and structuring

In this step, impose a standardized structure on your data. This involves defining a schema, which is a set of rules that dictates what information is collected and how it's formatted.

The schema we used for our job data includes fields like:

  • Job title: A standardized category (construction worker)
  • Wage: A single, normalized value ($2,500/month)
  • Benefits: A text description of other benefits, such as meals provided
  • Location: A standardized location tag (a district of a city)
  • Source: The original URL or other source where the data was found
  • Source type: A category for the type of source (social media, job board, interview, etc.)
  • Notes: A text field for any additional qualitative observations
💡
How we standardized different income formats

For our project, we needed to standardize different wage formats (e.g., "$100/day" or "$2500/month" or “$32,500/year”) into a single format for comparison.

We chose a monthly wage structure, because this was the most commonly reported format in our data, and it is the way that our intended audience talks about their income.

To achieve this, we identified entries that did not follow this format and multiplied or divided appropriately to arrive at a monthly wage.

Although there is some arbitrariness to this—the actual number of working hours or days per month may vary—we processed all the entries in the same way for better comparison during our analysis.

If we had not standardized these entries in fear of slight inaccuracy, then the data would have been simply unusable to us.

By standardizing every entry, you make your data searchable and comparable. This allows you to perform analysis later.

Step 3: Categorization and tagging

Once your data is clean and structured, you need to add layers of metadata that will enable analysis. This is where you apply tags and categories that reflect your research questions.

  • Thematic tagging: We use a standardized set of tags to identify key themes or topics, such as "wage dispute," "unsafe working conditions," or "recruitment scam." This allows us to quickly filter for and find relevant information.
  • Geographic tagging: We tag every piece of information with a specific location, which allows us to create geographic heatmaps and visualize trends in different regions.
  • Actor and network analysis: For qualitative data from interviews or social media, we tag key individuals, organizations, and their connections. This allows us to build a network map of who is influencing whom in the information ecosystem.
💡
What insights tagging revealed to us

One of the tags we used was for the required education level of each job.

By creating tags describing different education levels, from “none required” all the way up to post-graduate degrees, we could quickly see that education level was not a factor in the job role we chose for research.

This tagging was important because our intended audience expressed a need for information on educational opportunities to improve their job prospects, but this specific job role has decent pay and does not require a certain education level.

This information may not be known to our intended audience, so this finding from our tagging can fill a gap in their information needs.

The process of cleaning your data is a good time to brainstorm tags. When you’re going through the data on that initial level of cleaning it, you may notice types of information that you weren’t specifically looking for but that may become useful.

Common challenges and workarounds

Information processing is not without its challenges, especially when working with data from a restrictive environment.

Language and translation: Much of the data from our project is in a language not shared by everyone on the research team. This requires a system for translation and a way to handle nuances and slang that might be missed by automated tools.

  • Workaround: We use a combination of automated translation for initial parsing and a human translator for a final review. This hybrid approach ensures both speed and accuracy.

Incomplete or missing data: Data from censored or sensitive sources is often incomplete.

  • Workaround: We have a specific protocol for handling missing data. Instead of just leaving a cell blank, we tag it with a reason for the missing data (e.g., "redacted by censor," "not provided by source"). This allows us to track patterns in missing information, which can be a story in itself.

Scalability: Processing a massive amount of data can be technically challenging.

  • Workaround: We use lightweight, open-source tools that can be run on a standard laptop. For larger datasets, we use affordable cloud services that can handle the workload without a huge upfront investment.
💡
Scaling this process for your newsroom

- For a small newsroom: You can do all of this using free tools like Google Sheets and Airtable. Use a shared spreadsheet with a clear schema to standardize your data. Use a system of color-coding and notes to tag and verify information. The key is to be systematic and consistent, even on a small scale.

- For a large newsroom: You can invest in more powerful tools like a relational database, such as PostgreSQL or MySQL, and a data visualization platform like Tableau or PowerBI. The principles remain the same, but the tools allow you to handle larger, more complex datasets.

Next steps for using your data

By transforming your raw data into a clean, structured, and searchable database, you create a powerful tool for your newsroom. This has allowed us to move beyond anecdotal evidence and start seeing clear, quantifiable patterns.

We now have a clean dataset that we can use to generate story hypotheses, create service-oriented tools, and build a sustainable resource.

With our database, we can ask specific questions such as, "What percentage of job listings offer benefits?" or "How have average wages changed over the last two years?" These questions lead to concrete, data-driven stories.

We can also use our database to build a tool for our audience, such as a searchable database of jobs or a wage comparison calculator.

Because of the initial investment and heavy lifting done up to this point, the database we've built is a long-term asset. When we update it with new information, it is much easier to input information in our standard format. Our database is a durable foundation for our journalism in this project and future ones.

In the next post, we will talk about verifying the data, analyzing it and getting to a reporting hypothesis, and creating news products for our intended audience.

Join us on our process in the Audience Research, Reporting, and other phases. If you haven’t already, sign up to our newsletter so you don’t miss out.

If you have feedback or questions, get in touch at hello@gazzetta.xyz.