Digital Technology
Nov 01, 2017

Shedding the light on dark data

In May, Apple announced its acquisition of the startup Lattice Data. The acquisition is just one of many in recent years – all centered on technology firms that work with unstructured data, or “dark data.” You might be wondering what unstructured data is and what makes it dark. And, why are tech titans like Apple eager to acquire these startups? To answer these questions, let’s first take a look at the differences between structured and unstructured data.

Structured vs. unstructured data

Today’s businesses are used to working with structured data. It’s information housed in databases and spreadsheets – usually numbers that can be easily organized in rows and columns. Because it’s in a structured format, it can be sliced, diced and configured to create actionable insight. Unstructured data, on the other hand, is information that exists outside of databases and spreadsheets. It’s embedded in disparate sources, including web pages, PDFs, documents, emails, social media, etc. Often, it’s in a text format with a lot of language variances, making it difficult to compile, process or analyze using traditional IT algorithms. So it leaves executives in the dark and unable to make business decisions.

Unfortunately, today’s businesses have much more unstructured data than structured data. It’s estimated that unstructured data accounts for over 80 percent of all business data. And given trends in data proliferation, it’s projected to grow to nearly 95 percent by 2020. This means even the most data-driven companies are only working with a small fraction of their business-critical information. They’re essentially sitting on a treasure trove of untapped insight.

But what emerging players like Lattice Data are doing is using artificial intelligence (AI) to help companies structure their unstructured data and make it not only digestible, but also wieldable. 

Structuring the unstructured

In principle, structuring unstructured data involves:

  1. Identifying a business problem or question which can be solved through insight derived from unstructured data
  2. Reviewing the various data sources from across the organization
  3. Determining what each document is about
  4. Pinpointing where the relevant information is located within a document using context and background knowledge
  5. Extracting the information into a desirable, accessible format for analysis

These tasks may seem pretty straightforward when a human is doing it on a small scale, but it becomes very difficult and complicated if you’re trying to do it with millions upon millions of data sources, where the information you’re looking for can be in a variety of file formats and have different idiosyncrasies describing the same thing.

However, with AI, a machine can do it all for your organization – no matter the scale of the business – and fast. Specifically, AI branches, including natural language processing (NLP), computational linguistics and machine learning, enable the machine to understand contexts of information to pull the relevant data that your business is looking to structure.

NLP processes natural human speech and written language against existing data and patterns to understand what’s being looked for or said in the document. Similarly, computational linguistics extracts relevant information based on the linguistic context of the words surrounding it. While, breakthroughs in machine learning enable the machine to acquire more data and “learn” as it’s applied so it can more easily find the information you need based on an amassed database of existing information on your business and customers.

For instance, in the banking industry, Know Your Customer (KYC) is a process that banks use to identify and verify customers during account openings to prevent fraud. KYC is important for maintaining compliance with the numerous regulations regarding anti-money laundering. A lot of this customer data is unstructured and housed in tax filings, application forms, W-2s, etc. Right now, banks need to review all of these documents individually to verify customer identities and manage risk. This is a time-consuming endeavor because the data is unstructured. But by using NLP, computational linguistics and machine learning, banking institutions can quickly find the necessary data and structure it for customer verification and risk analysis. They can process accounts much faster than before – which leads to happier customers – while diminishing risk and improving compliance.

Structuring unstructured data opens organizations up to an unprecedented amount of valuable information that they can use to drive greater customer value, improve customer interactions and be more predictive in their operations. So it comes as no surprise that companies like Apple, Microsoft, Google and Salesforce are all putting a stake in AI technologies designed to shed a light on dark data. When companies can harness their untapped dark data, they can gain a tremendous competitive advantage.

To learn more about how businesses can leverage their unstructured data to benefit the enterprise, read my latest article on, “Three Ways to Make Sense Out of Dark Data.”

About the author

Sanjay Srivastava

Sanjay Srivastava

Chief Digital Officer

Sanjay Srivastava is Chief Digital Officer, where he runs Genpact’s growing Digital business, overseeing the Genpact Cora platform and all Digital products and services.