Analytics & Big Data
Nov 14, 2018

The key steps to data engineering: source, curate, store, and govern 

In today's digital age, companies rely on data-driven insights and analytics to thrive and stay on top of their market and competitors. Data can help answer tough business questions – and, when combined with new technologies like artificial intelligence (AI) and machine learning, it can power your business transformation. A pharmaceutical executive I recently met said he thinks of his organization as a “data company," given its bleeding-edge AI approaches to understanding things like genetic causes of diseases and identifying potential drug therapies. In other industries – from financial services to manufacturing, consumer goods, and retailers – many companies are adopting similar data-centered approaches.

To be a true data company, it is first important to understand what data to leverage and how, which can be a complex process. In many cases, it is difficult to determine what data is available, and where. Moreover, sizable teams are often needed to process and prepare the information before it can be used to drive business outcomes. Collectively, these processes make up the critical steps in data engineering.

Source (ingest) the data

Previously, companies were primarily focused on tapping into structured data housed in easily digestible formats, such as spreadsheets and tables. Today, many are interested in a wider variety of data sources, including those beyond the corporate walls, like social media data, blogs, websites, emails, IoT devices, and sensors. These sources have unstructured data that is not digestible by traditional means, but is still extremely valuable. Unstructured data represents 80 percent of all available data today. The challenge with unstructured data is in being able to extract contextual and meaningful information from it. Technologies like natural language processing and understanding allow companies to extract context and meaning out of unstructured data and convert it into a structured form that can then feed powerful new analytics possibilities.

Curate the data

Once the data is sourced, the next step is to curate the data. There are multiple considerations when curating data, from cleansing and validating the data, to harmonizing and transforming it. For example, one system might have a text-based identifier, another might have a numeric identifier, and a third might have an alpha-numeric identifier. The challenge, then, becomes cleansing, validating, and harmonizing data from disparate sources and in different formats to a single view of the enterprise. This single view can help connect the dots across the enterprise and drive powerful analytics and business impact.

Traditionally, the data-to-insights journey has been a long process, but recent advances in technology, such as the infusion of machine learning into data ingestion and data transformation platforms, are helping shorten the time to value by automating or augmenting some of the more complex parts of this process. The scalability of these self-service data preparation technologies is not yet up to enterprise standards, but their readiness is fast approaching – at least judging by the pace they are evolving at.

Store the data

After the data has been processed, it needs to be stored in a cost-effective, easy-to-access, and secure way. Rapid advances in big data platforms like Hadoop and cloud-based data storage have made it increasingly easy to meet cost- and security-related objectives. The availability of powerful SQL engines on top of these platforms has also solved the ease of access problem, giving power users – business-side stakeholders and analysts alike – the ability to use SQL skills to analyze massive amounts of data on these platforms in an interactive way in real time.

About the author

Binu Varghese

Binu Varghese

Global Delivery and Strategy Leader - Data Engineering

Follow Binu Varghese on LinkedIn