80% of your data will be unstructured within four years, according to Data Management Solutions Review.
Perhaps not surprisingly, the vast majority of the 100 or so new database products have embraced unstructured data as the primary issue they are managing, and the foundational premise of some of today’s most popular database technologies (ie, Kafka) centers around three ideas: 1. That unstructured data exists; 2. That it’s spreading rapidly (like a virus); and 3. That it’s very hard to manage.
All of the above is well-intentioned and certainly not dismissable. However, the term “unstructured data” itself is very relative, potentially misleading, and requires deconstruction.
Let’s start with definitions.
What is unstructured data?
The generally agreed upon definition of unstructured data is something like this:
Information that doesn’t have a pre-defined model or is not organized in a pre-defined manner.
But it may be more helpful here to define “data”.
Per the Cambridge Dictionary, data is:
Information, especially facts or numbers, collected to be examined and considered and used to help decision-making, or information in an electronic form that can be stored and used by a computer.
So if “data”, by definition, is something to be considered for decision-making, then the term “unstructured data” is an oxy-moron. It’s like saying “objective beauty”.
Because a piece of data, by itself, has no meaning without context, and hence can’t be used for decision-making.
Take a randomly picked individual data point: the number 9. The number 9, by itself, has no value whatsoever until we associate it with something. That association comes in the form of structure. Our number 9 only has meaning in relation to what’s around it—ie, other numbers, or data points, and once you can see all of these data points together, you then have context and can make meaningful decisions based on that information.
However, if we remove or destroy the structure, by, for example, randomly shuffling our data points, the data then becomes worthless even though we didn’t change any of its core elements. This ‘unstructured data’ can’t be processed by a computer, because from the outside it’s just a hodge-podge of random bytes.
Unstructured Data or Covertly Structured Data?
When English comedian Eric Morecambe was accused of “playing all the wrong notes” in a piano concerto, his response was, “I’m playing all the right notes. But not necessarily in the right order”.
So-called “unstructured data” has the same issue. In order for it to be useful, it has to be very unlike its moniker by having some kind of internal structure.
If data could really be completely unstructured, you’d only need to store a single record with a gigantic blob attached to it. But in the real world, there’s often multiple entries, each with its own key to uniquely identify it, which means, at a minimum, the people storing it implicitly accept the necessity and existence of using primary keys to split the data up into chunks, each of which is associated with a discrete thing we wish to work with.
Over time, more and more of the hidden structure comes into view and we eventually find ourselves working with child records and foreign keys. We may even start to see that there are a whole series of fields within any ‘unstructured’ data that would arguably be better off being stored as a column in a table.
So, what we have in fact been dealing with all along is not really unstructured data but covertly structured data. It has a structure, but that structure is not visible outside the application.
This has made individual developer’s lives much simpler, as they no longer have to deal with corporate data models or other teams with different goals and timescales and are free to use whatever data structures make sense to them.
It also means that attributes which have no utility outside their original application (such as the color of a box in a UI) can be stored in the same place as canonical reference to the object itself, which makes things easier.
The Emerging Challenges With Unstructured Data
As we’ve seen above, using ‘unstructured’ data allows us to get an initial version of an application up and running much faster, but as our application grows and is connected to other applications, a series of challenges emerge.
Not synchronizing the structure with other applications before deployment is arguably a form of technical debt—minor issues, such as the format of address fields, become serious issues when you are trying to verify that two things are semantically the same but syntactically different because the developers never spoke to each other or with a DBA
Even within an application, problems will emerge, as new use cases require features such as transactions, foreign key lookups and running totals that platforms designed for unstructured data struggle to support.
The top seven issues are:
1. Data sharing and reusing
In order to use data created by another application you now need to first understand its data format, and then understand its relationship to the items around it.
The need for every application to understand every other application’s data format if it wants to use it, and also because each application is storing data in a format that’s specific to a use case, data sharing and reusing have become extremely difficult.
2. Performance at scale
From a network bandwidth and CPU perspective, changing a single bit in a 20K record can cost just as much as creating the same 20K record from scratch, because everything goes across the wire, every time.
The database needs a mechanism to check that the new value is based on a valid old value and not a version of the data that has since been changed by someone else, making transactions a major challenge.
4. Foreign keys
There are practical and technical limits to how much unstructured data you can store for a single key, and once you exceed those, you now have the same issues with primary keys, unique keys and foreign keys that you would in a traditional RDBMS, except with minimal support for such concepts in many NoSQL products.
5. Aggregate operations
In order to answer any non-trivial question, you need to retrieve the data and then process it on the client side. What would be a simple SUM() .. GROUP BY operation in an RDBMS turns into hours of work.
6. GDPR and data protection
Under the European Union’s General Data Protection Regulation organizations need to be able to clearly and easily identify all the data related to a given individual. If the key can be mapped directly to the individual, this is manageable—but what if it’s buried in the payload?
7. Long-term data management
Without an externally visible date field, how do I know a record is old enough to delete? GDPR raises its ugly head again here, as EU-based companies are legally obliged to delete personal data they don’t need anymore.
Unstructured Data: The Bottom Line
While nobody can question the success of products that focus specifically on unstructured data, in the long term, companies seeking to future-proof their applications—and their products—will need to use a hybrid RDBMS-key-value database model—and that’s because for data to be useful to anyone other than the application that creates it, it needs to be externally comprehensible.
There are ‘facts on the ground’ that back this up.
Many vendors that were pure key-value stores are now adding SQL layered and support for transactions retroactively.
Here at VoltDB we’re happy working in such a hybrid world and have customers who use us for both traditional structured data and unstructured data at the same time. In addition to being able to store and work with pure binary data, VoltDB can also work with JSON and can index individual elements inside ‘unstructured’ fields, opening the way to solve many of the issues we describe above.
Bottom line: Managing unstructured data isn’t a challenge if you understand that 1. It’s not truly unstructured, it’s covertly structured, and 2. You have a data platform that can handle this covert structure.
That’s what we do at VoltDB, and that’s why we don’t stress out about unstructured data—nor do our customers.