Metadata: Key to the Kingdom

In my last post I talked about how content remains King in both the Internet realm and on intranets where it is associated mostly with unstructured or file data.  It ended by drawing a comparison between content and objects in storage based on the importance of metadata to both.  Metadata is critical because it provides the means for finding and retrieving data that may have been stored days ago or even years ago.  Whether you’re looking for a particular piece of content by browsing a folder tree (folder/sub-folder) or a group of content in a results set from a query (think Google, index/search) there’s a certain amount of metadata used for searching.  If content is King, then metadata is certainly the key to the Kingdom.

I’ve been talking about the importance of metadata to storage since the late 90′s and back then the “veterans” looked at me with crossed eyes.  It probably wasn’t so much that they didn’t totally get it, but that it was well outside the view people had of storage at the time, i.e. files, file systems and blocks.  However, I saw it as a way for storage vendors to increase their value to customers by adding informational value to the data they stored on disk and tape.  It was also based on a challenge I had when running several IT groups on the customer side for effectively managing an ever-increasing amount of file data coming from my end users in the mid-90s.

The issue was that my file storage was growing especially in user home directories and group shares.  Trying to prune these down to the most essential data was impossible.  There was no good way for my IT people to know what was important and what wasn’t just by looking at the file and folder tree.  There was no context for it other than who created and how “fresh” it was, i.e. created or accessed recently.  It was just a bunch of stuff (call back to my 1st and 2nd posts).  So when I jumped into the vendor world and was given the direction to simply ”build software that added value to our storage hardware,” it presented the opportunity to attack this growing problem.  Surely, I wasn’t the only customer who faced this issue.

Metadata was the one common thread I discovered when handed software projects in two different industries, health care and financial services.  Both needed to archive content for long periods and be sure it could be found and accessed in the future.  The context a richer set of metadata describing file assets (unstructured data) could provide was the key and what I would have loved to have back in my IT days.  It allowed the addition of informational value to data beyond what the file system provided that would enable search, preservation and policy-based management.

When a search is conducted through an ECM application, e-Discovery tool or using the well-known index-search engine (Google, Autonomy, etc.), it is done based on the metadata within the application.  The differences in the tools lies somewhat in the depth or richness of the metadata they use.  ECM apps typically use a database and index-search engines have optimized an index of full-text for advanced queries, e.g. proximity, natural language, semantic networks, etc.  For management purposes metadata needs to have some structure to it so file assets can be “tagged” appropriately and there are a number of metadata schemas already defined.

Source: Video Content Management in Broadcast, SMPTE Journal, February/March 2002

One method I have used to describe metadata and its depth is in a hierarchy model I developed early on and was published in the SMPTE Journal back in 2002.  There are just three levels to keep it simple, but useful. 

Base Level metadata is what you get in a traditional file system, which has almost no context (file name, date created, owner, size, last accessed, etc.)

Structured Metadata is based on a standard schema typically in name-value pairs.  There are a number already defined including Dublin Core, NBII Biological Metadata, Content Specification for Geospatial Metadata, DICOM, SMPTE, Video on Demand Content Specification, etc.  The structure provided by metadata standards enables simple rules for management such as if <tag> = blah, then replicate to clusterX.  It also enables more complex rules for managing and manipulating content such as transcoding specific MPEG files to streaming format, or transforming certain Word docs to PDFA for long-term archive.  It also provides a mechanism for creating and persisting complex relationships between content objects.

Unstructured Metadata is full-text indexing that most search engines perform.  These are indexing primary/unique words throughout a document and is not stored in a structured index.  The challenge in storage is that these indexes can be as much as 80% or more of the size of the original data.  It can be tough to sell the value of that much additional capacity.  However, full-text index/search is often provided as a critical tool on internal corporate portals enabling enterprise-wide discovery, but not as part of the storage infrastructure.

Object storage delivers a metadata capability that had previously been unavailable to application developers and end users.  The ability to add standard-based metadata and custom metadata as descriptors for files opens a whole new opportunity for information management and storage.  There remains a fair amount of work ahead of us because the traditional file system approach has become so ingrained in our perspective.  However, the massive growth of new content (unstructured data, files) and the tremendous storage capacity it will consume is necessitating a change.  Managing the amount of information foreseen requires applications and infrastructure to work more cohesively together and metadata is a common thread that extends from the app through storage.

Advertisement

2 Responses to “Metadata: Key to the Kingdom”


  1. 1 Hans Fremuth April 15, 2010 at 7:00 am

    Derek,

    I look at your three tiers from a different angle: Tier 1 to me is file and file system derived information. Tier 2 is the “true” metadata, found in ‘headers’ inside the files (containing administrative, descriptive, rights-related etc. fields). Tier 3 is the almighty ‘index’.

    Tier 1 and 3 is machine-created, Tier 2 is dominated by the power of that roughly one pound heavy piece of grey matter that sits between human ears.

    I suppose nothing really replaces the power of a full-text index (I am not an expert in that field). But improving the quality of Tier 2 metadata sounds to me as a good bang for your buck. Standards such as XMP allow very sophisticated tagging and semantic assumptions (well, at least there are efforts on the way to make this practical, such as opensamsn.org)

  2. 2 dgascon April 15, 2010 at 7:21 am

    Hans,

    Thanks for the comment. I think your perspective is very much in alignment with mine. The tier 2 as I call it, or “true” metadata as you describe it is what I consider the most important type. You make a great point about Tiers 1 & 3 being machine-created, which I hadn’t really thought about. Also, I love your comment about the “heavy piece of grey matter” being critical to tagging. This is the area where I believe real value is derived for file-based data, but also one of the more complicated to implement.

    Derek


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s




( ĭn’fәr-mā’zĕn )

A state of balance between business, information and technology.

Author: Derek Gascon

Veteran in product and marketing strategy for content mgmt, archive and storage delivering innovative software technologies to emerging markets.

Random Thoughts

  • Out hunting BIG DATA game this week. Meeting with six customers across various industries. Petabyte is the new order of magnitude. 1 week ago

 

April 2010
M T W T F S S
« Mar   May »
 1234
567891011
12131415161718
19202122232425
2627282930  

Follow

Get every new post delivered to your Inbox.