At the National Aeronautics and Space Administration, we are regularly faced with big data challenges in volume, velocity, and complexity amongst other things.
Complexity is intriguing to me since big data issues related to this have been strongly investigated by research in the distributed computing, cloud computing and cluster computing communities. Velocity is being studied by the software and hardware people and data movement researchers, and is well on its way as well.
Complexity, the scary problem of dealing with potentially many thousands of data and metadata attributes, or the problem of dealing with the complexity of tens of thousands of file formats, or the complexity of parsing information from these data files, is one of Essential Difficulties, as coined by Fred Brooks.
These are problems that cannot be completed solved by technology or by approach (compared with Accidental Difficulties, issues in software development and design such as efficiency, which can be mitigated by newer and higher level domain specific languages, or better Integrated Development Environments, or by specific compilers or tools).
Complexity is being solved by researchers looking at ontologies; looking at new ways of understanding data like Linked Data; new approaches in data modeling, or by an area that is near and dear to my heart, content detection and analysis.
What is content detection and analysis? To me, this area grew out of the search engines and information retrieval (IR) community, and combines IR with databases, natural language processing (NLP), distributed systems development and open source.
Need a file identified? Use a content detection and analysis toolkit. Need to extract text and metadata from those files? Content detection and analysis toolkit? What language is the information in those files? Yep, you guessed it, content detection and analysis toolkit.
I helped to construct one of the de facto toolkits out there. It’s called Apache Tika. Tika aims to be the “digital babel fish”, allowing the user to understand any file automatically, rapidly and accurately. This capability is essential in NASA missions where content is increasingly growing (both science and business), and is even more essential for use in science and instrument data processing (where a data system must quickly sift through and identify/triage/act on data present in files).
Tika grew out of the Apache Nutch project that spawned Apache Hadoop and an entire ecosystem and generation of amazing code and people. Tika was everything that we Nutch committers and project management committee members needed to make search engines (like Nutch, but also Google, Bing, etc.) understand any file, index its content and metadata, and make that information available for search.
Tika has an eight year history, a growing community, has achieved 1.0, and has delivered functionality to NASA, the National Science Foundation (NSF) community of researchers, to DARPA, to academia, and to content management systems and their users everywhere since Drupal, Plone, Alfresco, etc., all use Apache Solr and in turn Solr integrates Tika via its ExtractingRequestHandler component that allows Solr to index content from any file using Tika.
While not a household name (yet!), Tika has been ported from Java to Python to .NET to the emerging Julia language out of MIT, and it enjoys an extremely strong user and developer base, with new users, projects, and funding for Tika springing up every day.
So what’s next for the digital babel fish? One area that we are actively looking at is in Machine Translation (MT). It’s great that we identify the language in file types, but once we have content that’s in different languages, how do we harmonize it, or make uniform the text and metadata? MT systems are growing increasingly popular (think Google Translate; Bing Translate, etc.) and we are working to add these capabilities to Tika.
My team at NASA has added a Translator API functionality funded by the DARPA XDATA project and by NSF funding and we are actively expanding this API to support TranslatingParsers, to support massive translate activities for JSON datasets, and in general to take Tika to the next level in terms of content detection and analysis. Beyond these MT functionalities, we have also now plugged into Lingo24’s MT API.
My team is also funded by the NASA, the NSF and DARPA to add support for more scientific data formats to Tika including improved NetCDF and HDF file support along with support for Matlab, support for ENVI and grib2 data files so information can be unlocked from these data.
If you haven’t checked out Tika, please have a look and moreover if you are interested in the work going on at NASA please feel free to contact me!