by Darren Mottolini, Business and Research Manager Western Australia, FrontierSI; Ivana Ivanova, Senior Lecturer and Research Fellow Spatial Information Infrastructures, Curtin University and FrontierSI; and Tristan Reed, Research Associate, Curtin University
The energy industry has many and varied uses for datasets relating to spatial and geographical information, and traditionally, the process of knowing where to look for the right dataset has been challenging. However, Google recently launched a new online search engine specifically aimed at solving the challenge of finding and accessing the right dataset among the ever growing and increasingly fragmented array of online dataset repositories. And according to researchers from FrontierSI, this is just the tip of the iceberg when it comes to what we can achieve in dataset searches.
There are many thousands of data repositories on the web. Local and national governments around the world publish their data online, which adds to these repositories, meaning we now have access to millions of datasets.
We have all experienced searching for a dataset that contains specific elements of information. The time spent searching for a set containing specific elements of data, only to find that you must further process the data to make it fit-for-purpose, costs business and government alike.
In September 2018, Google launched Dataset Search, which allows anyone to find datasets wherever they’re hosted, whether it’s a publisher’s site, a digital library, or an author’s personal web page.
Google Dataset Search applies the same principles as Google Search (which are also used for Google Scholar). As long as the right structure of metadata tags is included in the data repositories, it will index the metadata for discovery.
The purpose of Google Dataset Search is to improve the discovery of datasets from sectors such as life sciences, social sciences, civics and government. Google plans to do this by ensuring publishers provide structured metadata, which means each dataset must include support information describing the dataset.
The idea of using structured metadata to allow machine-to-machine linkage is an area of research FrontierSI (formerly the CRC for Spatial Information) has been engaged in for the past seven years.
While Google Dataset Search is a great advancement, the reality is that this new application is only dipping into the possibilities of what can be achieved through structured metadata and machine linkage. Google’s work is indirectly demonstrating that the research undertaken by FrontierSI has practical merit, while recognising there is still more work to be done.
Searching for the right dataset through smarter use of structured metadata is only the low hanging fruit in optimising machine-to-machine links. Improving search so that users can use “natural language” phrases such as “what is the grain production output within the Wheatbelt” should not only get you to the right dataset, but in the future, provide you the right answer to it. We see this as the next level of research that Australia is well positioned to lead.
Let’s explore what is standing in our way to not only improve what datasets we search for, but help generate the answers we need. Structured metadata? Don’t we already have metadata standards?
Searching for spatial datasets in the geo-information domain relies on the existence of dedicated catalogues (including metadata catalogues, geoportals or clearinghouses) and complex, standards-compliant metadata, such as ISO 191151.
Metadata is a structured collection of information fully describing the spatial resource, and includes information about the creator of the dataset, its spatial and temporal reference system, content, quality and constraints on its use. The ISO standard recommends a minimal metadata set which should serve for data discovery and identification, yet despite having a complex and exhaustive metadata standard, there are persistent and well-known problems with spatial data discovery.
Spatial metadata is scarce or, if available, not well maintained, which is caused by two major problems:
1. The use of standards is not mandatory, and even if mandated (e.g. by national or corporate Spatial Data Infrastructure “SDI” policy), the standard does not specify a minimum metadata requirement. As such, it is frequently up to data producers to decide how much metadata and what information to provide.
2. Metadata is provided in specialised jargon, understandable only by geo-information professionals and often only those from the same specialised area as the producer.
To add further difficulty, searching for spatial resources relies on prior knowledge of these dedicated catalogues and where they can be accessed. Currently, attempting to use mainstream search engines requires an intricate and advanced knowledge of crafting search query strings to guide the search engine to a specified data catalogue location. Once there, the
search engine further needs to interact with the data catalogues system (such as using an OGC Catalogue Web Service2 request) to identify the right dataset based on the original query string.
There were prior attempts to harmonise search for spatial datasets with dedicated catalogues using mainstream search engines – one such example is OpenSearch for GEO3, however, searching for the right dataset that is fit for the user’s desired purpose continues to be a challenge in the geospatial domain.
An initiative within INSPIRE, the European SDI, to align geospatial metadata standards with Data Catalog (DCAT)4 demonstrates the desire to expose currently “invisible” data repositories to the web and aligns with recent developments in mainstream search engines, such as Google Dataset Search.
Google has recommended the use of RDF models and DCAT vocabularies to setup and design structured metadata for published data, but what does this mean? RDF stands for Resource Description Framework, a metadata model used as a general method for expressing conceptual descriptions or modelling of information that is implemented in web resources. It is a knowledge management technique that is founded on the idea of describing resources in the form of a triple – consisting of a subject, predicate and object.
In Figure 1, the subject is a property, the predicate expresses the relationship “isLocated”, the object in this case being a street. The expression would be “a property is located on a street” – subject, predicate, object.
DCAT is an RDF vocabulary designed to facilitate interoperability between data catalogues published on the web. DCAT does not make any assumptions about the format of the datasets contained within a catalogue but provides a standardised method of expressing the structure of a data catalogue and the metadata records of datasets within it.
How does it all work?
A collection of RDF statements intrinsically represents a directional graph data model that is suited to generating knowledge, from understanding how data is linked, while using machine inference to “fill in the gaps”. Using the concept of a mining haulage machine and the interlinked assets and components related to build such a machine, you can quickly map a graph data model by linking statements, as shown in Figure 2.
Once linked, using data analytics software (such as Apache Hadoop) we can then infer other linkages without having to hard code them into applications or build new look up tables. However, in practice, RDF data is often stored in normal relational databases or native representations, providing a mechanism for publishers to start building their own RDF statements and publishing these.
Why should the metadata of my dataset be structured this way?
When searching for datasets to meet the requirements of a solution, understanding that, for example, in a cadastre a property can be called a lot, a parcel, a land boundary, a property boundary, a title boundary, or several other possible terms, these additional descriptions become important for a search engine.
However, in most cases search engines do not consider the fact that one “thing” may be called many different things by other groups of people. As such, if you were to describe a dataset as containing information on “tree canopy”, a user querying the search engine with a more general term such as “vegetation” would not find the dataset and so would not be aware of a dataset that may meet the requirements of their solution as it was described using other terms.
Google Dataset Search is currently a leader in this regard. For example, querying “bore hole” and “borehole” yield effectively the same results. This is in contrast with other dataset search engines in use, such as CKAN, which ignores all records containing “borehole” if the search query is “bore hole” and vice-versa.
Through expressing metadata in a structured RDF format, vocabularies can be linked to elements of the metadata to “expand” or broaden the content. For example, existing or expert-generated vocabularies describing alternative representations for “bore hole”, “cadastre” or “tree canopy” could be used to automatically expand the keywords listed in the metadata records for the cases discussed above.
Spatial data also intrinsically contains extra context, be it implied through the geographic extent of where the spatial data itself is or the geographic extent to what the data covers which may be, specifically described in a metadata record. By applying the principles of RDF “triples” to create context in the published metadata, dataset search engine results can be tailored for the end user by looking at the spatial relevance or suitability of a dataset.
One example would be describing a dataset’s extent as being “Northam”, a town in the Wheatbelt region of Western Australia. Using RDF compliant vocabularies, a user can query a search engine with a phrase such as “‘in the Wheatbelt” and find said dataset. As such, a user looking to compare data from a set of related geographic areas only needs a single search query, rather than many as is currently required.
The Spatial Infrastructures program of FrontierSI has been at the forefront of research in this area for the past several years and new applications, such as Google Dataset Search, show ongoing promise that we are on the right path. For now, FrontierSI is continuing to improve how spatial metadata can better leverage the “web of data”, while supporting Australian data publishers to ready their data for Google Dataset Search.
1ISO 19115:2014 Geographic information – Metadata – Part 1: Fundamentals, ISO: Geneva, 2014, 167p.