How to build your own knowledge graph

How to build your own knowledge graph

Knowledge graph

Automatically categorise documents and use machine learning to improve the model

Do you have a lot of text documents stored on hard disks or in the cloud, and you don't use its textual information directly in your business? Then this article is for you. Learn how you can leverage artificial intelligence to use that dark data and turn it into valuable business insights, using a Knowledge Graph.

Categorisation of data

Many organisations have large amounts of information contained in free-text documents. Processing these documents often entails categorising the information contained in them. Humans read the documents and label them. At the same time, some metadata is usually added to the documents.

Labels and metadata are then stored in a database, together with a link to the original document. The documents themselves tend to sleep and only kept alive for reference.

After time, when business changes, these documents can not be used in a new context unless they are relabelled and reprocessed, which is cumbersome in a manual procedure.

Automation to the rescue.

Using Natural Language Processing (NLP) techniques, such as topic modelling and named-entity recognition, one can quickly find the important topics and entities buried in the texts.

topic modelling

The list of topics found needs to be labeled to have any meaning. This becomes the basis of a taxonomy [1] or ontology [2] for the business.

According to Wikipedia [2], "an ontology encompasses a representation, formal naming, and definition of the categories, properties, and relations between the concepts, data, and entities that substantiate one, many, or all domains."

A human with domain knowledge is needed to do this labeling properly and to create an ontology. But it's a quick procedure, the human only judges the found topics, and does not have to read through all the documents.

As an example, we examined a few thousand research applications for government subsidies by public institutions. The figure below lists the top 50 topics found using algorithms implementing term frequency — inverse document frequency (TF-IDF)[3] and non-negative matrix factorisation (NMF)[4].

taxonomy

Some found topics are manually labeled as non-topic because they describe the document or its contexts itself, instead of the business domain that the content is about. Such 'topics' should be further ignored.

Machine Learning: the process of learning new topics and changing contexts

Once we have the topics—the ontology—each new document now can be automatically classified. No need to read through all the documents anymore! It's all automated.

Well, not quite. The topics were found using a limited data set, the training set. For those data we know the topics are OK, because we have human-labeled them via the found topics. However, there is no guarantee the training set is representative of new documents.

The system must continuously learn new topics and slowly changing contexts.

So what do we do if we want to introduce a new topic? Finding topics is one thing, but defining new topics through the ontology is something else. The system now has to learn that certain documents should be labeled with the new label.

We want to avoid introducing a human again.

So let’s start with a training set obtained from somewhere else, say a Wikipedia page about the new topic. In our example case we are looking for all applications in the energy sector.

taxonomy ontology

We start with the Energy page from Wikipedia and define it as the training set. We use a linear support vector machine (SVM) [5] for the model training.

We train the model on N applications and for each application we have a similarity score with the training set. We rank these, and pick out the first Mapplications, with M < N.

The M applications most similar to the Wikipedia page are then examined by a human, who has to decide whether Energy is indeed a good label or not. This produces a new training set of M applications. This training set is part of the business domain and should therefore be more accurate than the generic Wikipedia page.

This technique reduces the cold start period for machine learning. In fact, our dataset shows that 90% of all Energy-labeled applications are correctly found by the algorithm after just 19% of the documents labeled, when using the Wikipedia page as the initial training set. If Wikipedia is not used, we have to label 27% of the documents before 90% of all Energy-labeled applications are correctly found.

avoiding cold start problem

At this point, many projects halt, because the goal of categorisation and labelling is achieved. We take it a step further, though, and link the information retrieved from the unstructured data to the domain model and other data.

This results in what is called a Knowledge Graph.

What is a Knowledge Graph?

Basically, a Knowledge Graph is a bunch of interrelated information, usually limited to a specific business domain, and managed as a graph. The interrelations provide new insights into the business domain.

A more formal definition is given by Paulheim[6]: a Knowledge Graph

  1. describes real-world entities and their interrelations;

  2. defines possible classes and relations of entities in a schema;

  3. allows for potentially interrelating arbitrary entities with each other;

  4. covers various topical domains.

This article shows how you can build a Knowledge Graph for your business domain using existing unstructured data.

Building the Knowledge Graph: linking entities and metadata to the onntology

All the entities and metadata that belong to the documents can now be linked to the ontology describing the business domain. A natural way to represent these relations is in a graph.

Entities can also be linked to information obtained from elsewhere: legacy databases, open data, etc. This way, the information contained in the documents is augmented by other data.

graph engine

In our example data, we have millions of nodes and relationships in the graph. We use the well-known native graph database Neo4J [7] to hold the data.

graph NEO4J

What is shown in this graph is just a small part of the data. The nodes are colour-coded as follows:

  1. Blue: the document as a whole, represented by a unique ID

  2. Red: the topics found using the topic modelling

  3. Grey: business defined labels that group topics into broader fields

  4. Green: the research institutions involved in the research described in the document

  5. Yellow: the year a specific document was issued

As one can see, a giant network connecting the nodes exists. Each relationship has a meaning. This becomes clearer when we zoom in on a particular part of the graph.

Here we asked the business question which research institutions from the city of Leuven have applied for subsidies for research related to "Bacteria" in the year 2015, and which other topics are related to this research.

Similarly, one can ask questions like which research related to "Energy" is a collaboration between the Universities of Ghent and Antwerp. Etcetera.

Having a Knowledge Graph allows business insights that are otherwise hard to get to, with a focus on relationships.

A Generic Solution

Starting with unstructured textual data, applying topic-modelling and NLP techniques, together with machine-learning algorithms results in the building bricks for a full Knowledge Graph.

The ontology is domain-dependent, but the techniques around it are generic by nature, summarised in the picture below.

Team work

This article is the result of a NLP Hackathon, hosted by the Flemish Government 10 & 17 October 2018 and is a collective effort.

Thanks to all who participated in and around the team and all others who supported us and were crucial for the inspiration, which made us win the hackaton in the start-up class.

References

  1. https://en.wikipedia.org/wiki/Taxonomy

  2. https://en.wikipedia.org/wiki/Ontology_(information_science)

  3. https://en.wikipedia.org/wiki/Tf%E2%80%93idf

  4. https://en.wikipedia.org/wiki/Non-negative_matrix_factorization

  5. https://en.wikipedia.org/wiki/Support_vector_machine

  6. Paulheim (2016), Semantic Web: http://www.semantic-web-journal.net/system/files/swj1167.pdf

  7. https://neo4j.com/

Natural Language Processing exercise to automatically categorize documents and use machine learning to improve the model. Finally create a Knowledge Graph from the labelled entities.
We winnen de eerste Belgische NLP hackathon in de start-upcategorie

We winnen de eerste Belgische NLP hackathon in de start-upcategorie

Our Vectr teambuilding weekend

Our Vectr teambuilding weekend