Ingredients for a Concept Mining Algorithm
In this post I will sketch the various ingredients needed for an EDXML concept mining algorithm and show how these fit together. Concept mining in EDXML is a simple form of machine reasoning that is inspired on how human data analysts extract knowledge from data. You can read all about it here.
Basics
Before we set off, let us establish some basic principles and terminology. An EDXML data set may define any number of concepts. In the story telling analogy, concepts are the characters that play a role in the story that the data is telling. Concepts are the things that the data set is about, like persons, locations, computers, and so on. A concept instance is a specific character, like a specific person for instance. The goal of concept mining is to extract knowledge about a particular concept instance. It is like reading a novel and writing down all properties of a specific character that are revealed throughout the book.
This knowledge consists of concept attributes. A concept attribute is a value of a specific event property that is associated with a particular concept. A possible concept attribute might be the name of a person.
For devising a concept mining algorithm it is useful to consider the EDXML data set as a graph where the nodes are all the concept attributes that exist in the data set. The edges that connect them represent possible associative reasoning steps to get from one concept attribute to another. One of the nodes is chosen to be the seed. The task of the mining algorithm is to "grow" a concept instance around the seed by finding all concept attributes that can be reached from it.
It is important to note that these edges are implicit in EDXML. Individual events do not contain any definitions of relations between their own object values, or the object values of other events. Edges can be inferred by combining events with their event type definitions. This will be explained in the next section.
Reasoning Steps
There are two possible reasoning steps that an algorithm should consider. Either it can use a property relation to step to related objects within the same event. Or it can step to the same object inside another event. Both are illustrated below.
Here we see events (indicated as 'E' in red), their properties (indicated as 'P' in blue), their associated concepts (indicated as 'C' in grey) and event objects (indicated as 'O' in green). On the left we see reasoning within a single event. It steps from an object to its associated property. Its event type defines a property relation, which allows stepping to the related property, and from there to the object of that property. On the right we see reasoning across events. It steps from an object of one event to the same object in the context of another event.
EDXML has the notion of shared objects. Shared objects are values of a particular object type that occur in multiple events. The events sharing it are implicitly related through this shared object. Each event in which the shared object occurs provides a different context for that object. In each context it may have a different meaning.
Suppose that we are mining a computer, and we know its IP address. And suppose that, while mining, we encounter an event and find this same IP address in it. Does that instance of the IP address also belong to the concept that we are mining? Well, it depends.
Suppose that this newly found event does not associate the IP address with a computer but with, say, a DNS record. In that case, it does not belong to our concept instance, because we are mining a computer, not a DNS record. So, the reasoning step to jump into that DNS record event should not be made. The fact that the same object may or may not be associated with the same concept is an important thing to realise when designing a concept mining algorithm. It means that each occurrence of a shared object must be considered separately.
As we have just seen, when considering if a particular reasoning step between two concept attributes should be taken or not it is necessary to check if the target attribute is associated with the same concept as the concept instance that we are mining. Here, the following question arises: When should two concepts be considered the same? To answer that question we need to remember the fact that EDXML concepts form a hierarchy. Some concepts can be really generic, like 'machine
'. Others may be more specific, like 'machine.computer.laptop
'.
Suppose that we are mining a concept instance representing a 'computer
' and we know its IP address. If we find that same IP address in another event where it is associated with 'computer.laptop
', what does that mean? Yes, these are different concepts. But both may very well refer to the same thing. If, for the purpose of this example, we assume that IP addresses are unique identifiers for both concepts, then we can say that we discovered that this 'computer
' is actually a 'computer.laptop
'.
This illustrates that, when two instances of a particular object are associated with different concepts, it may still make sense to include both in the mined concept instance. Here we identified another key design parameter for a concept mining algorithm: We need to establish the rules for relating attributes that are associated with different concepts.
The EDXML specification explains that the 'computer.laptop
' concept is a specialization of the 'computer
' concept, while the 'computer
' concept is a generalization of the 'computer.laptop
' concept. As a counterexample, the concepts 'computer.laptop
' and 'computer.smartphone
' do not have this relation because both are different specializations of the same root concept. Both are computers, but instances of the concept are either laptops or smartphones. Therefore, it makes sense not to take any reasoning steps that mix object instances associated with unrelated concepts. More precise: Given two instances of a shared object A and B, where A is already part of the mined concept instance, then B should only be included when the concepts associated with A and B have a specialization / generalization relation.
Up until this point we have considered reasoning steps across events. Let us now consider the other type of reasoning step: A reasoning step that associates two related objects within a single event. Now, things are quite different. An intra-concept relation may explicitly relate objects that are associated with any two concepts, no matter how much they differ. It may even relate concepts 'human
' and 'fly
', causing concept attributes of both humans and flies to coexist in a single concept instance.
Why is it possible to create such monstrosities? The EDXML specification permits this because one kind of thing may actually be viewed differently in different contexts. A knife may be seen as a tool, a weapon or a product. An intra-concept relation provides a means for an EDXML event type designer to explicitly tell a concept mining algorithm that two different concepts may actually refer to the same thing. That allows the concept mining process to safely combine attributes from concepts into a single concept instance, even if no specialization / generalization relation exists. It will trust the event type designer.
Object Hubs
We have explored some properties of the reasoning steps that form the edges in our reasoning graph. Of these edges, the ones that interconnect all the instances of a particular shared event object require some additional consideration. Given any of the nodes that represent instances of a particular shared object, the graph should provide a path to all other instances. Here, practical difficulties may arise. Mutually connecting n nodes requires approximately (n * n) / 2 bidirectional edges. The number of edges increases quadratically as the number of events referring to a shared object increases. This quickly becomes impractical for larger data sets.
This issue can be solved by introducing object hubs. An object hub is a special kind of node that represents a shared object. When all instances of a given shared object have an edge connecting them to their object hub, they are still mutually connected. Only this time, the number of required edges equals the number of instances of the shared object. The difference is illustrated below.
On the left we see five events (indicated 'E' in red) each having an instance of one shared object (indicated 'O' in green). Each instance of the shared object is connected to all other instances using 10 edges. To the right we see the same configuration where an object hub has been added (indicated 'H' in green). The object instances are connected to the hub using 5 edges.
Confidence
Real world data can be inaccurate, inconsistent or just plain wrong. This can be expressed in EDXML by means of confidences. For example, a property relation has a certain confidence which indicates a gross estimate of the odds that the relation actually exists. Confidence plays a key role in concept mining. Every associative reasoning step has a certain risk of incorrectly adding a concept attribute to the concept being mined. Every iteration of the mining process takes it farther from the initial seed and confidence of newly added attributes decreases. In practise, it makes sense to use a confidence cutoff to create a boundary for mined concept instances. This requires that the algorithm keeps track of how confidence propagates through the graph while reasoning.
There is one important detail to consider when tracking confidence: There may be multiple reasoning paths that lead to the same node. In order to obtain concept instances with good overall confidence it makes sense to choose the line of reasoning that has the lowest risk of error. This can be established by using Dijkstra's algorithm for finding the shortest path to a particular node in a graph.
There may be many nodes in the graph that represent the same concept attribute. As a result, the nodes that are found to belong to a single concept instance may include multiple instances of a single attribute. Each of them may have a different confidence. How should we interpret this?
Let us return to the story telling analogy once more. In this analogy, EDXML events are similar to the paragraphs in a novel. There may be many paragraphs that tell the same thing about a character. For example, one paragraph might feature a character telling about a car, possibly a blue one. And in another paragraph, another character is telling about the same car, noting that it was almost surely a blue car. Two observations of the same concept attribute, having different confidences. The combination of the two observations gives us more confidence about the color of the car than either of the two observations do on their own.
Confidence of concept attributes may be regarded in the same way. When multiple nodes confirm the same attribute with varying confidences, basic probability calculus may be used to compute the net confidence of the attribute.
Conclusion
We have identified a number of ingredients for a concept mining algorithm. These include the following:
- Iterative reasoning on a graph
- Use of shared object hubs
- Consider the concept associated with a node to decide about adding it to the concept instance
- Propagate confidence while reasoning
- Find the best line of reason using Dijkstra's algorithm