A Quick Recap on the Previous Post
In our last post, we learned that concept learning is merely a search problem in the space of hypotheses, and we also learned that a hypothesis can be represented by a conjunction of constraints on the attributes of our data. Moreover, we concluded that if a data instance x satisfies a given hypothesis h, then h will classify x as a positive instance, which implies that h(x) = 1. As a reminder, considering the set of attributes: sky, temperature, and wind, take a look at the following hypothesis:
<?, Cold, Strong>
In plain English, it means that our golf player, let’s call him Joe, will play golf when it is cold, and the wind is strong, regardless of the sky (denoted by ? as the constraint for attribute sky).
So, learning the concept of play, means learning the days for which Joe plays golf (i.e., play = yes). This set of days can be described through a conjunction of constraints on the instance attributes. And we know that this set of constraints is nothing by the hypothesis that we are hoping to converge to after searching the space of hypotheses.
Today we will focus more on the actual search step in the space of hypotheses and give you an idea as to how one could estimate the size of the hypothesis space through an example. In our next post we will explain how the machine could search the hypothesis space in a more guided and logical way through general-to-specific ordering of the hypotheses in the hypothesis space, H.
What is Inside a Hypothesis Space?
So we have learned that concept learning is nothing but a search problem in the space of hypotheses, and the goal is to find the hypothesis that fits the training examples the best. Let’s dig deeper and learn more about these hypotheses and also this space. Consider our previous example:
As we have mentioned in the last post, a hypothesis can be represented as a conjunction of constraints on the attributes. Considering the fact that we have 3 attributes (i.e., Sky, Temperature, and Wind), and also knowing that Sky has 2 possible values (i.e., Sunny, Rainy), Temperature has 3 possible values (i.e., Warm, Freezing, Cold), and Wind has 2 possible values (i.e., Strong, Weak), we can compute all the possibilities of the instance space of our training examples, let’s call it X:
Number of possibilities in X = 2 × 3 × 2 = 12
Now, can we use this to compute the number of possible hypotheses in the space of hypotheses? Sure we can!
Reminder: A hypothesis can impose 3 different constraints on each attribute:
- Dictate a specific value for that attribute (e.g., Wind = Strong)
- Make any value acceptable, which we denoted by a question mark ?
- Make no value acceptable, which we denoted with Ø
This means that we need to consider 2 more possibilities per attribute to have included both ? and Ø constraints, as we have already considered all the possible values each attribute can take. So, the number of syntactically distinct hypotheses in the space of hypotheses, H, can be computed as follows:
Number of syntactically distinct hypotheses in H = (2+2) × (3+2) × (2+2) = 80
The red number 2, denotes the 2 possibilities of ? and Ø constraints for each attribute. And we have added it to the number of possible values that each attribute can take. As a result, we have computed ALL 3 possible types of constraints mentioned earlier, for each and every attribute in our training set, X.
IMPORTANT: Let me ask you a question. Considering the fact that we have 3 attributes (i.e., Sky, Temperature, and Wind) check out the following possible three hypotheses:
<Ø, ?, ?>, <Ø ,Ø ,?>, and <Ø, Ø ,Ø >
Can you tell me which instances in our training set can satisfy any of these 3 hypotheses? That is right! None of them! The result of applying any of these 3 hypotheses on our training examples is nothing but an empty set! It is all because of Ø! Actually, having only 1 of such constraints on any of our attributes is enough for the entire hypothesis to result in an empty set. Think about it! If you have Ø for an attribute in a hypothesis, h, this means that h says the following:
“I don’t care what values this attribute has in your training examples! None of these values will satisfy me and you shall not pass!”
As a result, none of the example could pass h, and the output of h will be an empty set! As a result, it doesn’t matter if we have only one Ø for one attribute or multiple Ø‘s for multiple attributes in a hypothesis! They all result in the same result, that is, an empty set!
So, when computing the number of our hypotheses in the space of hypotheses, we can count all the cases where there is at least one Ø for any of our attributes, as just 1 case! As a result instead of adding the red 2 mentioned above, we can just add a red 1, to count for the ? constraint, and after computing the result of the multiplication, just add this 1 to the result. This last 1, is all the possible hypotheses in the space of hypotheses that have at least 1 Ø in any of their attributes! So, the number of semantically different hypotheses in H can be computed as follows:
Number of semantically distinct hypotheses in H = 1 + (2+1) × (3+1) × (2+1) = 37
Perfect! This is nice and we are getting more comfortable with the whole story! However, please note that this is a very simple problem, and our training set is quite small. The space of hypotheses could be really really large or even infinitely large! As a result any machine learning algorithm that would search this space has to find an efficient method of search in this space to find the hypotheses that would fit the training set best! In the next section we will shortly talk about a nice hierarchy that could make our search smarter and more practical.
General-to-Specific Ordering of Hypotheses
Imagine that the hypothesis space is ludicrously huge, or even infinitely huge! Is that even practical for the machine to search each and every one of these hypotheses individually? Absolutely no way! If only there was a way to have some sort of ordering among the hypotheses that could guide our searching mechanism and make it smarter and efficient! There is indeed a way to do this! Fortunately, all concept learning problems are similar in one unique aspect, namely, a structure that they all share:
“General-to-Specific Ordering of the Hypotheses”
A learning algorithm can use this very useful property to search the hypothesis space thoroughly, while not having to go through each and every one of the hypotheses individually.
In the next post, we will explain this idea of General-to-Specific ordering of the hypotheses, in more details.