Knowledge Base Embeddings for DBpedia.
This is my final blog post for GSoC 2017, which is a 3 month long programme hosted by Google, where selected students all over the world participate in the summers on projects for various open source organizations. I was participating for the organization DBpedia.
I would be summarizing about the overall project, the final deliverables and conclusions. You may check my week by week progress of the project on my blog posts here.
In this project, I had the opportunity to run different approaches of building Knowledge Base embeddings for DBpedia. Though we have popular dataset from Freebase and WordNet used for research, we don’t have any standard available dataset from DBpedia.
In the first month, out of the several methods we found in our literature survey, I handpicked 4 approaches after trying several open source codes on WN18K and FB15K datasets.
The criteria for selecting these codes were:
1) They were mentioned as strong baselines in research papers. Hence, evaluating them against DBpedia set would be necessary.
2) Their open source codes were executing properly.
These four approaches were TransE, DistMult, HolE, and CompleX.
In the second month, I prepared a DBpedia dataset by retreiving all the triples mapped from DBpedia subset. I played around this dataset and the above approaches.
After talking and discussing with my mentors, we decided to make three sets from this DBpedia set with sizes varying from 10^4, 10^5, and 10^6. Using these sizes we would try to analyze how the approaches can be scaled for the entire DBpedia and which method looks scalable.
The final table after conducting all the experiments on the DBpedia set until convergence can be found below:
From the above table, you can see that I conducted experiments for all the approaches for 10^3, and 10^4 sets. Due to time constraints, for 10^6 dataset, I ran the experiment only for the approach DistMult. The reason behind choosing DistMult for the largest set is as belows:
1) When we look at the set of size 10^4, we see that TransE gives the best Hits@10.
But overall, CompleX gave the best MRR filtered, Hits@1, and Hits@3, where TransE performance was visibly lower. DistMult and ComplEx performance lied somewhere between TransE and HolE. However, the bottleneck with HolE is there is no code available to run it on GPU. It took a lot many epochs and a large runtime execution.
Considering that we wanted to build a scalable system, I had to pick between DistMult and ComplEx. DistMult was taking a much lower number of epochs and lower runtime for training.
2)When we look at the set of size 10^5, which is supposedly a better indicator of the approaches performances’ due to larger training samples, we see that DistMult performed good in most of the metrics : MRR raw, MRR filtered, Hits@1 and Hits@3.
Therefore, DistMult. became the obvious choice for us for scaling it to a 10^6 dataset.
In order to compare the runtime for the approach DistMult on the three varying sets and predicting the training time for full DBpedia size dataset , I executed DisMult on each set for a fixed number of epochs which was 300. After plotting the results on a line graph using Microsoft Excel trendline feature, I got the following graph :
After plotting, I found the closest curve which fits the data is that of a ploynomial curve of order 2. From the curve, we can see that for 10^8 size (the range for entire DBpedia), the time estimation is approximately 80,000 seconds which means ~22 hours.
You can find the step by step execution of the code on my Github repo here.
You can find the final embeddings dataset after execution of DistMult on 10^6 dataset untli convergence here. (**add link**)