I have no interest in the tatoeba project. A much better project to convert would be BCCWJ but the original data costs a fortune for regular people.
This data would also be most valuable to put a frequency list in place (and even a custom dictionary for an IME like mozc). There already exists frequency lists that put to use the freely provided data but they are pretty much useless since the freely provided data does not take into account the writing used into the source material.
Here is a sample of what the original data contains:
The “キー” is the original writing found in the source material while the “語彙素” is the interpreted writing. Unfortunately the “キー” isn’t provided in the free data but it can be viewed within the free web search service called 中納言 . Perhaps the data can be extracted but it seems like a huge task.
Concerning kanjidic, I am interested but pyglossary unfortunately does not support it. I have already opened a feature request. I am sure there is another way to get a working dictionary but I do not believe I am knowledgeable enough to handle something more complex than a tab file.
As for the “compound verbs dict”, I had never heard of it until now. I suppose it is this project? Looking through the data I do not really understand why it would be useful since any dictionary already supports explanations for compound words.