CFI Indexing tool/engine/scripts ?

5 posts / 0 new
Last post


My team is looking for a tool/script/SDK that can index a EPUB using the CFI specification. I've been looking high and low for anything, and have had no luck.



I have not seen such a tool either. Let me consult with some folks and see if I can find anything.

Standing by ! :-).....thanks !

Queried the IDPF Indexing working group folks. Here's some of their replies:

"Not that I have an answer to your question, but are they asking about "a
tool that indexes a book" or "a tool that is used to index a book"? Those
are two very different things. To my mind, the former implies text mining
and the latter implies human (indexer or author) involvement."

"I suspect folks will want both things. For some situations, the text mining approach can produce a "good enough" result for what is being looked for. That is "optimal" in the sense of "it gets me what I need the fastest and cheapest way." But if you talk with any of my friends at the American Society of Indexers you will likely get a very different perception. A "real" index really does need to be done by a real indexer (person). Even authors do notoriously bad indexes of their own books."

"On the surface, that prompts a "good luck with that" response, or maybe "in yer dreams!"

To be a little more constructive, if that person had a well defined taxonomy of what s/he was looking for in the content, this might actually someday be possible. There are tools that can do that kind of text mining to some extent, though they don't encode the result with CFI to my knowledge. This is not that much different from what web crawlers or text mining engines do for web content, so applying them to the XHTML content docs in an EPUB is actually not that much of a stretch. But a real indexer would not consider that a real index. And remember you have to be able to define what you're looking for in the first place.

But if what is being looked for is along the lines of a magic button that will "tell me all the things that this book is about, and where it talks about 'em" . . . then good luck with that. ;-) "

Net net, it is a hard problem unless you just want mindless data-mining, a la google. But even then, nobody knew of a data-mining tool that would generate CFIs.

Thank you for the information.

In regards to the responders first question, yes a tool that indexes a book (creates the CFI tags, that can later be consumed by another script/code for searching). We are programmers over here, so some of the logic can be accounted for up front. We would ignore words that aren't relevant ("the", "a", "in", "you" etc.), and then likely pass "real words" through a stemmer script to extract their root values, so "parting" and "parted" would be "part"...and then index on the root words, and then on the search side look for words that contain the root word that was indexed, and we'd be back to getting all the matches.

We were hoping for a script that would generate the CFI tags for whatever words we passed into it, and leave the logic of what the words were up to us, assuming another programmer hasn't already compensated for what I said a moment ago. Writing it all from scratch would be labour and time intensive so we were hoping to avoid that. I've been looking all week to see if there's any scripts or tools out there to handle the CFI specification, and except for a couple of Javascripts, I would nothing, which is rather surprising since CFI is supposed to be the defacto standard.

Anyway, if you come across any additional information, please feel free to post it here as I'll be checking back to see.


Barry O'Neill

Secondary menu