EPUB3 CFI

8 posts / 0 new
Last post

Can someone write a simple explanation of CFI?
I tried reading the spec and still don't understand it.
I first want to understand the beginning.
It starts with a / followed by a number - what does this number means?
The first number is followed by a / and another number what is the meaning of the second number?
I find the spec. very hard to understand, I still did not understand these first numbers.
After I understand them I will continue and try to understand the rest.

The easiest way to think of a CFI is as a path through the content in an EPUB to a specific location. You start walking from the package file and each number refers to an element or inter-element text/whitespace (odd numbers are inter-element content and even numbers elements).

That's why you'll see CFIs starting with /6, as "6" refers to the third child element in the package document, which is always the <spine>. For example, you have this epubcfi at the top of the spec:

 

book.epub#epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/3:10)

After the /6 you obviously find another even /# which points to an <itemref> (4 here indicates the second <itemref>). The exclamation point after the number means to resolve this itemref to its content document as that's where the next /# in the CFI will pick up. The ID in brackets, allows for self-correction if the EPUB is later changed (e.g., if another content document were added to the spine before the one referenced, the CFI would be broken without an ID to compare against).

The next /4 means to go to the second child of this new content document we're now in. Assuming here it's an XHTML content document, the first child is the <head> and the second is the <body>.

The referencing just continues in this way until you reach a final element or text offset within it (the :10 at the end means 10 characters into the text content). Also have a look at this earlier thread on CFI referencing some more info.

CFIs can also point to location in media, pixel locations in images, etc. They're very flexible, and give much greater granularity than simple ID referencing. They also allow a reading system to point into any location regardless of how the content author tagged and ID'd the content.

They're also more powerful than the simple outline I've given here. There are other features like side bias and temporal offsets that you'd also need to know about, but typically they are for machines to create, not humans.

Hope that helps.

Thank you that clears some of the mess.
From what you say, the first element is always 6?
In the second number most of the examples I've seen it is 4 but I have seen some examples with an odd number.
Why is it defined this way that I always put the same number (6) as the first number and why do I have to multiply by 2?
The second number in your example (10) is not multiplied, so how do I know which number is multiplied and which is not?

Right, but the /7 references in section 2.3 are a typo that has been logged in the issue tracker, so ignore those.

There's no multiplying of numbers, so although that might be a shorthand for getting the right element reference number it's not actually what is going on. As I mentioned above, interelement text/whitespace and elements are sequentially ordered.

Consider this spine:

<spine><itemref id="tocref" idref="toc"/><itemref id="chap01ref" idref="c01"/></spine>

It contains two elements and three instances of interelement whitespace (collapsed in this case). The numeric referencing into this element therefore works like this:

       /1   /2     /3   /4     /5
<spine>  <itemref/>  <itemref/>  </spine>

There's simply no point in referencing the odd-numbered inter-element positions in the spine, as you need to reference an itemref. And note that this numbering would be the same even if the itemref tags weren't empty. You don't walk the descendants of child elements when building these reference points, as the content within an element represents another step down into the content (i.e., another step in your CFI path).

As noted above, the "10" refers to a character offset, which is why it is preceded by a colon not a slash (i.e., it means ten characters in from the text content position found at /3).

You'll need to read the path resolution section (3.1) of the specification to understand the meanings of the special characters in the CFIs, and what the numbers/characters that follow them mean. Once you grasp the path resolution using the /# notation, it does get much easier.

Thank you, your explanation really cleared it.
Reading the spec again after reading your explanation makes it clearer.
I think I will write an article about it so that other people can also understand.

It still seems strange
In the example:
epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/3:10) the number 3 is not even although section 3.1.1 in the spec. says:
Each element is assigned an even positive index: the first element is given index 2, the second element index 4, etc.

Each element is an even number, and inter-element content is odd. The inter-element content is typically text, but may just be whitespace, as between element in the spine.

Have a look at the example in s. 3.1.10 which is where these CFIs are coming from. When you drill down into the CFI, ultimately the /10[para05] points to this paragraph:

 

<p id="para05">xxx<em>yyy</em>0123456789</p>

The /3 can now be seen to point to the second instance of text, which is the start of the string '0123..' (remembering not to count what is inside the em tags). Assuming the first position is '0', If you go ten characters into this string, you reach the point immediately after the '9'.

And to be complete, if you want to get to 'yyy', your CFI would instead be:

/10[para05]/2/1

Since the <em> tag is the first child element in the paragraph, it is referenced as /2. And as it contains no child elements, there is only one instance of text content, which you reach at /1.

The only time you would end with an even number followed by a character offset is when referencing into the alt attribute of an image, as noted in section 3.1.4:

A character offset terminating step may be present only following a /N step. For XHTML Content Documents, N would be an even number when referencing the alt text of an img element, and N would be odd when referencing text in a text node.

Secondary menu