Working Group Draft 6 May 2011
Copyright © 2011 International Digital Publishing Forum™
All rights reserved. This work is protected under Title 17 of the United States Code. Reproduction and dissemination of this work with changes is prohibited except with the written permission of the International Digital Publishing Forum (IDPF).
EPUB is a registered trademark of the International Digital Publishing Forum.
Table of Contents
/
)[
)!
):
)~
)@
)~
+
@
)[
)[
+ ;s=
)This specification, EPUB Canonical Fragment Identifier (epubcfi), defines a standardized method for referencing arbitrary content within an EPUB® Publication through the use of fragment identifiers.
The Web has proven that the concept of hyperlinking is tremendously powerful, but EPUB Publications have been denied much of the benefit that hyperlinking makes possible because of the lack of a standardized scheme to link into them. Although proprietary schemes have been developed and implemented for individual Reading Systems, without a commonly-understood syntax there has been no way to achieve cross-platform interoperability. The functionality that can see significant benefit from breaking down this barrier, however, is varied: from reading location maintenance to annotation attachment to navigation, the ability to point into any Publication opens a whole new dimension not previously available to developers and Authors.
This specification attempts to rectify this situation by defining an arbitrary structural reference that can uniquely identify any location, or simple range of locations, in a Publication: the EPUB CFI. The following considerations have strongly influenced the design and scope of this scheme:
The mechanism used to reference content should be interoperable: references to a reading position created by one Reading System should be usable by another.
Document references to EPUB content should be enabled in the same way that existing hyperlinks enable references throughout the Web.
Each location in an EPUB file should be able to be identified without the need to modify the document.
All fragment identifiers that reference the same logical location should be equal when compared.
Comparison operations, including tests for sorting and comparison, should be able to be performed without accessing the referenced files.
Simple manipulations should be possible without access to the original files (e.g., given a reference deep in a file, it should be possible to generate a reference to the start of the file).
Identifier resolution should be reasonably efficient (e.g., processing of the first chapter is not required to resolve a fragment identifier that points to the last chapter).
References should be able to recover their target locations through parser variations and document revisions.
Expression of simple, contiguous ranges should be supported.
An extensible mechanism to accommodate future reference recovery heuristics should be provided.
Please refer to the EPUB Specifications for definitions of EPUB-specific terminology used in this document.
A Publication-level EPUB CFI links into an EPUB Publication. The path preceding the EPUB CFI references the location of the Publication.
An intra-Publication EPUB CFI allows one Content Document to reference another within the same Publication. The path preceding the EPUB CFI references the current Publication's Package Document.
Refer to Intra-Publication CFIs for more information.
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119].
All sections of this specification are normative except where identified by the informative status label "This section is informative". The application of informative status to sections and appendices applies to all child content and subsections they may contain.
All examples in this specification are informative.
This section is informative
A fragment identifier is the part of an IRI [RFC3987] that defines a location within a
resource. Syntactically, it is the segment attached to the of end the resource IRI
starting with a hash (#
). For HTML documents, IDs and named
anchors are used as fragment identifiers, while for XML documents the Shorthand
XPointer [XPTRSH] notation is used to refer to a given ID.
A Canonical Fragment Identifier (CFI) is a similar construct to these, but expresses a location within an EPUB Publication. For example:
demo.epub#epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/3:10)
The function-like string immediately following the hash
(epubcfi(…)
) indicates that this fragment identifier conforms
to the scheme defined by this specification, and the value contained in the
parentheses is the syntax used to reference the location within the
specified Publication (demo.epub
). Using the processing rules
defined in Path Resolution, any Reading System
can parse this syntax,
open the corresponding Content Document in the Publication
and load the specified location for the User.
A complete definition of the EPUB CFI syntax is provided in the next section.
epub
has been prepended to the name of the scheme as a more
generic CFI-like scheme may be defined in the future for all XML+ZIP-based file
formats.
A Canonical Fragment Identifier (CFI) consists of an initial sequence
epubcfi
that identifies this particular reference method, and
a parenthesized path or range. A path is built up as a sequence of structural steps
to reference a location. A range is a path followed by two local (or relative) paths
that identify the start and end of the range.
Steps can either be navigational or terminating. Navigational steps may be repeated as necessary (e.g., to count elements, to process children or to follow references). There may be only one terminating step, which, if present, must be the last step in the sequence.
Substrings in brackets are extensible assertions that improve the robustness of traversing paths and migrating them from one revision of the document to another. These assertions preserve additional information about traversed elements of the document, which makes it possible to recover intended location even after some modifications are made to the Publication.
The value
definition in the syntax above can either be a quoted
string (delimited by quote characters ("
)) or a sequence of
characters. The following characters must only appear in a quoted string, however,
as they would otherwise interfere with parsing:
quote ("
)
brackets ([
,]
)
parentheses ((
,)
)
comma (,
)
semicolon (;
)
Quote and backslash characters (\
) contained within a quoted
string must be escaped by adding a single backslash character before them.
Example of an EPUB CFI that points a location after the text "c:\" drive
.
#epubcfi(/6/7[chap05ref]!/4[body01]/10/2/1:3["\"c:\\\" drive"])
The following rules apply to the use of numbers and integers within the path or range:
leading zeros are not allowed for numbers or integers (to ensure uniqueness);
trailing zeros are not allowed in the fractional part of a number;
zero must be represented as the integer 0
;
numbers in the range 1 > N > 0
must have a leading
0.
;
integral numbers must be represented as integers.
When an EPUB CFI is used as a part of IRI [RFC3987], it must be escaped as per that specification. When extracting from an IRI, it must be unescaped prior to parsing, sorting or comparing.
The process of resolving an EPUB CFI to a location within an
Publication begins with
the root package
element of the Package Document. Each step in the CFI
is then processed one by one, left to right, applying the rules defined in the
following subsections.
The EPUB CFI examples in the following subsections are based on the sample documents in Examples.
/
)A step with a slash (/
) followed by an integer refers to a
child node or nodes in the following manner:
Each element is assigned an even positive index:
the first element is given index 2
, the second
element index 4
, etc.
Each (possibly empty) collection of non-element nodes before the first element, between elements, and after the last element are given odd indices according to their position (these typically refer to the text of the publication).
Non-element nodes that are not text nodes are always ignored (for the purposes of this specification, a text node includes text, CDATA sections and entity references).
This indexing method ensures that node identification is not sensitive to XML parser handling of whitespace text nodes, CDATA sections and entity references (e.g., to avoid the ambiguity that can arise depending on whether a parser collapses whitespace-only text nodes, keeps text, CDATA sections and entity references as distinct nodes or doesn't, or breaks text in multiple nodes).
For a Standard EPUB CFI,
the leading step in the CFI must start with a slash (/
)
followed by an even number that references the spine
child element
of the Package Document's root package
element. The Package Document traversed by the CFI
must be the one specified as the default rendition in the Publication's
META-INF/container.xml
file (i.e., the Package Document referenced
by the first rootfile
element in
container.xml
).
For an Intra-Publication EPUB CFI,
the first step must start with a
slash followed by a node number that references a position in Package Document
starting from the root package
element.
[
)When an EPUB CFI references an element that contains an ID [XML],
the corresponding path step must include that ID in square brackets
(i.e., after the slash (/
) and even number that identifies
the element).
Specification of identifiers adds robustness to the CFI scheme: a Reading System may determine that the location referenced by the CFI is not the original intended location, and may use the identifier to compute the set of steps that reach the desired destination in the content (see Intended Target Location Correction). The cost of this added robustness is that comparison (and sorting) of CFI strings may only be performed after logically stripping all bracketed substrings (see Sorting Rules).
!
)A step with a leading exclamation point (!
) indicates that
the reference must be followed and the next step applied starting from the new
target node (or root element node when a complete XML document is
referenced).
Only the following references are honored:
For itemref
in the Package Document spine
, the reference
is defined by the href
attribute of the corresponding
item
element
in the manifest
(i.e., that
the itemref
's idref
attribute
references).
For [HTML5]
iframe
and
embed
elements,
references are defined by the src
attribute
For the [HTML5]
object
element, the reference is defined by
the data
attribute
For [SVG]
image
and
use
elements, references
are defined by the xlink:href
attribute
:
)A terminating step with a leading colon (:
) followed by an
integer refers to a character offset.
The given character offset may apply to an element node only if this element is the
[HTML5]
img
element with an alt
attribute containing the text to which the character
offset applies.
For text nodes, or a collection of
nodes, the offset is zero-based and always refers to a position
between characters, so 0
means before the first character and
a number equal to the total UTF-16 length means after the last character. A
character offset value greater than the UTF-16 length of the available text must
not be specified.
A character offset terminating step may only be present following a
/N
step. For XHTML Content Documents, N
would
be an even number when referencing the alt
text of an img
element, and N
would be odd when referencing text in a text
node.
No other steps may follow a character offset terminating step.
~
)A terminating step with a leading tilde (~
) followed by a
number indicates a temporal position for audio or video measured in
seconds.
No other steps can follow a temporal offset terminating step.
@
)A terminating step with a leading at sign (@
) followed by
two colon-separated numbers indicates a 2D spatial position within an image or
video. The two numbers represent scaled locations in the x
and y
axes, and must be in the range 0
to
100
regardless of the image's native or display
dimensions (i.e., the upper left is 0:0
and the lower right
is 100:100
).
No other steps can follow a spatial offset terminating step.
~
+
@
)A temporal and a spatial position may be used together. In this case, the
temporal specification must precede the spatial one syntactically (e.g.,
[email protected]:97.6
refers to a point 23.5 seconds into a
video in the lower left of the frame).
No other steps can follow a temporal-spatial position terminating step.
[
)An EPUB CFI may specify a substring that should precede and/or follow the encountered point, but such assertions must only occur after a character offset terminating step.
For example, the following expression asserts that yyy
is
expected immediately before the encountered point using the
sample content below:
#epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/2/1:3[yyy])
An additional substring that follows the encountered point can be given after a comma. For example:
#epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/1:3[xx,y])
refers to the position marked by the asterisk:
x x x y y y 0 1 2 3 4 5 6 7 8 9 | | | * | | | | | | | | | | | |
If there is no preceding text, or only trailing text is specified, a comma must immediately precede the text assertion:
#epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/2/1:3[,y])
There is no restriction on the amount of the preceding and following text that can be included in the match. Text is taken from the document ignoring element boundaries and whitespace is always collapsed (i.e., a non-empty sequence of contiguous whitespace characters is always replaced with a single space character).
A Reading System may determine that the location referenced by the CFI is not the original intended location (due to non-matching text), and may use the preceding/trailing text to compute the set of steps that reach the desired destination in the content (see Intended Target Location Correction). The cost of this added robustness is that comparison (and sorting) of CFI strings may only be performed after logically stripping all bracketed substrings (see Sorting Rules).
[
+ ;s=
)In some situations, it is important to preserve which side of a location a reference points to. For example, when resolving a location in a dynamically paginated environment, it would make a difference if a location is attached to the content before or after it (e.g., to determine whether to display the verso or recto side at a page break).
The s
parameter is used to preserve this sided-ness aspect
of a location. It can take two values: b
means that the
location belongs with the content before it and
a
with the content after. This
parameter must always be used inside square brackets at the end of the CFI, even
if the ID [XML] or text location assertion is empty.
The location just after yyy
in the
sample content below can be expressed as
belonging with the content before it as follows:
#epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/2/1:3[;s=b])
Equally, it can be expressed including a text location assertion as:
#epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/2/1:3[yyy;s=b])
The location at the start of em
element can be attached to the
content preceding the em
element as follows:
#epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/2[;s=b])
If the side bias in the preceding example was set to a
rather than b
, the location would be attached to the child
content of the em
element, not the content following the
em
element.
Since side bias is expressed as a parameter, it does not participate in CFI comparison (see Sorting Rules).
Side is not defined for locations with spatial terminus.
Side bias is only meaningful when some type of break falls at the location (e.g., a page break or line break).
This section is informative
Given the following Package Document:
<?xml version="1.0"?> <package version="2.0" unique-identifier="bookid" xmlns="http://www.idpf.org/2007/opf" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf"> <metadata> <dc:title>...</dc:title> <dc:identifier id="bookid">...</dc:identifier> <dc:creator>...</dc:creator> <dc:language>en</dc:language> </metadata> <manifest> <item id="toc" properties="nav" href="toc.xhtml" media-type="application/xhtml+xml"/> <item id="titlepage" href="titlepage.xhtml" media-type="application/xhtml+xml"/> <item id="chapter01" href="chapter01.xhtml" media-type="application/xhtml+xml"/> <item id="chapter02" href="chapter02.xhtml" media-type="application/xhtml+xml"/> <item id="chapter03" href="chapter03.xhtml" media-type="application/xhtml+xml"/> <item id="chapter04" href="chapter04.xhtml" media-type="application/xhtml+xml"/> </manifest> <spine> <itemref id="titleref" idref="titlepage"/> <itemref id="chap01ref" idref="chapter01"/> <itemref id="chap02ref" idref="chapter02"/> <itemref id="chap03ref" idref="chapter03"/> <itemref id="chap04ref" idref="chapter04"/> </spine> </package>
and the XHTML Content Document chapter01.xhtml
:
<html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>...</title> </head> <body id="body01"> <p>...</p> <p>...</p> <p>...</p> <p>...</p> <p id="para05">xxx<em>yyy</em>0123456789</p> <p>...</p> <p>...</p> <img id="svgimg" src="foo.svg" alt="..."/> <p>...</p> <p>...</p> </body> </html>
Then the EPUB CFI:
#epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/3:10)
refers to the position right after the digit 9
in the paragraph with
the ID para05
. When producing CFIs for text
locations, unless the text is defined by an img
element's
alt
tag, one should always start with the text node or text
node collection (even if empty) that corresponds to the location and then trace
the ancestor and reference chain to the Package Document root.
The following examples show how EPUB CFIs can be constructed to reference additional content locations.
Reference to the img
element.
epubcfi(/6/4[chap01ref]!/4[body01]/16[svgimg])
Reference to the location just before xxx
.
epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/1:0)
Reference to the location just before yyy
.
epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/2/1:0)
Reference to the location just after yyy
.
epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/2/1:3)
In order to sort or compute relative locations of multiple EPUB CFIs referencing the same EPUB Publication, the following rules must be applied:
the "IRI un-escaped" core path is used;
all bracketed annotations are removed or ignored entirely;
steps that come earlier in the sequence are more important;
XML child nodes, character offsets and temporal positions are sorted in natural order;
the y
position is more important than
x
;
omitted spatial position precedes all other spatial positions;
omitted temporal position precedes all other temporal positions;
temporal position is more important than spatial;
different step types come in the following order from least important to
most important: character offset (:
), child
(/
), temporal-spatial (~
or
@
), reference/indirect (!
).
An EPUB CFI can be used to reference content inside the container.
This kind of referencing can be achieved by
specifying a reference to the Package Document
followed by a CFI, which must be resolved starting from the root
package
element.
For example, using the Package Document in the previous
example, a reference to the last location in chapter02.xhtml
might be written as follows:
<a href="pub.opf#epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/2/1:3[;s=b])">location</a>
EPUB CFIs allow the expression of simple ranges extending from a start location to
an end location. A range must be expressed as a triple of
parent path (P
),
start subpath (S
) and
end subpath (E
), or of the form:
#epubcfi(P,S,E)
The parent path must end at a step that is common for resolving both the path of the start and end locations of the range, and each start and end subpath must resolve to a location in non-decreasing order in the document.
To determine the start and end locations of the range, the start and end subpaths
must be concatenated to the parent path to create the start location path
(PS
) and end location path (PE
).
Using the sample documents above, the
following range would represents the text from the second
y
in yyy
up to (and including) digit
3
:
epubcfi(/6/4[chap01ref]!/4[body01]/10[para05],/2/1:1,/3:4)
Ranges must be compared according to their PS
, then
PE
, components.
It is not valid to use a path to an element as a shorthand for the range from the beginning to the end of the element. Single path notation always denotes a location point, and range is represented by the notation described above. There is no special step to produce a reference to the end of an element, as that would make sorting impossible without consulting the content of the document.
If range is used where single location is expected by the context, the start location must be used.
Side-bias parameters must not be used for ranges; the start of a range is implicitly attached to the content after the start location and the end is implicitly attached to the content before the end location.
As an EPUB Publication may be updated, corrected or otherwise altered over time, it is useful to be able to derive an EPUB CFI for the modified document from one that targeted a previous version. This specification provides two mechanisms to detect and adapt to content changes that impact CFIs: IDs [XML] and text location assertions.
When a Reading System is processing a CFI, it should check the
correctness of any
encountered assertions. For example, given the path /6/4[chap01ref]!...
,
the Reading System should verify that the element has the ID matching chap01ref
when processing element 4
(for this example, an itemref
in the
spine
). If not, the Reading System should locate the ID chap01ref
within the document and correct the CFI (e.g., if a new itemref
was inserted before the
chap01ref
itemref
, the desired element number would now
be 6
and the corrected CFI would be /6/6[chap01ref]!...
).
Likewise, text location assertions should be used to check referenced target locations, and used
to derive a corrected CFI that targets the desired text location.
If one of the assertions fails during processing, and a corrected CFI can not be derived (the ID is not found in the document, or text matches could not be found), the CFI must be considered an invalid reference. In cases where a Reading System cannot check for correctness (e.g. document-resident XML IDs are not available at CFI processing time), a Reading System must ignore the CFI assertions.
This notion of correcting CFIs can lead to circumstances where two different CFIs point to the same location (i.e. the "stale" CFI, pre-correction, and the corrected CFI). The corrected CFI should be used where possible. A Reading System and any surrounding content management system should attempt to replace stale CFIs with their corrected versions where possible.
This specification encourages the development of custom functions to assist with CFI correction where the intrinsic functionality is insufficient. Refer to Extending EPUB CFIs for more information on how to develop such functionality.
The provision for extensions (CSV parameter lists, prefixed by a parameter name, and separated by semicolons) allow Reading Systems to apply new or experimental heuristics to assist, for example, in migrating EPUB CFI fragments to updated documents.
It is recommended that any vendor-specific parameter names start with
vnd.
followed by the vendor name.
Implementations must ignore all parameters that they do not understand or cannot parse.
[RFC2119] Key words for use in RFCs to Indicate Requirement Levels (RFC 2119) . March 1997.
[RFC3987] Internationalized Resource Identifiers (IRIs) (RFC 3987) . January 2005.
[SVG] Scalable Vector Graphics (SVG) 1.1 (Second Edition) . 22 June 2010.
[XML] Extensible Markup Language (XML) 1.0 (Fifth Edition) . 26 November 2008.
[XPTRSH] XPointer Shorthand Notation.