DjVu - WikiHQ

DjVu is a computer file format designed primarily to store scanned documents, especially those containing a combination of text, line drawings, indexed color images, and photographs. It uses technologies such as image layer separation of text and background/images, progressive loading, arithmetic coding, and lossy compression for bitonal (monochrome) images. This allows high-quality, readable images to be stored in a minimum of space, so that they can be made available on the web.

DjVu has been promoted as providing smaller files than PDF for most scanned documents. The DjVu developers report that color magazine pages compress to 40–70 kB, black-and-white technical papers compress to 15–40 kB, and ancient manuscripts compress to around 100 kB; a satisfactory JPEG image typically requires 500 kB. Like PDF, DjVu can contain an OCR text layer, making it easy to perform copy and paste and text search operations.

History

The DjVu technology was originally developed, from 1996 to 2001,

Prior to the standardization of PDF in 2008, DjVu was considered superior because it is an open file format, in contrast to the proprietary nature of PDF at the time. The declared higher compression ratio (and thus smaller file size) and the claimed ease of converting large volumes of text into DjVu format were other arguments for DjVu's superiority over PDF in 2004. Independent technologist Brewster Kahle in a 2004 talk on IT Conversations discussed the benefits of allowing easier access to DjVu files.

The DjVu library distributed as part of the open-source package DjVuLibre has become the reference implementation for the DjVu format. DjVuLibre has been maintained and updated by the original developers of DjVu since 2002.

The DjVu file format specification has gone through a number of revisions, the most recent being from 2005.

{| class="wikitable sortable"

|+ Revision history

! Version

! Release date

! Notes

| 1996–1999

| Developmental versions by AT&T labs preceding the sale of the format to LizardTech.

| and the Internet Archive, browser plugins which allowed advanced online browsing, smaller file size for comparable quality of book scans and other image-heavy documents and support for embedding and searching full text from OCR.

Some features such as the thumbnail previews were later integrated in the Internet Archive's BookReader and DjVu browsing was deprecated in its favour as around 2015 some major browsers stopped supporting NPAPI and DjVu plugins with them.

Design

The DjVu file format is based on the Interchange File Format and is composed of hierarchically organized chunks. The IFF structure is preceded by a 4-byte <code>AT&T</code> magic number. Following is a single <code>FORM</code> chunk with a secondary identifier of either <code>DJVU</code> or <code>DJVM</code> for a single-page or a multi-page document, respectively.

All the chunks can be contained in a single file in the case of the so called bundled documents, or can be contained in several files: one file for every page plus some files with shared chunks.

{| class="wikitable"

|+Chunk types in DjVu files

!scope="col"| Chunk identifier

!scope="col"| Contained by

!scope="col"| Description

!scope="row"| FORM:DJVU

|| FORM:DJVM || Describes a single page. Can either be at the root of a document and be a single-page document or referred to from a chunk.

!scope="row"| FORM:DJVM

| || Describes a multi-page document. Is the document's root chunk.

!scope="row"| FORM:DJVI

|| FORM:DJVM || Contains data shared by multiple pages.

!scope="row"| FORM:THUM

|| FORM:DJVM || Contains thumbnails.

!scope="row"| INFO

|| FORM:DJVU || Must be the first chunk. Describes the page width, height, format version, resolution, gamma, and rotation.

!scope="row"| DIRM

|| FORM:DJVM || Must be the first chunk. References other chunks. These chunks can either follow this chunk inside the chunk or be contained in external files. These types of documents are referred to as bundled or indirect, respectively.

!scope="row"| NAVM

|| FORM:DJVM || If present, must immediately follow the chunk. Contains a BZZ-compressed outline of the document.

!scope="row"| ANTa, ANTz

|| FORM:DJVI or FORM:DJVU || Annotations.

!scope="row"| TXTa, TXTz

|| FORM:DJVU || Unicode text and layout information.

!scope="row"| INCL

|| FORM:DJVU || The ID of an included chunk.

!scope="row"| Sjbz

|| FORM:DJVU || BZZ compressed JB2 bitonal data used to store mask.

!scope="row"| Djbz

|| FORM:DJVI or FORM:DJVU || Shared shape table.

!scope="row"| WMRM

|| ? || JB2 data required to remove a watermark.

!scope="row"| <s>CIDa</s>

|| FORM:DJVU || Obsolete chunk with unknown content.

DjVu divides a single image into many different images, then compresses them separately. To create a DjVu file, the initial image is first separated into three images: a background image, a foreground image, and a mask image. The background and foreground images are typically lower-resolution color images (e.g., 100 dpi); the mask image is a high-resolution bilevel image (e.g., 300 dpi) and is typically where the text is stored. The background and foreground images are then compressed using a wavelet-based compression algorithm named IW44. both compression methods have the same problems when performing lossy compression. In 2013 it emerged that Xerox photocopiers and scanners had been substituting digits for similar looking ones, for example replacing a 6 with an 8. A DjVu document has been spotted in the wild with character substitutions, such as an n with bleeding serifs turning into a u and an o with a spot inside turning into an e. Whether lossy compression has occurred is not stored in the file.

Licensing

DjVu is an open file format with patents. The rights to the commercial development of the encoding software have been transferred to different companies over the years, including AT&T Corporation, LizardTech, Celartem and ePapyrus Solutions K.K. (formerly Cuminas before joining ePapyrus Solutions, Inc.). Patents typically have an expiry term of about 20 years.

Celartem acquired LizardTech and Extensis.

Format adoption

Free creators, manipulators, converters, web browser plug-ins, and desktop viewers are available. In February 2016, the Internet Archive announced that DjVu would no longer be used for new uploads, among other reasons citing the format's declining use and the difficulty of maintaining their Java applet based viewer for the format.

Format software

any2djvu converts .ps .ps.gz .pdf to .djvu (a DjVu file) via the Any2DjVu server, maintained by Léon Bottou and Yann LeCun, hosted by the Courant Institute of Mathematical Sciences at New York University, with hardware donated by Caminova, Inc.

Jakub Wilk's pdf2djvu creates DjVu files from PDF files for GNU/Linux OS (archived), including Ubuntu, and Cygwin (orphaned).

The selection of downloadable DjVu viewers is wider on Linux distributions than it is on Windows or macOS. Additionally, the format is rarely supported by proprietary scanning software.

DjVu is supported by a number of multi-format document viewers and e-book reader software on Linux (Okular, Evince, Zathura), Windows (Okular and SumatraPDF) and Android (Document Viewer, FBReader, EBookDroid, PocketBook).

DjVu.js Viewer is a project that develops a program library, a web application, and browser extensions for Firefox and Google Chrome, to view DjVu files.

Notes

References

External links

DjVu software downloads – Cuminas Corporation
DjVu.js Viewer used in: Firefox and Google Chrome