In computing, a polyglot is a computer program or script (or other file) written in a valid form of multiple programming languages or file formats. The name was coined by analogy to multilingualism. A polyglot file is composed by combining syntax from two or more different formats.
When the file formats are to be compiled or interpreted as source code, the file can be said to be a polyglot program, though file formats and source code syntax are both fundamentally streams of bytes, and exploiting this commonality is key to the development of polyglots. Polyglot files have limited practical applications in compatibility, but can also present a security risk when used to bypass validation or to exploit a vulnerability.
History
Polyglot programs have been crafted as challenges and curios in hacker culture since at least the early 1990s. A notable early example, named simply <code>polyglot</code> was published on the Usenet group rec.puzzles in 1991, supporting eight languages, though this was inspired by even earlier programs. In 2000, a polyglot program was named a winner in the International Obfuscated C Code Contest.
In the 21st century, polyglot programs and files gained attention as a covert channel mechanism for propagation of malware. Polyglot files have limited practical applications in compatibility.
Construction
A polyglot is composed by combining syntax from two or more different formats, leveraging various syntactic constructs that are either common between the formats, or constructs that are language specific but carrying different meaning in each language. A file is a valid polyglot if it can be successfully interpreted by multiple interpreting programs. For example, a PDF-Zip polyglot might be opened as both a valid PDF document and decompressed as a valid zip archive. To maintain validity across interpreting programs, one must ensure that constructs specific to one interpreter are not interpreted by another, and vice versa.
This is often accomplished by hiding language-specific constructs in segments interpreted as comments or plain text of the other format. Such documents can be parsed as either HTML (which is <span style="white-space:nowrap">SGML-compatible</span>) or XML, and will produce the same DOM structure either way. For example, in order for an HTML5 document to meet these criteria, the two requirements are that it must have an HTML5 doctype, and be written in well-formed XHTML. The same document can then be served as either HTML or XHTML, depending on browser support and MIME type.
As expressed by the html-polyglot recommendation, For example, to add an empty textarea to a page, one cannot use <code><nowiki><textarea/></nowiki></code>, but has to use <code><nowiki><textarea></textarea></nowiki></code> instead.
Composing formats
The DICOM medical imaging format was designed to allow polyglotting with TIFF files, allowing efficient storage of the same image data in a file that can be interpreted by either DICOM or TIFF viewers.
Compatibility
The Python 2 and Python 3 programming languages were not designed to be compatible with each other, but there is sufficient commonality of syntax that a polyglot Python program can be written than runs in both versions.
Security implications
A polyglot of two formats may steganographically compose a malicious payload within an ostensibly benign and widely accepted wrapper
format, such as a JPEG file that allows arbitrary data in its comment field. A vulnerable JPEG renderer could then be coerced into executing the payload, handing control to the attacker. The mismatch between what the interpreting program expects, and what the file actually contains, is the root cause of the vulnerability.
Detecting malware concealed within polyglot files requires more sophisticated analysis than relying on file-type identification utilities such as file. In 2019, an evaluation of commercial anti-malware software determined that several such packages were unable to detect any of the polyglot malware under test. The polyglot nature of the attack, combined with regulatory considerations, led to disinfection complications: because "the malware is essentially fused to legitimate imaging files", "incident response teams and A/V software cannot delete the malware file as it contains protected patient health information".
GIFAR attack
A Graphics Interchange Format Java Archives (GIFAR) is a polyglot file that is simultaneously in the GIF and JAR file format. This technique can be used to exploit security vulnerabilities, for example through uploading a GIFAR to a website that allows image uploading (as it is a valid GIF file), and then causing the Java portion of the GIFAR to be executed as though it were part of the website's intended code, being delivered to the browser from the same origin. Java was patched in JRE 6 Update 11, with a CVE published in December 2008.
GIFARs are possible because GIF images store their header in the beginning of the file, and JAR files (as with any ZIP archive-based format) store their data at the end.
Related terminology
- Polyglot programming, referring to the practise of building systems using multiple programming languages, but not necessarily in the same file.
- Polyglot persistence is similar, but about databases.
See also
- Quine (computing)
References
External links
- CSE HTML Validator for Windows with polyglot markup support
- Benefits of polyglot XHTML5
- A polyglot in 451 different languages
- A polyglot in 16 different languages
- A polyglot in 8 different languages (written in COBOL, Pascal, Fortran, C, PostScript, Unix shell, Intel x86 machine language and Perl 5)
- A polyglot in 7 different languages (written in C, Pascal, PostScript, TeX, Bash, Perl and Befunge98)
- A polyglot in 6 different languages (written in Perl, C, Unix shell, Brainfuck, Whitespace and Befunge)
- List of generic polyglots
- A PDF-MP3 polyglot, being a PDF document which is also an MP3 audio version of its content
- PoC||GTFO, a security publication published as polyglot PDF documents
