htsplit

split an HTML file into tokens 

Command


SYNOPSIS

htsplit [-bcx] [-e convert|separate] [-f entfile] [-o outfile] [file ...]


DESCRIPTION

The htsplit command is used as a filter process to split an HTML file into tokens, which can then be further processed by other shell scripts.

htsplit reads input from the file arguments on the command line; if there are none, it reads from the standard input. By default, the output produced by htsplit is functionally identical to the input, that is, a browser displays the input and the output in the same way.

The following rules are used to split the HTML input into tokens:

Options

-b 

breaks up all tokens even when adding a newline produces output that is no longer functionally identical to the input.

-c 

removes comments (all tags beginning with <!--). This also causes all scripting information in the HTML file to be lost.

-e convert|separate 

specifies how to handle entity references.

If set to separate, entity references are pulled out and displayed on a separate line. For example, f&aelig;ries is displayed as:

f
&aelig;
ries

If set to convert, entity references are converted to ASCII representations. For example, &amp; is displayed as &.

To perform conversions, htsplit uses a list of entity definitions in $ROOTDIR/etc/entities (see FILES) and other entity definition files specified with the -f option.

If an entity is undefined, it is converted to white space. This results in the text before and after the entity appearing on separate lines. For example, if aelig is undefined, f&aelig;ries is displayed as:

f
ries

You can specify additional entity files to use for conversions with the -f option.

-f entfile 

specifies an additional entity definition file to be used with -e convert. This file is in the same format as $ROOTDIR/etc/entities (see FILES). When an entity is defined in both the additional file and in $ROOTDIR/etc/entities, htsplit uses the definition from the additional file.

The -f option can appear multiple times to specify multiple additional entity definition files. If an entity is defined in more than one file, the definition from the -f-specified file that appears latest on the command line is used.

-o outfile 

sends the output of htsplit to outfile rather than the standard output.

-x 

processes the input as XML rather HTML. Tag names are not converted to uppercase and XML directives beginning with <? are recognized and retained.


FILES

$ROOTDIR/etc/entities 

contains a list of entity definitions that htsplit uses when converting entities to ASCII representations. Each entry in the list is an entity reference followed by a single space and then the ASCII representation of that entity. Additional spaces are considered to be part of the ASCII representation. For example, in this file, the entity for a non-breaking space (&nbsp;) is defined as:

nbsp<space><space>

That is, it is the entity name followed by two spaces. The first space is the separator and the second space is the ASCII representation.

The ASCII representation of an entity may itself include entities which are also converted.

Lines beginning with the & character are treated as comments.


DIAGNOSTICS

Possible exit status values are:

0 

Successful completion.

>0 

An error occurred.


AVAILABILITY

PTC MKS Toolkit for System Administrators
PTC MKS Toolkit for Developers
PTC MKS Toolkit for Interoperability
PTC MKS Toolkit for Professional Developers
PTC MKS Toolkit for Professional Developers 64-Bit Edition
PTC MKS Toolkit for Enterprise Developers
PTC MKS Toolkit for Enterprise Developers 64-Bit Edition


SEE ALSO

Commands:
htdiff


PTC MKS Toolkit 10.4 Documentation Build 39.