Site File Parameters Reference
Value Types
Most parameters take one of the following value types:
All the parameters use the format Parameter: value, where
the line starts with optional white space, followed by the param name,
followed immediately by a colon and more white space, followed by
the value.
The Details
Here's the details on each parameter, and what it does. Note that
only the URL parameter is required; all the others are
optional, and have default values.
URL
Top-level URL of the site.
Sitescooper will always start scooping the site by requesting this URL.
This is required, and must appear before any other parameters.
Name
Name of the site. This name will be used in both the filename of the output
file and the name of the resulting PRC database for Palm conversion, if needed.
This is pretty much required.
Description
A one-line description of the site.
Optional, but a great help!
AuthorName
The name of the site file's author.
AuthorEmail
The email address of the site file's author.
Active
Whether this site is active, ie. should be scooped.
SizeLimit
The top size limit, in kilobytes, of a scoop from this site.
By default, this is left at 0, which means that the
process-wide limit is inherited: 300K by default, or whatever
the user specified on the command line.
MinPages
The minimum number of pages that this site requires. If a scoop
generates less than this number of pages, it is assumed the site
has not been updated and the scoop is ignored.
By default, this is set to 1.
Levels
How many levels this site has. If unspecified, the site is
assumed to be made up of one page, ie. a 1-level site.
AddURL
An easy way to add additional URLs to a site, along with the
top-level URL.
LayoutURL
LayoutURL is similar to URL, but defines a layout for a specific pattern. If a
page's URL falls within this pattern, and parameters are defined for this
layout, but not defined by the site file, the layout parameters will be used.
This allows an easy way to specify default values that several sites can use;
just define them in a layouts file such as lib/layouts.site and
they will be inherited.
ExceptionURL
ExceptionURL is like LayoutURL, but it takes priority over both LayoutURL and
the normal site file rules.
The idea is that you can define a site using URL, then after defining the rules
for the main site pages, you can specify rules for a different set of pages
that will be encountered while scooping the site.
RequireCookie
If a site requires that a HTTP Cookie be set before it
can be accessed, use this parameter. It takes a two-part value,
consisting of the cookie's hostname and key separated by whitespace. For
example, the economist_full.site site file uses
RequireCookie: www.economist.com econ-key.
Rights
The rights of reproduction for that site, in addition to whatever will be
scooped. This allows you to append arbitrary copyright text to the output,
instead of the default End of snarf - copyright retained by original
providers. message.
TableRender
How tables should be rendered in the output. There are 3 possible values:
LevelNLinksStart / IssueLinksStart / ContentsStart / StoryStart
Specify the start-of-links-area or start-of-story-area pattern for a page at
that level.
LevelNLinksEnd / IssueLinksEnd / ContentsEnd / StoryEnd
Specify the end-of-links-area or end-of-story-area pattern for a page at that
level.
LevelNLinksIncludeStartPattern / IssueLinksIncludeStartPattern / ContentsIncludeStartPattern / StoryIncludeStartPattern
Causes the start-of-links-area or start-of-story-area pattern to be
included in the resulting scooped HTML. By default, it is not.
LevelNLinksIncludeEndPattern / IssueLinksIncludeEndPattern / ContentsIncludeEndPattern / StoryIncludeEndPattern
Causes the end-of-links-area or end-of-story-area pattern to be
included in the resulting scooped HTML. By default, it is not.
LevelNPrint / IssuePrint / ContentsPrint
Specify whether a links-level page should be printed, ie. output. The default
is 0 for text-style output, or 1 for HTML-style output. (HTML-style in this
case is defined as supporting hyperlinks, including iSilo etc.)
There is no StoryPrint, as stories are always printed.
LevelNCacheable / LevelNCachable / IssueCacheable / ContentsCacheable / StoryCacheable
Whether pages at that level should be cached. The default is 0, meaning they
are not cached (ie. it is assumed that a link to a file with the same URL may
not contain the same text next time around). Both Cacheable and
Cachable can be used, because it's a tricky word to spell ;)
LevelNDiff / IssueDiff / ContentsDiff / StoryDiff
Whether pages at that level should be diffed, ie. their contents compared
against that of a previous run, and only the new elements used.
LevelNUseTableSmarts / IssueUseTableSmarts / ContentsUseTableSmarts / UseTableSmarts
Should the automatic trimming of narrow tables take place? Narrow tables are
defined as tables with a width of less than 40% or less than 250 pixels. The
default is 1.
UseTableSmarts can also be called StoryUseTableSmarts for
clarity.
LevelNFollowLinks / IssueFollowLinks / ContentsFollowLinks / StoryFollowLinks
Should links which fit into the StoryURL, etc. pattern
for that level be followed to parallel pages, ie. other pages at the same
level? This allows a site to handle situations where stories or links to
stories are split into "page 1 of 4" etc.
The default is 0.
LevelNAddURL / IssueAddURL / ContentsAddURL / StoryAddURL
Add a URL to the list that needs to be scooped at that level.
LevelNURL / IssueURL / ContentsURL / StoryURL
The URL pattern that a page for that level must fit into.
Multiple URLs can be specified on multiple lines.
Generally, pages at the highest level do not need this to be specified,
unless the FollowLinks parameter for that level is turned on.
ContentsFormat
The format for the links-level pages. Currently either
html or rss can be used; HTML is the default,
RSS indicates that the XML RSS format is used by that site.
LevelNSkipURL / IssueSkipURL / ContentsSkipURL / StorySkipURL
If an URL at the given level matches this URL pattern, it will
not be examined by sitescooper. (note: versions of sitescooper
before 2.3.x do not support LevelNSkipURL or IssueSkipURL.)
StoryHeadline
This specifies a regular expression pattern used to search for the
story's headline or title. This is primarily useful for DOC-format
output, where a bookmark is created at the start of each story
using the headline as a bookmark title.
The story HTML is searched for this pattern before StoryStart
and StoryEnd stripping takes place.
It should be specified as a regular expression containing a single
(pattern) subexpression; the text that matches
the section between brackets is used as the headline text.
StoryToPrintableSub
A Perl regular expression substitution used to convert story links to a form
more suitable for sitescooper output. For example, many sites provide multiple
views of a story, including a "printable" view for printing, and often the
"printable" view is more amenable to scooping than the non-printable version.
StoryToPrintableSub allows you to convert the story URLs to this
"printable" format.
The StoryURL pattern must match the "printable" version.
It does not need to match the original, "non-printable" format.
The format of a perl substitution is as follows:
s,from-pattern,replacement, where from-pattern is a
perl regexp pattern, generally containing (pattern)
subexpressions, and replacement is a replacement text containing
\number markers where the strings matched by the bracketed bits are
inserted.
See the FAQ entry
on multi-page stories in the Writing a .site File document
for more information.
StoryLifetime
Very old stories, by default older than 90 days, are not scooped.
This limit can be changed using this parameter.
StoryHTMLHeader
Additional HTML which should be added to the top of any story
page.
StoryHTMLFooter
Additional HTML which should be added to the bottom of any story
page.
UseAltTagForURL
If an <img> tag refers to an image which matches this URL pattern, its
ALT tag will be used instead. The default is that no ALT tags be used. Note:
this URL pattern is for the image's URL, not the URL it may be linked to
(if there is one).
NeedLoginURL
If a page requires HTTP authentication to access, you can specify its pattern
here to avoid a needless HTTP transaction. Normally, the page is requested
first, and the server responds with a request for authentication; then the page
is re-requested. This allows you to skip the initial request for a minor
speed-up.
ImageURL
If an <img> tag refers to an image which matches this URL, that image tag
will be left in the scooped document. Normally all images are stripped.
Note that not all output formats support images however.
ImageOnlySite
Specify to sitescooper that no text is expected to appear on the resulting
page; the only thing scooped is the image.
ImageScaleToMaxWidth
Specify the maximum width of an image. By default, this is 300, the rough width
in pixels of the Palm handheld's screen; sites with large images, such as
comics, can specify a larger value, which requires the user to scroll around
the image but generally improves the readability of the picture.
This is not the way to solve the problem, by the way, so this parameter
may go away or change in some way in the future...
ImageProcess
A chunk of Perl code which will be used to transform every image that
sitescooper downloads.
The filename of the image downloaded from the website is passed in as
$img_in, and the processed image should be written to the file named in
$img_out. Set $img_out to the undef value if you want to
skip that image.
This parameter is intended to allow the use of image rotation, resizing or
quantizing code. For these purposes, the PerlMagick
module may prove very useful.
URLProcess
A chunk of Perl code which will be used to transform every URL that sitescooper
needs to download. This allows a huge degree of control over the links that
sitescooper operates on.
The URL to operate on is passed in as $_, and the post-processed URL is
expected to be in $_ afterwards. Set $_ to the undef
value if you want to skip that URL.
Note that links which do not pass the StoryURL, etc.
patterns will be dropped before URLProcess takes effect, so make sure
those patterns are open enough for this.
LevelNHTMLPreProcess / IssueHTMLPreProcess / ContentsHTMLPreProcess / StoryHTMLPreProcess
A chunk of Perl code which will be used to transform HTML pages before
sitescooper operates on them.
This takes place after sitescooper strips the StoryStart and StoryEnd
sections, and after the StoryHTMLHeader and StoryHTMLFooter sections are added (where
applicable).
The text to operate on is passed in as $_, and the output is
expected to be in $_ afterwards.
StoryPostProcess
A chunk of Perl code which will be used to transform every piece of text that
sitescooper outputs. Confusingly, this operates on pages at all levels, not
just story-level pages; sorry about that! This takes place after
sitescooper performs its own cleanup, StoryStart
and StoryEnd stripping, table-stripping, etc.
This parameter is deprecated, since the same processing is run for HTML output,
text output, DOC format etc., and levels are not differentiated. Using the LevelNHTMLPreProcess parameters is recommended
instead.
EvaluatePerl
Evaluate some arbitrary perl code before running that site. This takes Perl
code as a value.
If the $skip_site variable is set to a non-zero value after the
EvaluatePerl code is run, the site is skipped.
[
README ]
[
Installing ]|[
on UNIX ]|[
on Windows ]|[
on a Mac ]
[
Running ]|[
Command-line Arguments Reference ]
[
Writing a Site File ]|[
Site File Parameters Reference ]
[
The rss-to-site Conversion Tool ]|[
The subs-to-site Conversion Tool ]
[
Contributing ]|[
GPL ]|[
Home Page ]
|