ploticus: proc tabulate

Manual page for proc_tabulate(PL)

proc tabulate

DESCRIPTION

proc tabulate may be used to compute a one-way distribution on one data field, or a two-way distribution using two data fields. It produces a new data set, which becomes the "current" data set, ie. the data that will be accessed by subsequently executed plotting procs.

proc tabulate has a fixed maximum of 200 rows and 200 columns. Data does not have to be ordered in any particular way. See also the proc processdata count action, which requires ordered data but has no upper limit on number of "bins".

The showresults option is recommended so that you can see what the resulting data set looks like while developing a plot.

FEATURES

Can tabulate to produce one- or two-way distributions. Bins may be based on natural occurance or by preset lists or ranges. Percents may be calculated. Ordering of results may be controlled. Occurances may be counted, or accumulations may be done. Resulting text table may be displayed on screen or saved for other uses.

EXAMPLES

See the Gallery examples distrib and vermonth

PREREQUISITES

proc getdata must first be executed to define or access some data.

VARIABLES SET

NROWS = Number of rows in the data result.

NFIELDS = Number of fields per row in the data result.

Thus, if a one-way distribution is being done and there are 7 varieties, NRECORDS will hold 7, which may then be used (e.g. xrange: @NRECORDS+1) to automatically set scaling for a bargraph.

MODES

Either one-dimensional or two-dimensional modes.

If datafield1 is specified but datafield2 is not, a one-dimensional distribution will be computed. The result will be two data fields-- the first field will be the value and the second field will be the number of instances. The number of records in the result will be the number of bins.

If both datafield1 and datafield2 are specified then a two-dimensional distribution will be computed. Datafield1 will be distributed downward and datafield2 will be distributed across. The result's column headings will be usable as field names (the first column is always named rowname).

Many of the attributes are named with either 1 or 2 to correspond with either the distribution on datafield1, or datafield2.

NOTE

After proc tabulate executes, all subsequent plotting procedures in the script file will access its results for plotting. However, the original data is still in memory. If later it is necessary to plot the original data, proc usedata may be invoked.

MANDATORY ATTRIBUTES

The datafield1 attribute must be specified.

If a valuelist is not specified, all values encountered will get their own bin in the distribution.

ATTRIBUTES

datafield1 dfield

: Compute a distribution on this data field. Value will be in result data field 1 and N will be in result data field2. Example: datafield1: 1

datafield2 dfield

: Compute a two way distribution on datafield1 and this field. Distribution on datafield2 will be horizontal. See also MODES above. Example: datafield2: 5

axis1 x | y

: Axis to equivicate with the distribution on datafield1. This needs to be specified when working with data which is to be scaled using units such as date or time. Otherwise it does not need to be specified.

axis2 x|y

: Same as axis1.

valuelist1 stringlist

: Define a set of values that will be included in the distribution of datafield1. The ordering of this set determines the order that categories are presented in the result. This is a space- or comma-delimited list of values.
Example: valuelist: red green blue
: If ranges are being used (dorange1: yes), then this attribute may be used to explicitly define the ranges. See the following example for the syntax; by default, dash (-) is used to separate the low and hi values in a range, with no embedded spaces allowed.
: As a convenience, the letter "C" may be used in place of a low value in a range to indicate "continuous"; its effect if for the previous high value to be copied and taken as the next low value. This saves the tedium and error-prone-ness of having to enter values twice.
Example: valuelist1: 0-2.5 C-5 C-7.5 C-10
This would be equivalent to valuelist1: 0-2.5 2.5-5 5-7.5 7.5-10. Either way, a value of 2.5 would end up in the 2nd bin.

valuelist2

: Value list for datafield2. (see valuelist1)

doranges1 yes | no

: If yes, distribution on datafield1 will use ranges rather than values. If defined ranges overlap the higher bin has presidence. The ranges may be defined using either valuelist1, or rangespec1.

doranges2 yes | no

: If yes, distribution on datafield2 will use ranges rather than values. The ranges may be defined using either valuelist2, or rangespec2.

rangespec1 lowval binsize [hival]

: If doing ranges, this attribute may be used instead of valuelist if ranges of uniform size are to be used when tabulating. Ranges will begin at lowval and be of size binsize. Ranges will end when hival is passed, or when the high end of the axis is passed (if an axis has been defined). lowval and hival should be plotvalues . Implies doranges1: yes.
Example: rangespec1: 0 5 39
This would set up ranges 0-5, 5-10, 10-15, and so on, up to 35-40 (remember that where ranges overlap, the higher bin has presidence).

rangespec2 lowval binsize [hival]

: Same as rangespec1, but for datafield2.

accumfield dfield

: Normally, proc tabulate works by counting occurances. However, If accumfield is specified, instead of counting, an accumulation will be done using the specified field.

order1 natural | magnitude | reversemagnitude

: Specify order that categories are presented in the result.

order2 natural | magnitude | reversemagnitude

: Same as above for datafield2.

percents yes | no

: If yes, each tabulation column will be accompanied by a column of percents.

showresults yes | no

: If yes, the result data table will be written to the diagnostic stream for debugging purposes.

savetable filename | stdout | stderr

: [This attribute superceded by showresults above]

select conditional-expression

: conditional-expression is applied to each data record (row). If specified and if the expression evaluates to true, the data is included; otherwise it is excluded. Data fields are referenced by preceding them with two at-signs (@).
Example: select: @4 = G

rangesepchar char

: Allows user specification of the range separator character (the character that should be used to separate the low and high values of a range in the valuelists). Default range separator charactor is dash (-). Example: rangesepchar: ,

showrange low | avg

: If specified, controls the content of row or column labels when ranges are being used. Normal behavior is for the label to be formatted lowend - hiend. If this attribute is low, only the low value will be given. If this attribute is avg, an average of the low and high will be given. This attribute is useful when proc tabulate result range bins will be plotted by location, e.g. for a histogram (the 1st result data field can be used as the bar location).

data display engine
Copyright Steve Grubb

Markup created by unroff 1.0, December 10, 2002.