A Beautifier for the Perl Programming Language

Speech presented at The Perl Conference 2.0
Consultix
> Publications


Home
Training Services
Public Class Schedule
Training Classes
Clients and Endorsements
Publications
 - Presentations
 - Articles
 - Handouts
 - Books

Interviews
About Consultix

A Beautifier for the Perl Programming Language

Tim Maher, Ph.D

Head Software Instructor, CONSULTIX
POB 70563, Seattle WA 98107
t i m (AT)teachmeperl.com


NOTE: This paper was presented at The Perl Conference (TPC) 2.0, July, 1998.
A talk on an enhanced version of this program and other developments in Perl beautification was presented at TPC 5.0, in July of 2001. The slides are at http://teachmeperl.com/pb2001.html

I'm into Beauty. So much so, that when I first started learning Perl, I immediately began looking for a "Perl Program Beautifier," and was surprised when I couldn't find one.

What is a beautifier? A language-specific utility that reformats a program to conform to a standard of presentation. For example, blank lines might be inserted after procedure bodies and declarations, indentation might be adjusted to properly reflect nesting levels, excessively long lines might be split into shorter ones, and matching parentheses and braces might be vertically aligned to show program structure.

Why do I worry about program Beauty? For one thing, people like me, who use many languages, occasionally mix up their vocabularies, and end up speaking Language A to Interpreter B. For instance, /*  introduces a comment in C, but generates pathnames with a UNIX shell, and  //  introduces an in-line comment in C++, but applies the matching operator in Perl. When you're lucky, this type of "language crosstalk" will immediately generate an error. But in less fortunate cases, the alien code might turn out to be acceptable but dysfunctional, leading to trouble down the road.

So one benefit of a good beautifier is that it can highlight, in some distinctive manner, material that might be syntactically acceptable but inappropriate. In this way, a beautifier can act as a debugging aid.

Another benefit of beautification comes from its imposition of a standard manner of depiction on programs. This makes it much easier for a programmer to read the writings of another, as well as his own programs in the future. In this way, a beautifier acts as an aid to communication and software maintenance.

Because of the importance of beautification utilities to programmers, I was shocked to learn that a language as mature and popular as Perl would be lacking one. But when I asked around, I was told by "Perl Gurus" that a beautifier would be very difficult to write because of Perl's complicated syntax, as stated in the Perl Frequently Asked Questions List (under Part 3, Programming Aids):
 

    3.4   Is there a pretty-printer (similar to indent(1) ) for Perl?

    . . . If what you mean is whether there is a program that will reformat the program such as indent(1) will do for C, then the answer is no. The complex feedback between the scanner and the parser (as in the things that confuse vgrind) make  [sic]  it challenging at best to write a standalone Perl parser.

But I figured it could not be as difficult as this makes it sound. For one thing, Perl knows how to parse Perl programs, and the Perl source code is freely available, so Perl could conceivably be reworked into a Perl beautifier.

Alternatively, the GNU source code for the popular C language beautifier "indent" is also available, so another approach would be to rework its 7k lines of C to handle the Perl language.

But I knew there was an easier approach, which would not require reworking anybody else's existing code, or the use of a language other than Perl. I knew this because my quest for beauty had already led me to write a rudimentary C++ beautifier in a three command (sed | indent| sed) shell script (UNIX/World, August 1991, p. 134.), and later a more robust C++ beautifier in 140 lines of C and shell code (Dr. Dobbs' Journal, Dec. 1992, pp. 23s-27s).

These beautifiers certainly don't qualify as "standalone parsers" for C++, because they don't classify the program elements into meaningful units. But that doesn't prevent them from doing for C++ everything that indent and cb do for C! The trick is realizing that programs written in Language B can be successfully processed by beautifiers for Language A, if Language B bears a syntactic similarity to Language A, and if the Language B program can be temporarily disguised as Language A.

So with this C++ beautification experience under my belt, and a stubborn determination to prove that "Perl Beautification" could be accomplished if sufficient Hubris, Impatience, and Laziness could be mustered, I began writing in Perl the first fully functional "Perl Beautifier", pbeaut,   in April of 1998 [1].

Beautification Strategy

As with its C++ predecessors, I approached the problem of writing pbeaut by capitalizing on the existence of mature beautification utilities for the C language, which has some fortunate syntactic similarities to Perl, and milking the UNIX filter model for all it's worth.

The basic approach, borrowed from my C++ beautifiers, is to use a pre-processor to disguise Perl code as C code, effect the beautification using standard C tools, and then convert the disguised Perl back to its original form using a post-processor.

The basic model is therefore:

        PERL CODE ->
          Perl-to-C Encoder ->
            Standard C beautifier ->
              C-to-Perl Decoder ->
                BEAUTIFIED PERL CODE

Here is a listing of the first pbeaut program:

    $ cat pbeaut
    #! /bin/sh
    # Tim Maher, Consultix. www.teachmeperl.com
    # pbeaut, v .1
    # these indent options work pretty well
    pencode $* | indent -npro -bl -bli0 -nce -npcs | pdecode
    $

The encoder, pencode, examines every character of the Perl program, from first to last, and rewrites certain character sequences as necessary to disguise the Perl code as C.

The C beautifier, in this case GNU indent, inserts tabs to properly represent nesting levels, aligns parentheses and braces, inserts newline characters to split long lines into shorter ones, and generally fools around with the layout of the code to make it look more orderly and to emphasize the program's structure.

The decoder, pdecode, undoes the disguises crafted by pencode to reveal the hidden Perl program elements in their newly beautified context.

The current "production" version of pbeaut (version .62; 412 lines of code) communicates various types of information to and from the encoder and decoder and does extensive error checking, but its basic function is the same as the simple version shown above.

De-Obfuscation Testing

Where should one look to find the ugliest Perl code on the planet? Why the archives of past Obfuscated Perl Contests, of course, where contestants are rewarded for making their programs as inscrutable as possible (http://www.tpj.com/tpj/contest).

In this section, we'll examine the effects on Perl programs of C-style beautification using indent, as well as Perl-style beautification using pbeaut.

Here's a prize-winning entry from the 1996 contest:

    $ cat caton
    #F. First place: Russell Caton
    # (Reduced in size to fit on one line.)
    $-=100;while((($@)=(getpwent())[2])){push(@@,$@);}foreach(sort{$a<=>$b}@@){(($_<=$-)||($_==($-+++1)))?next:die"$-\n";}

After beautification with indent, it looks like this:

$ indent -npro -br -nce -npcs  < caton  # -br: brace on line with keyword
#F. First place: Russell Caton
$ -= 100;
while ((($ @) = (getpwent())[2])) {
        push(@@, $ @);
}
foreach(sort
        {
        $a <=> $b
        }
        @@) {
        (($_  <= $ -) || ($_  == ($ - +++1))) ? next : die "$-\n";
}

Perhaps surprisingly, the program layout looks pretty good, owing to the fact that Perl inherited many of its basic features from C (brace-delimited blocks, &&/|| conjunctions, semicolon line-termination, operator syntax, etc.).

On the other hand, the representations of two variables ( $- , $@ ) were altered by the insertion of a space between the symbols. Does this bother Perl?

    $ perl -c caton
    caton syntax OK
    $ indent -npro -br -nce -npcs  < caton > caton.ind; perl -c caton.ind
    caton.ind syntax OK
    $ perl -w caton.ind
    101
    $

It doesn't bother Perl a bit! The program still produces the next available number from the /etc/passwd file (or NIS database). However, having the depictions of those variables messed up is definitely likely to annoy most (non-obfuscatory) Perl programmers!

After beautification with pbeaut, the program looks like this:

    $- = 100;
    while ((($@) = (getpwent())[2])) {
        push(@@, $@);
    }
    foreach(sort{$a<=>$b} @@) {
        (($_ <= $-)  || ($_ == ($-++ + 1))) ? next : die "$-\n";
    }

As you can see, the preservation of variable names has been achieved, along with a much more Perlish representation of the foreach loop.

Here's another unattractive winning entry from the 1996 contest:

    #D. First place: Robert Klep
    # (Line breaks in original exactly as shown)
    $Y=-1.2;for(0..24){$X=-2;for(0..79){($r,$i)=(0,0);for(0..15){$n=$_;$r=($x=$
    r)*$x-($y=$i)*$y+$X;$i=2*$x*$y+$Y;$x*$x+$y*$y>4&&last}print unpack("\@$n a"
    ,".,:;=+itIYVXRBM ");$X+=3/80}$Y+=2.4/25}

Let's try beautifying this one with indent:

    $ indent -npro -br -nce -npcs < klep > klep.ind
    Standard input:1: Warning: old style assignment ambiguity in "=-". Assuming "= -"
    Standard input:1: Warning: old style assignment ambiguity in "=-". Assuming "= -"
    $ indent -npro -br -nce -npcs < klep 2>/dev/null | tee klep.ind
         1  $Y = -1.2;
         2  for (0. .24) {
         3          $X = -2;
         4          for (0. .79) {
         5                  ($r, $i) = (0, 0);
         6                  for (0. .15) {
         7                          $n = $_;
         8                          $r = ($x = $
         9                                r) * $x - ($y = $i) * $y + $X;
        10                          $i = 2 * $x * $y + $Y;
        11                          $x *$x + $y * $y > 4 && last
        12                  }
        13                  print unpack("\@$n a"
        14                               ,".,:;=+itIYVXRBM ");
        15                  $X += 3 / 80
        16          }
        17          $Y += 2.4 / 25
        18  }
      $ perl -c klep.ind
      klep.ind syntax OK
      $
    The "$Y=-1.2" assignment was correctly interpreted as "$Y = -1.2", rather than the archaic C "$Y =- 1.2", so no problem there. And once again, the program structure looks good, and it passed Perl's syntax check.

    However, instead of drawing the default Mandelbrot fractal on the screen in ASCII characters, here's what the program does:

      $ klep.ind
      $
      
    It seems that the degenerate range specifications (e.g, 0.  .79) in the for loops, although syntactically acceptable, resulted in only one iteration of each loop, which wasn't enough to show much output.

    The lesson to be learned here is that a faulty beautification process can produce code that is attractive but dysfunctional!

    pbeaut, on the other hand, produces the following reworked version of this program, which functions correctly:

      $ pbeaut -b 2 klep | cat -n     # -b 2: opening brace on line with keyword
          
      1  $Y = -1.2;
      2  for (0 .. 24) {
      3          $X = -2;
      4          for (0 .. 79) {
      5                  ($r, $i) = (0, 0);
      6                  for (0 .. 15) {
      7                          $n = $_;
      8                          $r = ($x = $  # NOTE: extraneous newline here
      9                                            r) * $x - ($y = $i) * $y + $X;
      10                          $i = 2 * $x * $y + $Y;
      11                          $x * $x + $y * $y > 4 && last
      12                  }
      13                  print unpack("\@$n a" # NOTE: extraneous newline here
      14                               ,".,:;=+itIYVXRBM ");
      15                  $X += 3 / 80
      16          }
      17          $Y += 2.4 / 25
      18  }
      $
    This program looks pretty good, apart from the undesirable preservation of badly located newline characters from the original, which caused the splitting of the $r variable (lines 8-9) across a line break!

    Although pbeaut tries to preserve programmer newlines by default, on the assumption they were sensibly placed, this behavior can be disabled through use of the -n (ignore newlines) invocation option:

      $ pbeaut -n -b 2 klep | cat -n  # -n: ignore newlines in original
           1  $Y = -1.2;
           2  for (0 .. 24) {
           3          $X = -2;
           4          for (0 .. 79) {
           5                  ($r, $i) = (0, 0);
           6                  for (0 .. 15) {
           7                          $n = $_;
           8                          $r = ($x = $r) * $x - ($y = $i) * $y + $X;
           9                          $i = 2 * $x * $y + $Y;
          10                          $x * $x + $y * $y > 4 && last
          11                  }
          12                  print unpack("\@$n a", ".,:;=+itIYVXRBM ");
          13                  $X += 3 / 80
          14          }
          15          $Y += 2.4 / 25
          16  }
    That's better!

    Testing with Perl Modules

    It can certainly be fun decoding the purposely inscrutable scribblings of the obfuscatory programmers, but the real testing of pbeaut has focused on a different collection of programs - the modules of the standard Perl distribution. These 66 programs, totaling 20,247 lines of expert-grade Perl code, use most of the features of the language, in lots of different combinations.

    Most of these programs can be successfully beautified. The others ( diagnostics.pm, getcwd.pl, perl5db.pl, CPAN.pm, English.pm ) cannot, due to their use of currently unsupported syntax features. (For example, semicolon delimiters with the substitution operator; see Current Limitations, below.)

    How pbeaut Works

    The job of pencode (currently about 1.7k lines of Perl5) is to make Perl code acceptable C code, using one of four techniques:
    • encapsulating within C comments
    • enclosing in quotes
    • encoding symbols alphabetically; e.g., { -> LB
    • leaving code unaltered (for material that can be properly handled by C-oriented rules)

    We'll use a small program called fix to illustrate the encoding, beautification, and decoding processes.

      $ cat fix
      #! /usr/bin/perl -wn
      # modify shebang pathname in input script
      # Written strangely to involve certain syntax features
      ${prog}='perl';
      if ($. =~ /^1$/) {$_=s|^#! /usr/bin/$prog|#! /usr/local/bin/$prog|o;print;}
      else {print;} # other lines unaltered
      $ pencode fix | cat -n  # show line numbers     
      1  /* _a_C #! _a_RMa\/usr_a_RMa\/bin_a_RMa\/perl -wn C_a_ */     
      2  /* _a_C # modify shebang pathname in input script C_a_ */     
      3  /* _a_C # written strangely to involve certain syntax features C_a_ */
      4       
      5  "_a_F_${prog}_F_a_"='perl';
      6       
      7  if ("_a_F_$._F_a_" _a_EQ_a_TD "_a_M_/^1$/_M_a_")  # line wrapped for display
           {"_a_F_$__F_a_"="_a_S_s|^#! /usr/bin/$prog|#! /usr/local/bin/$prog|o_S_a_";print;}
      8  else {print;} /* _a_C # other lines unaltered C_a_ */
    As you can see, the Perl comments were encapsulated within C comments (e.g., line 1), with a code sequence being inserted before each inner / . This is to guard against accidental termination of the C comment by material within the Perl comment. (Actually, the backslashes are only necessary for embedded slashes preceded by asterisks, but for simplicity all slashes are backslashed).

    The basic encoding sequence, _a_ , and an ancillary preceding or following syntax-specific code, mark the sequences added by pencode to facilitate their later removal by pdecode. (NOTE: if _a_ appears in the input program, pbeaut will select a different sequence for use.)

    To prevent indent from making bad decisions about formatting Perl symbol sequences, such as inserting spaces or newlines between them, certain ones are encoded with an alphabetic representation, such as EQ...TD for =~  (line 7).

    For related reasons, in line 5, ${prog} gets quoted to prevent the C beautifier from splitting apart its components.

    The job of pdecode (currently about 50 lines of Perl5) is to reverse the effects of pencode, by removing the C-comment "wrappers" encapsulating Perl comments, decoding the various encoded strings to their original forms, and so forth.

    Let's look at the result of subjecting fix to the full beautification process:

      $ nl -ba fix  # original, pre-beautification
          1  #! /usr/bin/perl -wn
          2  # modify shebang pathname in input script
          3  # written strangely to involve certain syntax features
          4  
          5  ${prog}='perl';
          6  
          7  if ($. =~ /^1$/) {$_=s|^#! /usr/bin/$prog|#! /usr/local/bin/$prog|o;print;}
          8  else {print;} # other lines unaltered
      $ pbeaut < fix | cat -n
          1 #! /usr/bin/perl -wn
          2 # modify shebang pathname in input script
          3 # written strangely to involve certain syntax features
          4
          5 ${prog} = 'perl';
          6
          7 if ($. =~ /^1$/) {
          8       $_ = s|^#! /usr/bin/$prog|#! /usr/local/bin/$prog|o;
          9       print;
          10 }
          11 else {
          12      print;
          13 } # other lines unaltered
    Compared to the original program, spaces have been inserted around the assignment operators (=) on lines 5 and 8, and the one-line code blocks on lines 7 and 8 of the original have been spread out over several lines (7-10, and 11-13).

    Current Limitations

    pbeaut is aware of many of its limitations, and it will report them to the user if one is encountered during processing, or if it is invoked with the -l option:
    $ pbeaut -l
         LIMITATIONS OF pbeaut VERSION 0.62:
         Delimiters for match and substitution operators known to work: <{[(#/|!?
                 (others might work, but they haven't been tested)
         Whitespace before some delimiters not supported (e.g., m  /a/;
         s/// and tr/// are not allowed to change delimiters for the second part
         // sometimes formatted better if explicitly tagged as match, via m//
         q{}, qq(), etc., cannot have embedded }/) unless backslashed
         split must have white-space before first / (split //, not split//)
         Keyword "sub" and following subroutine name assumed to be on same line
         Here-Docs cannot omit Framing-Word (can't use blank line as terminator)
         TIPS: To disable beautification for a line, put "#LIT" at its end
               If you don't like the formatting of subs without parentheses,
                 try adding them: sub foo() rather than sub foo
         (c) Tim Maher, CONSULTIX.  www.teachmeperl.com  (206) 781-UNIX
            Contact author for restrictions on distribution and usage
    NOTE: Although there are numerous options to indent that can be used to enable or disable different types of program alterations, only a few of the various combinations have thus far been tested with pencode/pdecode.

    For this reason, the current pbeaut only offers two choices of indent options: "-b 1" (braces on lines following keywords) and "-b 2" (braces on same lines as keywords).

    Perl Beautifier Demo Page

    To allow interested parties to experiment with it, I provided a pbeaut Demo Page on the World-Wide-Web for many months, which allowed me to obtain useful information about the strengths and weaknesses of the early versions.

    Status Report

    pbeaut is progressing rapidly, but is not yet fully reliable. In some cases, it generates Perl code that is syntactically incorrect, usually due to the introduction or removal of spaces and/or newline characters.

    Fortunately, pbeaut incorporates safeguards to prevent the accidental corruption of the original program, including:

    • presenting the beautified result to the standard output, rather than altering the original program
    • running perl -c (syntax check) on the beautified code and reporting any syntax errors to the user
    • reporting statistics on the input program and its beautified output, to facilitate detection of code loss
    Even when errors are introduced, they are typically of a very minor nature, and are easily detected and fixed by an experienced Perl programmer.

    Development and Testing Platforms

    pbeaut was developed on Slackware Linux 3.0 using GNU indent 1.9.1, and has also been lightly tested under ULTRIX 4.3 using /usr/ucb/indent and /usr/bin/cb. I expect it to work just as well on other platforms having different versions of indent and cb, but that assumption is currently untested.

    Summary (2001 update)

    pbeaut is a utility that can rewrite most Perl programs into compliance with an adjustable standard of presentation. Since the writing of this article, Steve Hancock's PerlTidy has become the de facto standard beautification utility, and we recommend it highly.

    Footnotes

    [1]
    I am aware of the 5/20/96 pbprogram by P. Lutus Ashland. But unlike pbeaut, it does nothing but adjust indentation to properly reflect nesting levels, rather than offering the full power of C's indent and cb utilities.

© Copyright 1994-2008   Pacific Software Gurus, Inc.   All Rights Reserved.

   Powered by Google