[Nickle] Bitsets considered harmful?

James LaMar giantrobot at f-m.fm
Tue Nov 30 08:36:44 PST 2004

The audacious subject of this post is only half-serious -- I'm starting
to think reimplementing the Mint parser generator using bitsets as a set
representation may have been a bad idea.

The reason behind this has to do with the nature of the canonical
collection/closure/first set/goto system of algorithms.  A bit of a
review: this system manipulates sets of LR(1) items.  Closure closes a
set of LR(1) items such that if the parsing location (dot) is in front
of a nonterminal in the set -- a corresponding item is added to that
set.  Goto takes a set of LR(1) items and some symbol and returns a set
which represents a new state of the parser.  The canonical collection
set of algorithms seems to correspond roughly to a breadth first search
of the set of states in the DFA-like entity which represents the parser.

The current implementation maps LR(1) items to integers (powers of 2 in
CVS, but I experimented with consecutive low integers on my own) and one
of the major things that makes mint slow seemed  to be going to and from
this mapping.  Since the set of LR(1) items is so incredibly vast, it
takes a while for the lookup to find the corresponding integer in the
hash table mapping LR(1) items to integers.  This is done (mostly) for
the benefit of the bitsets.

I'm thinking (and tell me if my thinking is wrong) that the other
interesting problem with bitsets is that all set operations are (in
general) linear in the size of domain instead of in the size of the sets
involved.  Set membership in particular is very poor for large domains. 
The constants, however, are so incredibly low that in practice it's a
very efficient set representation -- it's also a good set representation
when the sets being operated on have lots of elements (with respect to
the domain.)  However, the set of all LR(1) items associated with a
grammar is pretty big.  It grows linearly in the number of symbols,
roughly linearly in the average length of the productions, and linearly
again in the number of productions.  Furthermore, the sets of LR(1)
representing the canonical collection are mostly disjoint meaning that
for any appreciable number of sets, the bitset representation becomes
very poor.

I'm starting to think the most efficient repesention of sets of LR(1)
items might be best achieved using a red-black tree -- this would
preclude the need for a mapping from LR(1) items to the integers or the
use of bitsets.  What do you folks think?
  giantrobot at f-m.fm

More information about the Nickle mailing list