[Nickle] Bitsets considered harmful?

Tue Nov 30 11:11:32 PST 2004

Somehow I'm not making the right connection about how efficiently
finding the lsb solves the problem of mapping LR(1) items to consecutive
integers efficiently.  Can you explain this in more detail?  In the
current implementation, hashes are used only to map stuff (items,
strings) to consecutive integers so they can be used in bitsets -- is
this a wrongheaded notion?  If I have all the productions in the grammar
shoved into a similar integer <-> production mapping, I could create an
efficient bijection between the natural numbers like so:

For an LR(1) item with nonterminal S, production prod, and 'next' symbol
next:  

   S -> prod, next

Create an integer i such that:
i = (integer representing "next") + (integer representing "S") * (number
of symbols) +  (integer representing prod) * (number of symbols) ** 2

But this may not sufficiently speed up the set operation that's killing
us most -- union on bitsets in the domain of LR(1) item integer
representations.

On Tue, 30 Nov 2004 10:37:21 -0800, "Bart Massey" <nickle at po8.org> said:
> Note that Nickle contains an efficient implementation of
> Math::ilog() which would help a lot with your troubles.
> Hashes really aren't the right way to find the first 1 bit.
> Actually, ilog() does a lot of unecessary math: I just
> checked in a version of Nickle with Math::lsb() in it.  I'll
> be shocked if this doesn't solve your lookup speed problem.
> 
> Somebody should probably rewrite ilog() to work like lsb()
> in the case where base == 2 someday.
> 
> 	Bart
> 
> In message <1101832604.14475.209749476 at webmail.messagingengine.com> you
> wrote:
> > The audacious subject of this post is only half-serious -- I'm starting
> > to think reimplementing the Mint parser generator using bitsets as a set
> > representation may have been a bad idea.
> > 
> > The reason behind this has to do with the nature of the canonical
> > collection/closure/first set/goto system of algorithms.  A bit of a
> > review: this system manipulates sets of LR(1) items.  Closure closes a
> > set of LR(1) items such that if the parsing location (dot) is in front
> > of a nonterminal in the set -- a corresponding item is added to that
> > set.  Goto takes a set of LR(1) items and some symbol and returns a set
> > which represents a new state of the parser.  The canonical collection
> > set of algorithms seems to correspond roughly to a breadth first search
> > of the set of states in the DFA-like entity which represents the parser.
> > 
> > The current implementation maps LR(1) items to integers (powers of 2 in
> > CVS, but I experimented with consecutive low integers on my own) and one
> > of the major things that makes mint slow seemed  to be going to and from
> > this mapping.  Since the set of LR(1) items is so incredibly vast, it
> > takes a while for the lookup to find the corresponding integer in the
> > hash table mapping LR(1) items to integers.  This is done (mostly) for
> > the benefit of the bitsets.
> > 
> > I'm thinking (and tell me if my thinking is wrong) that the other
> > interesting problem with bitsets is that all set operations are (in
> > general) linear in the size of domain instead of in the size of the sets
> > involved.  Set membership in particular is very poor for large domains. 
> > The constants, however, are so incredibly low that in practice it's a
> > very efficient set representation -- it's also a good set representation
> > when the sets being operated on have lots of elements (with respect to
> > the domain.)  However, the set of all LR(1) items associated with a
> > grammar is pretty big.  It grows linearly in the number of symbols,
> > roughly linearly in the average length of the productions, and linearly
> > again in the number of productions.  Furthermore, the sets of LR(1)
> > representing the canonical collection are mostly disjoint meaning that
> > for any appreciable number of sets, the bitset representation becomes
> > very poor.
> > 
> > I'm starting to think the most efficient repesention of sets of LR(1)
> > items might be best achieved using a red-black tree -- this would
> > preclude the need for a mapping from LR(1) items to the integers or the
> > use of bitsets.  What do you folks think?
> > -- 
> >   James
> >   giantrobot at f-m.fm
> > 
> > 
> > _______________________________________________
> > Nickle mailing list
> > Nickle at nickle.org
> > http://nickle.org/mailman/listinfo/nickle
-- 
  James
  giantrobot at f-m.fm