[Nickle] Bitsets considered harmful?

Tue Nov 30 10:37:21 PST 2004

Note that Nickle contains an efficient implementation of
Math::ilog() which would help a lot with your troubles.
Hashes really aren't the right way to find the first 1 bit.
Actually, ilog() does a lot of unecessary math: I just
checked in a version of Nickle with Math::lsb() in it.  I'll
be shocked if this doesn't solve your lookup speed problem.

Somebody should probably rewrite ilog() to work like lsb()
in the case where base == 2 someday.

	Bart

In message <1101832604.14475.209749476 at webmail.messagingengine.com> you wrote:
> The audacious subject of this post is only half-serious -- I'm starting
> to think reimplementing the Mint parser generator using bitsets as a set
> representation may have been a bad idea.
> 
> The reason behind this has to do with the nature of the canonical
> collection/closure/first set/goto system of algorithms.  A bit of a
> review: this system manipulates sets of LR(1) items.  Closure closes a
> set of LR(1) items such that if the parsing location (dot) is in front
> of a nonterminal in the set -- a corresponding item is added to that
> set.  Goto takes a set of LR(1) items and some symbol and returns a set
> which represents a new state of the parser.  The canonical collection
> set of algorithms seems to correspond roughly to a breadth first search
> of the set of states in the DFA-like entity which represents the parser.
> 
> The current implementation maps LR(1) items to integers (powers of 2 in
> CVS, but I experimented with consecutive low integers on my own) and one
> of the major things that makes mint slow seemed  to be going to and from
> this mapping.  Since the set of LR(1) items is so incredibly vast, it
> takes a while for the lookup to find the corresponding integer in the
> hash table mapping LR(1) items to integers.  This is done (mostly) for
> the benefit of the bitsets.
> 
> I'm thinking (and tell me if my thinking is wrong) that the other
> interesting problem with bitsets is that all set operations are (in
> general) linear in the size of domain instead of in the size of the sets
> involved.  Set membership in particular is very poor for large domains. 
> The constants, however, are so incredibly low that in practice it's a
> very efficient set representation -- it's also a good set representation
> when the sets being operated on have lots of elements (with respect to
> the domain.)  However, the set of all LR(1) items associated with a
> grammar is pretty big.  It grows linearly in the number of symbols,
> roughly linearly in the average length of the productions, and linearly
> again in the number of productions.  Furthermore, the sets of LR(1)
> representing the canonical collection are mostly disjoint meaning that
> for any appreciable number of sets, the bitset representation becomes
> very poor.
> 
> I'm starting to think the most efficient repesention of sets of LR(1)
> items might be best achieved using a red-black tree -- this would
> preclude the need for a mapping from LR(1) items to the integers or the
> use of bitsets.  What do you folks think?
> -- 
>   James
>   giantrobot at f-m.fm
> 
> 
> _______________________________________________
> Nickle mailing list
> Nickle at nickle.org
> http://nickle.org/mailman/listinfo/nickle