#786 ✓invalid
ronin-24025 (at lighthouseapp)

Optimize Array.uniq to return in O(n) time

Reported by ronin-24025 (at lighthouseapp) | September 3rd, 2009 @ 07:47 PM | in 1.7

Array.uniq calls Array.include for each element of the source array. This is exponentially expensive and could be replaced with a lookup hash to get O(n) time.

For example, on a sorted array of 1000 numbers, 478 of which are unique:

Average method time over 100 calls:

Existing Array.uniq(false): 234ms
Existing Array.uniq(true): 6ms

Proposed Array.uniq(false): 1.4ms
Proposed Array.uniq(true): 1.4ms

This proposal would no longer need a 'sorted' argument.

Comments and changes to this ticket

  • ronin-24025 (at lighthouseapp)

    ronin-24025 (at lighthouseapp) September 3rd, 2009 @ 07:49 PM

    • Tag set to performance, section:lang
  • ronin-24025 (at lighthouseapp)
  • Yaffle

    Yaffle September 3rd, 2009 @ 07:56 PM

     var pushed = {};
     ...
     value = this[index];
     pushed[value] // for Objects their toString method will be called
    
  • Samuel Lebeau

    Samuel Lebeau September 3rd, 2009 @ 09:14 PM

    • Assigned user set to “Samuel Lebeau”
    • Tag changed from performance, section:lang to patched, performance, section:lang
    • Milestone set to 1.7
  • Samuel Lebeau

    Samuel Lebeau September 3rd, 2009 @ 09:53 PM

    • State changed from “new” to “enhancement”
  • Tobie Langel

    Tobie Langel September 3rd, 2009 @ 11:55 PM

    • Tag changed from patched, performance, section:lang to performance, section:lang
    • State changed from “enhancement” to “invalid”

    This won't work if your array contains objects, as pointed out by Yaffle. (Two different vanilla objects have an identical toString representation).

    Closing as invalid.

  • ronin-24025 (at lighthouseapp)

    ronin-24025 (at lighthouseapp) September 4th, 2009 @ 07:40 AM

    Both Array.uniq methods pass the existing tests. I propose we expand the tests. This new proposed method will need some enhancement to support objects, true, false, null, undefined, etc. and from some further tests, we should still be able to do it faster than the existing method. The existing method treats two {name: 'Sam'} objects as different objects. This is to be expected, but this could also be improved.

  • ronin-24025 (at lighthouseapp)

    ronin-24025 (at lighthouseapp) September 4th, 2009 @ 09:38 AM

    After some more testing of the existing Array.uniq:

    [{name returns:

    [{name

    There's a bug here in that 0 and 1 are excluded because Array.include does == comparison.

    I use Array.uniq on representations parsed from JSON, and face-value comparison for these kinds of arrays and objects would be more useful than comparison by reference.

    Here's a more complex (and ugly) patch that uses two lookup tables to check for uniqueness among strings, numbers, null, true, false, undefined, objects and arrays. Objects and arrays are compared using JSON.stringify. Function, instance, element, etc. duplicates are ignored as per the existing Array.uniq. This also fixes the 0,1 bug.

    On an array of 1000 objects, all duplicates:

    Average method time over 100 calls:

    Existing Array.uniq(false): 927.8ms
    Existing Array.uniq(true): 7ms

    Proposed Array.uniq(): 15ms

    Bear in mind that Array.uniq(true) requires the array to be sorted first requiring an extra 2ms to 16ms for this data set.

    This patch requires JSON.stringify and Object.isObject.

  • ronin-24025 (at lighthouseapp)

    ronin-24025 (at lighthouseapp) September 4th, 2009 @ 09:40 AM

    Apologies over the formatting issue:

    Existing Array.uniq:

    [{"name":"Sam"},{"name":"Batman"},{"name":"Sam"},null,true,false,"false",0,1,"cool",null,true,false]
    

    Returns:

    [{"name":"Sam"},{"name":"Batman"},{"name":"Sam"},null,true,false,"false","cool"]
    
  • Yaffle

    Yaffle September 4th, 2009 @ 02:27 PM

    I think, JSON is superfluously here.And using Hash for string elements may be good.

      function uniq(sorted) { 
        var p = {}, value, res = [], i, l, pushed, tmp;
        for(i=0,l=this.length;i<l;i++){
          value = this[i];
          if(sorted){
            pushed = res.last() === value;
          }else{
            if(typeof(value)==="string"){
              tmp = '_'+value;
              pushed = (tmp in p);
              p[tmp] = 1;
            }else{
              pushed = res.include(value);
            }
          }
          if(!pushed){
            res.push(value);
          }
        }
        return res;
      }
    

    WIth non string values: We can use flags for booleans or even mark Objects with some property(like "_countedByPrototype") when objects isn't read-only, but it's too terrifying. =)

    Anyway Array#uniq and Array#intersect is rarely used in prototype core and extension...

  • ronin-24025 (at lighthouseapp)

    ronin-24025 (at lighthouseapp) September 4th, 2009 @ 10:51 PM

    Thanks Yaffle, there's still the issue of 1 and 0 being seen as duplicates of true and false respectively.

    Array.include within the loop is too expensive to use and must be replaced with an index. Even though this requires more type checking, its faster (as in 15:1000). While uglier, it need not be less accurate.

    Then there is also the issue of ignoring duplicate {name: 'Sam'} or {} objects. Perhaps the most common use case for Array.uniq would be for data arrays anyway, rather than for arrays of class instances etc. such that comparison by reference would be less useful than comparison by face value. In any event, the performance of the proposed Array.uniq using JSON.stringify is faster than the existing Array.uniq.

    Then, on a related note, while Array.uniq may be more learnable for users coming from Ruby, perhaps Array.unique would make for a better method name.

  • ronin-24025 (at lighthouseapp)

    ronin-24025 (at lighthouseapp) September 11th, 2009 @ 09:39 AM

    Re: "Two different vanilla objects have an identical toString representation". This is no longer an issue with the updated implementation.

    Can we reopen? The optimization is dramatic (for an array of 1000 duplicate objects):

    15ms vs 1000ms
    

Please Sign in or create a free account to add a new ticket.

With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.

New-ticket Create new ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป

The Prototype JavaScript library.

Shared Ticket Bins

Pages