分页: 1 / 1

关于linux的 "sort" 和 "uniq"

发表于 : 2009-12-15 18:39
newway
发现不知道是不是bug的问题, 有一个测试文件(test.txt),内容如下:
● ○
● ○
⌚
● ○
● ○
● ○
● ○
● ○
● ○
● ○
○ ●
● ○
就12行,utf8编码中间是空格,回车符前没有任何空格.
用 sort < test.txt, 出来的结果同原文件, 更神奇的是,
如果用 uniq -c < test.txt , 输出是:
12 ● ○

也就是说, sort和uniq,把“● ○” “○ ●” “○ ⌚”这三种unicode字符组合当成一样的?
试了sort的R n 这些参数都无效...
为什么阿...

Re: 关于linux的 "sort" 和 "uniq"

发表于 : 2009-12-15 19:02
xhy

代码: 全选

static int
compare (const struct line *a, const struct line *b)
{
  int diff;
  size_t alen, blen;

  /* First try to compare on the specified keys (if any).
     The only two cases with no key at all are unadorned sort,
     and unadorned sort -r. */
  if (keylist)
    {    
      diff = keycompare (a, b);
      if (diff || unique || stable)
        return diff;
    }    

  /* If the keys all compare equal (or no keys were specified)
     fall through to the default comparison.  */
  alen = a->length - 1, blen = b->length - 1; 

  if (alen == 0)
    diff = - NONZERO (blen);
  else if (blen == 0)
    diff = 1; 
  else if (hard_LC_COLLATE)
    diff = xmemcoll (a->text, alen, b->text, blen);
  else if (! (diff = memcmp (a->text, b->text, MIN (alen, blen))))
    diff = alen < blen ? -1 : alen != blen;

  return reverse ? -diff : diff;
}
从sort.c源码看
export LC_ALL="C"
然后执行sort就是按照字节内容做比较了