[sword-devel] Character Frequency

Greg Hellings greg.hellings at gmail.com
Sun Jul 3 10:30:32 MST 2011


A few simple pipes in Unix can do the same thing with relative ease.

cat kjv.xml | sed -e 's/./&\n/g' | sort | uniq -c | sort -nr
1669596
1661832 "
1330866 o
1307266 r
1172801 s
1156121 e
1092384 n
1029125 m
 901465 t
 864037 >
 864037 <
 830916 =
 776214 a
 772641 w
 625029 h
 609087 :
 560652 g
 497519 l
 469056 /
 406801 i
 393184 0
 370919 p
 350731 1
 312386 H
 290358 2
 283469 8
 263960 3
 257239 d
 220707 .
 209066 5
 204056 b
 197713 4
 197400 c
 193701 7
 183464 6
 175932 G
 172006 9
 152074 -
 133127 I
 126782 M
 121721 D
 115182 N
 114636 v
 113384 T
 111775 u
 109108 y
 107290 P
  94242 A
  85226 S
  84923 f
  74768 ,
  73229 C
  39531 J
  36203 V
  35707 k
  34899
  25991 E
  24737 R
  23948 F
  20676 O
  18179 x
  16367 L
  10159 ;
   6930 z
   5389 K
   5047 B
   4036 …
   3421 ?
   3283 X
   2970 ¶
   2596 j
   2489 W
   2334 q
   2040 '
   1776 Z
    797 U
    551 Y
    313 !
    240 )
    240 (
    199 Q
     93 æ
      5 }
      5 {
      3 Æ
      1 ת
      1 ש
      1 ר
      1 ק
      1 צ
      1 פ
      1 ע
      1 ס
      1 נ
      1 מ
      1 ל
      1 כ
      1 י
      1 ט
      1 ח
      1 ז
      1 ו
      1 ה
      1 ד
      1 ג
      1 ב
      1 א

The format looks a bit nicer on the terminal.  Takes about 75 seconds
to run on the file. A few simple lines in Python or the like only
takes about 10s and is equally simple to whip up.

--Greg

On Sun, Jul 3, 2011 at 11:53 AM, David Haslam <dfhmch at googlemail.com> wrote:
> A useful tool for analysing or editing source text files is BabelPad, the
> Unicode Text Editor (for Windows).
> http://www.babelstone.co.uk/Software/BabelPad.html
>
> One of the Menu Tool Options is Character Frequency.
>
> This can be very helpful to detect unexpected code points, such as when the
> translators were inconsistent when they were editing.
>
> David
>
>
>
> --
> View this message in context: http://sword-dev.350566.n4.nabble.com/Character-Frequency-tp3642222p3642222.html
> Sent from the SWORD Dev mailing list archive at Nabble.com.
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>



More information about the sword-devel mailing list