1 faceclick/data/README
     2 
     3 This document explores the Emoji data and how to pare it down in different ways
     4 to make a subset that:
     5 
     6 * Works for my intended audience
     7 * Is as small on disc as possible
     8 * Has a great keyword search feature
     9 
    10 After some research, I'm going with the emojibase.dev data set, which is based
    11 on the official Unicode data files. It has excellent keywords ("tags") for
    12 searching the emoji, labels, grouping, etc.
    13 
    14 -------------------------------------------------------------------------------
    15 
    16 Maybe I'm a dunce, but I had a heck of a time figuring out where to get the
    17 JSON files from the https://emojibase.dev website. But I eventually found the
    18 CDN where the raw files are hosted. I got the full raw JSON file here:
    19 
    20     https://cdn.jsdelivr.net/npm/emojibase-data@16.0.3/en/
    21 
    22 Initially, I ran the data through `jq` to pretty-print it for readability
    23 while I was getting started and named it emojibase.pretty.json. So any
    24 references to the "pretty" file are talking about that.
    25 
    26 (Side note: I started learning enough of the jq command line tool to use it for
    27 all of my JSON manipulation needs, but then I realized that I already had a
    28 full language that I already knew at my disposal: Ruby. It's three lines to
    29 read the file, and then I can just use familiar loops and such. You can
    30 pretty-print the data and everything else.)
    31 
    32 The name, as downloaded, of the Emojibase data set in this directory is:
    33 
    34     data.json
    35 
    36 That's the FULL data with labels, groups, keywords, etc.
    37 
    38 -------------------------------------------------------------------------------
    39 
    40 Tools:
    41 
    42 customizer.rb - Makes personalized alterations to the full emojibase JSON
    43 
    44 getstats.rb   - Prints stats (including bytes used) about the relevant parts
    45                 of a given JSON file
    46 
    47 makejs.rb     - Process an Emoji data set (presumably 'myemoj.json', generated
    48                 by customizer.rb) and generate JavaScript (pretty much jSON
    49                 but namespaced to 'FC.data' for the final library and with
    50                 things that are not legal in JSON, such as comments and
    51                 trailing commas.
    52 
    53 go.sh         - Open it and see! (Automates my most common process, and is
    54                 currently changing rapidly. It will probably end up doing the
    55                 entire process of making customizing the data, making an HTML
    56                 contact sheet (to see which emoji are used), and exporting the
    57                 JavaScript version of the data for use in the final picker.)
    58 
    59 makesheet.rb  - Creates an HTML contact sheet for a given JSON input file.
    60                 Sheet is a single page with all of the emoji and labels in
    61                 tooltips.
    62                 (Now includes stats from getstats.rb!)
    63 
    64 -------------------------------------------------------------------------------
    65 
    66 Is it worth trying to "compress" via indexed keywords, etc?
    67 
    68 Let's look at gzip compression for a rough idea:
    69 
    70                uncompressed     compressed
    71     data       385566           31635
    72     w/ groups  420085           32258
    73     emoj list  23137            5281
    74     emoj txt   11699            4762
    75 
    76 Most of the raw data stats below were gathered with getstats.rb.
    77 
    78 -------------------------------------------------------------------------------
    79 
    80 Full list (full-base-stats.rb):
    81 
    82     (file size of emojibase.pretty.json is 1174981 bytes)
    83 
    84     list len: 1941
    85     raw emoji len: 3377 (longer than list due to ligature combos!)
    86     raw emoji bytes: 12295 (much longer due to multibyte + ligatures)
    87     labels (bytes): 25721
    88     tags: 10108
    89     tags (bytes): 56816
    90     unique tags: 3615
    91     unique tags (bytes): 21079
    92     -----------------
    93     tags+labels+emoji (bytes): 94832
    94 
    95 My list (myemoj.json):
    96 
    97     NOTE: The exact numbers below were out of date almost immediately because I
    98     found more items to remove. I'm not going to keep updating them here. But
    99     you can always re-run the script(s) for your data pleasures.
   100 
   101     File sizes:
   102         869445  With all emojibase data
   103         310256  With just labels, emoji, group, and tags
   104 
   105     Raw data:
   106     list len: 1778
   107     raw emoji len: 2758 (longer than list due to ligature combos!)
   108     raw emoji bytes: 10234 (much longer due to multibyte + ligatures)
   109     labels (bytes): 22946
   110     tags: 8885
   111     tags (bytes): 49539
   112     unique tags: 3571
   113     unique tags (bytes): 20776
   114     -----------------
   115     tags+labels+emoji (bytes): 82719
   116 
   117 JSON vs Raw data - (all relating to myemoj.json)
   118 
   119     310256  Pretty formatted JSON
   120     185830  JSON       -124426 bytes (40% smaller than pretty)
   121     82719   Raw data   -103111 bytes (55% smaller than JSON)
   122 
   123 So JSON encoding alone doubles the file size. Pretty JSON nearly triples it.
   124 
   125 I'll need SOME sort of encoding, and I suspect I'm going to end up with some
   126 sort of hybrid with data packed into some sort of string. It will still be
   127 lighning fast to chop up.
   128 
   129 -------------------------------------------------------------------------------
   130 
   131 New idea:
   132 
   133     Deconstruct labels into synthetic tags by splitting on space, then
   134     add those to the tag list and then re-construct the labels at runtime
   135     by using tag indexes!
   136 
   137     Here goes:
   138 
   139     78,393 mydata.js - with labels and tags
   140     88,663 mydata.js - with tags + label words + labels desconstructed
   141     87,932 mydata.js - same as above, but all lower case (not worth it)
   142 
   143     So that didn't work! The size went up because of all of the 4-digit
   144     index numbers.
   145 
   146     But... it got me thinking that part of the reason the result was so
   147     bulky is that the labels and tag references are quite redundant - I
   148     don't need to reference a tag from an emoji if I also reference that
   149     same tag from the emoji's label.
   150 
   151     So now I'm going to change it to a *simpler* system where I gather
   152     all "words" from both tags and labels:
   153 
   154         2811 unique words - tags
   155         3218 unique words - tags + labels split into words
   156 
   157     So labels only contain about 400 words not already in tags. This looks
   158     very promising!
   159 
   160         Also:
   161         3169 unique words - if we also make labels downcase...
   162         ...which I've decided to not do (It's commented out in customizer.rb)
   163 
   164     And then store the words to re-construct the labels first. And then
   165     ONLY store the tags that aren't already part of the label...
   166 
   167     74,057 mydata.js - yeah! that's 4kb smaller than the raw labels and tags
   168 
   169 Conclusion: It's surprisingly hard to actually save any space when a small 4-digit
   170 number is actually stored as 4 whole characters, plus the surrounding syntax
   171 of an array [] and commas to separate values.
   172 
   173 It *would* be quite interesting to pack bits...but I'm pretty sure the unpacking
   174 code would eat up most of the savings, and I don't see any sense in making
   175 it more obfuscated than it already is. Obfuscation has never been the goal...
   176 
   177 In fact, rather than three separate lists, I think I should have the tags and
   178 labels nested with the emoji so it's actually readable. I will pay for the
   179 additional quotes '' around each emoji which comes to 2kb...hmm... Totally
   180 worth it.
   181 
   182 Also, any reference to single-use words can add up to
   183 7 completely wasted characters.
   184 So I need to only store words that are used more than
   185 once...and even then, probably only words that are 2 or more
   186 characters in size:
   187 
   188     emoji | label with ref  | tags
   189     ------+-----------------+--------
   190     ["X", "Big $23 dog", ["+",34,15]]
   191 
   192 Wow, very surprised to find that there are only 404
   193 unique words once you de-dupe the synthetic tags from
   194 the labels.
   195 
   196 -------------------------------------------------------------------------------
   197 Next day:
   198 
   199 Okay, so now I've got BOTH keyword lists (labels and tags) stored as
   200 space-separated strings because that's way more compact (and readable!) than
   201 a JS array and is trivial to split into an array in JS.
   202 
   203 The tags and labels are de-duped *per emoji* because I'm going to search the
   204 terms in both anyway. In fact, I think this will actually speed up search
   205 on the other end if I don't ever even turn them into arrays, LOL. Kind of
   206 amazing how going deep on a problem can really turn it inside-out and
   207 end up simplifying...but I'm getting ahead of myself. Gotta see how big
   208 the output is and then find the right blend of word usage vs word length.
   209 
   210 I have two parameters I can mess with now to try to make it as compact
   211 as possible:
   212 
   213     min_word_usage_count = 2
   214     min_word_length = 1
   215 
   216 Those settings give me...
   217     
   218     65,514 mydata.js - Now we're talkin!
   219 
   220 The previous best was 74,057, so this is
   221 8.5Kb savings.
   222 
   223 So I want to test a small number of
   224 permutations to see if I can improve on that
   225 initial setting. I'm going to write a little
   226 script to automate testing...
   227 
   228     ruby word_experiment.rb
   229 
   230 I put the bytes output at the beginning so I can sort, so let's see...
   231 
   232     62719 bytes. min usage count=4 min word length=4
   233     62759 bytes. min usage count=5 min word length=4
   234     62817 bytes. min usage count=3 min word length=5
   235     62837 bytes. min usage count=3 min word length=4
   236     62919 bytes. min usage count=4 min word length=5
   237     63073 bytes. min usage count=5 min word length=5
   238     63153 bytes. min usage count=5 min word length=3
   239     63210 bytes. min usage count=4 min word length=3
   240     63280 bytes. min usage count=5 min word length=1
   241     63280 bytes. min usage count=5 min word length=2
   242     63360 bytes. min usage count=4 min word length=2
   243     63388 bytes. min usage count=4 min word length=1
   244     63472 bytes. min usage count=3 min word length=3
   245     63631 bytes. min usage count=3 min word length=2
   246     63656 bytes. min usage count=2 min word length=5
   247     63763 bytes. min usage count=3 min word length=1
   248     64084 bytes. min usage count=2 min word length=4
   249     65049 bytes. min usage count=2 min word length=3
   250     65307 bytes. min usage count=2 min word length=2
   251     65514 bytes. min usage count=2 min word length=1
   252     73302 bytes. min usage count=1 min word length=5
   253     75830 bytes. min usage count=1 min word length=4
   254     77661 bytes. min usage count=1 min word length=3
   255     78104 bytes. min usage count=1 min word length=2
   256     78400 bytes. min usage count=1 min word length=1
   257 
   258 Okay, that's awesome. The total size goes down as I increase the word usage
   259 count and minimum word length..until we get to the magical value of 4 for each
   260 and then it starts to creep back up again.
   261 
   262 This was a highly variable problem, so an experiment was, by far, the easiest
   263 and quickest way to find the optimal settings and shave off an additional
   264 2.8Kb.
   265 
   266 To be clear, at 74Kb, I had something pretty obfuscated, but at 63Kb, it is
   267 waaaay more understandable ("readable" would probably be overstating it.)
   268 
   269 Now to re-write everything that uses this data to see if it, you know, works!
   270 
   271 2025-08-13 IT WORKS!!!! And the whole thing is under 70Kb (or a little over
   272 if you include the CSS).