When a Reddit user created a giant archive of all available Reddit comments so far, I downloaded the data set and later on indexed it with a current version of Lucene. The code to do this is in my GitHub repository: reddit-data-tools. At the moment it is only capable of doing a limited kind of search from the command line.

Searching for "fun +author:girl~" yields both a count of all matching documents as well as a display of the top ten, including their comment text and a back link to Reddit.

Opening search index at F:/reddit_data/index-all. This may take a moment.
Going to search over 1532362437 documents.
Found: 1621 matching documents.
Going to display top 10:
Score: 2.742787 author: girfl, url: http://www.reddit.com/r/Hair/comments/17njif/c87a4al
That was fun!
Score: 2.5318978 author: qirl, url: http://www.reddit.com/r/AskReddit/comments/26s6br/chtxurs
that was fun! cya at the next family reunion? :)
Score: 2.4616015 author: qgirl, url: http://www.reddit.com/r/TwoXChromosomes/comments/au7t1/c0jfana
Yes! I almost always regret it when it's time to reverse the process, but it's fun anyway.
Score: 2.4616015 author: girfl, url: http://www.reddit.com/r/AskReddit/comments/18g4t6/c8esn2j
I have old lady hands. They are very wrinkled and I used to get made fun all the time in high school.
Score: 2.4616015 author: girfl, url: http://www.reddit.com/r/MakeupAddiction/comments/ytfi3/c5ypqon
Getting ready to go out is often more fun than actually being out.
Score: 2.4616015 author: zgirl, url: http://www.reddit.com/r/Fitness/comments/1fbagh/ca914bt
I also give fellow runners high fives as I pass them. It's fun.
Score: 2.4616015 author: girl8, url: http://www.reddit.com/r/AskReddit/comments/2d7g5o/cjmtslp
Have fun, be yourself. Don't go home the first few weeks no matter how much you might want to.
Score: 2.4616015 author: qirl, url: http://www.reddit.com/r/AskReddit/comments/26u8w0/chuijd2
Because we make fun of ourselves, so y'all are given free reign to do the same.
Score: 2.4616015 author: gdrl, url: http://www.reddit.com/r/Steam/comments/2riu4j/cngq8r9
Some of the fun is just gifting friends ultra ridiculous games and then having them do the "wtf?"
when they see/msg you. :)
Score: 2.4498854 author: girl_, url: http://www.reddit.com/r/berlinsocialclub/comments/lffry/c2t2vyf
Okay me and M^^^^ want to go out Saturday night. Where can we go thats fun without
people fornicating in the bathroom?
Search took 3979 ms

The phrase "+author:girl~" selects all authors with a username that is equal or resembling "girl". The search took almost 4 seconds with the index residing on hard disk and another process hugging all the CPU.

Indexed fields are:

  • author
  • name (id of the comment in t3_-format)
  • body (text of the comment)
  • gilded
  • score
  • ups
  • downs
  • created_utc
  • parent_id
  • subreddit
  • id (id of the comment)
  • url (custom field, which *should* contain a valid link to the comment or the comment thread)

All fields are stored fields (so you can display the field value of a document).

"Body" is the only field that is considered as "text".


Another example:

Example search for "love story twilight" with more than 1000 up votes (links are not really reliable currently):

Opening search index at F:/reddit_data/index-all. This may take a moment. 
Going to search over 1532362437 documents.
Found: 20 matching documents. Going to display top 10:
DocScore: 4.435478 author: dathom, ups:1103, url: http://www.reddit.com/r/AskReddit/comments/psoue/c3s132v
  Still a better love story than Twilight.
DocScore: 4.435478 author: Xenoo, ups:1358, url: http://www.reddit.com/r/funny/comments/qqhcm/c3zn0xo
  Still a better love story than twilight.
DocScore: 4.435478 author: unglad, ups:1986, url: http://www.reddit.com/r/nottheonion/comments/2ewday/ck3knl6
  OK maybe twilight was a better love story than this
Search took 4392 ms



Yesterday I encountered a strange bug - programatically generated archives of a Lucene index in the tar format were corrupted during creation. The process finished normally, but occasionally the resulting archive would be broken. Turns out that someone had (with probably good reason once upon a time) created our own implementation of a tar packaging module in Java, based upon the Apache Ant task.

The problem with the original source ode is: the Ant tar task is limited to an individual file size of 8 GByte, though the resulting archive may be far larger. This was fixed in the Apache Compression library v1.4, some time ago, but you would have to use one of the Gnu Tar formats which support unlimited file size.

The workaround for the current problem seems to be: use the default Lucene indexRamBufferSize setting of 16 MByte, so the segment files won't grow to 25 GByte and stay below the 8 GB limit. But a change of the compression module in the near future (preferably using standard open source components instead of homebrew versions) is already planned.

  •     Added more tests to ProductionServiceSpec and cleaned up the code.
  •     Use Google Guava Ints/Longs.tryParse() to make code more elegant.
  •     Use new OptionalResult class instead of new RuntimeException to report problems.   
        (The ProductionController still uses the old problem reporting method via Exception, but is now ready for further refactoring)
  •     Add OptionalResult<T> class
        This helps reducing the places where problems are communicated via RuntimeException
        and also reduces cases where nullable objects are returned and not properly handled.
        The old pattern in Little Goblin is:
        1. if(problem){throw new RuntimeException(errorMessage))
        2. catch RuntimeException e, render error message from e.getMessage (this breaks with unexpected exceptions lacking a proper messageId...)
        The new pattern should be:
        1. for each problem: add error to OptionalResult
        2. return OptionalResult and handle valid/invalid case from there
        The old pattern mixes serious exceptions with common user errors (for example, using a too-short password),
        it often uses expensive Exception objects (with complete StackTraces) and can only ever communicate one problem.
        You can work around the expensive objects problem by using static exceptions, but it's still ugly.
  •     Remove a hasMany connection, which makes it easier to unit test.
  •     Use Java 8
  •     Changed formatting; add custom toString() to several classes to to improve debugging and test output.
  •     Upgrade to Grails 2.4.4 - Closes #97
  •     Add Google's Guava library.
  •     This lib provides many useful constructs, for example Longs.tryParse(str) which can parse Strings and returns null instead of throwing NumberFormatException in case of invalid input.
  •     Started work on ProductionServiceSpec.
  •     Refactored ConstraintUnitSpec so it's no longer a parent class for unit tests.
        The idea of a class which could provide all kinds of mocked items for tests was good, but failed to take implementation details into account. Different unit tests would need specialized Item or Product instances and this complexity would care over to the providing class. Also unit tests with TestFor annotation would not necessarily want those classes to be Mock'ed. Current test writing strategy will be to make the test classes as independent as possible. A little bit more redundant, a lot less confusing/complex.
  •     Update Tomcat plugin to version 8.
  •     Made Creature class abstract as there is no use case for instantiating Creature objects - LG always uses sub classes. 
        Also I think this might make ORM easier.
  •     In Item class changed field owner from Creature to PlayerCharacter.   
        Reason: I was having a hard time using creating a unit test that would work with the extending class PlayerCharacter with this field.
        Then I changed Creature to be an abstract class as there won't be any Creature objects instantiated directly anyway.
        Now it looks like the ProductionServiceSpec will run without problems.
        Only drawback: Mobs which extends Creature cannot own Items now. But then was already a need to write new code for Mobs using Items.
  •     Fix #93; computeMaxProduction to return correct amount.

Today I had a very puzzling Exception case in a Spock unit test for LittleGoblin:

void "happyPathTest"() {

def accountResult = userAccountService.createUserAccount('a username', 'a password',
'This email address is being protected from spambots. You need JavaScript enabled to view it.')



was giving an NPE in the line with 'def accountResult'.

After making sure that the userAccountService was not null, I checked the return value of createUserAccount - which looks like this:

package de.dewarim.goblin

import grails.transaction.Transactional

class UserAccountService {

AccountCreationResult createUserAccount(String username, String password, String email){
try {
UserAccount newAccount = new UserAccount(username: username, email: email, userRealName: username)
newAccount.passwd = password
Role role = Role.findByName('ROLE_USER')
UserRole userRole = new UserRole(newAccount, role)
return new AccountCreationResult(userAccount: newAccount)
catch (Exception e){
log.debug("Failed to create account: "+e.getMessage())
return new AccountCreationResult(errorMessage: e.getMessage())

So, due to the try...catch block, the method should _always_ return an object, never null.

After searching for quite some time I came upon the JIRA entry:

Unit test of service doesn't work with @Transactional annotation

Turns out, you need to add @Mock([...]) for all domain objects used in the service method along with mocking the transactionManager for the service in the tests setup().

userAccountService.transactionManager = Mock(PlatformTransactionManager) {
getTransaction(_) >> Mock(TransactionStatus)




Descriptions of error messages and possible solutions

Posts about Little Goblin, the Grails based open source browser game engine and its reference implementation.

The home page of Little Goblin is littlegoblin.de