Indexing all reddit comments

When a Reddit user created a giant archive of all available Reddit comments so far, I downloaded the data set and later on indexed it with a current version of Lucene. The code to do this is in my GitHub repository: reddit-data-tools. At the moment it is only capable of doing a limited kind of search from the command line.

Searching for "fun +author:girl~" yields both a count of all matching documents as well as a display of the top ten, including their comment text and a back link to Reddit.

Opening search index at F:/reddit_data/index-all. This may take a moment.
Going to search over 1532362437 documents.
Found: 1621 matching documents.
Going to display top 10:
Score: 2.742787 author: girfl, url: http://www.reddit.com/r/Hair/comments/17njif/c87a4al
That was fun!
Score: 2.5318978 author: qirl, url: http://www.reddit.com/r/AskReddit/comments/26s6br/chtxurs
that was fun! cya at the next family reunion? :)
Score: 2.4616015 author: qgirl, url: http://www.reddit.com/r/TwoXChromosomes/comments/au7t1/c0jfana
Yes! I almost always regret it when it's time to reverse the process, but it's fun anyway.
Score: 2.4616015 author: girfl, url: http://www.reddit.com/r/AskReddit/comments/18g4t6/c8esn2j
I have old lady hands. They are very wrinkled and I used to get made fun all the time in high school.
Score: 2.4616015 author: girfl, url: http://www.reddit.com/r/MakeupAddiction/comments/ytfi3/c5ypqon
Getting ready to go out is often more fun than actually being out.
Score: 2.4616015 author: zgirl, url: http://www.reddit.com/r/Fitness/comments/1fbagh/ca914bt
I also give fellow runners high fives as I pass them. It's fun.
Score: 2.4616015 author: girl8, url: http://www.reddit.com/r/AskReddit/comments/2d7g5o/cjmtslp
Have fun, be yourself. Don't go home the first few weeks no matter how much you might want to.
Score: 2.4616015 author: qirl, url: http://www.reddit.com/r/AskReddit/comments/26u8w0/chuijd2
Because we make fun of ourselves, so y'all are given free reign to do the same.
Score: 2.4616015 author: gdrl, url: http://www.reddit.com/r/Steam/comments/2riu4j/cngq8r9
Some of the fun is just gifting friends ultra ridiculous games and then having them do the "wtf?"
when they see/msg you. :)
Score: 2.4498854 author: girl_, url: http://www.reddit.com/r/berlinsocialclub/comments/lffry/c2t2vyf
Okay me and M^^^^ want to go out Saturday night. Where can we go thats fun without
people fornicating in the bathroom?
Search took 3979 ms

The phrase "+author:girl~" selects all authors with a username that is equal or resembling "girl". The search took almost 4 seconds with the index residing on hard disk and another process hugging all the CPU.

Indexed fields are:

  • author
  • name (id of the comment in t3_-format)
  • body (text of the comment)
  • gilded
  • score
  • ups
  • downs
  • created_utc
  • parent_id
  • subreddit
  • id (id of the comment)
  • url (custom field, which *should* contain a valid link to the comment or the comment thread)

All fields are stored fields (so you can display the field value of a document).

"Body" is the only field that is considered as "text".

 

Another example:

Example search for "love story twilight" with more than 1000 up votes (links are not really reliable currently):

Opening search index at F:/reddit_data/index-all. This may take a moment. 
Going to search over 1532362437 documents.
Found: 20 matching documents. Going to display top 10:
DocScore: 4.435478 author: dathom, ups:1103, url: http://www.reddit.com/r/AskReddit/comments/psoue/c3s132v
  Still a better love story than Twilight.
DocScore: 4.435478 author: Xenoo, ups:1358, url: http://www.reddit.com/r/funny/comments/qqhcm/c3zn0xo
  Still a better love story than twilight.
DocScore: 4.435478 author: unglad, ups:1986, url: http://www.reddit.com/r/nottheonion/comments/2ewday/ck3knl6
  OK maybe twilight was a better love story than this
(...)
Search took 4392 ms

.