JonBlog
Thoughts on website ideas, PHP and other tech topics, plus going car-free
Stack Overflow bad post detector
Categories: Ideas, Outline

Some years ago I started work on a JavaScript implementation of a bad writing filter for Stack Overflow. I have some code for that sitting around, and one day I may even publish it!

I find myself wanting to play around with the Stack Exchange API, partly as an excuse to learn Laravel, but sadly my first use-case turned out not to be supported by the API. So I thought I might build a bad posts detector, using the search call. The idea is that community editors can see at a glance which posts are most in need of repair (and optionally down/close votes).

I’ll need a set of tables to store the “bad phrase” queries, each of which will be run in turn, with a set of question IDs being returned in each case. At the end of a run, questions can have a score totalled, with higher scores indicating higher levels of bad writing. The API permits 10,000 calls per day, which should be enough for a few full refreshes.

Initial ERD:

Table Column Type Nulls? Comments
rule id INTEGER (PK) No
title VARCHAR No Short identifier
description VARCHAR No Describes the purpose of this query
query VARCHAR No
post_type ENUM(question, answer) Null means no type restriction
score_title INTEGER No
score_body INTEGER No
set_id INTEGER Useful where we should take the highest score of a set only
max_results INTEGER No Limit of how many posts to grab
is_enabled BOOLEAN No Defaults to true, disabled if false
set id INTEGER (PK) No
description VARCHAR No Describes the purpose of this set
run id INTEGER (PK) No
created_at DATETIME No
run_stats run_id INTEGER (FK) No  Composite PK
query_id INTEGER (FK) No  Composite PK
count_items INTEGER No
completed_pages INTEGER No Starts at zero, equals page count when completed
count_pages INTEGER No
run_item id INTEGER (PK) No
run_id INTEGER (FK) No
query_id INTEGER (FK) No
question_id INTEGER No
answer_id INTEGER Not null if the found item is an answer
score_title INTEGER No
score_body INTEGER No

The last two tables probably need a bit of expansion:

  • run_stats: meta data per query per run
  • run_item: a found item for a query and run

Some examples of independent searches:

  • plz
  • coz
  • urgent

Some examples of search sets (which are equivalent but may be scored differently if required):

  • please help me out > please help me > please help = help me please > help please

There are many more search terms sketched out in the JavaScript original (see earlier link).

At present the score of a question or answer would be determined by a SUM() query with a GROUP BY, though if this gets excessively slow, query results can be snapshotted to a new table.

The UI of an app would show the latest results, in reverse score order, and show how that score is achieved. It will also list all live searches, and for an admin, it will allow searches to be edited. Easy!

 

Leave a Reply