Stack Overflow bad post detector

04.Sep.2015

Categories: Ideas, Outline

Some years ago I started work on a JavaScript implementation of a bad writing filter for Stack Overflow. I have some code for that sitting around, and one day I may even publish it!

I find myself wanting to play around with the Stack Exchange API, partly as an excuse to learn Laravel, but sadly my first use-case turned out not to be supported by the API. So I thought I might build a bad posts detector, using the search call. The idea is that community editors can see at a glance which posts are most in need of repair (and optionally down/close votes).

I’ll need a set of tables to store the “bad phrase” queries, each of which will be run in turn, with a set of question IDs being returned in each case. At the end of a run, questions can have a score totalled, with higher scores indicating higher levels of bad writing. The API permits 10,000 calls per day, which should be enough for a few full refreshes.

Initial ERD:

Table	Column	Type	Nulls?	Comments
rule	id	INTEGER (PK)	No
	title	VARCHAR	No	Short identifier
	description	VARCHAR	No	Describes the purpose of this query
	query	VARCHAR	No
	post_type	ENUM(question, answer)		Null means no type restriction
	score_title	INTEGER	No
	score_body	INTEGER	No
	set_id	INTEGER		Useful where we should take the highest score of a set only
	max_results	INTEGER	No	Limit of how many posts to grab
	is_enabled	BOOLEAN	No	Defaults to true, disabled if false
set	id	INTEGER (PK)	No
	description	VARCHAR	No	Describes the purpose of this set
run	id	INTEGER (PK)	No
	created_at	DATETIME	No
run_stats	run_id	INTEGER (FK)	No	Composite PK
	query_id	INTEGER (FK)	No	Composite PK
	count_items	INTEGER	No
	completed_pages	INTEGER	No	Starts at zero, equals page count when completed
	count_pages	INTEGER	No
run_item	id	INTEGER (PK)	No
	run_id	INTEGER (FK)	No
	query_id	INTEGER (FK)	No
	question_id	INTEGER	No
	answer_id	INTEGER		Not null if the found item is an answer
	score_title	INTEGER	No
	score_body	INTEGER	No

The last two tables probably need a bit of expansion:

run_stats: meta data per query per run
run_item: a found item for a query and run

Some examples of independent searches:

plz
coz
urgent

Some examples of search sets (which are equivalent but may be scored differently if required):

please help me out > please help me > please help = help me please > help please

There are many more search terms sketched out in the JavaScript original (see earlier link).

At present the score of a question or answer would be determined by a SUM() query with a GROUP BY, though if this gets excessively slow, query results can be snapshotted to a new table.

The UI of an app would show the latest results, in reverse score order, and show how that score is achieved. It will also list all live searches, and for an admin, it will allow searches to be edited. Easy!

2 Comments to “Stack Overflow bad post detector”

Toby Mills says:

5 September 2017 at 10:36 am

Hi Jon,
You just cleaned up one of my Stack Overflow posts for which I’m very grateful and wanted to say thank you but SO doesn’t have a messaging system that I could see (it’s been a while). Anyway, I tracked down your blog through your profile and saw this entry. Couldn’t help thinking this subject may have helped alert you to my lazy writing.

Anyway, thank you again and will try to be cleaner in my posts going forward.

T

Reply
- Jon says:
  
  19 September 2017 at 12:30 pm
  
  No worries, Toby, I did start writing some code to detect potential post improvements, but some folks beat me to it:
  
  https://socvr.org/faq#what-is-smoke-detector-and-how-can-i-participate
  
  My current approach is rather more lo-fi – I just use a variety of bookmarks to find common phrases that could do with editing. Not as impressive, but very reliable!
  
  Reply

2 Comments to “Stack Overflow bad post detector”

Leave a Reply

Pages

Categories

Archives

Link stuff