Some years ago I started work on a JavaScript implementation of a bad writing filter for Stack Overflow. I have some code for that sitting around, and one day I may even publish it!
I find myself wanting to play around with the Stack Exchange API, partly as an excuse to learn Laravel, but sadly my first use-case turned out not to be supported by the API. So I thought I might build a bad posts detector, using the search call. The idea is that community editors can see at a glance which posts are most in need of repair (and optionally down/close votes).
I’ll need a set of tables to store the “bad phrase” queries, each of which will be run in turn, with a set of question IDs being returned in each case. At the end of a run, questions can have a score totalled, with higher scores indicating higher levels of bad writing. The API permits 10,000 calls per day, which should be enough for a few full refreshes.
Initial ERD:
Table | Column | Type | Nulls? | Comments |
rule | id | INTEGER (PK) | No | |
title | VARCHAR | No | Short identifier | |
description | VARCHAR | No | Describes the purpose of this query | |
query | VARCHAR | No | ||
post_type | ENUM(question, answer) | Null means no type restriction | ||
score_title | INTEGER | No | ||
score_body | INTEGER | No | ||
set_id | INTEGER | Useful where we should take the highest score of a set only | ||
max_results | INTEGER | No | Limit of how many posts to grab | |
is_enabled | BOOLEAN | No | Defaults to true, disabled if false | |
set | id | INTEGER (PK) | No | |
description | VARCHAR | No | Describes the purpose of this set | |
run | id | INTEGER (PK) | No | |
created_at | DATETIME | No | ||
run_stats | run_id | INTEGER (FK) | No | Composite PK |
query_id | INTEGER (FK) | No | Composite PK | |
count_items | INTEGER | No | ||
completed_pages | INTEGER | No | Starts at zero, equals page count when completed | |
count_pages | INTEGER | No | ||
run_item | id | INTEGER (PK) | No | |
run_id | INTEGER (FK) | No | ||
query_id | INTEGER (FK) | No | ||
question_id | INTEGER | No | ||
answer_id | INTEGER | Not null if the found item is an answer | ||
score_title | INTEGER | No | ||
score_body | INTEGER | No |
The last two tables probably need a bit of expansion:
- run_stats: meta data per query per run
- run_item: a found item for a query and run
Some examples of independent searches:
- plz
- coz
- urgent
Some examples of search sets (which are equivalent but may be scored differently if required):
- please help me out > please help me > please help = help me please > help please
There are many more search terms sketched out in the JavaScript original (see earlier link).
At present the score of a question or answer would be determined by a SUM() query with a GROUP BY, though if this gets excessively slow, query results can be snapshotted to a new table.
The UI of an app would show the latest results, in reverse score order, and show how that score is achieved. It will also list all live searches, and for an admin, it will allow searches to be edited. Easy!
Hi Jon,
You just cleaned up one of my Stack Overflow posts for which I’m very grateful and wanted to say thank you but SO doesn’t have a messaging system that I could see (it’s been a while). Anyway, I tracked down your blog through your profile and saw this entry. Couldn’t help thinking this subject may have helped alert you to my lazy writing.
Anyway, thank you again and will try to be cleaner in my posts going forward.
T
No worries, Toby, I did start writing some code to detect potential post improvements, but some folks beat me to it:
https://socvr.org/faq#what-is-smoke-detector-and-how-can-i-participate
My current approach is rather more lo-fi – I just use a variety of bookmarks to find common phrases that could do with editing. Not as impressive, but very reliable!