Thoughts on website ideas, PHP and other tech topics, plus going car-free
Forum bad writing filter
Categories: Code, Ideas

Rightly or wrongly, one of the primary reasons why new users get a difficult reception on Stack Overflow is sub-standard writing. This is sometimes due to the writer having English as a second language, and sometimes it down to laziness. Unfortunately, manual requests for improvement don’t always go well, which prompts me to think that software could do a better job. So, I propose a simple website that mirrors the subject/body input boxes of a discussion forum, and runs a sequence of regular-expression JavaScript tests to advise the user on how the question might be improved. True, it’ll not be particularly smart, but I don’t think that matters: many of the rules required seem to be pretty trivial.

On the front page, I’d like to see a URL/question-id input box, so questions can be loaded directly from the site. Also, perhaps the rules can be split according to site, so General Question tests can be run as well as Stack Overflow tests (this permits special Stack Overflow-specific rules, such as omitting salutations). Lastly, the whole thing should unit test, either in JavaScript or PHP.

I’ve made a list of a few rules I’d add (italics items are already coded):

  • kinda, wanna, gonna, sorta, thx, coz, bcoz, cuz, shud, teh, tht, thn, idk, (high score)
  • Misspellings where pairing makes it obvious: “there own”, “wat is”, “can u “, “can i “, “cos I”, “cos it”, “ur help”, “ur ideas”, “pls help”, “tryin “
  • Lazy abbreviations: “db” as an abbreviation of “database”, “for ex.” instead of “for example”,
  • Punctuation: floating punctuation e.g. ” . “, ” , ” and ” ; “
  • Punctuation: incorrectly spaced ellipsis, e.g. “. . . “
  • Punctuation: full stop, exclamation and question mark without at least one trailing space
  • Sigs: Use of “[[With ](Kind|Best) ]Regards,”
  • Colons and dashes in titles (likely a sign of tagging)
  • “Help please” in titles worse than same in body
  • “Help me understand X” better than “X help me” (perhaps: titles ending in “help me” or “help me”)
  • “Help me out” worse than “help me” due to redundant “out”
  • Titles containing “please” (high score)
  • Titles containing “please suggest/help” (very high score)
  • Titles containing “any body”, “anybody”
  • Titles containing “question” that do not contain “interview”
  • Titles containing “thanks” or “thank you” except for “thanks/(thank you)/(thank-you) mail/email/message/page”
  • Titles containing “Solved” or “(Solved)” or “[Solved]”
  • Sentences that start with a redundant “So,”, “So I’m”, “So im”, “Ok so”, “Okay so”, “So basically”,
  • Apostrophes highlighted with pairing: “im trying”, “your trying” (can we generalise to (im|your)\w<alpha>ing ?)
  • Apostrophes highlighted with pairing: “im new”, “im a”
  • Apostrophes: “doesnt”, “shouldnt”, “couldnt”, “wouldnt”, “wont”, “dont”, “theyre”, “weve”, “youve”, “isnt”, “whats”,
  • Case errors: MySQL, Zend, Android, PostgreSQL, Apache, PHP, HTML, CSS, WordPress, JavaScript, jsFiddle, jQuery, PDO, Visual Studio, iPhone, Matlab, Windows, REST, RESTful, MVC,
  • Case errors: Latin, English, German, etc.
  • Case errors: “Hello World”
  • Redundant spaces: “js Fiddle”, “Java Script”, “HTML 5”, “Symfony 2”, “any body”, “some body”, “out put”, “in put”,
  • Requires spaces: “SQLFiddle”,
  • “I need” appearing in a title without a question mark (so “Why do I need an IoC container as opposed to straightforward DI code?” is great, but “I need regular expression” is probably terrible)
  • Requests for free work: “tell me in detail”, “give [an] example”,
  • Bad grammar: “say me”, “recommend me”, “explain me”, “some helps”, “softwares”, “advices”,
  • Redundant apostrophes: “doc’s”, “CD’s”
  • Common mis-spellings: alot, thankyou, existant, retreive, beleive, dinamic, dinamically, dynamicly, CodeIgnitor, comming, posible, basicly,
  • Missing hyphen: “non ” (e.g. non existent)
  • “Doesn’t work” worse in titles than bodies
  • “Dont work” worse than “Don’t work” worse than “Doesn’t work” (actually, the missing apostrophe would be counted separately)
  • Redundant phrases: “As per [the] title”, “title says it all”
  • Redundant phrases: “waiting for your reply/response”,
  • Redundant phrases: “basically” (low score)
  • Salutations
  • Opening/closing PHP tags found outside backticks
  • Gender specificity: “Hi guys”, “Hi dudes”
  • Protesteth too much: “I searched a lot”
  • How about “dude”, “dudes” without the hi? (But not Data Dude, which is a software product)
  • Too subjective: “What’s the best”
  • Numbers less than 10 should not be in digits, unless they’re in quotes (hard)
  • 1st, 2nd, 3rd should be first, second and third etc.
  • Opening bracket without preceding space/CR or closing bracket without following space/CR
  • Titles should not be entirely in lower/upper case (high scoring)
  • Titles should be in sentence case, not title case (low scoring)
  • Titles should not be too long
  • Sentences and paragraphs should not be too long
  • Bodies should not be too short
  • Sentences without ending punctuation
  • Three or more¬†contiguous exclamation/question marks
  • Two contiguous exclamation/question marks not in backticks inside body (maybe add non-scoring advisory for titles, since backticks don’t work there)
  • Huge warning for “give me (the|teh) code(s|z)”
  • Huge warning for “Urgent help”, “Its|It’s urgent”
  • Words all in caps, should use italics/bold
  • Sentences/phrases all in caps
  • Perhaps a rule for newb/noob/newb/newbie – not sure what yet

It’d be great if we could automatically run an RSS search for each rule every day, so we can see the popularity of each offence!

Some thoughts on the algorithm:

  • Each rule records where it is valid: title, body, either
  • Each rule has a severity
  • Each rule has an explanatory text that is shown when it is triggered
  • All rules applied make up a score
  • Rule scores should specify if they are counted once or are counted for each offence
  • Non-certain rules are omitted from the score (e.g. “CD’s” is correct for possession but not plural)
  • Each rule has a unique string ID so it can be referred to
  • A match can skip one or more other rules
  • Space-indented and <pre> code in questions is skipped
  • Server-side rule browser
  • “Suggest a rule” feature
  • Tick rules manually that don’t apply, to improve the score

Leave a Reply