Mar 15, 2016

Blocking Chinese and Korean spam for user-submitted content

I help maintain a Drupal 7 site called Power Poetry that is a platform for publishing poetry. It's really a great site with awesome writing that has grown steadily in both the amount of content and the number of page views.

Unfortunately, its popularity has apparently attracted some foreign spam content, particularly in Chinese and Korean.

Our initial take was that being U.S.-centric, we should limit submissions to only English and Spanish. So I did some research into language detection, found a service with an API that looked promising, and did an initial implementation. However, for technical reasons that I never did uncover, that API was not working on our hosting setup (even though it worked on my local system!).

We then changed our focus from language detection to the fact that our current problem was almost entirely due to submissions made in certain scripts (specific human writing systems).

I found that regular expressions in PHP support Unicode character properties. Those property codes include values that designate the Chinese script (which includes both traditional and simplified characters) and the Korean script.

We decided to take a zero-tolerance approach. A single Chinese or Korean character causes the user submission not to validate. So far, implementing this has drastically reduced the amount of inappropriate content and comments.

Here's the code, which is called by our Drupal form validation functions.

/**
 * Check whether a string contains any characters from a

 * banned script, such as Chinese or Korean.
 *
 * @param string $text
 *   The piece of text to be checked.
 *
 * @return TRUE | FALSE
 *   Returns TRUE if any Chinese or Korean characters detected.
 *   Otherwise, returns FALSE.
 */
function _contains_banned_scripts($text) {

  $unicode_modifier = 'u';

  // Detect whether there are any Chinese characters.
  $chinese_regex = '\p{Han}+';
  $preg_regex = '/' . $chinese_regex . '/' . $unicode_modifier;
  $chinese_found = preg_match($preg_regex, $text);

  if ($chinese_found == 1) {
    return TRUE;
  }

  // Detect whether there are any Korean characters.
  $korean_regex = '\p{Hangul}+';
  $preg_regex = '/' . $korean_regex . '/' . $unicode_modifier;
  $korean_found = preg_match($preg_regex, $text);

  if ($korean_found == 1) {
    return TRUE;
  }

  return FALSE;
}


Sources:

Forum spam
https://en.wikipedia.org/wiki/Forum_spam

Unicode Regular Expressions
http://www.regular-expressions.info/unicode.html

Unicode Character Properties
https://secure.php.net/manual/en/regexp.reference.unicode.php