u/Bigolbagocats

▲ 0 r/PHP

How do you handle form spam in PHP?

Disclosure: I work at Cloudmersive as a technical writer and the code below uses our SDK

I’m curious how folks in this community are handling form spam in practice these days, specifically whether standard solutions (e.g. reCAPTCHA, honeypots, Akismet) are actually covering you, or whether you’re still seeing a ton of spam getting through?

While documenting this stuff I’ve noticed that most of these approaches check *how* a form was submitted rather than *what* was actually submitted in the form.  For example, if a human types “hi I can offer you great SEO services for $99 a month” into a sales contact form, it goes straight through reCAPTCHA because a human submitted it. The API I’ve been documenting reads the field values and classifies them against configurable categories.  For example that request would look like:

{
  "InputFormFields": [
    {
      "FieldTitle": "Message",
      "FieldValue": "Hi, I can offer you great SEO services for only $99/month"
    }
  ],
  "AllowUnsolicitedSales": false,
  "AllowPromotionalContent": false,
  "AllowPhishing": false
}

And the response would come back like:

{
  "CleanResult": false,
  "SpamRiskLevel": 0.92,
  "ContainsSpam": true,
  "ContainsUnsolicitedSales": true,
  "ContainsPromotionalContent": true,
  "ContainsPhishingAttempt": false,
  "AnalysisRationale": "Message contains unsolicited sales pitch and promotional pricing"
}

And the PHP integration would look something like this:

composer require cloudmersive/cloudmersive_spam_api_client

<?php
require_once(__DIR__ . '/vendor/autoload.php');

// Configure API key authorization
$config = Swagger\Client\Configuration::getDefaultConfiguration()
    ->setApiKey('Apikey', 'YOUR_API_KEY');

$apiInstance = new Swagger\Client\Api\SpamDetectionApi(
    new GuzzleHttp\Client(),
    $config
);

// Build the request body with your form fields and spam policy settings
$body = new \Swagger\Client\Model\SpamDetectionAdvancedFormSubmissionRequest();
//e.g. $body->setInputFormFields([['field_title' => 'Message', 'field_value' => $_POST['message'] ?? '']]); 
//e.g. $body->setAllowUnsolicitedSales(false); 
//e.g. $body->setAllowPhishing(false);

try {
    $result = $apiInstance->spamDetectFormSubmissionAdvancedPost($body);

    // CleanResult is false if spam was detected
    if (!$result->getCleanResult()) {
        // Handle flagged submission — log it, reject it, queue for review, etc.
        error_log('Spam detected: ' . $result->getAnalysisRationale());
    }

    print_r($result);
} catch (Exception $e) {
    echo 'Exception when calling SpamDetectionApi->spamDetectFormSubmissionAdvancedPost: ' . $e->getMessage() . PHP_EOL;
}
?>

The body example in this case is wired to $_POST directly since I think that’s probably the most realistic use case? Basically you drop this wherever you’re currently processing submissions.

And on the flip side, if you’re doing content-based filtering of any kind like this API is, how do you handle false positives? For instance, I’ve seen a bunch of legitimate sales inquiries through a contact form that look a lot like spam.

reddit.com
u/Bigolbagocats — 4 hours ago
▲ 0 r/csharp

Curious what this community thinks about non-generative AI in document pipelines

Disclosure: I work at Cloudmersive as a technical writer and the below code uses our SDK

Feels like every AI conversation right now is about frantic code generation or impending doom, but I’m curious about a different corner of it… specifically how people are handling document boundary detection in batch processing pipelines.  Basically whether AI is actually earning its place there or whether deterministic approaches are still generally considered more sensible.

While documenting this stuff I keep running into a scenario where a merged/combined file enters a pipeline (containing some mix of forms, ID cards, insurance docs, etc.) and something needs to figure out where one document ends and the next begins before any real processing can happen.

The API I’ve been documenting detects boundaries based on things like visual layout, headers, names, document type, etc. and returns sub-documents as separate chunks with a description/file bytes. Example output:

{
  "Successful": true,
  "SubDocuments": [
    {
      "StartPage": 0,
      "EndPage": 2,
      "DocumentDescription": "Driver's License - Jane Doe",
      "FileBytes": "..."
    },
    {
      "StartPage": 3,
      "EndPage": 6,
      "DocumentDescription": "Proof of Insurance - Policy #449201",
      "FileBytes": "..."
    }
  ]
}

And the C# integration to get there would look something like this:

dotnet add package Cloudmersive.APIClient.NETCore.DocumentAI --version 1.0.0

using System;
using System.IO;
using Cloudmersive.APIClient.NETCore.DocumentAI.Api;
using Cloudmersive.APIClient.NETCore.DocumentAI.Client;
using Cloudmersive.APIClient.NETCore.DocumentAI.Model;

namespace Example
{
    public class ExtractSplitExample
    {
        public void main()
        {
            Configuration.Default.AddApiKey("Apikey", "YOUR_API_KEY");

            var apiInstance = new ExtractApi();
            var inputFile = new FileStream("C:\\temp\\batch_upload.pdf", FileMode.Open);

            try
            {
                SplitDocumentResponse result = apiInstance.ExtractSplit("Advanced", inputFile);

                foreach (var doc in result.SubDocuments)
                {
                    Console.WriteLine($"{doc.DocumentDescription}: pp. {doc.StartPage}–{doc.EndPage}");
                    File.WriteAllBytes($"output_{doc.StartPage}.pdf", Convert.FromBase64String(doc.FileBytes));
                }
            }
            catch (Exception e)
            {
                Console.WriteLine($"Error: {e.Message}");
            }
        }
    }
}

Here’s what I’m actually curious about (and I realize this may be fairly niche): for those of you processing unstructured/unpredictable batch documents, have you found a reliable way to handle things like boundary detection without reaching for AI somewhere in the pipeline? And on the flip side, if you are using AI classification somewhere in your pipeline, how are you handling the fact that it won’t always get it right?

reddit.com
u/Bigolbagocats — 10 days ago