[JOURNEY] AI Dataset Factory: Building the ultimate Vegan Keto AI Chef

Hey guys, taking a different approach here. I’m not doing lead gen or e-com. I’m building a proprietary dataset to train a custom LLM agent.

The Goal: Launch a paid “AI Chef” app for a highly specific niche (Vegan Keto).
The Angle: General models like GPT-4 suck at niche recipes. I’m going to scrape 50,000+ posts from specialized forums and blogs to build a fine-tuning dataset that no one else has.
The Stack: RTILA X + Hugging Face (for training) + Vercel.

The Plan:
I’m using RTILA to crawl a massive old-school vBulletin forum dedicated to vegan keto. I need to extract the post title, author, and the full text of the recipes.

Update 1: Data Cleaning Nightmare

The scraping is working, but the data is filthy. Forum posts have signatures, “Quote” blocks from previous replies, and weird HTML artifacts. If I feed this into an LLM, it’s going to output garbage.

I don’t want to save 50k dirty rows to my database and clean them later. Can I clean them inside RTILA before saving?

Yes, you can clean it in-memory!

When you call the extraction helper in your run_script, pass { saveToApi: false }. This pulls the raw array into your script’s memory instead of saving it to the database immediately.

const rawData = await helpers.extractData('recipes', { saveToApi: false });

// Clean the data using standard JS
const cleanedData = rawData.map(item => {
  let text = item.full_text;
  text = text.replace(/---Quote---[\s\S]*?---End Quote---/g, ''); // Remove quotes
  return { ...item, full_text: text.trim() };
});

// Save the clean data
await helpers.saveData('recipes', cleanedData);

This way, your database only stores the pristine training data!

Update 2: 10,000 Recipes Cleaned and Exported :pot_of_food:

That in-memory cleaning trick was exactly what I needed. I let the bot run overnight, handling pagination automatically. I now have a beautiful JSONL file with 10k perfectly formatted recipes.

Starting the fine-tuning job on Hugging Face today. I’ll share a link to the AI Chef once it’s deployed!