blog.andrewle.com

The One Little Trick Clickbait Websites Don't Want You to Know About

January 3, 2024

data, nlp

Clickbait headlines are my weakness. I don't truly believe any life hacks or secret ingredients will "change the way I ___ forever" but I can't help being curious about what they are. It would save a lot of time if I got the answers up front, which is a job that could be performed using natural language processing (NLP). Many projects and research papers1 are using NLP to classify whether or not headlines are clickbait and optionally attempt to spoil or TLDR an article by revealing its subject.

You Down With NLP? Yeah You Know Me

I don't know much about NLP but my naive explanation is that it's a way for software to understand language (text or voice) and be able to summarize or expound upon it by answering who, what, where, when, and why. In my brief reading, I've come across two methodologies in NLP: supervised and unsupervised. Supervised methods involve training a machine-learning model with human-labeled data. When given new material, the model attempts to find patterns similar to the existing, known data. A person then needs to verify, adjust, or reject the results. Unsupervised NLP requires plain, unlabeled input and uses grammar rules or supplied keywords to discover information, structure, and patterns. Here are some problems that can be solved using these two methods:

  1. Named Entity Recognition - identifying people, places, things, dates, etc.
  2. Sentiment Analysis - whether content is positive/negative/neutral
  3. Topic Modeling - what something is about, perhaps the content of Taylor Swift songs over time

IDK My BFF TF-IDF

The most straightforward way to spoil a clickbait article seems to be supervised topic modeling. One way to do that is a technique called term frequency–inverse document frequency (TF-IDF). It makes sense that when a word or phrase is mentioned several times, that's a good indicator of the topic. Less important words that occur often such as "a" and "the" get ignored. I ran each article's headline and text through an NL processor and returned the most frequently used terms:

  1. The Secret Baking Ingredient Ina Garten Always Keeps In The Pantry

Step 1 of 3 - Text Input

The Secret Baking Ingredient Ina Garten Always Keeps In The Pantry
Tips The Secret Baking Ingredient Ina Garten Always Keeps In The Pantry Joe Seer/Shutterstock By   Chris Day | Nov. 30, 2022  5:40 pm EST "Barefoot Contessa" host Ina Garten often shares unusual culinary tips with her fans. From the secret ingredient she adds to her margaritas to the mess-free way she cuts a cauliflower , the Food Network personality is full of surprising and helpful kitchen tips . There are probably many things you don't know about Garten , who is not only the author of 13 cookbooks , but also a  four-time winner of Outstanding Culinary Host for her show, "Barefoot Contessa." Another thing you may not know is the secret weapon she uses when she wants to deepen the flavor of baked goods — the one baking ingredient Garten insists on having in her pantry at all times. In her 2020  cookbook , "Modern Comfort Food," Garten shares a recipe for Black & White Cookies with a special something in the glaze — a surprising addition that you may already have in your pantry. Hint: It's the jar you may keep on the shelf for when your grandpa visits. A caffeine infusion alohadave/Shutterstock When you think of the Barefoot Contessa, you may think of complicated recipes using expensive and obscure ingredients like truffle butter or fleur de sel salt . However, the baking ingredient Ina Garten always keeps in her pantry is neither expensive nor obscure. According to Food & Wine , Garten's secret to enhancing the flavor of her baked goods, including her Outrageous Brownies , is the addition of instant coffee. Some of Garten's recipes call for instant coffee while others use espresso powder, and she clarifies the difference between the two on her website , explaining that she decides which one to use based on how much coffee flavor she wants to add to a recipe. Her Black & White Cookies recipe calls for half a teaspoon of instant coffee, but she uses a full tablespoon of Medaglia d'Oro, her preferred espresso powder, when she makes her decadent Peanut Butter Globs (per Food Network ). The multi-purpose Joe warat42/Shutterstock Ina Garten isn't the only one to find unexpected uses for instant coffee. Bon Appétit has a long list of ways for java aficionados to jazz up ordinary things using instant coffee, including by adding it to the likes of peanut butter and jelly sandwiches, ice cream, oatmeal, or salad dressing. The caffeine jolt we all rely upon can also work to perk up your houseplants and can keep cats out of your garden (per Waka Coffee ). You may be surprised to discover that the sugary foamy brew that became a TikTok sensation is made with instant coffee. That's right. Dalgona Coffee uses plain, old instant coffee and its newfound popularity on social media made sales of the coffee crystals surge in 2020 (via Eagle-Tribune ). Instant coffee can be your secret ingredient to better baking, too. So, do like Garten and always have some on hand.

Step 2 of 3 - Tokenization/Normalization

[
  'tips',               'secret',              'baking',
  'ingredient',         'ina',                 'garten',
  'always',             'keeps',               'pantry',
  'joe',                'seer',                'shutterstock',
  'chris',              'pm',                  'est',
  'barefoot',           'contessa',            'host',
  'culinary tips',      'fans',                'secret ingredient',
  'margaritas',         'mess',                'free way',
  'cauliflower',        'food',                'network',
  'personality',        'helpful kitchen',     'many things',
  'author',             'cookbooks',           'time',
  'winner',             'outstanding',         'culinary',
  'show',               'thing',               'know',
  'secret weapon',      'flavor',              'baked goods',
  'times',              'cookbook',            'modern',
  'comfort',            'shares',              'recipe',
  'black',              'white',               'cookies',
  'glaze',              'surprising addition', 'hint',
  'jar',                'shelf',               'grandpa',
  'visits',             'caffeine',            'infusion',
  'alohadave',          'recipes',             'ingredients',
  'truffle',            'butter',              'fleur',
  'sel',                'salt',                'baking ingredient',
  'wine',               'outrageous',          'brownies',
  'addition',           'instant coffee',      'call',
  'others',             'use',                 'espresso',
  'powder',             'difference',          'website',
  'much coffee',        'half',                'teaspoon',
  'full tablespoon',    'medaglia',            "d'oro",
  'preferred espresso', 'decadent peanut',     'globs',
  'multi-purpose',      'warat42',             'unexpected uses',
  'bon',                'appétit',             'long list',
  'ways',               'java',                'aficionados',
  'ordinary things',
  ... 26 more items
]

Step 3 of 3 - Weighted Results Using BM25 Vectorizer

[
  [ 'instant coffee', 0.540281 ],
  [ 'food', 0.486846 ],
  [ 'coffee', 0.486846 ]
]

First try was a success! Ina Garten adds instant coffee to her baked goods. The term frequency is represented by a number between 0 and 1. The top 3 results have been included for additional context.

  1. The Staple Pantry Ingredient That Makes A Game-Changing Beef Stew

Two for two...

  1. The Best Type Of Oil To Use When Making Popcorn

...aaand a fail. The article mentioned high-temperature oils and listed several kinds only once: canola, vegetable, peanut, coconut. None of these even show up in the top 10 words. I haven't been using the IDF part of TD-IDF. But if I were, the model could potentially rule out some of these words for being generic in the context of cooking.

Overall this was a very rudimentary approach to solving this problem and doesn't work very well. In imagining the effort involved in improving this, maybe I'm better off just not knowing.