metadata – Terence Eden’s Blog

Fixing "Date/time not in ISO 8601 format" in Google Search Console

@edent — Wed, 24 Dec 2025 12:34:43 +0000

I like using microdata within my HTML to provide semantic metadata. One of my pages had this scrap of code on it:

9 June 2025 11:27

The Google Search Console was throwing this error:

I was fairly sure that was a valid ISO 8601 string. It certainly matched the description in the Google documentation. Nevertheless, I fiddled with a few different formats, but all failed.

On the advice of Barry Hunter, I tried changing the datetime attribute to content. That also didn't work.

Then I looked closely at the code.

The issue is the itemscope. Removing that allowed the code to pass validation. But why?

Here's what the Schema.org documentation says:

By adding itemscope, you are specifying that the HTML contained in the block is about a particular item.

The HTML specification gives this example:

Here, the image property is the value of the element. In this case google-logo.png. So what's the problem with the time example?

Well, is a void element. It doesn't have any HTML content - so the metadata is taken from the src attribute.

But is not a void element. It does contain HTML. So something like this would be valid:

2025-06-09T11:27:06+01:00

The text contained by the element is a valid ISO8601 string.

My choice was either to present the ISO8601 string to anyone viewing the page, or simply to remove the itemscope. So I chose the latter.

Is IPA furigana a bad idea?

@edent — Thu, 10 Oct 2024 11:34:17 +0000

My name is Terence(/ˈtɛɹəns) Eden(ˈiːdən/).

Modern HTML allows the user to use to annotate text.

This is usually used for furigana - which allows pronunciation to be placed above words.

For example: "シン・ゴジラ (Shin Godzilla)" shows you how to pronounce both words if you are unfamiliar with kanji. The text can be any language or use any characters. In Japanese, it is quite often used to show phonetic pronunciation using hiragana.

Because English is a composite language⁰, it isn't always easy for people to pronounce words¹.

So I have abused(?) the ruby syntax to show the International Phonetic Alphabet above the English words.

Is this a good idea? Is it a valid use of the syntax? Is it semantically correct? I don't know. But I do now know that it is possible.

I doubt the majority of people know the IPA, so it is of dubious use. It does make my name's pronunciation more apparent to machines.

An alternative is to use Schema.org. For example, my contact page has the following microdata:


    
        
            Terence 
            Eden

That allows humans to listen to the pronunciation of my name, and machines to see the IPA version.

Is there a better, more accessible, more useful way of encoding how to pronounce text?

Or a mongrel language ↩︎
Yes, I've seen that funny Tiktok. And that one. ↩︎

WebMentions, Privacy, and DDoS - Oh My!

@edent — Tue, 29 Nov 2022 12:34:15 +0000

Mastodon - the distributed social network - has two interesting challenges when it comes to how users share links. I'd like to discuss those issues and suggest a possible way forward.

When you click on a link on my website which takes you to another website, your browser sends a Referer⁰. This says to the other site "Hey, I came here using a link on shkspr.mobi". This is useful because it lets a site owner know who is linking to them. I love seeing which weird and wonderful sites have linked to my content.

It is also something of a privacy nightmare as it lets sites see who is clicking and from where they're clicking. So Mastodon sets a noreferrer¹ attribute on all links. This tells the browser not to send the Referer.

This means sites no longer know who is sending them traffic.

That's either a good thing from a privacy perspective or a disaster from a marketing perspective. Or a little bit of both.

Here's a related issue. When a user posts a link to your website on Mastodon, the server checks your page to see if there are any oEmbed tags for a rich link preview. But, at the moment, it doesn't check your website's robots.txt file - which lets it know whether it is allowed to scrape your content.

In the case of something like Twitter or Facebook, this is fine. If a million users post a link, the centralised social network checks the link once and caches the result.

With - potentially - thousands of distributed Mastodon sites, this presents a problem. If a popular account posts a link, their instance fetches a rich preview. Then every instance which has users following them also requests that URL. Essentially, this is a DDoS attack.

I can fix you

So here's my thoughts on how to fix this.

When a user posts a link to Mastodon, their instance should send a WebMention to the site hosting the link. This informs the website that someone has shared their content. Perhaps a user could adjust their privacy settings to allow or deny this.

The instance would check the site's robots.txt and, if allowed, scrape the site to see if there were any Open Graph Protocol metadata elements on it.

That metadata should be included in the post as it is shared across the network.

For example, a status could look like this:

{
  "id": "123",
  "created_at": "2022-03-16T14:44:31.580Z",
  "in_reply_to_id": null,
  "in_reply_to_account_id": null,
  "visibility": "public",
  "language": "en",
  "uri": "https://mastodon.social/users/Edent/statuses/123",
  "content": "Check out https://example.com/",
  "ogp_allowed": true,
  "ogp": {
      "og:title": "My amazing site",
      "og:image:url": "https://cdn.mastodon.social/cache/example.com/preview.jpg",
      "og:description": "A long description. Perhaps the first paragraph of the text."
      ...
   }
   ...
}

When a post is boosted across the network, the instances can see that there is rich metadata associated with the link. If there is an image associate with the post, that will be loaded from the cache on the original Mastodon instance - avoiding overloading the website.

Now, there is a flaw in this idea. A malicious Mastodon server could serve up a fake OGP image and description. So a link to McDonald's might display a fake image promoting Burger King.

To protect against this, a receiving instance could randomly or periodically check the OGP metadata that they receive. If it has been changed, they can update it.

Perhaps a diagram would help?

What other people say about the problem

David Gerard

@davidgerard@circumstances.run

yes, you should put a cache in front of a blog. nginx and wp-supercache do well. but.

mastodon's auto-DDOS feature is still obnoxious. and in a social network, technically designed in obnoxiousness is incompetent.

i realise it'd need extension of activitypub, but is anyone working on sending prerendered cards with the URL? just to save 1000 servers hammering the URL to generate their own cards locally.

2022-11-28, 14:44 7 boosts 23 favorites

Feedback?

Is this a problem? Does this present a viable solution? Have I missed something obvious? Please leave a comment and let me know 😃

This is a spleling mistake which is part of the specification so cannot be changed. ↩︎
This one is spelled correctly. Which makes life confusing for all involved. ↩︎

Is Open Graph Protocol dead?

@edent — Sun, 06 Nov 2022 12:34:49 +0000

~~Facebook~~ Meta - like many other tech titans - has institutional Shiny Object Syndrome. It goes something like this:

Launch a product to great fanfare
Spend a few years hyping it as ✨the future✨
Stop answering emails and pull requests
If you're lucky, announce that the product is abandoned but, more likely, just forget about it.

Open Graph Protocol (OGP) is one of those products. The value-proposition is simple.

It's hard for computers to pick out the main headline, image, and other data from a complex web page.
Therefore, let's encourage websites to include metadata which tells our services what they should look at!

OGP works pretty well! When you share a link on Facebook, or Twitter, or Telegram - those services load the website in the background, look for OGP metadata, and display a friendly snippet.

~~Facebook~~ Meta were the driving force behind OGP - and have now left it to fester.

The website - https://ogp.me/ - still works.
But the Facebook OGP Discussion Group is now full of spam.
The Developer Mailing List is broken.
The Google Documentation links to a dead Google+ page.
And the GitHub Page has been archived.

Is OGP finished?

And, that might be fine. ~~Facebook~~ Meta are a small company with limited resources. They can't afford to fund standards work indefinitely. And, anyway, OGP is complete, right? It has all the tags that anyone could ever possibly want. Why does it need any improving?

Well, that's not the case. We know, for example, that Twitter have created their own proprietary OGP-like meta tags. Similarly, Pinterest have their own as well. And even Google are going their own way with Rich Snippets.

This is annoying for developers. Now we have to write multiple different bits of metadata if we want our links to be supported on all platforms.

Standards work is never "finished". Developers want to add new features. Users want to interact with new forms of content.

Tomorrow someone is going to invent a way to share smells over the Internet. How does that get represented in an Open Graph Protocol compliant manner?

or or or...

We know from bitter experience that having several mutually incompatible ways to implement something is a nightmare for developers and provides a poor user-experience.

So we create standards bodies. They're not perfect, but a group of interested folks can do the hard work to try and satisfy oppositional stakeholders.

This is my plea to ~~Facebook~~ Meta. If you're no longer interested in improving OGP, OK. You do you. But hand it over to people who want to keep this going. Maybe it's the W3C, or IndieWeb, or Schema.org or someone. Hell, I'm not busy, I'll take it on.

Remember, if you love something, let it go.

Semantic Comments for WordPress

@edent — Thu, 28 Apr 2022 11:34:13 +0000

As regular readers will know, I love adding Semantic things to my blog.

The standard WordPress comments HTML isn't very semantic - so I thought I'd change that. Here's some code which you can add to your blog's theme - an an explanation of how it works.

The aim is to end up with some HTML which looks like this (edited for brevity):


    
        
            22-04-12 10:22
        
        
            
            
                Commenter's Name says:
        
        This is the text of my comment

Which will be interpreted as:

This adds elements as well as Schema.org microdata.

Howto

In comments.php you'll see something like this:


     'ol',
            'short_ping'  => true,
            'avatar_size' => 64,
        ) );
    ?>

You need to add a new callback. In this case, I've called it my_comments_walker:


     'ol',
            'short_ping'  => true,
            'avatar_size' => 64,
            'callback' => 'my_comments_walker',
        ) );
    ?>

You can read more about WordPress Walkers on their documentation page.

Now that's done, you need to create a function in your functions.php file. I added this to the end of my file:

function my_comments_walker() {

    //  Basic comment data
    $comment_id          = get_comment_id();
    $comment             = get_comment( $comment_id );

    //  Date the comment was submitted
    $comment_date        = get_comment_date( "c" );
    //  In slightly more human-readable format
    $comment_date_human  = get_comment_date( "y-m-d H:i" );

    //  Author Details
    $comment_author      = get_comment_author();

    //  Author's URl if they've added one
    $comment_author_url  = get_comment_author_url();

    //  If there's an Author URl, link it
    if ($comment_author_url != null) {
        $comment_author_name = "{$comment_author}";
    } else {
        $comment_author_name = "{$comment_author}";
    }

    //  Provide a link to the comment anchor
    $comment_url_link = "{$comment_date_human}";

    //  Author's Avatar based on ID
    //  As per https://developer.wordpress.org/reference/functions/get_avatar/ both alt & default must be set
    $gravatar            = get_avatar( $comment, 64, "", "", array('extra_attr' => 'itemprop="image"') );

    //  Comment needs newlines and links added
    $comment_text        = apply_filters( 'comment_text', get_comment_text(), $comment);


    //  The comment may have various classes. They are stored as an array
    $comment_classes     = get_comment_class();
    $comment_classes_text = "";
    foreach( $comment_classes as $class ) {
        $comment_classes_text .= $class . " ";
    }
    $comment_classes_text = trim($comment_classes_text);

    //  Link to open the reply box
    $comment_reply_link = get_comment_reply_link( [
                    'depth'     => 20,
                    'max_depth' => 100,
                    'before'    => '',
                    'after'     => ''
            ] );

    //  Write the comment HTML. No need for a closing  as WP handles that.
    echo <<< EOT
    
        
            
                $comment_url_link
            
            
                $gravatar
                $comment_author_name says:
            
            $comment_text
            $comment_reply_link
        
    EOT;
}

There are a few extra classes and spans which I use. You can remove them if you like.

And that's it. All your comments will have individual semantic metadata. If you think anything else should be included, please let me know.

How to add ISSN metadata to a web page

@edent — Fri, 17 Sep 2021 11:09:05 +0000

Inspired by John Hoare at the Dirty Feed blog - I've asked the British Library to assign my blog an International Standard Serial Number (ISSN).

An ISSN is an 8-digit code used to identify newspapers, journals, magazines and periodicals of all kinds and on all media–print and electronic.

Why?

Shut up.

OK. It turns out that lots of people cite my blog in academic papers - so I wanted to make it slightly easier for scholars of the future to use metadata to trace my vast influence on Human civilisation.

How?

I filled in a form on the British Library website. Didn't cost me a penny. Was pretty quick!

Metadata

I can stick a bit of text at the bottom of each page with the ISSN - but that doesn't make it easily discoverable by automated tools. How can I make an ISSN machine readable? There are a few ways.

Meta Elements

There are a limited list of official names. These are extensible, and Google Scholar recommends citation_issn. Which is as simple as adding the following to your page's :

There alternatives though.

Schema.org

In recent years, Schema.org has become the dominant form for representing metadata on the web. There are two ways you can implement it:

JSON-LD

JSON Linked Data involves adding a scrap of JavaScript to your HTML, like this:

If you don't want to add a separate script, you can add the data inline using...

Microdata

The microdata specification uses the exact same data as Schema.org - but allows you to add the data directly into the web page like this:


   ...
   ISSN 1234-5678

That's probably the easiest way to do it.

Links

The ISSN registry allows you to look up any ISSN with a simple URL. Mine is at https://portal.issn.org/resource/ISSN/2753-1570.

Belt and braces

So, this is what I've ended up doing - cramming everything in all at once.


   ...
   


   ...
   ISSN 1234-5678

Any other ways?

What am I missing? Can someone smarter than I tell me that there's an easier / better / more interoperable way to do this?

Reducing GPS accuracy in photos

@edent — Thu, 01 Oct 2020 11:56:50 +0000

Here's a quick one-liner to reduce the precision of location stored in a photo's EXIF metadata:

exiftool -c "%.2f" -TagsFromFile @ -GPSLatitude -GPSLongitude photo.jpg

(Thanks to the EXIFtool Forum for their help.)

Why is this useful?

Modern phones automatically attach a GPS location to every photo you take. GPS resolution is around 10 metres. When you share your photos, you're often sharing your precise location.

I wanted to upload some photos to the Wikimedia Commons of an interesting junction box installed in our home. I didn't want my home location stored on the Internet forever - but I thought it would be useful to include a rough location.

The above command takes a location of 51.123456,0.987654 and returns 51.12,0.98. That's good enough to roughly show the location, without revealing it exactly.

Adding Semantic Reviews / Rich Snippets to your WordPress Site

@edent — Sun, 12 Jul 2020 11:50:01 +0000

This is a real "scratch my own itch" post. I want to add Schema.org semantic metadata to the book reviews I write on my blog. This will enable "rich snippets" in search engines.

There are loads of WordPress plugins which do this. But where's the fun in that?! So here's how I quickly built it into my open source blog theme.

Screen options

First, let's add some screen options to the WordPress editor screen.

This is what it will look like when done:

This is how to add a custom metabox to the editor screen:

//  Place this in functions.php
//  Display the box
function edent_add_review_custom_box()
{
   $screens = ['post'];
   foreach ($screens as $screen) {
      add_meta_box(
         'edent_review_box_id', // Unique ID
         'Book Review Metadata',    // Box title
         'edent_review_box_html',// Content callback, must be of type callable
         $screen                 // Post type
       );
   }
}
add_action('add_meta_boxes', 'edent_add_review_custom_box');

The contents of the box are bog standard HTML

//  Place this in functions.php
//  HTML for the box
function edent_review_box_html($post)
{
    $review_data = get_post_meta(get_the_ID(), "_edent_book_review_meta_key", true);
    echo "";

    $checked = "";
    if ($review_data["review"] == "true") {
        $checked = "checked";
    }
    echo "";

    echo "";

    echo "";

    echo "Embed Book Review: 
Rating: 
ISBN: ";
}

Done! We now have a box for metadata. That data will be POSTed every time the blogpost is saved. But where do the data go?

Saving data

This function is added every time the blogpost is saved. If the checkbox has been ticked, the metadata are saved to the database. If the checkbox is unticked, the metadata are deleted.

//  Place this in functions.php
//  Save the box
function edent_review_save_postdata($post_id)
{
   if (array_key_exists('edent_book_review', $_POST)) {
        if ($_POST['edent_book_review']["review"] == "true") {
            update_post_meta(
                $post_id,
                '_edent_book_review_meta_key',
                $_POST['edent_book_review']
            );
        } else {
            delete_post_meta(
                $post_id,
                '_edent_book_review_meta_key'
            );
        }
    }
}
add_action('save_post', 'edent_review_save_postdata');

Nice! But how do we get the data back out again?

Retrieving the data

We can use the get_post_meta() function to get all the metadata associated with a blog entry. We can then turn it into a Schema.org structured metadata entry.

function edent_book_review_display($post_id){
    // https://developer.wordpress.org/reference/functions/the_meta/
    $review_data = get_post_meta($post_id, "_edent_book_review_meta_key", true);
    if ($review_data["review"] == "true")
    {
        $blog_author_data = get_the_author_meta();

        $schema_review = array (
            '@context' => 'https://schema.org',
            '@type'    => 'Review',
            'author' =>
            array (
                '@type' => 'Person',
                'name'  => get_the_author_meta("user_firstname") . " " . get_the_author_meta("user_lastname"),
                'sameAs' =>
                array (
                    0 => get_the_author_meta("user_url"),
                ),
            ),
            'url' => get_permalink(),
            'datePublished' => get_the_date('c'),
            'publisher' =>
            array (
                '@type'  => 'Organization',
                'name'   => get_bloginfo("name"),
                'sameAs' => get_bloginfo("url"),
            ),
            'description' => mb_substr(get_the_excerpt(), 0, 198),
            'inLanguage'  => get_bloginfo("language"),
            'itemReviewed' =>
            array (
                '@type'  => 'Book',
                'name'   => $review_data["title"],
                'isbn'   => $review_data["isbn"],
                'sameAs' => $review_data["book_url"],
                'author' =>
                array (
                    '@type'  => 'Person',
                    'name'   => $review_data["author"],
                    'sameAs' => $review_data["author_url"],
                ),
            'datePublished' => $review_data["book_date"],
            ),
            'reviewRating' =>
            array (
                '@type' => 'Rating',
                'worstRating' => 0,
                'bestRating'  => 5,
                'ratingValue' => $review_data["rating"],
            ),
            'thumbnailUrl' => get_the_post_thumbnail_url(),
        );
        echo '';

        echo "";
        if (isset($review_data["rating"])) {
            echo "";
            $full = floor($review_data["rating"]);
            $half = 0;
            if ($review_data["rating"] - $full == 0.5)
            {
                $half = 1;
            }

            $empty = 5 - $half - $full;

            for ($i=0; $i < $full ; $i++) {
                echo "★";
            }
            if ($half == 1)
            {
                echo "⯪";
            }
            for ($i=0; $i < $empty ; $i++) {
                echo "☆";
            }
            echo "";
        }
        echo "";
        if ($review_data["amazon_url"] != "") {
            echo "Buy it on Amazon";
        }
        if ($review_data["author_url"] != "") {
            echo "Author's homepage";
        }
        if ($review_data["book_url"] != "") {
            echo "Publisher's details";
        }
        echo "";
    }
    echo "";
}

In index.php, after the_content(); add:

edent_book_review_display(get_the_ID());

Then, on the website, it will look something like this:

Note the use of the Unicode Half Star for the ratings.

The source code of the site shows the output of the JSON LD:

When run through a Structured Data Testing Tool, it shows as a valid review:

And this means, when search engines access your blog, they will display rich snippets based on the semantic metadata.

You can see the final blog post to see how it works.

ToDo

My code is horrible and hasn't been tested, validated, or sanitised. It's only for my own blog, and I'm unlikely to hack myself, but that needs fixing.

I want to add review metadata for movies, games, and gadgets. That will either require multiple boxes, or a clever way to only show the necessary fields.

Removing default metadata from .opus files

@edent — Fri, 24 Apr 2020 11:03:00 +0000

I'm trying to create some ridiculously tiny audio files. The sort where every single byte matters.

I've encoded a small sample. But the opusenc tool automatically adds metadata - even if you don't specify any.

Using the amazing Mutagen Python library I was able to completely strip out all the metadata!

import mutagen
mutagen.File("example.opus").delete()

It edits the file immediately - so be careful!

But what is it actually doing? I wanted to understand a bit more - so let's go hex diving!

What the user sees

Running opusinfo example.opus gives:

New logical stream (#1, serial: 03fe3cc9): type opus
Encoded with libopus 1.3.1, libopusenc 0.2.1
User comments section follows...
    ENCODER=opusenc from opus-tools 0.2
    ENCODER_OPTIONS=--bitrate 6 --comp 10 --framesize 60 --padding 0
Opus stream 1:
    ...
Logical stream 1 ended

There are two "mandatory" comments. The ENCODER and the ENCODER_OPTIONS. I can't find a way to stop those being generated by opusenc.

The Opus File API gives some idea about the binary structure of the file.

But the real magic happens in the Opus Forumat Specification RFC. It details the header format in 32 bit clumps.

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |      'O'      |      'p'      |      'u'      |      's'      |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |      'T'      |      'a'      |      'g'      |      's'      |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                     Vendor String Length                      |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     :                        Vendor String...                       :
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                   User Comment List Length                    |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                 User Comment #0 String Length                 |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     :                   User Comment #0 String...                   :
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                 User Comment #1 String Length                 |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     :                                                               :

Let's take a look at our file in binary, jumping straight to the comment section.

0000004b: 4f70 7573  Opus
0000004f: 5461 6773  Tags

Starts as expected. Next is the Vendor String Length

00000053: 1f00 0000  ....

0x1f is 31 bytes. This is a 32 bit, unsigned, little endian number. Hence it is written as 1f00 which becomes 00001f.

00000057: 6c69 626f  libo
0000005b: 7075 7320  pus 
0000005f: 312e 332e  1.3.
00000063: 312c 206c  1, l
00000067: 6962 6f70  ibop
0000006b: 7573 656e  usen
0000006f: 6320 302e  c 0.
00000073: 322e 31    2.1

According to the spec, no terminating null octet is necessary. So the next bytes are the User Comment List Length. Continuing on from the previous line:

00000073:        02     .
00000077: 0000 00    ...

There are two comments (again, 32 bit little endian).

This field indicates the number of user-supplied comments. It MAY indicate there are zero user-supplied comments, in which case there are no additional fields in the packet.

This means we can have an empty comment section! This is what you get by default:

00000077:        23  ...#
0000007b: 0000 00    ...

First string length is 0x23 = 35 bytes long. Again, little endian.

0000007e: 454e 434f  ENCO
00000082: 4445 523d  DER=
00000086: 6f70 7573  opus
0000008a: 656e 6320  enc 
0000008e: 6672 6f6d  from2
00000092: 206f 7075   opu
00000096: 732d 746f  s-to
0000009a: 6f6c 7320  ols 
0000009e: 302e 3240  0.2@

After exactly 35 bytes, we get our next little endian number 0x40 = 64.

000000a1: 4000 0000  @...
000000a5: 454e 434f  ENCO
000000a9: 4445 525f  DER_
000000ad: 4f50 5449  OPTI
000000b1: 4f4e 533d  ONS=
000000b5: 2d2d 6269  --bi
000000b9: 7472 6174  trat
000000bd: 6520 3620  e 6 
000000c1: 2d2d 636f  --co
000000c5: 6d70 2031  mp 1
000000c9: 3020 2d2d  0 --
000000cd: 6672 616d  fram
000000d1: 6573 697a  esiz
000000d5: 6520 3630  e 60
000000d9: 202d 2d70   --p
000000dd: 6164 6469  addi
000000e1: 6e67 2030  ng 0

And that's the end of the comment section!

Manually editing the file

I started by setting the User Comment List Length to zero, and removing all the subsequent comment data. That didn't work. opusinfo gave the following errors:

WARNING: Hole in data (28 bytes) found at approximate offset 1492 bytes. Corrupted Ogg.
WARNING: Hole in data (51 bytes) found at approximate offset 1492 bytes. Corrupted Ogg.
WARNING: sequence number gap in stream 1. Got page 2 when expecting page 1. Indicates missing data.
WARNING: discontinuity in stream (1)

Back to the documentation!

An Ogg Opus stream is organized as follows (see Figure 1 for an example).

        Page 0         Pages 1 ... n        Pages (n+1) ...
     +------------+ +---+ +---+ ... +---+ +-----------+ +---------+ +--
     |            | |   | |   |     |   | |           | |         | |
     |+----------+| |+-----------------+| |+-------------------+ +-----
     |||ID Header|| ||  Comment Header || ||Audio Data Packet 1| | ...
     |+----------+| |+-----------------+| |+-------------------+ +-----
     |            | |   | |   |     |   | |           | |         | |
     +------------+ +---+ +---+ ... +---+ +-----------+ +---------+ +--
     ^      ^                           ^
     |      |                           |
     |      |                           Mandatory Page Break
     |      |
     |      ID header is contained on a single page
     |
     'Beginning Of Stream'

    Figure 1: Example Packet Organization for a Logical Ogg Opus Stream

There are two mandatory header packets. The first packet in the logical Ogg bitstream MUST contain the identification (ID) header, which uniquely identifies a stream as Opus audio. The format of this header is defined in Section 5.1. It is placed alone (without any other packet data) on the first page of the logical Ogg bitstream and completes on that page. This page has its 'beginning of stream' flag set.

The second packet in the logical Ogg bitstream MUST contain the comment header, which contains user-supplied metadata. The format of this header is defined in Section 5.2. It MAY span multiple pages, beginning on the second page of the logical stream. However many pages it spans, the comment header packet MUST finish the page on which it completes.

I tried saying there was one comment, with a length of zero and a null comment. That didn't work either.

I think this is because before the start of the comment header there is something describing how long the packet will be.

Headers

Here are the headers from the original file, and the one stripped by Mutagen.

Original Header

00000000: 4f67 6753 0002 0000  OggS....
00000008: 0000 0000 0000 c93c  .......<
00000010: fe03 0000 0000 f90e  ........
00000018: f775 0113 4f70 7573  .u..Opus
00000020: 4865 6164 0101 3801  Head..8.
00000028: 80bb 0000 0000 004f  .......O
00000030: 6767 5300 0000 0000  ggS.....
00000038: 0000 0000 00c9 3cfe  ......<.
00000040: 0301 0000 0035 dfaf  .....5..
00000048: 0601 9a4f 7075 7354  ...OpusT
00000050: 6167 731f 0000 006c  ags....l
00000058: 6962 6f70 7573 2031  ibopus 1

Stripped Header

00000000: 4f67 6753 0002 0000  OggS....
00000008: 0000 0000 0000 c93c  .......<
00000010: fe03 0000 0000 f90e  ........
00000018: f775 0113 4f70 7573  .u..Opus
00000020: 4865 6164 0101 3801  Head..8.
00000028: 80bb 0000 0000 004f  .......O
00000030: 6767 5300 0000 0000  ggS.....
00000038: 0000 0000 00c9 3cfe  ......<.
00000040: 0301 0000 00ae 941c  ........
00000048: 4e01 2f4f 7075 7354  N./OpusT
00000050: 6167 731f 0000 006c  ags....l
00000058: 6962 6f70 7573 2031  ibopus 1

The Difference

Original                                  Stripped
00000040: 0301 0000 0035 dfaf  .....5.. | 00000040: 0301 0000 00ae 941c  ........
00000048: 0601 9a4f 7075 7354  ...OpusT | 00000048: 4e01 2f4f 7075 7354  N./OpusT

So, something is happening in bytes 45 - 50. But what?

A page is a header of 26 bytes, followed by the length of the data, followed by the data. The constructor is givin a file-like object pointing to the start of an Ogg page. After the constructor is finished it is pointing to the start of the next page

Mutagen Source Code

Unfortunately, my brain freezes up when I see things like

header = struct.unpack('<4sBBqIIiB', header_data)

But the code does point to the Ogg page format specification.

The LSb (least significant bit) comes first in the Bytes. Fields with more than one byte length are encoded LSB (least significant byte) first.

  0                   1                   2                   3
  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 | capture_pattern: Magic number for page start "OggS"           | 0-3
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 | version       | header_type   | granule_position              | 4-7
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |                                                               | 8-11
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |                               | bitstream_serial_number       | 12-15
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |                               | page_sequence_number          | 16-19
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |                               | CRC_checksum                  | 20-23
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |                               | page_segments | segment_table | 24-27
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 | ...                                                           | 28-
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

So, it is the CRC Checksum which is different. The Vorbis framing documentation has a brief description of how the CRC is calculated - but the full documentation 404s.

Conclusion

Hand editing binary files is for mugs.

Interesting Email Metadata

@edent — Thu, 24 Nov 2016 08:22:49 +0000

For many years, my email footer said "Sent via my Casio cPhone" - my attempt to poke fun at the users who hadn't updated their iPhone's default email signature.

This leads to an interesting question:

Because 2016 is maximum news, I'm sure there are some interesting stories based on email releases which have been missed. Metadata tells stories.

So, what metadata can we pick up from an email?

In GMail, it's quite easy to see all the raw data sent with an email as it travels through the Internet.

Let's take a look at some of the more interesting fields.

Here's an email that I've sent from my mobile - I've redacted some bits for my privacy.

Received: from [192.168.1.42] (oxfd.cable.virginm.net. [82.6.ZZZ.ZZZ])
 by smtp.gmail.com with ESMTPSA id l6sm9069017wmg.11.2016.10.08.09.37.57
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Sat, 08 Oct 2016 09:37:58 -0700 (PDT)

Well, first off we can see the sender's internal IP address. That gives us a little insight into their network topology. Of more interest is the sender's external IP address.

This can leak all sorts of interesting information. Location, service provider, connection speed - even ISP contract details in some cases.

Let's suppose someone sends an email which says "Sorry, at home with the flu today." You check the IP address and find that they're connected to the WiFi at Disney World. Isn't that interesting...

A little further down the headers, we find (again, redacted)

Message-ID:

Oh ho! What do we have here? The Message-ID is a unique string. Most email clients will choose a unique suffix.

This means, if you received this message from me, you could tell which email program I used and (possibly) which device.

So if I send you an email saying "sorry, my phone is broken" - you'll be able to tell if that's a lie.

There's another leak of client information at the multipart boundary

Content-Type: multipart/alternative; boundary="--_com.syntomo.email_596674815977850"

----_com.syntomo.email_596674815977850
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: base64

SGVyZSB3ZSBnbyEg

Much more

This brief blog post only scratches the surface of what can be found - and what you could do with the information.

Other "interesting" metadata includes:

User's Timezone - not as accurate as an IP address, but if their phone says they're at GMT+2 but they claim to be at GMT-7, is that interesting?
Reply threading - was this email originally a reply?
What language their equipment is set to. Some email headers contain Accept-Language: and Content-Language: information. Why is your "Urgent email from the FBI" sent from computer that's set to Chinese?
Software versions - do the sender's servers have known vulnerabilities?
Operating System - is the sender's equipment up to date?

I'm sure there are several other pieces of information which could prove interesting.

Manipulation

This is not a cast iron investigative tool. It is possible for programs to mangle the metadata - either deliberately or not. Some people will take care to mask their email footprint, others will not.

Metadata is everywhere. While your emails are unlikely to get leaked to the press (I hope!) you should consider just how easy it is for a little white lie to be uncovered.

Sent from my iPhone.