<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://www.steven-braun.com/feed.xml" rel="self" type="application/atom+xml"/><link href="https://www.steven-braun.com/" rel="alternate" type="text/html" hreflang="en"/><updated>2026-02-09T14:23:04+00:00</updated><id>https://www.steven-braun.com/feed.xml</id><title type="html">blank</title><subtitle>PhD in Machine Learning from TU Darmstadt (AIML Lab)</subtitle><entry><title type="html">Emacs Org Mode: org-emphasize-dwim</title><link href="https://www.steven-braun.com/blog/2024/emacs-org-mode-emphasize-dwim/" rel="alternate" type="text/html" title="Emacs Org Mode: org-emphasize-dwim"/><published>2024-11-15T00:00:00+00:00</published><updated>2024-11-15T00:00:00+00:00</updated><id>https://www.steven-braun.com/blog/2024/emacs-org-mode-emphasize-dwim</id><content type="html" xml:base="https://www.steven-braun.com/blog/2024/emacs-org-mode-emphasize-dwim/"><![CDATA[<p><img class="img-fluid rounded z-depth-0 center tiny mb-4" src="/assets/posts/2024-11-15-emacs-org-mode-emphasize-dwim/org-mode-unicorn.svg" data-zoomable=""/></p> <p>I use Emacs for programming, note-taking in org-mode, and scientific writing in LaTeX. Org-mode offers a simple function <code class="language-plaintext highlighter-rouge">(org-emphasize &amp;optional CHAR)</code>, which inserts an emphasis at a point or region and prompts for <code class="language-plaintext highlighter-rouge">CHAR</code> when called interactively. When I write notes or documentation in org-mode, the usual application of <code class="language-plaintext highlighter-rouge">org-emphasize</code> is to apply markup such as bold, italic, code, or strikethrough to one or multiple words. Since I’m a previous vim user, I’ve converted to Emacs via the popular <a href="https://github.com/doomemacs/doomemacs">Doom Emacs</a> configuration framework, which <em>emphasizes</em> the vim concepts wherever it can. Therefore, my application of <code class="language-plaintext highlighter-rouge">org-emphasize</code> to regions usually involves first selecting a region with vim motions. In the case of a single word, this breaks down to <code class="language-plaintext highlighter-rouge">ysiw&lt;CHAR&gt;</code>. Citing tpope’s README of <a href="https://github.com/tpope/vim-surround"><code class="language-plaintext highlighter-rouge">surround.vim</code></a>: <em>It’s easiest to explain with examples</em>. Press <code class="language-plaintext highlighter-rouge">ysiw*</code> (<strong>y</strong>ou <strong>s</strong>urround <strong>i</strong>nner <strong>w</strong>ord) at cursor position <code class="language-plaintext highlighter-rouge">[ ]</code>:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Hello [W]orld!
</code></pre></div></div> <p>leads to</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Hello *[W]orld*!
</code></pre></div></div> <p>Since I’m heavily relying on <code class="language-plaintext highlighter-rouge">localleader</code> (keybinding <code class="language-plaintext highlighter-rouge">,</code>) for Emacs major-mode functionality, I’m inclined to map essential functions to my localleader group. While I could emulate emacs executing normal mode commands such as <code class="language-plaintext highlighter-rouge">ysiw*</code> for <strong>bold</strong>, <code class="language-plaintext highlighter-rouge">ysiw/</code> for <em>italic</em>, <code class="language-plaintext highlighter-rouge">ysiw=</code> for <code class="language-plaintext highlighter-rouge">code</code> and <code class="language-plaintext highlighter-rouge">ysiw~</code> for <del>strikethrough</del>, I think it is more elegant to introduce a <strong>DWIM</strong> wrapper to <code class="language-plaintext highlighter-rouge">org-emphasize</code>:</p> <div class="language-emacs-lisp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">sbraun/org-emphasize-dwim</span> <span class="p">(</span><span class="nb">char</span><span class="p">)</span>
  <span class="s">"DWIM (Do What I Mean) wrapper for org-emphasize.
   If there's an active region, apply emphasis to it.
   Otherwise, apply emphasis to the word at point.
   CHAR is the emphasis character to use."</span>
  <span class="p">(</span><span class="nv">interactive</span><span class="p">)</span>
  <span class="c1">;; Check if there is an active region (e.g., text is selected).</span>
  <span class="p">(</span><span class="k">if</span> <span class="p">(</span><span class="nv">use-region-p</span><span class="p">)</span>
      <span class="c1">;; If a region is active, apply emphasis to the selected region.</span>
      <span class="p">(</span><span class="nv">org-emphasize</span> <span class="nb">char</span><span class="p">)</span>
    <span class="c1">;; Otherwise, apply emphasis to the word at point.</span>
    <span class="p">(</span><span class="nv">save-excursion</span>
      <span class="c1">;; Find the boundaries of the word at point.</span>
      <span class="p">(</span><span class="k">let</span> <span class="p">((</span><span class="nv">bounds</span> <span class="p">(</span><span class="nv">bounds-of-thing-at-point</span> <span class="ss">'word</span><span class="p">)))</span>
        <span class="p">(</span><span class="nb">when</span> <span class="nv">bounds</span>
          <span class="p">(</span><span class="nv">goto-char</span> <span class="p">(</span><span class="nb">car</span> <span class="nv">bounds</span><span class="p">))</span>
          <span class="p">(</span><span class="nv">set-mark</span> <span class="p">(</span><span class="nb">cdr</span> <span class="nv">bounds</span><span class="p">))</span>
          <span class="c1">;; Apply emphasis to the selected word.</span>
          <span class="p">(</span><span class="nv">org-emphasize</span> <span class="nb">char</span><span class="p">)</span>
          <span class="p">(</span><span class="nv">deactivate-mark</span><span class="p">))))))</span>
</code></pre></div></div> <p>With this defined, I can add keybindings[1], such as</p> <div class="language-emacs-lisp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">map!</span> <span class="ss">:localleader</span>
      <span class="ss">:map</span> <span class="nv">org-mode-map</span>
      <span class="p">(</span><span class="ss">:prefix</span> <span class="p">(</span><span class="s">"t"</span> <span class="s">"text markup"</span><span class="p">)</span>
               <span class="ss">:desc</span> <span class="s">"italic"</span> <span class="s">"i"</span> <span class="nf">#'</span><span class="p">(</span><span class="k">lambda</span> <span class="p">()</span> <span class="p">(</span><span class="nv">interactive</span><span class="p">)</span> <span class="p">(</span><span class="nv">sbraun/org-emphasize-dwim</span> <span class="nv">?/</span><span class="p">))</span>
               <span class="ss">:desc</span> <span class="s">"bold"</span>   <span class="s">"b"</span> <span class="nf">#'</span><span class="p">(</span><span class="k">lambda</span> <span class="p">()</span> <span class="p">(</span><span class="nv">interactive</span><span class="p">)</span> <span class="p">(</span><span class="nv">sbraun/org-emphasize-dwim</span> <span class="nv">?*</span><span class="p">))</span>
               <span class="ss">:desc</span> <span class="s">"code"</span>   <span class="s">"c"</span> <span class="nf">#'</span><span class="p">(</span><span class="k">lambda</span> <span class="p">()</span> <span class="p">(</span><span class="nv">interactive</span><span class="p">)</span> <span class="p">(</span><span class="nv">sbraun/org-emphasize-dwim</span> <span class="nv">?=</span><span class="p">))</span>
               <span class="ss">:desc</span> <span class="s">"strike"</span> <span class="s">"s"</span> <span class="nf">#'</span><span class="p">(</span><span class="k">lambda</span> <span class="p">()</span> <span class="p">(</span><span class="nv">interactive</span><span class="p">)</span> <span class="p">(</span><span class="nv">sbraun/org-emphasize-dwim</span> <span class="nv">?+</span><span class="p">))))</span>
</code></pre></div></div> <p>Back to the original example, pressing <code class="language-plaintext highlighter-rouge">,ti</code> on:</p> <pre><code class="language-org">Hello [W]orld!
</code></pre> <p>leads to</p> <pre><code class="language-org">Hello /[W]orld/!
</code></pre> <p>which saves me … <em>drum rolls</em> … two keystrokes compared to <code class="language-plaintext highlighter-rouge">ysiw*</code> – yay, what a way to procrastinate.</p> <p>[1] The <code class="language-plaintext highlighter-rouge">map!</code> syntax is specific to Doom Emacs, see also: <a href="https://github.com/doomemacs/doomemacs/blob/7bc39f2c1402794e76ea10b781dfe586fed7253b/docs/getting_started.org#binding-keys">binding-keys</a>.</p>]]></content><author><name>Steven Braun</name></author><summary type="html"><![CDATA[Add org-emphasize-dwim which wraps org-emphasize and either applies it to the region or to word at pointer.]]></summary></entry><entry><title type="html">Tiling macOS: Moving from i3wm to Yabai</title><link href="https://www.steven-braun.com/blog/2022/i3wm-to-yabai/" rel="alternate" type="text/html" title="Tiling macOS: Moving from i3wm to Yabai"/><published>2022-10-06T00:00:00+00:00</published><updated>2022-10-06T00:00:00+00:00</updated><id>https://www.steven-braun.com/blog/2022/i3wm-to-yabai</id><content type="html" xml:base="https://www.steven-braun.com/blog/2022/i3wm-to-yabai/"><![CDATA[<p><img class="img-fluid rounded z-depth-2 center normal mb-4" src="/assets/posts/2022-10-06-i3wm-to-yabai/yabai-screenshot.png" data-zoomable=""/></p> <p>After using Linux for almost a decade, I’ve finally gotten annoyed at all the little hiccups and issues that arrive from time to time when working in Linux. ArchLinux has taught me more than anything else about the Linux world, its bleeding edge character, and the issues that come along with it. This has brought me to Fedora Linux about two years ago. While more stable in general, even Fedora has its sharp edges here and there. I’ve experienced issues with Bluetooth, audio sinks and sources, printers, and more on a daily to weekly basis. After having less and less available time due to my research, constantly tinkering with my system was no longer an option. Therefore, I’ve decided to finally ditch Linux and give macOS a try.</p> <p>My Linux workflow was mainly keyboard-driven, using i3wm as a tiling window manager, Emacs as a programming IDE and as a note-taking tool with <a href="https://orgmode.org">Org mode</a> and <a href="https://www.orgroam.com">Org Roam</a>, and basically <em>living</em> in the terminal. Therefore, my first goal when switching to macOS was to replicate as much as possible of this exact workflow. While finding that (doom) Emacs worked basically out-of-the-box using the <code class="language-plaintext highlighter-rouge">emacs-mac</code> build (<code class="language-plaintext highlighter-rouge">brew tap railwaycat/emacsmacport brew install emacs-mac --with-modules</code>) and my Zsh configuration was working without any major changes, finding a workable replacement for i3wm was a much harder task.</p> <h2 id="yabai-and-skhd-to-the-rescue">Yabai and skhd to the Rescue!</h2> <p>After fiddling around for a few days, I’ve settled with a setup that works really well for me: <a href="https://github.com/koekeishiya/yabai">yabai</a> as a tiling window manager and <a href="https://github.com/koekeishiya/skhd">skhd</a> to define keyboard shortcuts that perform <code class="language-plaintext highlighter-rouge">yabai</code> (and some other) commands, replicating most of the functionality that is available in i3wm. In the following, I will go through my yabai and skhd setup and explain how it can replicate the classic i3wm behavior. When giving examples for yabai and skhd commands and configs, these usually go into their respective configuration files at <code class="language-plaintext highlighter-rouge">~/.config/yabai/yabairc</code> and <code class="language-plaintext highlighter-rouge">~/.config/skhd/skhdrc</code>.</p> <h3 id="open-terminal">Open Terminal</h3> <p>For a terminal-focused workflow, it was important to me to have a quick and simple way to open a new terminal instance bound to my preferred shortcut <code class="language-plaintext highlighter-rouge">cmd - return</code>. <a href="https://sw.kovidgoyal.net/kitty/">Kitty</a> allows this via the <code class="language-plaintext highlighter-rouge">kitty --single-instance -d ~</code> arguments. That is, with skhd we can now map <code class="language-plaintext highlighter-rouge">cmd - return</code> to this exact call:</p> <div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cmd - <span class="k">return</span> : kitty <span class="nt">--single-instance</span> <span class="nt">-d</span> ~
</code></pre></div></div> <p>If you prefer <a href="https://iterm2.com">iTerm2</a> over kitty, we can quickly start a new iTerm2 session (as long as there is at least one iTerm2 window already running) with</p> <div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cmd - <span class="k">return</span> : osascript <span class="nt">-e</span> <span class="s2">"tell application </span><span class="se">\"</span><span class="s2">iTerm2</span><span class="se">\"</span><span class="s2"> to set newSession to create window with default profile end tell"</span>
</code></pre></div></div> <video class="video mt-4 mb-4 z-depth-2" controls=""> <source src="/assets/posts/2022-10-06-i3wm-to-yabai/open-terminal.mp4" type="video/mp4"/> </video> <h3 id="close-window">Close Window</h3> <p>To quickly close windows, I map <code class="language-plaintext highlighter-rouge">cmd - q</code> to the specific yabai command:</p> <div class="language-zsh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cmd - q : yabai <span class="nt">-m</span> window <span class="nt">--close</span>
</code></pre></div></div> <video class="video mt-4 mb-4 z-depth-2" controls=""> <source src="/assets/posts/2022-10-06-i3wm-to-yabai/close-window.mp4" type="video/mp4"/> </video> <h3 id="window-focus">Window Focus</h3> <p>Window management in yabai turns out to be pretty similar to i3wm in practice. Yabai allows compass-like focus commands with <code class="language-plaintext highlighter-rouge">yabai -m window --focus &lt;direction&gt;</code>. I typically use the vim-like keys and bind <code class="language-plaintext highlighter-rouge">h/j/k/l</code> to <code class="language-plaintext highlighter-rouge">west/south/north/east</code> respectively as follows:</p> <div class="language-zsh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cmd - h : yabai <span class="nt">-m</span> window <span class="nt">--focus</span> west
cmd - j : yabai <span class="nt">-m</span> window <span class="nt">--focus</span> south
cmd - k : yabai <span class="nt">-m</span> window <span class="nt">--focus</span> north
cmd - l : yabai <span class="nt">-m</span> window <span class="nt">--focus</span> east
</code></pre></div></div> <video class="video mt-4 mb-4 z-depth-2" controls=""> <source src="/assets/posts/2022-10-06-i3wm-to-yabai/focus-window.mp4" type="video/mp4"/> </video> <p>If you have multiple displays, say next to each other, you can add an alternative command via <code class="language-plaintext highlighter-rouge">||</code> if the first command fails. That means if the focus is currently on the east-most window, and we call <code class="language-plaintext highlighter-rouge">yabai -m window --focus east</code>, but there is another display right of your current display, the following will handle switching the display as well:</p> <div class="language-zsh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cmd - h : yabai <span class="nt">-m</span> window <span class="nt">--focus</span> west <span class="o">||</span> yabai <span class="nt">-m</span> display <span class="nt">--focus</span> west
cmd - l : yabai <span class="nt">-m</span> window <span class="nt">--focus</span> east <span class="o">||</span> yabai <span class="nt">-m</span> display <span class="nt">--focus</span> east
</code></pre></div></div> <p>Similarly, you can switch stacks conditionally, i.e., first try if you can focus the next or previous window in the current stack and if that fails, conditionally focus the next window south/north:</p> <div class="language-zsh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cmd - j : yabai <span class="nt">-m</span> window <span class="nt">--focus</span> stack.next <span class="o">||</span> yabai <span class="nt">-m</span> window <span class="nt">--focus</span> south
cmd - k : yabai <span class="nt">-m</span> window <span class="nt">--focus</span> stack.prev <span class="o">||</span> yabai <span class="nt">-m</span> window <span class="nt">--focus</span> north
</code></pre></div></div> <h3 id="move-windows">Move Windows</h3> <p>Similarly, windows can be moved (with my preferred keybinding <code class="language-plaintext highlighter-rouge">cmd + shift - h/j/k/l</code>):</p> <div class="language-zsh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cmd + <span class="nb">shift</span> - h : yabai <span class="nt">-m</span> window <span class="nt">--warp</span> west <span class="o">||</span> yabai <span class="nt">-m</span> window <span class="nt">--display</span> west
cmd + <span class="nb">shift</span> - l : yabai <span class="nt">-m</span> window <span class="nt">--warp</span> east <span class="o">||</span> yabai <span class="nt">-m</span> window <span class="nt">--display</span> east
cmd + <span class="nb">shift</span> - j : yabai <span class="nt">-m</span> window <span class="nt">--warp</span> south 
cmd + <span class="nb">shift</span> - k : yabai <span class="nt">-m</span> window <span class="nt">--warp</span> north
</code></pre></div></div> <video class="video mt-4 mb-4 z-depth-2" controls=""> <source src="/assets/posts/2022-10-06-i3wm-to-yabai/move-window.mp4" type="video/mp4"/> </video> <h3 id="spaces">Spaces</h3> <p>The equivalent of i3wm workspaces in macOS are “Desktops”. These can be focused in yabai via the <code class="language-plaintext highlighter-rouge">yabai -m space --focus &lt;label&gt;</code> command, where <code class="language-plaintext highlighter-rouge">&lt;label&gt;</code> is a tag you assign in your <code class="language-plaintext highlighter-rouge">yabairc</code> file:</p> <div class="language-zsh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>yabai <span class="nt">-m</span> space 1 <span class="nt">--label</span> 1
yabai <span class="nt">-m</span> space 2 <span class="nt">--label</span> 2
yabai <span class="nt">-m</span> space 3 <span class="nt">--label</span> 3
yabai <span class="nt">-m</span> space 4 <span class="nt">--label</span> 4
yabai <span class="nt">-m</span> space 5 <span class="nt">--label</span> 5
yabai <span class="nt">-m</span> space 6 <span class="nt">--label</span> 6
yabai <span class="nt">-m</span> space 7 <span class="nt">--label</span> 7
yabai <span class="nt">-m</span> space 8 <span class="nt">--label</span> 8
yabai <span class="nt">-m</span> space 9 <span class="nt">--label</span> 9
yabai <span class="nt">-m</span> space 10 <span class="nt">--label</span> 10
</code></pre></div></div> <p>Then in <code class="language-plaintext highlighter-rouge">skhdrc</code>, you can use these labels to focus a particular space. Additionally, to simulate the <code class="language-plaintext highlighter-rouge">workspace_auto_back_and_forth yes</code> setting of i3wm, we can append the command to focus the most recent space if you press the keybinding for the same space again:</p> <div class="language-zsh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cmd - 1 : yabai <span class="nt">-m</span> space <span class="nt">--focus</span> 1 <span class="o">||</span> yabai <span class="nt">-m</span> space <span class="nt">--focus</span> recent
cmd - 2 : yabai <span class="nt">-m</span> space <span class="nt">--focus</span> 2 <span class="o">||</span> yabai <span class="nt">-m</span> space <span class="nt">--focus</span> recent
cmd - 3 : yabai <span class="nt">-m</span> space <span class="nt">--focus</span> 3 <span class="o">||</span> yabai <span class="nt">-m</span> space <span class="nt">--focus</span> recent
cmd - 4 : yabai <span class="nt">-m</span> space <span class="nt">--focus</span> 4 <span class="o">||</span> yabai <span class="nt">-m</span> space <span class="nt">--focus</span> recent
cmd - 5 : yabai <span class="nt">-m</span> space <span class="nt">--focus</span> 5 <span class="o">||</span> yabai <span class="nt">-m</span> space <span class="nt">--focus</span> recent
cmd - 6 : yabai <span class="nt">-m</span> space <span class="nt">--focus</span> 6 <span class="o">||</span> yabai <span class="nt">-m</span> space <span class="nt">--focus</span> recent
cmd - 7 : yabai <span class="nt">-m</span> space <span class="nt">--focus</span> 7 <span class="o">||</span> yabai <span class="nt">-m</span> space <span class="nt">--focus</span> recent
cmd - 8 : yabai <span class="nt">-m</span> space <span class="nt">--focus</span> 8 <span class="o">||</span> yabai <span class="nt">-m</span> space <span class="nt">--focus</span> recent
cmd - 9 : yabai <span class="nt">-m</span> space <span class="nt">--focus</span> 9 <span class="o">||</span> yabai <span class="nt">-m</span> space <span class="nt">--focus</span> recent
</code></pre></div></div> <video class="video mt-4 mb-4 z-depth-2" controls=""> <source src="/assets/posts/2022-10-06-i3wm-to-yabai/focus-space.mp4" type="video/mp4"/> </video> <p>Similar to moving windows around in a specific space, I bind <code class="language-plaintext highlighter-rouge">cmd + shift - &lt;label&gt;</code> to moving a window to a particular space:</p> <div class="language-zsh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cmd + <span class="nb">shift</span> - 1 : yabai <span class="nt">-m</span> window <span class="nt">--space</span> 1
cmd + <span class="nb">shift</span> - 2 : yabai <span class="nt">-m</span> window <span class="nt">--space</span> 2
cmd + <span class="nb">shift</span> - 3 : yabai <span class="nt">-m</span> window <span class="nt">--space</span> 3
cmd + <span class="nb">shift</span> - 4 : yabai <span class="nt">-m</span> window <span class="nt">--space</span> 4
cmd + <span class="nb">shift</span> - 5 : yabai <span class="nt">-m</span> window <span class="nt">--space</span> 5
cmd + <span class="nb">shift</span> - 6 : yabai <span class="nt">-m</span> window <span class="nt">--space</span> 6
cmd + <span class="nb">shift</span> - 7 : yabai <span class="nt">-m</span> window <span class="nt">--space</span> 7
cmd + <span class="nb">shift</span> - 8 : yabai <span class="nt">-m</span> window <span class="nt">--space</span> 8
cmd + <span class="nb">shift</span> - 9 : yabai <span class="nt">-m</span> window <span class="nt">--space</span> 9
</code></pre></div></div> <video class="video mt-4 mb-4 z-depth-2" controls=""> <source src="/assets/posts/2022-10-06-i3wm-to-yabai/move-window-to-space.mp4" type="video/mp4"/> </video> <h3 id="spacebar">Spacebar</h3> <p>As an i3bar replacement, there are several options:</p> <ul> <li><a href="https://github.com/Jean-Tinland/simple-bar">simple-bar</a>: A Ubersicht widget, very customizable.</li> <li><a href="https://github.com/cmacrae/spacebar">spacebar</a>: A standalone bar application.</li> </ul> <p>To reserve some space in yabai for the bar, you need to configure the height of the <code class="language-plaintext highlighter-rouge">external_bar</code> variable in yabai as follows:</p> <div class="language-zsh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>yabai <span class="nt">-m</span> config external_bar all:0:24
</code></pre></div></div> <h3 id="floating-window-settings">Floating Window Settings</h3> <p>Some windows are just not worth floating and you may collect more of those over time. For this, yabai allows to add rules that disable yabai management for specific apps or windows with titles:</p> <div class="language-zsh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>yabai <span class="nt">-m</span> rule <span class="nt">--add</span> <span class="nv">title</span><span class="o">=</span><span class="s1">'Settings$'</span> <span class="nv">manage</span><span class="o">=</span>off
yabai <span class="nt">-m</span> rule <span class="nt">--add</span> <span class="nv">app</span><span class="o">=</span><span class="s2">"^System Preferences$"</span> <span class="nv">manage</span><span class="o">=</span>off
yabai <span class="nt">-m</span> rule <span class="nt">--add</span> <span class="nv">app</span><span class="o">=</span><span class="s2">"^System Information$"</span> <span class="nv">manage</span><span class="o">=</span>off
yabai <span class="nt">-m</span> rule <span class="nt">--add</span> <span class="nv">title</span><span class="o">=</span><span class="s2">"^Preferences$"</span> <span class="nv">manage</span><span class="o">=</span>off
yabai <span class="nt">-m</span> rule <span class="nt">--add</span> <span class="nv">title</span><span class="o">=</span><span class="s2">"^Digital Colour Meter$"</span> <span class="nv">manage</span><span class="o">=</span>off
yabai <span class="nt">-m</span> rule <span class="nt">--add</span> <span class="nv">title</span><span class="o">=</span><span class="s2">"^General.*"</span> <span class="nv">manage</span><span class="o">=</span>off
</code></pre></div></div> <video class="video mt-4 mb-4 z-depth-2" controls=""> <source src="/assets/posts/2022-10-06-i3wm-to-yabai/floating-window.mp4" type="video/mp4"/> </video> <h2 id="additional-resources">Additional Resources</h2> <p>My dotfiles are available <a href="https://github.com/braun-steven/dotfiles">here</a>, including my old <a href="https://github.com/braun-steven/dotfiles/blob/master/configs/i3/.config/i3/config">i3wm config</a>, my new <a href="https://github.com/braun-steven/dotfiles/blob/master/configs/yabai/.config/yabai/yabairc">yabai config</a>, and the <a href="https://github.com/braun-steven/dotfiles/blob/master/configs/skhd/.config/skhd/skhdrc">skhd config</a>.</p> <p>The yabai GitHub repository also hosts a great <a href="https://github.com/koekeishiya/yabai/wiki">Wiki</a> that covers everything from installation to configuration.</p>]]></content><author><name>Steven Braun</name></author><summary type="html"><![CDATA[A short summary on how one can replace the typical Linux i3wm experience on macOS with yabai and skhd.]]></summary></entry><entry><title type="html">Optimizing Matplotlib Visualizations for Academic Papers</title><link href="https://www.steven-braun.com/blog/2021/matplotlib-viz/" rel="alternate" type="text/html" title="Optimizing Matplotlib Visualizations for Academic Papers"/><published>2021-10-27T00:00:00+00:00</published><updated>2021-10-27T00:00:00+00:00</updated><id>https://www.steven-braun.com/blog/2021/matplotlib-viz</id><content type="html" xml:base="https://www.steven-braun.com/blog/2021/matplotlib-viz/"><![CDATA[<p><img class="img-fluid rounded z-depth-0 center small" src="/assets/posts/2021-09-29-matplotlib-viz/featured.png" data-zoomable=""/></p> <p>Without much talk, lets start with a matplotlib example. Let’s say we want to visualize the <a href="https://arxiv.org/abs/1708.02002">Focal Loss</a> objective for different values of \(\gamma\) w.r.t. the probability of the ground-truth class \(p_c^t\):</p> <p><img class="img-fluid center small" src="/assets/posts/2021-09-29-matplotlib-viz/base.png" data-zoomable=""/></p> <p>If we were to include this figure directly into some LaTeX document, it would look like this:</p> <div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">\documentclass</span><span class="na">[twocolumn]</span><span class="p">{</span>article<span class="p">}</span>
<span class="k">\usepackage</span><span class="p">{</span>blindtext<span class="p">}</span>  <span class="c">% Lorem ipsum filler</span>
<span class="k">\usepackage</span><span class="p">{</span>graphicx<span class="p">}</span>  <span class="c">% includegraphics</span>
<span class="nt">\begin{document}</span>

<span class="nt">\begin{figure}</span>
  <span class="k">\centering</span>
  <span class="k">\includegraphics</span><span class="na">[width=\linewidth]</span><span class="p">{</span>./base.png<span class="p">}</span>
  <span class="k">\caption</span><span class="p">{</span>Lorem ipsum dolor sit amet, consectetuer adipisc-
ing elit. Etiam lobortis facilisis sem. Nullam nec mi et neque
pharetra sollicitudin. Praesent imperdiet mi nec ante. Donec
ullamcorper, felis non sodales commodo, lectus velit ultrices
augue, a dignissim nibh lectus placerat pede.<span class="p">}</span>
<span class="nt">\end{figure}</span>

<span class="c">% Some lipsum filler</span>
<span class="k">\Blindtext</span>
<span class="k">\Blindtext</span>
<span class="nt">\end{document}</span>
</code></pre></div></div> <p><img class="img-fluid rounded z-depth-1" src="/assets/posts/2021-09-29-matplotlib-viz/base-latex.png" data-zoomable=""/></p> <p>Now, there are a few things that bug me:</p> <ol> <li>The sans-serif font used in the figure stands in contrast to the serif font used in LaTeX documents</li> <li>The figure font size is smaller than the document font size</li> <li>The axis grid is missing</li> </ol> <p>Note, that point 2. can be okay if you need to save space and have multiple figures next to each other. If you have enough space, you should always ensure the same font size for all your text (including figures and tables). Furthermore, point 3. can be omitted if the exact data/values are not important. For everything else, axis grids are an easy, non-intrusive hint for the reader for a quick comparisons of values.</p> <p>The easiest way to accomplish the above is to use the popular <a href="https://github.com/garrettj403/SciencePlots">SciencePlots</a> python package. It offers multiple matplotlib styles which you can enable via:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plt</span><span class="p">.</span><span class="n">style</span><span class="p">.</span><span class="nf">use</span><span class="p">([</span><span class="sh">"</span><span class="s">science</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">grid</span><span class="sh">"</span><span class="p">])</span>
</code></pre></div></div> <p>The resulting image, then includes grids, increased font size, and uses a serif font.</p> <p><img class="img-fluid center small" src="/assets/posts/2021-09-29-matplotlib-viz/sciplots.png" data-zoomable=""/></p> <p>To go one step further, one can adjust the legend box frame to look consistent with the axis frame, i.e. use <code class="language-plaintext highlighter-rouge">black</code> as color and set the linewidth to <code class="language-plaintext highlighter-rouge">0.5</code>:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">legend</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="nf">legend</span><span class="p">(</span><span class="n">fancybox</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">edgecolor</span><span class="o">=</span><span class="sh">"</span><span class="s">black</span><span class="sh">"</span><span class="p">)</span>
<span class="n">legend</span><span class="p">.</span><span class="nf">get_frame</span><span class="p">().</span><span class="nf">set_linewidth</span><span class="p">(</span><span class="mf">0.5</span><span class="p">)</span>
</code></pre></div></div> <p><img class="img-fluid center small" src="/assets/posts/2021-09-29-matplotlib-viz/legend.png" data-zoomable=""/></p> <h3 id="figure-size">Figure Size</h3> <p>It is helpful to adjust the figure size to the actual size available in your LaTeX document. We can find the length of <code class="language-plaintext highlighter-rouge">\textwidth</code> by adding the following statement somewhere in the source.</p> <div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">\printinunitsof</span><span class="p">{</span>in<span class="p">}</span><span class="k">\prntlen</span><span class="p">{</span><span class="k">\textwidth</span><span class="p">}</span>
</code></pre></div></div> <p>This will print the <code class="language-plaintext highlighter-rouge">\textwidth</code> variable in inches at the position we have placed it in the document. For the example case with a <code class="language-plaintext highlighter-rouge">twocolumn</code> article class, this returns <code class="language-plaintext highlighter-rouge">3.31314</code> inches. We will now go ahead and make the figure size relative to this base measure by putting the height with a fixed aspect ration in direct relation to the textwidth (and enable an optional scaling factor if necessary for smaller/larger figures):</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">textwidth</span> <span class="o">=</span> <span class="mf">3.31314</span>
<span class="n">aspect_ratio</span> <span class="o">=</span> <span class="mi">6</span><span class="o">/</span><span class="mi">8</span>
<span class="n">scale</span> <span class="o">=</span> <span class="mf">1.0</span>
<span class="n">width</span> <span class="o">=</span> <span class="n">textwidth</span> <span class="o">*</span> <span class="n">scale</span>
<span class="n">height</span> <span class="o">=</span> <span class="n">width</span> <span class="o">*</span> <span class="n">aspect_ratio</span>
<span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="nf">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="n">width</span><span class="p">,</span> <span class="n">height</span><span class="p">))</span>
</code></pre></div></div> <h3 id="pgf-outputs">PGF Outputs</h3> <p>If we export the figure as <code class="language-plaintext highlighter-rouge">.png</code> (or even worse: <code class="language-plaintext highlighter-rouge">.jpg</code>) file, the resulting visualization is rasterized and has fixed height and width. This can lead to, depending on the image size, pixelated results when enlarging the figure for an in-detail inspection by the reader. On the other hand, we can simply export the figure either as a <code class="language-plaintext highlighter-rouge">.pdf</code> file, or even better, use the <code class="language-plaintext highlighter-rouge">.pgf</code> (progressive graphics file) format. The big advance of <code class="language-plaintext highlighter-rouge">.pgf</code> over <code class="language-plaintext highlighter-rouge">.pdf</code> is the fact that <code class="language-plaintext highlighter-rouge">.pgf</code> has no embedded fonts and only tells the latex pgf compiler how to generate the figure from instructions. This leads to the resulting figure in the document body using the very same font for all text as the document text itself.</p> <p>We can enable the <code class="language-plaintext highlighter-rouge">pgf</code> module in matplotlib with the following python preamble:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">matplotlib</span>
<span class="n">matplotlib</span><span class="p">.</span><span class="nf">use</span><span class="p">(</span><span class="sh">"</span><span class="s">pgf</span><span class="sh">"</span><span class="p">)</span>
<span class="n">matplotlib</span><span class="p">.</span><span class="n">rcParams</span><span class="p">.</span><span class="nf">update</span><span class="p">(</span>
    <span class="p">{</span>
        <span class="sh">"</span><span class="s">pgf.texsystem</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">pdflatex</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">font.family</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">serif</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">text.usetex</span><span class="sh">"</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">pgf.rcfonts</span><span class="sh">"</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span>
    <span class="p">}</span>
<span class="p">)</span>
</code></pre></div></div> <p>Now we save the figure in the <code class="language-plaintext highlighter-rouge">.pgf</code> format instead of the <code class="language-plaintext highlighter-rouge">.png</code> format.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gd">- plt.savefig("fig.png")
</span><span class="gi">+ plt.savefig("fig.pgf")
</span></code></pre></div></div> <p>In latex, compiling <code class="language-plaintext highlighter-rouge">.pgf</code> files is provided with the <code class="language-plaintext highlighter-rouge">pgfplots</code> package.</p> <div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">\usepackage</span><span class="p">{</span>pgfplots<span class="p">}</span>
</code></pre></div></div> <p>And replace the <code class="language-plaintext highlighter-rouge">\includegraphics</code> statement with an <code class="language-plaintext highlighter-rouge">\input</code> statement as follows</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gd">- \includegraphics[width=\linewidth]{fig.png}
</span><span class="gi">+ \input{fig.pgf}
</span></code></pre></div></div> <h3 id="final-result">Final Result</h3> <p>With this, we have addressed all issues pointed out earlier on. So let’s compare this directly in the resulting LaTeX output PDF, before (left) and after (right):</p> <div class="row mt-3"> <div class="col-sm mt-3 mt-md-0"> <img class="img-fluid rounded z-depth-1" src="/assets/posts/2021-09-29-matplotlib-viz/base-latex.png" data-zoomable=""/> </div> <div class="col-sm mt-3 mt-md-0"> <img class="img-fluid rounded z-depth-1" src="/assets/posts/2021-09-29-matplotlib-viz/final-latex.png" data-zoomable=""/> </div> </div> <div class="caption"> Before (left) and after (right) comparison when modifying a matplotlib figure to visually fit into a LaTeX document. We have now fixed the figure dimensions, text size and font, legend borders, and used the pgf format. </div> <p>The updated figure now looks more polished and visually fits into the context of the LaTeX document with higher consistency. For reference, you can find the python script that generated all above figures <a href="/assets/posts/2021-09-29-matplotlib-viz/figures.py">here</a> and the LaTeX document <a href="/assets/posts/2021-09-29-matplotlib-viz/main.tex">here</a>.</p>]]></content><author><name>Steven Braun</name></author><summary type="html"><![CDATA[In this post we will see few tricks to polish matplotlib figures, making them ready for inclusion in academic papers, i.e. LaTeX generated documents.]]></summary></entry><entry><title type="html">A Short Review of Axis-Aligned and Oriented Object Detection</title><link href="https://www.steven-braun.com/blog/2021/object-detection/" rel="alternate" type="text/html" title="A Short Review of Axis-Aligned and Oriented Object Detection"/><published>2021-09-21T00:00:00+00:00</published><updated>2021-09-21T00:00:00+00:00</updated><id>https://www.steven-braun.com/blog/2021/object-detection</id><content type="html" xml:base="https://www.steven-braun.com/blog/2021/object-detection/"><![CDATA[<p>This post is going to give a brief introduction to deep models, the history of object detection ranging from classic methods based on hand-crafted features to the latest deep learning object detectors, object detection datasets, and object detection evaluation metrics.</p> <h2 id="deep-models-preface">Deep Models Preface</h2> <p>The construction of well performing deep models in complex computer vision tasks is often two-fold. The primary goal is to find a model architecture defined by a directed computation graph \(G = \left( V, E \right)\) connecting model inputs \(\left\{ \boldsymbol{X}_{0}, \dots, \boldsymbol{X}_{K_{\mathrm{in}}-1} \right\}\), with \(\boldsymbol{X}_{k} \in \mathbb{R}^{d_{0} \times \dots \times d_{D_{k}-1}}\) to nodes \(v_{i} \in V\) to model outputs \(\left\{ \boldsymbol{Y}_{0}, \dots, \boldsymbol{Y}_{K_{out}-1} \right\}\) with \(\boldsymbol{Y}_{k} \in \mathbb{R}^{d_{0} \times \dots \times d_{D_{k}-1}}\). Each node \(v_{i}\) in the graph represents an operation \(f\) performed on one or more inputs which in turn generates one or more outputs. The operation can be arbitrary as long as it is differentiable w.r.t. each of its input, i.e.</p> \[\frac{\partial f \left( \boldsymbol{X}_{0}, \dots, \boldsymbol{X}_{K_{in}-1} \right) }{\partial \boldsymbol{X}_{i}},\quad 0 \leq i \leq K_{\mathrm{in}}\] <p>exists. These operations can be mainly divided into two groups. The first group consists of parametric operations, i.e. the operations’ output additionally depends on a set of weights that are adjustable during the optimization step. Prime examples for this group are fully connected layers, which are implemented as affine transformations:</p> \[f_{linear} \left( \boldsymbol{x} ; \boldsymbol{W}\right) = \boldsymbol{x} \cdot \boldsymbol{W}, \quad \boldsymbol{x} \in \mathbb{R}^{D_{in}}, \quad \boldsymbol{W} \in \mathbb{R}^{D_{\mathrm{in}} \times D_{\mathrm{out}}} ,\] <p>where \(\boldsymbol{W}\) is the weights matrix (with an implicit bias encoding) that maps the input from \(\mathbb{R}^{D_{\mathrm{in}}}\) to \(\mathbb{R}^{D_{\mathrm{out}}}\), as well as convolution layers, implementing the convolution (which is actually a cross-correlation) operation with weight window, also called kernel map, \(\boldsymbol{W} \in \mathbb{R}^{K_{H} \times K_{W}}\), with \(K_{H}\) and \(K_{W}\) odd, over an input \(\boldsymbol{X} \in \mathbb{R}^{H \times W}\):</p> \[f_{\text{conv2d}, m, n} \left( \boldsymbol{X} ; \boldsymbol{W} \right) = \sum_{i = - \frac{K_{H}-1}{2}}^{\frac{K_{H}-1}{2}} \sum_{j = - \frac{K_{W}-1}{2}}^{\frac{K_{W}-1}{2}} W_{\frac{K_{W}-1}{2} + i, \frac{K_{H} - 1}{2} + j} X_{m-i, n-j} \quad .\] <p>Note that this operation in particular is the 2D convolution often used in image processing which is only one of many possible convolution operations. Other convolution operations that are commonly used include 1D and 3D convolution, as well as convolutions with stride, convolutions with dilations, depth-wise, and separable convolutions. Parametric operations have the additional constraint to be differentiable w.r.t. their weights, i.e. \(\partial f \left( \boldsymbol{X}_{0}, \dots, \boldsymbol{X}_{K_{\mathrm{in} - 1}} ; \boldsymbol{W} \right) / \partial \boldsymbol{W}\) exists.</p> <p>The second group is formed by non-parametric operations. That is, operations that do not include any learnable weights. Common examples for those are activation functions such as</p> \[\begin{aligned} \text{sigmoid} \left( x \right) &amp;= \frac{1}{1 + e^{-x}} \\ \text{tanh} \left( x \right) &amp;= \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}} \\ \text{ReLU} \left( x \right) &amp;= \text{max}(0, x) \end{aligned}\] <p>which are usually used after affine transformations to achieve non-linearity. Other common non-parametric operations are pooling, normalization (although some normalization techniques such as Batch Normalization <d-cite key="ioffe2015batchnorm"></d-cite> do have learnable parameters), dropout <d-cite key="hinton2014dropout"></d-cite>, and softmax.</p> <p>The second important step in the development of deep models is the choice of a proper loss function. The loss function \(L\), also called a cost function, measures the overall loss of a model \(f\) in taking a decision or action. The goal for the model is to minimize the loss value \(L( \boldsymbol{x}, \boldsymbol{y}, f)\) for some input \(\boldsymbol{x}\) and the apriori known ground-truth target \(\boldsymbol{y}\). This is achieved by computing the gradient of the loss w.r.t. the weights at each node in the computation graph, using backpropagation and optimizing the weights by taking a descent step in the gradient direction. Simple machine learning tasks use single-term loss functions like the cross-entropy loss for classification or some form of distance metric such as the mean-squared-error for regression. Other tasks may require multiple objectives, such as regressing the coordinates of a bounding box and classifying the object inside that box in object detection. Therefore, loss functions can also be composed of multiple objectives, where each objective \(i\) is represented by its loss function \(L_{i}\), weighted by \(\lambda_{i} \in \mathbb{R}^{+}\):</p> \[L \left( \boldsymbol{x}, \boldsymbol{y}, f \right) = \sum_{i} \lambda_{i} L_{i} \left(\boldsymbol{x}, \boldsymbol{y}, f \right) .\] <p>It is common to include loss terms that are independent of the input and output such as weight decay <d-cite key="krogh1991weightdecay"></d-cite> applying a regularization on the weights, encouraging the model to keep the weights small, as well as gradient penalty <d-cite key="gulrajani1027gradientpenalty"></d-cite> which normalizes gradients w.r.t. the inputs, commonly found in successful Generative Adversarial Network <d-cite key="Goodfellow2014"></d-cite> architectures.</p> <h2 id="sec:background:evaluation">Quantification of Object Detection Performance</h2> <p>Before we begin to dive into the methodology of object detection, the following will shortly describe common datasets, as well as the de facto standard metric, the <em>mean Average Precision</em> (mAP), in the field of object detection.</p> <h3 id="sec:background:evaluation:datasets">Datasets</h3> <p>In computer vision and machine learning in general, the quality of the data which is used to train a model is of utmost importance. The following section lists common datasets for horizontal object detection and oriented object detection.</p> <h4 id="sec:background:evaluation:datasets:horiz-obj-det">Horizontal Object Detection</h4> <h5 id="pascal-voc">PASCAL VOC</h5> <p>The PASCAL Visual Object Classes (VOC) Challenges<sup id="fnref:1"><a href="#fn:1" class="footnote" rel="footnote" role="doc-noteref">1</a></sup> <d-cite key="pascal-voc"></d-cite> (2005 — 2012) includes multiple tasks such as image classification, object detection, semantic segmentation, and action detection. The two most prominent datasets in use for object detection evaluation are VOC07 with 10k training images and 25k annotated objects, and VOC12 with 12k training images and 27k annotated objects. Both datasets contain 20 different classes which are common in everyday life situations such as persons, animals, vehicles, and indoor objects.</p> <h5 id="ilsvrc">ILSVRC</h5> <p>The ImageNet Large Scale Visual Recognition Challenge<sup id="fnref:2"><a href="#fn:2" class="footnote" rel="footnote" role="doc-noteref">2</a></sup> <d-cite key="ILSVRC15"></d-cite> (2010 — 2017) is a object detection challenge based on the ImageNet <d-cite key="imagenet_cvpr09"></d-cite> dataset. It contains 200 classes and 517k images with 534k annotated objects beating VOC by two orders of magnitude in scale.</p> <h5 id="coco">COCO</h5> <p>The Common Objects in Context (COCO)<sup id="fnref:3"><a href="#fn:3" class="footnote" rel="footnote" role="doc-noteref">3</a></sup> <d-cite key="mscoco"></d-cite> (2015 — 2019) is a large-scale object detection, segmentation, and captioning dataset with 80 object categories in 164k images (COCO17) and 897k annotated objects. Before the Open Images Detection challenge (see below), COCO was the most challenging object detection dataset since it contains more object instances per image and more small objects (with a relative image area below 1%), as well as more densely located objects than VOC and ILSVRC.</p> <h5 id="oid">OID</h5> <p>The Open Images Detection (OID) challenge<sup id="fnref:4"><a href="#fn:4" class="footnote" rel="footnote" role="doc-noteref">4</a></sup> <d-cite key="OpenImages"></d-cite> (2016 – 2020) released the largest object detection dataset to date in 2018, consisting of 1.9M images with 16M annotations across 600 object categories. Due to the dataset being relatively new, only very few papers publish evaluations for OID.</p> <h4 id="sec:background:evaluation:datasets:oriented-obj-det">Oriented Object Detection</h4> <p>The task of oriented object detection requires ground-truth orientation labels for each bounding box. For the above-mentioned datasets, these can only be obtained by applying a minimum-bounding-rectangle algorithm on the complex hull of the segmentation map of each object. Alternatively, oriented object detection datasets have been gathered, as listed below.</p> <h5 id="dota">DOTA</h5> <p>The Dataset for Object Detection in Aerial Images (DOTA) <d-cite key="Xia_2018_CVPR"></d-cite> was released as part of the Object Detection in Aerial Images (ODAI)<sup id="fnref:5"><a href="#fn:5" class="footnote" rel="footnote" role="doc-noteref">5</a></sup> challenge in early 2018. In the first version (1.0) of the dataset, a total of 2806 images are collected from Google Earth and vary in size between \(800 \times 800\) and \(4000 \times 4000\) pixels, annotated by 188k objects in total. Each object instance annotation consists of an arbitrary quadrilateral, i.e. 8 degrees of freedom (four pairs of \(x\)- and \(y\)-coordinates), as well as one label from a set of 15 possible object categories. While all publications on oriented object detection evaluate on version 1.0 of DOTA, the dataset authors have additionally published version 1.5 which introduces an additional class and increases the number of annotations on the existing image base to 403k.</p> <h5 id="hrsc2016">HRSC2016</h5> <p>The High Resolution Ship Collection (HRSC)<sup id="fnref:6"><a href="#fn:6" class="footnote" rel="footnote" role="doc-noteref">6</a></sup> <d-cite key="hrsc2016"></d-cite> dataset was collected from Google Earth and consists of 1.7k satellite images. Each image can contain multiple ships and each ship is annotated with a 5-tuple describing the pixel location, width, height, and rotation angle. Additionally, each ship is annotated with a label for the ship class, a specific category, and a ship type.</p> <h5 id="icdar">ICDAR</h5> <p>The International Conference on Document Analysis and Recognition (ICDAR)<sup id="fnref:7"><a href="#fn:7" class="footnote" rel="footnote" role="doc-noteref">7</a></sup> offers the ICDAR2015 <d-cite key="icdar15"></d-cite> challenge on incidental scene text detection containing 1.7k everyday scene images. Each text instance is annotated with a quadrilateral (8 degrees of freedom) specifying the arbitrary bounding box and the actual text content.</p> <h5 id="fddb">FDDB</h5> <p>The Face Detection Data Set and Benchmark (FDDB)<sup id="fnref:8"><a href="#fn:8" class="footnote" rel="footnote" role="doc-noteref">8</a></sup> <d-cite key="fddbTech"></d-cite> is a dataset of faces, designed to study the problem of unconstrained face detection. The annotations consist of 5.2k faces in 2.8k images, where each instance is described by a 5-tuple of rotated ellipsis.</p> <h3 id="sec:background:evaluation-metric">Measuring Detection Accuracy</h3> <p><img class="img-fluid rounded z-depth-0" src="/assets/posts/2021-09-21-object-detection/iou.png" data-zoomable=""/></p> <div class="caption"> The Intersection over Union metric can be computed by measuring the proportion of the intersection between two bounding boxes w.r.t. their joint area, see (a). The IoU value ranges between $0.0$ when $A \cap B = \emptyset$ (b) and $1.0$ when $A \cap B = A \cup B$, implying that $A = B$ (c). </div> <p>The most common evaluation metric in object detection is the mean Average Precision (mAP), originally introduced in VOC07. The mAP score is computed as the average object detection precision, i.e. “<em>What proportion of positive detections was actually successful?</em>”, over different recall values, i.e. “<em>What proportion of actual positives was detected successfully?</em>”, evaluated for each object class separately and averaged afterward. The values for precision and recall can be computed as follows:</p> \[\begin{aligned} \text{Precision} &amp;= \frac{\text{TP}}{\text{TP + FP}} \\ \text{Recall} &amp;= \frac{\text{TP}}{\text{TP + FN}} , \end{aligned}\] <p>where TP is the number of true positives, while FP and FN are the numbers of false positives and false negatives respectively. The natural follow-up question is how a detection match and miss is decided. In a binary classification problem, we simply check for equality between the predicted and the target label. In the setting of object detection, the targets and predictions consist of tuples of coordinates that define a bounding box. Hence it is necessary to define a rule at which point the prediction matches the target box in the two-dimensional image space. This rule can be expressed as a hard threshold for the so-called Intersection over Union (IoU) value which measures the relative overlap between the two boxes \(|A \cap B|\) w.r.t. their common covered area \(|A \cup B|\) (see Figure above):</p> \[\text{IoU}\left( A, B \right) = \frac{|A \cap B|}{|A \cup B|} .\] <p>The IoU score ranges between a value of 0.0 when there is no overlap between the two boxes (\(A \cap B = \emptyset\), Figure (b)) and 1.0 when the boxes are equal (\(A \cap B = A \cup B\), Figure (c)). For a fixed IoU score threshold \(\tau\), we can now count the necessary statistics to compute the precision and recall values as follows:</p> <ul> <li> <p><strong>True Positives</strong>: Number of predicted boxes \(A\) that fulfil \(\text{IoU}\left( A, B \right) \geq \tau\) for at least a single target box \(B\), i.e. those objects which are correctly localized.</p> </li> <li> <p><strong>False Positives</strong>: Number of predicted boxes \(A\) that fulfil \(\text{IoU}\left( A, B \right) &lt; \tau\) for all target boxes \(B\), i.e. those predictions that did not sufficiently overlap with any target.</p> </li> <li> <p><strong>False Negatives</strong>: Number of target boxes \(B\) that fulfil \(\text{IoU}\left( A, B \right) &lt; \tau\) for all predicted boxes \(A\), i.e. those targets that had no sufficient overlap with any prediction.</p> </li> </ul> <p><img class="img-fluid rounded z-depth-0" src="/assets/posts/2021-09-21-object-detection/pr-curve.png" data-zoomable=""/></p> <div class="caption"> An example precision-recall curve. The blue line represents the true PR-Curve, while the dotted orange line is the 11-point interpolation which at recall point $\tilde{r}$ uses the maximum precision value $p \left( r \right)$ for all $r \geq \tilde{r}$ with $\tilde{r} \in \left\{ 0.0, 0.1, \dots, 1.0\right\}$. The Average Precision score is equal to the area under the 11-point interpolation of the precision-recall curve. </div> <p>Following a sorting procedure of each prediction based on object classification confidence value, we can then generate the precision-recall curve by counting the number of TP, FP and FN cumulatively along the confidence-ascending list of predictions as shown with the blue line in the above figure. The Average Precision (AP) score reflects the area under the precision-recall curve. To be more robust against small changes in prediction confidences and the following change in the precision-recall curve, the area under the curve is interpolated using an 11-point average, i.e. for each recall value \(\tilde{r}\) in the range \(\left[ 0, 1 \right]\) with a step size of 0.1, the according precision \(p \left( \tilde{r} \right)\) is set to be the maximum precision over all \(r \geq \tilde{r}\):</p> \[\text{AP} = \int_{0}^{1} p \left( r \right) dr \approx \frac{1}{11} \sum_{\tilde{r} \in \{0.0, 0.1, \dots, 1.0\}} \max_{r \geq \tilde{r}} p \left( r \right)\] <p>The extension of the Average Precision to a multi-class problem is called the <em>mean</em> Average Precision (mAP):</p> \[\text{mAP} = \frac{1}{|C|} \sum_{c \in C} \text{AP}_{c} ,\] <p>where \(C\) is the set of available object classes and \(\text{AP}_{c}\) is the Average Precision score for a specific class \(c\). The IoU based mAP score has become the prime metric for object detection evaluation. Nevertheless, it is common to report a batch of scores for different IoU thresholds, namely \(\text{mAP}_{0.5}\) for a 0.5-IoU threshold, \(\text{mAP}_{0.75}\) for a 0.75-IoU threshold, and \(\text{mAP}_{\left[0.5, 0.95\right]}\), sometimes abbreviated as mAP or simply AP when clear from context, for an averaged mAP score over 10 equally distanced IoU thresholds between 0.5 and 0.95. Another distinction is the separation of \(\text{mAP}\) scores into different object sizes as is common in the COCO benchmark: \(\text{mAP}^{small}\), \(\text{mAP}^{medium}\), and \(\text{mAP}^{large}\) are mAP values for objects with an \(\text{area} &lt; 32^{2}\), \(32^{2} &lt; \text{area} &lt; 96^{2}\), and \(96^{2} &lt; \text{area}\), in pixel^2^ respectively.</p> <h2 id="sec:background:hist-object-detect">A History of Object Detection</h2> <p>As modern deep learning-based object detection borrows many techniques from traditional approaches, it is important to quickly summarize these before moving on to more recent ones. In an era before the prominent rise of deep learning models in the last decade, robust image representation optimized towards a specific task could not be simply learned from the data but had to be handcrafted and designed sophisticatedly. As used in <d-cite key="od-framework,trainable-system-od,example-based-od"></d-cite> and later successfully optimized in the first real-time application for human face detection by <d-cite key="Viola01rapidobject"></d-cite>, the basics of object detection used to be a straight forward approach: A sliding window is used to detect object instances in all locations and scales of an image. Each image sub-region under the current window position is then used to compute so-called Haar-like features (similar to Haar wavelets) and classifiers are then learned to distinguish between positive samples, i.e. feature representations of sub-regions which contain the object, and negative samples, those that count towards the background and are not of interest.</p> <div class="row mt-3"> <div class="col-sm mt-3 mt-md-0"> <img class="img-fluid rounded z-depth-1" src="/assets/posts/2021-09-21-object-detection/hog-test.png" data-zoomable=""/> </div> <div class="col-sm mt-3 mt-md-0"> <img class="img-fluid rounded z-depth-1" src="/assets/posts/2021-09-21-object-detection/hog-neg.png" data-zoomable=""/> </div> <div class="col-sm mt-3 mt-md-0"> <img class="img-fluid rounded z-depth-1" src="/assets/posts/2021-09-21-object-detection/hog-pos.png" data-zoomable=""/> </div> </div> <div class="caption"> History of Oriented Gradients feature transformation applied on a test image. Left: Test image. Middle: HOG descriptors weighted by positive SVM weights. Right: HOG descriptors weighted by negative SVM weights. Source: <d-cite key="hog-features"></d-cite>. </div> <p>Introduced in <d-cite key="hog-features"></d-cite>, Histogram of Oriented Gradients (HOG) have been developed as an important improvement over scale-invariant feature transform (SIFT) <d-cite key="sift"></d-cite> descriptors. The main idea behind HOG features is that local object shapes and appearances in an image can be expressed as a distribution of color intensity gradients. The above figure shows a test image of a pedestrian in and the HOG descriptors weighted by positive and negative Support Vector Machine (SVM) weights, which were used to classify the presence of a pedestrian, in and respectively. HOG features became an important foundation of many object detectors <d-cite key="Felzenszwalb2008ADT"></d-cite>; <d-cite key="5539906">; <d-cite key="ens-exem-svms"></d-cite>, as well as other computer vision tasks.</d-cite></p> <p>The peak of traditional object detection methods was reached with the Deformable Part-based Model (DPM) proposed by , winning the VOC-07, -08, and -09 detection challenges. It is built on the foundations of the HOG detector and views training as the task to learn how to decompose an object while inference is an ensemble of detections of different object parts. Detecting a person would then be translated into the decomposed detection of a head, legs, hands, arms, and body, which was also called the “star-model” in <d-cite key="Felzenszwalb2008ADT"></d-cite>. <d-cite key="NIPS2011_4307"></d-cite> later improved this to “mixture models” <d-cite key="5539906"></d-cite>; <d-cite key="NIPS2011_4307"></d-cite>; <d-cite key="10.5555/2520924"></d-cite>, coping with objects of larger variation.</p> <h2 id="sec:background:modern-object-detect">Deep Learning based Object Detection</h2> <p>After the success of DPMs, improvements in object detectors stagnated. With the comeback of convolutional neural networks (CNN) <d-cite key="cnn-rebirth"></d-cite> in 2012, deep architectures have been developed to learn robust and high-level task agnostic feature representations of images that easily superseded hand-crafted ones. Unsurprisingly, the field of object detection has quickly gained new traction due to successful deep models. This section goes into more detail on different approaches which can be grouped into two-stage detectors and one-stage detectors. The former can be split into two steps that divide the candidate region generation and the actual location regression and object classification while the latter implements an end-to-end solution using a single deep neural network.</p> <h3 id="sec:background:two-stage-detectors">Two-Stage Detectors</h3> <p>Two-stage object detectors follow a detection paradigm of a separated (1) proposal detection step, where likely object locations are determined and (2) a verification step, where each proposal is classified into one of the possible classes of objects and additionally the proposed location is fine-tuned.</p> <h5 id="sec:background:regions-with-cnn">Regions with CNN Features: R-CNN</h5> <p><img class="img-fluid rounded z-depth-0" src="/assets/posts/2021-09-21-object-detection/splash-method.pdf.png-1.png" data-zoomable=""/></p> <div class="caption"> R-CNN object detection system overview. The system (1) takes an input image, (2) extracts region proposals, (3) warps and forwards each proposal through a pre-trained CNN to obtain feature representations and finally (4) classifies each output using class-specific SVMs. Source: <d-cite key="rcnn"></d-cite>. </div> <p>The first of its kind in the two-stage category of object detectors was R-CNN (Regions with CNN features) by . R-CNN starts with the generation of object proposals that serve as candidates for processing. These are obtained using the selective search algorithm <d-cite key="selective-search"></d-cite> which is a region proposal procedure that computes hierarchical groupings of similar regions based on size, shape, color, and texture. Each object proposal is then warped into a fixed predetermined image size and forwarded through a CNN model, which is pre-trained on ImageNet, to extract a fixed 4096-dimensional feature vector. Afterward, class-specific SVMs perform the object recognition task by scoring each region proposal with their respective class. Finally, a greedy non-maximum suppression is applied, rejecting regions of high IoU values with other regions of the same class achieving higher SVM scores than a learned threshold. The above figure gives a sketch of the inference pipeline. Additionally, the CNN can be fine-tuned on other datasets. To improve the object localization, <d-cite key="rcnn"></d-cite> have applied a separate bounding box regression stage (a similar strategy was introduced in <d-cite key="od-disc"></d-cite>) that uses class-based regressors to predict new bounding boxes based on the CNN features. R-CNN broke the stagnation in the field of object detection by pushing the VOC07 \(\text{mAP}_{0.5}\) score from 33.7% of DPM-v5 <d-cite key="dpm-v5"></d-cite> to 58.5%.</p> <h5 id="sec:background:fast-r-cnn">Fast R-CNN</h5> <p><img class="img-fluid rounded z-depth-0" src="/assets/posts/2021-09-21-object-detection/fast-rcnn-structure2.pdf.png-1.png" data-zoomable=""/></p> <div class="caption"> The architecture of Fast R-CNN. The full input image is fed through a CNN to generate feature maps. Based on RoIs, each feature map is pooled into a RoI feature vector and fed into a sequence of fully-connected layers with a classification (object classes) and regression (object localization) head. Source: <d-cite key="fast-rcnn"></d-cite>. </div> <p>R-CNN was superseded by Fast R-CNN <d-cite key="fast-rcnn"></d-cite>. In Fast R-CNN, the <em>full</em> input images are forwarded through a CNN backbone to produce features maps. For each object proposal output of the selective search algorithm, a region of interest (RoI) pooling layer extracts a fixed-size feature vector from the feature map, inspired by spatial pyramid pooling in SPPNet introduced by <d-cite key="sppnet"></d-cite>. Then, each vector is fed into a sequence of fully-connected layers which then split into two heads, one for softmax classification outputting a probability vector \(\boldsymbol{p}\) of length \(|C|+1\) for the possible classes (with an additional class for the background), and one for the bounding box regression outputting a bounding box offset \(\boldsymbol{t}^{c} = ( t_{x}^{c}, t_{y}^{c}, t_{w}^{c}, t_{h}^{c}, )\) for each of the \(|C|\) object classes (see figure above). To jointly train for classification and bounding box regression, a multi-task loss \(L\) on each labeled RoI is introduced:</p> \[L\left(\boldsymbol{p}, u, \boldsymbol{t}^u , \boldsymbol{v}\right) = L_{cls}(\boldsymbol{p}, u) + \lambda\left[u \geq 1\right]L_{loc}\left(\boldsymbol{t}^u, \boldsymbol{v}\right) ,\] <p>where \(u\) is the ground-truth label, \(\boldsymbol{v}\) is the ground-truth bounding box regression target, the Iverson bracket indicator function \([u \geq 1]\) evaluates to 1 when \(u \geq 1\) and 0 otherwise, and the hyperparameter \(\lambda\) controls the balance between the two task losses. For classification the binary cross-entropy loss and for bounding box regression, the SmoothL1 loss is used to accumulate over the coordinates. For background RoIs (\(u = 0\)) \(L_{loc}\) is ignored due to missing ground-truth references.</p> <p>This approach is closer to an end-to-end solution than its predecessors as it gets rid of the multi-stage pipeline and can be trained given only the input image and the object proposals coming from an off-the-shelf algorithm. Fast R-CNN improves training time by a factor of 9 (with a VGG16 backbone) and testing time by a factor of 213, and pushes the \(\text{mAP}_{0.5}\) score on VOC07 to 70.0%.</p> <h5 id="sec:background:faster-r-cnn">Faster R-CNN</h5> <p><img class="img-fluid rounded z-depth-0" src="/assets/posts/2021-09-21-object-detection/faster-rcnn-rpn.pdf.png-1.png" data-zoomable=""/></p> <div class="caption"> The End-to-end unified architecture of Faster R-CNN including an 'attention' module called Region Proposal Network (RPN) generating likely object proposals (location and objectness) based on $k$ predefined anchor boxes. Source: <d-cite key="faster-rcnn"></d-cite>. </div> <p>Fast R-CNN was expanded by Faster R-CNN <d-cite key="faster-rcnn"></d-cite>, replacing the external object proposal stage with an end-to-end approach (see figure above). <d-cite key="faster-rcnn"></d-cite> have introduced a so-called Region Proposal Network (RPN) serving as an “attention” model which is a fully convolutional network that takes an image of any size as input and predicts a set of bounding box proposals. Each proposal is attached with an <em>objectness</em> score that measures the membership to a set of object classes against the background class. Instead of generating final bounding box locations of the object proposals, <d-cite key="faster-rcnn"></d-cite> introduce the notion of <em>anchors</em>, making Faster R-CNN the first anchor-based detector. These are used as a baseline reference box for which an offset has to be regressed (see figure above). Anchor-boxes are predefined with different scales (e.g. \(32\times32\), \(64\times64\), …) and aspect ratios (e.g. 1:1, 1:2, 2:1, 5:1, …). For each position in a convolutional feature map in the RPN, a sliding window approach computes a vector of \(2k\) objectness scores, one for the class “object” and one for the class “background”, as well as \(4k\) bounding box offsets where \(k\) is the number of different anchors (the product of the number of anchor scales and the number of anchor aspect ratios). Therefore, the RPN generates \(W \cdot H \cdot k\) proposals, where the intermediate feature map is of size \(W \times H\).</p> <p>Embedding the region proposal generation into the network stack has improved the VOC07 \(\text{mAP}_{0.5}\) to 73.2% (COCO \(\text{mAP}_{0.5}=42.7\%\), COCO \(\text{mAP}_{\left[0.5, 0.95\right]}=21.9\%\)). Faster R-CNNs therefore became the first end-to-end and the first near-realtime object detector (17fps with a ZFNet <d-cite key="zfnet"></d-cite> backbone). Computational redundancies at the subsequent detection stage have later been reduced in RFCN <d-cite key="rfcn"></d-cite> using fully convolutional networks, and Light-Head R-CNN <d-cite key="Li2017LightHeadRI"></d-cite> thinning out the prediction heads and replacing the CNN backbone with smaller networks (e.g. Xception <d-cite key="Chollet2017XceptionDL"></d-cite>).</p> <p>Faster R-CNN has been extended in Feature Pyramid Networks (FPN) <d-cite key="fpns"></d-cite>. Feature pyramids are a principal component in computer vision tasks for objectives that have to be solved at multiple scales. Until the development of FPN, deep learning-based object detectors have been avoiding feature pyramids, mostly due to their computational complexity and high memory usage. FPNs utilize the inherent multi-scale hierarchy present in deep convolutional networks and introduce feature pyramids with almost no extra cost. Instead of only using the very last output in the convolutional layer sequence, FPNs introduced a top-down architecture with lateral connections (see figure above) which extracts intermediate feature maps from another deep network (in this context called <em>backbone</em>). This architecture allows building high-level semantics at all scales, significantly improving scores on COCO to \(\text{mAP}_{0.5}=59.1\%\) and \(\text{mAP}_{\left[0.5, 0.95\right]}=36.2\%\). Feature Pyramid Networks have since become one of the basic building blocks for many newer object detectors.</p> <h3 id="sec:background:one-stage-detectors">One-Stage Detectors</h3> <p>In contrast to two-stage object detectors, a parallel line of development took place with a very different approach. Instead of having a separate stage in the network that proposes where an object is likely, one-stage detectors predict object locations and classes on a grid that can be mapped onto the input image in a single pass which is methodologically simpler and computationally faster. This also allows the object detector to be trained in a simple end-to-end fashion, therefore optimizing the whole network in a unified way towards the defined objective, unlike two-stage methods which usually have to define freezing phases for different network parts with different loss functions to achieve a stable training (see “4-Step Alternating Training” in <d-cite key="faster-rcnn"></d-cite>.</p> <h5 id="sec:background:you-only-look">You Only Look Once (YOLO)</h5> <p><img class="img-fluid rounded z-depth-0" src="/assets/posts/2021-09-21-object-detection/yolo-model.pdf.png-1.png" data-zoomable=""/></p> <div class="caption"> YOLO divides the input image into a grid of $S\times S$ patches. Each grid cell then predicts $B$ bounding boxes, confidences for the boxes, and $C$ class probabilities. Source: <d-cite key="yolo"></d-cite>. </div> <p>The most prominent and also first one-stage object detector was YOLOv1, proposed by . The network separates the input image into a grid of \(S \times S\) patches and predicts bounding boxes, object confidences, and class probabilities for each patch at the same time (see figure above), reaching \(\text{mAP}_{0.5} = 63.4\%\) on VOC07. YOLOv1 was improved in YOLOv2 <d-cite key="yolo-v2"></d-cite>, adapting anchor boxes from Faster R-CNN, achieving a \(\text{mAP}_{0.5}\) score of 78.6% on VOC07, and \(\text{mAP}_{0.5}=44.0\%\) and \(\text{mAP}_{[0.5,0.95]}=21.6\%\) on COCO. It nevertheless suffered from weak localization accuracy for small objects, which was addressed in YOLOv3 <d-cite key="yolo-v3"></d-cite> by making use of features from multiple scales, similar in concept to feature pyramids in FPNs (\(\text{mAP}_{0.5} = 57.9\%\) and \(\text{mAP}_{[0.5,0.95]} = 33.0\%\) on COCO).</p> <h5 id="sec:background:single-shot-detect">Single Shot Detection (SSD)</h5> <p><img class="img-fluid rounded z-depth-0" src="/assets/posts/2021-09-21-object-detection/ssd-vs-yolo.pdf.png-1.png" data-zoomable=""/></p> <div class="caption"> An architectural comparison of SSD (top) and YOLOv1 (bottom). SSD works without any fully-connected layers and adds convolutional feature layers on-top of the backbone network to predict offsets for a set of anchor boxes at multiple feature scales. Evaluated mAP values are for the VOC07 dataset. Source: <d-cite key="ssd"></d-cite>. </div> <p>SSD, proposed by , was the second one-stage object detector after YOLOv1. Its major contribution is using default boxes of different scales on different intermediate feature maps instead of only using the last layer (see figure above). Taking multiple feature scales into account, SSD offers advantages in terms of detection speed and accuracy over YOLOv1, especially for small objects (VOC07 \(\text{mAP}_{0.5}=76.8\%\), COCO \(\text{mAP}_{0.5}=46.5\%\), COCO \(\text{mAP}_{\left[0.5, 0.95\right]}=26.8\%\), and a fast version running with 59FPS VOC07 \(\text{mAP}_{0.5}=74.3\%\) depicted in the above figure).</p> <h5 id="sec:background:retinanet">RetinaNet</h5> <p>In RetinaNet <d-cite key="retinanet"></d-cite>, <d-cite key="retinanet"></d-cite> discover that one-stage detectors have trailed two-stage detectors in terms of accuracy due to their unmanaged class imbalance of positive and negative samples. Two-stage detectors can use sampling heuristics such as a fixed foreground-to-background ratio (typically 1:3), or online hard example mining (OHEM) <d-cite key="ohem"></d-cite> to maintain the class balance between foreground (positive samples) and background (negative samples). Since one-stage detectors perform a single pass and have no stage to filter possible object proposals, they evaluate about \(10^4\) to \(10^{5}\) candidate locations per image while only a few locations contain objects, causing the following two issues: (1) inefficient training since most locations are easy to pick backgrounds and do not contribute any useful learning signal and (2) negative sample numbers overweight positives ones by a large margin, leading to degenerate models as the training can only adjust to few positives in comparison to negatives. To tackle this issue in one-stage detectors, the authors propose the so-called <em>Focal Loss</em> which represents a modified version of the cross-entropy loss. The main idea of Focal Loss is to automatically down-weight the contribution of easy examples during training and focus the model on hard examples with low confidences. Using feature pyramids from FPN <d-cite key="fpns"></d-cite> (with a ResNeXt-101-FPN <d-cite key="resnext"></d-cite> backbone) and anchor boxes from Faster R-CNN, RetinaNet managed to beat previous state-of-the-art results with a COCO mAP of 40.8%, surpassing the best one-stage detector DSSD513 <d-cite key="fu2017dssd"></d-cite> (with a ResNet-101-DSSD backbone) at 33.2% mAP as well as the best two-stage detector Faster R-CNN (with an Inception-Resnet-v2-TDM <d-cite key="tdm"></d-cite> backbone) at 36.8% mAP.</p> <h3 id="sec:background:anchor-free-object">Anchor-Free Object Detection</h3> <p>Although anchor-based methods have shown great success, they come with increased methodological and computational complexity. <d-cite key="faster-rcnn"></d-cite> show in <d-cite key="faster-rcnn"></d-cite> that detection performance is sensitive to anchor box size, aspect ratio and number. Therefore, hyper-parameters need to be additionally tuned to find anchors appropriate for the specific dataset. Moreover, anchor-based detectors have difficulties with objects of large shape variations, as well as small objects. The need for fine-tuning moreover impedes the model’s generalization abilities, as anchors need to be redesigned for new tasks. Most anchor boxes are labeled as negative samples during training since anchors are required to be placed densely on the input image (FPN <d-cite key="fpns"></d-cite> e.g. places 180k anchors) leading to only a few anchors overlapping with positive samples. Logically, the question of whether anchor-based solutions are the optimal way to solve object detection arises. The recent field of <em>anchor-free</em> object detection tries to find answers to this exact question. First results show, that approaches without object anchors are competitive and capable of beating state-of-the-art anchor-based models, with the advantage of being faster and methodologically less complex.</p> <p>Early approaches such as DenseBox <d-cite key="huang2015densebox"></d-cite>, the first unified end-to-end fully convolutional detector, UnitBox <d-cite key="unitbox"></d-cite> tackling the localization with an IoU-based loss, YOLOv1 <d-cite key="yolo"></d-cite> focusing on real-time object detection, were made but without success in surpassing anchor-based systems such as Faster R-CNN at that time.</p> <p>CornerNet <d-cite key="cornernet"></d-cite> approaches the bounding box prediction by predicting a pair of keypoints, the top-left, and bottom-right corners. The network predicts a heatmap for the top-left corner, a heatmap for the bottom-right corner, and an embedding for each corner which is supposed to group pairs of corners that belong to the same bounding box, minimizing embedding vector distance between pairs. Embeddings are produced using the Associative Embedding <d-cite key="NIPS2017_6822"></d-cite> technique to separate different instances. CornerNet achieves a mAP of 42.2% on COCO. CenterNet <d-cite key="centernet"></d-cite> builds on top of CornerNet and includes the prediction of a center keypoint used to perform center pooling. Additionally, CenterNet also introduces cascade corner pooling as an extension to corner pooling from CornerNet which performs max-pooling first along the bounding box borders and then along the orthogonal row/column towards the region center. This addresses the problem that CornerNet and most other one-stage detectors lack an additional look into the cropped regions by exploring the visual patterns within each predicted bounding box. CenterNet achieves a mAP of 47.0% on COCO, giving a significant improvement of 4.8% mAP over CornerNet.</p> <p>ExtremeNet <d-cite key="extremenet"></d-cite> on the other hand is motivated by who proposed to annotate bounding boxes by marking the objects’ four extreme points: top, bottom, left, right. In ExtremeNet, object extreme keypoints are predicted as follows. First, four multi-peak heatmaps, one for each extreme, are predicted for each object category to generate possible extreme points. Then, each combination between a top, bottom, left, and right extreme point is being generated and their geometric center is being compared to a fifth heatmap that generates center keypoints. If the geometric center is close to one of the peaks in the center heatmap (distance above a fixed threshold), the extreme point combination is valid and an object is predicted.</p> <p>FCOS <d-cite key="fcos"></d-cite> uses a fully convolutional one-stage detection approach introducing the notion of <em>center-ness</em>, which depicts the normalized distance from the location to the center of the object that the location is responsible for. The center-ness is used to adjust the classification confidence and thus helps to suppress low-quality detected bounding boxes and improves overall performance by a large margin (\(\text{mAP}_{[0.5,0.95]}\) on the COCO minival validation subset of 37.1% with, compared to 33.5% without the center-ness branch).</p> <p>FoveaBox <d-cite key="kong2019foveabox"></d-cite> is a single unified network, composed of a backbone network and two task-specific sub-networks, following the RetinaNet <d-cite key="retinanet"></d-cite> design. Its main contribution is the assignment of different object scales to different feature pyramid outputs, i.e. each pyramid layer is responsible for a certain interval of object sizes.</p> <h2 id="sec:background:rotat-agnos-object">Oriented Object Detection</h2> <p><img class="img-fluid rounded z-depth-1" src="/assets/posts/2021-09-21-object-detection/rot-vs-ori.png" data-zoomable=""/></p> <div class="caption"> Comparison of axis-aligned (top) and oriented (bottom) bounding boxes in object detection on aerial view imagery. It becomes clear that the oriented bounding box representation is superior for oriented objects and generates tighter boxes that better cover the true area of the object. </div> <p>Oriented object detection has recently gained more attention in computer vision for aerial imagery, scene text, and face detection. In oriented object detection, bounding boxes are not bound to be strictly horizontal and vertical along the <em>x</em>- and <em>y</em>-axis but can be either rotated by an arbitrary angle or defined by an arbitrary quadrilateral consisting of the four corners. This allows for tighter bounding boxes, especially in densely populated regions as well as for objects that are not parallel to the physical <em>x-y</em> plane, where <em>x</em> is the horizontal and <em>y</em> is the vertical axis. The above figure shows the effect of using oriented bounding boxes, compared to horizontal bounding boxes. The top row shows axis-aligned bounding boxes on ships, vehicles, and airplanes in aerial view imagery while the bottom row shows their respective oriented bounding boxes. It is clear that the oriented bounding box representation is superior for oriented objects and generates tighter boxes that better cover the true area of the object.</p> <h4 id="two-stage-detectors">Two-Stage Detectors</h4> <p>Since this field has only recently received more attention, current approaches are usually based on methods that are successful in horizontal object detection. A straightforward approach to tackle oriented object detection is to extend the prediction of a bounding box with an additional parameter \(\theta\), determining the rotation angle. did so by using a similar network as Faster R-CNN, expanding the bounding box priors by multi-angle anchors including rotations between 0\(^{\circ}\) and 180\(^{\circ}\), thus increasing the anchor hyperparameter set significantly. Similarly, propose Rotation Region Proposal Networks (RRPN), adapting the Region Proposal Network from Faster R-CNN, designed to generate RoIs with angle rotation information.</p> <p>Inspired by Mask R-CNN <d-cite key="mask-rcnn"></d-cite>, proposed a semantic segmentation-guided RPN (sRPN) using the atrous spatial pyramid pooling (ASPP) module from <d-cite key="Chen2017RethinkingAC"></d-cite> to suppress background clutter and a RoI module that fuses multi-level outputs from an FPN. Similarly, SCRDet++ <d-cite key="yang2020scrdet"></d-cite> introduced the idea of instance-level denoising on the feature maps into object detection to enhance the detection of small and cluttered objects, common in satellite images.</p> <p>RoI Transformer <d-cite key="roi-trans"></d-cite> is another two-stage anchor-based detector where the geometric transformation from horizontal bounding boxes to oriented bounding boxes is learned and applied to the RoI output of an RPN, such as in Faster R-CNN.</p> <p>argue that a five-point representation of oriented bounding boxes can cause training instability as well as performance decreases. They ascribe this to the loss discontinuity which results from the natural periodicity of angles and therefore possible exchanges of width and height in the box representation. To circumvent this discontinuity, <d-cite key="qian2019learning"></d-cite> propose to use the quadrilateral representation in combination with a loss modulation that greedily minimizes the loss from the set of possible edge assignments.</p> <h4 id="one-stage-detectors">One-Stage Detectors</h4> <p><img class="img-fluid rounded z-depth-0" src="/assets/posts/2021-09-21-object-detection/salience-biased-loss-for-object-detection.pdf.png-1.png" data-zoomable=""/></p> <div class="caption"> Salience Biased Loss <d-cite key="sun2018salience"></d-cite> model architecture. The lower network part (green) is the RetinaNet backbone generating multiscale feature maps while the upper part (blue) is a second network which estimates the salience maps used to adapt the focal loss to the difficulty of the current image. Source: <d-cite key="sun2018salience"></d-cite>. </div> <p>Similar to <d-cite key="retinanet"></d-cite>, <d-cite key="sun2018salience"></d-cite> propose a novel loss function based on salience information directly extracted from the input image. Like the Focal Loss, the proposed Salience Biased Loss (SBL) treats training samples differently according to the complexity (saliency) of an image. This is estimated by an additional deep model trained on ImageNet <d-cite key="imagenet_cvpr09"></d-cite> in which the number of active neurons across different convolution layers are measured (see figure above):</p> \[S = \frac{1}{C \cdot W \cdot H} \sum_{c=1}^{C} \sum_{w=1}^{W} \sum_{h=1}^{H} f\left( \boldsymbol{x} \right)_{c,w,h} ,\] <p>where \(S\) is the average activation value across the layer and \(f\) is a convolution operation with an output feature map of size \(C \times W \times H\). The idea is that with increasing complexity, more neurons will be active. The saliency then scales an arbitrary base loss function to adapt the importance of training samples accordingly.</p> <p><img class="img-fluid rounded z-depth-0" src="/assets/posts/2021-09-21-object-detection/align-deep-features/architecture.pdf.png-1.png" data-zoomable=""/></p> <div class="caption"> Architecture illustration of S2A-Net consisting of a backbone pretrained on ImageNet, the Feature Pyramid Network to extract multiscale features, the Feature Alignment Module to generate oriented anchors using aligned convolutions and the Oriented Detection Module using Active Rotating Filters. Source: <d-cite key="han2020align"></d-cite>. </div> <p>Han et al. <d-cite key="han2020align"></d-cite> approach to solve the discrepancy of classifications score and localization accuracy and ascribe this issue to the misalignment between anchor boxes and the axis-aligned convolutional features. Hence they propose two modules (see the figure above for the full architecture): The Feature Alignment Module (FAM) generates high quality oriented anchors using their Anchor Refinement Network (ARN) and adaptively aligns the convolutional features according to the generated anchors using an Alignment Convolution Layer (ACL).</p> <p><img class="img-fluid rounded z-depth-0" src="/assets/posts/2021-09-21-object-detection/deform-convs.png" data-zoomable=""/></p> <div class="caption"> Sampling locations of different convolution operations using a $$3 \times 3$$ kernel. Figure (a) is the default convolution while (b) portrays deformable convolutions <d-cite key="deform-conv"></d-cite>, enabling learnable sampling offsets for each of the 9 locations. Figure (c) and (d) are aligned convolutions (AlignConv) which in turn are deformable convolutions but restricted to a global translation, rotation and scaling w.r.t. the full sampling window and are supposed to align the convolution operation to oriented anchor boxes. Source: <d-cite key="han2020align"></d-cite>. </div> <p>The ACL is a restricted Deformable Convolution Layer <d-cite key="deform-conv"></d-cite> in the sense that it learns the same translation, scaling, and rotation for each sampling location (see figure above). The second proposed module is the so-called Oriented Detection Module (ODM). Introduced in Oriented Response Networks (ORN) <d-cite key="orn"></d-cite>, <d-cite key="han2020align"></d-cite> make use of Active Rotating Filters (ARF) which is a $k \times k \times N$ convolutional filter that rotates the features $N - 1$ times, generating an output feature map of \(N\) orientation channels, thereby encoding \(N\) orientations directly into the feature maps.</p> <p>In <d-cite key="yang2020arbitrary"></d-cite>, the authors tackle the issue of discontinuous boundary effects on the loss due to the inherent angular periodicity and corner ordering by transforming the angular prediction task from a regression problem into a classification problem. They devise the Circular Smooth Label (CSL) technique which handles the periodicity of angles and raises the error lenience to adjacent angles.</p> <h4 id="one-stage-anchor-free-detectors">One-Stage Anchor-Free Detectors</h4> <p>The following contributions go one step further and remove the concept of anchors, generating predictions on a dense grid over the input image.</p> <p>IENet <d-cite key="lin2019ienet"></d-cite> is based on the one-stage anchor-free fully convolutional detector FCOS. The regression head from FCOS is extended in IENet by another branch that regresses the bounding box orientation, using a self-attention mechanism that incorporates the branch feature maps of the object classification and box regression branches.</p> <p>Axis-Learning <d-cite key="xiao2020axis"></d-cite> also builds on the dense sampling approach of FCOS and explore the prediction of an object axis, defined by its head point and tail point of the object along its elongated side (which can lead to ambiguity for near-square objects). The axis is extended by a width prediction which is interpreted to be orthogonal to the object axis.</p> <p>In PIoU <d-cite key="chen2020piou"></d-cite> the authors argue, that a distance-based regression loss such as SmoothL1 only loosely correlates to the actual IoU measurement, especially in the case of large aspect ratios. Therefore, they propose a novel Pixels-IoU (PIoU) loss, which exploits the IoU for optimization by pixelwise sampling, improving detection performance on objects with large aspect ratios dramatically.</p> <p>P-RSDet <d-cite key="zhou2020objects"></d-cite> replaces the Cartesian coordinate representation of bounding boxes with polar coordinates. Therefore, the bounding box regression happens by predicting the object’s center point, a polar radius, and two polar angles. Furthermore, to express the geometric constraint relationship between the polar radius and the polar angles, a novel Polar Ring Area Loss is proposed.</p> <p>An alternative formulation of the bounding box representation is defined in O2-DNet <d-cite key="wei2020o2-dnet"></d-cite>. Here, oriented objects are detected by predicting a pair of middle lines inside each target, showing similarity to the extreme keypoint detection schema proposed in ExtremeNet.</p> <div class="footnotes" role="doc-endnotes"> <ol> <li id="fn:1"> <p><a href="http://host.robots.ox.ac.uk/pascal/VOC/">http://host.robots.ox.ac.uk/pascal/VOC/</a>, accessed 2021-04-10 <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:2"> <p><a href="http://image-net.org/challenges/LSVRC/">http://image-net.org/challenges/LSVRC/</a>, accessed 2021-04-10 <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:3"> <p><a href="http://cocodataset.org/">http://cocodataset.org/</a>, accessed 2021-04-10 <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:4"> <p><a href="https://storage.googleapis.com/openimages/web/index.html">https://storage.googleapis.com/openimages/web/index.html</a>, accessed 2021-04-10 <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:5"> <p><a href="https://captain-whu.github.io/ODAI/">https://captain-whu.github.io/ODAI/</a>, accessed 2021-04-10 <a href="#fnref:5" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:6"> <p><a href="https://www.kaggle.com/guofeng/hrsc2016">https://www.kaggle.com/guofeng/hrsc2016</a>, accessed 2021-04-10 <a href="#fnref:6" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:7"> <p><a href="https://iapr.org/archives/">https://iapr.org/archives/</a>, accessed 2021-04-10 <a href="#fnref:7" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:8"> <p><a href="http://vis-www.cs.umass.edu/fddb/">http://vis-www.cs.umass.edu/fddb/</a>, accessed 2021-04-10 <a href="#fnref:8" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> </ol> </div>]]></content><author><name>Steven Braun</name></author><summary type="html"><![CDATA[This post is going to give a brief introduction to deep models, the history of object detection ranging from classic methods based on hand-crafted features to the latest deep learning object detectors, object detection datasets, and object detection evaluation metrics.]]></summary></entry><entry><title type="html">The arXiv PDF Command-Line Interface Downloader</title><link href="https://www.steven-braun.com/blog/2021/arxiv-downloader/" rel="alternate" type="text/html" title="The arXiv PDF Command-Line Interface Downloader"/><published>2021-07-12T00:00:00+00:00</published><updated>2021-07-12T00:00:00+00:00</updated><id>https://www.steven-braun.com/blog/2021/arxiv-downloader</id><content type="html" xml:base="https://www.steven-braun.com/blog/2021/arxiv-downloader/"><![CDATA[<p><img class="img-fluid rounded z-depth-1" src="/assets/posts/2021-08-02-arxiv-downloader/header.png" data-zoomable=""/></p> <h1 id="the-hassle-of-downloading-arxiv-papers">The Hassle of Downloading arXiv Papers</h1> <p>A few months ago I had acquired a tablet to read and annotate research papers wherever I want, stopping the waste of printing these in paper form and then keeping the annotated printed versions somewhere in my desk. This led me to a setup in which I store all research papers in some directory structure at <code class="language-plaintext highlighter-rouge">~/papers/</code> and have this directory be in sync with a synchronization service; currently OneDrive, which has an excellent third-party <a href="https://github.com/abraunegg/onedrive">Linux client</a>.</p> <p>Many of the papers I find are hosted on <a href="https://arxiv.org/">arXiv.org</a>. When I’ve got a paper which I’d like to read, e.g. <a href="https://arxiv.org/abs/2107.00630">Variational Diffusion Models</a>, I first have to manually download the paper into some location. By default, this results in a file named after the arXiv paper <code class="language-plaintext highlighter-rouge">id</code>, e.g. <code class="language-plaintext highlighter-rouge">2107.00630.pdf</code>, which is annoying as the <code class="language-plaintext highlighter-rouge">id</code> is hard to associate with the actual paper title. Therefore, I usually rename the paper according to its title:</p> <div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">mv </span>2107.00630.pdf ~/papers/generative-models/2107.00630v1.Variational_Diffusion_Models.pdf
</code></pre></div></div> <p>Now the paper is ready to be automatically synced and I can find it on my tablet device. After repeating the above multiple times, I was especially annoyed by the file renaming necessity. Furthermore, it also happens that I often simply have the arXiv link and know that I want to have this paper on my tablet. I’m an avid Linux user and work from the terminal most of the time. Hence, it was natural to write a command-line tool that takes an arXiv link/id, downloads it into a preferred directory, and automatically renames the filename according to the title. This is when I came up with a simple tool called <a href="https://github.com/braun-steven/arxiv-downloader">arxiv-downloader</a>, wrapping the neat arXiv Python wrapper <a href="https://github.com/lukasschwab/arxiv.py">lukasschwab/arxiv.py</a>.</p> <h1 id="the-solution-arxiv-downloader">The Solution: arxiv-downloader</h1> <p>This little tool is available on PyPi (<code class="language-plaintext highlighter-rouge">pip install arxiv-downloader</code>) and offers the <code class="language-plaintext highlighter-rouge">arxiv-downloader</code> script in the command-line:</p> <div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>arxiv-downloader <span class="nt">--url</span> https://arxiv.org/abs/2107.00630 <span class="nt">--directory</span> ~/papers/generative-models/

Directory /home/steven/papers/generative-models/ does not exist. Create? <span class="o">[</span>y/n] y
Starting download of article: <span class="s2">"Variational Diffusion Models"</span> <span class="o">(</span>2107.00630<span class="o">)</span>
Download finished! Result saved at:
/home/steven/papers/generative-models/2107.00630v1.Variational_Diffusion_Models.pdf
</code></pre></div></div> <p>This merges the sequence of opening an arXiv link, manually downloading the PDF, renaming this PDF according to the paper title, and moving the file to the final location, into a single command. Although this might only save me about half a minute for every paper, I deeply felt the need to automate these steps. If you’re a computer scientist or programmer, I hope you can relate; if not then just call me crazy :-).</p> <p>You can check out all available arguments with <code class="language-plaintext highlighter-rouge">arxiv-downloader -h</code></p> <div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>arxiv-downloader <span class="nt">-h</span>
usage: arxiv-downloader <span class="o">[</span><span class="nt">-h</span><span class="o">]</span> <span class="o">[</span><span class="nt">--url</span> URL] <span class="o">[</span><span class="nt">--id</span> ID] <span class="o">[</span><span class="nt">--directory</span> DIRECTORY] <span class="o">[</span><span class="nt">--source</span><span class="o">]</span>

arXiv Paper Downloader.

optional arguments:
  <span class="nt">-h</span>, <span class="nt">--help</span>            show this <span class="nb">help </span>message and <span class="nb">exit</span>
  <span class="nt">--url</span> URL, <span class="nt">-u</span> URL     arXiv article URL.
  <span class="nt">--id</span> ID, <span class="nt">-i</span> ID        arXiv article ID <span class="o">(</span><span class="k">for </span>https://arxiv.org/abs/2004.13316 this would be
                        2004.13316<span class="o">)</span><span class="nb">.</span>
  <span class="nt">--directory</span> DIRECTORY, <span class="nt">-d</span> DIRECTORY
                        Output directory.
  <span class="nt">--source</span>, <span class="nt">-s</span>          Whether to download the <span class="nb">source tar </span>file.
</code></pre></div></div> <p>The source code is available at <a href="https://github.com/braun-steven/arxiv-downloader">braun-steven/arxiv-downloader</a>. Feel free to contribute, leave feedback, and report issues.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[In this post, I introduce arxiv-downloader, a command-line interface for conveniently downloading papers from arXiv.]]></summary></entry></feed>