.. _blueprint-debuginfod-server:

=================
debuginfod server
=================

Related issue: :issue:`Provide debuginfod server <957>`

Debusine builds and hosts Debian repositories but currently has no way to serve
debug symbols to developers.  When a crash occurs, a developer must manually
find and install the correct ``-dbgsym`` package for the exact binary they are
debugging.  This blueprint proposes bringing
`debuginfod <https://sourceware.org/elfutils/Debuginfod.html>`_ server
functionality to Debusine, allowing :manpage:`gdb(1)` to automatically fetch
debug symbols from a Debusine archive without any manual steps.

Goals
=====

- Extract ``.debug`` ELF files from ``-dbgsym`` packages during the
  :task:`Sbuild` task and store them as discrete artifacts in Debusine.
- Serve those artifacts through debuginfod-compatible HTTP endpoints
  (``/buildid/<id>/debuginfo``), scoped per archive.
- Document how to configure :manpage:`gdb(1)` to use a Debusine archive as a
  debuginfod source, so developers can get automatic symbol resolution with a
  single ``DEBUGINFOD_URLS`` environment variable.

Requirements
============

- The extraction pipeline must run inside the isolated sbuild worker so that
  a malformed or malicious ELF file cannot affect the web server or other
  builds.
- Each extracted ``.debug`` file must be stored as a Debusine artifact indexed
  by its build-ID for fast HTTP lookups.
- Debug symbol artifacts must be published into a :collection:`debian:suite`
  alongside their parent :artifact:`debian:binary-package` artifacts, via the
  existing ``RELATES_TO`` relation and :task:`CopyCollectionItems` mechanism.
- debuginfod URLs must be resolvable from the root of each archive, so that a
  single ``DEBUGINFOD_URLS`` entry covers all suites in the archive.
- The HTTP response must include the headers required by the debuginfod
  protocol specification: ``X-DEBUGINFOD-FILE`` and ``X-DEBUGINFOD-SIZE``.
- Responses must support HTTP 206 Partial Content so that :manpage:`gdb(1)`
  can fetch individual ELF sections without downloading the full file.
- Two :artifact:`debian:debug-symbols` artifacts in the same archive that
  share a build-ID must have identical file contents, analogous to the
  existing pool-file uniqueness constraints.
- The :collection:`debian:archive` collection must expose a per-archive
  configuration option (similar to Launchpad's ``build_debug_symbols``)
  controlling whether debug symbols are extracted.  When disabled,
  ``DEB_BUILD_OPTIONS=noautodbgsym`` is passed to sbuild to suppress
  ``-dbgsym`` package generation entirely.

Out of scope
============

The following parts of the official debuginfod specification are excluded:

- **Source file serving** (``/buildid/<id>/source``): the
  :manpage:`debuginfod(8)` man page explicitly notes that, due to Debian and
  Ubuntu packaging policies, debuginfod cannot resolve source files for
  ``.deb`` and ``.ddeb`` packages.  `debuginfod.debian.net has the same
  limitation <https://wiki.debian.org/Debuginfod#The_service_isn.27t_working.21__I_can.27t_download_the_source_code_while_debugging_a_package.21>`_.
- **Executable serving** (``/buildid/<id>/executable``): :artifact:`debian:binary-package`
  stores the whole ``.deb`` as a single file rather than broken-out binaries.
  Extracting individual executables on-the-fly inside a request handler is not
  feasible; pre-extraction at build time would require significant additional
  storage and pipeline changes that are out of scope for the initial
  implementation.
- **Metrics endpoint** (``/metrics``): this Prometheus statistics endpoint is
  designed for a standalone C++ daemon; Debusine's existing application-level
  logging and system-wide monitoring tools are more appropriate.
- **Metadata search** (``/metadata``): implementing a searchable JSON index
  for build-ID metadata requires a separate database design that is out of
  scope for the initial implementation.
- **IMA signatures** (``X-DEBUGINFOD-IMA-SIGNATURE``): this response header
  carries per-file Integrity Measurement Architecture signatures used primarily
  for RPM packages and has no standard applicability to Debian ``.deb``
  packages.
- **Upstream federation**: automatically forwarding unresolved build-ID
  requests to upstream servers such as ``debuginfod.debian.net`` is excluded.
  Debusine's primary use case is serving self-hosted, private, or localised
  archives where the operator controls the source.
- **DWZ supplement ingestion**: ``-dbgsym.deb`` packages built with DWZ
  compression ship a shared supplement ELF file at
  ``./usr/lib/debug/.dwz/<name>.debug`` that several regular ``.debug`` files
  reference via their ``.gnu_debugaltlink`` section.  Ingesting these
  supplements is excluded from the initial implementation to keep the first
  iteration small.  Until DWZ support is added, :manpage:`gdb(1)` requests
  for the supplement's build-ID return ``404``; the practical effect is that
  DWZ-using debug info renders with the alternate strings table missing, not
  that debugging fails entirely.

Background: dbgsym packages and ELF build-IDs
==============================================

When Debian builds a binary package it strips debug information to keep the
shipped binary small.  For example, ``util-linux_2.40.2-1_amd64.deb``
contains the stripped binary while
``util-linux-dbgsym_2.40.2-1_amd64.deb`` contains the DWARF debug symbols
for the exact same binary.

Inside the ``data.tar`` of every ``-dbgsym.deb`` the debug files follow a
fixed path convention::

    ./usr/lib/debug/.build-id/XX/YYYY.debug

where ``XX`` is the first two hex characters of the 40-character build-ID and
``YYYY`` is the remaining 38 hex characters.  The build-ID is assigned at link
time and uniquely identifies a specific binary.

ELF classification uses three sections:

``.note.gnu.build-id``
    Contains the raw build-ID bytes.  Reading them sequentially and converting
    to lowercase hex yields the familiar 40-character string.

``.debug_info`` / ``.gnu_debugdata``
    Presence of either section confirms that DWARF debug information is
    embedded, marking the file as a ``.debug`` artifact to be extracted.

``.gnu_debugaltlink``
    Present when a package was built with DWZ compression.  Contains the
    build-ID of a shared DWZ supplement file living at
    ``./usr/lib/debug/.dwz/<name>.debug`` inside the same ``-dbgsym.deb``.
    DWZ supplement ingestion is excluded from the initial implementation
    (see "Out of scope" above); only files under
    ``./usr/lib/debug/.build-id/`` are ingested.

Implementation plan
===================

.. artifact:: debian:debug-symbols

New artifact category: ``debian:debug-symbols``
------------------------------------------------

A new artifact category ``debian:debug-symbols`` will be added as an enum
value in ``debusine/artifacts/models.py``.  Each instance represents all
``.debug`` ELF files extracted from a single ``-dbgsym`` package.  Files are
stored as multiple entries within one artifact; each entry's
``FileInArtifact.path`` is set to
``usr/lib/debug/.build-id/<XX>/<YYYY>.debug``, where ``<XX>`` is the first
two hex characters of the build-ID and ``<YYYY>`` is the remaining 38
characters — the same path the file occupies inside the ``-dbgsym.deb``
with the leading ``./`` stripped.  Storing all debug files for one
``-dbgsym`` package in a single artifact avoids creating one artifact per
debug file, which would place excessive load on the database during
publishing and expiry.

The artifact data carries the following field:

.. list-table::
   :header-rows: 1
   :widths: 20 20 60

   * - Field
     - Type
     - Purpose
   * - ``build_ids``
     - list of 40-char hex strings
     - Index of all build-IDs contained in this artifact

``X-DEBUGINFOD-SIZE`` is derived from the stored file size at serve time
and does not need to be persisted.

Extraction pipeline in the :task:`Sbuild` task
-----------------------------------------------

Debusine already opens ``.deb`` files in ``upload_artifact()`` to read
control data and create :artifact:`debian:binary-package` artifacts.  The
debug-symbol extraction follows the same pattern, added as two new helpers
called from ``_upload_binary_packages()``:

``_upload_debug_symbols(dbgsym_deb)``
    For each ``-dbgsym.deb`` in the build output:

    1. Open the ``data.tar`` archive.
    2. Iterate over every file whose path matches
       ``./usr/lib/debug/.build-id/**/*.debug``.  Each such file becomes
       one entry in the resulting :artifact:`debian:debug-symbols`
       artifact, with its build-ID joining the artifact's ``build_ids``
       list and its ``FileInArtifact.path`` set to
       ``usr/lib/debug/.build-id/<XX>/<YYYY>.debug`` (the same in-tar
       path with the leading ``./`` stripped).  Files under
       ``./usr/lib/debug/.dwz/`` (DWZ supplements) are not ingested in
       the initial implementation; see "Out of scope" above.
    3. For each file, parse the ELF structure with ``pyelftools`` to locate
       the ``.note.gnu.build-id`` section and extract the build-ID.  Also
       check for ``.debug_info`` / ``.gnu_debugdata`` to confirm it is a
       debug file.
    4. Verify that the in-tar path matches the convention
       ``./usr/lib/debug/.build-id/<XX>/<YYYY>.debug``, where ``<XX>`` and
       ``<YYYY>`` are derived from the build-ID extracted in step 3.  If
       the path does not match, the task fails with an explanatory error.
       ``dh_strip``'s ``make_debug`` function constructs this path
       deterministically from the build-ID and we are not aware of any
       tooling in Debian that constructs ``-dbgsym.deb`` packages by hand,
       so a mismatch indicates a malformed package and must not be
       silently ingested.
    5. Accumulate all such files into a single
       :artifact:`debian:debug-symbols` artifact and upload it.

``_create_debug_symbol_relations(debug_artifact, binary_artifact)``
    Records a ``RELATES_TO`` relation from each :artifact:`debian:debug-symbols`
    artifact to its parent :artifact:`debian:binary-package` artifact.

Running extraction inside the sbuild worker confines the blast radius of any
malformed or malicious ELF input to the isolated worker process and avoids
re-fetching files from artifact storage, since all build output is already
present on disk.

Collection specification changes
---------------------------------

The specs in ``docs/reference/collections/specs/`` must be updated to
reflect the new artifact type:

- ``debian:suite`` and ``debian:archive``: a uniqueness constraint is added
  in both collection specifications, in the same shape as the existing
  ``pool-file`` constraints.  Within either a single
  :collection:`debian:suite` or a single :collection:`debian:archive`, two
  :artifact:`debian:debug-symbols` collection items that share a build-ID
  must refer to files with identical contents.  The suite-level constraint
  allows the archive-level constraint to be relaxed when an obsolete suite
  is removed; the archive-level constraint prevents two suites in the same
  archive from disagreeing about the file for a given build-ID.  The
  constraint text in the two specifications is essentially identical.
- ``debian:suite``: :artifact:`debian:debug-symbols` becomes a valid item
  type.  See the per-item data table under "Publishing into a suite" below.
- ``debian:archive``: a new boolean data field ``build_debug_symbols``
  (default ``true``) controls whether debug symbols are extracted for this
  archive.

Database index
--------------

A new Django migration adds a partial B-tree index on
``CollectionItem.data->>'build_id'`` conditioned on:

- ``category = 'debian:debug-symbols'``
- ``child_type = 'a'`` (artifact item in Debusine's ``CollectionItem`` model)
- ``parent_category = 'debian:suite'``

This pattern is taken directly from ``migration 0005``, which adds a similar
partial index for repository index path lookups.  The partial condition keeps
the index small by covering only debug-symbol rows in suite collections,
excluding the much larger set of binary and source package rows that carry no
``build_id`` field.

Because URLs are anchored at the archive level, build-IDs must be unique
across the entire archive (not just within a single suite).  This is
enforced by the uniqueness constraints described in the collection
specification changes above (one each in :collection:`debian:suite` and
:collection:`debian:archive`), rather than by the index itself.

Publishing into a suite
-----------------------

When a binary package is published into a suite, :workflow:`package_publish`
must pull in the matching debug symbols alongside it.  A new helper
``_add_debug_symbols()`` follows the ``RELATES_TO`` relation recorded during
the :task:`Sbuild` task to obtain the :artifact:`debian:debug-symbols`
artifact, then queues it for copying into the suite via
:task:`CopyCollectionItems`.

Inside ``DebianSuiteManager.do_add_artifact()``, a new ``elif`` branch handles
the ``DEBUG_SYMBOLS`` category and creates one ``CollectionItem`` row per
build-ID contained in the artifact:

.. list-table:: Per-item data for ``debian:debug-symbols`` items in a suite
   :header-rows: 1
   :widths: 25 25 50

   * - Field
     - Type
     - Source
   * - ``build_id``
     - 40-char hex string
     - The build-ID this collection item represents
   * - ``srcpkg_name``
     - string
     - Mirrored from the parent :artifact:`debian:binary-package` item
   * - ``srcpkg_version``
     - string
     - Mirrored from the parent :artifact:`debian:binary-package` item
   * - ``package``
     - string
     - Mirrored from the parent :artifact:`debian:binary-package` item
   * - ``version``
     - string
     - Mirrored from the parent :artifact:`debian:binary-package` item
   * - ``architecture``
     - string
     - Mirrored from the parent :artifact:`debian:binary-package` item

The collection item is named ``debugsym:<build-id>``. The "parent"
binary-package item is the one reached via the ``RELATES_TO`` relation
recorded by ``_create_debug_symbol_relations()`` during the
:task:`Sbuild` task.

If the suite already contains a ``CollectionItem`` with the same
``(name, parent_collection)`` — which occurs when a source package is built
reproducibly more than once — the existing item's file hash is compared
against the incoming file.  If the hashes match the collision is logged and
ignored; if they differ an error is raised, as this would indicate a
non-reproducible build with the same build-ID, which is a toolchain problem.

HTTP serving
------------

``DebugInfoView``
~~~~~~~~~~~~~~~~~

The view is scoped to an archive (inherits from ``ArchiveFileView``) rather
than a suite, so that a single ``DEBUGINFOD_URLS`` entry resolves build-IDs
across all suites in the archive.

URL pattern (appended automatically by :manpage:`gdb(1)`)::

    https://<archive-host>/<scope>/<workspace>/buildid/<build-id>/debuginfo

The view queries ``CollectionItem`` filtered by ``build_id`` and archive,
hitting the partial B-tree index for a fast lookup.  It then calls the
existing ``stream_file()`` helper and appends the mandatory response headers:

.. list-table::
   :header-rows: 1
   :widths: 30 30 40

   * - Header
     - Source
     - Description
   * - ``X-DEBUGINFOD-FILE``
     - computed from ``build_id`` as
       ``/usr/lib/debug/.build-id/<XX>/<YYYY>.debug``
     - Path of the ``.debug`` file within the binary package.  The leading
       ``./`` of the in-tar path is an implementation detail of the
       ``.deb`` format and is stripped before the value is emitted.
   * - ``X-DEBUGINFOD-SIZE``
     - ``file_in_artifact.file.size``
     - File size in bytes

:manpage:`gdb(1)` sends a ``HEAD`` request before ``GET`` to check
availability. Django's
``django.views.generic.base.View.setup`` already aliases ``head`` to
``get`` when ``get`` is defined and ``head`` is not, so ``DebugInfoView``
needs no explicit handling. ``HEAD`` is covered by unit tests alongside
``GET`` and ``Range:`` requests.
