Posts on Thore Göbel

Yet another 25G home router build

Sat, 03 Feb 2024 20:00:00 +0100

I recently moved out of my shared student flat – where internet was centrally provided – and into my own flat. Finally I get to choose my internet setup! The obvious choice is init7’s Fiber7 offering: unthrottled, symmetric, open peering, a static /48 IPv6, peer-to-peer (no PON), no fuss. An internet speed limited only by hardware, not by economics trying to carve out pricing levels for every pocket (100 Mbit/s on fiber, seriously?).

Fiber7 costs 65 CHF/month, which is a medium market price in Switzerland. 10G and 25G have the same monthly price, only the setup fee differs, because the hardware cost differs. 65 CHF is not too expensive (there are more expensive offerings), but also not the cheapest (there are cheaper offerings, but at worse service quality).

Especially when you look beyond the Swiss border and into Germany, and you see Deutsche Telekom offering Magenta 2000: 2G up, 1G down for 140 €/month. Twice the price for 1/10th of the speed.

What is more, this speed smells a lot like GPON. But on the announcement Telekom says that they have XGSPON equipment on their end and that you need a XGSPON-capable router. But XGSPON can carry symmetric 10G! So is Telekom throttling XGSPON to GPON? Maybe. Ein Schelm, wer Böses denkt. Telekom is clearly leaving itself room to give customers an artifical speed upgrade in the future (or even to sell the upgrade at a premium).

All of this is to say: I am very happy init7 exists and provides this great service. Now I just need to get a router.

A number of other people have also written about their home router builds for their init7 connection:

Check them out! I got inspired by their blog posts, too. This post is just yet another home router build.

A small note on hardware before we start: There are two form factors for network cables. RJ45 (the classic) and SFP (and its variants). At 10G, people tend to use both SFP+ (data center) and RJ45 (home lab) connectors. At 25G, however, SFP28 is the go-to option (but Cat8 would work, too). Wikipedia has a full table showing all the different SFP types. Here is an extract:

Port	Speed
SFP	1G
SFP+	10G
SFP28	25G

Initial reconnaissance

My original plan was to simply go for 10G and buy an off-the-shelf appliance as a router: the DEC740 from Deciso, the makers of OPNsense, the router and firewall I wanted to use. It is fanless, quiet, and has a low power consumption of ~15W, and costs 750 €. The simple router boxes other ISPs give their customers are also in the 10-15W range.

But I needed to buy a network card anyway to connect my existing NAS to the router. So I started shopping. Initially, I was searching for 10G cards. Most 10G cards cost around 200–250 CHF. SFP28 cards tend to cost 300–350 CHF. (As of December 2023, YMMV.)

Eventually, I came across the Mellanox ConnectX-4 Lx on Digitec:

Or rather, I took this screenshot too late. I actually came across it a few days earlier when it was available not for 145 CHF but for 115 CHF! (This likely was some temporary deal from supplier jacob.de.)

The Mellanox ConnectX-4 Lx has two SFP28 ports, not SFP+ ports. I was hooked. If this 25G card is cheaper than the 10G cards (115 < 200), then let’s buy this one! And if my NAS has a 25G networking card, it would be a shame if my router only had 10G, right? Sure, SFP28 is backwards compatible with SFP+, but still.

Thus, I gave up on my “buy an off-the-shelf appliance and be done with it” plans and started researching a custom router build. Spoiler: at the end, it was way more expensive than the DEC740. But I learnt a lot about the hardware, and it was enormous fun doing the ~~shopping~~ research for parts and finally the assembly!

Settling on a NUC

My first idea was to build a tower-sized computer. Michael Stapelberg’s blog post drew my attention to the AMD Ryzen 7 Pro 5750GE, with only 35W TDP! Unfortunately, it was out-of-stock wherever I looked.

Later, Scott Diers’s build with the Intel NUC 9 Pro inspired me to take a look at NUCs instead of a full tower. I liked the smaller form factor. Before reading Scott’s post, I simply wasn’t aware that there exist NUCs that have PCIe slots (and you need a PCIe slot for the network card).

The reason that these NUCs have a PCIe slot is that they are designed for gamers. We can (ab)use that: instead of putting in a graphics card, we can put in a networking card!

Sadly, the “Pro” series that Scott used does not come with a PCIe slot anymore (for the 11, 12, 13 versions). Therefore, I had to go for the more expensive “Extreme” series. The Pros cost around 500 CHF while the Extremes cost around 1000 CHF…

So let’s compare the two most relevant NUC Extremes. The i9 variant of the NUC 12 Extreme and the NUC 13 Extreme compare as follows (an i7 variant also exists):

	NUC 12 Extreme	NUC 13 Extreme
Launch	February 2022	November 2022
CPU	12th Gen i9-12900	13th Gen i9-13900K
Core count	16 cores (8P+8E), 24 threads	24 cores (8P+16E), 32 threads
TDP	65 W	125 W

In all other regards (that are relevant to me) they are similar: 1 PCIe x16 Gen5 slot, 2 Thunderbolt 4 ports, lots of USB-A ports, up to 64 GB RAM (though 12 is DDR4, 13 is DDR5), 3 M.2 slots, two RJ45 ports (2.5 Gbit/s and 10 Gbit/s).

The only difference between the i9 variant and the i7 variant (other than the CPU) is the 10 Gbit/s port, which is only present on the i9 variant. I chose the i9 variant because I wanted the additional RJ45 port.

In the end, I chose the NUC 12 Extreme simply because the NUC 13 Extreme was not available in a reasonable timeframe from any online store. Note that the lower TDP is lower, but the TDP doesn’t say much about the idle power consumption.

Having a 16-core home router is a waste. Let’s use it for other computing tasks as well! I decided to not build a router anymore, but instead build a server that happens to be a router. In practice, this means that the bare metal is running Proxmox as a hypervisor, and one of the VMs is running OPNsense as the router and firewall.

In total the server will feature: 16 cores, 64 GB RAM, 8 TB SSDs (but as RAID-1, so only 4 TB usable). A compute-heavy complement to my existing storage-heavy NAS (which I built a few years ago).

(Granted, it is still a desktop CPU not a server CPU. It has “only” 8 performance cores, the other 8 are energy-efficient cores. An Intel Xeon or AMD Epyc CPU is a very different beast.)

The final build

Fast forward, I ended up with the following parts for the ~~router~~ server:

Part	Prize
Intel NUC 12 Extreme Kit	963 CHF
2x Samsung 990 Pro SSD 4TB	2x (265 € ~ 250 CHF)
Corsair Vengeance RAM 2x 32 GB	117 CHF
Mellanox ConnectX-4 Lx	115 CHF
Σ	1695 CHF

To connect the router to the optical termination outlet (OTO) of my flat, and to upgrade my NAS to 25G and connect it to the router, I also bought the following parts:

Part	Prize
LC UPC to LC APC, simplex, single mode, 2m patch cable	3.30 € ~ 3 CHF
SFP28 optical transceiver	106 € ~ 99 CHF
SFP28 DAC Cable, 2m	37 € ~ 35 CHF
Mellanox ConnectX-4 Lx	115 CHF
Σ	252 CHF

I bought all of the above parts on digitec.ch, amazon.de, and fs.com.

You can also get the fiber patch cable and the transceiver directly from init7, for 77 CHF (10G) or for 222 CHF (25G) (because SFP28 transceivers are most expensive). With a referral code, you get 111 CHF off the hardware.

If you (like me) buy the parts yourself, you need to do your research to make sure you buy the right cable and transceiver. The wall outlets (OTO) in Switzerland are generally LC APC. SFP transceivers are almost always UPC (aka. PC), but can be either SC or LC. APC is green, UPC is blue. fs.com’s blog is a great resource to learn about the various types of cables and connectors. For the transceiver, follow the specs provided by init7 here and here. People seem to buy fiber equipment either on fs.com or on flexoptics.net.

Assembling it

Here are a few photos from assembling the server.

The inside: You can see the two SSDs on the left, the big sliver heat sink that hides the CPU behind it, and the two pieces of RAM on the right. The box on the very right (facing the front of the NUC) is the power supply. Lying at the front, you can also the box with the fan that goes on top of the CPU/SSD/RAM. The two grey bars will go on top of the SSDs, meaning that you don’t need dedicated heatsinks for the SSDs (also, they wouldn’t fit).

Installing the Mellanox card in the blue PCIe slot:

The view of the back: The top RJ45 port is 10 Gbit/s, the bottom one is 2.5 Gbit/s. On the right, the two SFP28 ports of the Mellanox card. Matchstick for scale.

Installing it

This is what the NUC looks like with everything installed:

And the back: The yellow fiber cable goes into the wall and out to init7. The black Direct Attach Cable (DAC) below goes into my NAS. The blue copper cables go into switches.

Because the NUC Extreme is designed for gamers, it has LEDs (of course it does):

Luckily, there is a physical button on the underside of the NUC allowing you to turn the LEDs on and off. No need to install any drivers.

As you can see e.g. on its Digitec page, the front of the NUC usually features a skull. Weird taste, probably reaching only a subset of gamers, and an even smaller subset of the general population, but okay. Luckily, the skull is just a small semi-transparent mask. You can unscrew the front of the panel and take the skull mask out, leaving you with the full, square LED front (as seen in my picture). (I learnt this from this review of the Intel NUC 11 Extreme.) Optionally, you can cut your own thin sheet to mask the light to another shape. I haven’t tried this yet since I leave the LEDs off anyway.

Speedtest

So does it work? Does it make the internet go swoosh?

First, an iPad Pro, with a 2.5 Gbit/s USB-C to RJ45 adapter:

(Note that unlike Android, iPadOS doesn’t have a dedicated status icon in the top right for Ethernet (like for WiFi).)

Next, the system monitor on my laptop when connected with the 2.5 Gbit/s adapter (290 MiB/s = 2430 Mbit/s):

The first half shows an iperf3 test to speedtest.init7.net. The second half shows a speedtest.net test in Firefox. Clearly my CPU (a 2 core, 4 threads i5 7th gen) is the bottleneck. Speedtest.net, Firefox, the Snap sandbox, the network stack, or something else somewhere, makes the CPU go more spinny than iperf3 in a terminal does. Either I can buy a new latptop, or someone can fix the software.

Apropos iperf3 in a terminal. Here is iperf3 running on the shiny new router:

A mere 15–16 Gbit/s. Far from the 25 Gbit/s we were hoping for. But at least more than 10 Gbit/s, so not a complete waste, I guess?

Doing an iperf3 between the new router and the old NAS, both with their SFP28 cards and the SFP28 DAC cable, I also only get ~15 Gbit/s. So the bottleneck is not init7, but somewhere on my setup. I just need to find the time to track it down. Using PCI passthrough for the NIC (since OPNsense is running in a VM) already improved the performance by ~1 Gbit/s compared to using Proxmox’ virtual bridge.

Conclusion

In this post I described my journey to building a home router. Compute-wise, the new server is more powerful than originally envisioned. That’s fine, I will use it to run VMs, e.g. a remote desktop to extend the lifetime of my old laptop. Speed-wise, the router is practically capable of 15 Gbit/s, and theoretically capable of 25 Gbit/s (once I managed to tune it).

Do I need 15 Gbit/s? No. Working, reliable 2.5 Gbit/s would be nice though.

In practice, the download performance for many things is still at most 1 Gbit/s, and 1.5 Gbit/s if you are lucky. I briefly tested this by downloading various installation images (Debian, Ubuntu, LineageOS, GrapheneOS).

It is a chicken-and-egg problem. If no one has fast internet at home, no one will provide fast servers, and no one will optimise the software stacks and transport protocols to “just work”. By having the hardware for 25 Gbit/s, we have eliminated the first bottleneck. Next, we need to eliminate the software and configuration bottlenecks.

Hopefully, as more and more people have faster internet, more developers will optimise for it, and those speeds will become usable out-of-the-box. I am looking forward to the day when I can download the 10 GB folder of photos my friend shared with me from our holiday trip at 2.5 Gbit/s in 30 seconds, and not in 2 minutes at the currently realistic 0.7 Gbit/s.

And maybe even in 8 seconds at 10 Gbit/s, if I decide to spent way too much money on a Thunderbolt-to-SFP+ adapter for my laptop.

How Certificates are Born (Data Structure Edition)

Fri, 28 Jul 2023 13:00:00 +0000

Authentic key distribution is a fundamental problem of public key cryptography: how do you know that a public key really belongs an entity (the “subject”)? A MITM attacker could have intercepted and changed the public key! In other words, we need an authentic public-key-to-identity binding.

One solution is to meet up in person and verify their public key directly. This is what Signal and WhatsApp do with their Safety Numbers.

But this doesn’t scale. How would you meet up with Google to verify their website’s public key? This is where certificates come in.

A Certificate Authority (CA) verifies the identity (“thore.io”) and that this identity controls a keypair (pk, sk). The CA then signs a certificate that basically states “I, the CA, certify that the secret key belonging to the public key pk is controlled by the entity thore.io”. Now anybody who trusts that the CA is honest and that it correctly verifies identifies before issuing certificates can also trust that this public-key-to-identity binding is correct.

But what is the flow of creating certificates? What are the data structures?

Certificates are often seen as this big complex beast. And correctly so! The many extensions and options mean that it gets complicated fast. But the core data structures are not that complicated.

This post tries to break down these data structures. Using the data structures to guide us, we step through the process of creating a certificate.

Step 1: Certificate Request

It starts with what is colloquially known as “Certificate Signing Request”. It usually uses the following PKCS#10 data structures (defined in RFC 2986).

PKCS#10 looks like this:

CertificationRequestInfo ::= SEQUENCE {
    version       INTEGER { v1(0) } (v1,...),
    subject       Name,
    subjectPKInfo SubjectPublicKeyInfo{{ PKInfoAlgorithms }},
    attributes    [0] Attributes{{ CRIAttributes }}
}

CertificationRequest ::= SEQUENCE {
    certificationRequestInfo CertificationRequestInfo,
    signatureAlgorithm AlgorithmIdentifier{{ SignatureAlgorithms }},
    signature          BIT STRING
}

If subject S has a keypair and wants to obtain a certificate, it first creates a CertificationRequestInfo and fills out the subject name (“thore.io”) and the details of their public key. Then S signs the CertificationRequestInfo and together with the signature puts it into a CertificationRequest.

Step 2: X.509 Certificate

S then sends the CertificationRequest to a CA of their choice.

Now the CA needs to verify that the subject S really controls the public key S wants to bind to its identity. It can e.g. do this using ACME (defined in RFC 8555).

When the CA is satisfied it issues the certificate. Certificates generally use the X.509 format (defined in RFC 5280).

X.509 certificates look like this:

TBSCertificate  ::=  SEQUENCE  {
    version         [0]  EXPLICIT Version DEFAULT v1,
    serialNumber         CertificateSerialNumber,
    signature            AlgorithmIdentifier,
    issuer               Name,
    validity             Validity,
    subject              Name,
    subjectPublicKeyInfo SubjectPublicKeyInfo,
    issuerUniqueID  [1]  IMPLICIT UniqueIdentifier OPTIONAL,
    subjectUniqueID [2]  IMPLICIT UniqueIdentifier OPTIONAL,
    extensions      [3]  EXPLICIT Extensions OPTIONAL
}

Certificate  ::=  SEQUENCE  {
    tbsCertificate       TBSCertificate,
    signatureAlgorithm   AlgorithmIdentifier,
    signatureValue       BIT STRING
}

The CA first fills out the TBSCertificate (“to-be-signed” certificate). It copies over the values from the CertificationRequest, assigns a new serial number, chooses a validity period, and adds any extensions that it wants.¹

The CA then signs the TBSCertificate and puts everything together into the Certificate. We now have an X.509 certificate!

Step 3: Signed Certificate Timestamp

Back in the day, we would now be done. The CA returns the X.509 certificate to the requester subject S who can use it to prove their identity (e.g. in a TLS handshake).

However, after a series of CAs misbehaving and wrongly issuing certificates Google introduced Certificate Transparency (CT). With CT the CAs need to insert every certificate that they issue into a public append-only log. For efficiency, this log is a Merkle Tree with the leaves being the log entries.

CT is defined in RFC 9162. The relevant data structures are:

struct {
    VersionedTransType versioned_type;
    select (versioned_type) {
        case x509_entry_v2: TimestampedCertificateEntryDataV2;
        case precert_entry_v2: TimestampedCertificateEntryDataV2;
        case x509_sct_v2: SignedCertificateTimestampDataV2;
        case precert_sct_v2: SignedCertificateTimestampDataV2;
        case signed_tree_head_v2: SignedTreeHeadDataV2;
        case consistency_proof_v2: ConsistencyProofDataV2;
        case inclusion_proof_v2: InclusionProofDataV2;
    } data;
} TransItem;

opaque TBSCertificate<1..2^24-1>;

struct {
    uint64 timestamp;
    opaque issuer_key_hash<32..2^8-1>;
    TBSCertificate tbs_certificate;
    Extension sct_extensions<0..2^16-1>;
} TimestampedCertificateEntryDataV2;

struct {
    LogID log_id;
    uint64 timestamp;
    Extension sct_extensions<0..2^16-1>;
    opaque signature<1..2^16-1>;
} SignedCertificateTimestampDataV2;

First the CA sends the X.509 certificate to a CT log operator. (Alternatively the CA can also send a “precertificate”. This signals the CA’s binding intent to later issue a certificate.)

The log operator extracts the TBSCertificate from the X.509 certificate (after verifying the signature). The log operator then builds the TimestampedCertificateEntryDataV2 and puts it into a TransItem of type x509_entry_v2. This TransItem will eventually be inserted into the log (as one of the Merkle Tree leaves).

The log operator the creates a “Signed Certificate Timestamp (SCT)”: It signs the TransItem and puts the signature into the SignedCertificateTimestampDataV2. The timestamp and the sct_extensions are copied over from the TimestampedCertificateEntryDataV2. With this signature a log operator promises to include the corresponding TimestampedCertificateEntryDataV2 in the log.

Note that the SCT does not contain the certificate. Thus the SCT is only useful in combination with a certificate. Also, the SCT is not included in the log, only the certificate.

The log operator then returns the SCT to the CA.

Step 4: X.509 Certificate again

Chrome and Safari require CT and will only accept TLS connections with certificates that are included in two or more CT logs. Firefox does not enforce CT (at the time of writing in July 2023).

The browsers enforce this by verifying the >= 2 SCTs together with the certificate. There are three ways for the webserver to send the SCT to the browser: include the SCT directly in the certificate, or serve the SCT over OCSP, or as a TLS extension (see Section 6 of RFC 9162).

If the SCT is directly embedded into the certificate, the flow is slightly different: the CA does not submit a X.509 certificate to the log but instead it submits a precertificate. When the CA gets the SCT from the log operator, includes the SCT in the extensions field of the X.509 certificate. Only then the CA signs and issues the X.509 certificate.

Finally, the CA returns the X.509 certificate and the SCT to the subject.

Conclusion

This post gave a brief overview of the process of creating certificates and the data structures that the certificate data flows through.

We started with a Certificate Request, then issued an X.509 certificate, and then also obtained an SCT.

If you want to dive deeper read the linked RFCs.

If you want to inspect PEM/DER-encoded ASN.1 objects (e.g. a CSR) you can use openssl asn1parse. Alternatively der2ascii (Github) provides a nice hierarchical view.

The TBSCertificate.signature field is redundant. Also yes, it should really be called TBSCertificate.signatureAlgorithm… ↩︎

Please Don't Write Passwords to Android Logs

Wed, 10 May 2023 08:00:00 +0200

TLDR: Don’t log sensitive information. Not on Android, and not anywhere else. Still, in 2022 I found two Android apps that do exactly that.

This post starts with an introduction to logging on Android. I describe common pitfalls that lead to sensitive information being leaked to Android logs. I describe why this is bad and what impact it can have. Finally, I show two real-world apps that were/are leaking sensitive information in this way.

Logs on Android

Logs are useful because they allow developers to trace application behaviour. It allows developers to understand what happened to an application, what state it was in, what events occurred. This is useful when troubleshooting.

Apps on Android usually write logs by calling one of the methods of the android.util.Log class:

Log.v("MY_TAG", "Something happened");

These print statements are collected into a log stream. It’s like printing to stdout, but separate.

There are different log levels: verbose, debug, info, warning, error.

Normally, apps can only read their own logs, but not other apps’ logs. However, certain (system) apps can read all logs if they have the READ_LOGS permission (see below).

If debugging is enabled in the phone’s developer settings, you can view all logs on a computer using the ADB command adb logcat or in the logcat tab in Android Studio. Logcat can filter logs by tag and by log level.

Common Pitfalls

There are two common pitfalls that I have seen Android developers fall into:

Pitfall 1: The log level DEBUG is only another tag to filter the stream when viewing logs.

Logs written with Log.d are included in builds with build type “release”. Even when BuildConfig.DEBUG == False these log statements are included and will be printed!

Developers, you need to manually wrap these Log.d statements in an if (BuildConfig.DEBUG) { ... }. Or better, you should strip them completely from the release apk/dexfile (which is non-trivial).

Google could make this more clear, or make it easier for developers to strip DEBUG level statements.

Pitfall 2: Including network logging in release builds.

Logging network requests is useful during development, since you don’t need to setup a proxy and break open TLS. And with OkHttp’s HttpLoggingInterceptor it is as easy as adding three lines of code:

HttpLoggingInterceptor logging = new HttpLoggingInterceptor(); // add this
OkHttpClient client = new OkHttpClient.Builder()
  .addInterceptor(logging) // add this
  .build();

// add this to the build.gradle
implementation("com.squareup.okhttp3:logging-interceptor:1.0.0")

// and - bam! - all network requests are logged and viewable in logcat

Developers, you should only add the HttpLoggingInterceptor temporarily while debugging a specific issue. There is no real need to even commit the HttpLoggingInterceptor to git. And if you do, at least wrap it in if (BuildConfig.DEBUG) { ... } and add the dependency only to debug builds (with flavourDebugImplementation(...), see the docs).

Impact of Logging Sensitive Information

Logging sensitive data can result in data exposure to third parties.

On Android specifically, logging crosses the trust boundary between the app sandbox and the logging system.

For example, an app that uses an access token for network requests can store the token in its “shared preferences”. Despite the name, shared preferences are sandboxed and cannot be accessed by other apps. But when the access token is written to logcat it leaks outside the sandbox. Other apps can read these logs (with the appropriate permission, see below) and can now access the token!

The corresponding CWE is CWE-532 (Insertion of Sensitive Information into Log File).

Exploitability

Reading logs requires the android.permission.READ_LOGS permission.

User apps usually do not have this permission. It can be granted via adb (some apps use this), but for the average user this is an unrealistic hurdle.

System apps that come pre-installed sometimes have this permission.

And starting in Android 13, in addition to requiring the READ_LOGS permission, Android also prompts the user:

READ_LOGS Permission Request

But how widespread is this?

In my own test 44 system apps had the READ_LOGS permission on OxygenOS 11 (OnePlus 7, as of June 2022), and 21 apps on LineageOS 20 (without GApps, OnePlus 5, as of May 2023).

AppCensus also noted a similar problem of sensitive data exposure via logs back in April 2021. They found that 59 apps can READ_LOGS on a Xiami Redmi Note 9, and 89 apps on a Samsung Galaxy A11.

The following apps/packages can READ_LOGS on OxygenOS 11 (produced with this script):

android
cn.oneplus.nvbackup
com.android.dynsystem
com.android.inputdevices
com.android.keychain
com.android.localtransport
com.android.location.fused
com.android.providers.settings
com.android.server.telecom
com.android.settings
com.android.wallpaperbackup
com.dsi.ant.server
com.fingerprints.fingerprintsensortest
com.google.android.feedback
com.google.android.gms
com.google.android.gsf
com.oem.autotest
com.oem.logkitsdservice
com.oem.nfc
com.oem.oemlogkit
com.oem.rftoolkit
com.oneplus
com.oneplus.backuprestore.remoteservice
com.oneplus.brickmode
com.oneplus.camera.service
com.oneplus.config
com.oneplus.coreservice
com.oneplus.factorymode
com.oneplus.filemanager
com.oneplus.gamespace
com.oneplus.minidumpoptimization
com.oneplus.opbackup
com.oneplus.opbugreportlite
com.oneplus.orm
com.oneplus.screenrecord
com.oneplus.screenshot
com.oneplus.sdcardservice
com.oneplus.security
com.oneplus.setupwizard
com.oneplus.sound.tuner
com.qti.diagservices
com.tencent.soter.soterserver
net.oneplus.commonlogtool
net.oneplus.odm

And on LineageOS 20 (without GApps):

android
com.android.dynsystem
com.android.inputdevices
com.android.keychain
com.android.localtransport
com.android.location.fused
com.android.providers.settings
com.android.server.telecom
com.android.settings
com.android.shell
com.android.wallpaperbackup
com.dsi.ant.server
com.stevesoltys.seedvault
com.tencent.soter.soterserver
lineageos.platform
org.lineageos.lineageparts
org.lineageos.lineagesettings
org.lineageos.pocketmode
org.lineageos.settings.device
org.lineageos.settings.doze
org.lineageos.setupwizard

Notice that this includes Google Play Services (com.google.android.{gms, gsf}) and various proprietary OnePlus apps. It is unclear whether these apps do access the logs and/or sent them to vendors. But they do have the permission do to so.

In addition, both OnePlus and Google Pixel devices have features that allow users to explicitly submit logs to the vendor.

OnePlus: Settings > System > Experience Improvement Programmes > System Stability program
Googe Pixel: Settings > Tips & Support > Send feedback

Overall, the exploitability is low. Still, sensitive information could end up on the vendors’ servers.

Fixing it

Fixing this is easy. Just remove all logging statements (network and other) that expose sensitive information to logcat (at least in production builds).

Real-World Examples

Below I show the two real-world apps that I found logging sensitive information, prompting me to write this post.

How did I find these? In both cases, the apps were buggy. Some unrelated feature was not working for me. And when I investigate bugs the first thing I do is open logcat. This is where I went “oh dear”.

Both apps have one thing in common (despite the logging issue): they don’t have a security.txt or a Vulnerability Disclosure Policy (let alone a Bug Bounty).¹ In both cases I had to jump through hoops to find a contact. For app 1 I luckily had a shared contact at the company that I could reach out to. For app 2 I had to go through normal support.

So in 2022, it was still challenging to report vulnerabilities, even for two apps that both handle financial transactions and that both have 100K+ downloads on Google Play.

In the logcat excerpts below I replaced sensitive information with 🚫🚫🚫.

App 1: Unnamed

The first app is a financial app. It was logging passwords, but also network traffic:

D Action  : Logon(vertragsNr=🚫🚫🚫, password=🚫🚫🚫, appInfo=AppInfo(appIdentifier=🚫🚫🚫, versionNumber=42.0, buildNumber=42, deviceName=OnePlus, ONEPLUS A5000, osVersion=Android 12 (32) | Code name: REL, unlocked=true, pushServiceType=GOOGLE), isBiometricLogon=false, biometryType=null)
D Network : GET /mobile/home/context -> HomeContext(sections=[HomeSection(type=FINANZEN, position=0, title=null), HomeSection(type=SHORTCUTS, position=1, title=null), HomeSection(type=ZU_ERLEDIGEN, position=2, title=You have), HomeSection(type=MARKETING_CONTENT, position=3, title=Discover)])

This log leaks the username (“vertragsNr”) and password. Also notice how the network responses are logged as data classes. They reflect the sections on the app’s home screen. This is bad, because the app was effectively streaming the screen state to logcat.

The log level “D” (debug) demonstrates Pitfall 1.

Disclosure Timeline

Wed 2022-06-22 21:30: I reach out to my shared contact at the company.
Thu 2022-06-23 08:18: They reply and say they will find the right contact internally.
Thu 2022-06-23 10:22: The security team reaches out to me.
Thu 2022-06-23 11:35: I send them my writeup.
Thu 2022-06-23 15:22: The security team acknowledges that they reproduced and patched it and will make a new release soon.
Thu 2022-06-23 22:45: I test their fix, and verify that it works.

Within a single day the issue was acknowledged and fixed! They clearly had the internal processes in place to handle security issues, even if they did not have a public security contact yet (now they do¹).

(Why am I not naming them? Their handling was exemplary, it is fixed, there is no public benefit knowing the name. If you keep your apps up-to-date, there is nothing else you as a user need to do to protect yourself. The only public benefit is in knowing that I found more than one app, because this demonstrates that this is a recurring issue that we as a security and developer community need to be aware of.)

App 2: Circuit Laundry

The second app is the app for a laundry service in the UK, called Circuit Laundry. This app allows you to log in, pay with your credit card within the app to top up your account balance, and use that account balance to start washing machines and tumble driers.

Just like App 1, Circuit also streams all network traffic to logcat. This includes credentials (passwords, bearer tokens), personally identifiable information (first name, last name, email address), account balance, and account id.

Logs written during login:

I okhttp.OkHttpClient: --> POST https://phoneadmin.flashcashonline.com/api/user/authenticate/?password=🚫🚫🚫&version=2&email=🚫🚫🚫&platform=Android
I okhttp.OkHttpClient: Content-Length: 0
I okhttp.OkHttpClient: --> END POST (0-byte body)
I okhttp.OkHttpClient: <-- 200 https://phoneadmin.flashcashonline.com/api/user/authenticate/?password=🚫🚫🚫&version=2&email=🚫🚫🚫&platform=Android (1201ms)
I okhttp.OkHttpClient: cache-control: no-cache
I okhttp.OkHttpClient: pragma: no-cache
I okhttp.OkHttpClient: content-type: application/json; charset=utf-8
I okhttp.OkHttpClient: expires: -1
I okhttp.OkHttpClient: vary: Accept-Encoding
I okhttp.OkHttpClient: set-cookie: ARRAffinity=6eaf2cedfb705ed6ce633e6c1ba37f34686cbd9dce05635559dff6fd9e92ea1a;Path=/;HttpOnly;Secure;Domain=phoneadmin2.azurewebsites.net
I okhttp.OkHttpClient: set-cookie: ARRAffinitySameSite=6eaf2cedfb705ed6ce633e6c1ba37f34686cbd9dce05635559dff6fd9e92ea1a;Path=/;HttpOnly;SameSite=None;Secure;Domain=phoneadmin2.azurewebsites.net
I okhttp.OkHttpClient: x-aspnet-version: 4.0.30319
I okhttp.OkHttpClient: x-powered-by: ASP.NET
I okhttp.OkHttpClient: x-cache: CONFIG_NOCACHE
I okhttp.OkHttpClient: x-azure-ref: 0xYNjYwAAAAAEtq35BO3kQa1pmW2pi1HQTFRTRURHRTEzMTAAYWFiZjcyNTctMmE5NC00MDY5LTlkMGMtMWM1NTQzYjNlZWIz
I okhttp.OkHttpClient: date: Thu, 03 Nov 2022 09:03:01 GMT
I okhttp.OkHttpClient: {"Data":{"AppUserId":🚫🚫🚫,"UserIdentification":"🚫🚫🚫","HasPromotions":false,"Token":{"Value":"🚫🚫🚫","Expires":null},"PrimaryLocation":"","AccountBalance":6.80,"InternalId":🚫🚫🚫,"ExternalKey":"🚫🚫🚫","AccountName":"🚫🚫🚫","AccountOperatorID":5,"AccountMinimumPurchaseAmount":5.00,"AccountLowBalanceIndicator":5.00,"AccountCurrencyTypeID":3,"AccountCurrencyUniCode":"20AC","IsRoomViewAvailable":true,"AccountWelcomeTitle":"Welcome to Circuit","AccountWelcomeText":"Welcome to Circuit","FirstName":"Thore","LastName":"Goebel","OptInNotification":false,"OptInMarketing":false,"EmailAddress":"🚫🚫🚫","AppVersion":"2","AppPlatform":"Android","MessageForUser":""},"Success":true,"Message":"Authenticated"}
I okhttp.OkHttpClient: <-- END HTTP (1727-byte body)

Logs written during app usage later:

I okhttp.OkHttpClient: --> GET https://phoneadmin.flashcashonline.com/api/user/
I okhttp.OkHttpClient: authorization: bearer 🚫🚫🚫
I okhttp.OkHttpClient: --> END GET
I okhttp.OkHttpClient: <-- 200 https://phoneadmin.flashcashonline.com/api/user/ (217ms)
I okhttp.OkHttpClient: cache-control: no-cache
I okhttp.OkHttpClient: pragma: no-cache
I okhttp.OkHttpClient: content-type: application/json; charset=utf-8
I okhttp.OkHttpClient: expires: -1
I okhttp.OkHttpClient: vary: Accept-Encoding
I okhttp.OkHttpClient: set-cookie: ARRAffinity=6eaf2cedfb705ed6ce633e6c1ba37f34686cbd9dce05635559dff6fd9e92ea1a;Path=/;HttpOnly;Secure;Domain=phoneadmin2.azurewebsites.net
I okhttp.OkHttpClient: set-cookie: ARRAffinitySameSite=6eaf2cedfb705ed6ce633e6c1ba37f34686cbd9dce05635559dff6fd9e92ea1a;Path=/;HttpOnly;SameSite=None;Secure;Domain=phoneadmin2.azurewebsites.net
I okhttp.OkHttpClient: x-aspnet-version: 4.0.30319
I okhttp.OkHttpClient: x-powered-by: ASP.NET
I okhttp.OkHttpClient: x-cache: CONFIG_NOCACHE
I okhttp.OkHttpClient: x-azure-ref: 0yYNjYwAAAACoPdEYF50OSrWF+8q6w4IZTFRTRURHRTEzMTAAYWFiZjcyNTctMmE5NC00MDY5LTlkMGMtMWM1NTQzYjNlZWIz
I okhttp.OkHttpClient: date: Thu, 03 Nov 2022 09:03:04 GMT
I okhttp.OkHttpClient: {"Data":{"AppUserId":🚫🚫🚫,"UserIdentification":"🚫🚫🚫","HasPromotions":false,"Token":null,"PrimaryLocation":null,"AccountBalance":6.80,"InternalId":🚫🚫🚫,"ExternalKey":"🚫🚫🚫","AccountName":"JLA LAB","AccountOperatorID":5,"AccountMinimumPurchaseAmount":5.00,"AccountLowBalanceIndicator":5.00,"AccountCurrencyTypeID":3,"AccountCurrencyUniCode":"20AC","IsRoomViewAvailable":false,"AccountWelcomeTitle":null,"AccountWelcomeText":null,"FirstName":"Thore","LastName":"Goebel","OptInNotification":false,"OptInMarketing":false,"EmailAddress":"🚫🚫🚫","AppVersion":"2","AppPlatform":"Android","MessageForUser":null},"Success":true,"Message":"Authorized"}
I okhttp.OkHttpClient: <-- END HTTP (669-byte body)

Disclosure Timeline

Mon 2022-11-07: Initial outreach to {security, info, contact, postmaster}@circuit.co.uk. {security, contact}@ bounced.
Mon 2022-11-14: Outreach to info@circuitcardtopup.com (found on Google Play) and dataprotection@jla.com (found in the Privacy Policy). Interestingly, the data protection contact does not reply.
Tue 2022-11-15: Circuit support (info@circuitcardtopup.com) replies. I send them the report.
Sun 2023-02-04: I ask for an update and notify them that I plan to publish in 90 days (i.e. on 2023-05-05). I decided to count the 90 days from today since I did not mention my intent to publish in November.
Wed 2023-02-08: Circuit support replies: “I have passed this on to our app developers again to look into it for you and have escalated it. I will update you once they come back to me.”
Mon 2023-04-17: I ask for an update.
Tue 2023-04-18: Circuit support replies: “There have been no updates provided to us as of yet but I will chase up with the app development team and let you know of any updates when we get them.”
2023-05-10: I publish this blog post.

The issue remains unfixed, ~6 months after I first reported it, and despite the fix being super simple. The affected version is 4.1.0 which was released on 31 March 2022.²

As a user there is very little you can do. You could reboot your phone after using the Circuit app (since this wipes logcat). You could also avoid explicitly submitting logs to vendors (Google Pixel’s “Send feedback”). Or you could take your laundry somewhere else.

Conclusion

First, if you are a developer, don’t write sensitive information to logs. On Android, by writing logs you are sending information outside of the app’s sandbox. As I have shown, tens of preinstalled apps can read logs. It is an easy issue to fix and avoid. But it also easily slips through PR reviews, and there are two common pitfalls for developers.

Second, if you are a business, have a security contact and a security.txt. Even if you have never received a report before. Some day you will. And even if your main business is not selling software, you still need to have a security contact.

P.S.: The Expires field in the security.txt is required.

App 1 did not have a security.txt when I found this issue in June 2022. Now in May 2023 they have one. App 2 still doesn’t. ↩︎ ↩︎
There is also another app called “Circuit Laundry Plus”. I haven’t tested it. ↩︎

Cryptographic Parameters and Bits of Security: Classical and Quantum

Sun, 12 Feb 2023 20:00:00 +0000

Did you ever wonder how the parameter size of cryptographic algorithms relates to their security strength? Or how the security strength changes with quantum computers? Or do you keep forgetting when the bits-of-security are equal to the parameter size, and when they are halved?

You’re in luck! Here are the tables:

Security Stength	Symmetric Crypto	Finite Field Crypto (DSA, DH)	Integer Factorisation Crypto (RSA)	Elliptic Curve Crypto
80	-	1024 (public key \( g^x \)) / 160 (private key \( x \))	RSA-1024	160
112	-	2048 / 224	RSA-2048	224
128	AES-128	3072 / 256	RSA-3072	256
192	AES-192	7860 / 384	RSA-7680	384
256	AES-256	15360 / 512	RSA-15360	512

Table 1: Taken from NIST’s “Recommendation for Key Management” (SP 800-57 Part 1, Section 5.6.1.1). You can find similar tables on keylength.com.

Algorithm	Parameter Size [bits]	Bits of Security (classically)	Bits of Security (quantum)
AES-128	128	128	64 = none
AES-256	256	256	128
RSA-2048	2048	112	-
RSA-3072	3072	128	-
P-256	256	128	-
P-384	384	192	-
Curve25519	256	128	-
Curve448	448	224	-
SHA-256	256	256 (preimage) / 128 (collision)	128 / 128
SHA-384	384	384 / 192	192 / 192
HMAC-SHA256	256	256	128
HMAC-SHA384	384	384	192

Table 2: Assembled by me.

My motivation for collecting this information and writing this post is two-fold:

To answer the “why”: Where do these numbers come from? You can find the NIST table quoted a lot around the internet. But I always missed a concise summary of why the parameter sizes and the security strengths relate in this way.
To cover quantum: All tables I found only contained the security strengths in the classical case.

Point 2 is already solved by Table 2 above. For the remainder of this post I will address point 1 and give an explanation of the “why”. That is, I will explain why Table 2 has these values.

This post is not meant to explain all areas in depth. Rather, it is intended as a quick, summarising overview that pulls all the concepts together in one place. It assumes prior knowledge, refreshing only the points that are important in this context.

THIS POST DOES NOT CONSTITUTE CRYPTOGRAPHIC ADVICE. DO NOT TRUST A RANDOM PERSON ON THE INTERNET. YOU SHOULD SEEK PROPER CRYPTOGRAPHIC COUNSEL INSTEAD.

Preliminaries

Let’s start with some background. We will need this when we discuss for each type of algorithm how its parameter sizes relate to its classical and quantum strength.

Representing Numbers

We represent integers \( \leq 2^n = N \) as bit strings of length \( n \). We use capital \( N \) for integers, and lower case \( n \) for bit string lengths.

Note that \( \frac{2^n}{2} = 2^{n-1} = \mathcal{O}(2^n) \). Also note that \( \sqrt{2^n} = 2^{n/2} \).

Parameter Size

The “parameter size” is what we can tune about an algorithm to achieve different security strengths.

In most cases the parameter size is the key size. But not always! For example, for unkeyed hash functions (such as SHA-256) the parameter size is the size of the output.

Security Strength: Bits of Security

We measure the “strength” of an algorithm in “bits of security”, or short “bits”. An algorithm has “\( n \) bits of security” if it takes \( \mathcal{O}(2^n) \) operations to break it.

Usually, an “operation” is equal to one algorithm invocation and one comparison. For example, one invocation of SHA-256 and one comparison of the output against a target value.

Ideally, an algorithm can only be “broken” by brute-forcing, i.e. by trying all possible values. For keys of length \( n \) there are \( 2^n \) many possible keys, so an average brute-force search succeeds after \( \frac{2^n}{2} \) attempts. But as noted above, this is still \( \mathcal{O}(2^n) \). So instead of “127 bits” we say “128 bits”. Thus you can read all the bit-of-security values in the tables above with a big-\( \mathcal{O} \) in mind.

Classical: Discrete Logarithm

The Discrete Logarithm Problem is that given two numbers \( a, b \), find a number \( x \) such that \( b^x = a \). That is, find \( x = \log_b a \). We work in a cyclic group \( G \) of order \( N \) (i.e. there are \( N \) group elements). For cryptography \( N \) is usually prime.

There exist classical algorithms such as Pollard’s rho that solve the DLP in time \( \mathcal{O}(\sqrt{N}) = \mathcal{O}(2^{n/2}) \).

Hence classically, for parameter size \( n \) bits we only get \( n/2 \) bits of security.

Quantum: Integer Factorisation and Discrete Logarithms – Shor’s Algorithm

Shor’s period finding algorithm can be adapted to solve both the Discrete Logarithm Problem and Integer Factorisation Problem. It is “efficient”, meaning that it runs in polynomial time.

Hence for quantum, for parameter size \( n \) bits we get \( \approx 0 \) bits of security.

Quantum: Brute-Force Search – Grover’s Algorithm

Grover’s algorithm speeds up unstructured search from \( \mathcal{O}(N) \) classically to \( \mathcal{O}(\sqrt{N}) \).

Hence for quantum, for parameter size \( n \) bits we only get \( n/2 \) bits of security.

Hashing: Pre-images, Collisions, and the Birthday Attack

For pre-image resistance, the problem is that given an output \( y \) to find an input \( x \) such that \( x = h(y) \). Both classically and quantum the best known approach is brute-force search.

For collision resistance, the problem is to find two arbitrary inputs \( x_1, x_2 \) such that \( h(x_1) = h(x_2) \). Classically, this can be solved with a birthday attack in time \( \mathcal{O}(\sqrt{N}) = \mathcal{O}(2^{n/2}) \). For quantum, there is a paper claiming to be able to find collisions in time \( \mathcal{O}(\sqrt[3]{N}) = \mathcal{O}(2^{n/3}) \). However, Daniel Bernstein vigorously disagrees, concluding that the best practical collision attack even with quantum is still in time \( \mathcal{O}(2^{n/2}) \).

Tying It All Together

With the above in mind, we obtain the following relations that describe how the security strength \( S \) varies with the parameter size \( n \).

Symmetric crypto (AES, HMAC):
- Random symmetric key, breakable by brute force search.
- \( S = n \) classically.
- \( S = n/2 \) quantum (Grover’s).
Asymmetric crypto (RSA, DH, ECC):
- Discrete Logarithm and Integer Factorisation.
- \( S = n/2 \) classically (Pollard’s rho and others).
- \( S \approx 0 \) quantum (Shor’s).
Hashing:
- Pre-image resistance:
  - \( S = n \) classically.
  - \( S = n/2 \) quantum (Grover’s).
- Collision-resistance:
  - \( S = n/2 \) classically (Birthday attack).
  - \( S = n/2 \) OR \( S = n/3 \) quantum (Bernstein OR Brassard, Høyer, and Tapp).

Note that just like the value “128” in the security strength is an approximation, the \( = \) is also an approximation.

Also note that these relations and the values in Table 2 are upper bounds. They only represent the security strength under generic attacks. There can be algorithm-specific attacks. For example, SHA-1 has an output size of 160 bit, which would imply 80 bit security against generic collision attacks. However, the SHAttered attack estimates that it only needs \( \approx 2^{64} \) operations.

What Security Strength Should You Target?

For a long time, 128 bit was commonly recommended. However, the NSA’s latest guidance in its Commercial National Security Algorithm Suite is at least 192 bit (except for RSA-3072, which is 128 bit).

Conclusion

We started this post with concrete values: a table of common cryptographic algorithms, their parameter size, and their security strength in both the classical and the quantum setting.

We ended this post with the formal relations that relate the parameter size to the security strength.

Inbetween, we refreshed our knowledge to understand the why. We explained where these relations come from.

I hope this helps as a quick reference to look up the values, the relations, and the derivation of the relations.

Home Office Killed My Laptop Battery

Sun, 04 Dec 2022 14:00:00 +0000

This post is a collection of things I learned about batteries. This is my adventure diving into why doing too much home office killed my laptop battery, and my journey replacing it. It is not necessarily data-driven, but at least data-intuited.

Motivation

Before the pandemic, my laptop (a ThinkPad) was on and off the charger. I was walking around university, sitting in lectures, going to the library. I would charge my laptop when the battery was getting low, and then disconnect the charger again.

During the pandemic, my laptop was always sitting on my desk, connected to a monitor, a docking station — and a charger. It spent several months at full charge.

As of early 2020, my laptop was 3 years old and the battery capacity still at 85%. Over the course of the next 12 months, its capacity dropped to 65%. With a design capacity of 57 Wh, this is a drop from 48 Wh to 37 Wh. At a power consumption of 5 W, this is 2 hours less runtime! And it is a drop by 20% over 1 year, versus a drop of 15% over 3 years.

Graph of Battery Capacity Over Time

Factors Impacting Battery Life

Why did this happen? What can I do to prolong the life of my laptop battery?

According to Lenovo, battery life is a function of:

age
number of charge cycles
amount of time at full charge
high temperature

In other words: anything that causes physical stress to the battery/electrons/chemicals decreases battery life. The battery “wears out” and is no longer able to hold the same amount of charge.

The “amount of time at full charge” killed my laptop battery. Sitting at 100% for a large portion of 2020 was what accelerated its death so significantly.

The “high temperature” killed my phone battery. This is a different story. But keep this in mind next summer when it is over 30°C!

Setting an upper charging limit

To protect my battery, I needed to prevent it from being at 100% for long periods of time (read: days non-stop).

To do this, I wanted to configure the battery to only charge up to a maximum value. One option are command-line tools, such as TLP (more details below). Another option are graphical tools, such as KDE Plasma’s system settings:

Setting the charge limits in KDE Plasma 5.24 LTS

Personally, I have set the “value to start charging at” and the “value to stop charging at” to 75% and 80% respectively.¹ They seem like a reasonable tradeoff, while still having a nearly-full battery for when I pack up and leave my desk. 75% and 80% are also the default values that TLP sets. And Lenovo agrees: “For systems which are always connected to an AC power source, Lenovo recommends setting the upper charge limit to 80% or less.”

Now when my laptop is at the docking station for 2 days, it sits their at 80% instead of at 100%.

The added benefit is that when it sits there at 80% and I run compute-heavy tasks, the 20 or so watt come directly from the charger. It does not impact the battery, i.e. it is saving me some charging cycles (see below).

Reading Battery Statistics

The Linux kernel reads battery statistics from the hardware and exposes them. To read the current power consumption in milli-Watt and the amount of full charging cycles:

cat /sys/class/power_supply/BAT0/power_now
cat /sys/class/power_supply/BAT0/cycle_count

For the former I made myself a bash alias: alias watt="cat /sys/class/power_supply/BAT0/power_now". I look at watt regularly to get an intuition of how my laptop is doing.

I found the following rule-of-thumb numbers for my system (4-core i5-7200U @ 2.5 GHz):

Task	Power consumption [Watt]
Idle (or slowly reading through a PDF)	5 W
Normal usage (browsing, typing, switching applications)	5.5 - 6 W
Streaming a movie	up to 10 W
Compiling on all 4 cores	20 W

I also found that screen brightness is not a significant factor (on my laptop!). At least, it is barely noticeable in the watt readings. Instead of forcing brightness down to 10% to save power, I now have it at a comfortable 30% most of the time.

For the cycle_count, note that this is the number of full cycles. I have had my new battery for 8 months (= 240 days) now. Despite using it daily, I am only at 60 cycles. One contributing factor is that my laptop now spends weekends and home-office days on my desk at the charger.

If you like graphs, KDE Plasma also ships with an Energy Monitor that shows you the historical power consumption:

KDE Energy Monitor

Using TLP to read statistics and set thresholds

TLP is a command-line tool that can read battery statistics and configure various settings, such as charging thresholds. In this section I will give a short primer on how I use TLP.

First, install it.

Now you can view the battery statistics:

$ sudo tlp-stat --battery
--- TLP 1.5.0 --------------------------------------------

+++ Battery Care
Plugin: thinkpad
Supported features: charge thresholds, recalibration
Driver usage:
* natacpi (thinkpad_acpi) = active (charge thresholds)
* tpacpi-bat (acpi_call)  = active (recalibration)
Parameter value ranges:
* START_CHARGE_THRESH_BAT0/1:  0(off)..96(default)..99
* STOP_CHARGE_THRESH_BAT0/1:   1..100(default)

+++ ThinkPad Battery Status: BAT0 (Main / Internal)
/sys/class/power_supply/BAT0/manufacturer                   = LGC
/sys/class/power_supply/BAT0/model_name                     = 01AV494
/sys/class/power_supply/BAT0/cycle_count                    =     60
/sys/class/power_supply/BAT0/energy_full_design             =  57000 [mWh]
/sys/class/power_supply/BAT0/energy_full                    =  59180 [mWh]
/sys/class/power_supply/BAT0/energy_now                     =  30780 [mWh]
/sys/class/power_supply/BAT0/power_now                      =   4883 [mW]
/sys/class/power_supply/BAT0/status                         = Discharging

/sys/class/power_supply/BAT0/charge_control_start_threshold =     75 [%]
/sys/class/power_supply/BAT0/charge_control_end_threshold   =     80 [%]
tpacpi-bat.BAT0.forceDischarge                              =      0

Charge                                                      =   52.0 [%]
Capacity                                                    =  103.8 [%]

As mentioned above, you can also obtain the power_supply values by cat-ing the files directly (this is my watt shortcut). The tlp-stat command just nicely aggregates them.

Next, configure the charging thesholds. To do this, create a file /etc/tlp.d/01-charge-thresholds.conf with the following content: ²

START_CHARGE_THRESH_BAT0=75
STOP_CHARGE_THRESH_BAT0=80

Make sure this file has the correct owner and correct permissions:

$ sudo chown root:root /etc/tlp.d/01-charge-thresholds.conf
$ sudo chmod 644 /etc/tlp.d/01-charge-thresholds.conf

Then apply the TLP config by running sudo tlp start. You should now see these changes when you run sudo tlp-stat -b.

What if you have a trip coming up, and you want to temporarily charge to 100%? Instead of messing with the config, just connect your charger and run:

$ sudo tlp fullcharge
Setting temporary charge thresholds for BAT0:
  stop  = 100
  start =  96
Charging starts now, keep AC connected.

sudo tlp-stat -b shows you the updated thresholds. Simply run sudo tlp start to revert back to the settings in your config file.

Note: The command tlp start takes the value from the config file, and writes them to the embedded controller (EC) via ACPI. tlp fullcharge does the same, but with the default values 96 and 100. The EC is a piece of hardware that controls the charging process. This way, charging will stop at 80% even if the laptop is turned off. But this also means that you have to manually run start or fullcharge for your changes to take effect!

For more information, read the docs, the FAQ or the manpage (man tlp).

Replacing the Battery

Now I knew how to slow down battery aging, and how to monitor the battery. But what if the battery is already dead?

Disclaimer: repeat this at your own risk. In my case, my laptop was 4 years old, so the warranty was void anyway.

When my old battery reached 65% capacity, I decided to replace it. Only having 6 hours of runtime was getting too impractical.

I ordered a replacement battery on AliExpress for 35 €, including shipping. For comparison: official replacement batteries cost 100-150 €.

Replacing the battery requires only a few screw drivers. I recommend searching YouTube for any replacement video (even if it is not your exact model). It is a good way to get a rough view of what to expect, and to build confidence before you open up your own laptop.

Here are the steps for my ThinkPads (others should be similar):

Disconnect the charger.
Discharge the battery to 20-30%. Less electrical charge reduces the risk if something goes badly wrong.
Touch some grounded metal to ensure you are grounded as well. Prevent electrostatic discharge (ESD).
Shut down the laptop and reboot into the BIOS.
In the BIOS, go to Config > Power > Disable Built-In Battery. Then click Enter. (This ensures the power is cut and the battery is fully disconnected from the other electronics inside the laptop. You won’t be able to turn the laptop on again without a charger.)
Close the lid and turn the laptop upside down.
Remove the screws that are holding the back plate and remove the back plate.
Disconnect the internal battery cable.
Remove the screws that fix the battery in-place.
Take the old battery out, and put the new one in.
Screw the new battery in place. Then re-connect the internal battery cable.
Close the back-plate and screw it tight.
Connect a charger and boot your laptop.
Done! Proceed to calibrating the battery.

Disabling the battery in the BIOS

Calibrating the new battery

After I installed the new battery, it was time to calibrate it. Calibration means pushing the battery to both the low and the high extreme. This helps the battery controller figure out what level of charge constitutes “full charge” and “empty charge”.

This changes over the lifetime of the battery, because as the battery ages it is able to hold less charge. The controller remembered a lower charge value as “full” from the old battery. Calibrating it means teaching it that “full” is now at a higher charge.

This paper card that iFixit ships with phone replacement batteries sums up the steps:

Determining the Quality of the Replacement Battery

35 € for a laptop battery is rather cheap. Sure, I did some diligence. I compared batteries on AliExpress. I read through comments and reviews. I looked at clues indicating seller reputation and quality. But it is still a guessing game. With seller and buyer so far away, there is little accountability.

How can I verify that the cheap battery is not of low quality? What if the battery reports a higher capacity than it actually has? Can I trust the hardware readings? After all, the “Last full charge = 59.18 Wh” (see Energy Monitor screenshot above) seems strange given that “Design capacity = 57 Wh”.

It turns out that batteries are smart. ³ They have a built-in chip that reports the voltage, current, etc to the laptop. This means that the laptop relies on the values measured and reported by the battery.

This also means that a scam battery can report wrong values. For example, it can report a larger capacity than it actually has. I know that this happens with smartphones: the Android system settings on phones that are cheap “offers” on AliExpress claims that the phone has 16 GB RAM but in reality it only has 0.5 GB RAM! (German news article)

Without taking the battery apart and doing more analysis, I just have to trust the battery.

One data point that I can control, though, is the overall runtime. With a nominal capacity of 57 Wh and a power consumption of 5.5 W, the new battery should last about 10 hours. So I measured how long the new battery would last during normal laptop usage. And indeed: the new battery got me back to running an entire day at university without having to charge. This has held up so far over the past 8 months. Because of this, I now also trust the watt readings – they vary with my laptop usage patterns in the way I expect them to.

I assume the 59.18 > 57 is either due to the uncertainty in the physical measurement/calculation. Or the factory capacity really is too large, to account for manufacturing uncertainties, that is, to prevent returns and angry customers.

Conclusion

In this post I explained what factors impact the lifespan of a modern laptop battery (age, number of charge cycles, amount of time at full charge, high temperature). Based on this, I also explained what you can do to prolong the lifespan (don’t keep it at 100% for long times). I also described how you can view battery statistics (use TLP). Finally, I showed how I replaced my laptop battery.

Note that I only focused on ThinkPads running Linux. Please let me know if you find similar ways to set the charging thresholds on MacOS. There might be, but I haven’t researched it. Apple has Optimised Charging, but this seems to only work when charging overnight. It is not a solution if you keep your MacBook connected to your monitor via USB-C-with-power-delivery during an entire workday — it will charge to 100%.

The lower bound ensures that the battery does not constantly fluctate between “discharging” and “charging”. This would send the electrons back and forth, also wearing out the battery. ↩︎
You could also put these lines in /etc/tlp.conf. But I prefer leaving tlp.conf untouched, and instead put all my values into the tlp.d directory. This way, it is easier to distinguish what are TLP’s factory settings and what are my settings. ↩︎
Wikipedia has more information on battery management systems. I originally learnt about the Smart Battery System (SBS) standard from this blog post about reversing the embedded controller firmware (the controller in the ThinkPad, not in the battery). That blog post also references this Black Hat paper on reversing the battery firmware, which is an equally interesting read. ↩︎

FIDO's Future

Sun, 10 Jul 2022 14:00:00 +0000

In honor of World Password Day on May 5th the FIDO Alliance, together with Google, Apple and Microsoft, announced “plans to expand support for a common passwordless sign-in standard”. In other words, they (once again) publicly committed to FIDO.

This is not new. All three companies have long supported FIDO to some extent. Both Google Chrome and Microsoft Edge have had WebAuthn and CTAP2 support for a few years now. iOS also has WebAuthn support since 14.5.

The two hot new things are:

smartphones as roaming authenticators via Bluetooth, and
multi-device credentials (“passkeys”), via cloud sync. ¹

The announcement has brought some renewed attention to FIDO. Reason enough to write down my thoughts on the current state of the FIDO ecosystem!

First, I will give a short introduction/refresher about FIDO and its concepts. Then, I will go into what I think is great about FIDO, as well as some of the problems that I see.

Background

Back in the early days, there was the password. Unfortunately, people took that too literally and chose words like “password” or “qwerty” (is that a word?) as their password.

In response to that (and recalling that good passwords should be long) we collectively decided to call them passphrase instead (phrase, as in “sentence”). Surely sentences are easier to remember! But that didn’t work either – people kept chosing “letmein”.

So now, “the industry has introduced the term passkeys” (source). “The industry” meaning Apple. ² ³

This is where FIDO comes in. The basic idea behind FIDO is that instead of user-chosen passwords we can use public key cryptography to authenticate users. Hence the word “passkey”.

Here is how FIDO replaces the registration and login processes (roughly speaking):

Upon registration, you create a public-private key pair. The server provides you with a challenge, you sign, and – bam – you are now registered. This is trust-on-first-use.
Upon authentication (aka login), the server again sends you a challenge. You sign it, the server verifies it, and you are logged in!

Basically, this is all that FIDO is. Public key crypto and a challenge-response protocol.

Unfortunately, you (the human) already have a hard time keeping 256-bit numbers in your head, let alone computing ECDSA signatures with those numbers. Luckily, all this cryptography is done for you by an authenticator. There are two main types of authenticators: First, platform authenticators, which are built into your device. Think Touch ID, Face ID, or Windows Hello. Second, roaming authenticators, that can “roam” around, between multiple devices. Think physical keys (YubiKeys, …). They communicate with your computer via USB, NFC or BLE (using the CTAP protocol). ⁴

In addition, there is WebAuthn. WebAuthn is the API that websites or desktop/mobile apps interact with (see the Mozilla docs). The browser or OS then handle the communication with the authenticator (either internally for platform authenticators, or via CTAP for roaming authenticators).

So overall, when people speak about FIDO or FIDO2, they mean WebAuthn and CTAP. Here is my amateur drawing of an overview of FIDO:

FIDO: The protocols, the entities and devices involved

For a deeper dive, see this glossary and this blog post as starting points for further reading. Adam Langley also has multiple great posts on security keys. Apple’s WWDC22 video nicely shows how the UX flows can look like.

By the way, did you know that the new SwissPass can act as a roaming authenticator over NFC? Every public transport user in Switzerland now gets a security key included in their travelcard! Since it is FIDO-compliant, you can use it to log in to any website that supports WebAuthn (on your phone, or on a laptop with an NFC reader).

What’s new?

One big UX problem for FIDO so far was that keys were bound to a single device. That meant you would need to manually enroll two or more security keys (in case you loose one), or manually enroll both your laptop and your phone. What is more, you would need to manually do this for every single service where you used FIDO. Just imagine switching phones, having to manually enrol your new phone in 10+ services again.

FIDO now proposed to solve this with multi-device keys. Effectively it lifts the requirement that a key needs to be bound to a secure hardware element, and allows it to be synced. Notably, the standard does not impose any requirements (security or other) on the nature of the sync service, and instead leaves it up to the vendor (e.g. iCloud).

Yes, this lowers the high bar that you get from hardware-bound secrets. But for the average user reusing their passwords, it still a major step up. In high-security contexts, you can continue to use device-bounds keys.

A second UX problem was the need to buy a security key. Without such a roaming authenticator that can plug into a USB port, how would you log into a new device?

This second problem is solved by specifying how smartphones can act as roaming authenticators. If you have a registered FIDO key on your smartphone, and you want to log into your email account on a workstation in a public library, you can scan a QR count on the workstation. The QR code contains information for local pairing over Bluetooth, thus proving proximity. Then your phone and the workstation will run the normal CTAP protocol, with your phone acting as a roaming authenticator. ⁵

These changes are exciting because they promise to clear the way and remove the two main blockers of mainstream adoption.

The good

The good: things that are great about FIDO and that we should do more of.

No shared secrets

One thing I love about FIDO is that it eliminates shared secrets. Instead, the secret (aka. private) key stays solely on the user’s device.

The long history of password leaks has shown that it is difficult to protect the treasure chests that password databases are. It is much easier to break into a single central server, than millions of end-user devices.

By using public key crypto we also get phishing protection: the secret never leaves the device, so there is nothing to be phished. ⁶

No human-generated secrets

With FIDO we move away from mostly human-generated secrets and towards (pseudo-)randomly generated secrets. Sure, we have mitigations like salting or special hash functions (scrypt, Argon2). But in the end, they are still band-aids working around a low-entropy source.

Industry push

Any technology can only make a difference if it has broad user adoption and widespread software support. Unfortunately, widespread adoption doesn’t just happen. Companies and people need to invest money and time. And sometimes, you need peer pressure to make a step.

We have seen this before with biometrics on laptops: Thinkpads have had fingerprint sensors for generations. Yet only after Apple pushed Touch ID on MacBooks we started seeing improved software support (imho because people know it from Apple and also demand it from other products). Even on Linux fingerprint sensors are starting to “just work”!

With FIDO we can hope for the same effect. Google, Apple, Microsoft and Mozilla laid the foundations over the past few years (and took on the initial risk of investment!).

We are still far away from people being so used to passkeys that they start demanding it from all products. But with industry leaders committing and leading the way, we can get there.

The bad

The bad: things that I think are problematic about FIDO, and that should be talked about more.

This is not to say we should throw FIDO out of the window and start anew! In fact you will see that some problems are very familiar and keep coming up in different areas of IT.

TLS client certificates revisited

First of all, why are we inventing a new protocol?

FIDO and its use of public key cryptography inherently reminds me of TLS client certificates. In both cases, the user/client holds a private key and uses it to authenticate itself to the server.

The only difference is where the server’s trust in the private key comes from (or rather, its trust in the binding from the key to a human). In FIDO, it is trust-on-first use, i.e. the server remembers the (public key, identity)-binding. In TLS client authentication on the other hand, the server derives its trust from the certificate chain that is presented alongside the public key.

Other than that, the two are pretty similar:

Both use public key crypto.
Both can have their secret key material either stored simply on disk or in secure hardware (YubiKeys, smartcards). ⁷
Both have some form of freshness. This guarantees to the server that the client was recently alive (FIDO: challenge response, TLS: signing the session transcript).

Which begs the question: why are we not using TLS client certificates?

Or, in other words: if nobody outside enterprise uses TLS client certs, if no consumer site offers client certs, if client cert usability is so bad – why should FIDO succeed where client certs did not?

And, more concretely: what does FIDO do better than TLS?

I already mentioned that FIDO has hardware attestation, and TLS client certs don’t. ⁷ Also, you might argue that it makes sense to separate transport and authentication into separate protocols.

On the other hand, I would argue that having both together is beneficial, since it allows you to bind the security of your channel to an identity. That is, you can run FIDO over a TLS connection that is being MITMed (assuming the TLS client trusted the server cert, or a trusted CA has been compromised) – the server won’t notice that the connection to the WebAuthn client is not private.

Also, in terms of performance it makes sense to bind transport and authentication together. In almost all cases you need a secure channel anyway to communicate further after authentication. Why spend extra round trips with FIDO if you can integrate it into TLS? At Google-scale, the extra requests will make a difference.

I would love to hear your thoughts on this question (FIDO vs TLS client certs)!

Proprietary solutions

Open standards don’t lead to open solutions.

For Android, Google implements FIDO support in their proprietary Play Services (rather than in AOSP). That means all the CustomROMs and Android-based OSes without ties to Google won’t have FIDO support for a long while.

Meanwhile, Microsoft’s announcement on World Password Day (while everybody else was announcing their push for FIDO) used the terms “Azure”, “Windows”, “Microsoft” noticably more often than it did say “FIDO” or “Webauthn”. So far, my impression is that they are boasting with FIDO and open standards, all while pushing their proprietary Authenticator app.

Additionally, banks love to build their own 2FA apps rather than use TOTP for their online banking logins. Will they switch to FIDO?

Furthermore, how will the new multi-device credentials work? Will users be locked into the iCloud ecosystem? Or will Apple offer a way for 1Password or Keepass to plug into its FIDO APIs and function as a passkey sync provider? This is especially critical since FIDO does not specify how multi-device keys are synced. The sync services will be the Achilles’ heel for the security of users’ passkeys.

Higher complexity for developers

Implementing FIDO is more complex than username+password. For the time being, this will continue to pose a barrier for entry.

You need to parse binary messages, handle keys, verify signatures, generate challenges, adhere to a specification, and so on. Of course, all of this will be (and should be!) outsourced into libraries – after all, you also don’t implement your own TLS code.

But even with the bit-juggling abstracted away, developers still need to learn the concepts and the terminology. It is not simply a matter of new HTTPSConnection(url). The WebAuthn API seems innocent: it is just create() and get(). But you need to understand how to prepare your inputs and how to interpret the outputs.

FIDO has many options and knobs that you can configure. What’s the difference between attestation and assertion? What are PublicKeyCredentials, and which ones should I choose? What level of user verification do I want? Do I really need to verify that counter? Should I allow multi-device credentials, or only keys that reside in TPMs?

I can imagine a world where web frameworks and CMS systems offer FIDO support out of the box as part of the user management. But this will take time. Until then, developers will need to go through this learning curve. It is not impossible – but it is an barrier that I think is often neglected in the hype around FIDO.

Widespread support

With all the hype around FIDO: how good and how widespread will FIDO support actually be in reality?

On the authenticator side:

Firefox still does not support CTAP2 on MacOS and Linux, only on Windows using Windows Hello (but they are working on it). Which means even though it has a WebAuthn API exposed to JavaScript, you cannot use features like discoverable credentials. Discoverable credentials are important because they allow usernameless signin (as opposed to “only” passwordless). They were introduced in CTAP2. In practice, this means that you still cannot use your YubiKey for passwordless sign-in to Microsoft’s ecosystem using Firefox.

I.e. “FIDO support” in Firefox in July 2022 means “WebAuthn wrapped around CTAP1/U2F”.
Support on the Linux desktop (for desktop apps, similar to Windows Hello) will (most likely) take a while.
AOSP won’t support FIDO either any time soon (due to it being implemented in Play Services, see above).

On the website/client side:

I have over 200 entries in my password manager. Only a fraction of those offers TOTP (20%). And an even smaller fraction offers WebAuthn (5%). Given FIDO’s complexity: will FIDO surpass TOTP? Will it become the password-killer?

A real world example: ETH Zürich is planning to roll out 2FA in autumn 2022 for its single sign-on. But it will use OTP (presumably TOTP), not WebAuthn.
Almost all consumer websites offer WebAuthn merely for 2FA. That is, they are not replacing passwords. One exception is Nextcloud, which offers it for passwordless signin. Not a single website I use offers usernameless signins via WebAuthn.
On the upside, FIDO support is already significantly more widespread on consumer websites than TLS cert support is.

Research & academia

From an academic perspective, FIDO is a relatively new protocol, and research into it is just getting started.

There are some relatively recent papers formally modelling and formally verifying it (formal as in formal methods). One using ProVerif, another looking at the crypto in the computational model. Others are already presenting real-world attacks.

This reminds me of SSL/TLS and WEP/WPA: industry builds it, academia breaks it, industry patches it, repeat.

TLS 1.3 broke this cycle: it involved academia early on in the standardisation process. There was a lot of research into both the handshake and the record layer that was incorporated into the drafts. Also the research attacked the spec from multiple angles: proving cryptographic security computationally (as opposed to information theoretically), tool-based approaches (Tamarin, ProVerif), and formally verifying implementations.

This is not a central criticism of FIDO! FIDO is just starting out, you cannot expect dozens of researchers already jumping onto it. But if FIDO does prove itself to be there to stay and gains significant adoption, FIDO3 or FIDO4 should involve academia more early on in the process, just like TLS 1.3 did.

The name

WebAuthn has not even reached broad adoption, yet the name is already outdated: WebAuthn started out as a web standard, but with native/mobile/desktop support, it is now breaking out of the browser. Native iOS and Mac apps can use the WebAuthn protocol over native, non-browser APIs. Thus the term WebAuthn has become too narrow.

Even within FIDO (or should I say FIDO2? Or FIDO/WebAuthn?) there are legacy names that keep getting changed (e.g. resident keys aka. resident credentials aka. discoverable credentials).

In addition, it is often unclear what someone means when they talk about having “FIDO support”. Do they mean only WebAuthn? Do they mean CTAP2 as well (cf. Firefox)? What about U2F? And CTAP1? Wait, U2F and CTAP1 are the same thing.

The naming is not (yet) as messy as with USB (hello, USB 3.0 and USB 3.1 Gen 1 and USB 3.2 Gen 1x1). But it is definitely not very human friendly either – neither for consumers, nor for developers.

Will we see a broad migration from passwords to passkeys?

Mainstream adoption of passkeys depends on two questions: First: will passkeys be broadly available? And second: will users actually be willing to use them?

In terms of availability, we need both service-side and client-side support. For the service providers (social media, online shops, email providers, …) it is relatively straight-forward: just pull in a WebAuthn library, consider all my discussions above, and don’t forget to get the recovery process right. On the client-side big vendors such as Microsoft, Google and Apple are finally making a push. But even they have more work to do: the reality of syncable credentials is still very much open. It is unclear if and how third-party services (such as today’s password managers) can hook into the client and sync passkeys.

And when everything is ready, all my favourite services, and even my obscure Linux distro and my smart TV support FIDO – will users adopt it? The UX of passkeys is remarkably similar to the UX of using a password manager:

One click to generate a new password/passkey.
One click to auto-fill an existing password/use the passkey.
Various features to list and manage all your passwords/keys.

Yet password managers have not taken off. If we as security professionals could never provide enough incentives to convince the broad public to use password managers – how will we be able to convince them to adopt passkey managers? How will we convince someone with an Apple iPad, a Windows laptop and an Android phone that passkeys are easier and faster than passwords?

Conclusion

Are passwords bad? Yes. Is FIDO the solution? (Some of) the industry definitely thinks so.

In reality however we are still a long way away. Even Google admits that “passwords will continue to be part of our lives” for a while. There are a lot of challenges yet to be solved.

This all sounds very pessimistic, but it shouldn’t. I am excited about replacing passwords. I am excited to see cryptography being applied to solve a real everyday problem. And I do want to be able to use my YubiKey on more than just 10 out of 200 accounts.

Apple has a history of pushing things forward, simply by adopting something and everybody else following suit. And with Microsoft and Google on board as well, there is a good chance that they can bridge the chicken-and-egg problem: website don’t offer FIDO because there is poor client support, and clients don’t offer FIDO because it is not a feature that is widely requested among websites.

For us computer scientists and engineers there are a lot of interesting challenges ahead. Naming, discussing and understanding the good things as well as the bad things, the opportunities as well as the weaknesses, is key to tackling these challenges.

Please let me know if you found this useful, have any comments, found an error, or outright disagree with me – I would love to hear your opinion!

Yubico already wrote about it back in March and the white paper also dates earlier. Apple even tech-previewed syncing passkeys at WWDC 2021. Nevertheless, both two are pretty recent developments from a daily user point-of-view, and thus worth the news! ↩︎
“Passkey” is not (yet) an official FIDO term afaik (as of May 2022). Apple coined it (in this context) at WWDC2021 afaik. ↩︎
But: don’t confuse Apple’s/FIDO’s passkeys with Bluetooth’s passkeys! (see Bluetooth Core Specification 5.3, Volume 3, Part C, Section 3.2.3) ↩︎
Interestingly, a smartphone and its Touch ID can be both: you can use it to log in to a website on your phone (same platform). Or you can use it to log you in to a website on your laptop (roaming via Bluetooth). ↩︎
For reliability reasons, Bluetooth is only used for an initial key exchange. After that, the devices communicate over the internet. ↩︎
On top of that, FIDO implementations ensure that a key is only used to log in on the intended site (much like password managers only prompt to autofill when the domain matches). This adds another layer of protection. ↩︎
One difference, however, is that FIDO has attestation to prove that a secret key is indeed stored in secure hardware. I am not aware that TLS client auth has this feature. ↩︎ ↩︎

TLS 1.3 0-RTT and the real world

Sun, 07 Nov 2021 14:30:00 +0100

Recently I was implementing some of the TLS 1.3 handshake as part of the Information Security Lab at ETH Zurich.

When working on the lab I was googling around and by chance came across this OpenSSL man page. Specifically, its “NOTES” section at the bottom. You can read it on your own, but the TLDR is: under certain operating systems and TCP settings, and given a not-too-large amount of application data, 0-RTT may inadvertently end up being 1-RTT. In other words, you go to all the lengths to build a protocol that has low latency but wind up back on square one. (You can find the original issue here.)

For me this is a reminder of two things:

Measuring performance matters (just like writing tests). Without measuring it, you wouldn’t have noticed that not all of your packets are doing 0-RTT.
You can have a nice higher level protocol, but the lower part of the stack can cause unexpected effects. Abstraction works only so far. Thus even if you generally work higher up, you need to have a solid understanding of what is happening underneath.

Beautiful UI

Sat, 23 May 2020 19:00:00 +0200

What makes an app great? A novel idea, a day-to-day problem that it solves elegantly, a smooth user experience, a beautiful interface.

A beautiful interface.

What is “beautiful”? How do we grasp that intuition and put it into something that we as designers and developers can implement in our apps?

In this post, I want to explore a number of things that I found make me stumble when I use an app. Things that I now watch out for when building an app. Things that I go to my company’s designers to ~~complain~~ discuss about. Things that I open issues and PRs with open source projects on Github for.

Basically, I describe a number of characteristics that a Beautiful UI has.

Disclaimer:

Firstly, I am not a designer. I have no clue about the concepts and structured approaches they teach in design school. I am writing this from the view of a user with a very keen eye, very high standards and a background in Android development.

Secondly, the last thing I want to do is to point fingers. The (anti-)examples I give are to illustrate. We simply learn more from our errors than from the things we already do right.

Lastly, I am biased towards Android, since this is what I know most about. However, the general rules should equally apply to other platforms.

Rule 1: Stick to the guidelines

Design systems exist for a reason: to provide a consistent set of rules how things are to be done, and to save us time since a lot of design problems are already solved for us.

Material Design is the go-to choice for many Android apps. It contains a huge list of building blocks that we can use to build an app, allowing us to create a consistent and beautiful experience.

Choose a design system, and then stick to its guidelines. Be consistent. Not doing so gives the user the impression that we don’t really care about our brand’s appearance after all.

For example, Material defines how an app bar should look like. Jitsi however does not adhere to that. First, the app bar height is too small – notice how the space to the left of the back arrow is greater than the space below it. Second, the title is centered, whereas in Material for Android it should be left-aligned.

Another issue here is colouring: if we define a primary brand colour, we should use it everywhere. The ugly default green on the switches does not go well with the blue.

Jitsi also provides us with an example of a modal navigation drawer, colloquially known as the “hamburger menu” (since it hides behind a button that vaguely looks like a hamburger). To be fair, the Material specs are not particularly helpful, as they imply that (on a phone) the drawer should always take the full width minus the size of the app bar. Not even Google does this with their own apps, and there are various different articles discussing what might be the right width. Anyway, the width that Jitsi uses is far smaller than what anybody else uses. This break with practical Material Design is distracting.

Icons need to be consistent, too: ProtonVPN uses an old Holo icon amidst their Material-themed app (but hopefully not for much longer!).

Finally, leaving random, unthemed and seemingly unfinished screens in our app is not just not beautiful. It’s ugly. This is an actual settings page from the Publibike app: in the default Android theme, without any app bar or back button.

Stick to the guidelines. Everywhere.

To end on a good note, I would like to point out two apps that have done an awesome job in encorporating the latest Material Design: this files app and Threema. Case studies of well-executed and more custom design systems can be found with Slack, ASVZ and SwissCovid.

Rule 2: Be responsive

By “responsive” I mean fast, not skipping animation frames, lagging, or jumping around. This rule concerns the dynamic aspect of our interface.

Lack of responsiveness quickly results in the user experience being consciously or subconsciously worsened.

Most often this happens to apps that are not native to Android, but written in some cross-platform framework like React Native or Cordova. This is not to say that we cannot build responsive apps with them, but we need to go to greather lengths to achieve that. One cross-platform framework for which this is not true is Flutter, at least in my experience so far. I am yet to see a non-responsive Flutter app.

Rule 3: Have touch states

Touch states are the difference between interacting with a panel of glass and interacting with something natural and beautiful.

It doesn’t matter whether it’s the classic default-hover-focus-press state list or the animated Material ripple effect. I believe that touch states contribute a great length to an app feeling “native”, “fluent” or “responsive”. In web apps the effect is even stronger: distinct hover states make the app feel much more “native”, even though it only runs in the browser. Thus touch states are closely related to responsiveness, but deserve to be considered on their own.

Unfortunately, there are numerous examples where touch states are missing. I’m sure you’ll find some in an app near you. If the user can click it and it does something, it needs a touch state.

When they are present, however, touch states should be a single colour. Not two, like in this example:

Rule 4: Align and pad

Consistent alignment gives an app a structured appearance. Malaligned items on the other hand make the app look messy and cluttered.

This effect is increased by too little padding. Remember Jitsi’s settings screen from above? Everything is crammed together, making texts harder to read and sections harder to differentiate at first glance.

One alignment example from ProtonMail:

And a very subtle centering example from Threema:

Subconsciously, something in the green header feels odd. It is only when we turn on layout bounds in the developer tools, that we see that the white content is top-aligned instead of being aligned to the horizontal center line:

Knowing this, it is easier to see (in the original image) that the green space above Hang up is smaller than the green space below. It’s like the arrow in the FedEx logo: once you know it, you can’t stop noticing it.

Summary

In this post, I described a number of points that I found to be common pitfalls when giving an app its final polish. I gave a few examples that I learned from and that continue to help me in not making the same mistakes.

These points complement Clean Code: just like Clean Code should be elegant, Beautiful UI should be delightful. Just like Clean Code should be pleasing to read for the developer, Beautiful UI should be pleasing to look at for the user.

Beautiful UI can be summarised in four rules:

Stick to the guidelines
Be responsive
Have touch states
Align and pad

Listing those is easy. Implementing them is hard.

Stitching Screenshots in GIMP

Thu, 21 May 2020 19:20:00 +0200

Today I once again found myself needing to stitch screenshots together. It has been a while since I last had to do this (back then, I used some random app). Because I started using GIMP for more and more tasks, I wanted to do this in GIMP as well now.

Luckily, I found this Reddit post. I will repeat the steps here, both as a mental aid and in case that the original posts disappears. Also, I have slightly adjusted the steps.

Open GIMP
File > Open the top most screenshot
File > Open as Layers... all other screenshots
∀ layers: set opacity to ~50%. (Select layer in the layer list on the left, then click opacity bar above.)
Image > Canvas Size...: liberally increase the canvas height
Click M to select the Move tool
∀ layers: select the layer and use the arrow keys to move them up/down. Use Shift+⇅ for rough moving. Zoom in on the canvas to get the alignment pixel-perfect. Increase the canvas height again if necessary.
∀ layers: set the opacity back to 100%
Image > Fit Canvas to Layers
File > Export As...

Filling the Proximity Tracing Gap

Thu, 21 May 2020 14:45:00 +0200

Tracing apps are all around the news these days. Some countries have actively used them for some time now, others are in the process of launching them, while even others are stuck in various debates.

Consequently, there has been a lot of press about tracing apps. However, I feel like most news sites - even the tech media - have failed to underline some crucial points on how these apps (could) work. Sure, some have noted, maybe even explained them. But many have not, and those that have did not sufficiently stress their importance.

What do I mean by important? Important, in the sense that understanding them will enable you to develop a much better mental model of what tracing apps are. Important, because understanding them enables you to better participate in the current debate on tracing apps.

My goal is to explain the concepts, while striking a balance between too little detail and overwhelming the reader. If you want little detail, stay with the mainstream media. If you want to really dive into the topic, read the various white papers. (Especially if you are studying Computer Science. They are a great case study of security protocols.)

A note before I start:

I won’t go into the pros and cons of tracing apps and the concerns they raise. I will simply try to explain a couple technical points.
Whenever I speak of “tracing apps”, I mean tracing apps based on the DP^3T system. It is the one currently being implemented in Switzerland (among other countries). I cannot speak for other approaches.
Contact tracing or proximity tracing? The Swiss Federal Office for Public Health FOPH as well as Google and Apple (mostly) use the term proximity tracing, so for the sake of consistency I will, too.
I am going to make simplifications in order to strike a middle ground.

1. What data is uploaded to the server?

TLDR:

always: random seed
opt-in: contact event data only for contacts with infected patients
never: location data, contact event data for unifected contacts

Some terminology: a “contact” is not an entry in your address book, but rather the event of being in close proximity with another person for some time.

Understanding what data is uploaded and why requires a brief explanation of what happens during your use of the app.

In the design described in section 2 of the DP^3T white paper, your app creates a long and random (and thus unguessable) sequence, the so-called seed. That seed is initally secret and only stored on your phone. From that seed a list of so-called ephemeral IDs are derived. Note that the other way around is unfeasible: Given an ephemeral ID you cannot compute the seed. Also note that without knowledge of the seed, you cannot learn whether or not two given IDs come from the same seed, i.e. come from the same user.

While using the app, it sends out these ephemeral IDs, going through the list one-by-one. One ID is broadcasted for some time, then the next, and so on. Regularly changing the IDs is what makes it harder to track you, since for anyone but you, the IDs look random. Nevertheless, your app stores all the other apps’ ephemeral IDs that it receives.

If and only if you are tested positive (and you enter the code the test center gives you), then the app uploads your seed to a central server, usually run by the local health authority. (Then, your app generates a new seed to start fresh.)

One reason for uploading the seed is to save data: the list of ephemeral IDs is huge since you change them regularly. And you do need to change IDs in order to prevent tracking – recall that there is no obvious connection between any two IDs.

The server collects a day’s worth of seeds from newly infected patients. Once a day, your app downloads the list of seeds of infected patients.

Then locally on your phone, your app goes through this list of “infected seeds”. For each seed it computes the corresponding ephemeral IDs, just like the infected user did to send them out. Your app then checks whether it previously “saw” any one of them. Under certain conditions (e.g. if it finds a minimum number of contact events with infected persons) it issues a warning. Thus, only now your app can make sense of the ephemeral IDs that it “met”.

DP^3T also describes a slight variant of the above (white paper section 3). This approach is currently not implemented by Google/Apple/Swiss PT, but it is still interesting enough to be mentioned as an outlook.

In this variant, your app would in the last step not receive a list of “infected seeds” and would not compute all the ephemeral IDs of the infected patients. Here is what happens instead:

The server still receives the seeds from the infected users like before. From them, it builds a so-called Cuckoo filter and sends that to all apps. Your app can check whether an ephemeral ID of another phone that it encountered was generated from a seed that the server put into the Cuckoo filter. Thus it can still issue a warning if needed. However it cannot in practise extract the “infected seeds” from the Cuckoo filter, thanks to the underlying maths.

Furthermore, DP^3T describes that the app can upload data about your contact with infected patients (e.g. when, how long, how often). This is to support the work of epidemiologists. It is purely optional and currently not implemented in the Swiss app.

2. Where do I have privacy?

Privacy in this is many-fold:

Privacy against other users/apps

Privacy against other users is preserved.

Other apps do not learn your identity, since the only link between your random seed and your identity is the code from the test centre. But that is not send to other users, it stays on the server.

Privacy is even stronger in DP^3T’s second variant: here, other users do not even learn the “infected seeds”. They only learn a Cuckoo filter, and from that the only thing that they can learn is whether or not they have a match.

Privacy against the central server

Privacy against the server depends on your situation:

You never test positive (or if you do, never enter the test centre code). Then no data is sent to the server.
You test positive and enter the code in the app. Then the server learns:

that you are infected. Okay, the health authority saw your positive test anyway.
your seed. Useless on its own though. Only useful to someone who met with you and can check for matching ephemeral IDs.

Crucially however, since the ephemeral IDs of your contacts do not leave your phone, the server is unable to build a social graph, a network of who is meeting whom. The only data your app sends to the server is “I am infected and this is my random sequence” but not “this is who I met”.

3. How are Google and Apple involved?

The Swiss proximity tracing app is going to use the software interface (API) that Google and Apple provide via system updates. What are the implications of this?

The idea behind using the operating systems’s API is to split responsibilities: The OS handles the exchange of ephemeral IDs and the proximity measuring. Since it has much deeper access to Bluetooth, the hope is that the accuracy of the system will be better. The app can then locally query the operating system for the encounters, and handle all the logic for storing and comparing the IDs, communicating with the server, and displaying a pretty UI.

Another advantage arising from the collaboration of the two tech companies is increased compatibility, since they agreed on a common interface. Have you ever tried sending a file via Bluetooth from an Android phone to an iPhone? See what I mean.

Also, both Android and iOS give you control over which app has access to the proximity data, similar to the way you can control permissions.

Note though, that the latest FAQ (as of writing) seems to be very careful not to explicitly say that no data will be shared with Google or Apple. It only talks about not sharing the identities of users with anyone, including Google and Apple.

Exposure logging settings in iOS 13.5

Final words

I hope this helps a little bit in understanding conceptually what is going on with these tracing apps.

Yet, in the end, it all comes down to trust. Trust that the implementation really does what the specification says. Trust that the software contains no critical bugs. Trust that the maths and formalisms of the specification are correct. How this trust is earned and lost is a different matter.

Should you find any error, feel free to contact me.

Sources and further reading:

Learnings From Two Student Cluster Competitions

Wed, 27 Nov 2019 21:50:00 +0100

Having taken part in two Student Cluster Competitions now, I want to take the opportunity to take a step back and reflect on what we learnt. That will aid me organise my thought and hopefully will help other teams prepare for the challenge, since with HPC being such a niche field there is not much writing about the cluster competitions out there (though there does exist one book! But I am yet to read it…).

There are three significant Student Cluster Competitions: ASC in Asia, ISC in Frankfurt (Germany) and SC in the US. The last two are paired with supercomputing conferences. The oldest and “original” Supercomputing Conference (SC) takes place since 1988 and is the largest with over 11,000 visitors.

During each conference a Student Cluster Competition takes place: teams of 6 students build a small supercomputer with the help of their advisors and industry partners. They are given a set of benchmarks and real-world applications that they need to run within a power limit of 3 kW. Teams are then judged on the performance of their system as well as interviews with a jury where they need to demonstrate their knowledge of HPC.

I was lucky to be part of RACKlette, the first team from Switzerland to take part in one of these competitions. We first competed at ISC19 in June and now again in November at SC19. Despite being a newcomer we were quite successfull: in Frankfurt we placed third overall and won the LINPACK award and in Denver we placed 5th in a highly competitive field.

In this post, I will outline the main learnings our team had in the hope that it will be helpful to future teams. Among others, I will go over what you should remember to bring with you, some tips on how to tune both your software and hardware and finally what to expect from the competition in general.

What to bring

It’s easy, right? Your local supercomputing centre organises the shipment of the cluster, you bring your laptop, and you are all set.

Unfortunately, that’s far from it. Here is a - most likely incomplete - list of things you should bring to the competition:

Cabling:
- 6 Ethernet cables: each of you needs to connect to the cluster via wire.
- Ethernet-to-USB-C adapters (thank you Tim Apple)
- Multiple socket outlet
- CH-to-US adapter: different sockets!
Monitoring:
- External monitor: Provided at SC but not at ISC, ask CSCS to ship one with the cluster
- HDMI cable
- Raspberry Pi: Setup InfluxDB and Grafana to read out IPMI readings
- Second laptop: Always good if yours fails, plus you can use it do display the Grafana dashboard (the Pi is quite busy running the DB). Alternatively bring your iPad to be more productive with Sidecar (please can somebody build an equivalent for Linux?)
Varia:
- Coffein: Very important at SC where the competition lasts 48 hours non-stop. The organisers also provide coke, but quickly there is only diet coke left, so buy some yourself at the local supermarket.
- Replacement SSD: One team lost their boot drive. It’s unlikely, but if it happens you want to be prepared.
- Paper clip: In case your team mate locked the rack but is now asleep and you need to reboot after a power outage.
- National flag: a Swiss flag always attracts extra love from your human fans, and more people stop by a nicely decorated booth. Looks great if blowing in the wind of the cooling fans.

Tuning applications

Tuning things is what this competition is all about. There are a thousand different means to that end, and thanks to the complexity of modern systems much of it comes down to lots of time invested and trial-and-error. But if you suddenly get a 2x speedup it can feel very worthwhile! (Remember, running inputs naively can take 6ish hours each!)

Other than getting speedup, the other thing to keep in mind is the power limit. You cannot throw 16 Nvidia V100s at it, each consuming 250 W peak power, if the committee’s alarm goes off at 3 kW!

Hardware

First GPUs. When we competed we - like most teams - had Nvidia cards. You can use the CLI tool nvidia-smi to instruct your GPUs to run at a maximum frequency or maximum power. Try both, depending on the task one may be faster or one may result in a more stable power consumption.

Second CPUs. Use cpupower frequency-info -p to read out and cpupower frequency-set to set frequencies. We wrote a script for that to quickly set frequencies across all nodes using pdsh.

And frequency matters! At SC19, for SST Ember the input had ten iterations with each repetitively doing the same kind of work. When running at 2.6 GHz one iterations took 38 min on our hardware. At 2.8 GHz it only took 33 min. For ten iterations that saves you 50 minutes!

You can also set frequencies manually during a run. (Or automatically, if your so inclined to script it.) One prime example is LINPACK which has a big spike at the beginning and than uses less power if kept capped at the same frequency. In addition, here is an example of us manually adjusting the CPU frequency for SST Ember during SC19 (notice how each iteration has this bump at the beginning, and how close we are driving it to 3 kW):

Also be aware of turbo boost. To avoid sudden unexpected (and unnecessary) spikes over the power limit, you can limit the CPU frequency to something which the CPU can continuously maintain.

And third, cooling. Fans use power, too. During specially important things like LINPACK it is possible (but not necessarily recommended) to slow down the fans via your motherboard’s IPMI. Luckily our Gigabyte motherboards had a webinterface to do that, but there is also a CLI available.

DISCLAIMER: You MUST test this before the competition. If you set the fans too slow and your systems becomes too hot during the run, it becomes unstable, components go into lets-rescue-myself mode and you need to reboot!

Temperature matters too. At SC19 we ran HPCG in the morning just fine, though reproducibly very close below the power limit. When we re-run it in the afternoon, all three attempts spiked over the 3 kW limit (with the exact same configuration, i.e. same inputs, frequencies, fan speeds). We currently acknowledge this to a 1-2°C increase in temperature in the conference hall (because the IPMI was reading a 1-2°C higher CPU temperature, and our cooling fans were at fixed speeds).

Software

Try different compilers: gcc, icc, you name it. -O3 is always good, plus -march=native to tell the compiler it can use vectorisation. Do not overdo it with flags, at some point you will get worse results! Also be aware of flags that icc only has for compatibility reasons: e.g use -xSKYLAKE-AVX512 instead of -march=native.

Also try different MPIs: OpenMPI, IntelMPI, MVAPICH, MPICH. According to their docs MPICH has checkpointing. We only found that out during SC19, but that sounds like something worth looking into. Note that all of them may have slightly different syntax.

Input files

Understand your input files! The organisers - especially at SC - give you inputs that are not optimal and are meant to test your understanding of the application. For example for SST the input we got used --partitioner=linear but there are something like five others to partition the workload.

Or if there are matrices involved think about how much you assign to each MPI rank.

Ideally you had a chance to play with these parameters beforehand and know what to do. However it’s not unheard of that people (that is, I) have found out about other tuning opportunities while the competition was already going on. So don’t be afraid of a blind shot if you think it can reasonably give you more performance!

Varia

Always check the power reading on the provided official PDU! For one, it generally updates in real-time (i.e. every second) while you or the organisers’ Grafana may only update every five to thirty seconds. Our team lost two hours of compute time during 3am and 5am because we were only watching our Grafana, which unfortunately decided to update with outdated data so we didn’t notice our job had finished. So watch the PDU!

The Cluster Competition Experience: On the conference floor

Very exciting, you are finally there at the Supercomputing Conference! You spend a day searching your cluster because it was shipped to a completely different booth somewhere else in the hall, but now you are all set up and ready to go!

Here are some tips:

Be ready to mingle with the other teams. Everyone is really nice and very outgoing, and chatting about how you optimise applications or what results you got for that input file is completely normal and okay. It great to see more people doing HPC!
Expect to be interviewed. Not only by the judges, but also by the conference organisers, your sponsors and Dan Olds. Dan publishes articles about HPC (e.g. on HPCwire) and is an incredibly kind, chatty though eloquent and entertaining character. You will notice when you see him! He interviews all teams every year. Because he has been at the cluster competitions since the beginning he knows them inside out.
At SC (and maybe at ISC in the future?) there is a power outage, planned by the committee. It is meant to test your ability to recover from such an event. This year at SC19 they took everyone by surprise by pulling the power not only once but twice!
If you can, walk around the conference hall and check out the exhibition. There is cool stuff on display, and sometimes you even get merchandise. Some booths host raffles, so there’s your chance to win a NUC!

Final words

The student cluster competition is an incredible experience. Meeting like-minded students, experiencing the conference itself, learning a thousand things about HPC, computer systems, networking, compiling and application optimisation in the progress makes it a very unique challenge but also opportunity.

Although secretly, it’s only an international sweets exchange really: we brought Swiss chocolate and received sweets from Estonia, Poland and Shanghai in return.

This post also appeared on the RACKlette team blog

What is HPC?

Wed, 27 Nov 2019 14:36:22 +0100

Computer Science in itself is a huge field, and one of it sub-fields is HPC. HPC, short for High Performance Computing is little known, even among Computer Scientists and even more so in the general public. You might have heard of, or seen pictures of, “supercomputers”. They fill huge rooms with aisle after aisle of cabinets (and often feature some pretty graphic on the outside).

So you might ask, what’s the difference between a supercomputer and one of Google’s datacentres?

First is their configuration: even though both have lots of cabinets stuffed with computer hardware, the hardware in a normal datacentre is set up in a way such that it functions as many “small” computers (but those are still bigger than your PC at home!). The hardware in a supercomputer on the other hand is all tied together to form one massive computer where you can run your program on.

And the second difference is their use case: common services like websites, databases or application backends usually run in datacentres. Supercomputers on the other hand are mainly used for scientific simulations. Those might be academic (simulating expanding galaxies or protein reactions) but can also be industrial (simulating airflow over a new airplane design iteration rather than having to physically build and test it in a windtunnel). So both require a huge amount of compute power, but the first does so because it needs to be able to handle large amounts of clients and the second because it solves inherently difficult problems with brute force.

And even if you didn’t fly lately, HPC directly affects you every single day: Every few hours, a supercomputer in the Swiss Supercomputing Centre (CSCS) in Lugano crunches the latest weather data and runs the meteorologists’ simulations to come up with a new weather forecast(*). Meteo Swiss is a leader in this area and thanks to being one of the first to run these simulations on GPUs it is able to achieve predictions for a much finer grid than neighbouring countries.

That’s a short overview of what HPC is and does. If you want to find out more have a look at the field’s two big news websites HPCwire and insideHPC.

(*) Please stop giving bad reviews in the App Store if the weather prediction is bad, it’s usually not the app developer’s fault but the meteorologist’s!