WeMine: OS for managing mining equipment at one of the world's largest crypto farms

WeMine: OS for managing mining equipment at one of the world's largest crypto farms

WeMine: OS for managing mining equipment at one of the world's largest crypto farms

4,500 m2

Farm area

24,000 GPUs 2,917 ASICs

Hardware

128,459 ETH

Mined

Services

User Research, Product Strategy, Product Design

Role

I led the redesign closely collaborating with stakeholders and engineering team

Year

2020-2021 (1 yr 1 mo)

Platform

Web

Result

−648h/y

Cut in mining equipment downtime
& low-performance periods

Cut in mining equipment downtime & low performance periods

Cut in mining equipment downtime & low-performance periods

+$500K

Increase in annual profit

Challenge

Equipment repair time is a key metric. Every second of delay between detecting and fixing issues is money lost. The sum of these seconds—$616,000 annually.

01

How might we

help technicians detect equipment issues in real-time?

02

How might we

minimize time from problem detection to repair initiation?

Status quo

I inherited a product from the previous team. They copied an existing solution without understanding the client's specifics. The result didn't fulfill business needs and required a redesign.

Process

Business research

Before I joined, the team rarely communicated with the farm's CEO and almost never with our users—the farm's technicians. So no one properly understood what to do.

To figure things out, I started by meeting with the CEO and set up regular meetings with him.

What I learned

I understood how the farm works, how it makes money, and what reduces ROI.

This helped formulate principles (first principles thinking) that became the foundation for product decisions:

Speed

  • Mining business earns by solving mathematical puzzles for the blockchain network.

  • Whoever solves it first wins the reward.

  • Our solving speed depends on the combined computing power of all our devices.

  • Device issues happen constantly.

  • Each issue reduces power, which slows down our solving speed, and with speed—income.

  • We can't prevent issues from arising, but can influence how quickly we fix them.

  • Fix faster, earn more.

Efficiency

  • Each issue reduces power differently.

  • The extent of reduction depends on the type and intensity of the issue.

  • Power reduction is distributed unevenly: 20% of problematic devices are responsible for 80% of total reduction.

  • Therefore, issues are not equal and cannot be fixed in random order.

User research

The next step was getting to know our users. I initiated a team trip to the farm. For two days we worked alongside technicians, doing their work.

The farm building

What I learned

As we remember, every second of delay between detecting and fixing issues is money lost. The trip helped discover the main causes increasing this time gap:

  • Device issues are detected with delay.

  • You can't start repairs until you determine what to fix and in what order.

Solutions

Challenge 01

How might we help technicians detect equipment issues in real-time?

Smart alerting system

Issues are detected with delay, postponing repair initiation.

The later we react, the worse it gets: from slower device performance to its breakdown.

Before

Telegram bot helps stay informed about everything happening on the farm.

The problem is it serves many scenarios.

Messages arrive ~1 time per minute, only every 6th is about an issue.

Reacting to every notification is impractical.

This forces technicians to check for issues directly in the product.

After

Created a separate bot only for issues.

Messages arrive based on new rules:

Expensive issues—send every incident.

Cheap ones—bundle into one message when reaching a set quantity.

Notifications come with a special sound to stand out from other chats.

Every 7.5m
255h 30m/year

Useful notifications: 16%

Notifications ignored

When issues occur
~103h 25m/year

Useful notifications: 100%

React to every notification

Before

After

Every 7.5m
255h 30m/year

When issues occur
~103h 25m/year

Useful notifications: 16%

Useful notifications: 100%

Notifications ignored

React to every notification

Before

After

Every 7.5m
255h 30m/year

When issues occur
~103h 25m/year

Useful notifications: 16%

Useful notifications: 100%

Notifications ignored

React to every notification

Challenge 02

How might we minimize time from issue detection to repair initiation?

When you detect issues, you can't start repairs without solving two intermediate tasks:

  • Gather information about all current issues.

  • Determine the order of fixing them.

Issue tracker

Issues are detected with delay, postponing repair initiation.

The later we react, the worse it gets: from slower device performance to its breakdown.

Before

We only show a summary of how many devices are offline. But this is just 1 of 5 issue types. The rest are scattered throughout the product.

After

Added the remaining issue types. All the information at a glance.

152h 5m/year
25 times a day × 60 sec

20h 17m/year
25 times a day × 8 sec

Before

After

152h 5m/year
25 times a day × 60 sec

20h 17m/year
25 times a day × 8 sec

Before

After

152h 5m/year
25 times a day × 60 sec

20h 17m/year
25 times a day × 8 sec

Profit-based device organization

Devices should be fixed in order of profit impact.

To determine the order, you need to find and compare data from hundreds of devices across dozens of folders.

No one can process that volume quickly and accurately at the same time. You have to sacrifice accuracy.

Suboptimal order—lost profit.

Before

Devices are distributed across folders by their physical location in the building (floor + row).

Each folder contains all devices from a location.

For prioritization only problematic ones are needed. But they're mixed with healthy ones, which vastly outnumber them.

You need to hunt for them, assess each one's condition, and create a repair order.

After

Added a new Issues page showing only problematic devices.

The algorithm sorts devices by profit impact. A technician gets a ready-to-go plan.

No hunting, no analysis—straight to work.

System repeats building structure

376h 50m/year
42 times a day × 89 sec

From 16 to 300+ actions
Depends on incident scale

Suboptimal order

System organized by business needs

12h 46m/year
42 times a day × 3 sec

1 action
Doesn't depend

Optimal order

Before

After

System repeats building structure

System organized by business needs

376h 50m/year
42 times a day × 89 sec

12h 46m/year
42 times a day × 3 sec

From 16 to 300+ actions
Depends on incident scale

1 action
Doesn't depend

Suboptimal order

Optimal order

Before

After

System repeats building structure

System organized by business needs

376h 50m/year
42 times a day × 89 sec

12h 46m/year
42 times a day × 3 sec

From 16 to 300+ actions
Depends on incident scale

1 action
Doesn't depend

Suboptimal order

Optimal order