2022 March 28

Try to discuss the economic model of open source ecology

Origins

This is a topic that I’ve always wanted to write about, but never thought very clearly about. The goal is very clear: to analyze various problems in the open source ecosystem by creating an economic model.

Last time on a CHAOSScast China Podcast, we were talking about “metrics”. The topic was: bottom-up metrics or top-down metrics.

The bottom-up approach is to start with various behaviors that are already observable in the open source world, such as star, fork, commit, PR, etc., and then use some sort of calculation to get a “metric”. The problem with this is that it is not clear what it means. We can do all kinds of weighted calculations, but it is impossible to explain why we can do that.

By top-down, we start with a goal, where we want to look at some kind of activity, health, or other characteristics of an open source project, or an open source community. Then try to build a computational approach: stitching together and computing various behaviors. The problem with doing this is: there’s no way to do it. Because there are a million questions we can ask, but the answers are too hard to find.

The key is that we lack an unambiguous set of concepts that can form an understandable and realistic model. Further, the key is that we lack an understanding of open source software, and indeed the nature of open source.

Start thinking from “the Labor theory of value(LTV)”

I just recently finished reading a very thin little book, “Interpreting the Labor theory of value”, which is only 59,000 words in total, but explains very clearly. A brief copy of the book.

Use Value: A product of labor can be sold as a commodity because it has a certain usefulness. The usefulness of this product, i.e., the property that satisfies some need of people, is called by Marx the use value of a commodity.

Exchange Value: Exchange value is first expressed as the relationship or proportion of the amount of one use value exchanged with another.

Exchange value (supplementary): later (in the money era) expressed in terms of the exchange relationship between commodities and money, i.e., the amount of money exchanged for a certain amount of commodities.

Value: The abstract human labor condensed in the commodity is the value of the commodity, and the common thing hidden in the exchange value of the commodity is the value.

The common thing hidden in the commodity exchange relationship or exchange value - the amount of labor - is the value of the commodity, and the exchange value is the expression of the value.

Socially Necessary Labor Time: Socially necessary labor time is the labor time required to produce a certain use value under the existing normal conditions of social production and at a socially average level of labor proficiency.

Use value is the natural property of a commodity, and exchange value is the social property of a commodity. In other words, exchange value is the price, which always fluctuates slightly up or down relative to the value.

In summary, use value indicates what the commodity is used for, and value indicates how much it is worth, but this “money” is not the price of the commodity, but the amount of human labor condensed into the commodity.

Think about the use value and value of software through the series of concepts of labor value

In Marx’s time, the commodities he discussed were in “physical form”, so the exchange was either in barter or in money and goods. However, in the era of commercial software, what is the value of software (either source code or executable software) that can be copied indefinitely?

On the other hand, when discussing all goods in the world, the use value of different goods, in fact, cannot be compared with each other. But, also as software, can the use value of different software be compared with each other?

The essence of software is that software executed on a CPU is essentially replacing human mental work. Therefore, the essence of software is “the condensation of human brain power”.

Therefore, the value of the use of software lies in how much human mental work it can replace. Therefore, the use value of different software can be compared with each other.

For example, what is the average human computation time for a multiplication operation of a 100-digit integer by a 100-digit integer? How long is it by some kind of computer? Suppose the former is 1 day and the latter is 0.1 second. Then, this is the value generated by the software of this multiplication calculation, after one run.

A little more complicated, a 3D animation film computer rendering, how much time it takes to use the computer? How much does it cost? How many people, how much time, and how much cost would it take to render by hand, assuming each frame is hand-drawn? Well, this is the value of 3D graphics rendering software, in a film production, generated.

Based on our understanding of the nature of software, the value of a piece of software in use actually depends on the replacement capacity of that software itself, multiplied by how many computers (CPUs) it runs on, multiplied by how many times it runs.

Socially necessary labor time, also a theoretical time, as I understand it, can be called: the average cost of developing a piece of software.

So: to estimate the value of a piece of software, you can convert it to “estimate the average development cost of a piece of software”.

There are two references for this, one is the Function point based development cost estimation that I often heard in the past, and the other is the Constructive Cost Model (COCOMO).

For more information you can refer to: Cost estimation in software engineering

The exchange value of software, Essentially, it is not an exchange of money for software, but an exchange of money for the right to use it. Because of the special nature of software, the pricing of usage rights also appears to be more complex than that of traditional physical goods.

Two models of software pricing

Distinguishing between value-in-use and value, we see that there are two models of pricing in the software space. One is use-value based pricing, or use-case based pricing. The other is value-based pricing, or: socially necessary labor time / development cost-based pricing.

Pricing based on usage is really about selling copies. You have 100 people in your company, and everyone uses my Office software, then you have to pay for 100 copies. You buy my software and run it on a hundred servers, each with 32 cores, all of which should be accounted for and paid for.

Pricing based on development cost is actually “custom development”. I put forward the requirements, you quote the cost, and I pay a reasonable price at the end, you have to earn, and I save my own development.

The reason for the existence of two pricing models still lies in the replicability of the software. In the case of physical goods, each production of a good requires one labor. Thus: socially necessary labor time –> value –> use value, a tightly connected whole. In the case of digital goods, software/digital goods, which can be reproduced infinitely, have a use value with nearly infinite possibilities. Its socially necessary labor time, and thus the value determined by it, on the other hand, is finite.

The value of use in the modern software ecology

In the past (decades ago), software was software and software reuse was only an ideal. But now, with the development of modern programming languages, the “package management system” based software development reuse model has been born.

Thus, a final software that can be used directly and a package that can be reused, become two concepts. Usually in a final software, there are very many packages included (dependent), and a more popular open source package is used (dependent) by many different software.

It becomes very difficult to estimate the value of an open source software package when A uses B, B uses C and D, and D uses E. If we do enough counting, we might eventually figure out that E, at the same time, is used by four different software packages, in different situations.

Of course, we also need to know: how many computers (CPUs) these different software are running on, and how many times they have been run. In the case of commercial software, it might still be possible to keep track of the data through sales, but with open source software, that would be too difficult.

What does it mean to be active in an open source community?

Because it is impossible to actually count, the actual usage of an open source software. We can only rely on the assumption that “the more users there are of open source software, the more active the open source community will be”.

Of course: there are many uncertainties in such an assumption.

the more mature the software, the more users there are, but the less active the users may be in the community
the relationship between the number of users and the number of uses, depending on the application area of this open source software, the more server-side, underlying software, but also a small number of administrators, maintaining a large number of instances of use
because of the package dependency relationship, resulting in a large number of indirect use, but can not bring an increase in activity

Therefore, calculating the activity of an open source community is not equivalent to the “value of the open source software”.

Meaning of Value Stream Network

Frank Zhao has recently written three blogs that also discuss related topics.

In my opinion, the value stream network is actually a complex network of open source software ecology, with the help of PageRank algorithm, as far as possible to “deduce” the use value of an open source software.

This is of course very valuable work, but it does not solve the problem of calculating the value of open source software, or the socially necessary labor time.

Various time spent in the open source community

The various actions generated in the open source community will have their actions initiated by the parties involved, spending their respective time. Some of this time is typically spent in the software development process. Others are behaviors that only exist in the community and the time spent.

For example, reasonable questions and normal answers. If the question and answer content can be sunk, it would have saved the time of those who came later. However, if a user, without the ability to search and find, asks a question directly, and forcibly takes up the time of others. This kind of behavior will be rejected by the community. Because, he wasted the effective investment time of the whole community.

So, in any case, we can see three time lengths.

T1 = the necessary labor time required for the current version, developed from scratch.
T2 = the current version, developed from scratch, the labor time invested by the community as a whole.
T3 = current version, from one start to the present, the time spent by the whole community, as a whole

Usually T1 < T2 < T3

We can interpret the ratio of T1/T2 as the community’s ability to develop (the ability to take fewer detours); the ratio of T2/T3 as the community’s ability to govern (the ability to be less disturbed). Of course, it can be further extended: for example, the time spent on handling bugs, as a percentage of development time, represents the quality of open source software.

How to estimate the time spent on behaviors in a community

In the previous subsection, the time for T1 looks relatively easy to estimate. Maybe based on function points, maybe using COCOMO. but what about T2 and T3? In fact, we should be able to understand that the more complex a behavior is, the more difficult it is to estimate its average elapsed time. And the simpler the behavior, the easier it is to estimate.

For example: it is difficult to estimate the socially necessary labor time to write a software with 1 million lines of code. But: writing a class with get/set methods, the time consumed is very easy to estimate. Writing a novel with 1,000,000 words, the socially necessary labor time would be very difficult to estimate (starting.com, perhaps, already has very accurate data), but writing a 100-line document, the time spent is very easy to estimate.

Similarly, many of the behaviors we observe in open source communities, whether it’s liking, forking, creating an issue, or launching a PR, are much easier than writing a complete piece of software. It is also relatively easier to estimate its average time spent.

How to calculate open source community activity

If we define open source community activity as the total time investment an open source community attracts from all over the world, divided by the unit time. This becomes more explanatory than the purely subjective weighting of various behaviors.

Suppose it takes a user 1 second to click on a star. fork a repository, about 2-3 seconds.

If you commit an issue, you should not count the time it took to write the issue, but the time it took to write the issue. The time before writing the issue, we can not calculate (debugging bugs, all kinds of walls), the time to write, maybe 1 word 2, 3 seconds. Then, around this issue, someone constantly check, someone involved in the discussion, until finally the issue was closed. In total, it took about 30~60 minutes, or even more.

Writing code and submitting PRs can be calculated along similar lines.

From this point of view, the weight of star is second-level, while the weight of issue, PR, should be minute, or even hour-level. This way, the calculation of activity will be more reasonable.

The significance of open source community activism!

We can think of all the available time for all of humanity as a large constant. The various open source communities, and of course the Internet, SNS, short video and shopping platforms, are all competing for that time and attention.

The total amount of time programmers around the world can devote to writing code, writing documentation, answering questions, and debugging bugs is actually limited. The ability of an open source community to attract enough attention and time investment is the basis for its success. And the open source competition between software/Internet companies is actually competing for the attention of programmers, the remaining time. This is exactly the key point I mentioned years ago: “Open source projects should also talk about attention economy

Improvements for collaborative influence

In Frank’s blog, assume that developer d is active on project p1 with \(A_{d,p1}\) and on project p2 with \(A_{d,p2}\), then the developer’s contribution to the collaborative association of these two projects is \(\frac{A_{d,p1}A_{d,p2}}{A_{d,p1}+A_{d,p2}}\) .

In the problem section, Frank also writes: “The design of open-source collaborative networks is insensitive to activity, but it has requirements. That is, if low-cost behaviors such as star or fork are introduced into the activity, it will lead to a large number of connected relationships between projects, which in turn will lead to a decrease in the accuracy of determining the category of projects. That is, the effectiveness of current clustering relies heavily on the underlying activity design.”

Therefore, my suggestion is just one: use a time-based activity design to calculate collaborative influence. :P

Summary

With the above analysis, we can basically get a full picture of the economic model of an open source ecosystem.

Socially necessary labor time –> development cost of a specific version –> T1 –> value of a specific version
Total community time invested –> Open source community activity –> T3 –> Community competitiveness
Collaboration impact based on time investment –> Correlation between open source communities –> PageRank –> Ecological relevance
Value Stream Network –> Total Ecological Value of Open Source Software –> Use Value

Based on the above model, we may be able to develop further analysis.