A Data-Driven Research of Sales and Purchases on JD.com Platform

Unlike consumers in the mall or supermarkets, online consumers are “in-tangible” and their purchasing behaviors are affected by multiple factors, including product pricing, promotion and discounts, quality of products and brands, and the platforms where they search for the product. In this research, I study the relationship between product sales and consumer characteristics, the relationship between product sales and product qualities, demand curve analysis, and the search friction effect for different platforms. I utilized data from a randomized field experiment involving more than 400 thousand customers and 30 thousand products on JD.com, one of the world’s largest online retailing platforms. There are two focuses of the research: 1) how different consumer characteristics affect sales; 2) how to set price and possible search friction for different channels. I find that JD plus membership, education level and age have no significant relationship with product sales, and higher user level leads to higher sales. Sales are highly skewed, with very high numbers of products sold making up only a small percentage of the total. Consumers living in more industrialized cities have more purchasing power. Women and singles lead to higher spending. Also, the better the product performs, the more it sells. Moderate pricing can increase product sales. Based on the research results of search volume in different channels, it is suggested that it is better to focus on app sales. By knowing the results, producers can adjust target consumers for different products and do target ad-vertisements in order to maximize the sales. Also, an appropriate price for a product is also crucial to a seller. By the way, knowing the search friction of different channels can help producers to rearrange platform layout so that search friction can be reduced and more potential deals may be made.


Introduction
In 2018, an estimated 1.8 billion people worldwide purchase goods online. This number is still increasing dramatically in recent years [1]. Total global online store sales are expected to reach the $2 trillion mark by the end of 2019. The increase will be 6 per cent compared with 2017 [2]. The Asia-Pacific region is home to the largest and fastest growing e-commerce market, with total online retail revenues set to nearly double from $733bn in 2015 to $1.4tn by 2020, with China being the largest of these, accounting for 80 per cent of the Asia-Pacific online retail market [3]. The growth of e-commerce retailing (or E-tailing) has given rise to many new and challenging problems at both strategic and operational levels.
In the context of E-tailing, this paper connects and contributes to three streams of literature. First, this research is related to a large literature on the determinants of consumers' online purchase behavior such as attitute, facilitating conditions, perceived usefulness, enjoyment, social pressure, transaction security, etc. [4]- [8]. For example, Tontini finds that the quality of e-services and online services are generally considered to be key determinants of competitive advantage and are related to measuring customers' purchase intention [4]. These studies have made important contributions to the understanding of the key factors in online shopping. However, most of researches adopted a questionaire survey and didn't take full advantage of the benefits of big data. Collecting sale data from JD.com, this research tries to comprehensively depict the digital image of users.
The second related stream of research is the operations management literature on dynamic pricing, which usually focuses on building pricing models and algorithms in different industries to maximize profits. This paper provides more evidence to design price strategies for sellers and platforms. For example, research on the pricing strategy of a firm to determine which price strategy, either dynamic pricing or preannounced pricing, is most beneficial to a firm [9]. Some other researches are focused on price promotions on retailing platforms and their effects on consumers' behaviors in both long-run and short-run period [10]. Moreover, some research is focused on matching rate and the factors that may affect it, including market thickness [11]. Dynamic pricing is based on the accurate understanding of consumers' consumption behavior. According to the acquired understanding of consumers, this paper also puts forward relevant suggestions. As you can see, many researches have focused on seller side in the e-tailing market, including pricing, discounts motivations and matching rate. However, in my research, the focus is more on the consumer side, revealing how different characteristics of products and platforms will affect consumers' purchasing behavior.
This research also provides insights for retailers about how they can increase their sales volume and revenue by allocating resources in different channels. Xu et al found results demonstrate that the tablet channel acts as a substitute for the PC channel while it acts as a complement for the smartphone channel [12]. Chen et al concluded that the online retailer's profit critically depends on customer loyalty [13]. Ronghui et al. suggested that an increase in product return probability or retailer cost of handling a returned product can be beneficial to retailers [14]. Compared with risk aversion and service orientation, the convenience orientation of customers is higher, which urges consumers to choose online channels instead of offline channels. Knowing the demand and consumers' behavior when browsing products through various channels (PC, mobile devices, etc.) will help sellers better set price and decide on which channel to post their products in order to attract the most amount of potential buyers, getting most clicks from them, and reducing as much search friction as possible.
Using JD.com's proprietary data which captures a "full customer experience cycle" that begins as soon as a customer starts browsing on the platform and ends when the customer receives the delivered products, I hope to address several relationships related to product sales: 1) the relationship between sales and consumer's characteristics; 2) the relationship between each product's sales amount and their respective attributes; 3) and the demand curve of different products. Furthermore, there are several other interesting aspects related to consumer behaviors and pricing strategy, including search friction and price discrimination.
The contributions to the literature are two-fold. First, this study utilizes the big data to depict customers' characteristics in online retails, different from the questionnaire survey many previous studies adopted. Second, it provides a framework for sellers to analyze the users' behavior and design effective marketing strategies including allocating proper resources in different channels and designing pricing strategies.
The main methodologies in the paper include linear regression and various data analysis approaches. The remaining part of this paper begins with the introduction of the data and interesting statistical observations resulted from explorative analysis (Section 2). Methods and procedures adopted to solve the three main questions are discussed in Section 3, 4, 5, and 6, respectively. Section 7 concludes the paper.

Explorative Data Analysis
In this research, there are 457,298 potential JD consumers observed purchasing 31,867 products. All the data is stored in five datasets: "SKUs" table, which contains all the information related to products including the brand it belongs to, its attributes, and date when it enters the market and leaves the market; "users" table, which contains all the information related to a user including his or her user level, age, marital status, education, and city level; "clicks" table, which contains every clicking information of a user on a product via which channel; and "orders" table, which includes the information of an actual order including the price and discounts of this order. After first observing all the datasets and constructing a bar chart analyzing the sales distribution of all the products, I can find that the sales are highly skewed. The result is shown below (This graph only contains products that quantity of sales is greater than 500).
In Figure 1, the horizontal axis represents the product ID, and the vertical axis represents product sales. Those selling more than 1000 pieces accounted for only 1.0% of the total, while those selling more than 100 pieces accounted for 8.0% of the total. The most sold product "SKU_ID" was "068F4481B3", selling 25,769 units, while the average sales of samples were only 73.060 units. It can be seen that the sales distribution of JINGdong's products is highly skewed to the right, with only a small part of the sample selling extremely high.
What's more, we can find price-discrimination during the whole research. Taking the discount price of each consumer when purchasing products as the dependent variable and the characteristics of consumers as the independent variable, the linear regression model is established and the following results are obtained.
In Table 1, attributes of consumers with P values less than 0.05 are statistically significant and can be used for analysis. As we can see, gender, age, and marital status are all the aspects determining the final price of a product, which suggests the possible existence of certain degree of price discrimination. Take age for an example, the older you are, the products will be less expensive on average. Also, for the marital status, you will get possibly a lower price if you are married than those not.
Based on these results, sellers can adjust product features, target consumers, prices, and time of product entry and exit to achieve higher sales and maximum revenue.

Product Sales Analysis
Linear regression is used to find the relationship between a consumer's characteristics (education, city level, purchasing power, gender, marital status, etc.) and the sales volume of the products they bought. The data used needs to be preprocessed before these variables can be added. In the "Users" table, each consumer  has its own user ID and is given a number or letter in each condition to represent its user characteristics. The first five lines of the "users" Table 2 are shown below. Some features are represented by specific Numbers, such as "user_level", which has a value of 0, 1, 2, 3, or 4, where a higher "user_level" is associated with a higher total purchase value in the past. If the consumer is a JD Plus member, the value of the "PLUS" column is 1, otherwise it is 0. "education" is valued according to consumers' education level, the greater the number, the higher the education level. "city_level" ranges from 1 to 5, with greater numbers representing less industrialized cities. "purchase power" is also ranged from 1 to 5, with greater numbers symbolizing lower purchasing ability. Figure 2 shows the percentage of users having different values under different categories.
A dummy variable is a numeric variable that represents categorical data. For example, in the "users" table, the gender of the user with the "user_ID" of "000089D6a6" is "F", indicating that the user is female. After being treated with dummy variables, the value of "gender_F" becomes "1" and the values of "gend-er_M" and "gender_U" become "0". By doing so, the computer will identify the user's gender. A portion of the "users" Table 3 after a dummy variable transformation is shown below.
After combining all the information in the "Users" and "Orders" Table 3 and Table 4, linear regression can be used to determine the relationship between the sale of the product purchased by the consumer and the characteristics of that consumer. The "Y" value is the total sales of the products purchased by each consumer, while the "X" value is the different characteristics of each consumer, including user level, marital status, gender, etc. The code for linear regression can be seen in Appendix (Code A).
The result of the linear regression is as follows: As you can see from Table 5, "user_level", "city_level", "purchasing_power", "gender", and "marital_status" are not statistically significant. This indicates that JD + membership, education level and age have no obvious relationship with product sales, and these three factors have little impact on sales. The higher the user level, the higher the consumption. Consumers living in industrialized cities tend to bring in higher sales. In terms of marital status, single people tend to buy more than married people.

Quantity of Sales Analysis
In this section, the main goal is to find the relationship between product sales and their respective attributes. When working on this, the "SKUs" Table 6, which contains all the information of a typical product, becomes important. The first five lines of the "SKUs" Table 6 are shown below.
In Figure 3, the first attribute takes an integer value between 1 and 4 (unknown data is denoted by −1), and the second ranges between 30 and 100  (unknown data is also −1). For each attribute, a higher value indicates better performance of certain functionality. The distribution of all the products having different values under different attributes is shown below. If linear regression is directly used by providing the value of attribute as independent variables and sales amount as dependent variables, the relationship cannot be fairly analyzed because for some products with the same attributes, due to its huge number of products carrying the same value for this attribute, the Intelligent Information Management    quantity of sales will be undoubtedly greater than those only have a relatively small numbers of products carrying the other value for the same attribute, that is, the sales amount is distorted. To understand this, take "attribute 1" as an example as shown in Table 7. There are 813 products with the value of 1 and 2491 products with the value of 2 for "attribute 1" There are 7952 quantities of sales for all the products with the value of 1 for "attribute 1" and 91,708 quantities of sales for all the products with  the value of 2 for "attribute 1". So, one may conclude that the greater the value for "attribute 1", the more the quantity of sales would be. However, this result is highly biased because there exists a possibility that it is the greater number of products having the value of 2 makes the quantity of sales to be greater than those with value of 1. So, in order to exclude this biased condition in this linear regression, I need to apply volume factor twice, one based on the quantity of products in "attribute 1" and one based on the quantity of products in "attribute 2". By dividing the quantity of sales for each value of "attribute 1" or "attribute 2" by the quantity of products for each value of "attribute 1" or "attribute 2" respectively, the above biased condition will be largely excluded. The codes I apply based on both attributes, "attribute 1" (Code B) and "attribute 2" (Code C), will be shown in APPENDIX. The linear regression's independent variables include the value of the attributes and the dependent value is the sales amount. The result of linear regression done based on "attribute 1" which resolves the distortion is shown in Table 8.
The results of linear regression based on "attribute 2" for resolving distortion are shown in Table 9.
According to the results of both trials, the relationship between sales quantity and the value of an attribute will be more pronounced if the trial is unbiased based on that respective attribute. In other words, if the result is processed unbiased based on "attribute 1", then the relationship revealed between quantity of sales and attribute 1 will be more apparent than that of attribute 2. Since a smaller P value suggests higher significance. In the above table, for example, the P value for "attribute 1" is 0.000 when using "attribute 1" to avoid biased results, whereas the P value for "attribute 2" is 0.480 when using "attribute 2" to avoid biased results, suggesting that the result will be more significant if the unbiased adjustment is done based on respective attribute. In addition, from these two linear regression results, we can see that the lower the value of the two attributes, the higher the sales. This means that the better the performance and function of the product, the higher the sales.

Demand Analysis
In this section, demand curve will be the main focus. As shown in Figure 4, Demand Curve is a curve showing the seller the willingness to buy for each    consumer under each price for a product. So, I choose the product with the most quantity of sales to do the research, which is the product with "sku_ID": "068f4481b3". In order to construct a demand curve, I need to first know the price when a consumer is viewing the product. Since the "SKUs" table (shown in section IV) contains several discounts that do not fully reflect the consumer's willingness to pay when seeing the product, including bundle discount, coupon discount, and quantity discount, I use final unit price as each consumer's willingness to pay for a product.
In order to discover for a specific product, the price associated with most sales, which is captured by the demand curve, I need to process the data in a way that avoids biases similar to that discussed in section IV. There is a certain possibility that the sales quantity of a product is affected by the time period this price is shown to all consumers. For example, if a tube of toothpaste costs $25 for a year, meaning the price for it will be $50 if you see this product during the year, and the quantity of sales is 100. Then, its price changes to $25 for only a month and the quantity of sales is 10. Can you say that consumers' willingness to pay for this toothpaste is at $50 because this price level has more quantity of sales? Of course, the answer will be negative because this conclusion is based on consumers having a higher possibility of buying it at a price of $50 due to this price level's time period is longer. So, in order to exclude this bias, I apply the techniques as in part IV, which uses the frequency of a certain price for a product divide by the duration of time it is sold for that price as y-axis. The final unit price will be shown on the x-axis. By doing so, I can find out at which price a product has most sales in a less biased way. The result is shown in Figure 5. According to the results shown above, most people prefer the price to be around 200 RMB for this product. Concerning prices from 0 RMB to 100 RMB and 250 RMB to 250 RMB, there are more people who prefer cheap prices than really expensive price. This result reveals consumer behavior. Consumers will prefer an "appropriate" price more than prices that are extremely high or extremely low. The reason may be: if a price is too cheap, although it becomes more affordable, the product may seem bad and low-quality in consumers' minds, leading to relatively lower quantity of sales. If the price is too high, then it is too unaffordable and exceeds the value in consumers' minds. As a result, only an appropriate price will lead to better quantity of sales. After seeing the best sold products, the demand curve for the top 5 best sold can also be taken into concern, as shown in Figure 6. As you can see in Figure 6, there are prices which have relatively high quantity of sales (we call it the "best" price in the following paragraphs), proving that setting an appropriate price is crucial to a producer. An appropriate price, instead of a cheap price, will lead to a higher quantity of sales. Furthermore, there is something special and different if we look at the five graphs together. The "best" price is distributed at different points in the pricing range of the product. For example, for the best sold product, the "best" price is $200, which is located almost in the middle of the price range: $0 -$350. For the third one, the "best" price is located at price $40, to the right of the price range. For the rest, the "best" price occurs frequently, usually appears when price increases by a certain amount. This finding may imply the product type of those products. For "best" price only appears rarely and located in the middle or slightly to the right of the price range, the products may be food, electronic devices or those may have potential safety hazard, making human disbelieve extra-cheap price and refuse extra-expensive price. For "best" price appears frequently among the price range, the products may be toys or recreational equipment, that are valued usually different depending on the wealth of the consumers. In conclusion, knowing the Intelligent Information Management demand curve of several products may help producers characterize the products and specialize the pricing of each product.

Search Friction in the Online Marketplace
In this section, search-friction of different platforms will be the main topic. Search friction is determined by the amount of clicks a consumer is willing to spend on the product via this channel. Smaller search friction means a higher possibility of consumer's purchase due to more voluntary clicking. As a result, all manufacturers want their products to be more searchable. To determine the number of searches, the "Clicks" table is used to store the number of channels or platforms on which consumers are clicking on products. An example of the "Clicks" Table 10 is shown below.
In Figure 7, the distribution of clicks between different channels is shown. It turns out that app has the most hits, followed by WeChat. Another key indicator of search resistance across different channels is the time spent browsing products through that channel.      Table 11 shows that mobile has the longest time a consumer spent on clicking, which suggests that it has the lowest search friction; whereas WeChat has the highest search friction. By knowing the search friction in different channels, sellers can efficiently decide which channel could be the best for a specific prod-uct and which channel is needed to be adjusted in order to reduce search friction. Figure 8 shows the total number of clicks for a brand under different channels. Utilizing this information, sellers can clearly know the search friction of different channels for a brand and adjust its selling strategy or even improve channels with relatively higher search friction. Take the brand with "brand_ID" "003938d449" as an example, the clicking frequencies of app and WeChat are significantly larger than that of pc and mobile, indicating higher search friction in the pc and mobile channels. In this way, the seller who sells the product with this "brand_ID" can adjust his or her selling strategy accordingly. Maybe he or she can increase advertisement spent on app and decrease ads spent on pc or redesign and improve the website for pc users in order to make it more attractive and convenient to browse, thus reducing search friction.

Conclusions
In this research, I mainly focused on two aspects: consumers' purchasing behavior towards different products and strategy a producer can take to increase sales.
First focus is on how different consumer characteristics affect sales. Quantity of sales for products is highly skewed in e-tailing area, with extremely high quantity of sales for a tiny portion of all the products. Furthermore, I found that people live in a more industrialized city, with higher purchasing power and user level, being singles will lead to higher consumption on buying things. Moreover, a higher performance or better function of a product will always lead to higher quantity of sales.
Second focus is on pricing. I found that an appropriate price, not too expensive and too cheap, leads to higher quantity of sales of a product. Also, demand curve for several products helps producers characterize the products and specialize the pricing of each product. What's more, according to the results get after researching on search frictions of different channels, it is better to concentrate on app selling due to its lower search friction for consumers among most of the products. At the same time, some platforms with high search friction, for example, personal computer, can consider reforming the platform and rearranging layout so that search friction can be reduced.
After analyzing major results, I found throughout the research, there are still some improvements we can make for the further research. In this paper, we focus on linear regression methods to study the various relationships in the e-commerce data from JD.com. Other regression analysis, such as nonlinear regression and generalized linear models, could be utilized to gain more insights. Machine learning methods such as decision trees, nearest neighbors and random forests, can also be applied to make predictions on sales and consumer clicking and purchasing behavior.
In the demand curve analysis, an issue exists as the price data is cen-sored-only prices of sold products are recorded. We hope to mitigate this issue by collecting more posted data in addition so as to get a full picture of the demand curve.