📜 ⬆️ ⬇️

Couchbase in telecom

Digital transformation is a global trend for big business and is vital for adapting an enterprise to the modern needs of a client. In addition to the usual for large companies, the centralization of systems and the integration of billing systems and subscriber databases are added to the requirements for high availability and real-time operation to which customers are already accustomed to industry leaders (Google, Amazon, Netflix).


New challenges require new technologies and approaches that are needed to reduce the time to implement customer-friendly functions, personalized business proposals, quickly respond to competitors' proposals, as well as control costs for systems, IT infrastructure, data centers and qualified personnel. These trends carry a big minus: the complexity of the architecture and bloated transactional databases that do not cope with the flow and processing of information. Previous generation technologies have a vertical scaling ceiling. For example, an instance of the Oracle DBMS runs at the limit of the most powerful server on x86 processors with a load of a billion transactions per day.



In order to withstand a similar load with which the Internet industry has long been faced, a new stack of technologies, such as In-Memory caches and NoSQL databases, is used. So, Apple uses Cassandra, Sberbank - Ignite (GridGain), in MegaFon we use Couchbase and Tarantool.


MegaFon uses different architectural templates for In-Memory DBMS:


  1. Simple cache, updated by schedule or by event from the database and applications
  2. All changes to the database are made through the cache (write-through script), for example, connecting an Oracle client to the DCP Couchbase

For one of our decision-making systems for the subscriber’s life cycle, we use the first template, since only one application takes a decision from the aggregate of data and sends it to all systems, including the Oracle database. One of the bright cases of using the subscriber’s life cycle is locking and unlocking on a negative balance. After all, all subscribers of mobile operators after recharging the balance want to immediately be in touch and make calls. Thanks to a separate application and Couchbase, we were able to reduce the time to break the lock from 90 seconds to 30, and this is not the limit. The main database will only get a record of the change in the status of the subscriber (Fig. 1)



Figure 1 (Interaction Example)

With the use of new technologies, we were able to reduce the time to exit the financial block by 3 times. But in order to get current results, we have traveled a long way to the architectural transformation of the billing circuit and the choice of NoSQL database.


Why did we choose Couchbase? There are several reasons for this.


Performance requirement


  1. Handling up to 200,000 requests per second.
  2. Average response time (50%) - up to 5 ms (within one data center).
  3. The maximum response time (99%) - up to 15 ms (within one data center).
  4. Maximum insert capacity 500 MB / sec
  5. The maximum number of insert operations 100000 / s
  6. Maximum number of change operations (document updates) 100,000 / s
  7. Maximum performance changes (document updates) 500 MB / sec
  8. The maximum number of read operations 100000 / s
  9. Maximum read speed 500 MB / sec

High-performance key search and data access


Couchbase is based on distributed keystore (KV). KV storage is an extremely simple data management approach that stores a unique identifier (key) along with a piece of arbitrary information. The KV storage itself can accept any data, be it a binary blob or a JSON document. Due to the simplicity of KV implementation, data access is provided with minimal latency. As our experience shows, more often network latency is 2-3 times higher than providing data by key on the Couchbase side.


Dynamic storage scheme ( JSON)


Documents are stored on the Couchbase server in JSON format. The format supports both basic data types, such as numbers, strings, and complex types, as well as built-in dictionaries and arrays.


The data scheme in Couchbase is a logical construction, defined by the application and the developer. Due to its flexibility and ability to use several options, we can use a tag in the document, for example, with version information. This allows the application to determine in which mode to process the document, as well as to ensure the smooth migration of the database to the new data scheme.


High availability


One of the components of the information system is its availability. Couchbase provides high data availability with many different features. One of them is data replication (distribution of several copies of data on different servers of the cluster), which allows us to provide service during routine maintenance or the failure of some servers.



Figure 2 (Couchbase server replicas)

The second important feature for high availability is the Internal Database Change Protocol (DCP). It provides high-speed transfer of changes to all copies of data, secondary indices (GSI), intercluster replication (XDCR) and external customers.


Bidirectional replication


The correct practice in companies is to use the redundancy of all business processes and equipment. Ideally, this is reservation in Active-Active mode, when switching between problem nodes occurs automatically. Couchbase bidirectional replication enables AA mode. But replication testing has shown that it is effective only in close data centers. With the separation of more than 100 km conflicts appear. Couchbase has conflict resolution mechanisms: based on Timestamp and Sequence Number. However, due to the time delay on the network, outdated data fall into the database. We refused to use bi-directional replication (cross-cluster consistency). All changes are carried out on only one cluster. Data availability in the “read” mode is provided in all data centers (AA).


Horizontal scaling


One of the important characteristics of most NoSQL DBs is horizontal scaling (Fig. 3). The main difference between Couchbase is support for multidimensional scaling, when we in a cluster can increase only the service we need in performance. For example, the game Pokemon GO uses a split architecture. At the start of the project 5 servers with combined services were used. After increasing the load, they applied a spaced architecture: 5 servers with data and 55 servers for processing queries and indexes. One of the disadvantages of scaling Couchbase is the occurrence of problems with the orchestrator, if there are more than 50 data nodes in the cluster.




Figure 3 MDS


IS requirements


Information security requirements influenced our choice to a lesser extent, but their presence in the system made an additional argument in favor of one or another database. Since the cache may contain personal data, we must comply with the requirements of the regulator. It is worth deciding: will we use additional equipment or will we be able to provide this with the database itself ?!


In the enterprise version, Couchbase supports traffic encryption, data encryption and personalized access. This allows you to save on hardware, such as Cisco ASA.


Easy to upgrade


One of the significant advantages of Couchbase is a transparent update mechanism and API support for older versions. At the time of the cluster upgrade, it works in compatibility mode. New mechanisms will work only after a full upgrade of the cluster. The impact on running applications is minimal due to the support of the old API.


PS: Update / Downgrade is allowed on neighboring major versions only.


Additional functionality


Logical distribution


Another interesting feature is the consolidation of servers in a cluster into logical groups, linked to replicas. This allows you to distribute complete copies of the replicas of a single cluster to different avtozalam. That allows to have a complete copy of the data in the second when one of the car halls fails



Figure 4 Server Gropus

Backup and Restore


Couchbase contains ready-made backup and recovery tools. The backup process can operate in three modes: full, differential and cumulative. This allows in some cases to save disk space and processor resources.



Couchbase vs mongo


It’s difficult to answer the question of choosing alternative NoSQL databases, and often the best Unix is ​​the one that your admin knows. Let us try to formulate why we preferred Couchbase, rather than another very popular platform - MongoDB.


It is quite difficult to compare two different projects with different architecture and functionality. One of the parameters to which we paid attention is the ease of maintenance and the ability to quickly reconfigure the system to fit the needs of the business.


Table 1 Comparison


 


Couchbase


MongoDB


Scaling


Automatic for the entire data set


Manual key selection


Data distribution


Data is always evenly distributed across all date nodes.


Incorrect markup may result in misallocation of data.


Adding / deleting a node or replica


Added in one step via GUI, with rebalancing


Quite a challenge with weight calculations for each collection.


Distribution of replicas on racks / data center


Implemented through logical groups


Not implemented


Automatic load sharing


Each node has the same number of active records available for reading and writing.


Not balanced. Secondary nodes do not support writing


Index scaling


Flexible, you can add a separate node index due to the distributed architecture


Hard scaling index is associated with scaling data.


Cluster metadata


Distributed across all cluster nodes


Configuration servers required


Integrated search


N1LQ (SQL ++)


JSON request



Table 2 Replication Comparison


 


Couchbase


MongoDB


Architecture


Intercluster replication has no dependencies, the cluster is independent of each other


Intracluster Expansion Only


Customization flexibility


Flexible (setting up individual buckets, filters, tuning)


Speed ​​tuning


Topology


Bidirectional replication, star, chain, etc.


Star


Active-Active mode


Supported by


Not supported



In general, Couchbase is more flexible and simpler in the settings required for our tasks and the rapidly changing hybrid architecture.


Operating experience


For a start, we would like to give the numbers with which the system and the cluster on Couchbase are now operating.


  1. More than 80 million subscribers [i]
  2. 380 million JSON client information documents
  3. 3.5 TB HDD (we use memcached, information is stored on the disk for a quick start)
  4. 3 TB RAM
  5. 50 thousand operations per second (Figure 5)
  6. 50 microservices processing the entire message flow



Figure 5 Load

We started the first milestones of transformation with the third version of Couchbase. At the first stage, at the launch of the project, all applications worked stably. But when transferring additional logic to the new mechanism, we were faced with the fact that the View mechanism began to work unpredictably. Those. at some point, the process hung and these views from such nodes stopped returning. In this case, access to data and their processing is not interrupted. The problem was corrected quite easily - by restarting the node, which generally reduced the availability of the service. In the course of communication with Couchbase technical support, we were offered an undocumented command that restarts only the view process.


curl -s --data 'cb_couch_sup: restart_couch ().' -u Administrator: pass http://127.0.0.1:8091/diag/eval [ii]


The command is valid only in versions 3.x.


curl -s --data 'couch_server_sup: restart_core_server ().' -u Administrator: Administrator http://127.0.0.1:8091/diag/eval


The command is valid only in versions 4.x.


Another problem of the third version was the breaking data compression mechanism (compaction). It had to be started manually according to the monitored monitoring metrics. Both problems kept in suspense not only the duty shift, but also functional engineers.


In this regard, we have decided to migrate to the fourth version. Migration with minimal impact on the service took about two weeks. The update process itself does not require complicated actions and control, but when adding or removing a node, a rebalance takes at least two hours to start. In the course of work, we found a way to speed up the update process through a buffer server: in this case, it is not a pure rebalancing process that is launched, but data transfer from one node to another. This reduced the update process to 30 minutes.


When upgrading an industrial cluster, you need to consider the following nuance: working in compatibility mode, when the cluster is operating in the mode of the youngest software version. The positive side is that the update process goes smoothly and painlessly, but nevertheless, new functions, such as the new compression mechanism, N1QL, cannot be used until the entire cluster is updated.


After the upgrade, we managed to fix only one problem - compression. It began to work properly. With the View mechanism, the problem still remained, although it was repeated much less frequently. It was possible to correct it only by the Couchbase developers in version 4.6.4.


As part of solving technical support problems, it turned out that the view mechanism will not be updated anymore. This was done on the basis that most Couchbase clients use views not for the purposes for which they were created, and Couchbase made the new N1QL mechanism. It is implemented as a separate service and is now independent of data nodes (Fig. 7)




Figure 7 Role Nodes

We have closed all critical problems with version 4.6.4. But due to the increase in data volume, a decision was made to migrate to the fifth version, where a new database was added for the indices and, on our data, the volume in memory and disks decreased by one and a half times. But, unfortunately, we did not see a decrease in the amount of data on the data nodes.


findings


In general, Couchbase proved to be a mature system, keeping a high load, even in non-specific cases (Viber - used as a database). As part of the hybrid architecture of MegaFon, the cluster can be easily adapted for any purpose without equipment downtime and without serious reconfiguration of servers, which generally allows the company to reduce personnel costs and make the service for the subscriber as convenient as possible.


PJSC MegaFon


2018 Kovalchuk Egor


[i] The system processes not only subscribers, but also devices with built-in sim cards, modems, etc.


[ii] Consult a specialist before use.

Source: https://habr.com/ru/post/436762/