More pain than gain
We cannot tell you about the client who came to us with the task — we signed an NDA. Let’s say that the client was a security department of a ‘large government organization’.
About 2 years ago, the security department of the organization checked all applicants manually: they googled names, examined information found in open databases of public agencies, investigated social networks, and took roundabout ways to check whether there is a relation between an applicant and any of corporate employees. However, the organization wasn’t that old fogy: they had an app for automating the process, it just didn’t help them because:
- the app interface was way too complicated and overwhelmed: the analysts spent much time completing the simplest tasks
- the app didn’t have any integrated databases so the analysts couldn’t find data automatically
- the app had many features that didn’t work
- the app was maintained by a developer who stood against innovations. Some developers suffer from such a syndrome: love their products so much that cannot accept improvements
The system worked but didn’t help at all — it was difficult to check up the potential employees no matter with or without the system. The client knew things had to change. The organization was looking for a contractor to develop big data for business with new features that would replace the existing system.
The client spent a lot of time searching for big data developers that would handle the task. Several contractors tried to do it before us: a freelancer, a cooperative of freelancers, and a small web agency. Each of them failed.
The client needed to:
- get a working big data solution
- cut down manhours thanks to the solution
- simplify the search for connections between applicants and existing employees (according to the law on government service, there should be no such conflicts of interest — it is an official restriction)
When the client came to us, we agreed to eat the elephant one bite at a time and start with a prototype. On the one hand, it’s risk-limiting for the client: they see what they pay for and know what to expect. On the other hand, it’s convenient for us to find out whether we could develop such a bid data app. Usually, we don’t offer prototypes, yet the situation made it incumbent. The app required implementing a feature we had never worked on before: graphs to visually display relationships between people.
The prototype proved that we could develop the system, so we moved on with our usual scenario:
- Estimated the design
- Designed the system
- Estimated the development
- Developed the big data app
- Signed maintenance contract
I have been managing the project. Together with me, there were a UI/UX designer, 4 developers, and QA engineers working on the project.
The tech stack of the project — React + Ruby on Rails. We chose Ruby because it better performs at working with databases, and the project required much of it (there are many databases, too many). Besides, it had off-the-rack solutions for Apache Solr, a search platform for full-text search, ready-made libraries for Neo4j (a graph database management system), workers, and foolproof background processes.
Ruby on Rails provides you with ActiveModel which is a powerful tool for integrating with a database and ActiveRecord which is a real ORM swiss knife. We used eager and lazy loadings, validations, scopes, prefixes, inheritance, callbacks, and other tricks.
The code name we gave to the project — Big Brother. By the way, we filled up the prototype with a killer feature: found and used a loophole in the security system of vk.com. This feature allowed us to persuade the client that they encountered the right contractor. 🙂
Prototype and loophole in the security of Vkontakte
We started with Excel Tables automation. Of course, the client had been using Excel Tables before yet squandered a huge amount of time on filling in all data manually: they had to create a document, key in available information, add their passport number, google for Federal Tax ID — after that go searching for penalties and debts, enter them in the table by hand.
To make the big data processing easier, we integrated Xneo and some other third-party services into the tables. Thanks to this, the information was gained and stored there automatically. When we finished, it worked as follows: an analyst enters available information in the table → the system requests the services, and picks up the data required → the system enriches the table.
While working on the prototype, we found a loophole in the security of the social network Vkontakte. This loophole allowed us to find personal accounts on vk.com with just a phone number.
The open API of vk.com doesn’t allow you to do it, but we found a way: used private API, proxies, and imitated real user activity in a mobile app. Vkontakte requested access to the user’s contact list to notify when any of their friends sign up — so we just seized this opportunity.
This is how it worked: we added numbers of potential employees to a contact list and checked the profiles that are linked to the numbers. It became a killer feature of the prototype which showed the client all the capabilities of it (wow, it can do many things) and proved that we could handle developing such a big data system. The companies who tried creating the app before us failed to complete this task.
However, Vkontakte eventually found this loophole and patched it — this happened after the start of the project. End of the story, mischief wasn’t managed.
No references, no preferences
We didn’t have any references at all. When we started designing, it seemed like something over our heads. We knew the client needed an app that would help their Security Department catch unwanted people in employment and prevent nepotism. Even though the client told us what features they wanted to see in the system — that still wasn’t enough to build a complete vision of the project.
So we started diving deep into the project. Yes, it took a lot of time at the start but helped to design everything properly. Our lead designer immersed himself in developing user scenarios: spent a great deal of time searching for his friends and strangers on the internet and investigating geotags on their photos, dug deep into open sources, and even turned to databases of the tax office and road police. He tried to understand where these people worked, lived, celebrated holidays. Today it’s easy because people like sharing every moment of their life.
The research helped to understand what information would be available for the client at the start: passport, personal insurance policy, and phone numbers — these can be filled in manually. This is how we came up with a form for setting up a profile. We intended to group the profile subsets into logical categories. Since the client didn’t have special requests, we faced no significant edits yet had some directorial adjustments that sounded like ‘it’d be better to place this field here because it’s more convenient for us’. We never argue over such corrections: the user’s convenience comes first, and in this case, the client and the user are the same person. After all, it’s the development of big data for business.
When we finished the central part, we got down to designing graph visualization. It seemed to be the most complicated task because of restrictions set by the D3 library: we couldn’t bring something new, only customize what was given by the library. It was hard, but we managed to pull it off. 🤘🏻
Your life in our hands
The app truly walked out of the best dystopias. It is composed of features that seem impossible to the mind as most people believe their personal data is protected. For instance, it’s enough to know the phone number to gather all data about a man. The system will find everything relating to the number via integrated services. Among them:
- CROINFORM: provides access to the Unified State Register of Legal Entities, Unified State Register of Immovable Property, Federal Tax Service, Federal Bailiff Service, and other internet sources
- Xneo: allows to find open data from accounts on social networks, ads posted on avito.ru, cian.ru, and other C2C marketplaces. Plus, it can find information about vehicle possession
- Open API of vk.com
- Avinfobot: detects advertisements about vehicles listed for sale posted on drom.ru or avito.ru, information about all parking tickets
- EGRUL bot: informs if a person is registered as a private entrepreneur
- MNP bot: finds information about a carrier operator and the region where the SIM card was logged
- Federal Penitentiary Service of Russia: checks if a person is wanted by authorities
- The Russian Ministry of Internal Affairs: verifies a passport
- Main Directorate for Traffic Safety in Russia: allows to check for parking tickets and find whether there were penalties
- Reestr Zalogov: searches for notices of property pledging
- KonturFocus: the service is integrated with 26 more sources. Allows to search for counterparties, corporate affiliations and find information by a Federal Tax ID (entrepreneurial activities, ownership interests in legal entities).
When the organization suspects that a person has a friend inside the company, the Security Department can just fill in the phone numbers of both — applicant and employee — and the system will unearth all of their common activities. If they have ever parked in the same parking space, went to the same school, or lived on the same street — the system will find it out. 🙃
An analyst of the security team can create a profile of a person, get all potential connections between this person and employees, and then download the data into an Excel table. Profiles are automatically updated — that’s why even if you change your passport, the organization will know it.
When sources are at sixes and sevens
It’s hard to integrate with such a large number of sources. Of course, some of them had a fully-fledged API, but the others could be implemented only by trick. For example, sometimes we had to monitor queries sent by the frontend service and try repeating them.
Periodically, some of the services fritzed out, some — changed their APIs. Therefore, we had to constantly keep an eye on the stability of each one. Besides, sometimes we sent too many requests to the service and exceeded the limits, so the system spitted out errors. In this case, half of the list could be processed, and half couldn’t be found. To make the system work flawlessly, we programmed a pause: it processes one person → gives the data on him or her back → waits for a certain amount of time depending on the service → starts processing the next one. And so on until the end of the queue set by an analyst.
It would have been great if integration had been the only problem. Just integrating the system with the services is not enough: the data needs to be found and recorded correctly. At the same time, everyone keeps records in different ways: one service has a date format like ‘day/month/year’, another ‘day.month.year’, or even ‘day-month-year’. Humans will easily deal with it, but it’s a mind-blowing task for the system, so we had to somehow adapt the information from all the sources independently.
To solve the problem, it was necessary to normalize the data. So we connected the app with the MongoDB management system. Now everything worked like this: we got the information → put it in MongoDB (in other words, filled the database with raw data) → normalized the data → saved structured data in PostgreSQL (i.e. cleaned up and made it perfect). This way, everything was displayed properly, and the different data formats received from external services ceased to be a problem for our system.
Battle of titans
As I said, we had never worked with graphs (diagrams that build relationships between employees of the organization) before the project. That’s why it was the most interesting part for us. To create graphs on the frontend part, we decided to use the D3 library — it’s often used for drawing histograms, trees, dynamic graphs. The main advantage of the library is the level of abstraction, which means the D3 library provides a lot of tools that you can use as you need.
D3 library is not that simple. Putting complicated terminology aside, it gives you a set of tools. Let's say, it's a hammer, nails, and planks. Whatever you build — a first-class house or a rural toilet — is always up to you.
We were satisfied with the library yet realized it’d be hard to implement it into the code. We had to build the D3 library that had its principles for structuring the document in React that managed the DOM structure of the document itself. There were 2 entities conflicting with each other and fighting for independence — and we needed to nest them one into another. That’s why we decided to search for something else that would get along with React.
We did find several solutions but none of them offered the capability set we needed. We had to get back to D3 — started by puzzling out the documentation, then finished it with coding missed parts.
Lenin street in every town
We took the D3 library for the frontend part and Neo4j for the backend. Neo4j is a graph database management system — it’s great for building relations between different nodes (and the client needed exactly it).
In the system, a ‘node’ is a person. Each node has a bunch of ‘relationships’ which represent a place of work, place of residence, and so on. But here’s the problem: let’s say, we are creating a profile of a person. The person lives at Lenina Str. Do you know how many such addresses are in Russia? Almost every city has a Lenina Str. What will be the relationship in this case? Right, incorrect. It’ll be just a bunch of dots that can’t be sorted out. The system simply wouldn’t understand how to connect people.
We knew that there should be a database that classified all addresses. We were right: found the Federal Information Address System, a database that assigns a unique ID to each object. To put it simply, any building in Russia has its own unique number. These IDs helped us solve the problem of nodes in relations — all the nodes are correctly understood by Neo4j, and the next time a new person is added into the system, Neo4j will connect people if there is a relation.
Different schools, same story. To find a unique ID, we used the open API of vk.com. They set IDs according to the general-to-specific approach. Thanks to it, our system always understands which school out of a hundred schools is needed.
Optimize this, that, and one of those
When we finished the visualization of relationships between applicants and employees, we encountered a new problem. Let’s imagine that we were analysts of the organization and we needed to build connections between 100 people with all parameters at once. How long would big data processing take? 10 minutes? We knew it needed to be optimized.
Initially, we added up RAM. Neo4j was stifled by a lot of queries, as a result, the app began to work slowly. We increased the default limits of the system to its maximum and prohibited a ‘scavenger’ of the Puma server to kill heavy processes. By default, it shot down the process of converting large graphs because they required much memory. This way we managed to stop declining queries. But we wanted to get better.
We began digging into the objects that we obtained from Neoj and found a bottleneck. The point was that Neo4j needs a particular set of data to perform: mutex, IDs, parameters, and a lot of duals and doublings. However, the system didn’t need it to build graphs. We made it to send only required units to JSON conversion, thus boosting the performance by 20%.
We decided to move on and came up with the idea to use cache. With cache, we could carry the graph-building process in the background. It didn’t affect the time yet now the user can start several processes simultaneously to come over the results later on.
To create forms, we used FinalForm library. It gave us many parameters that help to monitor conditions of separate fields or check the whole form at once. We split all the forms into functional groups: forms that must be filled in to create a profile, and forms that are filled in by the system.
Forms are a whole different story. The main problem was in the amount of data that the system received from third-party services. We began with a minimal version, however, the further we went, the harder it was. During the development of the app, the client was constantly seeking new data sources. The form became freakishly large, so we had to find a way to optimize the long list of fields
The search engine
Search by text became another feature that required a well-thought-out solution. The problem was the same — a huge number of profiles and vast relationships between people. The client didn’t have any particular ideas on how to implement the feature — that’s why we were given leeway. We connected the Apache Solr platform to search through the profiles.
We chose Apache Solr because we needed a tool that can deal with full-text search and grouping search results. Our experience showed that Solr copes with these tasks best.
It wasn’t difficult to implement the search engine. It was difficult to understand how to sort out the search results. We decided to add parameters to search by them — had to tinker with the indexation because each parameter needed to be indexed differently. For example, if we know a surname, we can run:
- exact-match search
- search by first letters
- search by occurrence
Another example: when we need to find a profile by vehicle identification number — the system will search by first characters. However, if we key in the full number, the system will provide us with an exact match result. So we had to set the right data format for each parameter to understand how to index it.
This is how we made it: let’s say the user fills in ‘John’ → the system shows the most reliable results in all profiles — be it a name, a surname, or a company name → the user can choose a parameter they are looking for and continue searching there. It’s convenient when we talk about such massive data.
We also added Edge N-grams to enable live search. For instance, we want to find a company named ‘Purrweb’ 🙃 — we start typing in and see results starting with ‘p…’, ‘pu…’, ‘pur…’, ‘purr…’ and so on.
The client approved the way we implemented the search yet soon came up with an idea on how to improve it. The Security Department wanted it to be like in Google: the user types in a search request and the system hits you with everything related to — not only results but also suggestions (‘Did you mean …?’ or ‘People also search for …’). The user can agree on the suggestion or not — if not, the system runs a search across all profiles. We kept the opportunity to apply parameters though.
Champions in heavyweight development
Working on such a project is interesting itself. Working with such a client — the government organization — adds a bit of spice. Every call was attended by 5-7 people with straight faces and serious questions. Everyone — with their vision of the project. Strategic decisions and agreements were approved by even more serious people. However, we were ready for negotiations: this is how things are done when developing big data for business.
We worked on a Fixed Price basis (it’s a tender after all) and had a billion new features to be added to the scope. If these were small changes, we made them for free. If large — spent 3-4 hours on a call to explain why we cannot take it in work.
At the same time, we balanced between one extreme and the other: features either came thick and fast or the client didn’t have exact requests for the next sprint. Big data developers who had no tasks were switched to other projects. When the client found a vision, we had to bring the developers back. You’ve probably heard that multitasking is a myth and generally unhealthy thing, so it wasn’t good for any of us either.
The project came to an end, the Security Department of the organization is already testing the system, and we are adding the great experience of working with a top-tier company and developing big data for business to our portfolio.
‘There are still some features remaining in the backlog. However, our employees already use the system. The app that handles big data helped us streamline the workflow — now most tasks are done 20x faster. We managed to boost the performance’, — the client says.
Now the app allows to create profiles and build relationships between people. On the one hand, we won: learned new technology, handled graph visualization, and developed big data for business.
On the other hand, we created an eye-pleasing and easy-to-use tool as if following the specifications of Orwell, Zamyatin, and Huxley at the same time. Now we can meet our friends and tell them that Big Brother really exists.
If you are looking for a company providing big data development — feel free to contact us. We offer big data as a service: analyze the requirements, study through competitors, design, develop and test big data applications.
- Anastasia Enina
- Vadim Subbotin
- Julia Tikhiy-Tishchenko
- Dmitriy Kryhtin
- Alexei Matyushin
- Kanat Askarov
- Maxim Orlov
- Konstantin Romanov
- Konstantin Zemlyakov
- Tatyana Perina
- Igor Andreev
- Daria Lobacheva