What exactly identifies a site visitor?

Question

Studying the backend, I encountered such a problem of misunderstanding: there is a website, Nodejs server (basically, whatever, but Nodejs interests me) and a visitor came to the site. It does not matter whether there is a registration on the site or not - what exactly identifies the visitor?

I know that this is managed by sessions. I read it here , but I didn’t understand where exactly and why and with what data the server understands that right now that very visitor? For example, two visitors came to the site: one from Moscow, the other from Vorkuta, and there is a chat on the site. Chat is simple and banal - you can send any post or chat without registration. It is necessary in the chat to make sure that the message of the user from Vorkuta was on a blue background, and messages from Moscow - on the green. Immediately a new visitor came to this site from Moscow and from the same IP address, but from another computer that is in the next room of our Muscovite (let's say it's his spouse), here’s her (wife) message should have an orange background ...

As a result, I do not ask for any code (unless it is only necessary for a visual example). Please explain the very logic of the server perception of visitors. How to determine who is who?

@andreymal, just don’t swear))) and if you don’t use cookies or it’s not possible?
The server checks if the visitor has cookies, and if not, gives him a unique key as a cookie, he remembers it and shows it on every request, and the server also remembers to himself.
@Air well then, anything where you can stuff a random string and where it won't get lost.
In the old days, when the cookies did not work very well, this random string was shoved right into the address (in my opinion, php even still knows how to do it).
@andreymal, this is understandable, in principle, this is necessary, thank you for your participation, everything is meaningful ... And write the details with the answer, I think it will be more useful for the future ... I don’t want it myself, because you better answer, for sure there are pitfalls , which I can't yet know for sure ...

Accepted Answer · 2017-11-03T09:41:25

I will list all the user identification methods known to me.

IP address

I specify this method because it is the only one that cannot be faked. It can be borrowed from others (proxy, VPN, Tor, just dynamic IP), but this is usually more difficult than, for example, cleaning cookies. Delete the IP-address, like cleaning cookies, you can not: some must be. Due to its relative reliability (not everyone is not too lazy to have hundreds of proxy servers to change IP ready), it is often used to enhance security: for example, they limit the maximum number of requests per second / minute / hour from a single IP. However, different people sitting through one Internet will not allow IP to be distinguished, which contradicts the condition of the question, so we go further.

Banal login and password

The essence is simple: we stupidly send login and password in each request. One of the options for implementing this method is already present in the HTTP protocol itself, through the Authorization header, already implemented in all major web browsers and web servers.

In the HTTP version, the essence is as follows:

when you first visit the site, the client has nothing and does not send any additional information to the server. The server responds with a 401 Unauthorized error and adds an HTTP WWW-Authenticate header with information about login methods (for a simple login-password, this is Basic realm="default" )
the client gets it all and asks the user for a username and password. After that, it sends its request again, but with the HTTP header Authorization , which contains the base64 username and password: Basic YWRtaW46MTIzNDU2 . If we decode this example, we get admin:123456 - login and password, separated by a colon
the site checks all this and either responds normally, or again 401 and ask for the login password for a new one
This Authorization: Basic YWRtaW46MTIzNDU2 every time Authorization: Basic YWRtaW46MTIzNDU2 send in all subsequent requests.

Advantages:

simplicity. The HTTP version in web browsers and web servers is already done, nothing needs to be invented. If you make your own version, it is sufficient to implement the verification of the login-password in each request without additional complications.

Problems:

Without HTTPS, there is no security at all: the login-password is in fact walking over the Internet in clear text. The client also has to remember in his password in the clear;
HTTP version in browsers works only within the current session; After restarting the browser, the login-password must be entered again.

For the sake of fairness, I note that HTTP can not only bare login-password ( perhaps a complete list of authorization options ), but I will not dwell on other methods due to their low prevalence.

Random string

The easiest, most balanced in terms of "safety / convenience" and the most popular method of identification. The most common in the world (probably) cook PHPSESSID - this is it. The bottom line is:

when you first visit the site, the client has nothing. The site notices this, creates a new random string (more authentic, so that it is difficult to pick up; 30 characters at least) and together with the usual response to the request in some way sends this generated string (Set-Cookie, redirect to a special link or just in the body of the response , if it is for example JSON API)
the client, along with the answer, receives this line and stores it somewhere (the browser itself stores in cookies, the SPA can put it in localStorage, etc.)
on subsequent visits to the site, the client adds this line to his request (cookies, HTTP header Authentication or just the GET parameter in the requested address)
if you need to identify the client more specifically (login by login-password, for example), the site in its database then writes that such and such a random line corresponds to such and such a login, and then on subsequent requests reads this information from the database.

If we talk about PHP, then all this is built into it: when you call the session_start() function, a PHPSESSID cookie is PHPSESSID from random letters and numbers (or the existing one is read, if it already exists). The data associated with this cookie is stored in the $_SESSION , and you can read and modify it. The contents of this array are saved to a file by default; upon subsequent requests from the user, this file will be automatically read when session_start() called, and all the data that you put into the $_SESSION when processing previous requests will be restored. Details in the documentation .

Advantages:

simplicity is obvious;
when changing the IP address (and this is a frequent occurrence on mobile phones), identification does not crash;
the implementation of the “Unlock me on all devices” button is reduced to simply deleting all records in the database.

Problems:

the random string generator should be really random (or not completely random, but crypto-resistant , not uniqid() ), since an attacker can try to pick a pseudo- uniqid() for example, selecting a generator state in PHP or Python , or selecting sessions created through uniqid (), Invision Power Board ). In no case can you use the login hash, password hash, current time, a single pre-prepared string and other non-random things as a string, as this greatly simplifies the selection. How to get a real chance, read the documentation for your programming language. Or just use a pre-built implementation like session_start() in PHP;
additional server load. To find out exactly which user is hiding behind a random string, he has to access the database. Not a problem for the vast majority of sites, but for giants such as Google is already a problem;
Cookies are sometimes buggy: for example, IE11 adds cookies to subdomains, even when it is not requested (Edge has already been fixed), which can lead to data leakage to third-party CDNs, for example. So watch how the browsers for which you hone the site are manipulated with cookies. Well, do not forget about HttpOnly so that it is impossible to hijack cookies through XSS (and about Secure if the site uses HTTPS).

Nonrandom but protected string (for example, JWT)

The bottom line is this: brazenly violating the aforementioned ban on non-random data and shoving a string, for example, a user ID and, optionally, existing access rights (for example, admin), the expiration date of the string and any other data. But! In addition to this line, we add some hash, which is considered to be data plus a certain secret line that only the site knows and does not give to anyone. When requested from the client, the site accordingly checks that the hash is correct. This protects against tampering and fakes: in order to fake data, you need to recalculate the hash, and the attacker, not knowing the secret string, cannot do this. (The secret line should be VERY long, one hundred characters, so as not to pick up at all, since all security is on it.) (In JWT, instead of just a secret line, you can use RSA to sign, which increases security, but I will not write all the implementation details and so long it turned out)

Advantages:

less server load. The client has already sent all the necessary data, the server can only calculate the hash from this data and the secret string and check that it matches the sent one. You don’t need to go to the database: the secret line usually lies in some variable nearby, so everything is done quickly;
the client himself can read JWT and understand who he is (if the data is only protected by a hash, and not encrypted);
when changing the IP address also does not crash.

Problems:

implementation is complicated. If you do everything yourself, then you can mess up and get a security hole, so it's best to take ready-made implementations like the same JWT;
The button “Unlocking me on all devices” cannot be done at all. In order for a user data line to become invalid, you need to either change the secret line, or remember somewhere in the database that such a line with such data became invalid. But all this is quite problematic and negates all the advantages of this method of identification. Therefore, such lines, as a rule, make short-lived: for example, Google issues JWT in its API, which is only valid for half an hour (information about the expiration date is stored directly in JWT, you don’t need to go to the database).
information may be rotten. For example, if you write to JWT that the user is the admin, and then select the admin rights, the site, based on the JWT data, will continue to consider the client as the admin until the JWT itself has gone completely. You can take information from the database, but then again it becomes easier to use a random string.
JWT and analogues due to the fact that they contain all the necessary information, usually long; with a large amount of data, the string may, for example, not get into cookies.

Supercuts and other fingerprinting

The point of using technology is not as intended. Each browser and each OS has its own behavioral features, and these features can be used to fairly accurately identify who is logged on. For example, they draw text a little differently, and by minor differences in pixels of text, browsers can be distinguished. I will not paint everything in detail, I will leave links for further reading:

Advantages:

get the hell out. If you want, you can, of course, but so much hassle. This is no longer just a button "Clear cookies" click. The client device will be identified regardless of whether it changed the IP address, cleaned the cookie, etc.

Problems:

accuracy is not one hundred percent. All iPhones are pretty much the same, and it is unlikely to succeed in distinguishing one iPhone X from another iPhone X (although this only concerns fingerprinting, for a simpler super-phone);
users will find you and beat you painfully.

Thank you very clever, I clarified the picture more than I expected ... I will not say that I am not an absolute teapot, I understand the very logic of programming, this is what I read in your answer, it turned out I know, not everything, but more but knew superficially, but now I know much deeper ...
Added about the problems with the cache in connection with the cookie, if you do not mind.
Correction: this randomness is not necessary, pseudo-randomness suffices - the main thing is that the RNG is crypto-resistant.
@sanmai rolled back your edits that contradict standards and are refuted by simple experiments
@sanmai about the complete insecurity of Sberbank was already written by all and sundry, you will not tell me anything new: D

Climenkomud 1.817 one five 28 · Answer 2 · 2017-11-03T09:02:23

Cookies are usually used - a specific data string stored by the user in the browser. You can PHPSESSID algorithm for generating them yourself, or use a native engine, for example, PHPSESSID in PHP (see the session_start() function).

You can also identify the user without them, but in this case it will be necessary to use other parameters that are accessible. This is, first of all, the User-agent (user’s browser) and its IP address. In the case of PHP these variables are stored in $_SERVER :

1. $_SERVER['HTTP_USER_AGENT']
2. $_SERVER['REMOTE_ADDR']

Accordingly, the second option is less accurate, but will work if the user has Cookie disabled. An example of an error in the second case - two users are behind NAT and use the same browser.

Answer 3 · 2018-04-04T08:00:31

In addition to what has been said about the shortcomings of the cook, we can say that:

Transparent server requests cache becomes impossible if cookies are used (be it Varnish or transparent caching in nginx ). Therefore, if you do not set cookies on the server side, if they are not needed, then the pages of the site will be able to use the server cache and will open faster. Attempts to cache requests with the heading Set-Cookie contradict just common sense, which means they also pass by the cache.
Setting up a CDN for static resources requires attention if cookies are used. If your site opens without www at, for example, test.ru , then by putting a cookie on the site, you can assume that this cookie will be sent in requests to all subdomains, including, for example, cdn.test.ru Therefore, on sites that traditionally open without www , you can see that a separate second-level domain is used for static resources, and not a subdomain. For example yastatic.net in Yandex.
The last problem was to fix RFC 6265, but at the time of writing this answer, some major browsers still do not fully support it . These old browsers still cannot be simply dismissed, because the use of old browsers is equally low in the rest of the world: for example, in March 2018, IE is used in Japan for 16% of all requests . And this is a country in which 127 million people live. If you are doing a global service, then so and so you will have to act as if RFC 6265 is not there yet.

In addition to JWT, you can recall other "long" cookies like ASP.NET ViewState , which have the same problems with caching and CDN as with regular session random cookies, only worse. Such and similar cookies can be very large, easily ten kilobytes, and if they are transmitted with each request to any picture or static file on your site, this will definitely affect the speed of the site in all modes. Microsoft directly recommend not to use them if you care about the speed of the site.

What else?

Instead of cookies, you can use all sorts of headers that can get into the browser cache. For example, if you once sent an ETag header with some image or file, the next time the browser accesses the value of this header . The user can not see that you are tracking him so, that is, you know that this is the person who came in again, because such techniques are not welcome.

Total

If you can not identify your users, then your site will be able to work faster.

The server cache should be placed after the authorization module, and not in front of it.
In the case of using a cache on a reverse proxy, the proxy can also be configured so that it does not touch the cookie.
In general, the pages should be divided into those where the content is the same for all users and those where the content is different.
The first cookie is not needed by definition, the second can not be cached on the server also by definition.
Thank you, gentlemen, and the answer and comment is quite informative ...
@PavelMayorov how do you put nginx after the authorization module?
I described the variant with nginx in the following sentence.
“Putting a cookie on the site, the same cookie will by default be transmitted in requests to all subdomains” - remove this lie please, you yourself saw that this problem exists only in IE

What exactly identifies a site visitor?

3 answers 3

IP address

Banal login and password

Random string

Nonrandom but protected string (for example, JWT)

Supercuts and other fingerprinting

What else?

Total

More articles: