Table of Contents
- Introduction
- Opening the torrent file
- Getting peers via the tracker
- Downloading from peers
- Conclusion
1. Introduction
1.1 About this guide
This guide will build a simple bittorrent client from scratch in node.js. By the end of this tutorial you should be able to use the command line to download the shared contents of a torrent file.
I had two audiences in mind for this guide. First is anyone who has learned the basics of javascript and looking for an intermediate level tutorial. Second is for web developers who may already have a good grasp of javascript but little experience working on the backend. This guide will go over things like reading/writing to file and network sockets.
I’ll be committing changes on this project using git, so you’ll be able to see how the project evolves step by step. You can find the project’s github page here. The first commit will just be a bare package.json, .gitignore file, and a sample torrent file for you to work with. I’ll also mark points throughout this guide with links to commits.
1.2 Overview of bittorrent
The bittorrent protocol has to two main parts.
Step 1: You need to send a request to something called a tracker, and the tracker will respond with a list of peers. More specifically, you tell the tracker which files you’re trying to download, and the tracker gives you the ip address of the users you download them from. Making a request to a tracker also adds your ip address to the list of users that can share that file.
Step 2: After you have the list of peer addresses, you want to connect to them directly and start downloading. This happens through an exchange of messages where they tell you what pieces they have, and you tell them which pieces you want.
You can get an idea of how this all works in the image below (credit to morehawes.co.uk):
1.3 Links and references
These are links that I referenced many times during this project:
wiki.theory.org/index.php/BitTorrentSpecification - This is an unofficial bittorrent specification but basically has everything you need to know. Detailed and very readable.
www.bittorrent.org/beps/bep_0015.html - The only thing thing you won’t find in the unofficial spec is how to form a request to a tracker that uses a UDP url. But you can find it here in this link. I’ll remind you about this link again when I cover trackers and UDP.
www.morehawes.co.uk/the-bittorrent-protocol - A very good high level explanation of the bittorrent protocol. With pictures!
www.kristenwidman.com/blog/33/how-to-write-a-bittorrent-client-part-1 - Another good high level explanation.
www.bittorrent.org/beps/bep_0020.html - BEP for peer id conventions.
2. Opening the torrent file
*Github commit #1: initial commit
Enough intro, let’s get started on the code! If you click the link above to go to our initial commit, you see that we’re basically starting from a blank slate. This is our current file structure:
You can see that I’ve added a torrent for us to work with. It contains a picture of a puppy in a cup. Let’s create a new file called index.js and start by opening the torrent file.
index.js:
readFileSync
is the easiest way to read the contents of a file. But if you
run this code you’ll realize that readFileSync
returns a buffer
, not a string. Later on you’ll see that all our network messages are sent
and received in the form of buffers, so it’s important that you have a good
understanding of how they work. If you’re not familiar with buffers in node.js
I’ve written a mini guide on buffers here. The short story is that buffers
represent a sequence of raw bytes. If you want to read the buffer as a string
you have to specify an encoding scheme (you can see I used utf-8 above).
The output should have looked something like this:
'd8:announce43:udp://tracker.coppersurfer.tk:6969/announce10:created by13:uTorrent/187013:creation datei1462355939e8:encoding5:UTF-84:infod6:lengthi124234e4:name9:puppy.jpg12:piece lengthi16384e6:pieces160:T�k�/�_(�S\u0011h%���+]q\'B\u0018�٠:����p"�j����1-g"\u0018�s(\u001b\u000f���V��=�h�m\u0017a�nF�2���N\r�ǩ�_�\u001e"2���\'�wO���-;\u0004ע\u0017�ؑ��L&����0\u001f�D_9��\t\\��O�h,n\u001a5g�(��仑,�\\߰�%��U��\u0019��C\u0007>��df��ee'
2.1 Bencode
*Github commit #2: open torrent, get tracker url
That output probably looked fairly incomprehensible to you, and that’s because you’ve probably never heard of bencode. Bencode is data serialization format, and I don’t think I’ve seen it used anywhere outside of torrent files. But you may be familiar with JSON or XML, and bencode is essentially the same idea, it just uses a slightly different format.
Here’s the same data again in JSON:
'{"announce":"udp://tracker.coppersurfer.tk:6969/announce","created by":"uTorrent/1870","creation date":1462355939,"encoding":"UTF-8","info":{"length":124234,"name":"puppy.jpg","piece length":16384,"pieces":"T�k�/�_(�S\\u0011h%���+]q\'B\\u0018�٠:����p\\"�j����1-g\\"\\u0018�s(\\u001b\\u000f���V��=�h�m\\u0017a�nF�2���N\\r�ǩ�_�\\u001e\\"2���\'�wO���-;\\u0004ע\\u0017�ؑ��L&����0\\u001f�D_9��\\t\\\\��O�h,n\\u001a5g�(��仑,�\\\\߰�%��U��\\u0019��C\\u0007>��df��"}}'
That should look a bit more familiar (except for the pieces property, which I’ll get to later). While it’s possible to write our own bencode parser, there are many good open source libraries that can do it for us. I just googled “bencode node” and installed the first result:
npm install --save bencode
Now we can parse our torrent into data that we can interact with:
You’ll notice that many of the values in our decoded torrent object are
buffers. It’s possible to convert all buffers to strings by passing ‘utf8’ (or
any other encoding scheme) as the second arguemnt to bencode.decode
, but
since we’ll be working mostly with buffers when we use this data in the
networking portion of this project, it’s best to keep them as buffers.
3. Getting peers via the tracker
In the code above I console.log
the announce property of the torrent. For
this particular file it happens to be udp://tracker.coppersurfer.tk:6969/announce
.
The announce url is what I’ve been calling the tracker’s url, it is the
location of the torrent’s tracker.
One interesting thing you’ll notice is that instead of the usual ‘http’ in front, this url has ‘udp’. This is because instead of the http protocol, you must use the udp protocol. It used to be that all trackers used http, but nowadays nearly all new torrents are using udp. So what’s the difference between the two protocols, and why switch to udp?
3.1 Http vs udp vs tcp
The main reason that most trackers now use udp is that udp has better performance than http. Http is built on top of another protocol called tcp, which we’ll use later in the project when we start actually downloading files from peers. So what’s the difference between tcp and udp?
The main difference is that tcp guarantees that when a user sends data, the other user will recieve that data in its entirety, uncorrupted, and in the correct order – but it must create a persistent connection between users before sending data and this can make tcp much slower than udp. In the case of upd, if the data being sent is small enough (less than 512 bytes) you don’t have to worry about receiving only part of the data or receiving data out of order. However, as we’ll see shortly, it’s possible that data sent will never reach its destination, and so you sometimes end up having to resend or re-request data.
For these reasons, udp is often a good choice for trackers because they send small messages, and we use tcp for when we actually transfer files between peers because those files tend to be larger and must arrive intact.
3.2 Sending messages with UDP
In the code below, I’ve made some changes to show how we can send and receive messages via UDP. NOTE: This example may not get a response back because the message being sent isn’t formatted correctly. This is just an example to show the various modules we’ll be using.
There’s a lot going on in this code so let’s try to break it down.
-
First we require 3 more modules,
dgram
,buffer
, andurl
. These are all just from the standard library. -
I use the
url
module’s parse method on our tracker url. This lets me easily extract different parts of the url like its protocol, hostname, port, etc. -
The
dgram
module is our module for udp, and here I’m creating a new socket instance. A socket is an object through which network communication can happen. We pass the argument ‘udp4’, which means we want to use the normal 4-byte IPv4 address format (e.g. 127.0.0.1). You can also pass ‘udp6’ for the newer IPv6 address format (e.g. FE80:CD00:0000:0CDE:1257:0000:211E:729C) but this format is still rarely used. -
In order to send a message through a socket, it must be in the form of a buffer, not a string or number.
Buffer.from
is an easy way to create a buffer from a string, see my mini guide for more info on buffers. -
The socket’s
send
method is used for sending messages. The first argument is the message as a buffer. The next two arguments let you send just part of the buffer as the message by specifying an offset and length of the buffer, but if you’re just sending the whole buffer you can just set the offset to 0 and the length to the whole length of the buffer. Next is the port and host of the receiver’s url. Finally the last argument is a callback for when the message has finished sending. -
Here we tell the socket how to handle incoming messages. Whenever a message comes back through the socket it will be passed to the callback function.
3.3 UDP tracker protocol and message format
Now we know how to send and receive messages through UDP, but in order to get a list of peers from the tracker, the tracker will be expecting messages to follow a specific protocol. In short you must:
- Send a connect request
- Get the connect response and extract the connection id
- Use the connection id to send an announce request - this is where we tell the tracker which files we’re interested in
- Get the announce response and extract the peers list
Since the amount of code related to the tracker is starting to get large, at this point I want to create a new file called tracker.js and move all tracker related code there.
tracker.js:
*Github commit #3: move UDP code into new file tracker.js
Here I’m exporting just the one function getPeers
which fairly closely follow
the 4 step protocol that I listed above. When we finally get the list of peers
we pass it to the callback. The other 6 functions are as
follows:
-
udpSend
is just a convenience function that mostly just callssocket.send
but lets me avoid having to set the offset and length arguments since I know I want to send the whole buffer, and sets a default callback which is just an empty function, since I mostly don’t need to do anything after sending the message (see point 5 in the previous section if you need a refresher onsocket.send
) -
respType
will check if the response was for the connect or the announce request. Since both responses come through the same socket, we want a way to distinguish them. -
to 6. These 4 methods will build and parse the connect and announce messages.
index.js has also been refactored to use our new getPeers
function.
index.js:
3.3.1 Connect messaging
Now let’s take a look at actually building the messages. Each message is a buffer with a specific format described in the BEP. Let’s take a look at the connect request first.
The BEP describes the connect request as follows:
Offset Size Name Value
0 64-bit integer connection_id 0x41727101980
8 32-bit integer action 0 // connect
12 32-bit integer transaction_id ? // random
16
This tells us that our message should start out with a 64-bit (i.e. 8 bytes) integer at index 0, and that the value should be 0x41727101980. Since we just write 8 bytes, the index of the next part is 8. Now we write 32-bit integer (4 bytes) with the value 0. This moves us up to an offset of 12 bytes, and we write a random 32-bit integer. So the total message length is 8 bytes + 4 bytes + 4bytes = 16 bytes long, and should look something like this:
<Buffer 00 00 04 17 27 10 19 80 00 00 00 00 a6 ec 6b 7d>
Let’s see how we can implement this in our tracker.js file.
tracker.js:
-
First we require the built-in crypto module to help us create a random number for our buffer. We’ll see that in action shortly.
-
Then we create a new empty buffer with a size of 16 bytes since we already know that the entire message should be 16 bytes long.
-
Here we write the the connection id, which should always be 0x41727101980 when writing the connection request. We use the method
writeUInt32BE
which writes an unsigned 32-bit integer in big-endian format (more info here). We pass the number 0x417 and an offset value of 0. And then again the number 0x27101980 at an offset of 4 bytes.You might be wondering 2 things: what’s with the 0x? and why do we have to split the number into two writes?
The 0x indicates that the number is a hexadecimal number, which can be a more conventient representation when working with bytes. Otherwise they’re basically the same as base 10 numbers.
The reason we have to write in 4 byte chunks, is that there is no method to write a 64 bit integer. Actually node.js doesn’t support precise 64-bit integers. But as you can see it’s easy to write a 64-bit hexadecimal number as a combination of two 32-bit hexadecimal numbers.
-
Next we write 0 for the action into the next 4 bytes, setting the offset at 8 bytes since just wrote an 8 byte integer. This values should always be 0 for the connection request.
-
For the final 4 bytes we generate a random 4-byte buffer using
crypto.randomBytes
which is a pretty handy way of creating a random 32-bit integer. To copy that buffer into our original buffer we use thecopy
method passing in the offset we would like to start writing at.
Parsing the response is much simpler. Here’s how the response is formatted:
Offset Size Name Value
0 32-bit integer action 0 // connect
4 32-bit integer transaction_id
8 64-bit integer connection_id
16
And here’s the parsing function:
tracker.js:
You can see I read the action and the transaction id as unsigned 32 bit
big-endian integers, passing in the offset. Then I just use the
slice
method to get the last 8 bytes. Since I can’t read a 64-bit integer it’s easier to
just leave it as a buffer.
github commit #4: add connection request and response functions
3.3.2 Announce messaging
Many of the concepts here are the same as the connect request and response. However there are some things to look out for that I’ll mention. Here’s the announce request below.
Offset Size Name Value
0 64-bit integer connection_id
8 32-bit integer action 1 // announce
12 32-bit integer transaction_id
16 20-byte string info_hash
36 20-byte string peer_id
56 64-bit integer downloaded
64 64-bit integer left
72 64-bit integer uploaded
80 32-bit integer event 0 // 0: none; 1: completed; 2: started; 3: stopped
84 32-bit integer IP address 0 // default
88 32-bit integer key ? // random
92 32-bit integer num_want -1 // default
96 16-bit integer port ? // should be betwee
98
tracker.js
util.js:
Here’s a quick list of things I want to mention about this code:
-
“peer id” is used to uniquely identify your client. I created a new file called util.js to generate an id for me. A peer id can basically be any random 20-byte string but most clients follow a convention detailed here. Basically “AT” is the name of my client (allen-torrent), and 0001 is the version number.
As you can see the id is only generated once. Normally an id is set every time the client loads and should be the same until it’s closed. We’ll we using the id again later.
-
The code for calculating “info hash” and “left” are a bit complicated so I wanted to split that out into a new file called torrent-parser.js. We’ll go over that in the next section, but for now you can just ignore those parts.
-
Notice that this function takes a
torrent
parameter now. Again, this is for the “info hash” and “left” fields, we’ll be going over that shortly. -
If you look at the BEP for this request it tells you what the offsets should be, you don’t have to count them out yourself!
-
I expect
connId
to be a buffer, since that’s how we left it in theparseConnResp
method, so I just copy it into the buffer directly. -
There are a couple spots where I need a 64-bit integer but since they’re always initialized to 0, I just used
Buffer.alloc(8)
, as this will just create an 8-byte buffer with all 0s. -
For the “num want” part I want to point out that I used writeInt32BE instead of writeUIntBE (note the U). Because the number is negative you cannot used the unsigned version.
-
Finally, for the port, the official spec says that the ports for bittorrent should be between 6881 and 6889, so I’ve decided to use a default of 6681.
Now let’s take a look at parsing the response:
Offset Size Name Value
0 32-bit integer action 1 // announce
4 32-bit integer transaction_id
8 32-bit integer interval
12 32-bit integer leechers
16 32-bit integer seeders
20 + 6 * n 32-bit integer IP address
24 + 6 * n 16-bit integer TCP port
20 + 6 * N
It’s a bit tricky because the number of addresses that come back isn’t fixed. The addresses come in groups of 6 bytes, the first 4 represent the IP address and the next 2 represent the port. So our code will need to correctly break up the addresses part of the response.
Most of this code is pretty straightforward. I think you could put the group
function outside or in a new file which would make it reusable, but for the
purposes of this tutorial I think it’s fine. Also a bit surprising that you can
call join
on a buffer. This is a bit fancy as it coerces the bytes into a
string but seems to work and I think it looks nice.
3.3.3 Info hash and torrent size
I created a new file called torrent-parser.js where I can keep all the code related getting information out of a torrent file. This means I also want to move the code for opening a torrent file here.
torrent-parser.js:
index.js:
github commit #5: add announce request and response
3.3.3.1 Info Hash
Now let’s go back to when we first opened the torrent file. Remember it looked something like this:
'{"announce":"udp://tracker.coppersurfer.tk:6969/announce","created by":"uTorrent/1870","creation date":1462355939,"encoding":"UTF-8","info":{"length":124234,"name":"puppy.jpg","piece length":16384,"pieces":"T�k�/�_(�S\\u0011h%���+]q\'B\\u0018�٠:����p\\"�j���1-g\\"\\u0018�s(\\u001b\\u000f���V��=�h�m\\u0017a�nF�2���N\\r�ǩ�_�\\u001e\\"2���\'�wO���-;\\u0004ע\\u0017�ؑ��L&����0\\u001f�D_9��\\t\\\\��O�h,n\\u001a5g�(��仑,�\\\\߰�%��U��\\u0019��C\\u0007>��df��"}}'
Last time we pulled the announce property from this object. Can you see that it also has an info property? If you were take the info property and pass it through a SHA1 hashing function, you would get the info hash! You can apply a SHA1 hash easily using the built-in crypto module.
torrent-parser.js:
Why use a SHA1 hashing function?
SHA1 is one of many hashing functions but it’s the one used by bittorrent so
in our case no other hashing function will do. We want to use a hash because
it’s a compact way to uniqely identify the torrent. A hashing function returns
a fixed length buffer (in this case 20-bytes long). For example, our example
torrent would output <Buffer 11 7e 3a 66 65 e8 ff 1b 15 7e 5e c3 78 23 57 8a db 8a 71 2b>
.
Because it’s very unlikely for two inputs to output the same hash value, and because the input (the info property) contains information about every piece of the torrent’s files (more about this later), it’s a good way to uniquely identify a torrent. That’s why we must send the info hash as part of the request to the tracker, we’re saying we want the list of peers that can share this exact torrent.
3.3.3.2 Torrent Size
When we were building the announce request we saw that we needed to fill out the “left” field. Actually we just send the whole size of the torrent files. Let’s see how we can calculate that.
torrent-parser.js:
github commit #6: add infoHash and size functions to torrent-parser
There are two cases we have to consider - torrents that have one file or more
than one file. If the torrent only has one file then we can get it’s size in
the torrent.info.length
property. But if it has multiple files, it will have
a torrent.info.files
property instead which is an array of objects for each
file. We have to iterate over these file objects and sum their length property.
There’s one more problem we have to consider with the file size, which is that it may be larger than a 32-bit integer. The easiest way to deal with this is to install a module to handle larger number for us. I’ve decided to use the ‘bignum’ library which is a popular one for this use case. Install it with the command:
npm install –save bignum
You can see that I write the number into a buffer using the bignum.toBuffer
function. The option {size: 8}
tells the function you want to write the
number to a buffer of size 8 bytes. This is also the buffer size required by
the announce request.
3.3.4 Finishing touches, response type and retries
Now almost all the pieces have come together for communicating with the
tracker. The last thing we have to do is write the respType
function to
identify whether a response was a connect response or an announce response.
There’s more than one way to write this method. After looking at the structure of the two response types I noticed that the connect response has an action value of 0 and the announce response has an action value of 1.
tracker.js:
github commit #7: add respType function to tracker.js
And that’s all we need to be able to run the program and get a list of peers for our torrent.
It’s possible that when you run the program it will hang. As I mentioned before it’s possible for udp messages to get dropped in transit, so if that happens just try rerunning the program. Or you could write your own function to retry after a timeout as a bonus exercise!
If you do attempt this, you should wait 2^n * 15 seconds between each request up to 8 requests total according to the BEP. This is called exponential backoff and the reason you want to do this is to balance two concerns.
-
It’s possible your message was lost in transmission, in which case you retransmit as soon as possible.
-
The message is still coming, but the network is experiencing traffic. In this case you don’t want to spam the network with even more requests.
This is a common technique for unreliable network communications.
4. Downloading from peers
Now that we’re able to get a list of peers for our files, we want to actually download the files from them. Here’s a basic overview of how this will work:
-
First you’ll want to create a tcp connection with all the peers in your list.The more peers you can get connected to the faster you can download your files.
-
After exchanging some messages with the peer as setup, you should start requesting pieces of the files you want. As we’ll see shortly, a torrent’s shared files are broken up into pieces so that you can download different parts of the files from different peers simultaneously.
-
Most likely there will be more pieces than peers, so once we’re done receiving a piece from a peer we’ll want to request the next piece we need from them. Ideally you want all the connections to be requesting different and new pieces so you’ll need to keep track of which pieces you already have and which ones you still need.
-
Finally, when you receive the pieces they’ll be stored in memory so you’ll need to write the data to your hard disk. Hopefully at this point you’ll be done!
4.1 Setup
Before we start I want to add a new file called download.js. Also, because we’re starting to have more files, I want to create a new folder called “src” and move our code there. Finally I’ve decided to remove the puppy.torrent file. Since it doesn’t have many peers, it’s not going to be that useful for this part of the project. I recommend that you find a small torrent with a lot of peers that you can play around with. Our new file structure should look like this:
I’ve also updated index.js to reflect these changes:
index.js:
Note the updated require paths. process.argv[2]
will get arguments from
the command line so you can run in the terminal something like:
node index.js /file/path/to/name-of-torrent.torrent
Also note I’m using the method exported by download.js.
github commit #8: restructure and add download.js
4.2 TCP connect to peers
Using tcp to send messages is similar to udp which we used before. In this case we use the “net” module instead of the “dgram” module. Let’s look at an example of how that would work.
You can see the tcp interface is very similar to using udp, but you have to
call the connect
method to create a connection before sending any messages.
Also it’s possible for the connection to fail, in which case we don’t want the
program to crash so we catch the error with socket.on('error', console.log)
.
This will log the error to console instead. Udp didn’t have this problem
because udp doesn’t need to create a connection.
Applying what we learned above about tcp, we can start writing our download.js file.
download.js:
We use our getPeers
method from the tracker.js file, and then for each peer
we create a tcp connection and start exchanging messages. Before we can go on
we need to know what messages we’ll be sending and receiving.
4.3 Protocol overview
Once a tcp connection is established the messages you send and receive have to follow the following protocol.
-
The first thing you want to do is let your peer know know which files you are interested in downloading from them, as well as some identifying info. If the peer doesn’t have the files you want they will close the connection, but if they do have the files they should send back a similar message as confirmation. This is called the “handshake”.
-
The most likely thing that will happen next is that the peer will let you know what pieces they have. This happens through the “have” and “bitfield” messages. Each “have” message contains a piece index as its payload. This means you will receive multiple have messages, one for each piece that your peer has.
The bitfield message serves a similar purpose, but does it in a different way. The bitfield message can tell you all the pieces that the peer has in just one message. It does this by sending a string of bits, one for each piece in the file. The index of each bit is the same as the piece index, and if they have that piece it will be set to 1, if not it will be set to 0. For example if you receive a bitfield that starts with 011001… that means they have the pieces at index 1, 2, and 5, but not the pieces at index 0, 3,and 4.
It’s possible to receive both “have” messages and a bitfield message, if which case you should combine them to get the full list of pieces.
-
Actually it’s possible to recieve another kind of message, the peer might decide they don’t want to share with you! That’s what the choke, unchoke, interested, and not interested messages are for. If you are choked, that means the peer does not want to share with you, if you are unchoked then the peer is willing to share. On the other hand, interested means you want what your peer has, whereas not interested means you don’t want what they have.
You always start out choked and not interested. So the first message you send should be the interested message. Then hopefully they will send you an unchoke message and you can move to the next step. If you receive a choke message message instead you can just let the connection drop.
-
At this point you’re ready start requesting. You can do this by sending “request” messages, which contains the index of the piece that you want (more details on this in the next section).
-
Finally you will receive a piece message, which will contain the bytes of data that you requested.
4.4 Message types
According to the spec the handshake message should be a buffer that looks like this:
handshake: <pstrlen><pstr><reserved><info_hash><peer_id>
pstrlen: string length of <pstr>, as a single raw byte
pstr: string identifier of the protocol
reserved: eight (8) reserved bytes. All current implementations use all zeroes.
peer_id: 20-byte string used as a unique ID for the client.
In version 1.0 of the BitTorrent protocol, pstrlen = 19, and pstr = "BitTorrent protocol".
Once the handshake has been established there are 10 different types of messages that can be exchanged, all following the same format:
- 4 bytes indicating the length of the message (excluding these 4 bytes)
- 1 byte for the id message
- the rest of the buffer is the message payload which varies by message
I’ll discuss these in more detail as they come up, but you can also find more info in the specs.
I’ve created a new file called message.js with some convenient functions to help build these message buffers.
message.js:
These functions are mostly straightforward, just pass a payload and then return a buffer with the appropriate length and id. Everything follows directly from the specs mentioned earlier.
github commit #9: add message.js
4.5 Grouping messages
Before going on to actually exchanging messages, there’s one more thing I need to address about tcp. You may have assumed that every time you recieve a data though a socket, it will be a single whole message. But this is not the case. Remember our code for receiving data looked like this:
socket.on('data', receivedBuffer => {
// do stuff with receivedBuffer here
})'
The problem is that the callback gets passed data as it becomes available and there’s no way to know how that data will be broken up. The socket might recieve only part of one message, or it might receive multiple messages at once. This is why every message starts with its length, to help you find the start and end of each message.
Things would be much easier for us if each time the callback was called it
would get passed a single whole message, so I want to write a function
onWhileMsg
that will do just that for us.
Like this:
download.js:
Here is the implementation of onWholeMsg
:
download.js:
How does this function work? First notice the distinction between the function
onWholeMsg
, the callback passed to onWholeMsg
, and the anonymous callback
passed to socket.on
. Next, the key to making this work is to use a closure.
Because the socket.on
callback is inside of the onWholeMsg
function, the
former will always be able to access the variables of the latter.
So every time the socket recieves data, the socket.on
callback is called. It
concats the new data with savedBuf
and as long as savedBuf
is long enough
to contain at least one whole message, it will pass it to the onWholeMsg
callback and then update savedBuf
by slicing out those messages. Basically
savedBuf
saves the pieces of incomplete messages between rounds of receiving
data from the socket.
I also have the handshake
variable in the closure. This is because the
handshake message doesn’t tell you its length as part of the message. The only
way you can tell you’re receiving a handshake message is that it’s always the
first message you’ll receive. That’s why I start with handshake
set to true,
and then the first time we receive a whole message I set it to false.
That all might have been a bit mind-bending if you’re not too familiar with
closures yet, but I encourage you to stick with it and read through the code
carefully. One thing that might help is to realize that the onWholeMsg
function is only getting called once, so the savedBuf
and handshake
variables are only initialized once. But then the socket.on
callback gets
called multiple times, each time getting and setting the same two variables.
github commit #10: add onWholeMsg function
4.6 Handshake
Now let’s code the handshake:
download.js:
-
We use the
buildHandshake
function we created in the message.js file, but this method requires the torrent object, so we now pass that in through thedownload
function. -
I didn’t want a long callback function in the
download
function so I created a new function calledmsgHandler
. This function will check what kind of message we are receiving and handle it accordingly. Here it checks if the message is a handshake response, and if so it sends the interested message and hopefully the peer will send an unchoke message. -
This function checks if the message is a handshake. Basically just checks that it’s the same length as a handshake and has pstr ‘BitTorrent protocol’.
github commit #11: add handshake handler
4.7 Pieces
After you establish the handshake your peers should tell you which pieces they have. Let’s take a moment to understand what pieces are exactly.
If you open up a torrent file, we saw that it contains data with various properties like the “announce” and “info” properties. Another property is the “piece length” property. This tells you how long a piece is in bytes. Let’s say hypothetically that you have a piece length of 1000 bytes. Then if the total size of the file(s) is 12000 bytes, that means the file should have 12 pieces. Note that the last piece might not be the full 1000 bytes. If the file were 12001 bytes large, then it would be a total of 13 pieces, where the last piece is just 1 byte large.
These pieces are indexed starting at 0, and this is how we know which piece it is that we are sending or receiving. For example, you might request the piece at index 0, that means from our previous example we want the first 1000 bytes of the file. If we ask for the piece at index 1, we want the second 1000 bytes and so on.
4.8 Handling messages
Now I want to start handling messages that aren’t the handshake message. Since these messages have a set format, I can just check their id to figure out what message it is. In order to help me do this I wrote added a function to message.js for parsing message buffers into their parts:
message.js:
If the message length isn’t greater than 4, then we know it is the keep-ahead message which has no id. If the length isn’t greater than 5 we know that it has no payload. If the id is 6, 7, 8, those messages split the pay load into index, begin, and block/length.
Now we can do this:
download.js:
github commit #12: add non-handshake message handlers
So the msgHandler
function receives a message, checks the id, and then passes
the payload, if any, to the appropriate handler function. You’ll notice we
aren’t handling all types of messages. The other messages are for sending files
rather than receiving them, so these are the messages I’ll focus on for now.
4.9 Managing connections and pieces
This is a critical point in the project because managing the connections and pieces involves a lot of interesting decisions and tradeoffs. So far we’ve mostly been following the specs directly, but from now on there are many possible solutions and I’ll be going through just one. That’s why I recommend taking some time to consider how you would implement these message handlers yourself before continuing.
Of course a big concern is efficiency. We want our downloads to finish as soon as possible. The tricky part about this is that not all peers will have all parts. Also not all peers can upload at the same rate. On top of that, it’s possible for connections to drop at any time, so we need a way of dealing with failed requests. How can we distribute the work of sharing the right pieces among all peers in order to have the fastest download speeds?
4.9.1 List of requested pieces
After some thought I decided on the following solution. First I would have a single list of all pieces that have already been requested that would get passed to each socket connection. Like this:
download.js:
The actual implementation of haveHandler
will be more detailed than this, but
you can see how the requested
list will get passed through and how it will be
used to determine whether or not a piece should be requested. You can also see
that there is just a single list that is shared by all connections. Some of you
may think that it’s inconvenient to have to pass the requested list to so many
functions. Again, there’s more than one possible solution, so I encourage you
to try you own.
This doesn’t yet account for failed requests, but I’ll be addressing that in a later section.
4.9.2 Job queue
Next I want to create a list per connection. This list will contain all the pieces that a single peer has. Why do we have to maintain this list? Why not just make a request for a piece as soon as we receive a “have” or “bitfield” message? The problem is that we would probaby end up requesting all the pieces from the very first peer we connect to and then since we don’t want to double request the same piece, none of the other peers would have pieces left to request.
Even if it’s possible to use a round-robin strategy so that each peer only gets a second piece to request after all peers have gotten at least one piece to request, there is still a problem. This strategy would lead to all peers having the same number of requests, but some peers will inevitably upload faster than others. Ideally we want the fastest peers to get more requests, rather than have multiple requests bottlenecked by the slowest peer.
A natural solution is to request just one or a few pieces from a peer at a time, and only make the next request after receiving a response. This way the faster peers will send their responses faster, “coming back” for more requests more frequently.
However, because we recieve the “have” and “bitfield” messages all at once, this means we’ll have to store the list of pieces that the peer has. This is so that we can wait until the piece’s response come back, then we can reference the list to see what to request next.
I refer to this as a job queue, because you can think of it like this: each connection has a list of pieces to request. They look at the first item on the list, and check if it’s in the list of already requested pieces or not. If not, they request the piece and wait for a response. Otherwise they discard the item and move on to the next one. When they receive a response, they move on to the next item on the list and repeat the process until the list is empty.
This list will also need to passed through to the handler functions, but it should be created per connection. Like this:
download.js:
So we only request a piece on a “have” if the queue was previously empty
(should be similar for bitfieldHandler
). From that point on we only request
pieces once we’ve received a piece response (in pieceHandler
). When we
receive a piece we can shift it out of the queue. If the piece has already been
requested, we again shift it out of the queue. You can see that happen in the
requestPiece
function which created since it is code shared by both handlers.
4.9.3 Request Failures
The last thing I want to go over before fully implementing the handler
functions is request failures. Right now we are adding the piece index to the
requested
array whenever we send a request. This way we know which pieces
have already been requested and then we can avoid the next peer from requesting
a duplicate piece.
However, it’s possible for us to request a piece but never receive it. This is
because a connection can drop at any time for whatever reason. Since we avoid
requesting pieces that have been added to the requested
array, these pieces
will never be received.
You might think we could just add pieces to the list when we receive them. But then between the time that the piece requested and received any other peer could also request that piece resulting in duplicate requests.
I’ve found that the easiest solution is to maintain two lists, one for
requested pieces and one for received pieces. We update the requested
list at
request time, and the received
list at receive time. Then whenever we have
requested all pieces but there are still pieces that we haven’t received, we
copy the received
list into the requested
list, and that will allow us to
rerequest those missing pieces.
However, it’s inconvenient to have to maintain multiple lists when just one data abstraction should be necessary. So I’ve create a new file called Pieces.js:
Pieces.js:
You can see that we need to initialize the Pieces instance with the total number of pieces, otherwise it’s not possible to know when we’re finished getting all the pieces. We assume that the array index is the same as the piece index, and then we set that index to true if it has been requested/received (all indexes start as false by default). This is a nice way to look up a piece’s status without iterating through the arrays.
This object can now be used to replace the requested
array from earlier.
4.10 Implementing the message handlers
4.10.1 Choke and unchoke
Dealing with choke and unchoke states is a little bit tricky because we don’t
want to request any pieces until we’ve been unchoked. I think the simplest way
to enforce this is to create a new object that holds both our queue
array as
well as a choked property.
Now we can add this new object and write out the choke and unchoke handlers.
download.js:
github commit #14: add choke and unchoke handlers
Here’s the new download.js file in its entirety since there’s quite a bit going on here. Fortunately most of it was covered in the previous sections. Most of these changes are simply passing our two new data structures Pieces and queue object through through to the handlers. I’ve marked all these lines with a 1.
In the previous section I changed the Pieces class so that its constructor
takes the total number of pieces as an argument. You can find this value by
looking up the torrent.info.pieces.length
property and dividing by 20. The
torrent.info.pieces
is a buffer that contains 20-byte SHA-1 hash of each
piece, and the length gives you the total number of bytes in the buffer.That’s
why we divide by 20 to get the total number of pieces.
The other big change is that we begin requesting pieces when we are unchoked
(which I’ve marked with 2). You can see that when you’re choked, you can’t make
any requests, and nothing happens. If you’re unchoked then we shift out piece
indexes until we find one that we don’t already have using the pieces.needed
method we coded earlier and request that piece. Finally we add the requested
index into pieces
and break the loop. The next request won’t happen until we
get a “piece response” which we’ll code up in the pieceHandler
function.
You might have noticed that I wrote a comment above socket.write(message.buildRequest(pieceIndex));
that something needs to be fixed. We’ll address this problem in the next
section.
4.10.2 Pieces vs. blocks
The function message.buildRequest
in the above code needs to take an object
with an index, begin, and a length property. These are the required fields for
the payload of a request message.
But what are these fields for exactly? Well index is easy, it’s the piece index
that is currently being passed in. What about begin and length?
These two properties are necessary because sometimes a piece is too big for one message. Although there is some dispute about the best size, it is typically considered to be around 2^14 (16384) bytes. This is called a “block”, where a piece consists of one or more blocks. If the piece length is greater than the length of single block, then it should be broken up into blocks with one message sent for each block.
So the “begin” field is the zero-based byte offset starting from the beginning of the piece, and the “length” is the length of the requested block. This is always going to be 2^14, except possibly the last block which might be less.
We’re going to need a number of modifications to accomodate this. But first I want to write some function in torrentParser.js that will help me determine the piece and block lengths for a given piece index.
torrent-parser.js:
In pieceLen
I had to use the bignum library to deal with the totalLength
variable because it’s in an 8 byte buffer. pieceLen
and blockLen
are
similar, they just return the piece length or block length, respectively,
unless they happen to be the very last piece or block. Then it might be shorter
than a full piece or block. Piece length can be found in the torrent file, and
block length is 2^14 bytes by convention.
Also, both the queue object and the Pieces class should be changed to deal with blocks instead of just pieces. For the queue object we just used a plain javascript object with two properties, but with the added complexity of dealing with blocks, I’ve decided to create a new Queue class.
Queue.js:
Now the queue is a list of pieceBlock
objects. I admit the terminology is getting
a little confusing but I’m referring to the object built in the queue
method
that have index, begin, and length properties. These pieceBlock
objects have the
same structure as the payload when we send a request message (check the
buildRequest
function in message.js), so we can pass them to the request
builder directly.
More generally, from now on we want to deal with these objects instead of the piece index, because it also gives us information about the block. Here’s an outline of when/where we would use these objects:
-
When we receive a “have” message, we get a piece index, but we fill the queue with these piece objects, one for each block for each piece.
-
When we deque an object we can pass it to the request builder and make a request for the associated block.
-
The Pieces class that tracks requested and received pieces should be able to add a
pieceBlock
. We’ll see this implemented next.
Note that the constructor need to get passed a torrent object now, instead of just the number of pieces.
I’ve prepended private variables with an underscore, and the other methods just
give us public methods to interact with the private _queue
variable. I’ve
left the choked
without an underscore to indicate it is a public property. If
you’re not familiar with private and public properties, it’s not that
important. The idea is that users of this module shouldn’t use the underscore
prepended properties.
Here is the Pieces class altered to handle blocks.
Pieces.js
Now the requested
and received
arrays, which used to hold the status of a
piece index, now holds an array of arrays, where the inner arrays hold the
status of a block at a give piece index. So if you wanted to find out the
status of a block at index 1 for a piece at index 7, you could look up
_requested[7][1]
and check if it’s set to true. As before all values are
initially set to false.
Also note that these methods expect to be passed a pieceBlock, the same ones we
saw in the Queue class. This means when we peek
a pieceBlock from a queue and
request it, or deque
a pieceBlock when we’ve received it, we can pass the
pieceBlock directly to addRequested
or addReceived
respectively.
Finally there’s a few changes that need to be made in download.js where we use the Queue and Pieces classes.
download.js:
github commit #15: support blocks
This code is mostly the same as before except that we’re passing and using our new Pieces and Queue classes. Also we now use the pieceBlock object instead of just the piece index.
4.10.3 Have and bitfield
These two messages do roughly the same thing, which is tell us which pieces the peer has for sharing. Let’s start with the “have” message.
download.js:
This will add the piece index into the queue. We also want to kick off the piece requesting process if this is the first item added to the queue. Why only the first? Because as we discussed before, we want to wait for the piece response to come back before requesting the next piece.
Now let’s look at bitfieldHandler
:
download.js:
github commit #16: add have and bitfield handlers
The payload here is a buffer, which you can think of as a long string of bits. If the peer has the index-0 piece, then the first bit will be a 1. If not it will be 0. If they have the index-1 piece, then the next bit will be 1, 0 if not. And so on. So we need a way to read individual bits out of the buffer.
I don’t want to dwell on this part too much. I think it’s a good exercise for the reader to try themselves. Basically repeatedly dividing by 2 and taking the remainder will convert a base-10 number to a binary number, giving you the digits of the binary number from least to most signifiant bit (right to left).
Just like haveHandler
we queue the piece index, and only make a request we
are adding the first piece in the queue.
4.10.4 Piece response handler
This is the final section before we can start downloading torrents! We just
need to write out the pieceHandler
which is for when we receive the actual
bytes of the piece we requested.
The main things we want to do in this function are:
- Add the piece to the list of received pieces.
- Write the bytes to file.
- Request the next needed piece. Or…
- If there are no more pieces to download in the queue we close the socket connection.
download.js:
I’m calling the response piece pieceResp
. It’s slightly different from a
pieceBlock because instead of the length the of piece, it contain the actual
data for the piece. But the index
and begin
properties are the same, so
you can pass it to the pieces.addReceived
method normally. We also get to use
our pieces.isDone
method here to check if we’re done. If we are we close the
socket, otherwise we request the next piece.
The only thing we need to do now is write the bytes to file. We can do this
using the fs
module by using the fs.openSync
method to open a file. This will return a file descriptor, which we pass to the
fs.write
method. Since want all the peers to be writing to the same file, creating the
file descriptor is one of the very first things we need to do:
download.js:
The openSync
method takes a file path (this includes the file name), which is
where the file will be created. Note that this exported function now takes a
second path
argument. We then use ‘w’ or write mode to create and write to
a new file. You could also change this code so that you can resume downloads
instead of overwriting and previous existing data, but we’ll keep things simple
for now. This file descriptor should get passed all the way down to the
pieceHandler
function.
The final pieceHandler
function should look like this:
download.js:
Since pieceResp.begin
only tells us the offset relative to the piece it’s in,
we have to calculate the absolute offset ourselves. Then we just write the
block of data to the file descriptor we create earlier. That’s it!
The very last thing we need to do is pass the filepath into the exported function.
index.js:
The only change we made here is to pass the torrent file’s name to the
download
function.
github commit #17: add piece handler
5. Conclusion
5.1 Improvements
Currently we can only download single files, not files in folders. This wouldn’t be hard to implement. You would just need to check the torrent object to see what the file structure looked like and where the start and end of each file was. In fact there are still many interesting problems you could work on to polish up this project. Here is a list:
- Add a graphic user interface
- Optimize for better download speeds and more efficient cpu usage. For example some clients calculate which pieces are the rarest and download those first.
- There’s also something called distributed hash tables which makes it possible to share torrents without the use of centralized trackers.
- You could write code to reconnect dropped connections
- You could look for more peers periodically.
- You could support pausing and resuming downloads.
- You could support uploading since currently our client only downloads.
- Sometimes peers are unable to connect to each other because they are behind a NAT which gives a proxy ip. You could look into NAT traversal strategies.
- It’s possible to bring bittorrent to the web using WebRTC, which creates a direct peer to peer connection.
The sky’s the limit!