[JDEV] Videoconferencing with jabber / Re:[speex-dev]Videoconferencing with speex and jabber

Mon Dec 1 12:08:30 CST 2003

Let me start off by apologizing to everyone on the list who is following 
this discussion for my horrible spelling lately. This discussion takes up 
a little more time than I actually have right now, but I'd still like to 
sort thing out a litte before I go on vacation dec. 5th.

On Mon, 1 Dec 2003 10:55:20 -0000, Richard Dobson <richard at dobson-i.net> 
wrote:

>> Having one user assume the role as server, and one of client is really 
>> no
>> harder than a model in wich you asume both are equal peers. It's simply 
>> a
>> matter of different roles. If you can think of any reason why this is 
>> not
>> true, please share it with the rest of us!
>
> I dont disbute that it is any harder (for one 2 one), simply that using a
> client server model when a p2p model is more appropriate IMO can create 
> more
> problems than it solves.

Then please point out those problems for me.
I doubt you can think of any for person to person. Not at all compared to 
a p2p solution. And *just* implementing that is enough to participate in 
conferences with more than 2 persons. You get this *for free*, so to 
speak.

Wether you'd choose to make an extention on this for conferencing over 
direct links is up to you. I won't stop you. I'd encourage you if I could 
:)

>> However, using a client/model will allow you to participate in a
>> conference on a server with more people *with no extra effort at all*. 
>> Yet
>> you still state you don't believe it will be easyer?
>
> Yes is easier to implement because you dont need extra p2p, but IMO its 
> not
> really that much more to implement it as you will already have a large
> amount of the necessary code inplace once you have created a client with
> inbuilt server.

So why not implement a c/s based solution for person to person and server 
conferencing (which will take about the same effort as implementing a p2p 
based solution for person to person). And then implement a direct link 
based conferencing solution, where each node it a server as defined in the 
c/2 spec (which will take about the same amount of effort as doing it 
based on a p2p spec.)

Unless for some reason, you think the c/s spec would bring up issues, 
which you seemed to imply a bit back in this email.

>> What I *am* saying, that an entirely p2p based conferencing model (with
>> more than 2 persons involved) is a lot more complex than a client/server
>> model. Even more so, if you only have to implement the client portion.
>> That's why this allows "thin" clients to still participate. It was you
>> yourself who argued against mixing and bandwith req. on thin clients 
>> such
>> as a pocket PC.
>
> Yes if you only implement the client portion

You actually make a good point here. Implementing client portion + server 
portion (that's just suitable for talking to one person) takes about the 
same effort as implementing a p2p solution. But I suppose you could go for 
the solution of only implementing the client portion in extreme cases 
where resources are very limited :)

> it will be a lot more work to
> add server or p2p, but if everyone does that (to save time and effort) 
> your
> proposed system will fall apart because there will be no servers for 
> people
> to connect to.

If I work on client X, ofcourse I'll implement the server portion (for 
single person to person chat in the least). Else client X can't talk with 
client X! That'd be kind of dumb. But I can imagine if I have an assigment 
for company to build a client that's capable of conferencing on the 
company servers (so they can log, etc.) I could drop the server part.

More advanced clients are likely to also implement a server that supports 
hosting a conference with more than 2 people. Or they'll implement a 
direct link conferencing extention (still based on the same protocol 
ofcourse). Those two are complimentary not competitive. But as pointed out 
by you, direct link person to person is definatly needed most, and as 
pointed out by other, server based conferencing is needed most too.

That doesn't mean there are no use cases left for direct link based 
conferencing, but IMHO not enough to justify a spec that will miss out on 
server based conferencing when you can get that practically for free, and 
will complicate the spec and raise the requirments for conferencing. 
Again, it's not impossible.

> As Mats Bengtsson suggests I think you should take a look at this
> http://www.skype.com/skype_p2pexplained.html their solution looks rather
> good (although goes further than I have been suggesting)

Skype uses UDP NAT travelsal based on getting it's IP from someone outside 
the NAT (at least, so it was suggested either here or on SJIG), wich is 
currently being rejected by the jabber server folks, and if that doesn't 
work it uses proxies on a p2p network. Those peers on the network 
basically act as a proxy. So I don't quite see the relationship with your 
proposal.

This sort of functionality can be used with SI. For example, you could 
make a SI bytestream over a JXTA network. With just a bit of cheating you 
could probably even use the Skype network itself with SI.

> maybe what we
> really need to do rather than concocing our own solution is defer to the
> even greater experience of someone else and just try to integrate with an
> existing mechanism, just like we did with SOCKS5 for the bytestreams
> mechanism.

SOCKS5 is hardly integrated with an excisting mechanism, it just uses part 
of the same spec. Using SI you can intergrate other solutions, almost 
transparently, and fall back on others if they don't work. That doesn't 
eliminate the need for a spec of setting up these things, and I see no 
good reason to not use a c/s architecture there.

>> I think from the discussion it's pretty obvious what's needed/wanted 
>> most
>> are 2 things:
>> - person to person over a direct link
>> - conferencing with multiple persons on a server
>
> As you realise I dont think you need to use a server to talk with a small
> group of people.

You're turning a blind eye to the issues with p2p then. Other people have 
pointed them out, and I have. I'm not ruling out direct links conferencing 
at all, but after direct link person to person second most needed is 
server based conferencing. Both as a fall back for direct link person to 
person, and because in many cases (not ALL, I'm not suggesting that) 
that's the only *quality* way of having a conference with multiple 
persons. So why throw this away if we can get it, almost for free?

Again, this does not rule out what you want at all.

>> This can both be handeled, without overlap, with a simple JEP based on a
>> c/s model. P2P won't cover this, nor will it be any simpeler.
>
> Sorry but it can handle it as I have clearly shown,

What you're talking about is simply a *different* problem. It's a 
solution, and a good one, but for a *different* problem. It can't handle 
it, and it doesn't cover it.

> it wont be any simpler
> but IMO its not much harder if you already have client/server code in 
> place,
> and is far more reliable.

Well exactly, if you have a c/s spec with c/s code in place, you can use 
that to implement your solution. You won't need anything p2p, it's about 
the direct links.

>
>> Conferencing over induvidual direct links between persons is intresting
>> too, but too complex to be included in the basic JEP if you ask me.
>
> I dont think its really all that much harder as you know.

Well, with a c/s spec, client (and servers for person to person) have it 
very easy. Bandwith reqs are low, CPU reqs are low, and you can talk to as 
many persons at once as you want. Ofcourse, the req. for the server are 
higher (when more than 2 persons are involved). But as I pointed out, not 
THAT much higher as a node in direct link conference. In many use cases 
there WILL be more advanced implementations that on more advanced platform 
with more resources that can support being, and many clients that couldn't 
be server. But it's no issue for those clients, since they only have to be 
client. In many MANY cases p2p/direct link style conferencing isn't an 
alternative. You too, have pointed to the dailups and the pocket PC's etc. 
I'm sure..

>
>> Conferencing over direct links doesn't have to be p2p either. You can 
>> base
>> it on the c/s JEP with every induvidual participant acting as a server.
>> Not that more complex than doing this on a p2p based model.
>
> But that is p2p is it not?

Any node (JID) in the network can be a server. This is a role in the 
protocol. By having this role, you can support both direct link person to 
person conversations, and on the server conferences. That's my point. If 
instead in the protocol you use the role of two equals "peers" this is 
disruptive.

[cut out some stuff where we pretty much agree I think]

>> So let's apply this to some real world situations. In how many cases are
>> all the clients have about the same available bandwith, CPU, etc. With 
>> Joe
>> Consumer this is unlikely.. it's a mix of dailup and broadband users. If
>> I'd want to talk to my mother, sister and brother at the same time, I 
>> have
>> a 1 mbit link, 1 will have a cheap DSL account, and the other 2 will be 
>> on
>> dailup most likely.
>
> I can see on dialup this is a problem, but as I detail below it can be
> complex determining the correct machine to run the server from (bandwidth
> available, CPU speed etc), this really needs to be automatic or we will 
> make
> it that much harder for normal users to use they might well not bother 
> and
> continue using MSN etc instead, we must make sure we offer something 
> that is
> at least as easy as MSN Messenger and the like to use, so whichever way 
> we
> go, be it client server or p2p or both all that needs to be hidden from 
> the
> user, and all they should need to do is select the people they wish to 
> chat
> to and click "chat".

Does MSN even *do* conferencing with more than one person? (I don't know)
I think in most cases users will know who has the fastest connection, but 
I can imagine you'd prefer an automatic solution for this. That would be 
rather neat. Ofcourse when you host all this on a server component the 
choice is clear.

>> Again I don't think direct-link style conferncing is unintresting or
>> unneeded, but it's a much more specific application than c/s 
>> conferencing.
>> And *again*, a c/s style approach will not prevent this from being an
>> extention.
>
> Good, but once we have a client server system in clients we will have 
> 90% of
> the code needed to implement it, it would be a mistake IMO and could 
> prove
> to create a messy protocol if we dont consider how to include p2p 
> function
> into the protocol we create from day one, otherwise when we extended it
> later it could end up either messy or we will end up duplicating lots of
> effort.

Agreed, when creating such a spec based on c/s, attention should be paid 
to allowing a direct-link conference style solution from the start. For 
that matter, it should also allow for things such as distrubited hosting 
of a conference (a sort of hybrid between direct links and c/s) or any 
other things people can come up with. It should just be as generic as 
possible.

>> And how's that? When 4 people talk at once, *all* client will have to 
>> mix
>> 4 streams in the case of direct links. In the case of c/s only the 
>> server
>> will have to mix 4 streams. Explain..
>
> Yes but the server has to do more than simply mix the streams, it also 
> has
> to re-encode the mixed streams, also if you want to remove echo's as you
> suggest below or be able to ignore partipants as someone has already
> suggested as useful functionality you need to re-mix and re-encode all
> outgoing streams individually, which would I expect be quite a CPU drain,
> but in p2p mode clients if using available technologies (directx or the
> equivalent) you dont even need to mix the streams as you can play
> simultaneous WAVE streams at the same time, also the client isnt needing 
> to
> re-encode the stream to send out again.

Well, I agree that, just like with the bandwith requirments, demands on 
the server will be higher than on a node in a direct link conference. Just 
not THAT much higher, unless you want some more advanced features. There's 
always trade-offs between the two solutions, and at times you could prefer 
yours over the other. But the point I'm making is that we can have *all* 
of them, relativly simple with a c/s based architecture, even if a p2p 
spec might be just a *little* easier to work with in your case, or at 
least sound more logical when reading the spec.

Ofcourse you still have to mix when you use DirectX ;) Servers can use 
existing technology too ofcourse.. Servers (components) specializing in 
hosting this kind for companies or paying customers could even use DSP 
hardware and such.

>> (only thing I could think of is if you want to create a seperate mix for
>> each client, without their own channel in it to prevent echo. Rather 
>> than
>> mixing new streams for each client you should just surpress echo for 
>> each
>> clients. Admitted, it increases demands on the server if you want this,
>> but not as bad as having to mix a new stream for each client)
>
> Not sure how you would suppress the echo of what someone said without
> re-coding the streams individually to exclude that person on their own
> incoming listening stream.

Well, aside from that you can surpress it client side... (which would 
raise the requirments for our poor pocketPC clients a little too much) I'm 
not an expert on audio technology but I'd imagine there are some 
optimizations heavy possible when making different mixes based on the same 
streams? I could be wrong ofcourse..

>
>> Yes, when the server quits the conference the other will get booted. If
>> this is a big issue for you, you could devise a fallback system to 
>> another
>> server (one of the clients for example) and still have a massivly less
>> complex system than direct-link based conferencing. Since servers are 
>> most
>> likely to be the best machines with the best connections this isn't 
>> such a
>> big problem, but it's still easily solved if you want.
>
> Good this would have to be if I were to support this, problem is tho, 
> adding
> in this sort of thing brings us even closer to the requirements of just
> using a p2p system,

Switching to a fallback server is *definatly* something different from 
using your direct links system. Again, c/s and direct links based 
conferencing are two different soltions to two different problems, except 
for perhaps in the most general sense.

People on the list made it very clear direct-link style conferncing with 
multiple persons will not fill their most basic needs. If your only 
problem would be worrying about wether the host dies, I'd recommend you go 
with the solution I proposed rather than go direct link style. But I doubt 
that's your only problem :)

> also would have to make it easy to start chats for
> normal users so the system needs to automatically determine which 
> machine in
> the group is best suited to be the server and set it up as it without the
> user needing to do that themselves. There is also a problem with falling
> back in this situation in that what if there is not a machine with enough
> bandwidth etc left to maintain the chat?

Ofcourse this is a problem. If it won't work it won't work. If your 
solution *would* work in that case, well that's why I think it would be 
great to have. However, don't overestimate how often this will be the 
case. But it's definatly so on XBox Live.. which is still a brilliant 
example :)

> It will go down, which it shouldnt
> in p2p because all nodes will require the same amount of bandwidth to
> maintain it and it should keep going.
>
>> When there are a few clients with bad connections in the conversation
>> reliability will probably improve a bit too. Bad connection <-> Good
>> connection <-> bad connection is generally more reliable than bad
>> connection <-> bad connection. Escp. when you consider bandwith usage
>> drops too.
>
> Yup but there is no real way without user intervention to make sure the
> server is on a reliable connection, but we need to make it as easy as
> possible otherwise normal people would not know what to do.

You could automate this.. (and use a remote control protocol to set 
everything up transparantly) but I don't think user intervention is a bad 
thing here *necisarly*. Even the most oblivious of users know broadband is 
better than dailup..

>> Latency is an intresting case, but in practise the results would 
>> probably
>> surprise you. Because on low-bandwith nodes to bandwith requirments
>> dramatically drop when they act as a client rather than a node in the
>> direct link conference, latency in many cases will actually improve in a
>> lot of cases!
>
> Thats good but do you have any real evidence of this?

I assume you have no problems with the idea that latency is lower on 
low-bandwith connections when the bandwith used is lower too? If not.. 
just play an online game, then exit it, turn on some filesharing network, 
and play the game again ;) That's just simple maths!

Even on my old "broadband" connection, where I had 15 KB/s upstream 
availably, latency would jump from about 25-40ms to 50-400ms if I used 
only 10KB/s of it for different purposes.

Gaming provides another example.. in the old days when I played Quake, I'd 
be a lot faster to play on my ISPs server with someone, then for either of 
us to host the server (latency would be higher and less reliable there). 
Experiance in using the old ICQ protocol gave me the same idea, even 
though the amounts of data are *very* limited there.

If latency is your main point for choosing direct link conferencing, I'd 
be very carefull if I were you cause the result might dissappoint in many 
cases.

>> So you can have the situation where a node in a direct-link
>> conference with 3 persons talking is barely able to keep up, with 
>> horrible
>> latency. While a client with the exact same quality connection is 
>> enjoying
>> a conference where 6 people are talking with lower latency! (it wouldn't
>> even be able to participate when 6 people are talking in a direct link
>> conference).
>
> You would have to have very low bandwidth to not be able to talk to 
> those 6
> people tho in p2p, but yea that could be a problem, but one of the people
> still needs to be on a good connection.

If you dedicate all your bandwith to voice chat, use low quality codecs, 
have at least a 56k6 on a decent ISP then you can probably talk to more 
than just a few. But that's hardly always the case. And still, the more 
streams will be active, the higher latency will get, and less reliability 
in some cases.

>> Now lets talk about out-of-sync mixing. With direct-link based 
>> conferences
>> every client will produce a different "mix" based on the latency /
>> bandwith of their connections, and that of the other nodes. This means
>> when we're in a meeting, for me it can sound like 3 people were talking 
>> at
>> once, while for you it can sound like they didn't at all. (that means I
>> didn't hear what they said and I'll ask them to repeat, while you'll be
>> annoyed with me (even more ;) cause for you it sounded like I could have
>> heard perfectly).
>
> Sure that could be a problem, but its a problem people will be used to if
> they have ever made long distance phone calls, this sort of thing is the
> least of our worries IMO.

This problem doesn't occur when you make long distance phonecalls..??? How 
could it? It doesn't even happen in a long distance *conference* call!

With a serverside solution *everyone* will receive the same audiostream 
(with perhaps only their own stream emitted). With direct links every 
client makes their own "mix".

Let's pretend you and I are in a conversation with person A and B and C. 
We're useing direct links for conferencing. Person A ask a question. His 
stream is broadcasted induvidually to all nodes. Person B then starts to 
answer, and so does person C. When person B notices person C also wants to 
answer (they both have a fast connection so little latency) person B shuts 
up, and C answers. I however am on a bad link. I receive the question from 
A, my link with person B just went bad a little, so him starting to answer 
didn't make it to me yet, but already I can hear C start to give his 
answer. Then the link with B clears up, and in the middle of what C is 
saying (way after he noticed B was gonna let him do the talking) I 
suddenly hear B start to answer, and then stop. So I ask if C can repeat 
himself. But your link with B and C is just fine, you didn't hear B talk 
through C when he was answering at all. So you ask yourself wether I was 
sleeping during the meeting or something.

The more diverse your different types of connection are (unlike with XBox 
Live where they are all pretty much the same) the more of a problem this 
will be. Escp. if you use TCP sockert over an unreliable connection. This 
does not happen *at all* with server based conferencing.

>> Ofcourse there is a solution for this, syncing the mixes between nodes.
>> But then you loose all latentcy advantages, you'll be as slow as the
>> "weakest link". (and the weakest link will be a lot more stressed than 
>> it
>> would be in a c/s model). Ofcourse compromises are possible..
>
> Sure

That doesn't mean doing in-sync mixing in a direct links conference isn't 
still a bitch to pull of.. how do you detirmine what delay the faster 
nodes should add? You'll need control channels at least, and ofcourse you 
don't want *those* to depend on a c/s architecture either. Good luck with 
that ;)

> But p2p chats should not need
> a server IMO because they are short lived sessions for which you will 
> have
> already located the other members of the chat via another means (your 
> Jabber
> session). Please bear in mind that client server systems are not always 
> the
> best solution, just think if the file sharing systems all went through
> central servers the bandwidth use would be unsustanable for the server
> admins.

That's not anything like what I am proposing. To start with, practically 
all person to person communication would be over direct links. Secondly, 
conferences would not be held on some gigantic server, rather there will 
be small clusters spread all over the place.

As you might know many p2p network have made this same change, relying 
more on the stronger better clients, letting them take some roles that 
traditionally were meant for servers, Peer caches, supernodes etc. At 
first this was just with control info, but Skype is the next step, using 
"peers" as proxy servers for data. (One could argue Skype is not the first 
one to do it though, there's Freenet for example)

I think it'd be great if we could take the same route with Jabber (I 
already named a SI/JXTA based solution as an example), but without ruling 
out the reliable and needed c/s model either. And I think I pointed out 
fairly decent how we could.

> Although there is the fact that current audio chat systems are mostly 
> p2p,
> e.g. XBox Live, MSN Messenger, AIM, Yahoo Messenger, H.323, SIP. We need 
> to
> be careful not to dismiss all that research development and reasoning 
> that
> went into the decision for these people to go p2p.

With the exception of XBox Live perhaps, I wouldn't want to rely on any of 
them for conferencing with more than one person.

SIP and H.323 depend *can* depend on direct links for conferncing (as far 
as I know) that doesn't mean they have to, or even do so in a lot of cases 
(espc. SIP wich is often used just for replacing CSD channels!). If you're 
under the impression that SIP and H.323 are never used in conjunction with 
a "classic" phone conference you'd be very wrong I'm afraid. (as far as 
AIM, Yahoo, MSN I didn't even know they support conferencing, let alone 
how or what architecture they use for it on the protocol level)

Solutions like Net2Phone definatly connect to some server implementation, 
even for non-conferencing.

> Maybe what we actually need to solve the low bandwidth problem of dial up
> users and the reliability problem of having a single point of failure is 
> to
> have a hybrid client server and p2p system where the people with 
> sufficent
> bandwidth run as both servers and p2p between each other (like the idea 
> of a
> supernode) and the low bandwidth users connect to one of those servers, 
> it
> solves the low bandwidth user problem and the reliablity problem by 
> having
> multiple servers users can switch to if one goes down, and also the CPU
> usage problem by not having too many people all connected to one server.

In previous email I already briefly touched the subject, and some in this 
email. I definatly think most of this could be handeled in the SI layer 
though (with a little cheating), a c/s based spec will not rule this out 
at all.